SmartFlowAI
点击上方蓝字关注我们
机智流顶会顶刊讨论组
全文约 2400 字,预计阅读时间 6 分钟
11 月 12 日至 11 月 16 日 EMNLP 2024 在美弗罗里达火热举办中,本届最佳论文也在今天北京时间凌晨发布。本文对获评的五篇最佳论文进行了盘点*。后续我们还会继续陆续发布不同领域的 EMNLP 2024 高引盘点,在机智流公众号后台对话框回复“盘点”,加入顶会论文盘点交流群。
An image speaks a thousand words, but can everyone listen? On image transcreation for cultural relevance Towards Robust Speech Representation Learning for Thousands of Languages Backward Lens: Projecting Language Model Gradients into the Vocabulary Space Pretraining Data Detection for Large Language Models: A Divergence-based Calibration Method CoGen: Learning from Feedback with Coupled Comprehension and Generation
An image speaks a thousand words, but can everyone listen? On image transcreation for cultural relevance
https://arxiv.org/pdf/2404.01247
总结:随着多媒体内容的增加,人类译者不仅在文字上,也在图像等其他形式上注重文化适应性以传达相同意义,但机器翻译系统仍局限于处理语音和文本中的语言。本文朝着翻译图像使其具有文化相关性迈出第一步,构建了三个包含先进生成模型的管道来执行任务,还构建了两部分评估数据集(概念部分含600张跨文化连贯的图像,应用部分含100张来自实际应用的图像),对翻译后的图像进行多方面的人工评估以评估文化相关性和意义保留情况。结果发现,图像编辑模型目前无法完成该任务,但可通过引入大型语言模型和检索器来改进;最佳管道在较简单的概念数据集里对某些国家只能翻译5%的图像,在应用数据集里对某些国家无法成功翻译任何图像,这表明该任务极具挑战性,同时公布了代码和数据的网址。
摘要:Given the rise of multimedia content, human translators increasingly focus on culturally adapting not only words but also other modalities such as images to convey the same meaning. While several applications stand to benefit from this, machine translation systems remain confined to dealing with language in speech and text. In this work, we take a first step towards translating images to make them culturally relevant. First, we build three pipelines comprising state-of-the-art generative models to do the task. Next, we build a two-part evaluation dataset: i) concept: comprising 600 images that are cross-culturally coherent, focusing on a single concept per image, and ii) application: comprising 100 images curated from real-world applications. We conduct a multi-faceted human evaluation of translated images to assess for cultural relevance and meaning preservation. We find that as of today, image-editing models fail at this task, but can be improved by leveraging LLMs and retrievers in the loop. Best pipelines can only translate 5% of images for some countries in the easier concept dataset and no translation is successful for some countries in the application dataset, highlighting the challenging nature of the task. Our code and data is released here: this https URL.
Towards Robust Speech Representation Learning for Thousands of Languages
https://arxiv.org/abs/2407.00837
总结:自监督学习(SSL)虽减少了对标记数据的需求,有助于将语音技术扩展到更多语言,但仍远不能支持世界上7000多种语言。本文提出通用语音跨语言编码器XEUS,它使用来自4057种语言的100多万小时数据进行训练,将SSL模型的语言覆盖范围扩大了4倍。通过将现有公开语料库的100万小时语音与新创建的来自4057种语言的7400多小时语料库(将公开发布)相结合,并针对多语言语音数据的不同条件,在典型SSL掩码预测方法基础上增加新的去混响目标以增强鲁棒性。XEUS在多个基准测试中表现优于或取得与最先进(SOTA)SSL模型相当的结果,在ML - SUPERB基准测试中创下新的SOTA,尽管其参数或预训练数据较少,但分别比MMS 1B和w2v - BERT 2.0 v2高出0.8%和4.4%。相关检查点、代码和数据可在指定网址获取。
摘要:Self-supervised learning (SSL) has helped extend speech technologies to more languages by reducing the need for labeled data. However, models are still far from supporting the world's 7000+ languages. We propose XEUS, a Cross-lingual Encoder for Universal Speech, trained on over 1 million hours of data across 4057 languages, extending the language coverage of SSL models 4-fold. We combine 1 million hours of speech from existing publicly accessible corpora with a newly created corpus of 7400+ hours from 4057 languages, which will be publicly released. To handle the diverse conditions of multilingual speech data, we augment the typical SSL masked prediction approach with a novel dereverberation objective, increasing robustness. We evaluate XEUS on several benchmarks, and show that it consistently outperforms or achieves comparable results to state-of-the-art (SOTA) SSL models across a variety of tasks. XEUS sets a new SOTA on the ML-SUPERB benchmark: it outperforms MMS 1B and w2v-BERT 2.0 v2 by 0.8% and 4.4% respectively, despite having less parameters or pre-training data. Checkpoints, code, and data are found in this https URL.
Backward Lens: Projecting Language Model Gradients into the Vocabulary Space
https://arxiv.org/abs/2402.12865
总结:研究Transformer语言模型(LMs)的学习和回忆信息机制是深度学习领域的重要目标。近期的可解释性方法将正向传播的权重和隐藏状态投射到模型的词汇表中,有助于揭示信息在LMs内的流动方式。本研究将这种方法扩展到LMs的反向传播和梯度,先证明梯度矩阵可被转换为正向和反向输入的低秩线性组合,再开发将梯度投射到词汇项的方法,探索新信息在LMs神经元中的存储机制。
摘要:Understanding how Transformer-based Language Models (LMs) learn and recall information is a key goal of the deep learning community. Recent interpretability methods project weights and hidden states obtained from the forward pass to the models' vocabularies, helping to uncover how information flows within LMs. In this work, we extend this methodology to LMs' backward pass and gradients. We first prove that a gradient matrix can be cast as a low-rank linear combination of its forward and backward passes' inputs. We then develop methods to project these gradients into vocabulary items and explore the mechanics of how new information is stored in the LMs' neurons.
Pretraining Data Detection for Large Language Models: A Divergence-based Calibration Method
https://arxiv.org/abs/2409.14781
总结:随着大型语言模型(LLMs)训练语料库规模的增长,模型开发者不愿透露数据细节,这给科学评估和伦理部署带来挑战。已有预训练数据检测方法被探索,Min - K%概率法虽有成果但存在局限。为解决这一问题,文中引入基于散度的校准方法校准标记概率以进行预训练数据检测,并开发了中文基准PatentMIA评估LLMs检测方法在中文文本上的性能,实验表明该方法优于现有方法,代码和基准可通过网址获取。
摘要:As the scale of training corpora for large language models (LLMs) grows, model developers become increasingly reluctant to disclose details on their data. This lack of transparency poses challenges to scientific evaluation and ethical deployment. Recently, pretraining data detection approaches, which infer whether a given text was part of an LLM's training data through black-box access, have been explored. The Min-K% Prob method, which has achieved state-of-the-art results, assumes that a non-training example tends to contain a few outlier words with low token probabilities. However, the effectiveness may be limited as it tends to misclassify non-training texts that contain many common words with high probabilities predicted by LLMs. To address this issue, we introduce a divergence-based calibration method, inspired by the divergence-from-randomness concept, to calibrate token probabilities for pretraining data detection. We compute the cross-entropy (i.e., the divergence) between the token probability distribution and the token frequency distribution to derive a detection score. We have developed a Chinese-language benchmark, PatentMIA, to assess the performance of detection approaches for LLMs on Chinese text. Experimental results on English-language benchmarks and PatentMIA demonstrate that our proposed method significantly outperforms existing methods. Our code and PatentMIA benchmark are available at \url{this https URL}.
CoGen: Learning from Feedback with Coupled Comprehension and Generation
https://arxiv.org/abs/2408.15992
总结:《CoGen:从反馈中学习,耦合理解与生成》:兼具语言理解和生成能力的系统可受益于两者间的紧密联系。本研究聚焦从与用户的交互中持续学习,探索理解与生成的耦合。提出将学习和推理的两种能力紧密整合的技术,以双人参考游戏为研究场景,部署多种模型与人类用户进行数千次交互并从交互反馈信号中学习。结果显示随着时间推移性能显著提升,理解 - 生成耦合使性能绝对值提升达26%,与非耦合系统相比准确率提高达17%,并且耦合对系统语言有重要的定性影响,使其更像人类语言。
摘要:Systems with both language comprehension and generation capabilities can benefit from the tight connection between the two. This work studies coupling comprehension and generation with focus on continually learning from interaction with users. We propose techniques to tightly integrate the two capabilities for both learning and inference. We situate our studies in two-player reference games, and deploy various models for thousands of interactions with human users, while learning from interaction feedback signals. We show dramatic improvements in performance over time, with comprehension-generation coupling leading to performance improvements up to 26% in absolute terms and up to 17% higher accuracies compared to a non-coupled system. Our analysis also shows coupling has substantial qualitative impact on the system's language, making it significantly more human-like.
往期 · 推荐
🌠 番外:我们期待与读者共同探讨如何在 AI 的辅助下,更好地发挥人类的潜力,以及如何培养和维持那些 AI 难以取代的核心技能。通过深入分析和实践,我们可以更清晰地认识到 AI 的辅助作用,并在 AI 时代下找到人类的独特价值和发展空间。“机智流”公众号后台聊天框回复“cc”,加入机智流大模型交流群!
一起“点赞”三连👇