2023-09-29-insights

今天三篇有趣的论文

Effective Long-Context Scaling of Foundation Models

Meta AI新研究，尝试了long seqlength的各种变体，发现Llama2做完训练可以在长样本benchmark超过turbo-16k。作者发现长样本模型并不需要预训练阶段很多的长样本数据，只需要在最后用小数据做一些finetune即可。

New research in Meta AI, which has explored various variants of long sequence length and found that: Llama2, after training, can surpass turbo-16k on long sample benchmarks. The authors discovered that long sample models do not require a large amount of long sample data during the pre-training phase.

Fine-tuning with a small amount of data after is sufficient.

Demystifying CLIP Data

FAIR, Meta AI的作品。作者认为几年前OpenAI CLIP模型是一个数据工程，能力来源于没有开源的400M数据集。这篇文章中，作者搞了个metaCLIP。完全聚焦于数据，没加任何trick。用同样量级的数据做了各种对比实验，打败了CLIP(70.8% > 68.3%)。

另外，作者做了scale实验，用1B数据打到了72.4%；同时如果模型大小也跟着scale，最高能到ViT-H的80.5%。

最重要的是，开源！Meta AI is all you need

Another work by FAIR, Meta AI. The authors believe that a few years ago, the OpenAI CLIP model was a fully data-engineering project, its capabilities stemming from an close-sourced 400M dataset. In this paper, the authors introduced MetaCLIP. It is entirely focused on data, with no additional tricks used. Various comparative experiments were conducted with the same scale of data, outperforming CLIP (70.8% > 68.3%).

Additionally, the authors conducted scaling experiments, achieving 72.4% with 1B data. And if the model size is scaled as well, it can reach up to 80.5% with ViT-H.

Most importantly, it is open-source! Meta AI is all you need.

AnyMAL: An Efficient and Scalable Any-Modality Augmented Language Model

还是Meta……作者训了一个高效、可扩展的多模态编码器，把2亿图片、220万音频、50万IMU序列、2800万视频都扔进去训练，然后都和Llama2-70B对齐，相当于同时让Llama2可以读懂多模态了，但是只能输出text。

作者刷了一堆榜，基本全是SOTA。

插一句，难道GPT-4V就是这么搞的吗？然后用上一篇Demystifying CLIP Data里整了一堆闭源的超级数据工程……

Yet another Meta AI research have trained an efficient and scalable multimodal encoder. The training utilized 200 million images, 2.2 million audio files, 500 thousand IMU sequences, and 28 million videos. All of these data were aligned with Llama2-70B, essentially enabling Llama2 to understand multimodal data( though it can only output text)

The authors archived SOTAs on various leaderboards, primarily achieving state-of-the-art results.

BTW, I have some questions: If this is the approach taken for GPT-4V? and refers to the use of a significant amount of closed-sourced super data-engineering as discussed in the previous mentioned paper(Demystifying CLIP Data) ?

我的思考(My insights)

现在Meta已经这么强了吗……我基本上看眼缘挑，挑出来的竟然全是Meta的论文，感觉他们的focus确实好。OpenAI率先闭源，开源界又有Meta这个猛虎，我辈高校科研人员仍需努力呀

Is Meta already that strong now? Basically, by just picking the Arxiv papers today that caught my eye, it turns out they are all from Meta. Their focus is indeed good.

With closed OpenAI taking the lead in and the open-source community having the "tiger" as Meta. Me, and the academic research community from universities, still need to work harder.