Allen Institute for AI Unveils OLMo: A Truly Open Source Large Language Model

Allen Institute for AI Unveils OLMo: A Truly Open Source Large Language Model

The Allen Institute for AI (AI2) has introduced OLMo, a large language model that boasts complete transparency in its design, training, and evaluation processes, making it a "truly open source" solution for companies building applications.

OLMo, along with its model code, weights, training code, data, and evaluation suite, allows users to delve into every aspect of its creation, providing unparalleled insight into its development. Built on the Dolma dataset, comprising three trillion tokens developed by AI2, OLMo comes in four sizes, each with around seven billion parameters, putting it in competition with other large language models like Meta's Llama 2-7B and Mistral's Mixtral 8x7B.

Founded by the late Microsoft co-founder Paul Allen, AI2 aims to empower academics and researchers by offering access to the full training aspects of OLMo, facilitating collective study of language models. The institute believes that this open approach not only reduces developmental redundancies but also contributes to the decarbonization of AI by enabling more efficient fine-tuning of models.

In addition to enhancing the rate at which researchers can work by eliminating the need for qualitative assumptions about model performance, OLMo's transparency aligns with the growing trend of openness in the AI community. While big names like Meta have pledged to open-source their AI systems, disputes among AI experts persist regarding the definition of a truly open AI model.

Eric Horvitz, Microsoft's chief scientific officer and a founding member of AI2's Scientific Advisory Board, expressed enthusiasm for OLMo's release, highlighting the value of community-driven open-source initiatives in shaping the future of AI.

Alongside OLMo, AI2 also introduced Paloma, a benchmark for evaluating open language models across various natural language processing tasks. Paloma's versatility extends to testing models in diverse domains, from niche artist communities to Reddit forums on mental health, offering a comprehensive evaluation tool for researchers and developers.

Companies seeking access to OLMo's pretraining dataset, Dolma, comprising three trillion tokens from web content, academic publications, and books, can do so for commercial applications via the Hugging Face Hub. This extensive dataset further enhances OLMo's applicability across a wide range of domains and applications, solidifying its position as a groundbreaking advancement in the field of open-source AI.