โ— PHANTOM
๐Ÿ‡ฎ๐Ÿ‡ณ IN
โœ•

Superseded by the European Open Source AI Index | The table below is provided for historical purposes but is no longer updated. We have tripled the amount of models and are including code, audio, and image models at osai-index.eu.

There is a growing amount of instruction-tuned text generators billing themselves as 'open source'. How open are they really? ๐Ÿ”—FAccT'24 ๐Ÿ”—CUI'23

ProjectAvailabilityDocumentationAccess
(maker, bases, URL)Open codeLLM dataLLM weightsRL dataRL weightsLicenseCodeArchitecturePreprintPaperModelcardDatasheetPackageAPI
OLMo 7B Instructโœ”๏ธŽโœ”๏ธŽโœ”๏ธŽโœ”๏ธŽโœ”๏ธŽโœ”๏ธŽโœ”๏ธŽโœ”๏ธŽโœ”๏ธŽโœ˜โœ”๏ธŽโœ”๏ธŽโœ”๏ธŽ~
Ai2LLM base: OLMo 7BRL base: OpenInstruct
BLOOMZโœ”๏ธŽโœ”๏ธŽโœ”๏ธŽโœ”๏ธŽ~~โœ”๏ธŽโœ”๏ธŽโœ”๏ธŽโœ”๏ธŽโœ”๏ธŽโœ”๏ธŽโœ˜โœ”๏ธŽ
bigscience-workshopLLM base: BLOOMZ, mT0RL base: xP3
AmberChatโœ”๏ธŽโœ”๏ธŽโœ”๏ธŽโœ”๏ธŽโœ”๏ธŽโœ”๏ธŽ~~โœ”๏ธŽโœ˜~~โœ˜โœ”๏ธŽ
LLM360LLM base: AmberRL base: ShareGPT + Evol-Instruct (synthetic)
Open Assistantโœ”๏ธŽโœ”๏ธŽโœ”๏ธŽโœ”๏ธŽโœ˜โœ”๏ธŽโœ”๏ธŽโœ”๏ธŽ~โœ˜โœ˜โœ˜โœ”๏ธŽโœ”๏ธŽ
LAION-AILLM base: Pythia 12BRL base: OpenAssistant Conversations
OpenChat 3.5 7Bโœ”๏ธŽโœ˜โœ”๏ธŽโœ˜โœ”๏ธŽโœ”๏ธŽ~โœ”๏ธŽโœ”๏ธŽโœ”๏ธŽ~โœ˜โœ”๏ธŽ~
Tshinghua UniversityLLM base: Mistral 7BRL base: ShareGPT with C-RLFT
Pythia-Chat-Base-7B-v0.16โœ”๏ธŽโœ”๏ธŽโœ”๏ธŽโœ”๏ธŽโœ˜โœ”๏ธŽโœ”๏ธŽโœ”๏ธŽ~โœ˜~~โœ”๏ธŽโœ˜
togethercomputerLLM base: EleutherAI pythiaRL base: OIG
Cerebras GPT 111M Instruction~โœ”๏ธŽโœ”๏ธŽโœ”๏ธŽโœ”๏ธŽ~โœ˜โœ”๏ธŽ~โœ˜โœ˜โœ”๏ธŽโœ˜โœ”๏ธŽ
Cerebras + SchrammLLM base: CerebrasRL base: Alpaca (synthetic)
RedPajama-INCITE-Instruct-7B~โœ”๏ธŽโœ”๏ธŽโœ”๏ธŽโœ”๏ธŽ~~~โœ˜โœ˜โœ”๏ธŽโœ”๏ธŽโœ˜~
TogetherComputerLLM base: RedPajama-INCITE-7B-BaseRL base: various (GPT-JT recipe)
dollyโœ”๏ธŽโœ”๏ธŽโœ”๏ธŽโœ”๏ธŽโœ˜โœ”๏ธŽโœ”๏ธŽโœ”๏ธŽ~โœ˜โœ˜โœ˜โœ”๏ธŽโœ˜
databricksLLM base: EleutherAI pythiaRL base: databricks-dolly-15k
Tulu V2 DPO 70Bโœ”๏ธŽโœ˜~โœ”๏ธŽโœ”๏ธŽ~~~โœ”๏ธŽโœ˜~~โœ˜โœ”๏ธŽ
AllenAILLM base: Llama2RL base: Tulu SFT, Ultrafeedback
MPT-30B Instructโœ”๏ธŽ~โœ”๏ธŽ~โœ˜โœ”๏ธŽโœ”๏ธŽ~โœ˜โœ˜~โœ˜โœ”๏ธŽ~
MosaicMLLLM base: MosaicMLRL base: dolly, anthropic
MPT-7B Instructโœ”๏ธŽ~โœ”๏ธŽ~โœ˜โœ”๏ธŽโœ”๏ธŽ~โœ˜โœ˜โœ”๏ธŽโœ˜โœ”๏ธŽโœ˜
MosaicMLLLM base: MosaicMLRL base: dolly, anthropic
trlxโœ”๏ธŽโœ”๏ธŽโœ”๏ธŽ~โœ˜โœ”๏ธŽโœ”๏ธŽ~โœ˜โœ˜โœ˜โœ˜~โœ”๏ธŽ
carperaiLLM base: various (pythia, flan, OPT)RL base: various
NeuralChat 7B~โœ˜โœ”๏ธŽโœ”๏ธŽโœ”๏ธŽโœ”๏ธŽ~~โœ˜โœ˜~~~โœ˜
IntelLLM base: Mistral 7BRL base: Orca
Vicuna 13B v 1.3โœ”๏ธŽ~โœ”๏ธŽโœ˜โœ˜~โœ”๏ธŽโœ˜โœ”๏ธŽโœ˜~โœ˜โœ”๏ธŽ~
LMSYSLLM base: LLaMARL base: ShareGPT
minChatGPTโœ”๏ธŽโœ”๏ธŽโœ”๏ธŽ~โœ˜โœ”๏ธŽโœ”๏ธŽ~โœ˜โœ˜โœ˜โœ˜โœ˜โœ”๏ธŽ
ethanyanjialiLLM base: GPT2RL base: anthropic
ChatRWKVโœ”๏ธŽ~โœ”๏ธŽโœ˜โœ˜โœ”๏ธŽ~~~โœ˜โœ˜โœ˜โœ”๏ธŽ~
BlinkDL/RWKVLLM base: RWKV-LMRL base: alpaca, shareGPT (synthetic)
BELLEโœ”๏ธŽ~~~~โœ˜~โœ”๏ธŽโœ”๏ธŽโœ˜โœ˜~โœ˜โœ˜
KE TechnologiesLLM base: LLaMA & BLOOMZRL base: alpaca, shareGPT, Belle (synthetic)
Geitje Ultra 7Bโœ˜~โœ”๏ธŽโœ”๏ธŽโœ”๏ธŽโœ˜โœ˜~~โœ˜~~โœ˜~
Bram van RoyLLM base: Mistral 7BRL base: Ultrafeedback Dutch (synthetic)
Phi 3 Instructโœ˜โœ˜โœ˜โœ˜โœ”๏ธŽโœ”๏ธŽโœ˜โœ”๏ธŽ~โœ˜โœ”๏ธŽโœ˜~โœ”๏ธŽ
MicrosoftLLM base: Phi3RL base: Unspecified
WizardLM 13B v1.2~โœ˜~โœ”๏ธŽโœ”๏ธŽ~~โœ”๏ธŽโœ”๏ธŽโœ˜โœ˜โœ˜โœ˜โœ˜
Microsoft & Peking UniversityLLM base: LLaMA2-13BRL base: Evol-Instruct (synthetic)
Airoboros L2 70B GPT4~โœ˜~โœ”๏ธŽโœ”๏ธŽ~~~โœ˜โœ˜~~โœ˜โœ˜
Jon DurbinLLM base: Llama2RL base: Airoboros (synthetic)
ChatGLM-6B~~โœ”๏ธŽโœ˜โœ˜โœ”๏ธŽ~~โœ˜~โœ˜โœ˜โœ˜โœ”๏ธŽ
THUDMLLM base: GLM (own)RL base: Unspecified
Mistral 7B-Instruct~โœ˜โœ”๏ธŽโœ˜~โœ”๏ธŽโœ˜~~โœ˜โœ˜โœ˜~โœ”๏ธŽ
Mistral AILLM base: unclearRL base: unspecified
WizardLM-7B~~โœ˜โœ”๏ธŽ~~~โœ”๏ธŽโœ”๏ธŽโœ˜โœ˜โœ˜โœ˜โœ˜
Microsoft & Peking UniversityLLM base: LLaMA-7BRL base: Evol-Instruct (synthetic)
Mistral NeMo Instruct~โœ˜โœ”๏ธŽโœ˜~โœ”๏ธŽโœ˜~โœ˜โœ˜โœ˜โœ˜~โœ”๏ธŽ
Mistral AILLM base: Mistral NeMoRL base: unspecified
Qwen 1.5~โœ˜โœ”๏ธŽโœ˜โœ”๏ธŽโœ˜~~โœ˜โœ˜โœ˜โœ˜~โœ”๏ธŽ
Alibaba CloudLLM base: QwenLMRL base: Unspecified
StableVicuna-13B~โœ˜~~~~~~~โœ˜~โœ˜โœ˜~
CarperAILLM base: LLaMARL base: OASST1 (human), GPT4All (human), Alpaca (synthetic)
Falcon-40B-instructโœ˜~โœ”๏ธŽ~โœ˜โœ”๏ธŽโœ˜~~โœ˜~โœ˜โœ˜โœ˜
Technology Innovation InstituteLLM base: Falcon 40BRL base: Baize (synthetic)
UltraLMโœ˜โœ˜~โœ”๏ธŽ~โœ˜โœ˜~โœ”๏ธŽโœ˜~~โœ˜โœ˜
OpenBMBLLM base: LLaMA2RL base: UltraFeedback (part synthetic)
Yi 34B Chat~โœ˜โœ”๏ธŽโœ˜โœ”๏ธŽ~โœ˜โœ˜โœ”๏ธŽโœ˜โœ˜โœ˜โœ˜~
01.AILLM base: Yi 34BRL base: unspecified
Koala 13Bโœ”๏ธŽ~~~โœ˜~~~โœ˜โœ˜โœ˜โœ˜โœ˜โœ˜
BAIRLLM base: LLaMA 13BRL base: HC3, ShareGPT, alpaca (synthetic)
Llama 3.1~โœ˜~โœ˜โœ˜โœ˜~~โœ˜โœ˜~โœ˜โœ”๏ธŽ~
Facebook ResearchLLM base: Meta Llama 3RL base: Meta, undocumented
Mixtral 8x7B Instructโœ˜โœ˜โœ”๏ธŽโœ˜~โœ”๏ธŽโœ˜~~โœ˜โœ˜โœ˜~โœ˜
Mistral AILLM base: MistralRL base: Unspecified
Stable Beluga 2โœ˜โœ˜~โœ˜โœ”๏ธŽ~โœ˜~~โœ˜~โœ˜โœ˜~
Stability AILLM base: LLaMA2RL base: Orca-style (synthetic)
Stanford Alpacaโœ”๏ธŽโœ˜~~~โœ˜~โœ”๏ธŽโœ˜โœ˜โœ˜โœ˜โœ˜โœ˜
Stanford University CRFMLLM base: LLaMARL base: Self-Instruct (synthetic)
Falcon-180B-chatโœ˜~~~~โœ˜โœ˜~~โœ˜~โœ˜โœ˜โœ˜
Technology Innovation InstituteLLM base: Falcon 180BRL base: OpenPlatypus, Ultrachat, Airoboros (synthetic)
Gemma 7B Instruct~โœ˜~โœ˜~โœ˜โœ˜~~โœ˜โœ”๏ธŽโœ˜โœ˜โœ˜
Google DeepMindLLM base: GemmaRL base: Unspecified
Orca 2โœ˜โœ˜~โœ˜โœ”๏ธŽโœ˜โœ˜~~โœ˜~โœ˜โœ˜~
Microsoft ResearchLLM base: LLaMA2RL base: FLAN, Math, undisclosed (synthetic)
Command R+โœ˜โœ˜โœ˜โœ”๏ธŽโœ”๏ธŽ~โœ˜โœ˜โœ˜โœ˜~โœ˜โœ˜โœ˜
Cohere AILLM base: RL base: Aya Collection
LLaMA2 Chatโœ˜โœ˜~โœ˜~โœ˜โœ˜~~โœ˜~โœ˜โœ˜~
Facebook ResearchLLM base: LLaMA2RL base: Meta, StackExchange, Anthropic
Nanbeige2-Chatโœ”๏ธŽโœ˜โœ˜โœ˜โœ”๏ธŽ~โœ˜โœ˜โœ˜โœ˜โœ˜โœ˜โœ˜~
Nanbeige LLM labLLM base: UnknownRL base: Unknown
Llama 3 Instructโœ˜โœ˜~โœ˜~โœ˜โœ˜~โœ˜โœ˜~โœ˜โœ˜~
Facebook ResearchLLM base: Meta Llama 3RL base: Meta, undocumented
Solar 70Bโœ˜โœ˜~โœ˜~โœ˜โœ˜โœ˜โœ˜โœ˜~โœ˜โœ˜~
Upstage AILLM base: LLaMA2RL base: Orca-style, Alpaca-style
Xwin-LMโœ˜โœ˜~โœ˜โœ˜โœ˜โœ˜โœ˜โœ˜โœ˜โœ˜โœ˜โœ˜~
Xwin-LMLLM base: LLaMA2RL base: unknown
ChatGPTโœ˜โœ˜โœ˜โœ˜โœ˜โœ˜โœ˜โœ˜~โœ˜โœ˜โœ˜โœ˜โœ˜
OpenAILLM base: GPT 3.5RL base: Instruct-GPT

How to use this table. Every cell records a three-level openness judgement (โœ”๏ธŽ open, ~ partial or โœ˜ closed) with a direct link to the available evidence; on hover, the cell will display the notes we have on file for that judgement. The name of each project is a direct link to source data. The table is sorted by cumulative openness, where โœ”๏ธŽ is 1, ~ is 0.5 and โœ˜ is 0 points. Note that RL may refer to RLHF or other forms of fine-tuning aimed at fostering instruction-following behaviour.

Why is openness important?

Open research is the lifeblood of cumulative progress in science and engineering. Openness is key for fundamental research, for fostering critical computational literacy, and for making informed choices for or against deployment of instruction-tuned LLM architectures. The closed & proprietary nature of ChatGPT and kin makes them fundamentally unfit for responsible use in research and education.

Open alternatives provide ways to build reproducible workflows, chart resource costs, and lessen reliance on corporate whims. One aim of our work here is to provide tools to track openness, transparency and accountability in the fast-evolving landscape of instruction-tuned text generators. Read more in the paper (PDF) or contribute to the repo.

TL;DR

Our paper makes the following contributions:

We find the following recurrent patterns:

We conclude as follows:

Openness is not the full solution to the scientific and ethical challenges of conversational text generators. Open data will not mitigate the harmful consequences of thoughtless deployment of large language models, nor the questionable copyright implications of scraping all publicly available data from the internet. However, openness does make original research possible, including efforts to build reproducible workflows and understand the fundamentals of instruction-tuned LLM architectures. Openness also enables checks and balances, fostering a culture of accountability for data and its curation, and for models and their deployment. We hope that our work provides a small step in this direction.

Papers

Liesenfeld, Andreas, Alianda Lopez, and Mark Dingemanse. 2023. โ€œOpening up ChatGPT: Tracking Openness, Transparency, and Accountability in Instruction-Tuned Text Generators.โ€ In CUI '23: Proceedings of the 5th International Conference on Conversational User Interfaces. July 19-21, Eindhoven. doi: 10.1145/3571884.3604316 (PDF).

Andreas Liesenfeld and Mark Dingemanse. 2024. Rethinking open source generative AI: open washing and the EU AI Act. In The 2024 ACM Conference on Fairness, Accountability, and Transparency (FAccT '24). Association for Computing Machinery, New York, NY, USA, 1774โ€“1787. doi: 10.1145/3630106.3659005

"