Language models like GPT-4 and Claude are powerful and useful, but the data on which they are trained is a closely guarded secret. The Allen Institute for AI (AI2) aims to reverse this trend with a new, huge text dataset that’s free to use and open to inspection. Called Dolma, the dataset is intended to be the basis for AI2’s planned open language model (OLMo). This move is in contrast to companies like OpenAI and Meta, which treat their dataset information as proprietary. Dolma is the largest open dataset so far, with 3 billion tokens, and its sources and processes are publicly documented. It uses the “ImpACT” license that requires users to provide contact information, disclose derivative creations, distribute derivatives under the same license, and agree not to apply Dolma to prohibited areas. Users concerned about their personal data in the dataset can request removal. Access to Dolma is available via Hugging Face.
Meta Data: {“keywords”:”open dataset, language models, Dolma”}
Source link