AI2 drops biggest open dataset yet for training language models

By admin August 18, 2023

Language models like GPT-4 and Claude are powerful and useful, but the data on which they are trained is a closely guarded secret. The Allen Institute for AI (AI2) aims to reverse this trend with a new, huge text dataset that’s free to use and open to inspection. Called Dolma, the dataset is intended to be the basis for AI2’s planned open language model (OLMo). This move is in contrast to companies like OpenAI and Meta, which treat their dataset information as proprietary. Dolma is the largest open dataset so far, with 3 billion tokens, and its sources and processes are publicly documented. It uses the “ImpACT” license that requires users to provide contact information, disclose derivative creations, distribute derivatives under the same license, and agree not to apply Dolma to prohibited areas. Users concerned about their personal data in the dataset can request removal. Access to Dolma is available via Hugging Face.

Meta Data: {“keywords”:”open dataset, language models, Dolma”}

Source link

AI2 drops biggest open dataset yet for training language models

By admin

You Missed

Moodeng

Thai Immigration Prepares for Visa-Free Entry for Chinese Tourists :)

Chery’s EV dreams charge ahead in Thailand: BoI blessings pending

NBTC and DES ministry prepare for Thaicom 4 satellite transition in 2024

By admin

Related Post

You Missed