Access to high-quality data is important. Even though the mass media tends to talk about the importance of building large data centers and scaling up models, when I speak with friends at companies that train foundation models, many describe a very large amount of their daily challenges as data preparation. Specifically, a significant fraction of their day-to-day work follows the usual Data Centric AI practices of identifying high-quality data (books are one important source), cleaning data (the ruling describes Anthropic taking steps like removing book pages' headers, footers, and page numbers), carrying out error analyses to figure out what types of data to acquire more of, and inventing new ways to generate synthetic data.

musk on rephrasing

HLE leaderboard