CS 294-288: Data-Centric Large Language Models

Instructor: Sewon Min
Class hours: TuThu 12:30–2:00 (12:40–2:00 considering Berkeley time)
Class location: Berkeley Way West 1203 (Tuesdays) and 1104 (Thursdays)
Office hours: By appointment
Contact: sewonm@berkeley.edu (Please include “294-288” in the email subject)
Feedback form: https://forms.gle/ytMKnKWLm2onutqb7

If you are interested in taking the course and can’t directly enroll, please submit this form.

Advances in large language models (LLMs) have been driven by the growing availability of large and diverse datasets. But where do these datasets come from? How are they being used? How can we leverage them in better, more creative ways? What challenges or issues do they present, and how might we address them? In this seminar, we will explore these questions as part of a broader effort to rethink how data is used in the development of LLMs: what data we use, how we use it, why it works, and what problems it brings.

The class is primarily designed for PhD students and is based on paper readings, discussions, and an open-ended project. Students are expected to independently understand assigned papers and have a background in ML, NLP (CS 288 or equivalent), and LLMs.

We will use Slack for most communications (no Ed!). You should be added to Slack by now. If not, email us or come to the first class.

Class Syllabus

Date	Class
08/28 Thu	Introduction [slides] 08/28 Thu 6pm is the deadline for submitting the presentation / role preferences.
09/02 Tue	Pre-training data curation [slides] • Language Models are Few-Shot Learners • DataComp-LM: In search of the next generation of training sets for language models • FineWeb: decanting the web for the finest text data at scale Suggested readings • Language Models are Unsupervised Multitask Learners (Sec 2.1) • Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer (Sec 2.2) • Deduplicating Training Data Makes Language Models Better • The Pile: An 800GB Dataset of Diverse Text for Language Modeling • Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research • Nemotron-CC: Transforming Common Crawl into a Refined Long-Horizon Pretraining Dataset
09/04 Thu	Synthetic pre-training data; Guest Lecture by Pratyush Maini (CMU / DatologyAI) [slides] • Rephrasing the Web: A Recipe for Compute and Data-Efficient Language Modeling • BeyondWeb: Lessons from Scaling Synthetic Data for Trillion-scale Pretraining Suggested readings • TinyStories: How Small Can Language Models Be and Still Speak Coherent English? • Textbooks are all you need • Cosmopedia: how to create large-scale synthetic data for pre-training • Kimi K2: Open Agentic Intelligence (Sec 2.2)
09/09 Tue	Scaling laws by Jongho Park & Prasann Singhal [pre-lecture questions] • Training Compute-Optimal Large Language Models • Language models scale reliably with over-training and on downstream tasks Suggested readings • Deep Learning Scaling is Predictable, Empirically • Scaling Laws for Neural Language Models • Emergent Abilities of Large Language Models • Scaling Laws Are Unreliable for Downstream Tasks: A Reality Check • Scaling Laws for Data Filtering – Data Curation cannot be Compute Agnostic • Scaling Data-Constrained Language Models • Compute-Constrained Data Selection
09/11 Thu	Post-training data by Qiuyang Mang & Huanzhi Mao [pre-lecture questions] • Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback (Sec 2) • Tulu 3: Pushing Frontiers in Open Language Model Post-Training Suggested readings • Self-Instruct: Aligning Language Models with Self-Generated Instructions • The Llama 3 Herd of Models (Sec 4 and relevant parts in Sec 5) • Vicuna • Secrets of RLHF in Large Language Models Part I: PPO • Direct Preference Optimization: Your Language Model is Secretly a Reward Model
09/16 Tue	Synthetic data & distillation by Ryan Wang & Charles Xu [pre-lecture questions] • Alpaca: A Strong, Replicable Instruction-Following Model and Textbooks Are All You Need • The False Promise of Imitating Proprietary LLMs Suggested readings • Self-Instruct: Aligning Language Models with Self-Generated Instructions • LIMA: Less Is More for Alignment • The Unlocking Spell on Base LLMs: Rethinking Alignment via In-Context Learning
09/18 Thu	Bias and copyright by Nathan Ju & Sidhika Balachandar [pre-lecture questions] • Bias: Whose Language Counts as High Quality? Measuring Language Ideologies in Text Data Selection • Copyright: The Common Pile v0.1: An 8TB Dataset of Public Domain and Openly Licensed Text Suggested readings: • Bias: Documenting Large Webtext Corpora: A Case Study on the Colossal Clean Crawled Corpus • Bias: From Pretraining Data to Language Models to Downstream Tasks: Tracking the Trails of Political Biases Leading to Unfair NLP Models • Copyright: Foundation Models and Fair Use • Copyright: Consent in Crisis: The Rapid Decline of the AI Data Commons • Copyright: SILO Language Models: Isolating Legal Risk In a Nonparametric Datastore
09/23 Tue	Evaluation by Sangdae Nam & Xutao Ma [pre-lecture questions] • Measuring Massive Multitask Language Understanding • Measuring short-form factuality in large language models Suggested reading: • GPQA: A Graduate-Level Google-Proof Q&A Benchmark • OLMES: A Standard for Language Model Evaluations • UQ: Assessing Language Models on Unsolved Questions
09/25 Thu	Reasoning models I; Guest Lecture by Negin Raoof (UC Berkeley) on “OpenThoughts: Data Recipes for Reasoning Models”
09/30 Tue	Reasoning models II by Charlie Ruan & Kaiwen Hu [pre-lecture questions] • s1: Simple test-time scaling and LIMO: Less is More for Reasoning • Does Math Reasoning Improve General LLM Capabilities? Understanding Transferability of LLM Reasoning Suggested reading: • DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning • OpenThoughts: Data Recipes for Reasoning Models • Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning • DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
10/02 Thu	Rethinking reasoning models by Dennis Jacob & Harman Singh [pre-lecture questions] • Spurious Rewards: Rethinking Training Signals in RLVR • Reasoning or Memorization? Unreliable Results of Reinforcement Learning Due to Data Contamination Suggested reading: • LIMA: Less Is More for Alignment • LLMs Can Easily Learn to Reason from Demonstrations Structure, not content, is what matters! • Learning to Reason without External Rewards • Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model? • Revisiting Reinforcement Learning for LLM Reasoning from A Cross-Domain Perspective
10/07 Tue	Replacing it with offline feedback sessions on project proposals
10/09 Thu	Replacing it with offline feedback sessions on project proposals 10/09 Thu 6pm is the deadline for the project proposal
10/14 Tue	Mixture of Experts by Sanjay Adhikesaven & Yuezhou Hu [pre-lecture questions] • OLMoE: Open Mixture-of-Experts Language Models and DeepSeek-V3 Technical Report • Branch-Train-MiX: Mixing Expert LLMs into a Mixture-of-Experts LLM Suggested reading: • Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity • GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding • Branch-Train-Merge: Embarrassingly Parallel Training of Expert Language Models • Nexus: Specialization meets Adaptability for Efficiently Training Mixture of Experts • FlexOlmo: Open Language Models for Flexible Data Use
10/16 Thu	Retrieval-based LMs by Yichuan Wang & Bhavya Chopra [pre-lecture questions] • Improving language models by retrieving from trillions of tokens • Scaling Retrieval-Based Language Models with a Trillion-Token Datastore Suggested reading: • In-Context Retrieval-Augmented Language Models • When Not to Trust Language Models: Investigating Effectiveness of Parametric and Non-Parametric Memories • Can Retriever-Augmented Language Models Reason? The Blame Game Between the Retriever and the Language Model • Great Memory, Shallow Reasoning: Limits of kNN-LMs • Frustratingly Simple Retrieval Improves Challenging, Reasoning-Intensive Benchmarks
10/21 Tue	Long context and retrieval by Dongwei Lyu & Siddharth Gollapudi [pre-lecture questions] • Can Long-Context Language Models Subsume Retrieval, RAG, SQL, and More? • Long-Context LLMs Meet RAG: Overcoming Challenges for Long Inputs in RAG
10/23 Thu	Guest Lecture by Barlas Oğuz (Meta AI) [slides]
10/28 Tue	Creativity & Model Collapse by Téa Wright & Donghyun Lee [pre-lecture questions] • The Curious Decline of Linguistic Diversity: Training Language Models on Synthetic Text • Is Model Collapse Inevitable? Breaking the Curse of Recursion by Accumulating Real and Synthetic Data Suggested reading: • How much do language models copy from their training data? Evaluating linguistic novelty in text generation using RAVEN • AI as Humanity’s Salieri: Quantifying Linguistic Creativity of Language Models via Systematic Attribution of Machine Text against Web Text • The Curse of Recursion: Training on Generated Data Makes Models Forget • Strong Model Collapse • Self-Consuming Generative Models Go MAD (vision)
10/30 Thu	Class Activity: Discussion of Talks from the BAIR-NLP Workshop
11/04 Tue	Membership Inference by Kalvin Chang & Juno Kim [pre-lecture questions] • Detecting Pretraining Data from Large Language Models • Membership Inference Attacks Cannot Prove that a Model Was Trained On Your Data Suggested reading: • Do Membership Inference Attacks Work on Large Language Models? • Reassessing EMNLP 2024’s Best Paper: Does Divergence-Based Calibration for Membership Inference Attacks Hold Up? • LLM Dataset Inference: Did you train on my dataset?
11/06 Thu	Search agents by Hanchen Li & Shangyin Tan [pre-lecture questions] • An Empirical Study on Reinforcement Learning for Reasoning-Search Interleaved LLM Agents • ZeroSearch: Incentivize the Search Capability of LLMs without Searching
11/11 Tue	Academic and Administrative Holiday
11/13 Thu	Vision-language models by Junyi Zhang & Colin Wang [pre-lecture questions] • Emerging Properties in Unified Multimodal Pretraining • Holistic Evaluation for Interleaved Text-and-Image Generation
11/18 Tue	Project Presentation I
11/20 Thu	Project Presentation II
11/25 Tue	Replacing it with offline feedback sessions on project reports
11/27 Thu	Thanksgiving Break
12/02 Tue	Replacing it with offline feedback sessions on project reports
12/04 Thu	Replacing it with offline feedback sessions on project reports 12/10 Wed 6pm is the deadline for the final paper

Deadlines

Weekly deadlines

Main presenters: Submit slides in the dedicated Slack channel 72 hours before class.
Panelists & follow-up researchers: Submit slides 48 hours before class in the same channel.
Auditors: Submit after-class reviews via the Google Form before the next class.
All other students: Submit responses to pre-lecture questions via the Google Form before class.

08/28 (Thu): Submit the form for topic/role preferences and proposals for open slots.
- Before 09/01 (Mon), we will announce assignments on Slack.
09/17 (Wed): Submit the project preferences (teammates, topic) here.
- By 09/19 (Fri), project team assignment will be announced on Slack.
10/09 (Thu): Submit the project proposal here.
11/17 (Mon): If you are presenting on 11/18, submit the project presentation slides in #class-discussion.
11/19 (Wed): If you are presenting on 11/20, submit the project presentation slides in #class-discussion.
12/10 (Wed): Submit the final paper here.

Acknowledgement

We are grateful to VESSL AI and Google Cloud for providing compute credits to support our final projects.

VESSL AI Google Cloud

The class format and guidelines are largely adapted from the following classes:

Alane Suhr (UC Berkeley) CS294-258 Language Agents in Interaction
Pang Wei Koh (UW) CSE599J: Data-centric Machine Learning
Danqi Chen and Sanjeev Arora (Princeton) COS 597R: Deep Dive into Large Language Models

CS 294-288: Data-Centric Large Language Models

Class Syllabus

Deadlines

Weekly deadlines

Deadlines related to role assignment and project (all by 6pm PST)

Acknowledgement