CS 294-288: Data-Centric Large Language Models

Instructor: Sewon Min
Class hours: TuThu 12:30–2:00 (12:40–2:00 considering Berkeley time)
Class location: Berkeley Way West 1203 (Tuesdays) and 1104 (Thursdays)
Office hours: By appointment
Contact: sewonm@berkeley.edu (Please include “294-288” in the email subject)
Feedback form: https://forms.gle/ytMKnKWLm2onutqb7

If you are interested in taking the course and can’t directly enroll, please submit this form.


Advances in large language models (LLMs) have been driven by the growing availability of large and diverse datasets. But where do these datasets come from? How are they being used? How can we leverage them in better, more creative ways? What challenges or issues do they present, and how might we address them? In this seminar, we will explore these questions as part of a broader effort to rethink how data is used in the development of LLMs: what data we use, how we use it, why it works, and what problems it brings.

The class is primarily designed for PhD students and is based on paper readings, discussions, and an open-ended project. Students are expected to independently understand assigned papers and have a background in ML, NLP (CS 288 or equivalent), and LLMs.

We will use Slack for most communications (no Ed!). You should be added to Slack by now. If not, email us or come to the first class.

Class Syllabus

Date Class
08/28 Thu Introduction [slides]
08/28 Thu 6pm is the deadline for submitting the presentation / role preferences.
09/02 Tue Pre-training data curation [slides]
Language Models are Few-Shot Learners
DataComp-LM: In search of the next generation of training sets for language models
FineWeb: decanting the web for the finest text data at scale

Suggested readings
Language Models are Unsupervised Multitask Learners (Sec 2.1)
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer (Sec 2.2)
Deduplicating Training Data Makes Language Models Better
The Pile: An 800GB Dataset of Diverse Text for Language Modeling
Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research
Nemotron-CC: Transforming Common Crawl into a Refined Long-Horizon Pretraining Dataset
09/04 Thu Synthetic pre-training data; Guest Lecture by Pratyush Maini (CMU / DatologyAI) [slides]
Rephrasing the Web: A Recipe for Compute and Data-Efficient Language Modeling
BeyondWeb: Lessons from Scaling Synthetic Data for Trillion-scale Pretraining

Suggested readings
TinyStories: How Small Can Language Models Be and Still Speak Coherent English?
Textbooks are all you need
Cosmopedia: how to create large-scale synthetic data for pre-training
Kimi K2: Open Agentic Intelligence (Sec 2.2)
09/09 Tue Scaling laws by Jongho Park & Prasann Singhal [pre-lecture questions]
Training Compute-Optimal Large Language Models
Language models scale reliably with over-training and on downstream tasks

Suggested readings
Deep Learning Scaling is Predictable, Empirically
Scaling Laws for Neural Language Models
Emergent Abilities of Large Language Models
Scaling Laws Are Unreliable for Downstream Tasks: A Reality Check
Scaling Laws for Data Filtering – Data Curation cannot be Compute Agnostic
Scaling Data-Constrained Language Models
Compute-Constrained Data Selection
09/11 Thu Post-training data by Qiuyang Mang & Huanzhi Mao [pre-lecture questions]
Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback (Sec 2)
Tulu 3: Pushing Frontiers in Open Language Model Post-Training

Suggested readings
Self-Instruct: Aligning Language Models with Self-Generated Instructions
The Llama 3 Herd of Models (Sec 4 and relevant parts in Sec 5)
Vicuna
Secrets of RLHF in Large Language Models Part I: PPO
Direct Preference Optimization: Your Language Model is Secretly a Reward Model
09/16 Tue Synthetic data & distillation by Ryan Wang & Charles Xu [pre-lecture questions]
Alpaca: A Strong, Replicable Instruction-Following Model and Textbooks Are All You Need
The False Promise of Imitating Proprietary LLMs

Suggested readings
Self-Instruct: Aligning Language Models with Self-Generated Instructions
LIMA: Less Is More for Alignment
The Unlocking Spell on Base LLMs: Rethinking Alignment via In-Context Learning
09/18 Thu Bias and copyright by Nathan Ju & Sidhika Balachandar [pre-lecture questions]
• Bias: Whose Language Counts as High Quality? Measuring Language Ideologies in Text Data Selection
• Copyright: The Common Pile v0.1: An 8TB Dataset of Public Domain and Openly Licensed Text

Suggested readings:
• Bias: Documenting Large Webtext Corpora: A Case Study on the Colossal Clean Crawled Corpus
• Bias: From Pretraining Data to Language Models to Downstream Tasks: Tracking the Trails of Political Biases Leading to Unfair NLP Models
• Copyright: Foundation Models and Fair Use
• Copyright: Consent in Crisis: The Rapid Decline of the AI Data Commons
• Copyright: SILO Language Models: Isolating Legal Risk In a Nonparametric Datastore
09/23 Tue Evaluation by Sangdae Nam & Xutao Ma
Measuring Massive Multitask Language Understanding
Measuring short-form factuality in large language models

Suggested reading:
GPQA: A Graduate-Level Google-Proof Q&A Benchmark
OLMES: A Standard for Language Model Evaluations
UQ: Assessing Language Models on Unsolved Questions
09/25 Thu Reasoning models I; Guest Lecture by Negin Raoof (UC Berkeley) on “OpenThoughts: Data Recipes for Reasoning Models
09/30 Tue Reasoning models II by Charlie Ruan & Kaiwen Hu
s1: Simple test-time scaling and LIMO: Less is More for Reasoning
Does Math Reasoning Improve General LLM Capabilities? Understanding Transferability of LLM Reasoning

Suggested reading:
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
OpenThoughts: Data Recipes for Reasoning Models
Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
10/02 Thu Rethinking reasoning models by Dennis Jacob & Harman Singh
Spurious Rewards: Rethinking Training Signals in RLVR
Reasoning or Memorization? Unreliable Results of Reinforcement Learning Due to Data Contamination

Suggested reading:
LIMA: Less Is More for Alignment
LLMs Can Easily Learn to Reason from Demonstrations Structure, not content, is what matters!
Learning to Reason without External Rewards
Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?
Revisiting Reinforcement Learning for LLM Reasoning from A Cross-Domain Perspective
10/07 Tue Replacing it with offline feedback sessions on project proposals
10/09 Thu Replacing it with offline feedback sessions on project proposals
10/09 Thu 6pm is the deadline for the project proposal
10/14 Tue Mixture of Experts by Sanjay Adhikesaven & Yuezhou Hu
OLMoE: Open Mixture-of-Experts Language Models and DeepSeek-V3 Technical Report
Branch-Train-MiX: Mixing Expert LLMs into a Mixture-of-Experts LLM

Suggested reading:
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity
GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding
Branch-Train-Merge: Embarrassingly Parallel Training of Expert Language Models
Nexus: Specialization meets Adaptability for Efficiently Training Mixture of Experts
FlexOlmo: Open Language Models for Flexible Data Use
10/16 Thu Retrieval-based LMs by Yichuan Wang & Bhavya Chopra
Improving language models by retrieving from trillions of tokens
Scaling Retrieval-Based Language Models with a Trillion-Token Datastore

Suggested reading:
In-Context Retrieval-Augmented Language Models
When Not to Trust Language Models: Investigating Effectiveness of Parametric and Non-Parametric Memories
Can Retriever-Augmented Language Models Reason? The Blame Game Between the Retriever and the Language Model
Great Memory, Shallow Reasoning: Limits of kNN-LMs
Frustratingly Simple Retrieval Improves Challenging, Reasoning-Intensive Benchmarks
10/21 Tue Long context and retrieval by Dongwei Lyu & Siddharth Gollapudi
Can Long-Context Language Models Subsume Retrieval, RAG, SQL, and More?
Long-Context LLMs Meet RAG: Overcoming Challenges for Long Inputs in RAG
10/23 Thu Guest Lecture by Barlas Oğuz (Meta AI)
10/28 Tue Creativity & Model Collapse by Téa Wright & Donghyun Lee
The Curious Decline of Linguistic Diversity: Training Language Models on Synthetic Text
Is Model Collapse Inevitable? Breaking the Curse of Recursion by Accumulating Real and Synthetic Data

Suggested reading:
How much do language models copy from their training data? Evaluating linguistic novelty in text generation using RAVEN
AI as Humanity’s Salieri: Quantifying Linguistic Creativity of Language Models via Systematic Attribution of Machine Text against Web Text
The Curse of Recursion: Training on Generated Data Makes Models Forget
Strong Model Collapse
Self-Consuming Generative Models Go MAD (vision)
10/30 Thu Class Activity: Discussion of Talks from the BAIR-NLP Workshop
11/04 Tue Memorization by Kalvin Chang & Juno Kim

Option 1: Membership inference
Detecting Pretraining Data from Large Language Models
Membership Inference Attacks Cannot Prove that a Model Was Trained On Your Data

Suggested reading:
Do Membership Inference Attacks Work on Large Language Models?
Reassessing EMNLP 2024’s Best Paper: Does Divergence-Based Calibration for Membership Inference Attacks Hold Up?
LLM Dataset Inference: Did you train on my dataset?

Option 2: Training data extraction
Extracting Training Data from Large Language Models
Language Models May Verbatim Complete Text They Were Not Explicitly Trained On

Suggested reading:
Recite, Reconstruct, Recollect: Memorization in LMs as a Multifaceted Phenomenon
Extracting memorized pieces of (copyrighted) books from open-weight language models
Measuring memorization in language models via probabilistic extraction
11/06 Thu Search agents by Hanchen Li & Shangyin Tan
An Empirical Study on Reinforcement Learning for Reasoning-Search Interleaved LLM Agents
ZeroSearch: Incentivize the Search Capability of LLMs without Searching
11/11 Tue Academic and Administrative Holiday
11/13 Thu Vision-language models by Junyi Zhang & Colin Wang
Emerging Properties in Unified Multimodal Pretraining
Holistic Evaluation for Interleaved Text-and-Image Generation
11/18 Tue Project Presentation I
11/20 Thu Project Presentation II
11/25 Tue Replacing it with offline feedback sessions on project reports
11/27 Thu Thanksgiving Break
12/02 Tue Replacing it with offline feedback sessions on project reports
12/04 Thu Replacing it with offline feedback sessions on project reports
12/10 Wed 6pm is the deadline for the final paper

Deadlines

Weekly deadlines

  • Main presenters: Submit slides in the dedicated Slack channel 72 hours before class.
  • Panelists & follow-up researchers: Submit slides 48 hours before class in the same channel.
  • Auditors: Submit after-class reviews via the Google Form before the next class.
  • All other students: Submit responses to pre-lecture questions via the Google Form before class.
  • 08/28 (Thu): Submit the form for topic/role preferences and proposals for open slots.
    • Before 09/01 (Mon), we will announce assignments on Slack.
  • 09/17 (Wed): Submit the project preferences (teammates, topic) here.
    • By 09/19 (Fri), project team assignment will be announced on Slack.
  • 10/09 (Thu): Submit the project proposal here.
  • 11/17 (Mon): If you are presenting on 11/18, submit the project presentation slides here with a filename starting with your 2-digit team number.
  • 11/19 (Wed): If you are presenting on 11/20, submit the project presentation slides here with a filename starting with your 2-digit team number.
  • 12/10 (Wed): Submit the final paper here.

Acknowledgement

We are grateful to VESSL AI and Google Cloud for providing compute credits to support our final projects.

VESSL AI Google Cloud

The class format and guidelines are largely adapted from the following classes: