CS 294-288: Data-Centric Large Language Models Instructor : Sewon Min Class hours : TuThu 12:30–2:00 (12:40–2:00 considering Berkeley time) Class location : Berkeley Way West 1203 (Tuesdays) and 1104 (Thursdays) Office hours : By appointment Contact : sewonm@berkeley.edu (Please include “294-288” in the email subject) Feedback form : https://forms.gle/ytMKnKWLm2onutqb7
If you are interested in taking the course and can’t directly enroll, please submit this form .
Advances in large language models (LLMs) have been driven by the growing availability of large and diverse datasets. But where do these datasets come from? How are they being used? How can we leverage them in better, more creative ways? What challenges or issues do they present, and how might we address them? In this seminar, we will explore these questions as part of a broader effort to rethink how data is used in the development of LLMs: what data we use, how we use it, why it works, and what problems it brings.
The class is primarily designed for PhD students and is based on paper readings, discussions, and an open-ended project. Students are expected to independently understand assigned papers and have a background in ML, NLP (CS 288 or equivalent), and LLMs.
We will use Slack for most communications (no Ed!). You should be added to Slack by now. If not, email us or come to the first class.
Class Syllabus Date Class 08/28 Thu Introduction [slides ]08/28 Thu 6pm is the deadline for submitting the presentation / role preferences. 09/02 Tue Pre-training data curation [slides ] • Language Models are Few-Shot Learners • DataComp-LM: In search of the next generation of training sets for language models • FineWeb: decanting the web for the finest text data at scale Suggested readings • Language Models are Unsupervised Multitask Learners (Sec 2.1) • Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer (Sec 2.2) • Deduplicating Training Data Makes Language Models Better • The Pile: An 800GB Dataset of Diverse Text for Language Modeling • Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research • Nemotron-CC: Transforming Common Crawl into a Refined Long-Horizon Pretraining Dataset 09/04 Thu Synthetic pre-training data ; Guest Lecture by Pratyush Maini (CMU / DatologyAI) [slides ] • Rephrasing the Web: A Recipe for Compute and Data-Efficient Language Modeling • BeyondWeb: Lessons from Scaling Synthetic Data for Trillion-scale Pretraining Suggested readings • TinyStories: How Small Can Language Models Be and Still Speak Coherent English? • Textbooks are all you need • Cosmopedia: how to create large-scale synthetic data for pre-training • Kimi K2: Open Agentic Intelligence (Sec 2.2) 09/09 Tue Scaling laws by Jongho Park & Prasann Singhal [pre-lecture questions ] • Training Compute-Optimal Large Language Models • Language models scale reliably with over-training and on downstream tasks Suggested readings • Deep Learning Scaling is Predictable, Empirically • Scaling Laws for Neural Language Models • Emergent Abilities of Large Language Models • Scaling Laws Are Unreliable for Downstream Tasks: A Reality Check • Scaling Laws for Data Filtering – Data Curation cannot be Compute Agnostic • Scaling Data-Constrained Language Models • Compute-Constrained Data Selection 09/11 Thu Post-training data by Qiuyang Mang & Huanzhi Mao [pre-lecture questions ] • Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback (Sec 2) • Tulu 3: Pushing Frontiers in Open Language Model Post-Training Suggested readings • Self-Instruct: Aligning Language Models with Self-Generated Instructions • The Llama 3 Herd of Models (Sec 4 and relevant parts in Sec 5) • Vicuna • Secrets of RLHF in Large Language Models Part I: PPO • Direct Preference Optimization: Your Language Model is Secretly a Reward Model 09/16 Tue Synthetic data & distillation by Ryan Wang & Charles Xu [pre-lecture questions ] • Alpaca: A Strong, Replicable Instruction-Following Model and Textbooks Are All You Need • The False Promise of Imitating Proprietary LLMs Suggested readings • Self-Instruct: Aligning Language Models with Self-Generated Instructions • LIMA: Less Is More for Alignment • The Unlocking Spell on Base LLMs: Rethinking Alignment via In-Context Learning 09/18 Thu Bias and copyright by Nathan Ju & Sidhika Balachandar [pre-lecture questions ] • Bias: Whose Language Counts as High Quality? Measuring Language Ideologies in Text Data Selection • Copyright: The Common Pile v0.1: An 8TB Dataset of Public Domain and Openly Licensed Text Suggested readings: • Bias: Documenting Large Webtext Corpora: A Case Study on the Colossal Clean Crawled Corpus • Bias: From Pretraining Data to Language Models to Downstream Tasks: Tracking the Trails of Political Biases Leading to Unfair NLP Models • Copyright: Foundation Models and Fair Use • Copyright: Consent in Crisis: The Rapid Decline of the AI Data Commons • Copyright: SILO Language Models: Isolating Legal Risk In a Nonparametric Datastore 09/23 Tue Evaluation by Sangdae Nam & Xutao Ma • Measuring Massive Multitask Language Understanding • Measuring short-form factuality in large language models Suggested reading: • GPQA: A Graduate-Level Google-Proof Q&A Benchmark • OLMES: A Standard for Language Model Evaluations • UQ: Assessing Language Models on Unsolved Questions 09/25 Thu Reasoning models I ; Guest Lecture by Negin Raoof (UC Berkeley) on “OpenThoughts: Data Recipes for Reasoning Models ” 09/30 Tue Reasoning models II by Charlie Ruan & Kaiwen Hu • s1: Simple test-time scaling and LIMO: Less is More for Reasoning • Does Math Reasoning Improve General LLM Capabilities? Understanding Transferability of LLM Reasoning Suggested reading: • DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning • OpenThoughts: Data Recipes for Reasoning Models • Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning • DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models 10/02 Thu Rethinking reasoning models by Dennis Jacob & Harman Singh • Spurious Rewards: Rethinking Training Signals in RLVR • Reasoning or Memorization? Unreliable Results of Reinforcement Learning Due to Data Contamination Suggested reading: • LIMA: Less Is More for Alignment • LLMs Can Easily Learn to Reason from Demonstrations Structure, not content, is what matters! • Learning to Reason without External Rewards • Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model? • Revisiting Reinforcement Learning for LLM Reasoning from A Cross-Domain Perspective 10/07 Tue Replacing it with offline feedback sessions on project proposals 10/09 Thu Replacing it with offline feedback sessions on project proposals10/09 Thu 6pm is the deadline for the project proposal 10/14 Tue Mixture of Experts by Sanjay Adhikesaven & Yuezhou Hu • OLMoE: Open Mixture-of-Experts Language Models and DeepSeek-V3 Technical Report • Branch-Train-MiX: Mixing Expert LLMs into a Mixture-of-Experts LLM Suggested reading: • Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity • GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding • Branch-Train-Merge: Embarrassingly Parallel Training of Expert Language Models • Nexus: Specialization meets Adaptability for Efficiently Training Mixture of Experts • FlexOlmo: Open Language Models for Flexible Data Use 10/16 Thu Retrieval-based LMs by Yichuan Wang & Bhavya Chopra • Improving language models by retrieving from trillions of tokens • Scaling Retrieval-Based Language Models with a Trillion-Token Datastore Suggested reading: • In-Context Retrieval-Augmented Language Models • When Not to Trust Language Models: Investigating Effectiveness of Parametric and Non-Parametric Memories • Can Retriever-Augmented Language Models Reason? The Blame Game Between the Retriever and the Language Model • Great Memory, Shallow Reasoning: Limits of kNN-LMs • Frustratingly Simple Retrieval Improves Challenging, Reasoning-Intensive Benchmarks 10/21 Tue Long context and retrieval by Dongwei Lyu & Siddharth Gollapudi • Can Long-Context Language Models Subsume Retrieval, RAG, SQL, and More? • Long-Context LLMs Meet RAG: Overcoming Challenges for Long Inputs in RAG 10/23 Thu Guest Lecture by Barlas Oğuz (Meta AI) 10/28 Tue Creativity & Model Collapse by Téa Wright & Donghyun Lee • The Curious Decline of Linguistic Diversity: Training Language Models on Synthetic Text • Is Model Collapse Inevitable? Breaking the Curse of Recursion by Accumulating Real and Synthetic Data Suggested reading: • How much do language models copy from their training data? Evaluating linguistic novelty in text generation using RAVEN • AI as Humanity’s Salieri: Quantifying Linguistic Creativity of Language Models via Systematic Attribution of Machine Text against Web Text • The Curse of Recursion: Training on Generated Data Makes Models Forget • Strong Model Collapse • Self-Consuming Generative Models Go MAD (vision) 10/30 Thu Class Activity: Discussion of Talks from the BAIR-NLP Workshop 11/04 Tue Memorization by Kalvin Chang & Juno KimOption 1: Membership inference • Detecting Pretraining Data from Large Language Models • Membership Inference Attacks Cannot Prove that a Model Was Trained On Your Data Suggested reading: • Do Membership Inference Attacks Work on Large Language Models? • Reassessing EMNLP 2024’s Best Paper: Does Divergence-Based Calibration for Membership Inference Attacks Hold Up? • LLM Dataset Inference: Did you train on my dataset? Option 2: Training data extraction • Extracting Training Data from Large Language Models • Language Models May Verbatim Complete Text They Were Not Explicitly Trained On Suggested reading: • Recite, Reconstruct, Recollect: Memorization in LMs as a Multifaceted Phenomenon • Extracting memorized pieces of (copyrighted) books from open-weight language models • Measuring memorization in language models via probabilistic extraction 11/06 Thu Search agents by Hanchen Li & Shangyin Tan • An Empirical Study on Reinforcement Learning for Reasoning-Search Interleaved LLM Agents • ZeroSearch: Incentivize the Search Capability of LLMs without Searching 11/11 Tue Academic and Administrative Holiday 11/13 Thu Vision-language models by Junyi Zhang & Colin Wang • Emerging Properties in Unified Multimodal Pretraining • Holistic Evaluation for Interleaved Text-and-Image Generation 11/18 Tue Project Presentation I 11/20 Thu Project Presentation II 11/25 Tue Replacing it with offline feedback sessions on project reports 11/27 Thu Thanksgiving Break 12/02 Tue Replacing it with offline feedback sessions on project reports 12/04 Thu Replacing it with offline feedback sessions on project reports12/10 Wed 6pm is the deadline for the final paper
Deadlines Weekly deadlines Main presenters: Submit slides in the dedicated Slack channel 72 hours before class. Panelists & follow-up researchers: Submit slides 48 hours before class in the same channel. Auditors: Submit after-class reviews via the Google Form before the next class. All other students: Submit responses to pre-lecture questions via the Google Form before class. 08/28 (Thu): Submit the form for topic/role preferences and proposals for open slots. Before 09/01 (Mon), we will announce assignments on Slack. 09/17 (Wed): Submit the project preferences (teammates, topic) here . By 09/19 (Fri), project team assignment will be announced on Slack. 10/09 (Thu): Submit the project proposal here . 11/17 (Mon): If you are presenting on 11/18, submit the project presentation slides here with a filename starting with your 2-digit team number. 11/19 (Wed): If you are presenting on 11/20, submit the project presentation slides here with a filename starting with your 2-digit team number. 12/10 (Wed): Submit the final paper here . Acknowledgement We are grateful to VESSL AI and Google Cloud for providing compute credits to support our final projects.
The class format and guidelines are largely adapted from the following classes: