CS 294-288: Data-Centric Large Language Models
Instructor: Sewon Min 
 Class hours: TuThu 12:30–2:00 (12:40–2:00 considering Berkeley time) 
 Class location: Berkeley Way West 1203 (Tuesdays) and 1104 (Thursdays) 
 Office hours: By appointment 
 Contact: sewonm@berkeley.edu (Please include “294-288” in the email subject) 
 Feedback form: https://forms.gle/ytMKnKWLm2onutqb7
If you are interested in taking the course and can’t directly enroll, please submit this form.
Advances in large language models (LLMs) have been driven by the growing availability of large and diverse datasets. But where do these datasets come from? How are they being used? How can we leverage them in better, more creative ways? What challenges or issues do they present, and how might we address them? In this seminar, we will explore these questions as part of a broader effort to rethink how data is used in the development of LLMs: what data we use, how we use it, why it works, and what problems it brings.
The class is primarily designed for PhD students and is based on paper readings, discussions, and an open-ended project. Students are expected to independently understand assigned papers and have a background in ML, NLP (CS 288 or equivalent), and LLMs.
We will use Slack for most communications (no Ed!). You should be added to Slack by now. If not, email us or come to the first class.
Class Syllabus
Deadlines
Weekly deadlines
- Main presenters: Submit slides in the dedicated Slack channel 72 hours before class.
 - Panelists & follow-up researchers: Submit slides 48 hours before class in the same channel.
 - Auditors: Submit after-class reviews via the Google Form before the next class.
 - All other students: Submit responses to pre-lecture questions via the Google Form before class.
 
Deadlines related to role assignment and project (all by 6pm PST)
- 08/28 (Thu): Submit the form for topic/role preferences and proposals for open slots. 
- Before 09/01 (Mon), we will announce assignments on Slack.
 
 - 09/17 (Wed): Submit the project preferences (teammates, topic) here. 
- By 09/19 (Fri), project team assignment will be announced on Slack.
 
 - 10/09 (Thu): Submit the project proposal here.
 - 11/17 (Mon): If you are presenting on 11/18, submit the project presentation slides in #class-discussion.
 - 11/19 (Wed): If you are presenting on 11/20, submit the project presentation slides in #class-discussion.
 - 12/10 (Wed): Submit the final paper here.
 
Acknowledgement
We are grateful to VESSL AI and Google Cloud for providing compute credits to support our final projects.
 
The class format and guidelines are largely adapted from the following classes:
- Alane Suhr (UC Berkeley) CS294-258 Language Agents in Interaction
 - Pang Wei Koh (UW) CSE599J: Data-centric Machine Learning
 - Danqi Chen and Sanjeev Arora (Princeton) COS 597R: Deep Dive into Large Language Models