CS 294-288: Data-Centric Large Language Models
Instructor: Sewon Min
Class hours: TuThu 12:30–2:00 (12:40–2:00 considering Berkeley time)
Class location: Cory 540AB
Office hours: By appointment
Contact: sewonm@berkeley.edu (Please include “294-288” in the email subject)
Feedback form: https://forms.gle/ytMKnKWLm2onutqb7
If you are interested in taking the course and can’t directly enroll, please submit this form.
Advances in large language models (LLMs) have been driven by the growing availability of large and diverse datasets. But where do these datasets come from? How are they being used? How can we leverage them in better, more creative ways? What challenges or issues do they present, and how might we address them? In this seminar, we will explore these questions as part of a broader effort to rethink how data is used in the development of LLMs: what data we use, how we use it, why it works, and what problems it brings.
The class is primarily designed for PhD students and is based on paper readings, discussions, and an open-ended project. Students are expected to independently understand assigned papers and have a background in ML, NLP (CS 288 or equivalent), and LLMs.
We will use Slack for most communiations (no Ed!). You will be added to Slack after the first lecture. If you join the class late, email us and we’ll add you. Once you’re on Slack, we prefer Slack messages over emails for all logistical questions. We also encourage students to use Slack for paper discussion and project collaboration.
Class Syllabus
This schedule is tentative and will be finalized after the first lecture.
Open slots
As you can see from the syllabus, there are two “class pick” sessions. We will receive proposals for the topics. In the proposal, please include
- Presenter names (you and your co-presenter)
- Two papers to present and a debate topic
- The topic’s significance and relevance to the class
If there are topics you’re excited about and/or have expertise in, this is a great opportunity to share them with the class! We discourage selecting your own work to foster open and honest discussion.
Deadlines
Weekly deadlines
- Main presenters: Submit slides in the dedicated Slack channel 48 hours before class.
- Panelists & follow-up researchers: Submit slides 24 hours before class in the same channel.
- Auditors: Submit after-class reviews via the Google Form by 24 hours before the next class.
- All other students: Submit responses to pre-lecture questions via the Google Form 24 hours before class.
Deadlines related to role assignment and project (all by 6pm)
- 08/28 (Thu): Submit the form for topic/role preferences and proposals for open slots.
- By 08/29 (Fri), we will announce role assignments on Slack. Note that the assignment can change based on class enrollment changes; if there’s any updates, we will let you know at least a week before the class.
- 09/17 (Wed): Submit the project preferences (teammates, topic).
- By 09/19 (Fri), project team assignment will be announced on Slack.
- 10/09 (Thu): Project proposal.
- 11/17 (Mon): Submit the project presentation slides (if you are presenting on 11/18)
- 11/19 (Wed): Submit the project presentation slides (if you are presenting on 11/20)
- 12/10 (Wed): Final paper.
Acknowledgement
The class format and guidelines are largely adapted from the following classes:
- Alane Suhr (UC Berkeley) CS294-258 Language Agents in Interaction
- Pang Wei Koh (UW) CSE599J: Data-centric Machine Learning
- Danqi Chen and Sanjeev Arora (Princeton) COS 597R: Deep Dive into Large Language Models