Handmade Datasets: Strategies for working critically with small data and Artificial Intelligence
Handmade Datasets is currently running as a course taught by Aarati Akkapeddi with Isabella Haid at School for Poetic Computation.
Previously, it was taught by Aarati Akkapeddi at Gray Area
This course unpacks the data pipeline behind large AI systems like Stable Diffusion by tracing the journey from web scraping to generation. We'll examine each step: how websites get crawled and archived, how images and their alt-text descriptions become training material, how click-workers and automated systems curate data, and how all of this shapes what a model generates. We'll then propose an alternative: "handmade datasets," or human-scale, personally assembled datasets, small enough that creators can handle each datapoint. Drawing on examples from artists like Anna Ridler and Stephanie Dinkins as well as events like the Dataset Farmers Market, we'll explore how slowness can be a form of resistance, creating space for consent, stewardship, and intentionality. Students will learn practical techniques for training models with limited data, including data augmentation, transfer learning, GANs, LoRA fine-tuning, RAVE and RAG. Through hands-on projects and critical discussion, students will develop both the technical capability to work with small-scale ML and a more nuanced understanding of data collection and the labor behind AI systems. By the end of the intensive, participants will have created their own handmade dataset and trained a custom model.
Students will complete weekly assignments to gradually build text, image, and audio models, expanding on one to work with as a final project. Participants can expect to spend 3-4 hours weekly on work outside of class.
By the end of this intensive, students will:
- Understand the full pipeline of text-to-image AI systems, from web scraping through training to generation, and recognize how dataset composition shapes model outputs and reproduces social biases
- Gain hands-on experience with multiple approaches to training image models with small datasets: GANs, LoRA fine-tuning, RAG, RAVE
- Learn methods for collecting, organizing, and augmenting personal datasets, and understand data augmentation vs. transfer learning vs. training from scratch
- Develop insight into where training data comes from (ImageNet, LAION-5B, FFHQ), the hidden labor behind large-scale AI (click-workers, exploitation), and how to center consent, ownership, and stewardship when working with data
- Create a personally meaningful handmade dataset and train a custom model
- Explore how slowness can be a form of resistance and intentionality in data collection
- Learn to troubleshoot common technical challenges (mode collapse, training instability, computational limits, dataset balance) and make informed decisions about model selection based on available resources and ethical considerations
- Connect critical theory with hands-on technical practice
- Build capacity to critically evaluate AI systems and their outputs through understanding their construction
- Connect with others exploring small-scale, intentional approaches to ML and build a supportive network for continued experimentation
No prior machine learning or advanced coding experience required. Students should be comfortable with basic computer literacy (file management, running applications) and have a willingness to experiment with new tools. We will focus on understanding concepts, collecting meaningful data, and remixing existing code rather than writing algorithms from scratch. An interest in questions around data ethics, labor, representation, personal archives, or critical approaches to technology is encouraged.
Disclaimer
While this class works primarily with Google Colab and Google Drive, there are alternatives to Google when it comes to working with ML that will be shared as resources to explore in the future. Students will need to purchase compute units from Google Colab in order to run ML training processes. It's completely up to each student how much they would like to spend but it is recommended to budget $30-45 towards compute units.
Course Schedule
| Date & Slides | Topic | Tech | Homework |
|---|---|---|---|
| Introduction & Landscape | Intro to pix2pix | ||
| Dataset Collection Strategies and Ethics |
Google Colab, Google Drive Python Scripts: making still images from video |
|
|
| Dataset Prep and Augmentation |
Python Scripts: - Batch cropping / converting / resizing - Edge detection - Face detection - Pose detection |
|
|
| Image Models | pix2pix, GANs (training via Google Colab) |
|
|
| Stable Diffusion Models & Fine-tuning | LoRA |
|
|
| LoRA | Training a LoRA |
|
|
| Running local LLMs / Customizing LLMs with RAG | RAG with Ollama & LangChain |
|
|
| Present RAG project | |||
| Intro to Audio Models | RAVE |
|
|
| Audio Models | RAVE |
|
|
| Present audio projects / Guest lecture / Final assignment intro | TBD | ||
| Final Project development | |||
| Final Project Presentations | N/A | ||
| Make up day |