Handmade Datasets: Strategies for working critically with small data and Artificial Intelligence

Handmade Datasets is currently running as a course taught by Aarati Akkapeddi with Isabella Haid at School for Poetic Computation.
Previously, it was taught by Aarati Akkapeddi at Gray Area

on the left is a photograph of Aarati's mother surrounded by blurry generated images of faces that look somewhat like her. On the right is a screenshot of Aarati's desktop with a folder open with multiple image files and another window open with a photo of physical family photographs scattered on a table

This course unpacks the data pipeline behind large AI systems like Stable Diffusion by tracing the journey from web scraping to generation. We'll examine each step: how websites get crawled and archived, how images and their alt-text descriptions become training material, how click-workers and automated systems curate data, and how all of this shapes what a model generates. We'll then propose an alternative: "handmade datasets," or human-scale, personally assembled datasets, small enough that creators can handle each datapoint. Drawing on examples from artists like Anna Ridler and Stephanie Dinkins as well as events like the Dataset Farmers Market, we'll explore how slowness can be a form of resistance, creating space for consent, stewardship, and intentionality. Students will learn practical techniques for training models with limited data, including data augmentation, transfer learning, GANs, LoRA fine-tuning, RAVE and RAG. Through hands-on projects and critical discussion, students will develop both the technical capability to work with small-scale ML and a more nuanced understanding of data collection and the labor behind AI systems. By the end of the intensive, participants will have created their own handmade dataset and trained a custom model.

Students will complete weekly assignments to gradually build text, image, and audio models, expanding on one to work with as a final project. Participants can expect to spend 3-4 hours weekly on work outside of class.

By the end of this intensive, students will:

No prior machine learning or advanced coding experience required. Students should be comfortable with basic computer literacy (file management, running applications) and have a willingness to experiment with new tools. We will focus on understanding concepts, collecting meaningful data, and remixing existing code rather than writing algorithms from scratch. An interest in questions around data ethics, labor, representation, personal archives, or critical approaches to technology is encouraged.

Disclaimer

While this class works primarily with Google Colab and Google Drive, there are alternatives to Google when it comes to working with ML that will be shared as resources to explore in the future. Students will need to purchase compute units from Google Colab in order to run ML training processes. It's completely up to each student how much they would like to spend but it is recommended to budget $30-45 towards compute units.

Course Schedule

Course schedule with dates, topics, technology, and homework
Date & Slides Topic Tech Homework
Introduction & Landscape Intro to pix2pix
Dataset Collection Strategies and Ethics Google Colab, Google Drive
Python Scripts: making still images from video
  • Intake form
  • Bring 50% of your B images
  • Fill out datasheet questionnaire
Dataset Prep and Augmentation Python Scripts:
- Batch cropping / converting / resizing
- Edge detection
- Face detection
- Pose detection
  • Bring 100% of your A & B images
Image Models pix2pix, GANs (training via Google Colab)
  • Finish training pix2pix and prepare output examples
Stable Diffusion Models & Fine-tuning LoRA
  • Finish training data (captions)
LoRA Training a LoRA
  • Finish training your LoRA model
Running local LLMs / Customizing LLMs with RAG RAG with Ollama & LangChain
  • Finish your custom RAG model
Present RAG project
Intro to Audio Models RAVE
  • Gather audio data
  • Fill out datasheet
Audio Models RAVE
  • Finish audio model training
Present audio projects / Guest lecture / Final assignment intro TBD
Final Project development
Final Project Presentations N/A
Make up day