Handmade Datasets: Strategies for working critically with small data and Artificial Intelligence

Handmade Datasets is currently running as a course taught by Aarati Akkapeddi with Isabella Haid at School for Poetic Computation.
Previously, it was taught by Aarati Akkapeddi at Gray Area

Class Hours: Section 1, Sundays 6:30 – 9:00pm | Section 2, Wednesdays 6:30 – 9:00pm

Instructor email: aarati.akkapeddi[at]gmail.com

Class Recordings: Section 1 | Section 2

SFPC Community Agreements

Are.na Channel

on the left is a photograph of Aarati's mother surrounded by blurry generated images of faces that look somewhat like her. On the right is a screenshot of Aarati's desktop with a folder open with multiple image files and another window open with a photo of physical family photographs scattered on a table

This course unpacks the data pipeline behind large AI systems like Stable Diffusion by tracing the journey from web scraping to generation. We'll examine each step: how websites get crawled and archived, how images and their alt-text descriptions become training material, how click-workers and automated systems curate data, and how all of this shapes what a model generates. We'll then propose an alternative: "handmade datasets," or human-scale, personally assembled datasets, small enough that creators can handle each datapoint. Drawing on examples from artists like Anna Ridler and Stephanie Dinkins as well as events like the Dataset Farmers Market, we'll explore how slowness can be a form of resistance, creating space for consent, stewardship, and intentionality. Students will learn practical techniques for training models with limited data, including data augmentation, transfer learning, GANs, LoRA fine-tuning, RAVE and RAG. Through hands-on projects and critical discussion, students will develop both the technical capability to work with small-scale ML and a more nuanced understanding of data collection and the labor behind AI systems. By the end of the intensive, participants will have created their own handmade dataset and trained a custom model.

Students will complete weekly assignments to gradually build text, image, and audio models, expanding on one to work with as a final project. Participants can expect to spend 3-4 hours weekly on work outside of class.

By the end of this intensive, students will:

Understand the full pipeline of text-to-image AI systems, from web scraping through training to generation, and recognize how dataset composition shapes model outputs and reproduces social biases
Gain hands-on experience with multiple approaches to training image models with small datasets: GANs, LoRA fine-tuning, RAG, RAVE
Learn methods for collecting, organizing, and augmenting personal datasets, and understand data augmentation vs. transfer learning vs. training from scratch
Develop insight into where training data comes from (ImageNet, LAION-5B, FFHQ), the hidden labor behind large-scale AI (click-workers, exploitation), and how to center consent, ownership, and stewardship when working with data
Create a personally meaningful handmade dataset and train a custom model
Explore how slowness can be a form of resistance and intentionality in data collection
Learn to troubleshoot common technical challenges (mode collapse, training instability, computational limits, dataset balance) and make informed decisions about model selection based on available resources and ethical considerations
Connect critical theory with hands-on technical practice
Build capacity to critically evaluate AI systems and their outputs through understanding their construction
Connect with others exploring small-scale, intentional approaches to ML and build a supportive network for continued experimentation

No prior machine learning or advanced coding experience required. Students should be comfortable with basic computer literacy (file management, running applications) and have a willingness to experiment with new tools. We will focus on understanding concepts, collecting meaningful data, and remixing existing code rather than writing algorithms from scratch. An interest in questions around data ethics, labor, representation, personal archives, or critical approaches to technology is encouraged.

Accessibility and Support

I am committed to making this course accessible to all students. If you have any specific needs or accommodations you would like to discuss, please reach out to me directly or through the Intake form.

I am committed to writing alt-text for all media on this website and on Google Slides.

Online class can be difficult on the body. While I will do my best to provide opportunities for breaks, please don't hesitate to advocate for more, or to take them as you need.

Keeping your camera on is of course welcome but not required.

Link to SFPC Student Resource Guide

For additional support, including:

1-on-1 sessions
Mental health resources
Community Agreement concerns

You may reach out to SFPC's community counselor, Ngozi Alston (they/them)
Discord: @ngwagwa
E-mail: community@sfpc.study
Calendly booking page in welcome email + pinned in the "general" channel

Disclaimer

While this class works primarily with Google Colab and Google Drive, there are alternatives to Google when it comes to working with ML that will be shared as resources to explore in the future. Students will need to purchase compute units from Google Colab in order to run ML training processes. It's completely up to each student how much they would like to spend but it is recommended to budget $30-45 towards compute units. I recommend using the pay-as-you-go option.

Collective Class Community Guidelines

Love your bugs 🐝 Mistakes are learning opportunities. Learning is messy and imperfect.
Mode collapse is beautiful
Practice and practice - “the parable of the pottery class” where quantity leads to quality
Enhance the glitch
Be respectful - to others AND yourself. Be kind.
Speak up during critiques -- if you tend to be quieter, push yourself to give feedback, if you tend to speak up, push yourself to listen
Ask questions along with giving feedback
Sometimes questions reveal even more than comments/direct observations
Be open to criticism
Be curious!
There are no stupid questions
Ask a human/each other first? (before asking claude/chatgpt/insert-corporate-llm-here ?)
Share often - learn from each other’s process
Be generous and kind
Bring your background(cultural, work history, studies) into the work
Diversity of perspectives is a strength
Assume best intent +1
Be aware of, and true to, yourself in the context of the work (positionality- histories, realities, experiences)
Be honest about areas of expertise and lack
Offer content warnings when discussing sensitive topics (but assume best intent if someone doesn’t/ forgets to)
Practice active listening and provide feedback especially in small group breakouts
Take care of our bodies -- take breaks when we need to
Different days bring different vibes and that’s ok, creativity isn’t automatic
Be present
Class is a pool → (the waves we make affect each other)
Be open to not knowing what to do (or what to do next), expect uncertainty
Have compassion for yourself and others
Things suck right now and we’re all trying our best
Unlearn ableism
Don’t chastise yourself for (temporarily) using Google Services GPUs :p
Center our shared humanity
Critique the work, not the person

Link to SFPC-wide Community Agreements

Course Schedule

Go to "weekly schedule" in the nav