Summer of Research - Data Curation

Data Data Data!

Jul 20, 2024

Introduction

This summer and part of this spring, I was selected for Blueyard Capital's DYOR research program. My proposal was to research data curation and what it would take to build a framework for open collaborative datasets to drive and power AI LLMs like ChatGPT and non-AI related systems that have great properties like privacy, attribution, usability, flexible reputation, and potentially rewards systems and collaboration. What follows is a brief intro into my research journey and where I'm going.

TL;DR

Data curation is crucial for the future of AI and digital systems. My research focuses on creating a framework for collaborative datasets that ensure privacy, attribution, usability, reputation, and collaboration. The rest is an intro to this space, sharing my findings and progress.

Research Journey

The future depends on data curation, and lately, I've been obsessed with it. Before we begin, let me state what I mean by data. I like the Latin origin of the word meaning "items given," which I find particularly useful. In this context, I refer to data as anything we can reproduce in digital form. This means images, video, words, the color of the sky, ideas, and other abstract things.

When we use AI to guide us to a perfect conversation or create a dank meme, data is behind it. Specifically, multiple curated sets of data, commonly referred to as a training dataset, allow you to type in a series of words (data) and receive an image of a cat riding a wave or a "7-foot tall basketball player from the Orlando Magic in the 90s that is now a TV pundit and DJ." We can't say Shaq due to restrictions on generating specific images.

We technical folk are already curating data for you. There's no public governance behind it; just people doing what they know. At this time, it's good to lay down some more groundwork.

What is Curation?

Curation comes from the Latin "curare," meaning "to care for." In the early days of art curation, curators collected and cared for works. Eventually, they created exhibitions, displayed works, shared them, and provided cultural context. Similarly, data curation involves including and excluding data to provide context or information for machines or humans. This data drives future cultural outputs, scientific discoveries, and connects us to our historical past.

Data in AI

Data in AI, specifically for LLMs like ChatGPT, exists in three major phases: base model data, alignment data, and post-training augmented data or fine-tuned data.

Base Model Data: During this phase, data is gathered from the internet and some book databases like Anna's Archive. High-quality data is crucial here. The AI learns to predict or mimic documents by outputting sequences that match input data.
Alignment Data: In this phase, we exclude undesirable data, such as racist, sexist, biased content, porn, passwords, and harmful instructions. The model is trained to recognize and avoid these or yield different responses.
Post-Training Augmented Data or Fine-Tuned Data: The model is fine-tuned with specific data to create a specialized output, like a GPT assistant or a classifier. This data can come from RLHF (reinforcement learning from human feedback), where people rate responses, and a rewards model is used to score the AI's output.

This step is tricky, and efforts are ongoing to reduce its complexity using DPO (direct preference optimization) vs. a distinct reward model.

Finally, we have the post-model fine-tuned stage. Here, we can either use the model as is or fine-tune the model further to be useful for a specific context. One warning here is that fine-tuning the model further can result in a loss of functionality outside the fine-tuned data.

A recent popular approach instead of fine-tuning is to perform RAG retrieval augmented generation, a technique that allows you to add data as context for the LLM as needed, meaning you can add context beyond what it was trained on.

Focus on Fine-Tuning and RAG

In these three stages, we deal with potential bias and data curation. I believe we should focus on the final stage, specifically fine-tuning and RAG. These techniques allow us to leverage the generality while adding additional context to change the output of the model.

Framework for Cultural Commons

I am researching how to create a framework for generating new cultural commons. I believe that data is the way to do that. Following the early tenets of privacy, attribution, reputation, collaboration, and usability, I believe there is a framework that we can use to change the way we consume and generate data that is human-led and machine-friendly.

Tenets of the Framework

Privacy: Contributions to a model can stay anonymous, and data sharing can be controlled.
Attribution: Should be pseudonymous, allowing flexible identity.
Reputation: Flexible systems that reflect different perspectives.
Collaboration: Allow for public, private, and semi-private data contributions.
Usability: The framework must be simple to deploy and manage, with minimal invasiveness in adding data.

Cultural and Community Aspects

The last matter is the cultural and community aspect. In software, it's too easy to focus on technical solutions while overlooking social or cultural aspects. Data curation tools are long-term tools for creating more culture and community. It's important to think about creating a multifaceted ecosystem.

Future Directions

In the current modern web, we have a high amount of fractionalization and specificity, and very few tools to break out of our own graph. Keeping this in mind, I will be thinking about ways in which knowledge or curated data can be shared and any sort of reputational view as well, allowing people to stumble upon perspectives that do not necessarily mirror their own. I look forward to sharing with you in the coming months the progress and experiments that I'm currently undertaking to bring this vision to fruition. I'd love to have you as an eventual beta tester and subscriber to the substack, where I'll continue to share the progress and tidbits on the road to code.

Conclusion

Special thanks to Blueyard Capital for creating space for me to realize this vision.

Zane’s Substack

Discussion about this post