Introducing DataChain: an open-source library to curate and process unstructured data at scale
DataChain, an open-source Python library designed to address the complex data requirements of multimodal generative AI, offers scalable unstructured data processing and AI-driven curation capabilities, including cloud integration, efficient performance, and embedded databases.
With generative AI moving past text-based language functionality and into the realm of multimodality, the data demands of generative AI applications have become increasingly complex, leaving developers to quarrel with extraordinarily sized amounts of unstructured data of various types and formats. Addressing the data requirements of multimodal generative AI is no longer as simple as designating a versioned folder containing relevant files as a dataset, and traditional data processing and curation techniques will not yield the expected results.
DataChain is an open-source Python introduced to address these issues exactly. DataChain is an ideal solution for scalable unstructured data processing and curation, as it satisfies the most ubiquitous requirements of the modern data stack as its approach to data curation incorporates AI-driven data curation, following the trend of having models judge their outputs or annotate and curate their datasets; can be scaled to billions of data points; and does not rely on JSON to store every feature, instead prioritizing Python objects as AI dataset elements.
More specifically, DataChain aims to stand in as a component in data curation workflows or as a mechanism enabling the evaluation of existing AI applications. Built with this role in mind, DataChain boasts the following key features:
- Cloud integration: DataChain covers local file storage, but it also goes one step further by letting users read data stored in their preferred cloud (S3, Google, or Azure), so they can create persistent and versioned datasets.
- Pydantic compatibility: DataChain lets users define data models using Pydantic. Features can then be stored as validated data objects with automatic serialization/deserialization.
- Data transformation support: DataChain enables data transformation by running local ML models, external LLM calls, or custom Python code.
- Efficient performance: Inference code can be run parallelized or out-of-memory, allowing users to process millions of files on practically any device.
- Embedded databases: DataChain deals with the intricacies concerning database management. In the open-source DataChain version, Python objects are efficiently stored using SQLite databases to perform vectorized operations like similarity search, and analytical calls.
DataChain has the potential to become the foundation for new unstructured data processing-specific libraries and AI-driven data curation solutions. The DataChain GitHub repository contains everything needed to get started with the library. Since the DataChain library is an open-source project just getting started, feedback and contributions are paramount and much appreciated.