Databricks' newly open-sourced Unity Catalog is now available
Databricks has open-sourced its Unity Catalog, calling it the industry's first open-source catalog for unified data and AI governance across clouds, data formats, and data platforms. The Unity Catalog enables several groundbreaking features while garnering ecosystem support.
Databricks has open-sourced its Unity Catalog, which the company now calls "the industry’s first open-source catalog for data and AI governance across clouds, data formats, and data platforms." According to the company, the move to open-source the Unity Catalog is part of an initiative to provide users with an open foundation unaffected by the challenges associated with data platforms using proprietary table formats. Starting June 13, the project will be available on GitHub and hosted at Linux Foundation AI & Data.
Databricks' decision to open-source the Unity Catalog builds upon the choice to become a platform where all tables are in an open format by default and the recent general availability of UniForm. The latter is a unification format that takes the fact that Delta Lake, Iceberg, and Hudi are all built on Apache Parquet data files, and generates the metadata for Iceberg and Hudi tables in parallel to Data Lake's to guarantee interoperability across the three ecosystems.
The Unity Catalog is built using the OpenAPI specification and released under an Apache 2.0 license. It also supports Apache Hive's meta store API and Apache Iceberg's REST catalog API. The Unity Catalog also features multi-format support for Delta Lake, Iceberg, Parquet, CSV, and others; multi-engine compatibility to read cataloged data; and a multimodal design supporting tables, files, functions, and AI models. Finally, in the spirit of true open collaboration, the Unity Catalog project has garnered support from Amazon Web Services, Microsoft Azure, Google Cloud, Nvidia, Salesforce, DuckDB, LangChain, dbt Labs, Fivetran, Confluent, Unstructured, Onehouse, Immuta, and Informatica, among others.
The Unity Catalog multimodal design, incorporating data and AI, caters to the rapidly evolving AI landscape, enabling unified governance of data and AI assets like unstructured data for compound models or tool catalogs for LLM applications. Databricks' existing customers already leverage capabilities like a single namespace for tables, data, and AI asset organization, centralized audit logs, unified lineage tracking, and secure cross-organization collaboration. With the 0.1 release, Unity Catalog includes these features, the mentioned Iceberg REST API support, and provides credential vending for secure cloud storage access.
Looking forward, Databricks plans to continue porting the functionalities from its original closed-source offering to the new open-source projects in stages. Upcoming features include Format-agnostic table write APIs, views, Delta sharing, models with MLflow integration, remote functions, Access Control APIs, and more.