This blog post is Human-Centered Content: Written by humans for humans.
Databricks Unity Catalog is one of the most capable governance systems in the data platform space. It puts access control, lineage and cross-workspace governance for your entire Databricks account under one roof. It’s also a system where practitioners routinely get tripped up — not on any single concept, but on how the pieces interact: How the hierarchy is structured, how identities map to privileges and where isolation between resources actually lives.
Getting the architecture right pays off quickly. It means fewer access control surprises, cleaner migrations and designs that hold up past the next quarter. And the stakes are higher than they look: Ungoverned data is a blind spot, and you can’t value what you can’t see or protect what you aren’t tracking.
Databricks Workspaces and Unity Catalog
Before exploring Unity Catalog itself, let’s first get an understanding of the Databricks account level and Databricks workspaces, and how they relate to the Unity Catalog. This will become important when discussing Unity Catalog’s purpose and capabilities.
The account sits at the top of the Databricks hierarchy. A single account can have multiple workspaces, which are the areas for execution and collaboration. In short, this is where users can log in and work with clusters, SQL warehouses, jobs, and notebooks. Databricks administrators can assign users and groups to workspaces based on their role and what data they will be interacting with. The focus of this post lies on the Unity Catalog, so we won’t go into the classic workspace vs. serverless workspace architecture, but just know that workspace compute can either be hosted in your cloud provider account (AWS, Azure, GCP) or in Databricks’ cloud account.
Unity Catalog is an account-level, unified governance layer for all data and AI assets in Databricks. Workspaces are associated with a metastore, the top-level structure within Unity Catalog (shown in the figure below). A metastore is per-region, which means it is shared across all workspaces in a given region. In essence, if you are only operating within a single region, then there is one metastore that all workspaces are linked to. Conversely, a workspace can only be tied to one metastore and can only span a single region.
The result is unified governance and access control across workspaces, which makes tracking data assets and permissions much easier, both from the administrators’ and users’ point of view: For example, a data engineer working in a data-engineering workspace and an analyst working in an analytics workspace may interact with the same underlying schemas and data.
To sum up this section: Workspaces are where users’ work happens, while Unity Catalog is where data is defined, secured and tracked. Understanding how those two planes interact is a thread that runs through most of the sections that follow.
What does Unity Catalog do?
One frequent point of confusion worth addressing right off the bat: Unity Catalog does not store the data itself. It stores metadata — in other words, the definitions, schemas, permissions, and lineage records that describe where your data lives and who can access it. Your actual data stays in your cloud object storage, such as AWS S3 or Azure Blob Storage. Earlier, Unity Catalog was described as a governance layer. More specifically, Unity Catalog handles the following within Databricks:
- Access control — Define who can access what data via a centralized privilege management system
- Automatic column-level lineage and audit logging tied to user identity — Trace where your data stems from
- Data discovery and business semantics — Explore catalogs and schemas you have access to from your workspaces and define business semantics via metric views
- Delta Sharing for governed cross-organizational data access without copying data — Share data with low overhead
Unity Catalog Hierarchy
To perform these capabilities, Unity Catalog relies on a three-level namespace to organize data within Databricks. You reference any object using its three-part name: catalog.schema.table. For example, prod_finance.reporting.monthly_revenue. This naming convention is enforced everywhere in Databricks — notebooks, SQL editor, Jobs and the REST API all use it.
You may be wondering why workspace aren’t in this hierarchy: Workspaces sit outside the three-level namespace because they’re execution environments, not data containers — i.e. the same catalog can be reached from any workspace bound to its metastore, so workspace boundaries don’t provide data isolation on their own. Isolation occurs at the catalog level via workspace-catalog binding as an additional guardrail on top (more on binding in the “access control” section of the post).
You can think of the figure below as a zoomed-in view into just the “metastore” component from the previous diagram:
The metastore is the top-level container within Unity Catalog. As mentioned, there is one metastore per Databricks account per region. Every workspace in that region attaches to the same metastore. The metastore stores metadata, including the definitions of catalogs, schemas, tables, privileges and lineage. What the metastore does not provide is data isolation — and this is one of the most important things to understand early. The metastore is shared infrastructure. If you need to enforce access separation between environments, teams, or domains, that happens one level lower at the catalog level, not at the metastore level. Databricks states this explicitly in their best practices: “metastores provide regional isolation but are not intended as default units of data isolation.”
The catalog is where data isolation begins. This is where you enforce access separation between environments, business domains or teams. A catalog has its own owner, its own set of privileges, and its own storage location (for managed assets). When a user does not have USE_CATALOG on a catalog, that catalog is inaccessible for querying. A data engineer working in the dev_finance catalog cannot query the prod_finance catalog unless they are explicitly granted access. This is the level where some of the most consequential governance decisions happen. We’ll discuss some common organization patterns in a following section. You can also have Foreign Catalogs which connect to platforms outside of Databricks, though this feature is generally read-only.
The schema is a logical grouping within a catalog. Schemas are where you organize objects by project, team, pipeline stage or use case. Privileges can be scoped to the schema level when needed, though schemas don’t carry the same isolation weight as catalogs. Within the prod_finance catalog, you might have schemas called raw, reporting and ml_features.
Data objects sit at the bottom of the hierarchy, and include different object types. Tables and views are the most familiar. Volumes are the Unity Catalog construct for governing files that do not have a tabular structure. ML models registered in MLflow and user-defined functions can have privileges assigned to them the same way you would grant access to a table.
For tables and volumes, there is an important distinction to make, namely whether they are managed or external:
- Managed tables are tables where Unity Catalog owns both the metadata and the underlying data files. When you create one, Databricks writes the data to a storage location controlled by the catalog or schema. When you drop the table, Databricks deletes the data. For most teams building new pipelines on Unity Catalog, this is the right default — the lifecycle is simple, the governance is clean and you never have to think about cloud storage paths.
- External tables register data that already lives in a cloud storage location you control. Unity Catalog owns the metadata — the schema definition, the privileges, the lineage — but the data files stay where they are. Dropping the table deletes the table’s metadata but does not touch the actual table data. External tables are the right choice in two situations: You have existing data in your data lake that you want to bring under Unity Catalog governance without moving it, or you need to manage the storage location explicitly for cost, tiering or compliance reasons.
- Managed and External Volumes follow the same pattern applied to unstructured files. A managed volume is a Unity Catalog-controlled directory for files — you upload to it, Unity Catalog manages the storage and access. An external volume wraps an existing cloud storage path with Unity Catalog governance. Volumes are useful for the non-tabular data that lives alongside your tables: Raw ingest files, model artifacts, reference data in CSV format, images for computer vision pipelines.
The rule of thumb: Default to managed objects. They reduce operational overhead and produce cleaner governance. Switch to external only when you have a specific reason — typically existing data you cannot or do not want to move. By default, managed tables and volumes are stored in the metastore’s root storage location. But you can override this at the catalog or schema level by assigning a managed storage location (i.e. a cloud storage path tied to a storage credential) at creation time. Data written to managed objects then resides in the closest override: A schema-level location takes precedence over a catalog-level one, which takes precedence over the metastore default. This gives administrators control over where data physically lives and is useful for isolating environments, meeting data residency requirements or separating cost centers without changing how users interact with the three-level namespace.
As an aside, if you’re migrating from Hive metastore, the main mental shift is that governance is now account-wide rather than per-workspace, and the namespace gains a catalog level above schemas. In addition, Unity Catalog is open source, and available at github.com/unitycatalog/unitycatalog. Databricks released the specification and a reference implementation at Data + AI Summit 2024.
Real-World Organization Patterns
Understanding the hierarchy is one thing, but knowing how to organize your data assets is where teams spend most of their planning energy. For most teams, the recommended starting point is environment-based catalogs.
Environment-based catalogs use a prefix per environment and per domain: dev_finance, staging_finance, prod_finance. Environment isolation falls naturally out of catalog boundaries — developers get access to dev_* catalogs, not prod_*. The main tradeoff is catalog proliferation: A large organization can end up with dozens of catalogs, but the cost of that is much lower than the cost of retrofitting isolation into a structure not designed for it.
Alternative patterns exist — domain-based catalogs (where the environment lives inside as a schema or schema prefix) and medallion schemas within a single catalog. The domain-based approach makes sense when data contracts are organized around business units rather than SDLC stages. Medallion-in-a-catalog may work better when you have one tightly integrated team. Choosing the most suitable pattern depends on your organization’s size, compliance requirements and how many teams share the platform. When in doubt, start with environment-based catalogs. You can always consolidate later. Adding isolation after the fact is much harder. Once you’ve decided on how to organize your environments and data assets, the next step is to define the proper access control, which the next sections cover.
Access Control — Who Can See What
Binding Catalogs to Workspaces
The first access control check happens at the workspace level through workspace-catalog binding, which is the process of specifying which catalogs a workspace can access. This determines which workspaces can reach specific catalogs and it applies regardless of a user’s individual privileges.
Out of the box, these catalogs are available to every workspace on the same metastore (except for the “Workspace Catalog” which is a catalog that gets created for all new workspaces and is scoped to only that workspace by default). Administrators can tighten this by binding objects to specific workspaces. Once bound, the restriction is absolute: A user with full privileges on a catalog is still denied access if they’re working from an unbound workspace. Essentially, no grant overrides a missing binding. Bindings can also be set to read-only, which further supports clean environment isolation: One example would be letting users in a development workspace query data in a production Catalog without any possibility of modifying it. Once a Catalog is bound to one or more workspaces, then one can define specific privileges on the data.
One important thing to note: Binding isn’t limited to catalogs. External locations and storage credentials can be bound to workspaces in a similar fashion. This matters for environments that share cloud storage, so you can enforce that only the production workspace can use the production storage credential.
Overview of Privileges in Databricks
Privileges in Unity Catalog are hierarchical and cascade downward: Granting a privilege at the catalog level automatically extends it to all current and future child schemas and objects. This applies to all privileges, including gating privileges like USE_CATALOG and USE_SCHEMA.
To read a table, a user needs three things: USE_CATALOG on the catalog the table lives in, USE_SCHEMA on the schema and SELECT on the table itself. All three must be in place for a query to succeed. The key thing to understand about USE_CATALOG and USE_SCHEMA is that they are gating privileges — they control whether a user can interact with that level of the hierarchy at all, but they do not by themselves grant access to data. A user with USE_CATALOG on prod_finance can see that the catalog exists and navigate into it, but they cannot read any tables unless they also have SELECT. Conversely, a user with SELECT on a specific table still cannot reach it unless the gating privileges are also in place. One exception: The BROWSE privilege bypasses the gating requirement for metadata discovery, letting users see that an object exists and view its metadata (description, columns, tags) — but not its actual data — without USE CATALOG or USE SCHEMA.
Because all privileges cascade, you have flexibility in how broad or narrow your grants are. Granting USE_SCHEMA at the catalog level automatically applies it to all schemas in that catalog. Granting SELECT at the catalog level gives read access to every table in every schema. This allows you to grant privileges to multiple objects at once, in case you’re wanting broader access. For tighter control, you can grant at the schema or individual object level instead. One standing rule that applies regardless of everything above: Assign privileges to groups, not to individual users. Individual-level grants can create an operational nightmare — they accumulate quietly and are hard to audit, so, instead, a user’s access should be a product of the groups they belong to.
All of these privileges cascade downward when granted at a higher level. Gating privileges are prerequisites — they control visibility and navigation but do not themselves confer data access.
A complete example of granting read access to a reporting table looks like this:
GRANT USE_CATALOG ON CATALOG prod_finance TO `data-analysts`;
GRANT USE_SCHEMA ON SCHEMA prod_finance.reporting TO `data-analysts`;
GRANT SELECT ON TABLE prod_finance.reporting.monthly_revenue TO `data-analysts`;
There are many more privileges in Databricks, such as for creating objects (CREATE SCHEMA, CREATE TABLE, CREATE VOLUME, etc.) and for working with volumes specifically (READ VOLUME, WRITE VOLUME). The list above covers a common read/write pattern for tables.
Service principals are worth calling out specifically. When a Databricks Job runs, it should authenticate as a service principal — not as whoever happened to set the job up. Unity Catalog supports granting privileges directly to service principals the same way it does for groups. This keeps your production jobs from running with broad permissions inherited from whoever set them up (and prevents issues if the job creator leaves your organization). Service principals used in Unity Catalog grants must exist at the account level, rather than at the workspace level, to ensure consistency across workspaces and a single place to manage access.
Administrative Access
The account, workspace, and metastore levels each come with their own administrator roles, allowing for a separation of duties. Account admins sit at the top — they create metastores, assign metastore admins and control account-wide settings like restricting workspace admin capabilities. Metastore admins govern a specific metastore and hold exclusive powers such as granting metastore-level privileges, or transferring ownership of any object. Databricks recommends assigning a group rather than an individual to this role. Workspace admins manage workspace-level settings and identities, and in auto-enabled workspaces they receive default privileges (like CREATE CATALOG) which let them provision new objects without needing a metastore admin to step in. These defaults apply to their workspace and don’t carry over to others. The key takeaway: Admin roles manage the system, but they don’t automatically grant data access to end users — that still requires explicit privilege grants.
Fine-grained Data Controls: Row-level Security, Column Masking and Attribute-based Policies
For finer-grained control, Unity Catalog supports row-level and column-level security. Column masking lets you define a policy that replaces sensitive column values — like a social security number or email address — with a masked representation for users who do not have elevated access. Row filters let you define a predicate that Unity Catalog applies automatically, so a user only sees the rows they are authorized to see. Both are defined as SQL functions and attached to the table.
Unity Catalog also supports attribute-based access control (ABAC), which uses tags on data assets to drive access policies — a newer capability that complements the row and column security mechanisms. Instead of manually attaching functions table by table, you tag your data assets and define reusable policies that automatically apply masks or filters to any table carrying a matching tag — scaling across catalogs and schemas without per-table configuration.
Discovering and Sharing Data
Catalog Explorer is the UI layer that ties everything in this post together. It’s where users browse the three-level namespace, explore table schemas and sample data, view and grant permissions, trace lineage graphs, check data quality health indicators, review classification tags, and inspect metric view definitions — all in one place.
For discovery specifically, the combination of BROWSE on catalogs and Catalog Explorer’s search is what makes data findable without granting read access. Users can see that a table exists, read its description, view column names and tags, check its quality status, and then request access directly from the UI if they need the actual data. Databricks recommends granting BROWSE to all account users at the catalog level for this reason: It makes the catalog a searchable inventory rather than a locked vault where users don’t know what to ask for.
Discovery inside a metastore is one thing. Getting data across metastore boundaries is another. Because metastores are scoped to a region, Delta Sharing is how Unity Catalog handles access across that boundary, between metastores in different regions, different cloud providers or even different organizations. The recipient queries shared tables in place, without data being copied and the sharing relationship is governed through Unity Catalog with its own securable objects (shares and recipients).
One gotcha worth flagging: Lineage does not traverse the sharing boundary, so if a downstream team is consuming a shared table, you won’t see their usage reflected in the source metastore’s lineage graph.
Data Lineage, Auditing and Semantics: Tracking and Defining your Data
Lineage answers two questions practitioners actually need answered: “Where did this data come from?” and “What breaks if I change it?” Forward lineage tells you the impact of modifying a table — every downstream table, view, job and dashboard that will be affected. Backward lineage tells you where the data originated, including what sources contributed to it and through which transformations. Unity Catalog captures data lineage automatically as part of its runtime architecture. Every query that reads from or writes to a Unity Catalog table emits lineage events derived from Spark execution plans. This works across SQL, Python, R and Scala, covers both batch and streaming workloads, and spans different compute entities, such as notebooks, jobs, DLT pipelines, SQL warehouses and dashboards.
Because lineage is built into the metastore layer, it inherits the same structural properties as the rest of Unity Catalog. The lineage graph is aggregated at the metastore level, so data flows captured in one workspace are visible from any other workspace on the same metastore. Permissions follow the same model in that users need at least BROWSE on a catalog to see its objects in lineage.
Lineage operates at two levels of granularity. Table-level lineage records which tables were read and written by each operation, along with the compute entity that triggered it (notebook ID, job run ID, pipeline ID and so on). Column-level lineage maps exactly which source columns flow into which target columns, making it possible to trace a PII field through a chain of transformations or understand how a derived metric is constructed. Column-level capture only works when both source and target are referenced by table name — path-based references preserve table-level lineage but lose the column mapping.
Beyond the lineage aspect, Unity Catalog’s system catalog hosts a broader set of operational system tables with operational data from all workspaces in your account, giving admin teams a single audit surface without needing to aggregate logs across environments. The data stored here spans audit logs, billing usage, query history, job runs and more. Combined with the lineage tables, they form a complete observability layer: Essentially, lineage tells you where data has been, audit logs tell you who did what and query history tells you exactly how.
There are four main ways to access lineage and audit data, each fitting different parts of the architecture:
Two limitations are worth knowing about lineage in Databricks. The first: Lineage breaks when data is referenced by a cloud storage path instead of a Unity Catalog table name. If a notebook reads from s3://my-bucket/some/path/ rather than prod.raw.customer_events, Unity Catalog can’t connect that read to the table it represents. This is one more reason to always reference data through its three-part name. That way, you preserve the lineage graph and avoid invisible dead ends.
The second: Lineage only captures what runs through the Databricks runtime. External pipeline tools — Airflow, Fivetran and similar orchestrators — don’t push lineage into Unity Catalog automatically. Integrations exist (dbt has deep native support), and for tools without built-in integration, Unity Catalog provides an external lineage metadata API (“Bring Your Own Lineage,” currently in Public Preview) for pushing lineage manually. But teams that assume full-stack lineage coverage often discover the gaps when they’re already trying to trace a problem.
Assessing the Quality of your Data
Unity Catalog’s data quality monitoring watches your tables automatically. Enable it at the schema level, and Databricks scans every table, building per-table statistical models from historical commit patterns and row counts.
Two signals are tracked out of the box. Freshness models each table’s commit cadence and flags it as stale when a commit is unusually late. Completeness predicts expected row counts over a rolling 24-hour window and flags tables that fall short. A column-level percent null check is available in beta for catching null floods that pass the row-count check.
What makes this fit into Unity Catalog’s architecture is the integration with lineage. Each incident includes a downstream impact score — how many tables and queries are affected — and root cause analysis that traces upstream to the specific job runs that may have caused the problem, with direct links to the job run page. Results land in the system table system.data_quality_monitoring.table_results, following the same pattern as lineage and audit. Alerting uses standard Databricks SQL alerts pointed at that system table.
Classification
Another Unity Catalog feature is data classification which can find sensitive data at scale without manual column-by-column review. Enable it at the catalog level (with optional schema-level scoping) and Databricks scans every table using an agentic AI system powered by Llama 3.1 running on Mosaic AI Model Serving.
Detected columns are stamped with system-governed tags in the class.* namespace, like class.email_address, class.credit_card, class.date_of_birth and region-specific tags. Each detection carries a confidence level and scanning is incremental, so only new or changed tables are rescanned automatically.
The governance payoff is how these tags connect to ABAC. From the classification results page, you can create a column-mask policy bound to a tag. Then, any column tagged class.email_address across your catalog gets masked automatically, including columns in tables that don’t exist yet. This is very practical, as discovery, tagging and enforcement all collapse into one workflow.
Metric Views
Moving into business semantics, metric views solve the problem where the same business metric gets defined differently across dashboards, notebooks and BI tools. More specifically, metric views are Unity Catalog objects that store authoritative definitions for measures. As an example, it may specify that total_revenue is defined as SUM(total_price). Unlike a regular view, the GROUP BY isn’t baked in — the engine generates the right aggregation at query time based on which dimensions the consumer selects. When someone queries total_revenue by region, the engine groups by region. When someone queries it by month, the engine groups by month. The definition lives once in the catalog. The grouping is decided at query time by whoever is asking. You can also define measures that reference other measures, so one doesn’t need to redefine dependencies here.
Because metric views are Unity Catalog securable objects, they inherit privileges, lineage, tags and audit. They also carry agent metadata (display names, synonyms and format specs) that tell AI agents how to interpret the metrics. For example, when Genie encounters, “What was our revenue last quarter?” it resolves the term against metric view synonyms and generates SQL using the governed definition. As a result, the human writing SQL and the AI answering a question get the same number.
Conclusion
Ultimately, Unity Catalog’s strength isn’t any single feature, but the fact that all of these governance features are cohesive and consistent. The “unity” is what makes the Unity Catalog system powerful, but its feature-richness can also make the learning curve feel steep.
The teams that get the most out of Unity Catalog tend to internalize a few principles early. Assign ownership to groups, not individuals. Grant BROWSE broadly for discovery but keep USE CATALOG and USE SCHEMA tight. Use workspace binding for environment isolation before reaching for more complex solutions. Reference tables by name rather than path to preserve column-level lineage. And treat the system catalog — lineage tables, audit logs, query history — as a first-class part of your governance stack, not an afterthought.
Ungoverned data remains a blind spot, but a governance system you’ve deployed without understanding is a different kind of risk — it gives you confidence without clarity. The goal of this post was to close that gap: Not just to explain what Unity Catalog does, but to show how its layers connect so that when something doesn’t behave the way you expect, you know where to look.
