The Medallion Data Architecture

Data

The Medallion Data Architecture

The term “Medallion Data Architecture” was first coined by Databricks where they describe data at different stages of processing as being “bronze,” “silver” or “gold” level data. The term has since been used by Microsoft to describe stages of processing within Microsoft Fabric. It can equally be applied to data in other platforms such as Snowflake, RedShift or BigQuery.

  • Bronze data refers to data in its unprocessed state, exactly as loaded from the data source.
  • Silver data refers to data at various stages of intermediate processing.
  • Gold level data is fully cleaned and prepared ready for use by a data consumer.

Separating the bronze stage from the rest of the architecture means you always have a copy of the data exactly as it was loaded from the data source. If a fault occurs, it allows you to quickly determine if the the problem is related to source data or processing within the data platform.

Separating the gold stage from the rest of architecture insulates your data consumers from the complexities of data ingestion and integration. All they are able to access is the fully cleaned, secured and maybe pre-aggregated data.

At InterWorks, we’ve been implementing staged architectures like this for years, though in the past, we’ve referred to the different stages as “raw,” “integration” and “presentation” or something along those lines, rather than “bronze,” “silver” and “gold.” Stages in data processing don’t seem to have much to do with either Olympic medals or car wash options, but sometimes names just stick.

Snowflake’s cloud data platform is a great place to implement a medallion style data architecture:

  • Snowflake SQL extensions can natively handle semi-structured data such as JSON or Parquet, so it’s straightforward to convert things like API data into regular tables.
  • The Snowflake system architecture that de-couples storage and compute allows for fast and efficient sharing of data across databases and between Snowflake accounts without having to make separate copies of the data.
  • Snowflake dynamic tables allow for efficient incremental processing of data without the need for additional SQL to separate existing and new data.
  • Row and column access policies can be applied to any table to hide or mask data that a user is not authorised to see with no need for additional code.

Beyond the three basic stages, there’s value to organising processing in a fixed order. As a data engineer, if you know that data type conversion is done at a particular point, then you don’t have to keep checking or reconverting the data in different places.

If you identify a problem that relates to a particular function, you know just where that processing has occurred.

Overall, an organised medallion data architecture will make your data platform more efficient and easier to maintain, ultimately reducing its cost.

Here’s what a medallion data architecture might look like with data processed in a structured way.

More About the Author

Mike Oldroyd

Data Architect
Diagnosing Issues in Matillion ETL Using Component Level Logging Sometimes when you’re developing a data pipeline in Matillion ETL, you may find that a component that you’re working on just isn’t ...
Domain Ownership: Data Products within Business Functions A few years ago, the dominant architecture for data and analytics was based around an enterprise data warehouse (EDW). The intention ...

See more from this author →

InterWorks uses cookies to allow us to better understand how the site is used. By continuing to use this site, you consent to this policy. Review Policy OK

×

Interworks GmbH
Ratinger Straße 9
40213 Düsseldorf
Germany
Geschäftsführer: Mel Stephenson

Kontaktaufnahme: markus@interworks.eu
Telefon: +49 (0)211 5408 5301

Amtsgericht Düsseldorf HRB 79752
UstldNr: DE 313 353 072

×

Love our blog? You should see our emails. Sign up for our newsletter!