Concurrency with Unlimited Scale in Matillion’s Data Productivity Cloud

Data

Concurrency with Unlimited Scale in Matillion’s Data Productivity Cloud

//

Ever found yourself stuck in development limbo, waiting for a job to finish, only to realize you might need to tweak it again? It’s one of the less enjoyable parts of being a data engineer. But what if I told you there’s a way to speed things up and make your life easier? The answer is simple: run your tasks concurrently. I know you know, but did you know this is possible in Matillion Data Productivity Cloud? Let’s explore how you can achieve this and discover the benefits it brings!

What is Matillion’s Data Productivity Cloud?

In a previous blog post, I highlighted that Matillion DPC is a cloud-based product offered as a service, eliminating the need to manage virtual machines or staging space, and accessible entirely through a web browser. The Data Productivity Cloud combines data loading and data transformation capabilities into one robust platform. It supports unlimited users, projects and environments, providing unlimited scalability with Matillion-managed or user-managed agents.

It features intuitive access control, allowing permissions to be granted to projects by assigning users to one of three roles: read-only, user or admin. This flexibility promotes effective collaboration, enabling team members to share resources and work efficiently together using Git, which is crucial for version control and collaborative teamwork.

Matillion DPC is designed to be user-friendly, yet offers a wide range of capabilities, significantly reducing the need for multiple tools to meet business requirements. By focusing on one comprehensive tool, data engineers can save time, reduce costs and enhance the overall efficiency of data management processes.

For more information on Matillion DPC, check out our blog series on the topic.

What are the Benefits of Concurrency?

Any jobs that use multiple threads can take advantage of the scaling capability of Matillion-managed or user-managed agents. By leveraging Concurrency, you can observe a reduction in total job execution time. This can be highly advantageous in situations where:

  • A single task tries to execute numerous subtasks — for example, when using iterators.
  • Multiple tasks are being run on the same agent simultaneously — such as when various schedules initiate at 6AM.

How Does it Look Like?

At InterWorks, we have tested the concurrency capabilities of Matillion DPC by simulating a real-world data engineering scenario: loading tables from a database into Snowflake. This test is comprehensive, utilizing a wide variety of components.

Our primary job is divided into two orchestration sub-jobs:

  • Metadata Management
  • Data Load

Screenshot 2024-07-25 at 13.55.08.png

Metadata Management

This job creates a schema and a metadata table if they do not already exist. It extracts metadata from the source database (PostgreSQL) and inserts this metadata into the newly created metadata table. The “Metadata Merge” transformation job ensures that an existing metadata table is not dropped, if appropriate.

Screenshot 2024-07-25 at 13.52.00.png

Data Load

This job also creates a schema if it does not already exist. It reads table names from the metadata table and saves them to a Grid Variable. It then sorts the table names (optional but nice to have) and uses a Grid Iterator to call Import Single Table, passing the table name.

The Import Single Table job, created by InterWorks, covers many use cases. It checks if a table already exists and if it can be loaded incrementally. If it can, the job extracts the incremental columns and generates the incremental query. If not, it generates a full load query. In both cases, the generated query triggers the data load for that table. Importantly, the grid variable passing the table name can be configured to run either concurrently or sequentially.

Screenshot 2024-07-25 at 13.54.31.png

Testing Procedure

For our tests, we used a Python script to generate tables containing 20 columns of various data types and 100 rows each. These tables were loaded into a PostgreSQL database, with three different table counts being tested: 10, 100 and 1000. Before each measurement, the schemas for the metadata and the tables were dropped.

The results of our tests are represented in the following chart:

image.png

There’s a noticeable 58% decrease in total runtime when processing 10 tables. This performance enhancement becomes even more apparent as the number of processed tables rises, reaching an impressive 93% with 1000 tables.

Wrap Up

The concurrency capabilities of Matillion DPC demonstrate impressive performance, particularly when executing multiple subtasks simultaneously, which significantly reduces runtime. This efficiency is not limited to grid variables but extends to all iterators available in Matillion DPC. Additionally, Matillion DPC will execute jobs concurrently if they are scheduled at the same time, making it a comprehensive SaaS tool for data loading and transformation. Matillion DPC is an invaluable asset for your data engineering tasks, enhancing both speed and efficiency.

KeepWatch by InterWorks

Whether you need support for one platform or many, our technical experts have you covered.

More About the Author

Fadi Al Rayes

Data Engineer
Concurrency with Unlimited Scale in Matillion’s Data Productivity Cloud Ever found yourself stuck in development limbo, waiting for a job to finish, only to realize you might need to tweak it again? ...
Simplifying Secure Access to Snowflake via Okta SSO This is the second and last part of the series on managing Snowflake users and roles via Okta. In our first part, we introduced a user ...

See more from this author →

InterWorks uses cookies to allow us to better understand how the site is used. By continuing to use this site, you consent to this policy. Review Policy OK

×

Interworks GmbH
Ratinger Straße 9
40213 Düsseldorf
Germany
Geschäftsführer: Mel Stephenson

Kontaktaufnahme: markus@interworks.eu
Telefon: +49 (0)211 5408 5301

Amtsgericht Düsseldorf HRB 79752
UstldNr: DE 313 353 072

×

Love our blog? You should see our emails. Sign up for our newsletter!