Manipulating dbt Files with Shell Scripts

Data

Manipulating dbt Files with Shell Scripts

If you’re working with a bunch of sources within your dbt project, maintaining a single-source file (or even a couple) can quickly become cumbersome to manage, especially if you have multiple people editing and committing changes to the same file.

It is much better to have a single file per source table; this is easier to navigate and track changes with distributed teams. In pursuit of that, I’ve created a separate folder in the root dbt project directory called Sources, under which there are subfolders representing database and schema. So if I have a source table, raw.data_lake.table1, in my database, my dbt project would have both folders raw and data_lake, underneath which table1.yml would live:

```
./sources
./sources/raw
./sources/raw/data_lake
./sources/raw/data_lake/table1.yml
```

I picked up a concept from this blog post from Shopify engineering, although above is a little more “verbose” version of what these guys mention in that post.

I could, of course, do everything by hand manually—write each yml separately—but my attention span is (at a push) five files. After that, my brain starts wondering what’s for dinner, whether landscaping work in my garden is ever going to end (or begin), and mistakes creep in.

So, here’s a semi-manual solution to this very, very niche problem. It’s semi-manual as you still need to copy-paste and press a few return keys after all.

Prerequisites

This blog assumes some familiarity with dbt. You’re welcome to run with this blog regardless, of course, but I’d also suggest checking out dbt’s excellent online courses page, where you’ll find a much more methodical introduction to its concepts.

I’m also going to heavily lean on UNIX utilities to help me bulk-edit files, so you may want to read up on their usage separately, but there is a glossary for each switch used in this blog that hopefully should explain their usage enough. The code below was tested with zsh on MacOS, but where possible I used the POSIX-compliant syntax, so you should find it works verbatim on WSL, Ubuntu or any other UNIX-like system that complies with POSIX. YMMV if you use alternative shell/OS.

  • sed: stands for stream editor. This utility takes either stdin or file reference and processes the contents as requested by the user. Think of it as find-and-replace on steroids.
  • find: like ls on steroids (bit of a pattern here) when it comes to recursively searching current directory for things. It takes up slightly more system resources, but it allows for a few useful switches (notably -exec) to make it worthwhile. We could have used ls, but it can be slower sometimes. Also, there’s this argument against doing that.
  • csplit: splits files into sections determined by a pattern then writes those sections individually into new, separate files.

To make use of the dbt macro, you’ll need to add a new file in your dbt project directory. Call it packages.yml, the contents of which should read something like this:

```yml
packages:
  - package: dbt-labs/codegen
    version: 0.5.0
```

Version 0.5.0 was the latest at the time of writing, but make sure you reference the latest stable version available. After that, run dbt deps to pull and install dependencies (codegen in this case), and you’re good to go for the next section.

Use dbt to Generate sources.yml

Using a pre-built macro from dbt-labs, generate the yml file for the source schema. Within DBT cloud, run the following:

```shell
$ dbt run-operation generate_source --args '{"schema_name": "data_lake", "database_name": "raw", "generate_columns": true, "include_descriptions": true}'
```

This will generate output (click on view logs). Note the extraneous time printed in the log as well.

```yml
12:30:14  version: 2

sources:
  - name: schema_name
    database: database_name
    tables:
      - name: table_name1
        columns:
          - name: col1
            description: ""
          - name: col2
            description: ""
          - name: <...>
      - name: table_name2 
        columns:
          - name: <...>
```

Paste the output starting from line 7 into a separate file called, for example, sources.yml. The file should look something like this (indentation is important):

```yml
      - name: table_name1
        columns:
          - name: col1
            description: ""
          - name: col2
            description: ""
          - name: <...>
      - name: table_name2 
        columns:
          - name: <...>
```

Use SQL to Generate sources.yml

Of course, you could get the exact same information from information_schema and use SQL text functions to print out the valid yml. Let’s do that:

```sql
select
  concat(
    , '\n  - name: ', lower(table_schema)
    , '\n    database: ', lower(table_catalog)
    , '\n    description: ""'
    , '\n    tables: '
    , '\n      - name: ', lower(table_name)
    , '\n        columns: '
    , '\n'
  ) 
  || listagg(
    concat(
     '          - name: ', column_name, '\n'
      , '            description: ""', '\n'
      , '            meta: ', '\n'
      , '              pii_type: '
    )
    , '\n')
  || '\n' as ""
from RAW.information_schema.columns
where 
  concat_ws('.', table_catalog, table_schema) 
    = 'DATABASE.SCHEMA'
group by 
  table_catalog
  , table_schema
  , table_name

```

Each row in the output would be identical to the output from DBT above.

concat() and concat_ws() functions will concatenate disparate strings together (in the latter function’s case, we’re also using a dot-separator), while listagg() will collapse strings across different rows into a single row.

For example, listagg() will take raw output from information_schema.columns (abridged):

| ROW | table_name | column_name |
| --- | ---------- | ----------- |
| 1   | TABLE1     | COLUMN1     |
| 2   | TABLE1     | COLUMN2     |
| 3   | TABLE2     | COLUMN3     |
| 4   | TABLE2     | COLUMN4     |
| 5   | TABLE2     | COLUMN5     |
| 6   | TABLE3     | COLUMN6     |
| 7   | TABLE3     | COLUMN7     |

And if you run the following,

``` sql
select listagg(column_name, '\n') 
from information_schema.columns
group by table_name
```

it would return:

| ROW | listagg(column_name, ', ') |
| --- | -------------------------- |
| 1   | COLUMN1, COLUMN2           |
| 2   | COLUMN3, COLUMN4, COLUMN5  |
| 3   | COLUMN6, COLUMN7           |

The rest of the sql statements are just dressing up the above to make it a valid yaml file. See docs pages on concat(), contact_ws() and listagg() for more info.

Optional: Add Custom yaml Keys

If you used the dbt macro to create the yml file, you might want to add a few extra keys. If you need placeholder space for metadata on each of the columns, now’s the time to do it. I’ll add a key for tests and meta keys using the script below:

```bash
sed -i 's/description: \"\"/description: "" \
            meta: \
              pii_type: \
            tests:/' sources.yml
# where
#   -i: 
#		    modify in place (overwrite original file)
#   's/pattern/replacement/options':
#	   		"s" stands for substitute. This runs a regex 
#       find-replace on pattern-replacement 
#       (options are... optional)
#   \ (at the end of the line)
#       line continuation indicating that the 
#       command continues on the next line
```

This will look for an exact string description: “” and replace it with description + new keys. I’m using description: “” here as that is a repeated string for each of the columns within my tables. The output of the code above should look something like this:

```
      - name: table_name1
        columns:
          - name: col1
            description: ""
            meta:
              pii_type:
            tests:
          - name: col2
            description: ""
            meta:
              pii_type:
            tests:
          - name: <...>
      - name: table_name2 
        columns:
          - name: <...>
```

Split Single yaml File into Set of Separate Files

Given the input file sources.yml containing output from the previous section, use csplit to write separate files with the moniker _src\d{2}.yml, treat the empty line as an indicator of the new file / file separator:

```bash
$ csplit --quiet --prefix=_src --suffix-format=%02d.yml --suppress-matched sources.yml /^$/ {*}

# where
#	  --quiet: 
#			do not print counts of output file sizes
#	  --prefix: 
#			use PREFIX instead of 'xx'
#	  --suffix-format: 
#			use sprintf FORMAT instead of %02
#  	--suppress-matched: 
#			suppress the lines matching PATTERN
#	  /^$/: 
#			regex pattern to split the files by
#	  {*}: 
#			repeat the previous pattern as many times as possible
```

See the rest of this excellent man page if you want to learn more. The code above was shamelessly stolen verbatim from this StackOverflow answer. There shouldn’t be any output in the stdout, but you will now notice a bunch of new files—something like this:

Insert Missing yaml Keys at Each File’s Start

Given a set of files in the current directory matching the pattern _src.*.yml, insert a multi-line string at the beginning of the file:

```bash
$ sed -i '1i version: 2 \
sources: \
  - name: schema_name \
    database: database_name \
    tables:' _src*
```

The code was adopted from this StackOverflow answer. Make sure to replace schema_name and database_name with the appropriate schema and database names. Each file should now be properly formatted yaml:

```yml
version: 2

sources:
  - name: schema_name
    database: database_name
    tables:
      - name: table_name1
        columns:
          - name: col1
            description: ""
          - name: col2
            description: ""
          - name: <...>
      - name: table_name2 
        columns:
          - name: <...>
```

Rename Each yaml File to Match Targeted Table Name

Given a set of files in the current directory matching the pattern _src.*.yml, find the sixth line within each file. Extract and store text matching the pattern `*- name: (*)` from the contents of that line. Construct a rename statement using the captured contents (removing unnecessary characters and lower-casing the string just in case):

```bash
$ find . -maxdepth 1 -regex '.*_src.*yml' -exec \
  sed -n "6 s|.*- name: \(.*\)|mv '{}' \L\1.yml|p" '{}' \
  \;

#  where
#    sed -n:
#        commonly used with a `p` option 
#        tells sed to only print the lines you ask it to print
#        default is to print everything. 
#        See https://superuser.com/a/852491 for more info.
#    find -maxdepth N`:
#        defines maximum subdirectory depth that `find` descends to. 
#        So a value of `2` would mean find does not go further down beyond 1 subfolder.
#    find -regex '.*'`: 
#        this allows you to filter out files based on regex pattern. 
#        Pattern of `'.*\.csv'` would only grab files with extension of `.csv`
#    find -exec <command> {}`:
#        this allows us to execute another command for each of the files found. 
#        Each file found using `find` utility is represented by `{}` here.
```

This should output (in the stdout) commands like the below:

```bash
$ mv './_src00.yml' 'name-from-_src00.yml-line-6.yml'
$ mv './_src01.yml' 'name-from-_src01.yml-line-6.yml'
...
$ mv './_srcNN.yml' 'name-from-_srcNN.yml-line-6.yml'
```

Finally, if you add | sh at the end of that code above, it would execute the rename commands (`mv`) in order, so the full script should look like this:

```bash
$ find . -maxdepth 1 -regex '.*_src.*yml' -exec \
  sed -n "6 s|.*- name: \(.*\)|mv '{}' \L\1.yml|p" '{}' \
  \; | sh
```

This should then produce a set of new yaml files with their proper names. You can now put those files in the appropriate directory mentioned in the sources-paths config option within your dbt-project.yml, and dbt would pick those up on the next run. This is adopted from this StackOverflow answer.

Conclusion

And there you have it—some snippets of code you can use to manipulate the files generated through dbt, or sql. The beauty of having these scripts is that as soon as a new source is added, we can generate a new set of yml files pretty painlessly.

This is what our example folder structure for sources looks like at the end of all these steps:

More About the Author

Danny Tishenko

Data Engineer
Disposable Cloud Development Environments for dbt Well, this turned out to be longer than anticipated. You know your blog’s too wordy when your TL;DR needs some abridgment ...
Manipulating dbt Files with Shell Scripts If you’re working with a bunch of sources within your dbt project, maintaining a single-source file (or even a couple) can ...

See more from this author →

Subscribe to our newsletter

  • I understand that InterWorks will use the data provided for the purpose of communication and the administration my request. InterWorks will never disclose or sell any personal data except where required to do so by law. Finally, I understand that future communications related topics and events may be sent from InterWorks, but I can opt-out at any time.
  • This field is for validation purposes and should be left unchanged.

InterWorks uses cookies to allow us to better understand how the site is used. By continuing to use this site, you consent to this policy. Review Policy OK

×

Interworks GmbH
Ratinger Straße 9
40213 Düsseldorf
Germany
Geschäftsführer: Mel Stephenson

Kontaktaufnahme: markus@interworks.eu
Telefon: +49 (0)211 5408 5301

Amtsgericht Düsseldorf HRB 79752
UstldNr: DE 313 353 072