The orange logo that looks like a fidget spinner. The immense praise it gets every time it is mentioned. What is dbt? In 5 minutes or less, I will explain what this tool in the field of analytics engineering is used for and show what it’s like to work with it. For a full-on tutorial, check out the dbt Fundamentals course. For an in-depth read on dbt, check out their official documentation.
dbt (short for “data build tool”) is a data transformation and modeling tool that allows data teams to transform data in their data warehouse using software engineering best practices. It is said be the “T” in ELT (Extract, Load, Transform). By simply writing SELECT statements in SQL, dbt handles turning those statements into tables, views, and more for use in the warehouse. It handles how to build those data structures, not what to build as that responsibility is handled by the data team.
From ETL to ELT
Traditionally, data warehouses have relied on ETL (Extract, Transform, Load) to collect data from its sources, transform the data, and then load the data into its final place in the warehouse. This came with models such as the Star Schema or the Kimball method which were optimized for reducing data redundancy. These days, compute and storage are much cheaper and so we can flip the order to ELT (Extract, Load, Transform) in order to load source data first and then transform it in the data warehouse directly. This allows for multiple use cases by different teams/departments for the same source data so they can transform data however they want for their purposes instead of going having to re-transform the data and load everything again, which is a lot repetitive work. dbt is great for this transformation step.
The interface is run on dbt Cloud as a web IDE. It has to be connected to a Git repository so that you can version control your transformation code. There is a directory template for each project you can work on that contains folders such as /data, /model, or /tests so it is very organized for the different features of dbt. At the bottom of the interface is the dbt command line to run the models and tests you can create. Commands such as dbt run or dbt test perform tasks that build and test data models. By writing SQL code, testing it to create models, and checking it into a repo, the dbt analytics engineering workflow resembles software engineering best practices quite a bit.
A model in dbt is simply a SELECT statement. Just one single SELECT statement that transforms raw data into a final table ready for analysis or for an intermediate step. It really is just as simple as that. dbt will handle the DDL and DML while you specify what kind of model is built in a .yml configuration file. It could be a table, a view, an ephemeral table, an incremental, or even a snapshot table. All you have to do is choose a model to build in your data warehouse and execute dbt run to have it created.
When software engineers write code, they have to test it to make sure that it works correctly as intended. Likewise, you should test data to make sure it fulfills certain conditions. dbt includes four schema tests to check qualities of columns in tables throughout the project that test:
- non-null values
- accepted values
You can also write data tests for specific cases, such as by checking that a total payment is ≥ 0. By running these tests, you can be sure the data fits expectations before releasing it to production for analysis.
Ah documentation. It is so easy to forget to document your work because it is often prioritized lower than the actual code. But that always has a way of coming back to bite you in the ass. With dbt, documentation no longer has to be separate from your code. This makes it easier to include it in your workflow instead of considering it to be separate work.
Documentation in dbt is written within a .yml file along with model specifications and tests. The really interesting thing about dbt documentation is that it has its own command to generate documentation in Markdown. This creates very neatly formatted docs and you even get a DAG model of which models depend on each other. Excellent for tracking data lineage.
dbt is an awesome tool for bringing better structure and software engineering practices to your data warehouse. By basically using SELECT statements and YAML files, an analytics engineer can write models for your data, transform it, and test the data to increase the robustness of your data down the line when your data is used to drive decisions in the business. It’s a great tool to amplify the value from both data engineering and data analysis so it’s clear to see why it is highly regarded in the data community.