Data Stack in a Box — New South Wales Department of Education (ft. DuckDB, Dagster, dbt, dlt and Evidence.dev)
TL;DR
You can setup your own data stack in a box with the click of a button over on GitHub: nsw-doe-data-stack-in-a-box
Open Data
Public data, and their re-use, are key resources for social innovation and economic growth. Open Data provides new opportunities for governments to collaborate with citizens and evaluate public services by giving citizens access to data about those services. — worldbank
The New South Wales (NSW) Department of Education (DOE) recently (June 2024) migrated its datasets to Data.NSW. I wanted to experience how the public sees our data.
Here is a quick peak at exploring the catalog of data on Data.NSW.
But what if I want to view metrics across different data sets?
Or better yet get a holistic view of all the metrics available?
And how do I combine Data.NSW with other publicly available datasets not on Data.NSW?
We need a data stack!
What is a Data Stack
In short it’s a set of tools and practices to bring data together to derive analytics such as (but not limited to):
- Descriptive Analytics tells you what happened in the past.
- Diagnostic Analytics helps you understand why something happened in the past.
- Predictive Analytics predicts what is most likely to happen in the future.
- Prescriptive Analytics recommends actions you can take to affect those outcomes.
The tools usually include data ingestion, transformation, storage, and visualisation tools.
In a box is referring to all the tools being in one place enabling single node analytics. In other words, all the tools can run on your machine (no external server dependencies).
Why a data stack in a box?
Data engineering used to be so challenging to reproduce in terms of time and money to setup.
Personally its been expensive, I have given AWS, GPC, and Azure tonnes of my money 💰 by accident. I once left AWS Sagemaker compute cluster on for a few days and got whacked $300 ouch.
However, we can create entire data stacks in a few mins 🚀 and do it completely free. Go have a beer 🍺 on me with the money saved, you deserve it.
Here are some of my favorite data stacks that I have stolen ideas from:
What are we trying to Solve with this Data Stack in a Box?
Main quests:
- Simple, one click deployment that is a totally free 💲 data stack, aiding in learning and proof of concepts.
- Getting humans excited about the publicly available data curated by NSW Department of Education and our partners.
Side quests:
- Help identify data quality and reliability issues with our data. This project is being run daily, with several tests so that we can improve quality and identify issues.
- Building a report that shows all metrics related to NSW Department of Education (DOE) data and consolidate any other publicly available data.
Architecture
Replicating the data Stack
Check out the repository nsw-doe-data-stack-in-a-box for instruction on how to set up the data stack.
And here is what it looks like to run the data stack with Dagster:
Dagster will coordinate our data stack. First it kicks off our data ingestion tool dlt, which loads our in memory data warehouse DuckDB with all that public data. Dagster then kicks off our data transformations with dbt to create data models ready for reporting.
Data Storage — Why DuckDB?
It’s a lightweight enterprise data warehouse. With DuckDB you have the simplicity of SQLite and the functionality of Snowflake, on your local machine.
Want to know more, see: https://kojo.blog/duckdb/
Reporting — Evidence.dev
Behind the scenes NSW Data uses CKAN to power the data portal. There are some reporting capabilities built into CKAN, however I wanted to see all the metrics in one place, which will help with data validation.
Now we have all of this data we can create a dashboard. For my sanity I needed to use a code-first approach to data visualisation so I choose evidence.dev. This has the added benefit of being able to incorporate into our CICD pipeline.
- Note — ideally I would have built a proper semantic layer. This is an area I want to work on. The features that I wanted are available from dbt and cube.dev but they are behind a pay wall. So I skipped them for this project. Instead I am just creating a basic denormalised metrics table.
Conclusion
This project is just getting started. We have already completed our first main quest for a Simple, one click deployment that is a totally free 💲 data stack.
The ability for everyone to reproduce your entire data stack in a box is a superpower. This fast tracks our ability to onboard new engineers whilst also providing the ability for us to learn and test new parts of the data stack.
Each time this data stack in a box is built it should always have the same Operating System, Programming Languages (Pyhton & Java Script), packages and Integrated Development Environments (including extensions).
This will no longer be an issue:
The project is also tackling several other challenges including DevOps (CICD), data quality, debugging, metadata management, and pipeline visibility. But more on that in the future. This is a living breathing project.
You can find the full code of the application on github over in the repo nsw-doe-data-stack-in-a-box.
Disclaimer: these are my own thoughts and opinions and not a reflection of my employer.