Delta tables are the backbone of data storage in Microsoft Fabric. If you’re working with lakehouses, understanding Delta is essential.
In this post, I’m going to explain what Delta tables are, why we need them, and how they work in Microsoft Fabric. This is part of a larger series where I’ll cover everything you need to know about Delta tables, so let’s dive in.
Why We Need Delta Tables
A delta table is a set of files on a data lake. And most data lakes fail in a very dumb way. They just dump files onto storage, but then it’s incredibly hard to actually do anything useful with those files.
You cannot easily query files. You cannot easily compute numbers across files, filter data, or perform the everyday operations you’d do as a data engineer or data analyst. That’s where Delta comes in.
Delta Lake (there are other formats that do similar things) is the format Microsoft Fabric has chosen as the default table format. It enables you to get database-like behavior on top of file storage on the data lake.
So you get the best of both worlds. You get cheap, reliable, and fast storage on the data lake, and you can query that data with tables, columns, data types, and ACID transactions as if it were a database.
The official definition? Delta Lake is an open-source project that enables building a lakehouse architecture on top of data lakes.
What Is a Delta Table?
Forgive me my handwriting—I’m better at Fabric than I am at drawing. But let me explain what’s inside a delta table.
Imagine a big box called “the delta table”. Inside this box are two types of objects.
On the left side, you have the delta log. This contains all the transactions that have been added to the delta table—every insert, update, or delete.
On the right side, you have the parquet files. There can be one or more parquet files, and they store the actual physical data.
Now here’s the interesting bit. The parquet files combined aren’t necessarily the entire state of the delta table. They might actually represent more than just the current state, because every transaction can potentially generate one or more new parquet files.
The delta log creates a transaction log that points to the correct parquet files. When you query with Spark or SQL, the delta engine uses this log to understand which files it needs to read to serve you the correct data.
Why Use Delta?
We use Delta to create ACID transactions. We want metadata handling, and we want to add capabilities beyond just storage.
In a CSV file, a JSON file, or even a flat parquet file, we can store data. But Delta adds rules, history, safe writes, and handles concurrent writes correctly.
Those are things you simply cannot get with just dumping files on a data lake.
Delta Tables in Microsoft Fabric
Let’s look at how this actually works in Fabric. I’ve opened a lakehouse called “Gold” and I’m looking at the tables section. There’s a table called silver.TaxiZoneLookup that contains taxi zone lookups from the New York City TLC open source dataset that I like to use in my demos.
We’re looking at a delta table right now. You wouldn’t necessarily know this—it could have been a SQL table or a CSV file displayed as a table. But trust me, this is a delta table.
How do I know? Because we’re in the tables section of the lakehouse. In the tables section, if you create tables, they will be delta tables on the lakehouse.
Under the Hood on the Data Lake
Let’s find out how that actually works on the data lake. If we open Microsoft Azure Storage Explorer and navigate to the Tables folder, we can see managed storage with delta lake.
Every folder in this tables folder is a delta table. If we open the silver.TaxiZoneLookup folder (the table we were just looking at in Fabric), we can see what’s actually there.
We have one parquet file—it’s a small file, only 7 kilobytes. Then we have a _delta_log folder that contains a JSON definition file and potentially commits.
So underwater on the data lake, a delta table is just a bunch of files. It’s a folder containing parquet files and a delta log.
Benefits of Delta Lake in Fabric
The benefits we get in Fabric for using Delta Lake are mostly about reliability and performance.
We get one table for storing data, but we get multiple consumption methods. We can query the delta tables that are in our lakehouse using Power BI with Direct Lake semantic models. We can query them using SQL (either in Spark SQL or T-SQL notebooks). We can use the SQL Analytics Endpoint. We can use Spark notebooks with Python.
There are a lot of consumption methods available. We have one format that can serve many different outputs.
Governance and Time Travel
The governance and the changes we make to the data are managed by the delta log, and that gives us time travel.
For example, we can see a history of all the commits that have been added to the table. We can query the table as if it were a point in time in the past. So we can run SELECT * FROM table AS OF '2024-01-01', if the history goes back that far, of course.
Those are things you cannot simply get with just storing files on a data lake or storing tables in a SQL Server.
Wrapping Up
To do a quick recap: we have an open-source project for building lakehouses on data lakes, and that’s called Delta (Delta.io or Delta Lake).
It’s very powerful. It gives you ACID transactions, metadata, a transaction log, and it helps you create one single format for tabular data storage that is queryable through a lot of different technologies—mainly Spark, SQL, and Power BI Direct Lake. Those are the three I use the most.
In my next post, I’ll be looking at the anatomy of a delta table in more detail. Stay tuned!
What are you using Delta tables for in your Fabric environment? Let me know in the comments below!