Write Delta Tables Using Python Notebooks

Today I’m going to show you how you can use Python notebooks inside Microsoft Fabric together with DuckDB to speed up your query process on delta tables—and maybe save you some money because you won’t need to start a Spark cluster anytime you want to do some simple querying.

I’ve never worked with DuckDB before myself. On my other screen, I have some documentation and examples open. So let’s find out how easy it is. It’s supposed to be very easy, so I think we’ll be able to figure it out in a couple of minutes.

Setting Up Your Python Notebook

We’re looking at a Python notebook here in Microsoft Fabric. I went into my “That Fabric Guy” workspace, created a new notebook, and called it “DuckDB”. The name doesn’t matter—it’s just so I can find all these examples quite easily.

If we go to the language toggle here on top, that’s where the magic happens. Right now it’s set to PySpark (Python). We need to set this to Python because the runtime will not be the Spark runtime. We’ll actually run this inside a Python runtime.

Mind you, there are only two versions of Python we can use—3.10 and 3.11. The more recent ones aren’t included in Fabric at this time. I think 3.11 should be fine.

I’ve also attached my Gold lakehouse to this notebook in order to start working.

Importing DuckDB

Before we can start working with DuckDB querying on our delta tables, we have to create the DuckDB import first.

Let’s create a cell for our imports:

import duckdb

When running this import (you can hit Control+Enter on a Windows laptop or Command+Enter on a Mac), you’ll find that the session starts in a couple of seconds. The compute is really fast on a Python notebook because there’s no Spark cluster that needs to be started up.

We’re using just one machine for Python instead of at least two machines (an orchestrator and a worker) to run Spark. So this should be faster and cheaper.

Creating the Connection

Now we’ve imported DuckDB, and we can start creating the connections, the source path, and so on.

Let’s create a connection object:

con = duckdb.connect()

Next, let’s define a source path for our delta tables. That should be in the lakehouse defaults:

source_path = "/lakehouse/default/Tables/gold_dim_dates"

The “lakehouse default” is the lakehouse we’ve added here. Where you see the pin icon is your default lakehouse. You can add more than one lakehouse, but this one will be the default.

When reading from /lakehouse/default, we’re actually reading from that lakehouse.

We can read from Tables because if we open up Gold, we’ll find there are two folders in that data lake—one called Tables, one called Files. We want to read from Tables in this case.

Registering Your Delta Table

Now we want to register our delta table. We apply the execute command to our connection object:

con.execute(f"""
    CREATE TABLE gold_dim_dates AS 
    SELECT * FROM delta_scan('{source_path}')
""")

What this is doing: those curly braces are injecting the value of the source_path variable into this string. We can do that by putting an f in front of a string definition.

When I first ran this cell, I got an error near the slashes. I had to change the quotes around a bit. The delta_scan needs to be applied on a string, so we have to put quotes there.

Running this should create a connection to a table called gold_dim_dates.

Querying Your Data

Now, this is where the magic happens. We start creating objects by running queries:

df_select = con.execute(f"""
    SELECT * FROM gold_dim_dates
""").fetchdf()

display(df_select)

What I always do is create these triple quotation marks for multi-line strings. Then we can put an f in there to make it dynamic if needed.

When I first ran this, the display wasn’t working. I forgot to actually fetch the output of this query—the result—into a dataframe. That’s what .fetchdf() does.

And now we should be finding the data of this table. Indeed, it works!

When to Use DuckDB vs Spark

This is something new for me as well—I wasn’t working with DuckDB before, and I haven’t prepared this except for reading a few documentation posts. But it’s showcasing how easy and convenient it is to set this up.

If you know a little bit about Python, it will be very easy and convenient to set up a DuckDB connection and start writing SQL on your delta tables. That’s actually really cool.

My recommendation here would probably be to not start reading billions of rows through DuckDB. If you really need to process those big numbers, you’ll benefit greatly from Spark’s distributed compute power.

With Spark, you can create a cluster with multiple nodes all doing their own processing. That wouldn’t be feasible with DuckDB.

However, if you’re working with smaller tables—like a date dimension, taxi types, or any other smaller datasets—then Python notebooks using DuckDB and SQL will help you save a bit of money by not running the more expensive Spark clusters in Microsoft Fabric.

Quick Recap

Here’s what we covered:

  • Set your notebook to Python runtime (not PySpark)
  • Import DuckDB: import duckdb
  • Create a connection: con = duckdb.connect()
  • Register your delta table using delta_scan()
  • Query using SQL and fetch results with .fetchdf()

DuckDB is perfect for smaller datasets and quick queries. For big data processing, stick with Spark. But for dimensional tables and quick analysis? DuckDB will save you money and startup time.

Have you tried DuckDB with Fabric yet? What’s your experience been? Let me know in the comments below!

Leave a Comment