Today I’m going to teach you how the Python merge statement works, and we use it a lot in Microsoft Fabric.
This is very useful information if you’re a data engineer. If you’re considering moving into Microsoft Fabric, doing a Python merge statement in a notebook is super interesting. So let’s find out.
What Is a Merge Statement?
A merge statement (also called an upsert) is how you intelligently load data into a table. If a row already exists, you update it. If it doesn’t exist, you insert it as a new row.
This is fundamental to data engineering. Instead of dropping and recreating tables every time (which is expensive and slow), you merge new data into existing tables.
Setting Up the Notebook
I’ve prepared a notebook that shows you exactly how this works. Let me walk you through it step by step.
Step 1: Import Delta Table
First, we need to import the Delta Table library. This lets us talk to delta tables on the data lake:
from delta.tables import DeltaTable
Step 2: Create Target Dataset
Next, we’ll create our target data set. This is the table that already exists—the one we’re merging data into:
columns = ['source_system', 'id', 'name', 'score']
data = [
('A', 1, 'Alice', 85),
('A', 2, 'Bob', 90),
('B', 1, 'Charlie', 78)
]
target_df = spark.createDataFrame(data, columns)
display(target_df)
# Write to delta table
target_df.write.format("delta").mode("overwrite").saveAsTable("test")
As you can see, this table of three rows will be loaded into our delta table called “test”.
If we open the explorer, go to our lakehouse, and refresh, we’ll see that we now have a test table with the columns we just defined.
Step 3: Create Source Dataset
Now we’ll create our source data set. This source data set will contain the same structure of data but with different content. There might be some rows that are updated, and there might be some rows that are new:
source_data = [
('A', 2, 'Bob Updated', 95), # This will UPDATE A2
('A', 3, 'David', 88), # This will INSERT (new)
('C', 1, 'Eve', 92) # This will INSERT (new)
]
source_df = spark.createDataFrame(source_data, columns)
display(source_df)
This source data set is just being created in memory as a Spark dataframe. We’re not loading this into delta as of right now.
By comparing the rows, we should be able to figure out what’s going to happen:
- The A1 row is not present in the new data, so it won’t be touched
- The A2 row will be updated (name and score change)
- B1 won’t be touched
- A3 and C1 will be added as new rows
The Merge Statement
This is where the magic happens.
Step 4: Configure the Key
First, we need to configure the key of our table. In this case, we say the source_system column and the ID column together are the key of our table:
table_keys = ['source_system', 'id']
Step 5: Generate the Merge Statement
Now we’re going to generate the merge statement:
# Create delta table object
delta_table = DeltaTable.forName(spark, "test")
# Define the merge condition
merge_condition = ' AND '.join([f"target.{key} = source.{key}" for key in table_keys])
# Perform the merge
delta_table.alias('target').merge(
source=source_df.alias('source'),
condition=merge_condition
).whenMatchedUpdateAll() \
.whenNotMatchedInsertAll() \
.execute()
Let me break down what’s happening here.
We create an object called delta_table and point it to the table name “test” that we defined earlier in the notebook.
We define the merge condition where we create a join on target key equals source key for every key in our table keys. For every column in our list of columns, we create this “target equals source” comparison. This creates a simple join statement like a1 = a2 AND b1 = b2 and so on.
Then we merge. We’re going to merge using the delta table as target, and we’re merging onto it. The source will be our dataframe (renamed to “source”).
Now, why do we do that alias? We want to rename target and source so that our merge condition works properly.
Understanding the Logic
A merge condition is nothing more than this: if we know this row already (if the key matches the key in the target table), it’s a row we already have in the system. So we update the row in place.
If we do not know this row already (if the key does not match anything in the target), we’ll do an insert because apparently this is new data.
So we put the merge condition in, and then we say:
- When matched: update
- When not matched: insert
We execute the merge statement, and this will actually commit data into the delta table.
Checking the Results
Let’s do a simple check on the results:
result_df = spark.sql("SELECT * FROM test")
display(result_df)
Here we are. We have five records. Is that what we expected?
Yes! Because we started out with three records. We merged three additional records into it. One of those records was an in-place update, and two were new.
So we have three original records. One is updated, and two are added. We should indeed have five records.
Looking at this data, it’s not just a result of our merge—it’s also already committed into the delta table. We’re literally reading from the delta table here using the Spark SQL command.
Wrapping Up
That was a short post, but hopefully it explained a very important concept—the Python merge statement (or upsert).
It’s quite easy to understand once you see it in action. You’re comparing keys between source and target data. If the key matches, update. If it doesn’t match, insert. That’s it.
This pattern is fundamental to building efficient data pipelines in Microsoft Fabric. Instead of full table rewrites, you intelligently merge only the changes.
Are you using merge statements in your Fabric pipelines? What challenges have you run into? Let me know in the comments below!