I recently ran into a sneaky issue while migrating a customer from SQL Server to Microsoft Fabric. We were doing what every lazy developer does—copy-pasting code—and suddenly our joins stopped working. Half the data was missing, relationships weren’t connecting, and it took me way too long to figure out why.
The culprit? Case sensitivity. SQL Server doesn’t care if your key is “A1” or “a1″—they’re the same. But PySpark in Fabric? It absolutely cares. And if you’re not aware of this difference, you’re going to have a bad time.
Today I’m going to show you a simple trick to handle case sensitivity issues when migrating code to Fabric. This is a short one, so let’s dive in.
The Problem: Case Sensitivity Breaks Your Joins
Here’s the scenario. You’ve got a dimension table with product keys and a fact table with sales data. In SQL Server, these tables join perfectly fine even if the keys have different casing. But when you migrate that same code to Fabric and run it in a PySpark notebook, suddenly your joins are broken.
Let me show you what I mean with some demo data.
# Create a products dimension with UPPERCASE keys
products_data = [
("A1", "Road Bike"),
("B1", "Mountain Bike"),
("C1", "Hybrid Bike"),
("D1", "City Bike")
]
dim_products = spark.createDataFrame(products_data, ["product_key", "product_name"])
# Create sales fact table with mixed case keys
sales_data = [
("A1", 10), # Uppercase - will match
("b1", 5), # Lowercase - won't match
("c1", 8), # Lowercase - won't match
("d1", 3), # Lowercase - won't match
("E1", 7), # Doesn't exist in products
("a1", 12) # Lowercase - won't match
]
fact_sales = spark.createDataFrame(sales_data, ["sales_product_key", "sales_quantity"])
Now let’s try a basic join like you’d write in SQL Server:
# This join will only match the uppercase "A1"
result = fact_sales.join(
dim_products,
fact_sales.sales_product_key == dim_products.product_key,
"left"
).select(
"sales_product_key",
"product_key",
"product_name",
"sales_quantity"
)
result.show()
When you run this, you’ll see that only the uppercase “A1” matches. All the lowercase keys (b1, c1, d1, a1) don’t match even though they should. The E1 doesn’t match either, but that’s expected since it doesn’t exist in the products table.
This is annoying if you’re migrating from SQL Server where the default collation is case-insensitive. You might copy-paste your code assuming case doesn’t matter, and suddenly your data is incomplete. Not good.
The Simple Fix: Convert Everything to Lowercase
The easiest solution is to convert all your keys to lowercase when joining. Here’s how:
from pyspark.sql.functions import lower
# Join on lowercase versions of the keys
result_fixed = fact_sales.join(
dim_products,
lower(fact_sales.sales_product_key) == lower(dim_products.product_key),
"left"
).select(
"sales_product_key",
"product_key",
"product_name",
"sales_quantity"
)
result_fixed.show()
Now all the keys match properly! The lowercase “b1” in sales matches the uppercase “B1” in products, and so on. The only records that don’t match are E1 and C2, which genuinely don’t exist in the products table—exactly what we’d expect.
This is a very easy fix for migrating code from case-insensitive systems like SQL Server to case-sensitive systems like PySpark in Fabric.
Performance Considerations for Large Tables
Now, before you go applying lower() to all your joins, there’s an important caveat: this can have a negative performance impact on very large tables.
If you’re working with tables that have a few thousand or even a few hundred thousand rows, using lower() in your joins is fine. It’s the most lightweight and easiest solution to implement right now.
But if you’re dealing with tables that have tens of millions or billions of rows, applying lower() during the join operation will slow things down significantly. The engine has to convert every single key value to lowercase before it can compare them, and that adds up fast on large datasets.
The Better Solution for Large Tables
For very large tables, my advice is to add a computed lowercase column when you ingest the data into your Fabric lakehouse. Do the conversion once during ingestion, store it as a separate column, and then use that pre-computed column for joins.
from pyspark.sql.functions import lower
# Add lowercase key column during ingestion
dim_products_with_lower = dim_products.withColumn(
"product_key_lower",
lower("product_key")
)
fact_sales_with_lower = fact_sales.withColumn(
"sales_product_key_lower",
lower("sales_product_key")
)
# Save these versions to your lakehouse
dim_products_with_lower.write.format("delta").mode("overwrite").saveAsTable("dim_products")
fact_sales_with_lower.write.format("delta").mode("overwrite").saveAsTable("fact_sales")
# Now join on the pre-computed lowercase columns
result_optimized = fact_sales_with_lower.join(
dim_products_with_lower,
fact_sales_with_lower.sales_product_key_lower == dim_products_with_lower.product_key_lower,
"left"
).select(
"sales_product_key",
"product_key",
"product_name",
"sales_quantity"
)
result_optimized.show()
This way, you’re only doing the lowercase conversion once when you write the data, not every single time you run a query. The join performance will be much better because it’s comparing already-lowercase values without any function calls.
When You’ll Run Into This Issue
This case sensitivity problem is most common when:
- Migrating from SQL Server or Azure SQL to Fabric
- Your source systems have inconsistent casing in key columns
- You’re copy-pasting SQL code into PySpark notebooks
- Different teams created tables with different naming conventions
- You can’t guarantee data integrity across source systems
The default collation in SQL Server is case-insensitive, so joins work regardless of casing. But PySpark is case-sensitive by default. If you’re not aware of this difference, your migrated code will silently produce incomplete results, and that’s a nasty bug to track down.
Quick Summary
Here’s the quick takeaway:
- Small to medium tables: Use
lower()directly in your join conditions - Large tables: Add lowercase computed columns during data ingestion and join on those
- Always be aware: PySpark is case-sensitive, SQL Server (usually) isn’t
This is a simple trick, but it can save you hours of debugging when you’re migrating code from SQL Server to Fabric. Trust me, I learned this the hard way 🙂
Have you run into case sensitivity issues when migrating to Fabric? How did you handle it? Let me know in the comments below!