There’s a common debate in the Fabric community: should you use low-code tools like Dataflows Gen2, or should you write pro-code solutions with PySpark notebooks? Today I’m going to settle this debate with actual performance and cost data.
Spoiler alert: low-code is easier, but you’re paying for that convenience with significantly higher compute costs. Let me show you exactly how much.
The Test Scenario
For this test, I used the Open Brewery DB API—a free REST API that contains over 8,000 breweries from around the world. Shout out to the team at Open Brewery DB for providing this awesome resource for testing and demos!
The scenario is simple: extract all brewery data from the API and load it into a lakehouse table. I’ll do this twice—once using Dataflows Gen2 (low-code) and once using a PySpark notebook (pro-code)—and compare the performance and cost implications.
To make the test more interesting, I’m forcing both solutions to loop through the data in small batches of 20 records per page. Even though the API supports up to 200 records per page, I want to force the engine to make lots of iterations so we can really see the performance differences.
The Low-Code Approach: Dataflows Gen2
Full disclosure: I’m not an M Query expert. I had to get ChatGPT to help me write this dataflow because M Query isn’t exactly my strong suit. Shout out to my friend who actually knows this stuff—I definitely don’t 🙂
But that’s kind of the point, right? With low-code tools, you can use AI assistants or visual interfaces to get things done without being a coding expert.
Setting Up the Dataflow
The M Query I created does the following:
- Connects to the Open Brewery DB API
- Loops through pages with 20 records each
- Increments the page number with each iteration
- Continues until it gets an empty result
- Expands the results into columns
For the data destination, I configured it to write to a lakehouse table. The credentials were set to anonymous since this is a public API—no authentication needed.
I’ll be honest, I struggled a bit with the Dataflow interface. I’m not a big Dataflows expert, so figuring out how to properly configure the lakehouse destination took some trial and error. But eventually I got it working.
Results: 4.5 Minutes
After running the dataflow, I checked the “Recent runs” section and found that it completed in a little over 4 minutes—almost 4 and a half minutes to be precise.
When I checked the lakehouse, the data was there. I forgot to rename the query from the default “Query” name, but the table contained all the brewery records. The preview showed the first 1,000 rows, and everything looked good. Success!
The Pro-Code Approach: PySpark Notebook
Now for the PySpark version. I speak a lot more PySpark than I speak M Query, so this was significantly easier for me to write. Here’s the basic approach:
import requests
import time
from pyspark.sql import SparkSession
base_url = "https://api.openbrewerydb.org/v1/breweries"
page = 1
per_page = 20
all_records = []
while True:
url = f"{base_url}?page={page}&per_page={per_page}"
response = requests.get(url)
try:
data = response.json()
except:
break
if not data:
break
all_records.extend(data)
if len(data) < per_page:
break
page += 1
time.sleep(1) # Rate limiting
print(f"Fetched {len(all_records)} records")
# Create Spark dataframe and save to Delta table
df = spark.createDataFrame(all_records)
df.write.format("delta").mode("overwrite").saveAsTable("breweries_notebook")
The logic is straightforward: start at page 1, loop through pages of 20 records each, keep adding records to a list, and break when we get an empty response. At the end, convert the list to a Spark dataframe and save it as a Delta table.
Demo Struggles (The Real Learning Happens Here)
Now, this is where things got interesting. When I first ran the notebook, I hit a "too many requests" error. The API has rate limiting, and I was hitting it too fast.
So I added a 1-second sleep between requests using time.sleep(1). That should help, right?
Nope. Then I ran into a Cloudflare issue. Apparently, Cloudflare was blocking requests coming from Microsoft Fabric because it looked like bot traffic rather than legitimate browser requests. This is what you get when you don't prepare properly for a demo, right? But maybe that's the fun of this channel—we try things and see what happens, warts and all.
The good news is that before it got blocked, the notebook had already fetched over 2,400 records in just a couple of seconds. So even though we couldn't complete the full run, we got enough data to make a comparison.
Performance Comparison
Performance-wise, it definitely felt like the dataflow was slower. The notebook grabbed 2,400 records in seconds before hitting the rate limit. If we extrapolate that by 3 or 4 times to get all 8,400+ records, the notebook should have been significantly faster than the dataflow's 4.5 minutes.
But I can't prove it 100% since the notebook didn't complete. This is just a feeling I have, which would make sense—pro-code solutions should generally be faster than low-code alternatives. But let's look at the real smoking gun: the cost metrics.
The Cost Comparison: Here's Where It Gets Interesting
Let's head over to the Fabric Capacity Metrics app to see what actually happened with our compute consumption. Before we can analyze anything, we need to refresh the semantic model since it's not real-time.
After refreshing, here's what the metrics show:
- Dataflow Gen2: 3,986 Capacity Unit seconds (CUs)
- PySpark Notebook: 119 Capacity Unit seconds (CUs)
Let me repeat that because it's important. The dataflow consumed almost 4,000 CUs. The notebook consumed 119 CUs.
Now, the notebook was killed early due to rate limiting, but I also ran it two or three times while testing with different delay settings. So we're probably pretty close to what the full runtime would have been if it had successfully extracted all 8,400 records.
Based on these numbers, the dataflow consumed roughly 2-3 times more capacity units than the notebook approach. That's a massive difference.
What This Means for Your Organization
Here's the bottom line: if you go with low-code solutions like Dataflows Gen2, yes, they're easier to build. You don't need a developer or data engineer to create them. You can use visual interfaces and AI assistance to get things done.
But you're paying for that privilege with higher compute costs. Not just a little higher—we're talking 2-3x more expensive in this test.
And it's not just about the money. The processes also run roughly twice as long, which means they're occupying your capacity for longer periods. If you're working with a decently sized Fabric capacity that's getting close to its limits, you could potentially cut your infrastructure costs in half by moving low-code dataflow workloads over to pro-code PySpark workloads.
Think about that for a second. If you have 10 dataflows running daily, converting them to notebooks could save you enough capacity to avoid upgrading from, say, an F8 to an F16. That's real money saved every month.
If you want to learn more about managing your Fabric costs and understanding CU consumption, check out my complete guide on Microsoft Fabric costs.
So Which Should You Use?
Look, I'm not saying you should never use Dataflows Gen2. They have their place, especially for:
- Quick prototypes and one-off data extractions
- Teams without developers or data engineers
- Simple transformations that don't run frequently
- Cases where development time is more expensive than compute costs
But if you're building production data pipelines that run daily or hourly, and you have the technical skills (or can learn them), PySpark notebooks are the way to go. The cost savings alone justify the investment in learning pro-code approaches.
And honestly? PySpark isn't that hard to learn. If you can write basic Python, you can write PySpark. There are tons of resources out there, including right here on my channel where I publish videos every week about Microsoft Fabric and data engineering.
Final Thoughts
This test showed me exactly what I suspected: low-code tools are convenient, but that convenience comes at a significant cost premium. In this case, we're talking about 2-3x higher compute consumption for essentially the same outcome.
For organizations serious about optimizing their Fabric spend, investing in pro-code skills isn't optional—it's essential. The ROI on that investment is clear when you're looking at potentially halving your infrastructure costs.
Have you done similar comparisons between low-code and pro-code in your environment? What were your findings? Let me know in the comments below!