Most APIs these days will have some kind of pagination built into them. This is to make sure that queries against the underlying database are not returning too much data, compromising the database performance as well as sending too large messages across the network. Often, these APIs will tell you in their responses how many pages to expect, or how many records, or it will show you a link to the next page (until there are none). But, this is not always the case. This blog post is about extracting those types of paginated APIs.
The Problem with ‘Blind’ Pagination
Working with APIs that handle data in pages is something we all deal with in ETL processes. Most of the time, these APIs make our lives easier by giving us metadata that tells us exactly what to expect – you’ll see fields like “hasMore”: false or “nextPage”: null that let us know when we’ve reached the end of our dataset. Some APIs even go the extra mile and provide a total count of pages or records, making it simple to calculate how many iterations we need.
But what happens when we don’t have any of these helpful indicators? I ran into this situation recently when working with a legacy API that provided zero metadata about its pagination. This is what I like to call “blind pagination” – you’re essentially walking through the data in the dark, having to figure out manually when you’ve hit the end.
Let me paint you a picture of what we’re dealing with here. Imagine you’re going through a filing cabinet where none of the folders are numbered. As long as you keep finding documents in each folder, you know there might be more to come. But it’s only when you hit an empty folder that you can be sure you’ve reached the end. That’s exactly what we’re doing with these APIs – we keep making requests until we either get an empty array back or receive fewer records than our page size.
This creates some interesting technical challenges we need to solve. The trickiest one shows up when the total number of records happens to be exactly divisible by our page size. Think about it – if every page is supposed to return 100 records, and we have exactly 300 records, how do we know that third full page is actually our last one? Then there’s the whole issue of API rate limits and stability. We can’t just blindly keep requesting pages – that’s a quick way to either hit rate limits or burn through our API quota.
But don’t worry – we can handle all of this with some smart coding in Microsoft Fabric Notebooks. I’ll show you how to build a robust solution that deals with these edge cases while keeping our data pipeline efficient and reliable. Let’s dig into the code 🙂
The Solution in Python
The solution is actually surprisingly easy.
Setting Up Pagination Handling
In order to setup pagination in Python, in this particular example I queried the API endpoint once. This specific API implementation (AFAS) uses skip and take parameters (often called offset and limit in other APIs) to page through results.
The first call I do therefore contains the parameters as follows: “?skip=0&take=1000”. Instead of 1000 you can choose the record count appropriately for your specific API of course.
Then, I capture the response in a JSON file on my lakehouse in Fabric. I call it ‘0.json’.
Detecting the End of the API Output
By creating an object containing the response data, I can figure out the row count of the result set. If the row count of the result set is equal to the take (or limit) parameter, I assume there is more data.
By implementing a ‘while loop’ in my notebook I can keep iterating over an ever increasing skip parameter until the number of rows is not equal to the take parameter. That is when I’ve reached the end of the result set.
Working Example (on the AFAS API)
Here is part of the notebook I used to extract data out of the paginated AFAS API. As you can see, the calls are setup dynamically in order to extract many different endpoints (or what AFAS calls GetConnectors) with the same piece of code.
skip = 0
take = 1000
api_url = f"{base_url}/{get_connector}?skip={skip}&take={take}&orderbyfieldids={table_key_url_encoding}"
headers = {
'Authorization': f'AfasToken {encoded_token}',
'IntegrationId': integration_id
}
response = requests.get(api_url, headers=headers)
data = response.json()
write_json_to_lakehouse(
landing_path = specific_landing_path,
output_filename = str(skip // take),
data = data
)
while len(data['rows']) == take:
skip += take
api_url = f"{base_url}/{get_connector}?skip={skip}&take={take}&orderbyfieldids={table_key_url_encoding}"
response = requests.get(api_url, headers=headers)
data = response.json()
write_json_to_lakehouse(
landing_path = specific_landing_path,
output_filename = str(skip // take),
data = data
)
The variables in the first two lines are used to control the size of the output. These can be further parametrised of course, but for now let’s assume we want to fetch 1000 records per iteration.
Then, we build the API URL and build the header we need to send to the API (containing authorization and other information).
We then use the requests library to fetch results from the API. The first iteration we do is using the skip=0 and take=1000 variables. That gives us the first page of data.
That page can then be stored in the lakehouse. In this case I wrote an easy method for this so I can reuse the logic.
For naming the JSON files in the lakehouse I want to do {pagenumber}.json. By performing skip // take I get the floor of the division of the two variables. That means that I get 0 for the first iteration, 1 for the second, and so on. Perfect!
Then, I jump into the while loop. Because as long as the result set is as large as the take parameter, there might be more data!
We increase the skip by the value of the take, and contact the API again. Now with skip=1000 and take=1000 (next one will be 2000 and 1000 and so on). The resulting json response will be stored on the lakehouse, and as long as the row count of the ‘data’ object is equal to the take, we will perform another iteration.
Conclusion
As we’ve seen, extracting data from paginated APIs without metadata is not as daunting as it might seem at first. By using a simple while loop and checking the size of our result set against our page size parameter, we can reliably determine when we’ve reached the end of our data. The trick is in the details – storing the responses as separate JSON files on our lakehouse gives us the flexibility to process the data further down the line, and using skip/take parameters (or offset/limit in other APIs) gives us precise control over our data retrieval.
I’ve shown you how to implement this for the AFAS API, but the same principle applies to any paginated API that uses similar pagination parameters. Just remember to adjust your page size based on what makes sense for your specific API and use case, and always consider potential rate limiting when determining how aggressive you want to be with your pagination.
The next time you encounter an API without those helpful pagination indicators, you’ll know exactly how to handle it efficiently in your Microsoft Fabric notebooks.