Common Pitfalls When Paginating Datasets

Introduction

In most software systems, there’s often a need to process, transfer, or display large amounts of datasets. Because of memory or network bandwidth or screen real estate constraints, it’s usually impossible to do this all at once. A very common design technique for dealing with this constraint is pagination, a technique where portions of the dataset are loaded in batches while maintaining a sort of pointer to where we are in the dataset. Anytime a new batch is loaded, the pointer is advanced, and we reach the end of the dataset, at which point we come to a graceful stop. While the pagination technique may seem simple, there are a lot of implementation pitfalls that engineers should watch out for. This is why I’m writing this post, to share some of these pitfalls I’ve seen or even fallen victim to throughout my practice.

What Makes A Pagination Implementation Correct?

It’d be useful to first briefly describe what a correct pagination should be, first from an engineering perspective and second in terms of what an external user using the system should observe.

From an engineering perspective, the task is to construct a query that, when given an input dataset, outputs a new dataset that’s predictably ordered based on a criterion (explicit or implicit) and would remain unchanged as long as the original query remains unchanged. The only thing that changes is what section of this ordered dataset we want to return. What makes a correct pagination even more challenging to implement is when we’re dealing with a dynamic dataset that may potentially be updated in real time while the pagination is ongoing.

From the perspective of an external observer using the system, an ideal pagination should meet the following criteria:

All entries in the source data should be seen and be seen exactly once.
The total number of entries returned per page should be at most the batch size.
It should be able to reach the end of the dataset gracefully.
It should be able to meet all 3 invariants above for dynamic datasets too.
It should be able to handle changes in page size without breaking the first 3 invariants.

Yet, several pagination implementations fail at least one of these invariants. Below is a non-exhaustive list of common pitfalls that many implementations fail to account for that lead to them producing incorrect pagination. I’ll not get too detailed, but I’ll give a short reason why they can produce incorrect pagination.

Pitfall 1: Paginating without sorting the data

While this may seem unlikely or obviously wrong, I’ve seen buggy pagination implementations that fail to sort the source data.

Pitfall 2: Sorting by a mutable field

This is bad because it can lead us to seeing the same item more than once or skipping an item if the sort field is updated while pagination is ongoing.

Pitfall 3: Sorting by a non-unique field with a smaller variance than the page size

This is when the sort field has more duplicates than the total page size. Example is when the paginated dataset is sorted by first names yet 15 entries share the same first names but the page size is, say 10. This can lead to skipping or duplication when paginating with the offset pagination pattern. In cursor-based pagination, this would almost invariably lead to a deadlock where you never reach the end of the data stream.

Pitfall 4: Sorting by a non-unique field

This is only a problem for cursor-based pagination techniques, where it invariably will lead to returning the same item multiple times.

Pitfall 5: Unstable order of sort fields when multiple fields are used for sorting

This is bad because unpredictable ordering of the sort fields would produce different datasets on each pagination iteration. While this may sound unlikely, it can happen in implementations where the query fetching the data is dynamically generated.

Pitfall 6: Using absolute indexing on a mutable dataset

This is only a problem for limit..offset pagination techniques. Any time a new entry is added to the dataset or removed while pagination is ongoing, data will either be skipped or returned multiple times.

All of these issues have one problem: they lead to skipping some entries, returning some entries more than once, or even, in extreme situations, an actual deadlock where the system can never reach the end of the dataset. Depending on your use case, duplication or skipping may be a tolerable tradeoff or bug, but if you’re, say, paginating over a list of transactions to be used in an inventory system, you might want to avoid skipping entries or even returning duplicate entries. Deadlocks, on the other hand, are non-negotiable, and all pagination algorithms should be designed to avoid that.