The Evils of Sequential IDs
3 reasons not to use Sequential IDs as Primary Keys
- Sequential IDs have several re-occurring risks that are known
- We will discuss 3 risks associated with Sequential IDs as Primary Keys
- Some valid cases for maintaining Sequential numbers are discussed
2001 was a very interesting time for me.
I had just made a major career change, realising Nursing was not really my thing, I had changed fields and graduated with a CompSci degree formalising all the things I had tried to learn myself. The Y2K panic had happened, and the DOTCOM bubble had just burst, flooding the field with people who had a lot more experience than I did and making finding a job all the more challenging. 9/11 brought security into crystal focus for professionals everywhere, information security being part of that. In this context, I found my start with a small consulting company that helped small businesses transition from in-house software solutions to robust managed solutions, and further into online solutions available from anywhere.
During one of these conversions, the owner of my company brought up an article he had recently read, and a debate ensued. According to the article, people should stop using sequential IDs for their record identifiers. This seemed absurd to me, Sequence IDs are easy to create, and they are easier to read; it makes no difference what you use internally. He was being absurd.
Unfortunately, he lost the debate.
- I have long since learned to be more open in my discussions
- He was right
It was many years later, and I was working on a refactoring of a website. I knew there were several vulnerabilities, but time is a precious resource, and I had prioritised them as best I could. That’s when we got the report, a customer had noticed that when you log in, you get logged into Company 5, he assumed that if there is a 5, there must also be a 6… and he gained access to data he had no business seeing.
If I had been more open to my boss’ argument, I might have seen how risky sequential IDs are and prioritised that particular issue more highly.
This issue was not unique to the system I was working on. Almost a decade later, I was in a team-building exercise: an intro C# class. It was all fairly basic stuff, but there was one thing that stuck out … right in the Microsoft Branded textbook, the instructor pointed out a recommendation to use UUIDs for primary keys in your database.
The reason was simple, at some point a PK is going to get exposed and it is going to act as the starting point for people to derive further information.
The very solution my boss had proposed, to a problem I had dismissed. This problem is so widespread, that it was necessary to be discussed in introductory programming classes, and the issues with it obvious to anyone that has grown up in this modern data security and privacy world!
So … why do we still need to be reminded? Why are we so obsessed with putting numbers in order?
False sense of utility
One argument that is often raised in favour of sequential IDs is the “familiarity” of numbers that are in order. This is expressed as
- We have a sense of how many records there are
- It gives me a way to orient myself
- Numbers are easier to read
What people are expressing is an emotional state: we are pattern-seeking monkeys and seeing ordinal growth gives us a sense of comfort. As hard as it is, the data is not there to satisfy your psychological comfort; rather we are experts who deal with the uncomfortable realities of data.
This puts us in a tough place. Humans are really good at post-hoc rationalisations. Any time we experience discomfort, we will reach for the comfortable, and justify it to ourselves in any way we can. Post-Hoc rationalisation is a problem we must each individually struggle with. It's been a while since I’ve read The Righteous Mind, but Haidt always nails this point.
We cannot allow ourselves to use what is comfortable, but rather to become comfortable with the most appropriate techniques. To do that we need to take time to consider the benefits.
If we are using a sequence of numbers as a primary key, we are using arbitrary keys. These are not for human consumption, and therefore they should not make your reading easier, they should make the computer’s reading easier, and hopefully our maintenance.
A key touch-point in Software development, expressed mostly from the perspective of Functional Programming, is to avoid side effects. Each command, each function, should do one thing, and should not be used to affect the system beyond what was specified.
When searching for the above link, Google gave me a sidebar that stated
Side effects are any changes in the state of the program or the environment that are not reflected in the function’s output. Side effects can make the program unpredictable, hard to test, and difficult to debug.
I’ll add to that, that we don’t want to become dependent on side effects because, by definition, there is more than one effect coming from a single action. We now cannot change the action to accommodate one effect, because that would impact the other effect.
Using the Primary Key as a sequence? Now you can’t change its data type to something else if you need to because that would make it non-sequential. Need to change the order? Not possible, it has foreign key references.
Our job, as experts, is not to define data in ways that make us comfortable; rather it is our job to define data as it is.
define whatever it is we perceive — to trace its outline — so we can see what it really is: its substance. Stripped bare. As a whole. Unmodified. And to call it by its name — the thing itself and its components … Nothing is so conducive to spiritual growth as this capacity for logical and accurate analysis of everything that happens to us.
Risks
Security
The two main examples from my personal experience draw on the security risk associated with sequential numbers. Numbers in sequence give us a pattern which we can use to predict the next, or previous, value.
Whether we meant to or not, by using sequential IDs, we have introduced information into our dataset.
In exposed applications that control information, this has been known to hint to people that there is more information available.
In Nova Scotia, in 2018, an individual using a government data transfer system noticed that when retrieving FOIP information releases online, the numbers were relatively small. He was able to surmise that the numbers were probably sequential and simply tried to type the next number into his browser. He was surprised when he got a FOIP record that did not belong to him. Had non-sequential values been used, his guess would likely have resulted in a miss. The large spaces with non-valid values would have been missed.
Clicking on the item’s link brings the user to the item’s page and puts the URL for that item, including its unique identification number, in the browser’s address bar. Changing just the number in the address bar takes the user to a different item directly
— OIPC Investigation Report
While this type of risk is well known it remains common, but at least it is obvious. Unfortunately, there are far more insidious ways of data leaks occurring.
Sequential IDs are famously capable of being exploited statistically to gain estimates of total counts. This is known as The German Tank problem.
During WWII, it was necessary to identify how many tanks Germany was producing. Using the serial numbers off the wheels of destroyed tanks, analysts were able to estimate the total number of wheels (and therefore tanks) that were in operation.
wheels, which were observed to be sequentially numbered … Analysis of wheels from two tanks … yielded an estimate of 270 tanks produced in February 1944 … German records after the war showed production for the month of February 1944 was 276
— German Tank Problem, Wikipedia
Randomly selecting a sample and then looking at their sequence numbers gives us an idea of how dense the population is. Using one of several functions, we can estimate population (and also sub-population) size from a sample.
When we share large datasets for analysis, we often exclude records deemed to be outside the Data Scientist’s domain of interest (often for privacy). Unfortunately, if we are sharing underlying sequential values we are exposing more information about the larger set than we intended. The customer can derive information about the unshared portion of the data.
Distributed Systems
In the early 90s, the Open Source Foundation (OSF) released the Distributed Compute System, bringing large-scale compute systems to the fore. Suddenly, individual analysts were not tied to their local desktop and could distribute their processing across an unlimited number of computers and processors. While not as readily available as modern Cloud services, intra-organizational processing just got a lot more powerful.
One problem that arose from using distributed compute systems is that it takes time for the processing units to talk to each other. To optimize a distributed solution, the processing units must be able to act independently of each other as much as possible.
Unfortunately, sequential keys require coordination. At the very least, every record must be touched by a single computer to coordinate the generation of the next
value.
This acts as a bottleneck in processing.
For the processing units to assign a coordinated value, they must communicate with each other to serialize the values. This means that either the sequential value must be added to the records before distributing the data (requiring the cost of producing a new dataset), or the values must be assigned in blocks (effectively randomly across partitions)
While the OSF came up with a very interesting solution to this problem, random numbers and cryptographic hashes have become more popular in the intervening decades. Using extremely large random numbers, we can be reasonably assured that the number generated will be unique to the population; meaning there is no need to coordinate between systems. Each processor can independently assign identifiers without concerning itself with the work of other processors in the system. The results can later be merged as one step.
Hashes serve a similar purpose but have the benefit of being deterministic. It is possible to do processing on a unit and identify which unit it is without having to do expensive scanning on the larger set. If some feature of the data can be used as a unique identifier, hashes can be used to convert that feature into a reasonably sized and well-distributed integer, suitable for use as a key. Like random numbers, we can be reasonably assured that there will not be a conflicting number, but we also have the benefit of being able to integrate overlapping entities if they are then determined to represent the same thing.
Imbalanced Indexes
The law of threes dictates that I make three subpoints, and so I had to come up with a third reason not to use Sequential IDs; but this one is a little more of a stretch.
There is an argument that during partitioning for storage or processing, or for indexing of datasets, sequential IDs do not balance well in the binary tree. This first came up during a (now lost) lecture I watched by Jan (can’t remember her last name) of IBM. She discussed why random keys were so important to database performance and asked audience members if their keys became imbalanced. This highlighted issues I remember from Project Gutenberg’s data store in which they partition based on the first digit of the sequential number of the book submission.
Under these conditions, Benford’s Law is going to burn us.
Benford’s Law states that often, the first digit of a number is low. In a sequential set, it is obvious why this would be true: every time you change scales (say 1’s to 10’s) it takes ten times longer to change the first digit (in other cases this remains a mystery to me).
For partitioning, this means that partition 1 is going to get 30% of the work, while partition 9 is only going to get 5% of the work; creating massive bottlenecks while one grinds away, and another sits idle.
While I have seen real-world examples of this, and do recognize the problem as a problem, I am generally dismissive of it. It is relatively simple to implement a fast, balanced hashing algorithm that resolves this problem.
Using the example from above, Benford’s Law can be circumvented by simply reversing the numbers (effectively a Mod10 hash). So instead of the sequence [101, 102, 103, 104, 105]
being sent to the same partition, we can instead use [101, 201, 301, 401, 501]
to distribute the values more equitably.
Having come across articles on the hashing algorithms used in indexing by some of the big database engines, I would expect the balancing of data during partitioning to be a well-understood and handled problem in any platform I choose to adopt. If you have to balance the values yourself, these may be considerations you need to take, but looking at your chosen tools may be in order.
Valid Uses
As with any hard and fast rule … well … there is no such thing. There are exceptions and valid reasons for why you might want a sequential identifier.
Audits
Going back to my very early days in my career, when my colleagues and I were just trying to figure out how to turn a dollar, a few of us took to teaching “How to use Office” night classes at local community centres, job retraining centres, or (mine) crime exit programs; all of us had a module “Use Excel to Make an Invoice”.
One weekend, I was at one of these friends' houses, we were having a few beers and working on a hobby problem together when he mentioned a lesson he got from one of his students.
As part of his invoice class, he taught how to make invoice numbers automatically (last inv number + 1
), but he told the class, “If you want to make your business look more impressive to your customers, change that formula to last invoice number + 10
. Even if you only have one customer, it won’t look that way.”
A hand, near the back slowly went up.
I am an accountant, and have to recommend people not do that. When you undergo a tax audit, they will identify the gaps as missing invoices and assume you have done 10x as much business as you have reported. They will estimate what those invoices are worth, and send you a bill for the apparently missing nine invoices
That’s not something you want to have happen, but the ability to match patterns and find gaps, which is what poses the original risk, also has its benefits. The main reason I will reach for a sequential ID is for audit purposes: a sequential ID can be used to verify missing data by creating gaps in the sequence when data is removed.
Non-technical individuals often don’t understand how malleable information is in a digital world, it is easier for them to comprehend that the number has incremented than it is to understand a chain of hashes, so adding a sequence ID for visiting auditors can be a courtesy.
Replication Checkpoints
One of the databases I like to use a lot is CouchDB, and one of its super-powers is its ability to replicate the data across multiple peers.
To achieve this, there are a few things that have to take place to address conflict resolution, change histories, and split brains; but one of the key elements is the simplest: every change to the database gets a sequence number.
This number is used during replication to act as a bookmark, allowing one database to request every change since a point in time.
This is an important tool to be used during replication, being able to state: “give me everything since checkpoint X
”.
A similar construct is also available in MS SQL Server, called a rowversion
(it was a timestamp
last I used it); and I’m sure most other DBs have the same concept: a globally incrementing sequence.
Solutions
Given all of this, what should we do when constructing identifiers?
Big Integers
No matter what style of identifier you use, make sure you use really big integers.
In those early days of early web applications, the databases we used had auto-incrementing indexes that could have “random” specified as the sequence. However, they were constrained to a 32-bit integer: that’s only room for about 4 billion records.
Not a lot by modern standards.
If you are using random numbers, you only derive the benefit of the gaps if they are reasonably large gaps. If you are using sequential values, you don’t want to run out.
Random values, Hashes, and UUIDs
Generally, I suggest using (at least) a 128-bit integer for storage. I don’t choose this number arbitrarily; yes it’s large, but it is also the size of a UUID. This has the convenience of having a ready-to-use storage type on most systems as well as standardised functions for generating random (or near random) numbers of that size.
I say “near random” because while UUIDv4 is a random number, other versions of UUID are not. For example, version 1 was dependent on the computer’s MAC address to form part of its uniqueness. Further, some implementations of UUID generators are not truly standards conformant, instead including timestamps in the number for convenience.
Hashes are another interesting option, while they are deterministic, they are effectively random. In the case that a dataset has a natural key that we do not necessarily want to share, but makes an obvious primary key, hashes can make a convenient way to mask the natural key. It generates a large, (effectively) random, integer. While an md5
will fit into a 128-bit integer, most modern hashes require more space.
One interesting benefit of using a hash occurs when you require scrambling of your keys. Through the use of a salt value (perhaps a salt value per customer), we can allow for simple regeneration of different random-like keys if we ever require them to change.
Primary Keys are never to be sequential
There are reasons to use sequential numbers, but making a primary lookup value creates the temptation to expose it to the client, and the moment you have something that looks tempting, it will happen. Let me say that again:
Your primary key will get exposed to the customer.
There are all kinds of reasons we promise it won’t, but at some point, somebody is going to make a mistake and expose that number. After all, it is the value we look at individual records by. If for no other reason than an analyst is going to want to do a join against two tables, that primary key is going to get exposed.
By using random-like values for PKs, there is less temptation to use them for inappropriate things. We can’t derive information from them directly and therefore don’t try to use them for secondary purposes. If we need to add sequential values, putting them in a field that explicitly defines why they exist, making their purpose (and non-purpose) clear.
Non-primary, Unique Identifiers
Just keep an eye out for them.
Do you have a field called created_time
?
If your data has a created timestamp on it, it has something that may be very close to a unique identifier. While this may not be sensitive information in itself, users may be able to use this to join data against other datasets.
While useful for diagnostic purposes, these low entropy values pose a risk when exposed to the wider customer group. By not having them as the primary key, we can simply remove these values from data shares. By having these unique values not acting as primary keys, we are not bound by their side effects and we are less tempted to share them inappropriately.
If these values are required for diagnostics, then we must maintain them. Our only defence is to be cognizant of the risks and to keep the values unbound from each other.
Conclusion
While there are valid reasons for maintaining sequential values in data, using sequential values as the primary key poses significant risks to data. Given the nature of primary identifiers, the cost of the risks versus the perceived benefits do not justify their use.
Separating valid sequential use cases from the primary identifier allows us to separate the concerns of the purpose of the data, allowing us to move parts independently of side effects. Very large integers have been widely available for over 30 years, and allow us to create large gaps of negative space in our data to evade malicious detection. Modern random number generation and cryptographic hashes offer a secure way to populate that large space.
Given these simple and well-established solutions, the risks to both security and performance posed by using sequential primary keys do not outweigh their perceived benefits.
As engineers and architects, we need to consider not the way we have always perceived data, but rather to look at the historic and scientific context that surrounds the protection of that data. We may find that the secure mechanisms of the past, no longer suit the computational realities of the present.