Maintaining Privacy in Data Shares

A guide to big data privacy for dummy developers

11 min readDec 18, 2023

Balancing the power of modern computational capabilities with individual privacy is a formidable challenge in data sharing.
Benevolent actors, in their quest for insights, may inadvertently breach privacy by applying multiple dimensions to data, exposing individuals.
Introducing a privacy metric to mechanically measure the risk associated with shared datasets, aiding in decision-making for responsible data sharing.

When we delve into the intricacies of data sharing, one paramount consideration arises — privacy. The very essence of privacy hinges on how informative the shared data is, necessitating a quantifiable approach.

In today’s landscape of information gathering, the sheer capacity often leaves one in awe. Privacy and data ethics, perennially debated topics, trace their roots back 250 years to the US Constitution’s Fourth Amendment, asserting the right to be secure in personal papers against unreasonable searches and seizures. This is just one significant example of recognition of the perils of exposing personal information to authorities.

Long suspected, a revelation in 2013 confirmed governmental organizations, such as the NSA’s, information capabilities to the forefront. The revelation that the NSA accumulates every phone call from every individual was not just mind-boggling; it raised significant questions about privacy and data ethics. The sheer scale of data collection raised concerns about the potential misuse and the safeguards in place.

We’re not talking about datasets reminiscent of the Access Databases of the 1990s for mailing lists; these are colossal databases maintained by governments and mega-corporations. They harbour the potential to craft exact profiles of individuals. The sheer volume of data generated by individuals, when amalgamated, provides a unique perspective on the individual.

ASIDE
K68mPpQOmdxPfzQtgikKaw==
My unique computer identifier on the internet.
This value is generated from my browser settings and allows my online activity to be uniquely tracked across multiple websites, and without the assistance of things like cookies.
What’s your fingerprint?

How do we balance the power of modern computational capabilities, with the privacy of the individuals whose data we collect?

The Road to Hell is Paved with Good Intentions

There are many layers to security, but often the first that should be applied is to guard against misuse: legitimate users, using the data for inappropriate purposes.

In movies, a private investigator will be seen buying information from an informant; super-spies are seen breaking into vaults to steal information about an individual. These are not complete fabrications; over the years, I have been involved in several investigations regarding leaks of personal information. I have been involved in at least two cases of data theft (both as a Nurse and as a Data Professional), in both cases private investigator hired an insider to look up information in the system. In both cases, gathering evidence to trace the activity was as simple as looking up data access logs that did not align with business duties, but did align with the suspected misuse. The mechanics of the system were sufficient to restrict, identify, and enforce access to data.

Data Warehouses and modern analytics add a wrinkle to this problem.

Data Warehouses can be a very powerful information resource. Depending on your business line, they will contain all the joined data of all personal information of customers and employees across multiple business lines. For governmental organizations like the NSA, the customer data points are the citizenry and visitors to the country. This is a very powerful tool for analysis, however, it carries with it risks for misuse.

Mechanically, the nature of the bulk data that is being shared makes it vulnerable to misuse by benign, malicious, and (most importantly) benevolent actors.

Keeping malicious actors out is obvious: do security background checks to find honest people, hire honest people, and create guidelines for use that honest people can follow.
Benign actors are … well … benign. They are the honest people you hire and are happy to follow corporate policies like: “You must not look up yourself, your friends, or your family”.
Benevolent actors are more complicated.

If we have screened for honest people, we have probably biased our search for helpful people. Combine that with large datasets, and the fact that Data Analysts are a curious lot, and we have a recipe for disaster.

To derive value from our large datasets, we have to perform analysis on large data sets. The benefit is derived from performing aggregate analysis on many detailed values. If an organisation has sensitive data, but an honest analyst who can offer some significant insight by inspecting the data, they will make the data available to the analyst. This generally takes the form of the analyst proposing their study, describing their data needs, and then sending them the data that aligns with their request.

… and this is where it starts to fall apart …

A perfectly reasonable request for data may exist within the United States Census Bureau to create a report that identifies a wage gap between teh genders.

Gender Pay Gap Map #CensusDataViz #CensusData

Gender Pay Gap Map #CensusDataViz #CensusDatapublic.tableau.com

To support the request, the analyst is given access to a dataset containing all tax filings for the past dozen years. The analyst then loads the data into the analysis tool of their choice, does a simple sum by state and sees their results.

select 
  year,
  state, 
  sum(case when gender = 'M' then income else 0 end)/sum(case when gender = 'M' then 1 else 0 end) as avgM,
  sum(case when gender = 'W' then income else 0 end)/sum(case when gender = 'W' then 1 else 0 end) as avgW
  sum(income)/count(*) as avg
from 
  dataset
group by 
  state, year

Their manager approves hitting the publish button, and the whole team heads out for lunch together.

Over lunch, the original author is discussing their findings with their colleagues, when someone asks a simple question:

I wonder if age has anything to do with that?

That’s an interesting question and may be a useful thing to add to future reports. So the analyst goes back to their favourite tool, the data is still in the analysis tool, and all they have to do is adjust the parameters on the query…. and this is where informational security starts to break down. This is not what the data was authorised to be used for.

While this is a benign example, each dimension of data brings us a little closer to revealing the individual. In their rush to discover meaningful insights, the analyst has applied two dimensions to the individual in question. While this may not be a big deal at the state level, imagine applying filters like this to the town of Albertville which has a population of 86. Suddenly using age may be unique enough to distinguish some individuals, and publishing their wages to their neighbours could cause some bad blood in town.

Security (privacy) of customer data is of paramount importance and requires a means of measuring the information about the individual we are sharing. This concept of revealing the individual effectively that of information entropy. Entropy is a measure of the amount of surprise, and in this case, it is the amount of surprise we have when we discover the actual person.

Aside from purely mechanical safeguards like network access controls, encryption, and permission, we need to consider the ability for shared data to be used inappropriately. The ultimate security tool is to simply not share data we do not want people to have access to, or (more importantly) to control the context in which the data is interpreted. This allows us to maintain a high level of entropy around the individual.

Every time someone asks for access, we need to evaluate whether we are exposing enough information to expose the individual.

This requires human thought and analysis, and humans make mistakes. Overworked, over-tired, or pressured by the office bully, a human may give permission to expose more data than is appropriate, allowing researchers to dox an individual accidently. Some mechanism for mechanically, automatically, and unbiasedly measuring the privacy risk of a proposed dataset is necessary because…

Hell truly is paved with good intentions

Hell’s Gate (Darvas Crater) was created when someone threw a match in to stop a gas leak Darvas Crater [Wikicommons, CCA]

The Problem

To best understand the problem, let us consider a simplified example

Studying this dataset we can see it is a basic set of income data for US citizens. It includes some demographic information like their name and gender, as well as some contact information. While it does have a random identifier, it also contains their Social Security Number, something we do not want to just hand out to the first Private Investigator who asks for it.

Our goal is to give information to customers (analysts) who request it, but also to ensure we do not give so much that we end up exposing the individual. We can define this as

the information given to the analyst must not be sufficient to identify a single individual

The very first step is to remove the account identifiers, we don’t want those being handed out, but the rest of the data is not so clear. What data does the Analyst need to satisfy their research needs?

Our customer is studying income, so we must include that, but the concern would be that it gets exposed. While the individual might recognise their income, assuming they’ve kept that private, it should stay private and not be associated with them.

Let’s start by considering the email address. Email addresses are designed to be unique to the individual if we give the dataset to our customer with nothing but an email address and an income we have effectively tied the income to that person. On the other hand, if we expose the country, it does not tell us anything unique about the individual (everyone in our dataset is from the USA). So if we are going to share data, we want to share data we want to share data with minimal impact.

Our analyst studying gender inequality (Alice) is doing pretty well, having only asked for Gender and income.

Another analyst, a couple of desks over (Bob), has been working on an algorithm for a while and thinks he has a way to use Last Name as a proxy for race. He would like to do a study using names and incomes. The Privacy analyst, on the ball, notices that names are perfectly unique so offers a list containing only last names.

Alice and Bob are discussing their findings over lunch one day when they are overheard by their co-worker Eve. Eve’s ears perk up because she is also a Private investigator and is always looking for exciting datasets and surreptitiously “acquires” the two datasets.

Eve knows she has acquired something useless in itself: the two datasets have been vetted to ensure that the individuals involved cannot be uniquely identified. However, with closer inspection, she notes that the incomes are unique. Using this insight, she joins the two tables.

Eve has achieved an interesting effect, she has not only joined the two original datasets to get a more complete picture, she has constructed likely salutations that are new data that exceed the scope of either of the original datasets. While it is not a complete picture, she has started to build a profile on individuals. These profiles can be used for purposes that exceed the original permitted use of the data.

A Real Risk

While this story is obviously made up, this is not an unrealistic story. If one looks at the way we package modern reports we can see an element of these risks being present all around us.

Modern data presentations demand some level of interactivity. The ability to filter, change, and compare the data on the fly is a very powerful and compelling tool. But there is a risk. To share these visualisations and make them dynamic, we have to ensure that there is sufficient detail embedded in the data we are sharing.

That data is in the file, and just because you don’t know how to extract it, doesn’t mean nobody does. For the tool to be useful, some people do know how … that’s how they make the visualisations work.

Like Alice and Bob, we act with the best of intentions but easily become Eve sharing data inappropriately if we aren’t careful. We end up sharing data that carries too little entropy and simply hide it behind a mask.

Screenshot of a dynamic visualization with embedded data [Tableau Public]

What Can We Do?

For people concerned with individual privacy, this is bad. So the question becomes, as custodians, how can we prevent it?

While human thought and analysis will always be necessary, it is subject to bias, inconsistency, and mistakes. Automating decision-making, or offering automated decision-making aids to humans is always a good idea. In light of this need, can we develop a metric that can be used to measure the level of privacy in our shared datasets? Can this metric be used to aid in the decision-making process around approvals of data shares?

Increase the Entropy

There are a few normal practices we can take to increase the entropy of the data. In our example, the two innocent datasets were joined through the use of income which was unique.

Originally, income was allowed because it was necessary, but that is what created the vulnerability. We can increase the entropy of the field by simply rounding it to some level.

Do we need it to be accurate to the dollar? What about the thousandth of a dollar?

In doing so we increase the number of people that will match the value and protect their anonymity.

Measure the Entropy

If we can measure the entropy, we should. Rather than leaving it to custodians to use their best judgment, we can offer them a way to objectively measure the state.

This metric can be made visible to both requesters and approvers to help them decide the appropriateness of the request. We can estimate the entropy of the request before it is even approved allowing us to keep safety at the forefront of our mind. Later we can measure the entropy of the request to ensure it is sufficiently anonymized before releasing it to the public.

Create Versatile Environments

Don’t dictate, cooperate.

One of the risks mentioned early in our example was the step of the Analysts holding on to the data. We produce the “safe” dataset and then make it available to the Analyst for loading into their tool of choice.

This is a common story: we have powerful computers, we have tools we’ve trained on. These tools have power, but they carry the risk of moving data off of the controlled environments.

By creating powerful and versatile environments we allow the customers to do their analysis in a controlled environment. We should accommodate the needs of experts and create environments that are capable of accommodating those expert’s diverse needs.

This passively discourages requests to take the data off-system by giving them access to the tools they want.

Maintaining Privacy in Data Shares

A guide to big data privacy for dummy developers

The Road to Hell is Paved with Good Intentions

Gender Pay Gap Map #CensusDataViz #CensusData

Gender Pay Gap Map #CensusDataViz #CensusData

The Problem

A Real Risk

What Can We Do?

Increase the Entropy

Measure the Entropy

Create Versatile Environments

Further Reading

Written by Jefferey Cave