How To Quantify Privacy in Datasets

An Entropy-Based Approach for Objective Evaluation and Automated Approval

Jefferey Cave
13 min readDec 27, 2023

We will discuss tools for estimating and automating the enforcement of privacy in datasets.

  • Objective Data Privacy Evaluation: Discuss a straightforward approach using objective measures to gauge and improve data privacy in shared datasets.
  • Estimating Data Privacy Levels: Learn through real-world examples how estimating data privacy levels can be a powerful tool, minimizing bias and enhancing informed decision-making
  • Automated Checks for Enhanced Workflow: Explore the application of automated checks in the approval workflow to simplify processes and boost productivity in safeguarding sensitive information

Head over to ObservableHQ to see an interactive version of this.

In a previous article, I delved into the intricate task of preserving privacy, underscoring the critical nature of cautious information sharing. This principle extends beyond safeguarding personal identities to shielding covert subjects like structures, military units, and other sensitive entities. The guiding rule is clear: the less revealed, the better.

Implementing this principle poses a formidable challenge. Over-sharing not only risks exposing individuals but also jeopardizes the confidentiality of various subjects. Assessing the “privacy” of shared datasets demands meticulous effort. Privacy analysts invest time scrutinizing datasets, applying rules, and leveraging professional judgment to ensure the concealment of both personal and classified subjects.

However, relying solely on subjective analysis has its inherent risks. Analysts, being human, are susceptible to biases, fatigue, and external pressures, occasionally leading to lapses in judgment. Despite these challenges, there’s a growing need to share data for diverse benefits.

The predicament endures: how do we establish a threshold for sharing information without compromising the privacy of individuals or the secrecy of sensitive subjects? How can we alleviate the burden of the privacy evaluators while simultaneously ensuring that shared data doesn’t pose excessive risks?

Enter Claude Shannon’s Information Entropy concept, originating from “The Mathematical Theory of Communication” in 1949. Shannon’s concept of the smallest piece of indivisible information provides a quantifiable measure for data. This measure is not only applicable to personal privacy but also to the confidentiality of secret subjects. It furnishes an objective metric to estimate the privacy risk associated with exposing a dataset, serving as a valuable tool for analysts and automated systems alike in assessing risks related to both personal and classified information.

These concepts present a practical tool for striking a balance between safeguarding the privacy of subjects and the imperative to share data for analysis.

Definitions

To ensure precision and clarity, we have chosen to substitute the term “individual” with the more inclusive term “subject.” While my main focus revolves around protecting people's privacy, it’s crucial to recognize that certain fields manage data of a covert nature that extends beyond human entities. Examples include “Tight Holes” in the Oil and Gas sector, clandestine police and military locations, or discreetly insured and transported tangible objects. Remarkably, the strategies for privacy protection can be universally applied, regardless of the nature of the subject.

Central to our exploration is the concept of “privacy,” which we define as the resistance to deducing the individual’s identity. Think of it like the classic game of “Guess Who?” where your opponent uses provided data to guess the subject’s identity and then uses a broader dataset to gather more information. Decreased privacy erodes when data boosts our confidence in identifying individuals, while increased privacy comes from a lower confidence level in distinguishing one individual from another.

It’s crucial to emphasize that privacy isn’t only preserved by obscuring the name or ID of a person. Even with fabricated labels for the subject, comprehensive knowledge about them can still lead to privacy breaches. As an analogy, personal privacy can be compromised even if you know everything about a subject but refer to them by a pseudonym, much like my limited knowledge about my neighbour, whom I simply recognize as “that lady next door.”

There are three significant roles in any informational message transfer: the sender, the receiver, and a potential interceptor. In the case of a dataset being shared, we find these three actors present. Privacy Analysts sit between our source data and filter it, acting like a sender, sending a sanitized message out. The intended recipient is a Data Analyst, whether an internal colleague or a member of the public. Lastly, the Data Analyst could be a malicious actor, either gaining access to the data nefariously or, more pertinently, using a permitted dataset in a nefarious way.

While “entropy” is often linked with the predictability of physical systems, it can be better understood as a measure of chaos. Information Entropy, similar to the concept of chaos, quantifies the level of “surprise” within a system. In a stable or predictable state, a system exhibits high entropy. Consequently, possessing complete knowledge about a subject leaves minimal room for surprise; conversely, knowing nothing allows for unexpected discoveries.

While Claude Shannon is renowned for the bit, there are several names and related measures that apply to the same concept of information entropy. The Shannon measure represents the smallest, indivisible unit of information, a binary state (true or false); it is synonymous with a bit (or “binary digit”). This concept builds on previous work defining the Hartley (Ralph Hartley, 1929), which uses a decimal base and can also be referred to as the dit, among other terms used by Turing, Good, and others.

An Example

To work through an explanation of enforcing privacy through an automated mechanism, an example is likely useful. Samples often help us orient ourselves to the task at hand. Naturally, in a discussion about privacy, we will want a dataset with individuals whose privacy we will want to protect. Still, we also want to make sure these individuals are not representative of anyone real. To achieve this, we have downloaded a sample customer dataset from Sling Academy, which gives us a list of individuals and some personal information about them.

I’ve also gone through a process of “cleaning” the data to make it a little more suitable for our purposes. Mostly by adding some randomized identifiers, but also by introducing some nulls and extraneous character removal.

The base data used includes 1000 rows from a sample customer set (ObservableHQ)

We have 1000 rows that are representative of data you may find in sales, lending, or social welfare domains. Each row represents one customer with a unique ID for each person, a large random number we assign as a primary key. The individual’s social security number and name are present to further identify them. We also have general contact information (phone, email), demographic information (gender, job), and some facts that are meaningful to the business (number of sales and total sales dollars). Note that gender uses the public toilet symbols for woman (), man (), and neutral () as mandated by several US States.

If we look more closely at some of these fields, we might notice that the values that show up inside our fields have different frequencies associated with them. If we consider gender we can see that there are only 3 possible values, and that for the most part knowing about them does not tell us a lot about the individual. For example, if I tell you that our subject is a man, you only have a 1 in 508 chance of guessing who that person is. If on the other hand, I tell you the subject's first name is "David", you have a much better chance of identifying the subject (1 in 17 chance).

The probability of guessing an individual by their gender depends on which gender is exposed (ObservableHQ)

This ability to measure predictability brings us close to Claude Shannon’s definition of “entropy”, or surprise. Given the probability of guessing the person, we can also calculate the amount of surprise. By calculating the probability of randomly selecting an individual from within each category we can get an idea of how private the field is.

In our case, the selected field, gender, has 3 categories. By taking an average of their probabilities (0.0430), we get a general sense of its level of privacy.

We can repeat this for all fields, giving us a privacy profile for the dataset.

Measures of privacy for each field as a probability, Shannon, or Hart (ObservableHQ)

The privacy factor calculated for each field matches our intuition: perfectly unique IDs carry a very low privacy factor, and items likefirst_name are relatively anonymous. Having observed this, we can observe some non-intuitive findings such as the very low privacy associated with municipality. We also have a cautionary reminder as well: emails and phone numbers are very unique to people.

// Calculate the privacy profile of all the fields
PrivacyProfile = (function(){
return {
"hashed": null,
table: Object.entries(fields1).map(d=>{
let key = d[0];
d = d[1];
let total = d.total;

let categories = Object.values(d.values).length;
let privFactor = Object.values(d.values)
.map(freq=>1-(1/freq))
.reduce((a,d)=>a+d,0)
/categories
;
let shannon = Object.values(d.values)
.map(freq=>Math.log(freq)/Math.LN2)
.reduce((a,d)=>a+d,0)
/categories
;
let hart = Object.values(d.values)
.map(freq=>Math.log(freq)/Math.LN10)
.reduce((a,d)=>a+d,0)
/categories
;
return {
"Field": key,
"PrivacyFactor": privFactor,
"Shannons": shannon,
"Harts": hart,
"Probability": 1-privFactor,
"Values": total,
"Categories" : categories,
};
})
}
})();

Applying the Measure to whole datasets

If our goal is to retain privacy while sharing data, in a large-scale environment, we can use this as a tool. Rather than relying purely on the subjective judgement of people, we can aid their judgement with an objective measure. When a data analyst comes to the Privacy Team, looking for access to sensitive data, we can take a measure of the privacy level of their data request.

We can calculate the net privacy of a request by taking a cumulative product of the privacy factors of the individual fields being requested.

Intuitively, we can immediately see that the entire dataset offers little to no privacy; however, this can be more formally stated by taking the product of all the fields.

PrivacyProfile.table.reduce((a,d)=>{
return a * d.PrivacyFactor
},0);

With items like the SSN, some items are 0 privacy, and including these in the product results in a Net Privacy Factor of "no privacy" (hard zero: 0)

We should expect a Data Request to be only for the data required for the analysis at hand. Therefore the request should include a reduced set. To perform this calculation on demand, we can create a generic function

function GenerateReqest(req = null){
// if nothing specific was requested, make this a randomized selection
if(!req){
let fldprob = Math.ceil(PrivacyProfile.table.length * Math.random()) / PrivacyProfile.length;
req = PrivacyProfile.table.filter(d=>(Math.random()<fldprob)).map(d=>d.Field);
}
// select only the items that actually exist
req = PrivacyProfile.table.filter(d=>req.includes(d.Field)).map(d=>d.Field);

// prepare the return
let rtn = {
req: req
};
// calculate the privacy factor
rtn.score = rtn.req.reduce((a,d)=>a*PrivacyProfile.hashed[d]['PrivacyFactor'],1);
// if the factor is so small as to require a scientific notation, let's just call it zero
// This convention is only in place to make reading the results easier
rtn.score = rtn.score.toString().includes('e') ? 0 : rtn.score;

// get the filtered dataset
rtn.data = basedata.map(d=>{
d = rtn.req.reduce((a,f)=>{
a[f] = d[f];
return a;
},{});
// generate a random id for each record
d = Object.assign({
id: Math.floor(Math.random()*Number.MAX_SAFE_INTEGER).toString(32)
},d);
return d;
});
return rtn;
}

This generic function takes a list of variables that the analyst wants access to, calculates the cumulative score, and produces the requested dataset. This can then be used to make a request (R00001) for items we know are more generic fields.

R00001 = GenerateReqest(['gender','age','hobbies']);
// {req: ["gender", "age", "hobbies"], score: 0.8534004245932628}

Again, the results match our expectations. We know that gender, age and hobbies have a much higher Privacy Factor and should therefore be much safer. However, now that we have a clear measure, we can be more explicit and state that the cumulative product of the request has a Privacy Factor of 0.8534 .

Another Data Analyst may make an innocent request for data that surprises us.

R00002 = GenerateReqest(['state','municipality','registered']);
// {req: ['state','municipality','registered'], score: 0.006712926793301226}

Their intent appears to be a longitudinal study of where the organisation’s successes and failures are. Intuitively, a city should be a reasonably anonymous item. Surprisingly — likely due to our small data size — this request shows a significantly lower Privacy Factor (0.0067).

This request should be looked at much more closely by the Privacy Team to determine if there is a problem, and if there are any further transforms that can be applied to the dataset that would reduce the risk profile of the supplied dataset.

Using the Thresholds in Automated Checks

There is no point in having this tool if we can’t use it in our systems to make our lives easier. We have already discussed using it as a tool to assist in evaluation, but it is also possible to apply this as an automated check. By setting a predefined threshold, it is possible to set an automated check on our datasets to ensure that any requests are immediately rejected if they are below a predetermined value.

While there is no one threshold suitable to all cases, organizations can inspect sample requests to identify a suitable point at which automatic approval, automatic rejection, or calls for further inspection occur. Further, these thresholds can be applied at different points in the approval chain.

  1. Creating the Request
  2. Privacy Approval
  3. Data generation

When the Data Analyst begins to put their request together, we can offer them an estimate of what the Privacy Factor of their request will be. As an inexpensive calculation, this can be done live, during the request process, through a web, or application, interface. This allows the customer to get early feedback about what they are requesting and offers them the opportunity to plan their justifications, or mitigating transforms that may reduce the risk profile. Assuming the Privacy Threshold is above a predetermined safe threshold, the request can be delivered immediately, with no further review.

Upon submission of a request requiring further review, Privacy Evaluators can have the privacy estimate available to them as a holistic estimate of the risks involved. This offers a tool to formulate their thoughts, it gives them a metric that allows them to focus on problem areas. This also has the benefit of reducing evaluator bias where one evaluator may personally have a different risk tolerance than another evaluator. This metric can form the basis for discussion among evaluators, and with the customer, to focus discussion on real (rather than perceived) issues.

Lastly, we can continue this process of evaluating results even as the customer performs their transforms. It is possible a dataset were to be approved for use, with the understanding that the trusted analyst will perform aggregation that will increase the Privacy Factor of the dataset, but how do we know that they succeeded? If the calculation is performed on a platform we control, we can use these same measures to evaluate the produced dataset. This evaluation can even go further, using a more complete evaluation of the privacy, rather than the estimate we used to get this far.

If their privacy measures are unsuccessful, and we control the platform, we can issue an error message indicating that the privacy threshold was not met.

Judging the appropriate threshold for any given system will be contextual. Privacy Analysts will need to evaluate where the cut-off should lie based on the openness of the shares, and on the nature of the data itself. Secure data managed in secure facilities will require fewer checks than data being published openly on the web.

Conclusion

The ability to measure the anonymity of data before any requests for access offers several advantages to us in managing Data Repositories. Claude Shannon’s concept of Information Entropy gives us just such a mechanism, by measuring the amount of entropy in a potential channel (shared dataset).

Using this estimate, combined with an evaluation of appropriate thresholds, it is possible to partially automate the approval mechanism for data shares. While complete automation is not possible, we are still able to reduce the burden on Privacy experts, as well as use these objective measures to reduce the impact of any personal biases the experts may have.

Enjoyed the Article, consider buying me a coffee

We all recognize the importance of maintaining the privacy of people, and secure assets, in shared datasets. These processes and techniques are useful in helping to reduce costs through objective techniques.

Going a Little Further

While this was an interesting idea to pursue, there are so many more layers that weren’t covered.

  • Privacy Weights: every field should have a manually set weight associated with it. Immediately, this can be used to include values that look unique but which will be made available with other protections. For example, a Unique ID should be transformed for every request, so the privacy factor is likely known beforehand but is not represented by the actual data. Having some manually set adjustments can compensate for that.
  • Anonymity Functions: Several very simple and automatically applicable functions could be included in the initial request. Rounding all dollar figures to the nearest thousandth, or including only the first three letters of a name will drastically change the privacy factor.
  • The Thing I Forgot: I’m sure there are a million little inclusions you could make to a Data Request that would allow for more refined automation and control. Feel free to drop them in the comments.

Further Reading

Some general articles that showed up in Google searches while I was writing this looked interesting. I will be reading them later.

While we discussed measuring entropy to ensure it was sufficient, there are several things you can do to increase entropy when it is not sufficient artificially.

I really found Utrecht’s book on the matter interesting. As I was looking for references to help make points I kept finding myself referring back to this book.

--

--

Jefferey Cave

I’m interested in the beauty of data and complex systems. I use story telling to help others see that beauty. https://www.buymeacoffee.com/jeffereycave