The 'Anonymous Data' Myth, Debunked
Published 2026-06-02
Companies routinely claim they share 'anonymous' data with partners. Here's why that's almost always misleading β and how re-identification works in practice.
The Claim
Every privacy policy you've ever skimmed contains some version of: 'We share anonymous / aggregated / de-identified data with our partners.' The phrasing implies that the data shared can't be linked back to you specifically. The reality is more complicated.
What 'Anonymous' Usually Means
- Pseudonymous — your real name is removed but replaced with a stable identifier (a hashed user ID, a device ID). Anyone with the mapping table can re-link the data. The data sharer keeps the table.
- Aggregated — counts and percentages without individual rows. Genuinely anonymous if the aggregates are large enough to prevent re-identification. Usually combined with pseudonymous individual data, not a replacement.
- De-identified — obvious identifiers (name, email, address) removed but everything else (browsing history, purchase patterns, location traces) kept. Frequently re-identifiable.
How Re-Identification Works
Researchers have repeatedly shown that 'anonymous' datasets re-identify with high accuracy when joined with public data:
- 87% of Americans are uniquely identifiable by just ZIP code + birthdate + gender (Sweeney 2000)
- Netflix's anonymised viewing dataset was re-identified by cross-referencing IMDB ratings (Narayanan + Shmatikov 2008)
- Anonymous taxi data revealed home addresses of celebrities visiting Manhattan strip clubs (NYC Taxi data, 2014)
- Anonymous fitness tracker data revealed the locations of CIA black sites (Strava heatmap, 2018)
- Pseudonymous credit card transactions re-identified 90% of users with just 4 timestamped purchases (de Montjoye et al. 2015)
Why It's So Easy
Each piece of behavioural data is a constraint that narrows down who you could be. Your morning commute pattern (location at 8am and 5pm) eliminates 99% of the population. Your shopping habits eliminate most of the rest. Your TV viewing pattern eliminates most of who's left. Within a few constraints, you're unique — even if your name was never in the data.
What 'Real' Anonymisation Looks Like
- k-anonymity: each row is indistinguishable from at least k-1 other rows on the identifying attributes. Coarsens granularity (e.g. ZIP code 5-digit becomes 3-digit prefix). Helps for static datasets, breaks for streaming data.
- Differential privacy: adds calibrated noise so individual rows can't be detected even with auxiliary information. Used by Apple (for some keyboard analytics), Google (for some Chrome metrics), and the US Census Bureau (2020 onwards).
- Aggregation only: never publish individual rows; only counts above a minimum threshold (e.g. 'at least 10 users in this bucket').
Differential privacy is the current gold standard. Most 'anonymous' data sharing in industry does NOT meet this bar.
What to Do About It
- Read privacy policies for the actual words. 'Anonymous' often means 'pseudonymous'.
- Use disposable email and other techniques to limit what data ties to your identity in the first place.
- For services that handle highly sensitive data (health, finance, legal), prefer providers with end-to-end encryption where they can't access the data even if compelled.
- Support legal frameworks (GDPR, CCPA) that require explicit consent for sharing pseudonymous data.
Bottom Line
'Anonymous data' is a marketing term, not a technical guarantee. Treat any sharing of behavioural data as potentially re-identifiable. The strongest defence is data minimisation: don't generate the data in the first place.
Related Guides
See also: how data brokers profile you, data harvesting in free apps, and browser fingerprinting.