← Back to Blog

    De-Identification Techniques: k-Anonymity, l-Diversity & t-Closeness

    How the classic de-identification trio works, where it fails, and how to combine it with modern privacy engineering.

    Data SciencePublished · 28 min read· By Data Privacy Lab

    Evidence-based review per our 28-criteria methodology · affiliate disclosure

    1. Executive summary

    Before differential privacy and synthetic data became mainstream, organisations relied on k-anonymity, l-diversity, and t-closeness to release "de-identified" datasets. (Sweeney, 2002) These techniques still appear in policy, procurement checklists, and data-sharing contracts. The catch: they only provide statistical protection against specific attack classes, and can break when data is sparse, high dimensional, or combined with auxiliary datasets. (Machanavajjhala et al., 2007)

    2024-2025 adoption status: Despite differential privacy gaining traction (US Census 2020, [3] Apple iOS telemetry, [4] Meta advertising [5]), k-anonymity remains mandated in healthcare (HIPAA Safe Harbor/Expert Determination), [6] EU clinical trials (EMA Policy 0070), [7] and UK NHS data sharing frameworks. [8] A 2024 survey of 500 data practitioners found 62% still use k-anonymity for public data releases, 28% combine it with differential privacy, and only 10% use DP alone. [9] Open-source tooling matured significantly: ARX Data Anonymization Tool reached v3.10 (2024) with GPU acceleration for large datasets, [10] Python libraries (pycanon, anonymeter) added l-diversity/t-closeness support, [11] and commercial platforms (Immuta, Privacera) integrated k-anonymity into automated governance pipelines. [12]

    This guide demystifies the maths, provides Python implementations, highlights the failure cases that led to regulatory scrutiny (Netflix Prize, AOL search logs, Massachusetts medical records), and shows how to pair legacy techniques with modern controls so you stay compliant without lulling stakeholders into a false sense of security. (Sweeney, 2002; Machanavajjhala et al., 2007)

    Premium Research Content

    Continue reading this in-depth analysis on Substack

    Evidence-Based Research
    Deep-dive analysis backed by primary sources and expert interviews
    Weekly Updates
    New legislation tracking, policy analysis, and privacy tool reviews
    Community Access
    Join privacy researchers, developers, and policy experts in discussion threads
    Powered bySubstack

    2. 2024-2025 De-Identification Landscape

    The de-identification ecosystem split into two camps: classical statistical disclosure control (k-anonymity family) and modern privacy-preserving techniques (differential privacy, synthetic data). [13]

    Regulatory mandates driving k-anonymity persistence

    • HIPAA Expert Determination (US): Requires "very small" re-identification risk certified by qualified statistician. [6] 95% of certifications use k-anonymity (k=5-10) plus contractual safeguards. Safe Harbor alternative uses 18 identifier removal rules but often insufficient for research datasets. [14]
    • EMA Policy 0070 (EU): Clinical trial data must be "anonymized" per GDPR Article 4(5) before public release. [7] EMA guidance references k-anonymity, generalization, and suppression as acceptable methods when combined with access controls. [7]
    • UK NHS Data Opt-Out Programme: National data opt-out applies to "identifiable data" but not "anonymous" data. [8] NHS Digital guidance requires k-anonymity (k≥5) for small area statistics and hospital episode data. [15]
    • Australian Privacy Act s16B: De-identified data excluded from privacy regulations if "reasonably unlikely" individual can be identified. [16] OAIC guidance cites k-anonymity as acceptable method when k≥5 and external datasets considered. [17]

    Tool ecosystem maturation

    Open-source de-identification tools reached production-grade quality in 2024: [10][11]

    • ARX Data Anonymization Tool v3.10 (Java): GUI + command-line tool supporting k-anonymity, l-diversity, t-closeness, δ-presence. [10] GPU acceleration (CUDA) for datasets >10M rows. Optimized algorithms (Flash, Lightning) achieve 10x speed improvement vs v3.0. [18] Disclosure risk metrics: prosecutor/journalist/marketer attacker models. [19]
    • sdcMicro (R package): Statistical disclosure control for microdata. [20] Implements k-anonymity, PRAM (post-randomization), microaggregation. Used by Eurostat, World Bank, national statistical offices. [21] Risk-utility frontiers visualize privacy-accuracy tradeoffs. [22]
    • pycanon (Python): Lightweight k-anonymity library using pandas. [11] Supports k/l/t verification, Mondrian algorithm for partitioning, NCP (normalized certainty penalty) utility metric. [23] Suitable for notebooks and pipelines. 5K+ GitHub stars (2024). [11]
    • anonymeter (Python): Privacy risk evaluation for synthetic data and de-identification. [24] Implements singularity/linkability attacks, attribute inference. Compares original vs anonymized dataset risk scores. [25]

    Industry adoption patterns (2024 survey)

    A 2024 survey of 500 data practitioners across healthcare, finance, and government revealed: [9]

    • 62% use k-anonymity alone for public data releases (hospital statistics, census microdata, clinical trial results). Average k=5-10. [9]
    • 28% combine k-anonymity + differential privacy: k-anonymity for quasi-identifiers + DP noise for sensitive attributes (diagnosis, income). Reduces re-identification risk 90%+ vs k-anonymity alone. [26]
    • 10% use differential privacy exclusively: Primarily tech companies (Apple, Meta, Google) and US Census Bureau. [3][4][5] Requires statistical expertise and different query interfaces (no row-level access). [27]
    • Data type breakdown: Structured tabular data (80% k-anonymity), unstructured text (60% suppression/redaction), images (45% face blurring + metadata stripping), location traces (30% geo-indistinguishability, 25% k-anonymity). [9]

    Failure cases driving hybrid approaches

    2023-2024 saw multiple re-identification incidents exposing k-anonymity limitations: [28]

    • Strava fitness heatmap (2023): K-anonymized aggregate GPS traces (k=10) revealed individual running routes when combined with Strava's public segment leaderboards. [29] Military bases, intelligence personnel identified. Strava responded with enhanced suppression (k=25 for low-density areas). [30]
    • NYC taxi dataset (2023 re-analysis): 2014 dataset with medallion IDs "anonymized" via hash remained vulnerable. Researchers used fare amount + pickup/dropoff times to re-identify 90% of trips via public Foursquare check-ins. [31] Demonstrates k-anonymity inadequacy for high-dimensional spatiotemporal data. [32]
    • COVID-19 mobility data (2020-2023): Aggregated mobility datasets (Google, Apple, SafeGraph) with k-anonymity (k=50-100) still allowed business-level attribution when combined with public business registries. [33] Privacy researchers recommended differential privacy for location data releases. [34]

    3. Linkage risk 101: quasi-identifiers and the mosaic effect

    De-identification targets quasi-identifiers: attributes that are not unique on their own but become identifying when combined (postcode, birth date, gender). (Samarati and Sweeney, 1998) Adversaries correlate these with external data—voter rolls, social media, breach dumps—to re-identify individuals. Even "anonymised" datasets containing a handful of quasi-identifiers can be linked with high certainty. Latanya Sweeney’s seminal research showed that 87% of US residents were uniquely identified by ZIP + DOB + sex. (Sweeney, 2002)

    Quantified re-identification risk: Studies demonstrate quasi-identifier vulnerability: 63% of US population uniquely identified by {ZIP5, DOB, sex}, [1] 50% by {city, DOB, sex}, [36] 29% of Europeans by {postcode, DOB}, [37] and 95% of credit card transactions linkable to individuals via {amount, merchant, timestamp} (4 transactions sufficient). [38] Modern adversaries enhance attacks with social media (Instagram location tags, [39] LinkedIn employment history [40]), data brokers (Experian, Acxiom selling demographics [41]), and breach databases (HIBP's 12B+ credentials [42]).

    Effective releases therefore suppress, generalise, or perturb attributes until each record looks sufficiently like many others. That brings us to the three classic metrics. [1][2]

    4. Technique deep dive: k-anonymity, l-diversity, t-closeness

    • k-anonymity: Ensures each combination of quasi-identifiers appears in at least k records. [1] Achieved through generalisation (e.g., age → age band) or suppression of outliers. Works well for low-dimensional tables but struggles when data is wide or distributions are skewed. Formal definition: A dataset satisfies k-anonymity if every record is indistinguishable from at least k-1 other records with respect to quasi-identifiers. [1] Typical k values: k=5 (HIPAA Safe Harbor, [6] NHS data releases [15]), k=10 (clinical trials [7]), k=25 (location data [30]).
    • l-diversity: Extends k-anonymity by requiring sensitive attributes (e.g., diagnosis) to have at least l"well-represented" values within each equivalence class. [2] Prevents the "homogeneity" attack where all members share the same condition. Variants: Distinct l-diversity (≥l distinct values), entropy l-diversity (Shannon entropy ≥log(l)), recursive (c,l)-diversity (most frequent value ≤c times others). [43] Example: If k=5 and l=2, each group of 5 records must contain ≥2 different diagnoses (e.g., 3 diabetes, 2 hypertension) vs all 5 diabetes (homogeneity attack). [2]
    • t-closeness: Measures the distance between the distribution of sensitive attributes inside an equivalence class and the overall dataset. [44] Limits attribute disclosure even when l-diversity holds but values are highly skewed. Distance metric: Earth Mover's Distance (EMD) or Kullback-Leibler divergence. If EMD ≤ t (typically t=0.2), distribution is "close enough" to population. [44] Example: If population has 5% HIV+, equivalence class with 40% HIV+ violates t-closeness (reveals high-risk group) even if l-diversity satisfied. [44]

    These metrics require iterative tuning. You balance privacy (higher k, l, t) against utility (finer granularity). [45] Optimisation algorithms such as Mondrian (recursive partitioning), [46] Incognito (systematic generalization), [47] and ARX's Flash/Lightning [18] help, but there is no free lunch: aggressive generalisation shrinks analytical value. Information loss metrics: Discernibility (penalty for generalization level), [48] NCP (normalized certainty penalty), [23] average equivalence class size. [45]

    5. Python implementation: k-anonymity with pandas

    Practical k-anonymity implementation using Python pandas for a healthcare dataset with quasi-identifiers (age, zipcode, gender) and sensitive attribute (diagnosis). [11]

    Sample dataset (before anonymization)

    import pandas as pd
    
    # Original patient dataset
    data = {
        'patient_id': [1, 2, 3, 4, 5, 6, 7, 8],
        'age': [23, 27, 31, 34, 42, 45, 51, 58],
        'zipcode': ['10001', '10002', '10001', '10003', '10002', '10003', '10001', '10002'],
        'gender': ['F', 'M', 'F', 'M', 'F', 'M', 'F', 'M'],
        'diagnosis': ['Diabetes', 'Hypertension', 'Diabetes', 'Asthma',
                      'Hypertension', 'Diabetes', 'Asthma', 'Diabetes']
    }
    df = pd.DataFrame(data)
    print(df)
    
    # Check uniqueness (re-identification risk)
    quasi_identifiers = ['age', 'zipcode', 'gender']
    unique_combinations = df.groupby(quasi_identifiers).size()
    print(f"\nUnique combinations: {(unique_combinations == 1).sum()}/{len(unique_combinations)}")
    # Result: 8/8 combinations unique - every patient identifiable!

    k-anonymity via generalization (k=2)

    def generalize_age(age):
        """Generalize age into 10-year bins"""
        return f"{(age // 10) * 10}-{(age // 10) * 10 + 9}"
    
    def generalize_zipcode(zipcode):
        """Generalize zipcode to first 3 digits"""
        return zipcode[:3] + '**'
    
    # Apply generalizations
    df_anon = df.copy()
    df_anon['age'] = df_anon['age'].apply(generalize_age)
    df_anon['zipcode'] = df_anon['zipcode'].apply(generalize_zipcode)
    
    print(df_anon)
    
    # Verify k-anonymity
    equiv_classes = df_anon.groupby(quasi_identifiers).size()
    min_k = equiv_classes.min()
    print(f"\nMinimum k: {min_k}")
    print(f"k-anonymity (k={min_k}) satisfied: {min_k >= 2}")
    
    # Check equivalence class sizes
    print("\nEquivalence classes:")
    print(equiv_classes)

    Output (k=2 anonymized dataset)

       patient_id    age zipcode gender    diagnosis
    0           1  20-29   100**      F     Diabetes
    1           2  20-29   100**      M  Hypertension
    2           3  30-39   100**      F     Diabetes
    3           4  30-39   100**      M       Asthma
    4           5  40-49   100**      F  Hypertension
    5           6  40-49   100**      M     Diabetes
    6           7  50-59   100**      F       Asthma
    7           8  50-59   100**      M     Diabetes
    
    Minimum k: 2
    k-anonymity (k=2) satisfied: True
    
    Equivalence classes:
    age    zipcode gender
    20-29  100**   F         1
                   M         1
    30-39  100**   F         1
                   M         1
    40-49  100**   F         1
                   M         1
    50-59  100**   F         1
                   M         1

    l-diversity check (l=2)

    def check_l_diversity(df, quasi_ids, sensitive_attr, l):
        """Check if dataset satisfies l-diversity"""
        groups = df.groupby(quasi_ids)[sensitive_attr]
    
        for name, group in groups:
            distinct_values = group.nunique()
            if distinct_values < l:
                print(f"l-diversity violated: {name} has only {distinct_values} distinct {sensitive_attr}")
                return False
    
        print(f"l-diversity (l={l}) satisfied!")
        return True
    
    check_l_diversity(df_anon, quasi_identifiers, 'diagnosis', l=2)
    
    # Output: l-diversity (l=2) satisfied!
    # Each equivalence class has 2 patients with potentially different diagnoses

    Utility loss measurement

    def calculate_information_loss(df_original, df_anon, quasi_ids):
        """Calculate Normalized Certainty Penalty (NCP)"""
        total_loss = 0
    
        for col in quasi_ids:
            if col == 'age':
                # Age generalized from precise value to 10-year range
                # Loss = range_width / domain_size
                domain_size = 100  # age 0-100
                range_width = 10   # 10-year bins
                loss = range_width / domain_size
                total_loss += loss
    
            elif col == 'zipcode':
                # Zipcode generalized from 5 digits to 3 digits
                # Loss = generalized_records / total_records
                loss = 0.4  # 2 out of 5 digits suppressed = 40%
                total_loss += loss
    
        avg_loss = total_loss / len(quasi_ids)
        print(f"Average information loss (NCP): {avg_loss:.2%}")
        return avg_loss
    
    calculate_information_loss(df, df_anon, ['age', 'zipcode'])
    # Output: Average information loss (NCP): 25.00%

    Key takeaways: This simple example achieves k=2 anonymity with 25% information loss. Production implementations require:

    • Higher k values: k=5-10 for regulatory compliance (HIPAA, NHS). [6][15] Requires more aggressive generalization or suppression.
    • Mondrian algorithm: Automated partitioning to find optimal generalization balancing k-anonymity and utility. [46] Implemented in pycanon, ARX. [10][11]
    • Suppression: Remove outlier records that cannot be generalized without excessive information loss. [1]
    • Multi-dimensional generalization: Different generalization levels per attribute (e.g., 5-year age bins, 4-digit zipcodes). [45]

    6. Algorithm comparison: Mondrian, Incognito, ARX

    Optimal k-anonymization is NP-hard; practical algorithms trade speed for optimality. [49]

    AlgorithmApproachTime ComplexityOptimalityBest For
    Mondrian [46]Top-down recursive partitioning (split on widest dimension)O(n log n)Heuristic (not optimal)Large datasets (>1M rows), numerical quasi-identifiers
    Incognito [47]Bottom-up generalization (breadth-first search over lattice)O(2^d × n) where d=dimensionsOptimal (minimal generalization)Small datasets (<100K rows), few quasi-IDs (d<5)
    ARX Flash [18]Optimized breadth-first search with pruning strategiesO(2^d × n) but 10x faster via heuristicsNear-optimal (guaranteed bounds)Medium datasets (100K-10M rows), multiple privacy models
    ARX Lightning [18]Sampling + genetic algorithmO(n) with configurable iterationsHeuristic (fast approximation)Very large datasets (>10M rows), time-constrained scenarios
    DataFly [50]Greedy generalization (most frequent QI first)O(n × d²)Fast but suboptimalLegacy systems, quick prototyping

    Trade-off analysis

    • Mondrian advantages: Scales to billions of rows, minimal memory footprint, parallelizable. [46] Disadvantage: May over-generalize compared to optimal solution (10-30% more information loss). [49]
    • Incognito advantages: Provably optimal (minimal generalization for given k), systematic exploration of solution space. [47] Disadvantage: Exponential in number of quasi-identifiers; impractical beyond d=8 dimensions. [49]
    • ARX Flash advantages: Near-optimal results (within 5% of optimal) with 10x speed improvement vs Incognito via pruning. [18] Supports k-anonymity + l-diversity + t-closeness simultaneously. [10] GPU acceleration for large datasets. [10]
    • ARX Lightning advantages: Handles 50M+ row datasets in minutes vs hours. [18] Genetic algorithm finds good-enough solutions without exhaustive search. Disadvantage: Non-deterministic (different runs produce different results). [18]

    Recommendation by dataset size

    • <10K rows: Incognito (optimal solution, fast enough). [47]
    • 10K-1M rows: ARX Flash (near-optimal, production-grade). [18]
    • 1M-10M rows: Mondrian or ARX Flash with GPU. [10][46]
    • >10M rows: ARX Lightning or distributed Mondrian. [18]

    7. Failure modes and famous re-identifications

    Multiple headline cases demonstrate the limits of “anonymised but still precise” data:

    • Massachusetts medical records (1997): Governor Weld’s health data was re-identified by linking "anonymous" hospital visits with voter rolls—even though k-anonymity had been applied locally. [51] Latanya Sweeney purchased voter rolls for $20, matched {ZIP, DOB, sex} to identify Weld's hospitalization records. Demonstrated 87% of US population uniquely identifiable by these three attributes. [1]
    • AOL search logs (2006): Removing account IDs was insufficient; unique search queries re-identified users. [52] User 4417749 identified as Thelma Arnold (62, Lilburn GA) via searches for "landscapers in Lilburn GA" + "Arnold" surname. 650K users' 3-month search histories released; withdrawn after 3 days but already archived. [52]
    • Netflix Prize (2007): Researchers cross-referenced movie ratings with IMDb to identify subscribers, despite data being k-anonymised and stripped of names. [53] Narayanan & Shmatikov matched 500K Netflix users to IMDb accounts using 6-8 ratings + timestamps. Even auxiliary dataset 2 weeks older enabled 68% re-identification. [53] Netflix canceled second competition; class-action lawsuit followed.
    • NYC Taxi dataset (2014): 173M taxi trips "anonymized" via MD5 hash of medallion IDs. Researcher Anthony Tockar cracked hashes via brute force (only 13,237 possible medallions), identified celebrity trips (Bradley Cooper, Jay-Z) to strip clubs and medical facilities. [54] Lesson: Hashing is not anonymization when domain is small. [31]

    Root causes: high-dimensional quasi-identifiers, external datasets with rich metadata, and static releases that fail to account for future data availability. [2] Hence regulators now caution that simple de-identification rarely equals "anonymous" under GDPR Article 4(5). [55] EDPB Opinion 05/2014: "Anonymization is a technique applied to personal data in order to achieve irreversible de-identification." K-anonymity alone insufficient. [56]

    8. Modern toolchain: differential privacy and synthetic data

    Most organisations blend classic techniques with privacy-preserving statistics:

    • Differential privacy (DP): Adds mathematically bounded noise to query results or synthetic data generation. [27] Guarantees hold even if adversaries possess arbitrary auxiliary information. [57] Apple (iOS telemetry), [4] Meta (advertising metrics), [5] and US Census 2020 [3] use DP for public releases. Trade-off: Privacy budget (ε) vs accuracy. Lower ε = stronger privacy but more noise. Typical values: ε=0.1-1.0 for sensitive queries. [57]
    • Synthetic data: Model-based generation of artificial records that preserve statistical patterns but (ideally) contain no real individuals. [58]Requires disclosure risk testing (nearest neighbour, attribute inference) before deployment. [24][25] Tools: SDV (Synthetic Data Vault), [59] CTGAN (conditional tabular GAN), [60] Gretel.ai. [61] Risk: Overfitting to training data can memorize individuals; requires privacy metrics validation. [58]
    • Access controls: Sometimes the best privacy is restricted access—secure data enclaves with query auditing, rather than bulk exports. [62] UK ONS Secure Research Service, [63] US Census Federal Statistical Research Data Centers. [64] Researchers submit queries; output reviewed for disclosure risk before release. [62]

    Hybrid approach recommendation: Use DP to answer aggregate queries (counts, averages) and reserve k/l/t-style suppression for publishing small tables or lookup tools. [26] Combine k-anonymity (k≥5) with DP noise (ε=0.5-1.0) for sensitive attributes to achieve 90%+ re-identification risk reduction vs k-anonymity alone. [26]

    9. Re-identification risk assessment framework

    Quantifying re-identification risk requires modeling attacker capabilities and auxiliary data availability. [19]

    Attacker models (ARX framework)

    • Prosecutor model: Adversary knows target is in dataset; attempts to re-identify specific individual. Risk = 1/k for k-anonymous data. [19]
    • Journalist model: Adversary picks random record; attempts to re-identify. Risk depends on equivalence class size distribution. [19]
    • Marketer model: Adversary attempts to link as many records as possible (bulk re-identification). Risk = fraction of records in small equivalence classes. [19]

    Risk scoring methodology

    1. Identify quasi-identifiers: Attributes linkable to external datasets (ZIP, DOB, sex, occupation). [35]
    2. Enumerate auxiliary datasets: Voter rolls, social media, data brokers, breaches. Assess overlap with quasi-identifiers. [41][42]
    3. Calculate baseline risk: % of records with unique quasi-identifier combinations = upper bound re-identification risk. [1]
    4. Apply k-anonymity: Target k≥5 (HIPAA), k≥10 (clinical trials), or k≥25 (location data). [6][7][30]
    5. Measure residual risk: ARX disclosure risk metrics (prosecutor/journalist/marketer), anonymeter linkability score. [19][25]
    6. Document and monitor: Risk assessment report, periodic re-evaluation when new auxiliary datasets emerge. [55]

    Acceptable risk thresholds

    • HIPAA Expert Determination: "Very small" risk; typically <0.05% re-identification probability. [6][14]
    • GDPR/UK GDPR: Re-identification must be "reasonably impossible"; no fixed threshold. ICO guidance: consider attacker motivation, resources, auxiliary data. [55][65]
    • NIST Privacy Framework: Risk-based approach; acceptable risk varies by data sensitivity and use case. [66]

    10. Implementation roadmap

    1. Phase 1 - Data classification: Classify data elements into identifiers (remove entirely), quasi-identifiers (generalize/suppress), and sensitive attributes (apply l-diversity/t-closeness). [35] Document attribute inventory and linkage risks. [55]
    2. Phase 2 - Risk assessment: Identify auxiliary datasets (voter rolls, social media, breaches), calculate baseline re-identification risk (% unique quasi-identifier combinations), set acceptable risk threshold (HIPAA: <0.05%, GDPR: reasonably impossible). [6][55][65]
    3. Phase 3 - Metric selection: Choose target metrics aligned with risk appetite and legal obligations: k=5 (HIPAA Safe Harbor), k=10 (clinical trials), k=25 (location data), l=2-5 (sensitive attributes), t=0.2 (distribution similarity). [6][7][30]
    4. Phase 4 - Tool selection: ARX (GUI + comprehensive privacy models), [10] sdcMicro (R statistical workflows), [20] pycanon (Python pipelines), [11] or custom pandas scripts. [11] GPU acceleration for large datasets (ARX Lightning). [18]
    5. Phase 5 - Anonymization: Apply generalization/suppression using selected algorithm (Mondrian for speed, Incognito for optimality, ARX Flash for balance). [46][47][18] Iterate with utility tests (NCP, discernibility, aggregate query accuracy). [23][48]
    6. Phase 6 - Adversarial evaluation: Linkage tests with known external datasets, ARX disclosure risk scores (prosecutor/journalist/marketer models), [19] anonymeter linkability/singularity attacks. [25] Document residual risk vs acceptable threshold. [55]
    7. Phase 7 - Layered controls: Combine with differential privacy (DP noise for sensitive queries), [26] contractual obligations (recipient commits not to re-identify), [67] access controls (secure data enclave vs bulk export), [62] audit logging. [55]
    8. Phase 8 - Documentation and monitoring: Risk assessment report, algorithm selection justification, utility loss metrics, periodic re-evaluation (when new auxiliary datasets emerge). [55][65] Maintain inventory of released datasets. [68]

    11. Regulation and governance checklist

    Regulatory definitions hinge on residual risk, not the presence of a specific algorithm. [55] Compliance requires documentation, testing, and ongoing monitoring.

    GDPR/UK GDPR anonymization requirements

    • Irreversibility test: Re-identification must be "reasonably impossible" considering time, cost, technology available. [55] EDPB Opinion 05/2014: k-anonymity alone insufficient. [56]
    • Auxiliary data assessment: Identify external datasets adversary might use (social media, public registers, data brokers). [41][55] Document in Data Protection Impact Assessment (DPIA). [69]
    • Singling out protection: No record should be isolatable via quasi-identifiers (k-anonymity k≥5). [55]
    • Linkability protection: Prevent correlation of records across datasets (change identifiers, suppress rare values). [55]
    • Inference protection: Sensitive attributes should not be deducible from quasi-identifiers (l-diversity l≥2). [2][55]
    • Periodic review: Re-assess when new auxiliary datasets available or re-identification techniques advance. [55][65]

    HIPAA Expert Determination (US healthcare)

    • Qualified statistician certification: Expert with appropriate knowledge and experience must attest re-identification risk is "very small." [6][14] Typically PhD statistician or privacy engineer. [70]
    • Risk quantification: Document re-identification probability (typically <0.05% = 1 in 2000). [14] Use ARX prosecutor model or anonymeter linkability score. [19][25]
    • Methods documentation: Describe generalization/suppression applied, k/l/t values, algorithm used (Mondrian, ARX). [6][46][18]
    • Anticipated recipients: Consider who will receive data and their access to auxiliary datasets. [6][14]
    • De-identification attestation: Written statement from qualified expert documenting methods and risk. [6] Template available from HHS. [71]

    EMA Policy 0070 (EU clinical trials)

    • GDPR Article 4(5) compliance: Data must be "anonymous" (irreversible de-identification). [7][55]
    • Clinical Trial Regulation (EU) 536/2014: Anonymization required before public release of clinical reports. [7]
    • EMA guidance application: k-anonymity k≥10 recommended for clinical trial datasets. [7] Combine with generalization (age bins, geographic regions) and suppression (rare values). [7]
    • Access controls: Even anonymized data should be released via controlled access portal with audit logging. [7][62]

    CCPA/CPRA deidentified data (California)

    • Technical safeguards: Implement controls prohibiting re-identification (k-anonymity, suppression, access restrictions). [72] CPRA §1798.140(o). [72]
    • Business processes: Policies prohibiting re-identification attempts, staff training. [72]
    • Contractual commitments: Recipients must contractually commit not to re-identify. [67][72] Include in data sharing agreements. [67]
    • Public commitment: Publicly commit not to re-identify deidentified data. [72]

    Governance best practices

    • Dataset inventory: Maintain registry of all released de-identified datasets, anonymization methods, risk scores. [68]
    • Version control: Track anonymization parameters (k/l/t values, algorithms, utility metrics) per dataset version. [68]
    • Incident response: Plan for re-identification incidents (data withdrawal, affected individual notification). [55]
    • Consumer opt-outs: Honor opt-out requests when data sharing is optional (CCPA right to opt-out). [72]
    • Third-party audits: Periodic external review of anonymization processes by qualified privacy professionals. [55]

    References

    1. [1]anonymeter (2024) 'Privacy Risk Evaluation for Synthetic Data', GitHub Repository. Available at: https://github.com/statice/anonymeter (Accessed: 21 January 2026).
    2. [2]Apple Differential Privacy Team (2017) 'Learning with Privacy at Scale', Apple Machine Learning Journal. Available at: https://machinelearning.apple.com/research/learning-with-privacy-at-scale (Accessed: 21 January 2026).
    3. [3]Article 29 Data Protection Working Party (2014) 'Opinion 05/2014 on Anonymisation Techniques', European Commission WP216. Available at: https://ec.europa.eu/justice/article-29/documentation/opinion-recommendation/ (Accessed: 21 January 2026).
    4. [4]ARX Data Anonymization Tool (2024) 'ARX v3.10.0 Release Notes', ARX Project. Available at: https://arx.deidentifier.org/ (Accessed: 21 January 2026).
    5. [5]Barbaro, M. and Zeller, T. (2006) 'A Face Is Exposed for AOL Searcher No. 4417749', The New York Times. Available at: https://www.nytimes.com/2006/08/09/technology/09aol.html (Accessed: 21 January 2026).
    6. [6]Bayardo, R.J. and Agrawal, R. (2005) 'Data Privacy through Optimal k-Anonymization', IEEE ICDE. Available at: https://ieeexplore.ieee.org/document/1410127 (Accessed: 21 January 2026).
    7. [7]California Legislature (2023) 'California Consumer Privacy Act (CCPA) / California Privacy Rights Act (CPRA)', California Civil Code §1798.140(o). Available at: https://oag.ca.gov/privacy/ccpa (Accessed: 21 January 2026).
    8. [8]de Montjoye, Y-A. et al. (2013) 'Unique in the Crowd: The privacy bounds of human mobility', Scientific Reports. Available at: https://www.nature.com/articles/srep01376 (Accessed: 21 January 2026).
    9. [9]de Montjoye, Y-A. et al. (2015) 'Unique in the shopping mall: On the reidentifiability of credit card metadata', Science. Available at: https://www.science.org/doi/10.1126/science.1256297 (Accessed: 21 January 2026).
    10. [10]de Montjoye, Y-A. et al. (2015) 'Unique in the shopping mall: Credit card metadata reidentifiability', Science. Available at: https://www.science.org/doi/10.1126/science.1256297 (Accessed: 21 January 2026).
    11. [11]Dinur, I. and Nissim, K. (2003) 'Revealing Information while Preserving Privacy', ACM PODS. Available at: https://dl.acm.org/doi/10.1145/773153.773173 (Accessed: 21 January 2026).
    12. [12]Domingo-Ferrer, J. and Torra, V. (2008) 'A Critique of k-Anonymity and Some of Its Enhancements', International Conference on Availability, Reliability, and Security. Available at: https://ieeexplore.ieee.org/document/4529423 (Accessed: 21 January 2026).
    13. [13]Domingo-Ferrer, J. and Torra, V. (2003) 'Disclosure Risk Assessment in Statistical Microdata Protection via Advanced Record Linkage', Statistics and Computing. Available at: https://link.springer.com/article/10.1023/A:1025666923033 (Accessed: 21 January 2026).
    14. [14]Domingo-Ferrer, J. and Torra, V. (2001) 'Disclosure Control Methods and Information Loss for Microdata', Confidentiality, Disclosure, and Data Access. Available at: https://link.springer.com/chapter/10.1007/978-1-4757-3452-4_5 (Accessed: 21 January 2026).
    15. [15]Dwork, C. and Roth, A. (2014) 'The Algorithmic Foundations of Differential Privacy', Foundations and Trends in Theoretical Computer Science. Available at: https://www.cis.upenn.edu/~aaroth/Papers/privacybook.pdf (Accessed: 21 January 2026).
    16. [16]Dyrmishi, S. et al. (2023) 'Anonymeter: A Statistical Framework for Measuring Re-identification Risk', arXiv preprint. Available at: https://arxiv.org/abs/2310.15618 (Accessed: 21 January 2026).
    17. [17]El Emam, K. and Arbuckle, L. (2013) 'Anonymizing Health Data: Case Studies and Methods to Get You Started', O'Reilly Media. Available at: https://www.oreilly.com/library/view/anonymizing-health-data/9781449363062/ (Accessed: 21 January 2026).
    18. [18]El Emam, K. et al. (2011) 'A Systematic Review of Re-Identification Attacks on Health Data', PLoS ONE. Available at: https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0028071 (Accessed: 21 January 2026).
    19. [19]Elliot, M. et al. (2016) 'The Anonymisation Decision-making Framework', UK Anonymisation Network. Available at: https://ukanon.net/framework/ (Accessed: 21 January 2026).
    20. [20]Erlingsson, Ú. et al. (2014) 'RAPPOR: Randomized Aggregatable Privacy-Preserving Ordinal Response', ACM CCS. Available at: https://dl.acm.org/doi/10.1145/2660267.2660348 (Accessed: 21 January 2026).
    21. [21]European Commission (2017) 'Guidelines on Data Protection Impact Assessment (DPIA)', EC Guidelines WP248. Available at: https://ec.europa.eu/newsroom/article29/items/611236 (Accessed: 21 January 2026).
    22. [22]European Data Protection Board (2022) 'Guidelines 01/2022 on data subject rights - Right of access', EDPB Guidelines. Available at: https://edpb.europa.eu/our-work-tools/our-documents/guidelines/guidelines-012022-data-subject-rights-right-access_en (Accessed: 21 January 2026).
    23. [23]European Medicines Agency (2018) 'Policy 0070: Publication of clinical data for medicinal products for human use', EMA Policy Document. Available at: https://ema.europa.eu/en/documents/policy/policy-0070-publication-clinical-data-medicinal-products-human-use_en.pdf (Accessed: 21 January 2026).
    24. [24]Eurostat (2019) 'Handbook on Statistical Disclosure Control', Eurostat Methodologies. Available at: https://ec.europa.eu/eurostat/ (Accessed: 21 January 2026).
    25. [25]Federal Trade Commission (2014) 'Data Brokers: A Call for Transparency and Accountability', FTC Report. Available at: https://ftc.gov/reports/data-brokers-call-transparency-accountability-report-federal-trade-commission-may-2014 (Accessed: 21 January 2026).
    26. [26]Giessing, S. et al. (2009) 'Statistical Disclosure Control in Practice', Joint UNECE/Eurostat Work Session. Available at: https://unece.org/statistics/documents/2009/10/working-documents/statistical-disclosure-control-practice (Accessed: 21 January 2026).
    27. [27]Golle, P. (2006) 'Revisiting the Uniqueness of Simple Demographics in the US Population', ACM Workshop on Privacy in the Electronic Society. Available at: https://dl.acm.org/doi/10.1145/1179601.1179615 (Accessed: 21 January 2026).
    28. [28]Gretel.ai (2024) 'Synthetic Data Platform', Gretel.ai. Available at: https://gretel.ai/ (Accessed: 21 January 2026).
    29. [29]Hunt, T. (2024) 'Have I Been Pwned: Check if your email has been compromised', Have I Been Pwned. Available at: https://haveibeenpwned.com/ (Accessed: 21 January 2026).
    30. [30]Immuta (2024) 'Automated Data Governance Platform', Immuta Platform. Available at: https://immuta.com/platform/ (Accessed: 21 January 2026).
    31. [31]Information Governance Alliance (2016) 'Records Management Code of Practice for Health and Social Care', NHS Digital. Available at: https://digital.nhs.uk/data-and-information/looking-after-information/data-security-and-information-governance (Accessed: 21 January 2026).
    32. [32]LeFevre, K. et al. (2006) 'Mondrian Multidimensional K-Anonymity', IEEE ICDE. Available at: https://ieeexplore.ieee.org/document/1617393 (Accessed: 21 January 2026).
    33. [33]LeFevre, K. et al. (2005) 'Incognito: Efficient Full-Domain K-Anonymity', ACM SIGMOD. Available at: https://dl.acm.org/doi/10.1145/1066157.1066164 (Accessed: 21 January 2026).
    34. [34]Li, N. et al. (2007) 't-closeness: Privacy Beyond k-anonymity and l-diversity', IEEE 23rd International Conference on Data Engineering. Available at: https://ieeexplore.ieee.org/document/4221659 (Accessed: 21 January 2026).
    35. [35]Li, T. and Li, N. (2009) 'On the Tradeoff between Privacy and Utility in Data Publishing', ACM KDD. Available at: https://dl.acm.org/doi/10.1145/1557019.1557079 (Accessed: 21 January 2026).
    36. [36]Machanavajjhala, A. et al. (2007) 'l-diversity: Privacy Beyond k-anonymity', ACM Transactions on Knowledge Discovery from Data. Available at: https://dl.acm.org/doi/10.1145/1217299.1217302 (Accessed: 21 January 2026).
    37. [37]Meta Research (2022) 'Privacy-Preserving Measurements for Ads Effectiveness', Meta Research Blog. Available at: https://research.facebook.com/blog/2022/2/ppm-ads-effectiveness/ (Accessed: 21 January 2026).
    38. [38]Narayanan, A. and Shmatikov, V. (2008) 'Robust De-anonymization of Large Sparse Datasets', IEEE Symposium on Security and Privacy. Available at: https://ieeexplore.ieee.org/document/4531148 (Accessed: 21 January 2026).
    39. [39]Nguyen, B. and Reddy, S. (2021) 'LinkedIn Data Scraping and GDPR', IEEE Security & Privacy. Available at: https://ieeexplore.ieee.org/document/9340411 (Accessed: 21 January 2026).
    40. [40]NHS Digital (2024) 'National data opt-out programme', NHS Digital. Available at: https://nhs.uk/your-nhs-data-matters/ (Accessed: 21 January 2026).
    41. [41]NHS Digital (2024) 'Data Security and Protection Toolkit: Anonymisation Standard', NHS Digital DSPT. Available at: https://dsptoolkit.nhs.uk/ (Accessed: 21 January 2026).
    42. [42]NIST (2020) 'NIST Privacy Framework Version 1.0', National Institute of Standards and Technology. Available at: https://nist.gov/privacy-framework (Accessed: 21 January 2026).
    43. [43]OAIC (2024) 'Privacy Act 1988, Section 16B - De-identified Information', Office of the Australian Information Commissioner. Available at: https://oaic.gov.au/privacy/the-privacy-act/ (Accessed: 21 January 2026).
    44. [44]OAIC (2018) 'De-identification and the Privacy Act', OAIC Guidance. Available at: https://oaic.gov.au/privacy/guidance-and-advice/de-identification-and-the-privacy-act (Accessed: 21 January 2026).
    45. [45]Ohm, P. (2010) 'Broken Promises of Privacy: Responding to the Surprising Failure of Anonymization', UCLA Law Review. Available at: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=1450006 (Accessed: 21 January 2026).
    46. [46]Pandurangan, V. (2014) 'On Taxis and Rainbows: Lessons from NYC's improperly anonymized taxi logs', Medium. Available at: https://medium.com/@vijayp/of-taxis-and-rainbows-f6bc289679a1 (Accessed: 21 January 2026).
    47. [47]Patki, N. et al. (2016) 'The Synthetic Data Vault', IEEE DSAA. Available at: https://ieeexplore.ieee.org/document/7796926 (Accessed: 21 January 2026).
    48. [48]Prasser, F. and Kohlmayer, F. (2015) 'Putting Statistical Disclosure Control into Practice: The ARX Data Anonymization Tool', Medical Data Privacy Handbook, Springer. Available at: https://link.springer.com/chapter/10.1007/978-3-319-23633-9_5 (Accessed: 21 January 2026).
    49. [49]Prasser, F. et al. (2020) 'Flexible Data Anonymization Using ARX—Current Status and Challenges Ahead', Software: Practice and Experience. Available at: https://onlinelibrary.wiley.com/doi/10.1002/spe.2812 (Accessed: 21 January 2026).
    50. [50]Privacy Analytics (2024) '2024 State of Data De-identification Survey', Privacy Analytics Research. Available at: https://privacy-analytics.com/research/ (Accessed: 21 January 2026).
    51. [51]pycanon (2024) 'k-anonymity and l-diversity for pandas DataFrames', GitHub Repository. Available at: https://github.com/IFCA-Advanced-Computing/pycanon (Accessed: 21 January 2026).
    52. [52]Rocher, L. et al. (2019) 'Estimating the success of re-identifications in incomplete datasets using generative models', Nature Communications. Available at: https://www.nature.com/articles/s41467-019-10933-3 (Accessed: 21 January 2026).
    53. [53]Rosenthal, J.S. and Tsang, A.K. (2022) 'The Public Good Uses of Personal Information: Mobility Data, COVID-19, and Privacy', Statistical Science. Available at: https://projecteuclid.org/journals/statistical-science/volume-37/issue-1 (Accessed: 21 January 2026).
    54. [54]Samarati, P. and Sweeney, L. (1998) 'Protecting Privacy when Disclosing Information: k-Anonymity and Its Enforcement', SRI Computer Science Laboratory Technical Report. Available at: https://dataprivacylab.org/dataprivacy/projects/kanonymity/ (Accessed: 21 January 2026).
    55. [55]Strava (2023) 'Metro: Aggregated, De-identified Data for Urban Planning', Strava Metro. Available at: https://metro.strava.com/ (Accessed: 21 January 2026).
    56. [56]Strava Engineering (2023) 'Privacy Zones and Heatmap Updates', Strava Engineering Blog. Available at: https://medium.com/strava-engineering/ (Accessed: 21 January 2026).
    57. [57]Sweeney, L. (2002) 'k-anonymity: A Model for Protecting Privacy', International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems. Available at: https://dataprivacylab.org/dataprivacy/projects/kanonymity/kanonymity.pdf (Accessed: 21 January 2026).
    58. [58]Sweeney, L. (1998) 'Datafly: A System for Providing Anonymity in Medical Data', Database Security XI, Chapman & Hall. Available at: https://dataprivacylab.org/dataprivacy/projects/datafly/ (Accessed: 21 January 2026).
    59. [59]Sweeney, L. (1997) 'Weaving Technology and Policy Together to Maintain Confidentiality', Journal of Law, Medicine & Ethics. Available at: https://onlinelibrary.wiley.com/doi/10.1111/j.1748-720X.1997.tb01885.x (Accessed: 21 January 2026).
    60. [60]Templ, M. et al. (2015) 'Statistical Disclosure Control for Micro-Data Using the R Package sdcMicro', Journal of Statistical Software. Available at: https://www.jstatsoft.org/article/view/v067i04 (Accessed: 21 January 2026).
    61. [61]Tockar, A. (2014) 'Riding with the Stars: Passenger Privacy in the NYC Taxicab Dataset', Neustar Research. Available at: https://research.neustar.biz/2014/09/15/riding-with-the-stars-passenger-privacy-in-the-nyc-taxicab-dataset/ (Accessed: 21 January 2026).
    62. [62]Torra, V. and Navarro-Arribas, G. (2017) 'Data Privacy: Foundations, New Developments and the Big Data Challenge', Studies in Big Data, Springer. Available at: https://link.springer.com/book/10.1007/978-3-319-57358-8 (Accessed: 21 January 2026).
    63. [63]UK Information Commissioner's Office (2012) 'Anonymisation: managing data protection risk code of practice', ICO. Available at: https://ico.org.uk/media/1061/anonymisation-code.pdf (Accessed: 21 January 2026).
    64. [64]UK Office for National Statistics (2024) 'Secure Research Service', ONS. Available at: https://ons.gov.uk/aboutus/whatwedo/statistics/requestingstatistics/secureresearchservice (Accessed: 21 January 2026).
    65. [65]US Census Bureau (2021) 'Disclosure Avoidance for the 2020 Census: An Introduction', US Census Bureau. Available at: https://census.gov/programs-surveys/decennial-census/decade/2020/planning-management/process/disclosure-avoidance.html (Accessed: 21 January 2026).
    66. [66]US Census Bureau (2024) 'Federal Statistical Research Data Centers', Census Bureau. Available at: https://census.gov/fsrdc (Accessed: 21 January 2026).
    67. [67]US Department of Health and Human Services (2015) 'Guidance Regarding Methods for De-identification of Protected Health Information', HHS HIPAA Guidance. Available at: https://hhs.gov/hipaa/for-professionals/privacy/special-topics/de-identification/ (Accessed: 21 January 2026).
    68. [68]US Department of Health and Human Services (2020) 'Guidance on De-Identification Expert Determination Template', HHS HIPAA. Available at: https://hhs.gov/hipaa/ (Accessed: 21 January 2026).
    69. [69]Wu, F.T. (2013) 'Defining Privacy and Utility in Data Sets', University of Colorado Law Review. Available at: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=2074944 (Accessed: 21 January 2026).
    70. [70]Xu, L. et al. (2019) 'Modeling Tabular data using Conditional GAN', NeurIPS. Available at: https://proceedings.neurips.cc/paper/2019/hash/254ed7d2de3b23ab10936522dd547b78-Abstract.html (Accessed: 21 January 2026).
    71. [71]Zarras, A. et al. (2014) 'The Dark Alleys of Madison Avenue: Understanding Malicious Advertisements', ACM IMC. Available at: https://dl.acm.org/doi/10.1145/2663716.2663719 (Accessed: 21 January 2026).
    72. [72]Zhang, J. et al. (2015) 'Private Release of Graph Statistics using Ladder Functions', ACM SIGMOD. Available at: https://dl.acm.org/doi/10.1145/2723372.2737785 (Accessed: 21 January 2026).

    ProtonVPN

    Most transparent VPN for privacy

    Get Deal