https://github.com/cdcgov/covid_case_privacy_review

Privacy review and statistical disclosure control methods for covid public case data.

Keywords

coronavirus covid covid-19 privacy r sdc srrg

Last synced: 9 months ago · JSON representation

Repository

Privacy review and statistical disclosure control methods for covid public case data.

Basic Info

Host: GitHub
Owner: CDCgov
License: apache-2.0
Language: R
Default Branch: master
Homepage: https://data.cdc.gov/Case-Surveillance/COVID-19-Case-Surveillance-Public-Use-Data/vbim-akqf
Size: 178 MB

Statistics

Stars: 16
Watchers: 3
Forks: 13
Open Issues: 1
Releases: 0

Topics

coronavirus covid covid-19 privacy r sdc srrg

Created over 5 years ago · Last pushed almost 2 years ago

Metadata Files

Readme License

COVID-19 Case Privacy Review

This project contains the procedures used by the Data, Analytics, Visualization Task Force to review and verify that data sets include privacy protection controls and meet the defined k-anonymity and l-diversity thresholds established by the covid response.

Issues, questions, problems, suggestions

If you have any of the above, please submit an issue on github.

Requires

Detailed dependencies are in renv.lock, but main dependencies called out

R version >= 4.0.3 (although I suspect, but have not tested >=3.5)
sdcMicro version >= 5.5.1
arrow version >= 2.0.0
renv >= 0.12.3
optional for profiling
- DataExplorer >=0.8.2
optional for rmarkdown
- Some LaTeX installed (I used MacTeX, but TinyTex should do)
- flextable >=0.6.4

Install & run procedures

renv::restore() to install necessary packages

Then in or symlink a data file to the data/raw folder, update the script with the name of the data file, run from the R folder of this project. This script will generate output to the console and create a privacy report in the reports folder.

R/review_public.R - to review the public12 file
R/reviewpublicgeo.R - to review the public19 file

renv::snapshot() to update renv.lock with any changes made to dependencies

Description

These scripts are part of a two step process for statistical disclosure control implementation. These scripts do not directly perform suppression or modify any data, but are a separate step to validate that the data generation pipeline - implemented in HHSProtect as a Palantir code repository - that generates the data file so that it meets all the privacy protection requirements.

Data files are not contained in this repo and must be fetched independently and placed into the data/raw folder for review. The public use files can be retrieved from Data.CDC.gov: * COVID-19 Case Surveillance Public Use Data * COVID-19 Case Surveillance Public Use Data with Geography

K-anonymity is a technique to release person-specific data such that the ability to link to other information using the quasi-identifier is limited. Each person contained in the released data cannot be distinguished from at least k-1 other persons who share the same quasi identifiers. For example, a dataset is considered 5-anonymous it means that the smallest number of cells that share the same quasi-identifiers is 5.

While k-anonymity reduces risk of reidentification, l-diversity protects privacy by limiting the ability for finding confidential information on an individual. This technique extends k-anonymity to protect confidential information within a release so that confidential values cannot be identified to groups of individuals that share quasi-identifiers. For example, if a dataset is considered to have 2-diversity the smallest number of values shared by quasi identifiers is 2 and there are no unique values within that cell.

Thresholds are established during the privacy review procedures by subject matter experts, statisticians, informaticians, and public health scientists familiar with the case surveillance data. We evaluate the dataset directly as well as potential other data sets these data may be linked. Our goal is to reduce the risk of reidentification while providing useful data for researchers and the public to use to protect America's health.

These thresholds are applied to the quasi-identifiers and confidential variables and do not impact non-confidential variables. Records are never deleted, but individual fields will be changed from their value to NA. For quasi-identifiers, changed fields can be distinguished as they are the only NA values in those fields as missing values have been recoded to the string literal "Missing", but for confidential variables missing values and changed fields are NA to protect confidential values.

We are working to improve these privacy procedures over time and welcome feedback and improvements submitted to this project as issues or pull requests.

Data file characteristics - "COVID-19 Case Surveillance Public Use Data"

Checks its 12 variables for...

Quasi-identifiers (3)

Checked for k=5

age_group
sex
raceethnicitycombined

Confidential attributes (1)

posspecdt

Data file characteristics - "COVID-19 Case Surveillance Public Use Data with Geography"

Checks its 19 variables for...

Quasi-identifiers (8)

Checked for k=1000

res_state
res_county

Checked for k=11

case_month
res_state
res_county
age_group
sex
race
ethnicity
death_yn

Population level and geography specific checks

Checking county population and res_county should never be populated if the county population (by FIPS code) is under 20k.
Checking that sex, race, ethnicity demographic values should never be populated with the county subpopulation by those demographics is under 220 (k*20).
Checking for case county by sex, race, ethnicity in a county is never higher than 50% of the subpopulation by those demographics for the county.
Checking that there is never a situation where only a single county within a state is suppressed, allowing the state to be deduced by process of elimination.
Checks that if resstate is suppressed, then rescounty should also be suppressed.
FIPS code fields are associated with quasi-identifiers so they are checked to make sure that they are always suppressed when corresponding fields are suppressed.
- statefipscode, suppressed when res_state is suppressed
- countyfipscode, suppressed when res_state is suppressed

Confidential attributes

No confidential attributes are in this dataset.

Interpreting output

This script uses the sdcMicro package so much of the output is generated from this package. What we look for is the specific output linked variable violations ( 0 ), k-anon violations ( 0 ), and < 0 > l-diversity violations. If any violations are found then the file is not ready for publication, notify the data team so they can fix the data pipeline.

For the geography checks, there are multiple steps, so output should be reviewed to confirm that all steps have completed without identifying any violations that require correction prior to publishing: * linked variable violations ( 0 ) * k-anon violations ( 0 ) for k=( 1000 ) and quasi-identifiers ( res_state res_county ) * k-anon violations ( 0 ) for k=( 11 ) and quasi-identifiers ( case_month res_state res_county age_group sex race ethnicity death_yn ) * Low population county violations ( 0 ) * Subpopulation county violations, part 1 checking subpopulation for counties ( 0 ) * Subpopulation county violations, part 2, checking to make sure there aren't any res_county that aren't NA but have subpops ( 0 ) * Subpopulation population too small for cases ( 0 ) * County/state complementary violations ( 0 )

For convenience, a portion of this output is stored in reports/log.md to compare results on previous versions of the dataset.

Analysis and Visualization files

utilitysummarypublic.Rmd - RMarkdown analysis to generate PDF report showing utility summary for public dataset
utilitysummarypublic_geo.Rmd - RMarkdown analysis to generate PDF report showing utility summary for public dataset

Helper files

countypopdemoforverify.csv a utility dataset generated from the 2019 census estimates that contain populations counts for each county by sex, race, and ethnicity. Based on a shared utility dataset made by HHSProtect, formatted to be easier to use by the verification script.
profile_data.R that uses the DataExplorer package to create a profile report that is helpful for understanding and debugging the dataset. If you run it, it will output a new profile to the reports folder.
parquet2csv.R that uses the Arrow package to read the dataset in parquet format and output an equivalent CSV file.

Structure

These files and folders are meant to help organize and make it easier for others to understand and contribute.

sh analysis <- analysis and visualization files utility_summary_public.Rmd <- utility summary analysis for public dataset utility_summary_public.pdf <- generated report from utility summary analysis utility_summary_public_geo.Rmd <- utility summary analysis for public geo dataset utility_summary_public_geo.pdf <- generated report from utility summary analysis R <- R scripts functions.R <- functions that are reused in other scripts parquet2csv.R <- converts a parquet file to csv profile_data.R <- creates a data profile report for exploratory data analysis renv.lock <- dependency file (generated by renv::snapshot()) review_public.R <- script to review public12 data file review_public_geo.R <- script to review public19 data file data <- data files used by project raw <- raw files, original, immutable data dump county_pop_demo_for_verify.csv <- census county populations by demo output <- output files readme.md <- Description of project, instructions for how to run reports <- Generated reports and visualizations log.md <- logged results from reviewed files

TODOs

Investigate using dlookr instead of DataExplorer.

References

Lee, B., Dupervil, B., Deputy, N. P., Duck, W., Soroka, S., Bottichio, L., Silk, B., Price, J., Sweeney, P., Fuld, J., Weber, T., & Pollock, D. (2021). Protecting Privacy and Transforming COVID-19 Case Surveillance Datasets for Public Use. ArXiv:2101.05093. http://arxiv.org/abs/2101.05093
https://data.cdc.gov/Case-Surveillance/COVID-19-Case-Surveillance-Public-Use-Data/vbim-akqf
https://data.cdc.gov/Case-Surveillance/COVID-19-Case-Surveillance-Public-Use-Data-with-Ge/n8mc-b4w4
Samarati, P., & Sweeney, L. (1998). Protecting privacy when disclosing information: k-anonymity and its enforcement through generalization and suppression. Tech. rep. SRI-CSL-98-04, SRI Computer Science Laboratory, Palo Alto, CA.
Benschop, Thijs, and Matthew Welch. Statistical Disclosure Control for Microdata: A Practice Guide for SdcMicro SDC Practice Guide Documentation. June 2016, https://sdcpractice.readthedocs.io/en/latest/index.html.
Templ M, Kowarik A, Meindl B (2015). Statistical Disclosure Control for Micro-Data Using the R Package sdcMicro. Journal of Statistical Software, 67(4), 136. doi: 10.18637/jss.v067.i04
Templ, M., Meindl, B., & Kowarik, A. (2020). sdcMicro: Statistical Disclosure Control Methods for Anonymization of Data and Risk Estimation (5.5.1) [Computer software]. https://CRAN.R-project.org/package=sdcMicro
Ashwin Machanavajjhala, Daniel Kifer, Johannes Gehrke, and Muthuramakrishnan Venkitasubramaniam. 2007. L-diversity: Privacy beyond k-anonymity. ACM Trans. Knowl. Discov. Data 1, 1 (March 2007), 3es. DOI: https://doi.org/10.1145/1217299.1217302
Delcher, P. C., Edwards, K. T., Stover, J. A., Newman, L. M., Groseclose, S. L., & Rajnik, D. M. (2008). Data suppression strategies used during surveillance data release by sexually transmitted disease prevention programs. Journal of Public Health Management and Practice, 14(2), E1-E8.
Erika McCallister, Tim Grance, and Karen Scanfone. Guide to protecting the confidentiality of personally identifiable information (PII), April 2010. NIST Special Publication 800-122. https://csrc.nist.gov/publications/detail/sp/800-122/final
El Emam, K., & Dankar, F. K. (2008). Protecting privacy using k-anonymity. Journal of the American Medical Informatics Association, 15(5), 627-637. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2528029/
Chertov, O., & Pilipyuk, A. (2009, October). Statistical disclosure control methods for microdata. In International Symposium on Computing, Communication and Control (pp. 338-342). https://www.researchgate.net/publication/228827997_Statistical_Disclosure_Control_Methods_for_Microdata
Committee on Strategies for Responsible Sharing of Clinical Trial Data; Board on Health Sciences Policy; Institute of Medicine, & Medicine, I. of. (2015). Concepts and Methods for De-identifying Clinical Trial Data. National Academies Press (US). https://www.ncbi.nlm.nih.gov/books/NBK285994/
Angiuli, O., & Waldo, J. (2016). Statistical Tradeoffs between Generalization and Suppression in the De-identification of Large-Scale Data Sets. 2016 IEEE 40th Annual Computer Software and Applications Conference (COMPSAC), 2, 589593. https://doi.org/10.1109/COMPSAC.2016.198
Monkman, M. (n.d.). Chapter 8 Anonymity and Confidentiality | Data Science with R: A Resource Compendium. Retrieved May 27, 2020, from https://bookdown.org/martin_monkman/DataScienceResources_book/anonymity-and-confidentiality.html
Report on Statistical Disclosure Limitation Methodology, https://nces.ed.gov/FCSM/pdf/spwp22.pdf
Klein, R. J., Proctor, S. E., Boudreault, M. A., & Turczyn, K. M. (2002). Healthy People 2010 criteria for data suppression. Statistical notes, 24, 1-12. https://www.cdc.gov/nchs/data/statnt/statnt24.pdf
Centers for Disease Control and Prevention, Agency for Toxic Substances and Disease Registry. CDC/ ATSDR Policy on Releasing and Sharing Data. Atlanta, GA: Centers for Disease Control and Prevention, US Dept of Health and Human Services; 2005. https://www.cdc.gov/maso/policy/releasingdata.pdf
Clearance of Information Products Disseminated Outside CDC for Public Use, https://www.cdc.gov/os/policies/docs/CDC-GA-2005-06_Clearance_of_Information_Products_Disseminated_Outside_for_Public_Use.pdf
Data, S. (2005). CDC-CSTE Intergovernmental Data Release Guidelines Working Group (DRGWG) Report: CDC-ATSDR Data Release Guidelines and Procedures for Re-release of State-Provided Data January 24, 2005. https://stacks.cdc.gov/view/cdc/7563
Simon, G., Shortreed, S. M., Coley, R. Y., & Iturralde, E. M. (2020). Toolkit for Assessing and Mitigating Risk of Re-identification when Sharing Data Derived from Health Records. https://www.sentinelinitiative.org/sites/default/files/Methods/Sentinel_Report_Toolkit-Assessing-Mitigating-Risk-Re-Identification-Sharing-Data-Derived-from-Health-Records.pdf

Public Domain Standard Notice

This repository constitutes a work of the United States Government and is not subject to domestic copyright protection under 17 USC 105. This repository is in the public domain within the United States, and copyright and related rights in the work worldwide are waived through the CC0 1.0 Universal public domain dedication. All contributions to this repository will be released under the CC0 dedication. By submitting a pull request you are agreeing to comply with this waiver of copyright interest.

License Standard Notice

The repository utilizes code licensed under the terms of the Apache Software License and therefore is licensed under ASL v2 or later.

This source code in this repository is free: you can redistribute it and/or modify it under the terms of the Apache Software License version 2, or (at your option) any later version.

This source code in this repository is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the Apache Software License for more details.

You should have received a copy of the Apache Software License along with this program. If not, see http://www.apache.org/licenses/LICENSE-2.0.html

The source code forked from other open source projects will inherit its license.

Privacy Standard Notice

This repository contains only non-sensitive, publicly available data and information. All material and community participation is covered by the Disclaimer and Code of Conduct. For more information about CDC's privacy policy, please visit http://www.cdc.gov/other/privacy.html.

Contributing Standard Notice

Anyone is encouraged to contribute to the repository by forking and submitting a pull request. (If you are new to GitHub, you might start with a basic tutorial.) By contributing to this project, you grant a world-wide, royalty-free, perpetual, irrevocable, non-exclusive, transferable license to all users under the terms of the Apache Software License v2 or later.

All comments, messages, pull requests, and other submissions received through CDC including this GitHub page may be subject to applicable federal law, including but not limited to the Federal Records Act, and may be archived. Learn more at http://www.cdc.gov/other/privacy.html.

Records Management Standard Notice

This repository is not a source of government records, but is a copy to increase collaboration and collaborative potential. All government records will be published through the CDC web site.

Additional Standard Notices

Please refer to CDC's Template Repository for more information about contributing to this repository, public domain notices and disclaimers, and code of conduct.

Owner

Name: Centers for Disease Control and Prevention
Login: CDCgov
Kind: organization
Email: data@cdc.gov
Location: Atlanta, GA

Website: http://open.cdc.gov/
Twitter: CDCgov
Repositories: 114
Profile: https://github.com/CDCgov

CDC's collaborative software projects to protect America from health, safety, and security threats, both foreign and in the U.S.

GitHub Events

Total

Fork event: 3

Last Year

Fork event: 3

Committers

Last synced: about 1 year ago

All Time

Total Commits: 166
Total Committers: 9
Avg Commits per committer: 18.444
Development Distribution Score (DDS): 0.765

Past Year

Commits: 4
Committers: 1
Avg Commits per committer: 4.0
Development Distribution Score (DDS): 0.0

Top Committers

Name	Email	Commits
tuyendo2021	d**1@c**v	39
Brian Lee	b**e@c**v	35
Crosby	r**4@c**v	34
Seth Billiau	s**u@g**m	23
Terry Wang	d**2@c**v	13
manandappa	d**4@c**v	8
Deirdre Pratt	8****7	7
Nimesh-Patel	c**2@c**v	6
Kariuki	w**7@c**v	1

Committer Domains (Top 20 + Academic)

cdc.gov: 7

Issues and Pull Requests

Last synced: about 1 year ago

All Time

Total issues: 1
Total pull requests: 3
Average time to close issues: N/A
Average time to close pull requests: about 7 hours
Total issue authors: 1
Total pull request authors: 1
Average comments per issue: 5.0
Average comments per pull request: 0.33
Merged pull requests: 3
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 0
Pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Issue authors: 0
Pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

leebrian (1)

Pull Request Authors

leebrian (3)

Top Labels

Issue Labels

bug (1)

Pull Request Labels

enhancement (1) bug (1)

https://github.com/cdcgov/covid_case_privacy_review

Science Score: 49.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

readme.md

COVID-19 Case Privacy Review

Issues, questions, problems, suggestions

Requires

Install & run procedures

Description

Data file characteristics - "COVID-19 Case Surveillance Public Use Data"

Quasi-identifiers (3)

Confidential attributes (1)

Data file characteristics - "COVID-19 Case Surveillance Public Use Data with Geography"

Quasi-identifiers (8)

Population level and geography specific checks

Confidential attributes

Interpreting output

Analysis and Visualization files

Helper files

Structure

TODOs

References

Public Domain Standard Notice

License Standard Notice

Privacy Standard Notice

Contributing Standard Notice

Records Management Standard Notice

Additional Standard Notices

Owner

GitHub Events

Total

Last Year

Committers

All Time

Past Year

Top Committers

Committer Domains (Top 20 + Academic)

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels