greasypop-co
Geographically REAlistic SYnthetic POPulation using Combinatorial Optimization
Science Score: 54.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
✓Academic publication links
Links to: arxiv.org -
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (11.2%) to scientific vocabulary
Repository
Geographically REAlistic SYnthetic POPulation using Combinatorial Optimization
Basic Info
- Host: GitHub
- Owner: CDDEP-DC
- License: agpl-3.0
- Language: Julia
- Default Branch: main
- Size: 19.4 MB
Statistics
- Stars: 2
- Watchers: 5
- Forks: 1
- Open Issues: 0
- Releases: 2
Metadata Files
README.md
Geographically REAlistic SYnthetic POPulation using Combinatorial Optimization

- generates a synthetic population (people, households, schools, workplaces) from US census data for a specified region, at census block group (CBG) resolution
- generates a synthetic contact network of regular household, school, and work contacts
- this version groups people into workplaces by industry ( for previous version(s) see Releases --> )
citation: Tulchinsky, A. Y., Haghpanah, F., Hamilton, A., Kipshidze, N., & Klein, E. Y. (2024). Generating geographically and economically realistic large-scale synthetic contact networks: A general method using publicly available data (arXiv:2406.14698). arXiv. http://arxiv.org/abs/2406.14698
files in this project
geo.json - specify the region for which to generate the synth pop
config.json - misc other settings
download_data.R - script to download data from census API
pull_datasets.R - data download functions
census.py - processes input data; converts census and PUMS to a common format; extracts school and workplace data needed for synthesis
CO.jl - performs combinatorial optimization: selects households for each CBG from microdata samples (PUMS)
synthpop.jl - script that calls functions for population synthesis from the files below
households.jl - fills each household from CO.jl with people generated from PUMS data; also creates group quarters (GQ)
schools.jl - reads school data prepared by census.py and assigns students created in households.jl
workplaces.jl - creates workplaces based on data from census.py and assigns workers created in households.jl; also assigns teachers to schools and staff to GQ's
netw.jl - generates synthetic contact network
utils.jl, fileutils.jl - various utility functions
export_synthpop.jl - exports synth pop to csv
export_network.jl - exports contact network to mtx
how to use
1. for automated data download:
(downloads 2019 ACS, PUMS, and LODES data)
(see below for manual download instructions)
edit geo.json
geos: list of areas to include in the synth pop
only CBGs starting with these strings will be included
(two chars for state FIPS, 5 chars for county, more for sub-county; any combination is ok)
commute_states: list of state FIPS that are plausibly within commute distance of the synth pop
use_pums: list of state FIPS whose microdata should be used when generating cbgs by urban/rural proportion
(usually ok to leave this out; see methods paper for details)
install R and required packages:
R 4.4.1, rjson, here, tidyverse, data.table, censusapi, tidycensus, lehdr, usmap, geojsonio
obtain a census API key https://api.census.gov/data/key_signup.html
paste it into download_data.R
key = "YOUR_CENSUS_API_KEY"
run script:
Rscript download_data.R
some data must still be downloaded manually:
into folder "geo"
from https://www.census.gov/programs-surveys/geography/guidance/geo-areas/pumas.html
(2010 file is already in this repo)
census tract to PUMA relationship file, *Census_Tract_to*PUMA*.*
(census boundaries were changed in 2020; choose the year corresponding to ACS year)
from geocorr https://mcdc.missouri.edu/applications/geocorr2018.html (if using < 2020 ACS)
or https://mcdc.missouri.edu/applications/geocorr2022.html (if using >= 2020 ACS)
(geocorr2018 data for all US states is already in this repo)
puma to county, rename to *puma_to_county*.*
puma to cbsa (latest), rename to *puma_to_cbsa*.*
puma to urban-rural portion, rename to *puma_urban_rural*.*
cbg to cbsa (latest), rename *cbg_to_cbsa*.*
cbg to urban-rural portion, rename to *cbg_urban_rural*.*
into folder "work"
employer size data from https://www.census.gov/programs-surveys/cbp/data/datasets.html
(2016 complete county file is already in this repo; more complete than later data)
cbp16co.zip
into folder "school"
from https://nces.ed.gov/programs/edge/Geographic/SchoolLocations
school locations: EDGE_GEOCODE_PUBLICSCH_*.xlsx
GIS data: folder Shapefile_SCH
from https://nces.ed.gov/ccd/files.asp
(choose "Nonfiscal" and Level = "School" from the dropdown options)
info about grades offered: "Directory" file ccd_sch_029*.csv or .zip
enrollment data: "Membership" file ccd_sch_052*.csv or .zip
number of teachers: "Staff" file ccd_sch_059*.csv or .zip
2. (optional) edit config.json
inc_adj: current year ADJINC from PUMS "Data Dictionary" at https://www.census.gov/programs-surveys/acs/microdata/documentation.html
inc_cats: arbitrary labels for income categories
inc_cols: corresponding sets of columns from ACS table B19001
income_associativity_coefficient: SBM associativity between income groups when generating workplace networks
school_associativity_coefficient: SBM associativity between school grades when generating school networks
inst_res_per_worker: # of institutional group quarters residents per staff member
noninst_res_per_worker: # of non-institutional group quarters residents per staff member
workplace_K: mean degree for workplace networks (mean # of regular work contacts)
school_K: mean degree for school networks (mean # of regular school contacts)
gq_K: mean degree for group quarters networks (mean # of contacts within group quarters)
3. install python and julia libraries:
python 3.9.16, pandas 1.5.3, numpy 1.24.3, geopandas 0.12.2, shapely 2.0.1, openpyxl 3.0.10
julia 1.9.0, CSV v0.10.10, DataFrames v1.5.0, Graphs v1.8.0, InlineStrings v1.4.0, JSON v0.21.4, MatrixMarket v0.4.0, StatsBase v0.33.21, ProportionalFitting v0.3.0
4. run scripts:
python census.py
julia -p auto CO.jl
(searches for optimal combination of samples to match census data, takes a while)
(uses multiple local processors; "-p auto" uses all available cores)
julia synthpop.jl
5. (optional) export population and/or network to csv
if continuing in julia, the population and contact network are serialized in folder "jlse"
otherwise, run export script(s):
julia export_synthpop.jl
julia export_network.jl
exports appear in folder "pop_export"
network is exported as a sparse matrix in Matrix Market native exchange format https://math.nist.gov/MatrixMarket/formats.html#MMformat
The file adjmatkeys maps the indices of the contact matrix to the people in people.csv. NOTE The indices in the .mtx files begin at 1. If you are reading the matrix into Juila (or R), everything will work as expected. If you read it into Python using scipy.io.mmread, it will automatically subtract 1 from all the index values to make it 0-indexed. In adjmatkeys, refer to the column (indexone or indexzero) corresponding to how the matrix ends up getting indexed. (In the older version, subtract 1 from the "index" column if your matrix becomes 0-indexed.)
Keep in mind that this is not a complete contact network for a population; it only describes contacts within households, group quarters, schools, and workplaces. You will probably need to generate other types of contacts depending on what you're using this for. The file adjoutworkers lists people who work outside of the synthesized area; they have jobs but are not part of any workplace network. The file adjdummykeys lists people who live outside but work within the synthesized area; they belong to a workplace network but are not part of any household.
manual data download
note: currently only works with data from 2010 - 2019 (format changed in 2020)
into folder "census"
create one sub-folder for each geographic area whose census data you will download; sub-folder names don't matter
(if you're only using data from one US state, make one sub-folder for it)
into each sub-folder, place the following data tables (from data.census.gov)
(ACS* = ACS 5yr survey, census block group (CBG) level, from year ####)
(DEC* = decennial census tables from preceding census, having same cbg boundaries)
ACSDT5Y####.B01001-Data.csv
ACSDT5Y####.B09018-Data.csv
ACSDT5Y####.B09019-Data.csv
ACSDT5Y####.B09020-Data.csv
ACSDT5Y####.B09021-Data.csv
ACSDT5Y####.B11004-Data.csv
ACSDT5Y####.B11012-Data.csv
ACSDT5Y####.B11016-Data.csv
ACSDT5Y####.B19001-Data.csv
ACSDT5Y####.B22010-Data.csv
ACSDT5Y####.B23009-Data.csv
ACSDT5Y####.B23025-Data.csv
ACSDT5Y####.B25006-Data.csv
ACSDT5Y####.B11001H-Data.csv
ACSDT5Y####.B11001I-Data.csv
ACSDT5Y####.C24010-Data.csv
ACSDT5Y####.C24030-Data.csv
DECENNIALSF1####.P43-Data.csv
into folder "pums"
PUMS data for the same 5-yr period as ACS
from https://www2.census.gov/programs-surveys/acs/data/pums/
psam_h??.* and psam_p??.*
for each state you want to draw samples from
(these are provided inside zip files named csv_h??.zip and csv_p??.zip)
into folder "geo"
from https://www.census.gov/programs-surveys/geography/guidance/geo-areas/pumas.html
census tract to PUMA relationship file, *Census_Tract_to*PUMA*.*
(census boundaries were changed in 2020; choose the year corresponding to ACS year)
from geocorr https://mcdc.missouri.edu/applications/geocorr2018.html (if using < 2020 ACS)
or https://mcdc.missouri.edu/applications/geocorr2022.html (if using >= 2020 ACS)
puma to county, rename to *puma_to_county*.*
puma to cbsa (latest), rename to *puma_to_cbsa*.*
puma to urban-rural portion, rename to *puma_urban_rural*.*
cbg to cbsa (latest), rename *cbg_to_cbsa*.*
cbg to urban-rural portion, rename to *cbg_urban_rural*.*
cbg lat-long coords from https://www2.census.gov/geo/tiger/TIGER####/BG/ where #### is year
tl####_??_bg.zip where ?? is the FIPS code for each state in the synth area
into folder "work"
origin-destination work commute data from https://lehd.ces.census.gov/data/
use the version that has the same boundaries as the ACS data (v7 for < 2020; v8 for >= 2020)
use JT01, "primary" jobs (because JT00 counts 2+ jobs for the same individual)
main file for every state in the synth area, named *od_main_JT01*.csv.gz
aux file for every state in the synth area, named *od_aux_JT01*.csv.gz
if many people from your synth area commute to other states, also get the *aux* file for those states
workplace area characteristics (WAC) data from same site
one file for each state in the synth area, named *wac_S000_JT01*.csv.gz
employer size data from https://www.census.gov/programs-surveys/cbp/data/datasets.html
2016 complete county file (more complete than later data)
cbp16co.zip
into folder "school"
from https://nces.ed.gov/programs/edge/Geographic/SchoolLocations
school locations: EDGE_GEOCODE_PUBLICSCH_*.xlsx
GIS data: folder Shapefile_SCH
from https://nces.ed.gov/ccd/files.asp
(choose "Nonfiscal" and Level = "School" from the dropdown options)
info about grades offered: "Directory" file ccd_sch_029*.csv or .zip
enrollment data: "Membership" file ccd_sch_052*.csv or .zip
number of teachers: "Staff" file ccd_sch_059*.csv or .zip
Owner
- Name: CDDEP-DC
- Login: CDDEP-DC
- Kind: organization
- Repositories: 1
- Profile: https://github.com/CDDEP-DC
Citation (CITATION.cff)
# This CITATION.cff file was generated with cffinit.
# Visit https://bit.ly/cffinit to generate yours today!
cff-version: 1.2.0
title: GREASYPOP-CO
message: >-
If you use this software, please cite it using the
metadata from this file.
type: software
authors:
- family-names: Tulchinsky
given-names: Alexander Y.
email: tulchinsky@onehealthtrust.org
affiliation: One Health Trust
- given-names: Fardad
family-names: Haghpanah
affiliation: One Health Trust
- given-names: Alisa
family-names: Hamilton
affiliation: Johns Hopkins University
- given-names: Nodar
family-names: Kipshidze
affiliation: One Health Trust
- given-names: Eili Y.
family-names: Klein
affiliation: 'One Health Trust, Johns Hopkins School of Medicine'
identifiers:
- type: url
value: 'https://arxiv.org/abs/2406.14698'
repository-code: 'https://github.com/CDDEP-DC/GREASYPOP-CO'
license: AGPL-3.0
preferred-citation:
type: article
authors:
- family-names: Tulchinsky
given-names: Alexander Y.
email: tulchinsky@onehealthtrust.org
affiliation: One Health Trust
- given-names: Fardad
family-names: Haghpanah
affiliation: One Health Trust
- given-names: Alisa
family-names: Hamilton
affiliation: Johns Hopkins University
- given-names: Nodar
family-names: Kipshidze
affiliation: One Health Trust
- given-names: Eili Y.
family-names: Klein
affiliation: 'One Health Trust, Johns Hopkins School of Medicine'
journal: ArXiv
title: 'Generating geographically and economically realistic large-scale synthetic contact networks: A general method using publicly available data'
year: 2024
url: 'https://arxiv.org/abs/2406.14698'
doi: 10.48550/arXiv.2406.14698
identifiers:
- type: other
value: 'arXiv:2406.14698'
description: 'Archive ID'
GitHub Events
Total
- Watch event: 2
- Fork event: 1
Last Year
- Watch event: 2
- Fork event: 1