greasypop-co

Geographically REAlistic SYnthetic POPulation using Combinatorial Optimization

https://github.com/cddep-dc/greasypop-co

Science Score: 54.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
    Links to: arxiv.org
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (11.2%) to scientific vocabulary
Last synced: 6 months ago · JSON representation ·

Repository

Geographically REAlistic SYnthetic POPulation using Combinatorial Optimization

Basic Info
  • Host: GitHub
  • Owner: CDDEP-DC
  • License: agpl-3.0
  • Language: Julia
  • Default Branch: main
  • Size: 19.4 MB
Statistics
  • Stars: 2
  • Watchers: 5
  • Forks: 1
  • Open Issues: 0
  • Releases: 2
Created over 2 years ago · Last pushed over 1 year ago
Metadata Files
Readme License Citation

README.md

Geographically REAlistic SYnthetic POPulation using Combinatorial Optimization

logo

  • generates a synthetic population (people, households, schools, workplaces) from US census data for a specified region, at census block group (CBG) resolution
  • generates a synthetic contact network of regular household, school, and work contacts
  • this version groups people into workplaces by industry ( for previous version(s) see Releases --> )

citation: Tulchinsky, A. Y., Haghpanah, F., Hamilton, A., Kipshidze, N., & Klein, E. Y. (2024). Generating geographically and economically realistic large-scale synthetic contact networks: A general method using publicly available data (arXiv:2406.14698). arXiv. http://arxiv.org/abs/2406.14698

files in this project

geo.json - specify the region for which to generate the synth pop
config.json - misc other settings
download_data.R - script to download data from census API
pull_datasets.R - data download functions
census.py - processes input data; converts census and PUMS to a common format; extracts school and workplace data needed for synthesis
CO.jl - performs combinatorial optimization: selects households for each CBG from microdata samples (PUMS)
synthpop.jl - script that calls functions for population synthesis from the files below
households.jl - fills each household from CO.jl with people generated from PUMS data; also creates group quarters (GQ)
schools.jl - reads school data prepared by census.py and assigns students created in households.jl
workplaces.jl - creates workplaces based on data from census.py and assigns workers created in households.jl; also assigns teachers to schools and staff to GQ's
netw.jl - generates synthetic contact network
utils.jl, fileutils.jl - various utility functions
export_synthpop.jl - exports synth pop to csv
export_network.jl - exports contact network to mtx

how to use

1. for automated data download:

(downloads 2019 ACS, PUMS, and LODES data)

(see below for manual download instructions)

edit geo.json

geos: list of areas to include in the synth pop
only CBGs starting with these strings will be included
(two chars for state FIPS, 5 chars for county, more for sub-county; any combination is ok)

commute_states: list of state FIPS that are plausibly within commute distance of the synth pop

use_pums: list of state FIPS whose microdata should be used when generating cbgs by urban/rural proportion
(usually ok to leave this out; see methods paper for details)

install R and required packages:

R 4.4.1, rjson, here, tidyverse, data.table, censusapi, tidycensus, lehdr, usmap, geojsonio

obtain a census API key https://api.census.gov/data/key_signup.html

paste it into download_data.R

key = "YOUR_CENSUS_API_KEY"

run script:

Rscript download_data.R

some data must still be downloaded manually:

into folder "geo"

from https://www.census.gov/programs-surveys/geography/guidance/geo-areas/pumas.html

(2010 file is already in this repo)

census tract to PUMA relationship file, *Census_Tract_to*PUMA*.*
(census boundaries were changed in 2020; choose the year corresponding to ACS year)

from geocorr https://mcdc.missouri.edu/applications/geocorr2018.html (if using < 2020 ACS)

or https://mcdc.missouri.edu/applications/geocorr2022.html (if using >= 2020 ACS)

(geocorr2018 data for all US states is already in this repo)

puma to county, rename to *puma_to_county*.*
puma to cbsa (latest), rename to *puma_to_cbsa*.*
puma to urban-rural portion, rename to *puma_urban_rural*.*
cbg to cbsa (latest), rename *cbg_to_cbsa*.*
cbg to urban-rural portion, rename to *cbg_urban_rural*.*

into folder "work"

employer size data from https://www.census.gov/programs-surveys/cbp/data/datasets.html

(2016 complete county file is already in this repo; more complete than later data)

cbp16co.zip

into folder "school"

from https://nces.ed.gov/programs/edge/Geographic/SchoolLocations

school locations: EDGE_GEOCODE_PUBLICSCH_*.xlsx
GIS data: folder Shapefile_SCH

from https://nces.ed.gov/ccd/files.asp

(choose "Nonfiscal" and Level = "School" from the dropdown options)

info about grades offered: "Directory" file ccd_sch_029*.csv or .zip
enrollment data: "Membership" file ccd_sch_052*.csv or .zip
number of teachers: "Staff" file ccd_sch_059*.csv or .zip

2. (optional) edit config.json

inc_adj: current year ADJINC from PUMS "Data Dictionary" at https://www.census.gov/programs-surveys/acs/microdata/documentation.html
inc_cats: arbitrary labels for income categories
inc_cols: corresponding sets of columns from ACS table B19001
income_associativity_coefficient: SBM associativity between income groups when generating workplace networks
school_associativity_coefficient: SBM associativity between school grades when generating school networks
inst_res_per_worker: # of institutional group quarters residents per staff member
noninst_res_per_worker: # of non-institutional group quarters residents per staff member
workplace_K: mean degree for workplace networks (mean # of regular work contacts)
school_K: mean degree for school networks (mean # of regular school contacts)
gq_K: mean degree for group quarters networks (mean # of contacts within group quarters)

3. install python and julia libraries:

python 3.9.16, pandas 1.5.3, numpy 1.24.3, geopandas 0.12.2, shapely 2.0.1, openpyxl 3.0.10
julia 1.9.0, CSV v0.10.10, DataFrames v1.5.0, Graphs v1.8.0, InlineStrings v1.4.0, JSON v0.21.4, MatrixMarket v0.4.0, StatsBase v0.33.21, ProportionalFitting v0.3.0

4. run scripts:

python census.py
julia -p auto CO.jl 
    (searches for optimal combination of samples to match census data, takes a while)
    (uses multiple local processors; "-p auto" uses all available cores)
julia synthpop.jl

5. (optional) export population and/or network to csv

if continuing in julia, the population and contact network are serialized in folder "jlse"

otherwise, run export script(s):

julia export_synthpop.jl
julia export_network.jl

exports appear in folder "pop_export"

network is exported as a sparse matrix in Matrix Market native exchange format https://math.nist.gov/MatrixMarket/formats.html#MMformat

The file adjmatkeys maps the indices of the contact matrix to the people in people.csv. NOTE The indices in the .mtx files begin at 1. If you are reading the matrix into Juila (or R), everything will work as expected. If you read it into Python using scipy.io.mmread, it will automatically subtract 1 from all the index values to make it 0-indexed. In adjmatkeys, refer to the column (indexone or indexzero) corresponding to how the matrix ends up getting indexed. (In the older version, subtract 1 from the "index" column if your matrix becomes 0-indexed.)

Keep in mind that this is not a complete contact network for a population; it only describes contacts within households, group quarters, schools, and workplaces. You will probably need to generate other types of contacts depending on what you're using this for. The file adjoutworkers lists people who work outside of the synthesized area; they have jobs but are not part of any workplace network. The file adjdummykeys lists people who live outside but work within the synthesized area; they belong to a workplace network but are not part of any household.

manual data download

note: currently only works with data from 2010 - 2019 (format changed in 2020)

into folder "census"

create one sub-folder for each geographic area whose census data you will download; sub-folder names don't matter

(if you're only using data from one US state, make one sub-folder for it)

into each sub-folder, place the following data tables (from data.census.gov)

(ACS* = ACS 5yr survey, census block group (CBG) level, from year ####)
(DEC* = decennial census tables from preceding census, having same cbg boundaries)

ACSDT5Y####.B01001-Data.csv
ACSDT5Y####.B09018-Data.csv
ACSDT5Y####.B09019-Data.csv
ACSDT5Y####.B09020-Data.csv
ACSDT5Y####.B09021-Data.csv
ACSDT5Y####.B11004-Data.csv
ACSDT5Y####.B11012-Data.csv
ACSDT5Y####.B11016-Data.csv
ACSDT5Y####.B19001-Data.csv
ACSDT5Y####.B22010-Data.csv
ACSDT5Y####.B23009-Data.csv
ACSDT5Y####.B23025-Data.csv
ACSDT5Y####.B25006-Data.csv
ACSDT5Y####.B11001H-Data.csv
ACSDT5Y####.B11001I-Data.csv
ACSDT5Y####.C24010-Data.csv
ACSDT5Y####.C24030-Data.csv
DECENNIALSF1####.P43-Data.csv

into folder "pums"

PUMS data for the same 5-yr period as ACS

from https://www2.census.gov/programs-surveys/acs/data/pums/

psam_h??.* and psam_p??.*
for each state you want to draw samples from
(these are provided inside zip files named csv_h??.zip and csv_p??.zip)

into folder "geo"

from https://www.census.gov/programs-surveys/geography/guidance/geo-areas/pumas.html

census tract to PUMA relationship file, *Census_Tract_to*PUMA*.*
(census boundaries were changed in 2020; choose the year corresponding to ACS year)

from geocorr https://mcdc.missouri.edu/applications/geocorr2018.html (if using < 2020 ACS)

or https://mcdc.missouri.edu/applications/geocorr2022.html (if using >= 2020 ACS)

puma to county, rename to *puma_to_county*.*
puma to cbsa (latest), rename to *puma_to_cbsa*.*
puma to urban-rural portion, rename to *puma_urban_rural*.*
cbg to cbsa (latest), rename *cbg_to_cbsa*.*
cbg to urban-rural portion, rename to *cbg_urban_rural*.*

cbg lat-long coords from https://www2.census.gov/geo/tiger/TIGER####/BG/ where #### is year

tl####_??_bg.zip where ?? is the FIPS code for each state in the synth area

into folder "work"

origin-destination work commute data from https://lehd.ces.census.gov/data/

use the version that has the same boundaries as the ACS data (v7 for < 2020; v8 for >= 2020)

use JT01, "primary" jobs (because JT00 counts 2+ jobs for the same individual)

main file for every state in the synth area, named *od_main_JT01*.csv.gz
aux file for every state in the synth area, named *od_aux_JT01*.csv.gz
if many people from your synth area commute to other states, also get the *aux* file for those states

workplace area characteristics (WAC) data from same site

one file for each state in the synth area, named *wac_S000_JT01*.csv.gz

employer size data from https://www.census.gov/programs-surveys/cbp/data/datasets.html

2016 complete county file (more complete than later data)

cbp16co.zip

into folder "school"

from https://nces.ed.gov/programs/edge/Geographic/SchoolLocations

school locations: EDGE_GEOCODE_PUBLICSCH_*.xlsx
GIS data: folder Shapefile_SCH

from https://nces.ed.gov/ccd/files.asp

(choose "Nonfiscal" and Level = "School" from the dropdown options)

info about grades offered: "Directory" file ccd_sch_029*.csv or .zip
enrollment data: "Membership" file ccd_sch_052*.csv or .zip
number of teachers: "Staff" file ccd_sch_059*.csv or .zip

Owner

  • Name: CDDEP-DC
  • Login: CDDEP-DC
  • Kind: organization

Citation (CITATION.cff)

# This CITATION.cff file was generated with cffinit.
# Visit https://bit.ly/cffinit to generate yours today!

cff-version: 1.2.0
title: GREASYPOP-CO
message: >-
  If you use this software, please cite it using the
  metadata from this file.
type: software
authors:
  - family-names: Tulchinsky
    given-names: Alexander Y.
    email: tulchinsky@onehealthtrust.org
    affiliation: One Health Trust
  - given-names: Fardad
    family-names: Haghpanah
    affiliation: One Health Trust
  - given-names: Alisa
    family-names: Hamilton
    affiliation: Johns Hopkins University
  - given-names: Nodar
    family-names: Kipshidze
    affiliation: One Health Trust
  - given-names: Eili Y.
    family-names: Klein
    affiliation: 'One Health Trust, Johns Hopkins School of Medicine'
identifiers:
  - type: url
    value: 'https://arxiv.org/abs/2406.14698'
repository-code: 'https://github.com/CDDEP-DC/GREASYPOP-CO'
license: AGPL-3.0
preferred-citation:
    type: article
    authors:
      - family-names: Tulchinsky
        given-names: Alexander Y.
        email: tulchinsky@onehealthtrust.org
        affiliation: One Health Trust
      - given-names: Fardad
        family-names: Haghpanah
        affiliation: One Health Trust
      - given-names: Alisa
        family-names: Hamilton
        affiliation: Johns Hopkins University
      - given-names: Nodar
        family-names: Kipshidze
        affiliation: One Health Trust
      - given-names: Eili Y.
        family-names: Klein
        affiliation: 'One Health Trust, Johns Hopkins School of Medicine'
    journal: ArXiv
    title:  'Generating geographically and economically realistic large-scale synthetic contact networks: A general method using publicly available data'
    year: 2024
    url: 'https://arxiv.org/abs/2406.14698'
    doi: 10.48550/arXiv.2406.14698
    identifiers:
      - type: other
        value: 'arXiv:2406.14698'
        description: 'Archive ID'

GitHub Events

Total
  • Watch event: 2
  • Fork event: 1
Last Year
  • Watch event: 2
  • Fork event: 1