population.modelling.hue2010
Results presented for iGISc conference 2021
Science Score: 41.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
○codemeta.json file
-
○.zenodo.json file
-
✓DOI references
Found 2 DOI reference(s) in README -
✓Academic publication links
Links to: zenodo.org -
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (8.4%) to scientific vocabulary
Repository
Results presented for iGISc conference 2021
Basic Info
Statistics
- Stars: 0
- Watchers: 4
- Forks: 1
- Open Issues: 0
- Releases: 0
Metadata Files
README.md
Census private data estimation
Author: Néstor de la Paz Ruíz
Code repository:
A CODECHECK certificate is available confirming that the computations underlying the poster could be independently reproduced:
1. Introduction
Enabling SMS with the estimation of block census private data
This document shows the steps taken for preparing the census data for
SMS. Census data related to age-ranges, sex, school, and work allows
the estimations of DWW pollutants (production and location) depending on
population characteristics.
1.1 Problem
Data privacy policies protect inhabitants’ sensitive information and make population census data difficult to use in research activities. Especially, privacy policies make data partly inaccessible at small spatial units such as neighborhoods blocks with low population density.
Spatial Microsimulation (SMS) refers to “the creation, analysis and modelling of individual level data allocated to geographic zones” (Lovelace, 2018). Entirely census data at blocks is required for implementing SMS to better investigate population behavioural dynamics in complex phenomena’s as mobility, security, pollution, or health.
1.2 Objective
Develop a method to estimate inaccessible population census data to enable more complex applications such as SMS.
1.3 Variables selected
After analyzing 198 census variables that exist at the block level,
45 were required in the phase of cleaning and transformation to
execute SMS and to provide input data for the DWW ABM. The selected
variables will be used to simulate the dynamics of the mobility,
biological, and social behaviors as ABM's submodels.
## [1] "POBTOT" "POBMAS" "POBFEM" "P_12YMAS" "P_12YMAS_M"
## [6] "P_12YMAS_F" "P_0A2" "P_0A2_M" "P_0A2_F" "P_3A5"
## [11] "P_3A5_M" "P_3A5_F" "P_6A11" "P_6A11_M" "P_6A11_F"
## [16] "P_12A14" "P_12A14_M" "P_12A14_F" "P_15A17" "P_15A17_M"
## [21] "P_15A17_F" "P_18A24" "P_18A24_M" "P_18A24_F" "P3A5_NOA"
## [26] "P3A5_NOA_M" "P3A5_NOA_F" "P6A11_NOA" "P6A11_NOAM" "P6A11_NOAF"
## [31] "P12A14NOA" "P12A14NOAM" "P12A14NOAF" "P15A17A" "P15A17A_M"
## [36] "P15A17A_F" "P18A24A" "P18A24A_M" "P18A24A_F" "PEA"
## [41] "PEA_M" "PEA_F" "PE_INAC" "PE_INAC_M" "PE_INAC_F"
1.4 NA Values
NA cells exits as a mean of privacy protection. The raw data by block demonstrate presence of NA values equal to 829 cells. A preview of data missing due to privacy looks like the following:
| P_0A2 | P_0A2_M | P_0A2_F | P_3A5 | P_3A5_M | P_3A5_F | P_6A11 | P_6A11_M | P_6A11_F | | -----: | --------: | --------: | -----: | --------: | --------: | ------: | ---------: | ---------: | | 0 | 0 | 0 | NA | 0 | NA | NA | NA | 0 | | 4 | NA | NA | 4 | 3 | NA | 6 | 4 | NA | | 3 | NA | NA | 4 | 3 | NA | 6 | 3 | 3 | | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | | NA | NA | 0 | NA | 0 | NA | 0 | 0 | 0 | | NA | 0 | NA | 0 | 0 | 0 | 8 | 5 | 3 | | NA | NA | 0 | NA | 0 | NA | NA | NA | 0 | | 7 | 3 | 4 | 3 | 3 | 0 | 12 | 4 | 8 | | NA | NA | 0 | NA | 0 | NA | 7 | 4 | 3 | | NA | NA | 0 | 7 | 4 | 3 | 8 | 5 | 3 |
1.5 NA Confidentiality criteria
The main confidentiality criteria defined by the Ley del Sistema Nacional de Información Estadística y Geográfica (LSNIEG) are the following (INEGI,2020):
• At the municipality or territorial demarcation level, locality and AGEB, any indicator with less than three units appears with an asterisk with the exception of the variables Total population (POBTOT), Total dwellings (VIVTOT) and Total inhabited dwellings (TVIVHAB).
• At the block level, the previous criterion also applies; additionally, for those with one or two inhabited dwellings, only presents information in the variables Total population (POBTOT) and Total dwellings (VIVTOT), in the rest of the indicators appear asterisks.
The above criteria is considered to fill NA cells. Note that in this
document * are replace by NA cells.
2 R script rules
The prediction of DWW pollutants depends on the quantification of
population totals and their characteristics. NA cells represent
missing population that leads to an underestimation of pollutants, for
that reason NA cells are replaced with values following the rules in
this section. Once values are filled, it is possible to fulfill the
requirements to execute SMS for a further estimation of DWW pollutants.
2.1 Functions to fill NA cells at each categorical variable.
As a general overview, NA cells are filled with a number of population
that matches the totals:
| Targeted totals | Description | | --------------- | --------------------------------------------------------------------------- | | By column | Representing the totals of the categorical values. | | By row | Making sure that the totals a block are not overestimated. | | By gender | Consideration in the desagregation of the totals by gender (and their sum). |
The variables selected represent the following categories:
Population totalsAge-ranges and sexAssistance or not to schoolActively economically or not (work)
Each category requires its own script for data cleaning and
transformation. This document explains the categorical variable
Age-ranges and sex. The functions for the rest of the categorical
variables are adjusted based on the similar logic. If there is interest,
it is also possible to explore the code for a better understanding of
the particularities of each script by targeted category.
The following script take a glance of the core rules for cleaning
Age-ranges and sex (code description provided after the script):
To interpret the code it is relevant to note the following terminology:
| Notation in the script | Description |
| ---------------------------- | ------------------------------------------------------------------------- |
| smcol_, smrow_ | Built functions to sum rows-columns across the categorical variable. |
| _mf, _m, _f | Endings that refer at each section of a function by gender. |
| .mf, .m, .f | If contained, it refers at each variable by its gender. |
| vec.mf.n, vec.m.n, vec.f.n | Vectors variables that saves the new .n values. |
| t.tot | Totals of the categorical value (columns). |
| hpcons | Original data table with all categorical variables. |
| df.c$col.f | Ending that refers to a cleaned and/or calculated dataframe and variable. |
``` r
RULES FOR TOTALS BY SUM OF MALE AND FELAME (MF)
Conditionals for totals MF: Filling missing values
while (t.tot.mf > smcolmf(i)){#while vales are not == real total population for (i in 1:nrow(hpcons)){#For each cell in the target variable column if (is.na(vec.mf[i])#Targeting the missing fields to fill &(t.tot.mf > smcolmf(i))#Keep sum col values under real total &(POBTOT.c$POBTOT.f[i] > smrow_mf(i))#keep row sum variable under POB totals &(3 > vec.mf.n[i])#Only values less than 3 are confidential ) {vec.mf.n[i] <- vec.mf.n[i]+1}#Add individual } }
RULES FOR TOTALS BY MALE (M)
Conditionals for totals M: Filling missing values
First make sure to cover all female conditions
while (t.tot.m > smcolm(i)){#while vales are not = real total population for (i in 1:nrow(hpcons)){#For each cell in the target variable column if (is.na(vec.m[i])#Targeting the missing fields to fill & !is.na(vec.f[i])#For female precondition for male cells & (POBTOT.c$POBMAS.f[i] > smrowm(i))#Keep sum col values under real total #First fill all values where total and females are defined & (((vec.mf + vec.mf.n)[i] - vec.f[i]) != vec.m.n[i]) ){while (((vec.mf + vec.mf.n)[i] - vec.f[i]) != vec.m.n[i]){ vec.m.n[i] <- vec.m.n[i]+1 } } } for (i in 1:nrow(hpcons)){ if(is.na(vec.m[i])#Targeting the missing fields to fill & is.na(vec.f[i])#2nd possibility with female values & (t.tot.m > smcolm(i))#Keep sum col values under real total & (POBTOT.c$POBMAS.f[i] > smrowm(i))#Keep sum row values under real total & (vec.m.n[i] != vec.df.mf.n[i,c(1)])#must be different that the MF total of the cell & (3 > vec.m.n[i])#Only values less than 3 are confidential ){#keep by block values under POB totals vec.m.n[i] <- vec.m.n[i]+1#Add individual } } }
RULES FOR TOTALS BY FEMALE (F)
Conditionals for totals F: Filling missing values
while (t.tot.f > smcolf(i)){#while vales are not = real total population for (i in 1:nrow(hpcons)){#For each cell in the target variable column if (is.na(vec.f[i])#Targeting the missing fields to fill & (t.tot.f > smcolf(i))#Keep sum col values under real total & (POBTOT.c$POBFEM.f[i] > smrow_f(i))#Keep sum row values under real total #Special condition: verify that male[i] keeps under total MF[i] & ( (vec.mf+vec.mf.n)[i] != ((vec.m+vec.m.n)[i] + vec.f.n[i])) #Special condition: Calculate difference of new totals & (( (vec.mf+vec.mf.n)[i] - (vec.m+vec.m.n)[i]) != 0) & (3 > vec.m.n[i]))#Only values less than 3 are confidential {vec.f.n[i] <- vec.f.n[i]+1#Add individual } } } ```
As can be seen in the script headers, the rules are separated in three sections, 1. Sum of male and female, 2. Male, 3. Female. All categorical variables are disaggregated by gender, and the structure of their functions follows the same structure.
It can be noted that for the three sections, the script of Male +
Female is repeated for Male and Female except that some additions
are required depending on the gender column. For that reason the
explanation of the Male + Female code in the following section is
enough to understand the rest of the code.
It is relevant to mention that the order to execute the code is
relevant. They exist differences on the Male and Female script that
are discussed in the following section. For the rest, the design of the
script is self-explained.
2.2 Columns(smcol_) Targets the consistency in the totals by categorical variable.
This section describes the function smcol_. The function:
while(t.tot.mf > smcol_mf(i)) is the initial condition in each section
of the code. The function makes sure that NA vales are filled until
matching the totals of the categorical variable.
Tables: t.tot & variable table
Each categorical variable consist of three columns separated by their
count of totals (male + female), count of totals by male, and by female,
e.g;P_18A25, P_18A25_M, P_18A25_F. The total table (t.tot)
consist of one row, and number of columns equal to the categorical
variables. Example of the data tables:
| P15A17A | P15A17A_M | P15A17A_F | | ------: | ---------: | ---------: | | 136 | 59 | 77 |
Section of t.tot with the totals by a categorical variable (columns):
| P15A17A | P15A17A_M | P15A17A_F | | ------: | ---------: | ---------: | | 0 | 0 | 0 | | NA | 0 | NA | | NA | NA | 0 | | 0 | 0 | 0 | | 0 | 0 | 0 | | 4 | 0 | 4 |
Section of the variable table with NA values for a categorical
variable (block data):
The while conditionals
As the census data provides the total population for each categorical
variable in a locality, it is possible to add values into NA cells
until matching the total criteria in the locality of study. For that
purpose a while conditional is used for each categorical variable and
their three columns:
r
Totals: while (t.tot > smcol(i)), vec.n[i] <- vec.n[i]+1
Male: while (t.tot.m > smcol_m(i)), vec.m.n[i] <- vec.m.n[i]+1
Female: while (t.tot.f > smcol_f(i)), vec.f.n[i] <- vec.f.n[i]+1
| Where: | Description |
| ------------ | --------------------------------------------------------------- |
| t.tot | Population totals of the categorical variable. |
| smcol(i) | Sum of column values of the variable in the cell iteration i. |
| vec.n | New vector value in the cell i that replace the NA. |
| vec.n[i]+1 | Adds a unit in the new vector cell. |
For each iteration (i) in a cell of the column, the script targets an
NA cell, adds 1 and goes to the next NA cell to repeat the process
until the totals of the categorical variables are not longer bigger than
the sum of the column by block. As a result, the population for targeted
categorical variable fills the missing population at the blocks.
The if conditionals
There are some conditionals that must be true in each iteration of the
cells to define if a value can be added at the cell i during the
while loop, which are the following:
r
if(is.na(vec.mf[i])#Targeting the missing fields to fill
&(t.tot.mf > smcol_mf(i))#Keep sum col values under real total
&(POBTOT.c$POBTOT.f[i] > smrow_mf(i))#keep row sum variable under POB totals
&(3 > vec.mf.n[i])#Only values less than 3 are confidential
The is.na function make sures that only NA cells are selected. The
(t.tot.mf > smcol_mf(i)) function is required to make sure that during
an specific iteration i such condition have been fulfilled or not. The
function (3 > vec.mf.n[i]) express the confidentiality criteria.
Finally the function (POBTOT.c$POBTOT.f[i] > smrow_mf(i)) is explained
in the following section.
2.3 Rows (smrow_): Consistency in the totals by block.
The totals by row allows to control a gradual addition of individuals in
NA cells under the threshold of the total population in each block for
every categorical variable.
Tables: POBTOT & variable table
In each block (rows), totals by gender are provided for locality, which can be appreciated in the following table:
| | | POBTOT | POBMAS | POBFEM | | :- | -----: | -----: | -----: | | 1 | 13 | 5 | 8 | | 2 | 64 | 31 | 33 | | 3 | 60 | 29 | 31 | | 4 | 0 | 0 | 0 | | 5 | 24 | 8 | 16 | | 6 | 36 | 17 | 19 | | 7 | 39 | 17 | 22 | | | | P15A17A | P15A17A\_M | P15A17A\_F | | :- | ------: | ---------: | ---------: | | 1 | 0 | 0 | 0 | | 2 | NA | 0 | NA | | 3 | NA | NA | 0 | | 4 | 0 | 0 | 0 | | 5 | 0 | 0 | 0 | | 6 | 4 | 0 | 4 | | 7 | NA | 0 | NA | |
The block conditional
The totals by block and gender allows to specify a rule to make sure
that the sum of each categorical variable by block does not overestimate
the population in the previews conditionals (while and if
conditions). For that reason, a smrow_ function is used to sum the
rows that belongs to the targeted categorical variables which sum
matches the total population in the block. The rules are the following:
| Gender | Function |
| ----------- | --------------------------- |
| Male+female | (POBTOT[i] > smrow_mf(i)) |
| Male | (POBMAS[i] > smrow_m(i)) |
| Female | (POBFEM[i] > smrow_f(i)) |
| Where | Description |
| ------------------------ | ----------------------------------------------------------------------------------------------- |
| POBTOT, POBMAS, POBFEM | Total population, total males and females. |
| smrow_ | Function that sums all rows of the categorical variables at the iteration cell based on gender. |
Once the smrow_ function provides the outcome of the rows’ sum in the
iteration of the cell, the value is compared with the respective total
in the block, if the total is bigger than the row sums, a sum of
vec.n[i] <- vec.n[i]+1 is allowed.
Note that the sum of all rows with the header format of P_#A#
(e.g. P_18A24) is equal to the POBTOT of that block. The same
corresponds to male and female functions.
2.3 Extra conditionals (M, F)
There are some extra conditions that apply specifically for the male column and others for the female column which are explained next.
Male conditionals
The male column is in the middle between totals and females, which
follows the order of the execution of the function. Some times, female
values were already provided, and the totals columns were defined. The
before implies that male NA blocks where female are known, the male
cells must be calculated before a regular iteration while conditional.
To solve that issue, the following lines of codes are provided:
``` r !is.na(vec.f[i])#For female precondition for male cells
First fill all values where total and females are defined
& (((vec.mf + vec.mf.n)[i] - vec.f[i]) != vec.m.n[i]) ```
Female conditionals
In the case of the female column, sometimes the blocks of totals and male already filled the space of the block, which means there is no more space for female. The conditional verifies which is the status of the totals and males to select priority blocks where female are required. The following code solves that issue.
``` r
Special condition: verify that male[i] keeps under total MF[i]
& ( (vec.mf+vec.mf.n)[i] != ((vec.m+vec.m.n)[i] + vec.f.n[i]))
Special condition: Calculate difference of new totals
& (( (vec.mf+vec.mf.n)[i] - (vec.m+vec.m.n)[i]) != 0) ```
3 Generic cleaning function
Finally, the declaration of the generic function that was built to clean
the data for all the columns of the categorical variables Age-ranges
and sex is declared as follows:
r
clean.pvar <- function( = hpcons$P_12YMAS,
vec.m = hpcons$P_12YMAS_M,
vec.f = hpcons$P_12YMAS_F,
t.tot.mf = hptotcons$P_12YMAS,
t.tot.m = hptotcons$P_12YMAS_M,
t.tot.f = hptotcons$P_12YMAS_F,
nam.mf.n = "P_12YMAS.f",
nam.m.n ="P_12YMAS_M.f",
nam.f.n ="P_12YMAS_F.f")
As mentioned in section 2.1, each categorical variable has an specific generic function.
4 Results
In the process of data transformation, it was possible to verify that negative values are not present from the differences of categorical values. It was also verified that the sum of all the values matches the totals of columns, and rows.
Here is a summary before data cleaning for the variable P_18A24_F:
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.000 3.000 4.000 4.405 6.000 16.000 21
Summary after cleaning P_18A24_F (eliminated NA values):
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.000 3.000 3.286 5.500 16.000
<!-- -->
Here is a summary before data cleaning for the variable P_18A24_M:
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.000 0.000 4.000 4.071 6.000 14.000 21
Summary after cleaning P_18A24_M (eliminated NA values):
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.000 3.000 3.159 5.000 14.000
<!-- -->
4.1 sGISc 2021 Conference Poster
This section refers to the results figures presented at the IGISc International Conference on Geospatial Information Sciences at CentroGeo, Mexico.
4.1.1 Evaluation
Section 2.1 introduce the target of the algorithm which is the base to
apply an evaluation. The evaluation applies for the age-ranges and
gender, school, and work related variables. The variables are
compared with their respective total values at the spatial units of
locality and block. The results evaluation demonstrate that the
algorithm target is fulfilled considering the following two criteria:
1. Comparing values between the observed census data of the totals by locality per variable with estimated values.
| Observed totals by locality | Estimated totals by locality | | ---------------------------------- | ----------------------------------------------------------------------------------------------------------- | | Known value of a targeted variable | Defined with the sum of estimated values of many blocks that compound the target variables (sum by columns) |
The bellow graph shows a comparison of observed and estimated totals by locality of the variables of interest.
<!-- -->
2. Comparing values between the observed census data of the totals by block per variable with estimated values.
| Observed totals by blocks | Estimated totals by blocks | | ---------------------------------- | -------------------------------------------------------------------------------------------------------- | | Known value of a targeted variable | Defined with the sum of estimated values of many blocks that compound the target variables (sum by rows) |
The bellow graph shows a comparison of observed and estimated totals by block of the variables of interest.
<!-- -->
The bellow table is a sample of filled NA values.
| P_0A2.f | P_0A2_M.f | P_0A2_F.f | P_3A5.f | P_3A5_M.f | P_3A5_F.f | P_6A11.f | P_6A11_M.f | P_6A11_F.f | | -------: | ----------: | ----------: | -------: | ----------: | ----------: | --------: | -----------: | -----------: | | 0 | 0 | 0 | 2 | 0 | 2 | 2 | 2 | 0 | | 4 | 2 | 2 | 4 | 3 | 1 | 6 | 4 | 2 | | 3 | 1 | 2 | 4 | 3 | 1 | 6 | 3 | 3 | | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | | 2 | 2 | 0 | 2 | 0 | 2 | 0 | 0 | 0 | | 2 | 0 | 2 | 0 | 0 | 0 | 8 | 5 | 3 | | 2 | 2 | 0 | 2 | 0 | 2 | 2 | 2 | 0 | | 7 | 3 | 4 | 3 | 3 | 0 | 12 | 4 | 8 | | 2 | 2 | 0 | 2 | 0 | 2 | 7 | 4 | 3 | | 2 | 2 | 0 | 7 | 4 | 3 | 8 | 5 | 3 |
4.1.2 Conclusion
Results demonstrate that the algorithm can estimate inaccessible census data at small spatial units with low population density.
The proposed algorithmic method enables the implementation of more complex applications.
As future research, it will be simulated a complex problem of pollution.
Owner
- Name: Nes
- Login: delapazruiz
- Kind: user
- Repositories: 2
- Profile: https://github.com/delapazruiz
Citation (CITATION.md)
# Poster presented at iGISs 2021, CentreoGeo, CDMX DeLaPaz-Ruíz, N., Augustijn, E.-W., Farnaghi, M. and Zurita-Milla, R. 2021. Population census estimation for Spatial Microsimulation In: iGISc 2021, International Conference on Geospatial Information Sciences [Online]., p.1. Available from: http://igisc.org/.