irs990efile
R package for building a research database from IRS 990 efiler tax returns.
https://github.com/nonprofit-open-data-collective/irs990efile
Science Score: 39.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
✓DOI references
Found 4 DOI reference(s) in README -
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (12.6%) to scientific vocabulary
Repository
R package for building a research database from IRS 990 efiler tax returns.
Basic Info
- Host: GitHub
- Owner: Nonprofit-Open-Data-Collective
- License: mit
- Language: HTML
- Default Branch: main
- Homepage: https://nonprofit-open-data-collective.github.io/irs990efile/
- Size: 148 MB
Statistics
- Stars: 25
- Watchers: 8
- Forks: 5
- Open Issues: 1
- Releases: 5
Metadata Files
README.md
irs990efile
R package for building a research database from IRS 990 nonprofit efiler tax returns. Please cite as:
Lecy, J. (2025). The irs990efile Package for R (v.1.0.0). Zenodo: https://doi.org/10.5281/zenodo.14736813
Lecy, J. (2024). IRS 990 Efiler Concordance File (v1.0.0) [Data set]. Zenodo: https://doi.org/10.5281/zenodo.14544301
The Concordance File provides the crosswalk architecture for moving from XML files to rectangular tables.
The full set of table descriptions is available in the DATA DICTIONARY.
Preprocessed CSV files are available on the NCCS website: DOWNLOAD TABLES.
Installation
Note that *xmltools** is not available on CRAN so has to be installed remotely before installing the irs990efile package.*
```r
install.packages( 'devtools' )
devtools::installgithub( 'ultinomics/xmltools' ) devtools::installgithub( 'nonprofit-open-data-collective/irs990efile' ) ```
Use
The package is designed to make IRS 990 Efiler XML files accessible to the research community by converting them into rectangular CSV formats that will be more familiar to data analysts. The code is designed to make file retrieval and translation straight-forward:
```r
library( irs990efile ) ### DO NOT RUN THIS VERSION FIRST
build_database( years=2020:2022 ) ### (IT CAN TAKE DAYS TO BUILD)
```
The code is simple, but the computing time is not. To understand the package and make sure it's working in your local environment start with this simple example (a toy database of ~500 990 returns:
```r library( irs990efile ) test_build()
Roughly equivalent to:
index <- build_index( tax.years=2018:2022 )
index.small <- dplyr::sample_n( index, 500 )
build_database( index.small )
```
You will see the following messaging:
```
Building a small database (~500 990 returns).
Average build time 5-10 minutes.
Check BUILD-LOG.txt for progress.
(1) Create directory structure.
(2) Split the index into BATCHFILES, one for each year.
(3) Parse XML batches into tables and save as CSV files in the YEAR folders.
(4) Combine all batched CSV files into compiled tables in the COMPILED folder.
(5) Combine all logfiles of missing xpaths into the FIX folder for review.
DATABASE BUILD START TIME: 2025-01-30 13:44:29.551851
You have 16 cores available for parallel processing.
There are 530 returns in this build.
|Var1 | Freq|
|:----|----:|
|2018 | 107|
|2019 | 113|
|2020 | 105|
|2021 | 96|
|2022 | 109|
STARTING YEAR 2018
There are 107 returns in 2018.
There are 11 groups being sent for parallel collection.
>> 1:44:57 PM -- Jan 30 2025 -- COMPLETED {G1} {G2} {G3} {G4} {G5} {G6} {G7} {G8} {G9} {G10}
There were 0 failed URLS
Time for the 2018 loop: 0.61 minutes
...
DATABASE BUILD FINISH TIME: 2025-01-30 13:47:43.788732
TOTAL BUILD TIME: 0.05 HOURS
```
Upon execution the following directory structure will be created and populated with data as XML files are parsed.
r
NEW FOLDER (the name will be 5 random letters: QBWJH)
BUILD-LOG.txt (reports build progress)
HIST (replication files)
system-info.txt
rhistory
index.rds
FIX (catalog of concordance issues)
missing xpaths.csv
collapsed fields.txt
2019
BATCHFILE.RDS (all 2019 urls split into batches)
2019-F9-P00-T00-HEADER-batch-01.CSV
2019-F9-P00-T00-HEADER-batch-02.CSV
2019-F9-P01-T00-SUMMARY-batch-01.CSV
2019-F9-P01-T00-SUMMARY-batch-02.CSV
2019-F9-P12-T00-FINANCIAL-REPORTING-batch-01.CSV
2019-F9-P12-T00-FINANCIAL-REPORTING-batch-02.CSV
... all tables x batches
2020
BATCHFILE.RDS (all 2020 urls split into batches)
2020-F9-P00-T00-HEADER-batch-01.CSV
2020-F9-P00-T00-HEADER-batch-02.CSV
... all tables x batches
2021
BATCHFILE.RDS (all 2021 urls split into batches)
2021-F9-P00-T00-HEADER-batch-01.CSV
2021-F9-P00-T00-HEADER-batch-02.CSV
... all tables x batches
COMPILED (batches compiled into single file)
2019-F9-P00-T00-HEADER.CSV
2019-F9-P01-T00-SUMMARY.CSV
2019-F9-P12-T00-FINANCIAL-REPORTING.CSV
2020-F9-P00-T00-HEADER.CSV
2020-F9-P01-T00-SUMMARY.CSV
2020-F9-P12-T00-FINANCIAL-REPORTING.CSV
2021-F9-P00-T00-HEADER.CSV
2021-F9-P01-T00-SUMMARY.CSV
2021-F9-P12-T00-FINANCIAL-REPORTING.CSV
CSV files created in YYYY (year) folders are split into batches to enable parallel computing and to keep track of progress so the process can be paused and restarted as necessary using the resumebuilddatabase() feature. Once all XML files have been parsed the batches are combined in the COMPILED folder. There are approximately 110 tables defined in the concordance plus a SCHEDULE-TABLE-YEAR.CSV that consists of TRUE/FALSE indicators for whether a nonprofit has filed each of the sixteen 990 schedules in a given year.
```r gettablenames()
[1] "F9-P00-T00-HEADER"
[2] "F9-P01-T00-SUMMARY"
[3] "F9-P01-T00-SUMMARY-EZ"
[4] "F9-P02-T00-SIGNATURE"
[5] "F9-P03-T00-MISSION"
...
[111] "SR-P05-T01-TRANSACTIONS-RLTD-ORGS"
[112] "SR-P06-T01-UNRLTD-ORGS-TAXABLE-PARTNERSHIP"
```
- The BUILD-LOG.txt will record progress (similar to what is printed in the console) and any errors that occur.
- Files archived in the HIST folder are useful for replication purposes (the index file used for the build, system settings, and the Rhistory file).
- Files in the FIX folder are logs of xpaths currently missing from the concordance but encountered in XML files, and fields that were supposed to be single values (part of one-to-one tables) but were returned as vectors. They are collapsed into a quasi-JSON format in the CSV files so they can be stored in a single cell: "{AK};{MT};{NY}".
The files in the FIX folder are mainly to help package developers track any schema changes that might impact existing variables or identify XML pathologies that break parsing routines.
XML Files on the 990 Data Commons
The irs990efile package pulls XML 990 returns from the Giving Tuesday Data Lake, an AWS S3 bucket that contains the full universe (as close as possible) of Efile 990 returns, along with clean and accurate index files.
You can access index files using the following package functions:
r
options( timeout = 600 ) # allows 10 minutes before timeout for large files
index <- get_current_index_batch() # the most recent batch of files added to the S3 bucket
index <- get_current_index_full() # the full list of all files in the S3 bucket
download_current_index_full() # creates a local download instead of reading as a data frame
Visualizing XML Structure
r
plot_table_str( "PF-P08-T01-PROG-RLTD-INV" )
r
print_table_str( "PF-P08-T01-PROG-RLTD-INV" )
1 Return
2 --ReturnData
3 --IRS990ScheduleL
4 --Form990ScheduleLPartIII
5 --AmountOfGrant
6 --AmtOfGrantOrTypeOfAssistance
7 --AmountOfGrant
8 --TypeOfAssistance
9 --NameOfInterestedBusiness
10 --BusinessNameLine1
11 --BusinessNameLine2
12 --NameOfInterestedPerson
13 --NameBusiness
14 --BusinessNameLine1
15 --BusinessNameLine2
16 --NamePerson
17 --PurposeOfAssistance
18 --RelationshipWithOrganization
19 --TypeOfAssistance
20 --GrntAsstBnftInterestedPrsnGrp
21 --AssistancePurposeTxt
22 --BusinessName
23 --BusinessNameLine1
24 --BusinessNameLine1Txt
25 --BusinessNameLine2
26 --BusinessNameLine2Txt
27 --CashGrantAmt
28 --PersonNm
29 --RelationshipWithOrgTxt
30 --TypeOfAssistanceTxt
We can use these visual representation to identify the grouping variables or "table headers" that are used to parse 1:M tables (group nodes are extracted and each is flattened into a unique row of the table).
/ IRS990ScheduleL / Form990ScheduleLPartIII
/ IRS990ScheduleL / GrntAsstBnftInterestedPrsnGrp
r
TABLE.HEADERS <- get_table_headers()
TABLE.HEADERS[[ "SL-P03-T01-GRANTS-INTERESTED-PERS" ]]
$`SL-P03-T01-GRANTS-INTERESTED-PERS`
[1] "//IRS990ScheduleL/Form990ScheduleLPartIII"
[2] "//IRS990ScheduleL/GrntAsstBnftInterestedPrsnGrp"
The build_database() Function
The workhorse function in the package is the build_database() function, which is a wrapper for the primary data workflow:
r
build_database( index=NULL, years=NULL, group.size=200 )
- If no index is provided (=NULL), download the most recent version from the Data Commons.
- Limit the build to the specified years and only include 990 and 990EZ return types.
- Activate logging via BUILD-LOG.txt.
- Split the index into batches using the group.size argument (more RAM allows for larger groups, which minimizes the number of read/write steps).
- Build all tables for each batch and write them to temporary CSV files.
- Once all batches are complete, compile the temporary files into a one table for year for each table.
- Log all of the anomolies and save a history file from the session.
If we walk through these steps manually it would look something like:
```r TABLES <- gettablenames() # list tables defined in the concordance FX.NAMES <- getfxnames( TABLES ) # return the corresponding build functions
YEAR <- 2020 index2020 <- dplyr::filter( tinyindex, TaxYear == YEAR ) # built in index file for testing index100 <- dplyr::sample_n( index2020, size=100 )
CUSTOM TABLE SELECTION
TABLES <- c( "F9-P00-T00-HEADER", "F9-P03-T00-PROGRAM-ONE", "F9-P03-T00-PROGRAM-TWO", "F9-P03-T00-PROGRAM-THREE", "F9-P03-T01-PROGRAMS-OTHER" )
FX.NAMES <- getfxnames( TABLES )
PARSE DATA FOR ONE NONPROFIT
url <- index100$URL[1] one.npo <- parse_npo( url, year=2000, fx.names=FX.NAMES, logXP=F )
names(one.npo) # ALL PARSED TABLES STORED AS LIST
[1] "BUILDF9P00T00HEADER" "BUILDF9P03T00PROGRAMONE"
[3] "BUILDF9P03T00PROGRAMTWO" "BUILDF9P03T00PROGRAMTHREE"
[5] "BUILDF9P03T01PROGRAMSOTHER" "BUILDSCHEDULETABLE"
one.npo[["BUILDF9P03T00PROGRAM_ONE"]] |> t() |> knitr::kable() # TABLE VALUES FOR SINGLE NONPROFIT
| | | |:---------------------|:----------------------------------------------------------------------------------------------------------------------------------------------| |EIN2 |EIN-81-5406671 | |OBJECTID |OID-202211339349308111 | |ORGEIN |815406671 | |ORGNAMEL1 |RUTHERFORD COUNTY ECONOMIC | |ORGNAMEL2 |DEVELOPMENT ASSOCIATION | |RETURNAMENDEDX |FALSE | |RETURNGROUPX |FALSE | |RETURNPARTIALX |FALSE | |RETURNTAXPERDAYS |365 | |RETURNTIMESTAMP |2022-05-13T15:13:27-05:00 | |RETURNTYPE |990 | |TAXPERIODBEGINDATE |2020-07-01 | |TAXPERIODENDDATE |2021-06-30 | |TAXYEAR |2020 | |URL |https://gt990datalake-rawdata.s3.amazonaws.com/EfileData/XmlFiles/202211339349308111public.xml | |VERSION |2020v4.0 | |F903PROGCODE |NA | |F903PROGDESC |THE ASSOCIATION RECEIVES DONATIONS FROM LOCAL GOVERNMENTS AND BUSINESSES AND USES THOSE FUNDS TO FURTHER THE ECONOMY IN RUTHERFORD COUNTY, NC. | |F903PROGEXP |NA | |F903PROGGRANT |NA | |F903PROG_REV |NA |
PARSE DATA AND SAVE TABLES
FOR ALL NONPROFITS IN URLS VECTOR
urls <- index100$URL YEAR <- "2020" dir.create(YEAR) build_tables( urls, year=YEAR, table.names=TABLES )
EQUIVALENT TO ABOVE
build_tables( urls, year=YEAR, fx.names=FX.NAMES )
PARSE DATA FOR A LIST OF NONPROFITS
URLS <- index100$URL TIMESTAMP <- format(Sys.time(), "%b-%d-%Y-%Hh-%Mm")
all.npos <- purrr::map( URLS, parsenpo, FX.NAMES ) # parse tables for all orgs in URLS df.expenses <- "BUILDF9P09T00EXPENSES" %>% # compile table F9-P09-T00-EXPENSES from the list getfxdf( all.npos, TIMESTAMP, YEAR ) # and save a CSV to file with TABLE NAME + TIMESTAMP (unique batch)
ALL TABLES TOGETHER
build_tables( urls=URLS, year=YEAR ) # write all to CSV, returns failed URLS if any ```
These functions are useful for testing purposes, but once you surpass a minimal batch size (the number of returns processed together) you will eventually run out of RAM. Large samples need to be split into smaller parts so collection can be serialized. Each batch is saved as a set of CSV files (one for each table), and once finished the batched CSV files are compiled.
```r
SPLIT INDEX INTO BATCHES
YEAR <- 2020 create_batchfiles( index100, years=YEAR, group.size=20 ) # creates "2020/BATCHFILE.RDS"
ACCESS THE BATCHFILE
bf <- get_batchfile( 2020 ) names(bf)
"G1{20}" "G2{20}" "G3{20}"
"G4{20}" "G5{20}"
Each batch consists of 20 URLs
URLS.01 <- bf[[ "G1{20}" ]] # BATCH 01 build_tables( urls=URLS.01, year=2020, fx.names=FX.NAMES )
URLS.02 <- bf[[ "G2" ]] # BATCH 02 build_tables( urls=URLS.02, year=2020, fx.names=FX.NAMES )
ALL BATCHES IN PARALLEL
TABLES <- c( "F9-P00-T00-HEADER", "F9-P03-T00-PROGRAM-ONE", "F9-P03-T00-PROGRAM-TWO", "F9-P03-T00-PROGRAM-THREE", "F9-P03-T01-PROGRAMS-OTHER" )
FX.NAMES <- getfxnames( TABLES )
bf <- getbatchfile( 2020 ) processbatch( bf, year=2020, fx.names=FX.NAMES ) ```
Again, these steps are all wrapped into a single workflow function. This one line would be equivalent to the steps covered above:
r
build_database( index100, years=2020, batch.size=20 )
Since large builds can take a long time (several days without a large processor) they may get interrupted when a computer shuts down or freezes. The BATCHFILES keep track of which batches are complete and which are remaining, so you can restart a process at any time by navigating back to the project directory (the folder where BUILD-LOG.txt is stored) and resume data collection:
r
resume_build_database()
Package Updates
Any time the concordance file is updated it can impact the package. As a result, the build functions need to be updated.
r
create_code_chunk( "F9-P03-T00-PROGRAM-ONE" ) # updates BUILD_F9_P03_T00_PROGRAM_ONE()
create_code_chunk_rdb( "F9-P07-T01-COMPENSATION" ) # updates BUILD_F9_P07_T01_COMPENSATION()
Or all together:
r
build_all_chunks() # update functions in /R
update_tinyindex() # update objects in /data
update_concordance() # update objects in /data
update_xpaths() # update objects in /data
Owner
- Name: Nonprofit Open Data Collective
- Login: Nonprofit-Open-Data-Collective
- Kind: organization
- Repositories: 15
- Profile: https://github.com/Nonprofit-Open-Data-Collective
GitHub Events
Total
- Create event: 5
- Issues event: 1
- Release event: 4
- Watch event: 10
- Issue comment event: 2
- Push event: 34
- Fork event: 3
Last Year
- Create event: 5
- Issues event: 1
- Release event: 4
- Watch event: 10
- Issue comment event: 2
- Push event: 34
- Fork event: 3
Dependencies
- R.utils * imports
- XML * imports
- dplyr * imports
- httr * imports
- jsonlite * imports
- purrr * imports
- xml2 * imports
- xmltools * imports