https://github.com/catalyst-cooperative/pudl-scrapers

Scrapers used to acquire snapshots of raw data inputs for versioned archiving and replicable analysis.

Science Score: 10.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
○
codemeta.json file
○
.zenodo.json file
○
DOI references
○
Academic publication links
✓
Committers with academic emails
2 of 10 committers (20.0%) from academic institutions
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (7.4%) to scientific vocabulary

Keywords

archiving eia epa ferc open-data public-data public-dataset pudl python scraper scrapy webscraper

Keywords from Contributors

policy zenodo reproducibility environmental-data energy-data climate-change emissions natural-gas ghg etl

Last synced: 5 months ago · JSON representation

Repository

Scrapers used to acquire snapshots of raw data inputs for versioned archiving and replicable analysis.

Basic Info

Host: GitHub
Owner: catalyst-cooperative
License: mit
Language: HTML
Default Branch: main
Homepage:
Size: 347 KB

Statistics

Stars: 3
Watchers: 5
Forks: 3
Open Issues: 0
Releases: 0

Archived

Topics

archiving eia epa ferc open-data public-data public-dataset pudl python scraper scrapy webscraper

Created over 6 years ago · Last pushed about 3 years ago

Metadata Files

Readme License

README.md

PUDL Scrapers

Deprecated

This repo has been replaced by the new pudl-archiver repo, which combines both the scraping andd archiving process.

Installation

We recommend using conda to create and manage your environment.

Run: conda env create -f environment.yml conda activate pudl-scrapers

Output location

Logs are collected: [your home]/Downloads/pudl_scrapers/scraped/

Data from the scrapers is stored: [your home]/Downloads/pudl_scrapers/scraped/[source_name]/[today #]

Running the scrapers

The general pattern is scrapy crawl [source_name] for one of the supported sources. Typically and additional "year" argument is available, in the form scrapy crawl [source_name] -a year=[year].

See below for exact commands and available arguments.

2010 Census DP1 GeoDatabase

scrapy crawl censusdp1tract

No other options.

EPA CEMS

For full instructions:

epacems --help

EIA Bulk Electricity Data

eia_bulk_elec

No other options.

EPA CAMD to EIA Crosswalk

To collect the data and field descriptions:

scrapy crawl epacamd_eia

EIA860

To collect all the data:

scrapy crawl eia860

To collect a specific year (eg, 2007):

scrapy crawl eia860 -a year=2007

EIA860M

To collect all the data:

scrapy crawl eia860m

To collect a specific month & year (eg, August 2020):

scrapy crawl eia860 -a month=August -a year=2020

EIA861

To collect all the data:

scrapy crawl eia861

To collect a specific year (eg, 2007):

scrapy crawl eia861 -a year=2007

EIA923

To collect all the data:

scrapy crawl eia923

To collect a specific year (eg, 2007):

scrapy crawl eia923 -a year=2007

FERC Forms 1, 2, 6, & 60:

To collect all the data:

sh scrapy crawl ferc1 scrapy crawl ferc2 scrapy crawl ferc6 scrapy crawl ferc60

There are no subsets enabled.

FERC 714

To collect the data:

scrapy crawl ferc714

There are no subsets, that's it.

Owner

Name: Catalyst Cooperative
Login: catalyst-cooperative
Kind: organization
Email: hello@catalyst.coop
Location: United States of America

Website: https://catalyst.coop
Twitter: CatalystCoop
Repositories: 82
Profile: https://github.com/catalyst-cooperative

Catalyst is a small data engineering cooperative working on electricity regulation and climate change.

GitHub Events

Total

Last Year

Committers

Last synced: about 2 years ago

All Time

Total Commits: 173
Total Committers: 10
Avg Commits per committer: 17.3
Development Distribution Score (DDS): 0.763

Past Year

Commits: 37
Committers: 3
Avg Commits per committer: 12.333
Development Distribution Score (DDS): 0.541

Top Committers

Name	Email	Commits
Pablo Virgo	m**x@p**m	41
Zane Selvans	z**s@c**p	38
zschira	z**3@c**u	28
dependabot[bot]	4****]	24
pre-commit-ci[bot]	6****]	15
Austen Sharpe	a**e@g**m	12
bendnorman	b**9@c**u	7
Christina Gosnell	c**l@c**p	4
karldw	k****w	2
t-desktop	t**h@g**m	2

Committer Domains (Top 20 + Academic)

catalyst.coop: 2 cornell.edu: 1 colorado.edu: 1 pablovirgo.com: 1

Issues and Pull Requests

Last synced: about 2 years ago

All Time

Total issues: 25
Total pull requests: 54
Average time to close issues: 4 months
Average time to close pull requests: 5 days
Total issue authors: 5
Total pull request authors: 10
Average comments per issue: 1.12
Average comments per pull request: 0.8
Merged pull requests: 51
Bot issues: 0
Bot pull requests: 37

Past Year

Issues: 0
Pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Issue authors: 0
Pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

zaneselvans (7)
cmgosnell (6)
zschira (6)
ptvirgo (5)
aesharpe (1)

Pull Request Authors

dependabot[bot] (24)
pre-commit-ci[bot] (13)
aesharpe (3)
bendnorman (3)
ptvirgo (2)
zschira (2)
cmgosnell (2)
zaneselvans (2)
TrentonBush (2)
karldw (1)

Top Labels

Issue Labels

rmi (4) ferc1 (3) datastore (2) xbrl (2) testing (1) ferc2 (1) ferc714 (1) dbf (1) eia923 (1)

Pull Request Labels

dependencies (17) rmi (3) github_actions (2) ferc1 (1) ferc2 (1) ferc6 (1) ferc60 (1) dbf (1) enhancement (1)

Dependencies

environment.yml conda

factory_boy >=2.12
pytest >=5.2
scrapy >=1.7

requirements.txt pypi

factory_boy >=2.12
pytest >=5.2
scrapy >=1.7