https://github.com/catalyst-cooperative/pudl-scrapers

Scrapers used to acquire snapshots of raw data inputs for versioned archiving and replicable analysis.

https://github.com/catalyst-cooperative/pudl-scrapers

Science Score: 10.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
  • .zenodo.json file
  • DOI references
  • Academic publication links
  • Committers with academic emails
    2 of 10 committers (20.0%) from academic institutions
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (7.4%) to scientific vocabulary

Keywords

archiving eia epa ferc open-data public-data public-dataset pudl python scraper scrapy webscraper

Keywords from Contributors

policy zenodo reproducibility environmental-data energy-data climate-change emissions natural-gas ghg etl
Last synced: 5 months ago · JSON representation

Repository

Scrapers used to acquire snapshots of raw data inputs for versioned archiving and replicable analysis.

Basic Info
  • Host: GitHub
  • Owner: catalyst-cooperative
  • License: mit
  • Language: HTML
  • Default Branch: main
  • Homepage:
  • Size: 347 KB
Statistics
  • Stars: 3
  • Watchers: 5
  • Forks: 3
  • Open Issues: 0
  • Releases: 0
Archived
Topics
archiving eia epa ferc open-data public-data public-dataset pudl python scraper scrapy webscraper
Created over 6 years ago · Last pushed about 3 years ago
Metadata Files
Readme License

README.md

PUDL Scrapers

Deprecated

This repo has been replaced by the new pudl-archiver repo, which combines both the scraping andd archiving process.

Installation

We recommend using conda to create and manage your environment.

Run: conda env create -f environment.yml conda activate pudl-scrapers

Output location

Logs are collected: [your home]/Downloads/pudl_scrapers/scraped/

Data from the scrapers is stored: [your home]/Downloads/pudl_scrapers/scraped/[source_name]/[today #]

Running the scrapers

The general pattern is scrapy crawl [source_name] for one of the supported sources. Typically and additional "year" argument is available, in the form scrapy crawl [source_name] -a year=[year].

See below for exact commands and available arguments.

2010 Census DP1 GeoDatabase

scrapy crawl censusdp1tract

No other options.

EPA CEMS

For full instructions:

epacems --help

EIA Bulk Electricity Data

eia_bulk_elec

No other options.

EPA CAMD to EIA Crosswalk

To collect the data and field descriptions:

scrapy crawl epacamd_eia

EIA860

To collect all the data:

scrapy crawl eia860

To collect a specific year (eg, 2007):

scrapy crawl eia860 -a year=2007

EIA860M

To collect all the data:

scrapy crawl eia860m

To collect a specific month & year (eg, August 2020):

scrapy crawl eia860 -a month=August -a year=2020

EIA861

To collect all the data:

scrapy crawl eia861

To collect a specific year (eg, 2007):

scrapy crawl eia861 -a year=2007

EIA923

To collect all the data:

scrapy crawl eia923

To collect a specific year (eg, 2007):

scrapy crawl eia923 -a year=2007

FERC Forms 1, 2, 6, & 60:

To collect all the data:

sh scrapy crawl ferc1 scrapy crawl ferc2 scrapy crawl ferc6 scrapy crawl ferc60

There are no subsets enabled.

FERC 714

To collect the data:

scrapy crawl ferc714

There are no subsets, that's it.

Owner

  • Name: Catalyst Cooperative
  • Login: catalyst-cooperative
  • Kind: organization
  • Email: hello@catalyst.coop
  • Location: United States of America

Catalyst is a small data engineering cooperative working on electricity regulation and climate change.

GitHub Events

Total
Last Year

Committers

Last synced: about 2 years ago

All Time
  • Total Commits: 173
  • Total Committers: 10
  • Avg Commits per committer: 17.3
  • Development Distribution Score (DDS): 0.763
Past Year
  • Commits: 37
  • Committers: 3
  • Avg Commits per committer: 12.333
  • Development Distribution Score (DDS): 0.541
Top Committers
Name Email Commits
Pablo Virgo m****x@p****m 41
Zane Selvans z****s@c****p 38
zschira z****3@c****u 28
dependabot[bot] 4****] 24
pre-commit-ci[bot] 6****] 15
Austen Sharpe a****e@g****m 12
bendnorman b****9@c****u 7
Christina Gosnell c****l@c****p 4
karldw k****w 2
t-desktop t****h@g****m 2
Committer Domains (Top 20 + Academic)

Issues and Pull Requests

Last synced: about 2 years ago

All Time
  • Total issues: 25
  • Total pull requests: 54
  • Average time to close issues: 4 months
  • Average time to close pull requests: 5 days
  • Total issue authors: 5
  • Total pull request authors: 10
  • Average comments per issue: 1.12
  • Average comments per pull request: 0.8
  • Merged pull requests: 51
  • Bot issues: 0
  • Bot pull requests: 37
Past Year
  • Issues: 0
  • Pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 0
  • Pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • zaneselvans (7)
  • cmgosnell (6)
  • zschira (6)
  • ptvirgo (5)
  • aesharpe (1)
Pull Request Authors
  • dependabot[bot] (24)
  • pre-commit-ci[bot] (13)
  • aesharpe (3)
  • bendnorman (3)
  • ptvirgo (2)
  • zschira (2)
  • cmgosnell (2)
  • zaneselvans (2)
  • TrentonBush (2)
  • karldw (1)
Top Labels
Issue Labels
rmi (4) ferc1 (3) datastore (2) xbrl (2) testing (1) ferc2 (1) ferc714 (1) dbf (1) eia923 (1)
Pull Request Labels
dependencies (17) rmi (3) github_actions (2) ferc1 (1) ferc2 (1) ferc6 (1) ferc60 (1) dbf (1) enhancement (1)

Dependencies

environment.yml conda
  • factory_boy >=2.12
  • pytest >=5.2
  • scrapy >=1.7
requirements.txt pypi
  • factory_boy >=2.12
  • pytest >=5.2
  • scrapy >=1.7