https://github.com/datamade/car-scraper

💲Make spreadsheets out of Chicago Association of REALTORS® reports

https://github.com/datamade/car-scraper

Science Score: 36.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • â—‹
    CITATION.cff file
  • ✓
    codemeta.json file
    Found codemeta.json file
  • ✓
    .zenodo.json file
    Found .zenodo.json file
  • â—‹
    DOI references
  • â—‹
    Academic publication links
  • ✓
    Committers with academic emails
    1 of 2 committers (50.0%) from academic institutions
  • â—‹
    Institutional organization owner
  • â—‹
    JOSS paper metadata
  • â—‹
    Scientific vocabulary similarity
    Low similarity (13.7%) to scientific vocabulary

Keywords

makefile pdf-converter web-scraping
Last synced: 5 months ago · JSON representation

Repository

💲Make spreadsheets out of Chicago Association of REALTORS® reports

Basic Info
  • Host: GitHub
  • Owner: datamade
  • Language: Python
  • Default Branch: master
  • Homepage:
  • Size: 69.3 KB
Statistics
  • Stars: 3
  • Watchers: 2
  • Forks: 2
  • Open Issues: 2
  • Releases: 0
Topics
makefile pdf-converter web-scraping
Created almost 9 years ago · Last pushed almost 3 years ago
Metadata Files
Readme

README.md

CAR Scraper

Grab Chicagoland real estate reports from the CAR website and convert them all to spreadsheets.

Requirements

Make sure you have OS-level requirements installed:

  • Python 3.3+ (standard DataMade tool)
  • Java (or any JRE)
  • pdfinfo (built-in on Ubuntu, available for other Linux distros as part of Xpdf - mac users can also use the Poppler fork via homebrew: brew install poppler)

Then, make a virtualenv and install Python requirements:

mkvirtualenv car-scraper pip install -U -r requirements.txt

Finally, build tabula-java 0.9.1 from source:

make tabula-java

Running the scraper

You'll need to decrypt the CAR login credentials before you can scrape the PDFs. If you're on the keyring for this repo, you can decrypt the secrets file:

blackbox_cat configs/secrets.py.gpg > scripts/secrets.py

Otherwise, copy over the example secrets file:

cp configs/secrets.example.py > scripts/secrets.py

Then, adjust the variables to reflect your CAR username and password:

CAR_USER = '<your_username>' CAR_PASS = '<your_password>'

Set the desired month and year for the reports in config.mk:

```bash

follow this format:

year = 2016 month = 02 ```

Use the DataMade Make standard operating procedure to get your files. make all produces the final output for the year/month you selected, and make clean removes all generated files from your repo.

Output

Output files land in the final/ directory. Files with monthly in the name catalogue month-over-month statistics, while files with yearly in the name catalogue year-to-date totals.

If you're interested in year-end statistics, just run the scraper for December of a given year ($(month) = 12) and grab the yearly files. These are the files we use in Where to Buy.

Errors

In the process of cleaning the CSVs, the scraper will double-check to make sure that table values look plausible. It will print these errors to the console while making the target cleaned_csvs, but you can also examine the output file conversion_errors.csv if you want to inspect further. Error messages look something like this:

Percentage error in raw/csvs/suburbs/clean/DuPage_County_4.csv Community: Carol Stream Column: months_supply_change Row value: -35.8 Calculated delta: -34.5 (Note: calculated deltas should be within +-1 of the row value.)

CAR often slightly miscalculates changes in values between years, as you can see above. This is the most frequent error I've encountered, and you can safely ignore it as long as the delta is within a reasonable range.

Team

  • Jean Cochrane - code
  • Forest Gregg - mentorship

Owner

  • Name: datamade
  • Login: datamade
  • Kind: organization
  • Email: info@datamade.us
  • Location: Chicago, IL

We build open source technology using open data to empower journalists, researchers, governments and advocacy organizations.

GitHub Events

Total
Last Year

Committers

Last synced: 7 months ago

All Time
  • Total Commits: 19
  • Total Committers: 2
  • Avg Commits per committer: 9.5
  • Development Distribution Score (DDS): 0.053
Past Year
  • Commits: 0
  • Committers: 0
  • Avg Commits per committer: 0.0
  • Development Distribution Score (DDS): 0.0
Top Committers
Name Email Commits
Jean Cochrane j****n@j****m 18
Forest Gregg f****g@u****u 1
Committer Domains (Top 20 + Academic)

Issues and Pull Requests

Last synced: 7 months ago

All Time
  • Total issues: 0
  • Total pull requests: 3
  • Average time to close issues: N/A
  • Average time to close pull requests: 20 days
  • Total issue authors: 0
  • Total pull request authors: 1
  • Average comments per issue: 0
  • Average comments per pull request: 0.33
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 3
Past Year
  • Issues: 0
  • Pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 0
  • Pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
Pull Request Authors
  • dependabot[bot] (3)
Top Labels
Issue Labels
Pull Request Labels
dependencies (3)

Dependencies

requirements.txt pypi
  • lxml ==3.7.3
  • requests ==2.13.0