panda-loaders

Scripts to load data into a PANDA instance

https://github.com/higs4281/panda-loaders

Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Committers with academic emails
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (12.8%) to scientific vocabulary
Last synced: 8 months ago · JSON representation ·

Repository

Scripts to load data into a PANDA instance

Basic Info
  • Host: GitHub
  • Owner: higs4281
  • License: mit
  • Language: Python
  • Default Branch: main
  • Homepage:
  • Size: 104 KB
Statistics
  • Stars: 3
  • Watchers: 1
  • Forks: 0
  • Open Issues: 0
  • Releases: 0
Created almost 12 years ago · Last pushed 10 months ago
Metadata Files
Readme License Citation

README.md

panda-loaders

These scripts began as a way to load voter data into a PANDA instance via its API.

Along the way, the end-of-life of Python 2 appeared on the horizon, and the wonderful PANDA project has fallen a bit out of date.

We're not giving up on PANDA, but for now, the voter script has been updated to run in Python 3.6+ and to be able to create a Postgres database of voter data as an alternative to feeding the data to a PANDA instance.

If you use the Postgres option, you'll get an indexed database ready to be plugged into a Django project, if desired. This repo is not a working Django project, but a Django model that can be used to hook up the Postgres voter db is included in /voters/models.py.

Getting started

First, install the Python requirements:

bash pip install -r requirements/base.txt

The voter script is tailored to Florida but could be adapted. It requires some source data and a date value: - County voter registration files, which in Florida are available from the state's Division of Elections. - A VOTERDATADATE value in YYYY-MM-DD form, provided by an environment variable of that name or by manually editing the script's global variable. This is used to name the database, to provide a default sourcedate in the `votersvoter` table and to provide a YEAR value for accessing the right data directory.

Local directories

The load_county_voters script uses some local directories as it transforms the raw data.

The assumed directory structure, all expected to live under a directory named for the data year (such as 2019/), includes these folders:

Folder | Use :---- | :-- VoterDetail | Put raw county voter files here load | Prepped voter files loaded | Voter files that were sent to PANDA prep | Processing folder temp | Processing folder

The load_county_voters script will look for a local environment variable, PANDA_LOADERS_BASE_DIR, to use as the base directory for the above folders, and will default to /tmp as the base directory if no environment var is found.

Running the script

With raw voter files in place, you can run the voter script with python load_county_voters.py plus one of four arguments: 1. [RAW FILE]: You can pass in a file name for a raw county voter file, such as BAY_20190312.txt. This will prep a single raw file for loading to a database or to PANDA, if the raw file is in /VoterDetail/. 2. prep_files: This preps all raw county files found in /VoterDetail/ and makes them ready for export to a database or to PANDA. 3. load_to_postgres: After files are prepped, this will create, load and index a Postgres voter database. If you put all 67 Florida county files in the /VoterDetail/ directory and prep them, this will create a statewide database of Florida voters. 4. export_to_panda: To export all prepped county files to a PANDA instance, creating one dataset per county.

PANDA loading

If you use the script to load data into a PANDA instance, the process uses PANDA's API, which is slow. But the API method has some advantages over PANDA's manual data upload process. - It sidesteps memory issues that you can encounter in PANDA's loading GUI. - It only uses PANDA index space, rather than index space + file storage space. - It results in a dataset with external_id values, which makes rows editable via the API.

Owner

  • Name: william higgins
  • Login: higs4281
  • Kind: user
  • Company: Consumer Financial Protection Bureau

Web developer @cfpb

Citation (citations/load_tampa_bike_citations.py)

from __future__ import unicode_literals

import csv
import datetime
import json
import os
import sys

import requests
from dateutil import parser
from django.template.defaultfilters import slugify


class RunVars:
    """Track runtime variables."""

    def __init__(self):
        self.starter = None
        self.processed = 0
        self.updated = 0
        self.created = 0
        self.passed = 0

# panda params
RUNVARS = RunVars()
RUNVARS.starter = datetime.datetime.now()
PANDA_BASE_URL = os.getenv('PANDA_BASE_URL')
PANDA_AUTH_PARAMS = {
    'email': os.getenv('PANDA_USER'),
    'api_key': os.getenv('PANDA_API_KEY')
}
QUERY_LIMIT = 1200
PANDA_BULK_UPDATE_SIZE = 1000
PANDA_API = '%s/api/1.0' % PANDA_BASE_URL
PANDA_ALL_DATA_URL = "%s/data/" % PANDA_API
PANDA_DATASET_BASE = "%s/dataset" % PANDA_API
DATASET_NAME = "Tampa bicycle citations 2003 to 2014"
DATASET_SLUG = slugify(DATASET_NAME)
DATASET_URL = '{}/{}/'.format(PANDA_DATASET_BASE, DATASET_SLUG)
DATA_URL = '{}data/'.format(DATASET_URL)


def parse_dob(value):
    turndate = datetime.date(2004, 1, 1)  # allow for any citees as young as 10
    try:
        dob = parser.parse(value.strip()).date()
    except Exception as e:
        print("Couldn't parse dob {} ({})".format(value, e))
        return None
    else:
        if dob == datetime.date.today():
            return None
        elif dob > turndate:
            return dob.replace(year=dob.year - 100)
        else:
            return dob


def panda_get(url, params={}):
    params.update(PANDA_AUTH_PARAMS)
    return requests.get(url, params=params)


def panda_put(url, data, params={}):
    params.update(PANDA_AUTH_PARAMS)
    return requests.put(
        url, data,
        params=params,
        headers={'Content-Type': 'application/json'})


def panda_delete(url, params={}):
    params.update(PANDA_AUTH_PARAMS)
    return requests.delete(url, params=params)


"""
RAW HEADINGS
  1: ID
  2: Uniform Case Number
  3: Citaion Number
  4: Case Number
  5: Law Enf Agency Name
  6: Law Enf Officer Name
  7: Offense Date
  8: Received Date
  9: Last Name
 10: First Name
 11: Middle Name
 12: Suffix
 13: Address Line 1
 14: Address Line 2               ## add to 13
 15: City                                ## add to 13
 16: State                              ## add to 13
 17: Zip Code                        ## add to 13
 18: Date Of Birth
 19: Race
 20: Gender
 21: Driver License Number
 22: Driver License State       ## add to 21
 23: Commercial Vehicle
 24: Statute
 25: Statute Description
 26: Posted Speed
 27: Actual Speed
 28: Disposition
 29: Disposition Date
 30: Amount Paid
 31: Date Paid
 32: Defensive Driving School (DDS)
 33: DDS Court Ordered
 34: DDS Elected (Regular)
 35: DDS Elected (Advanced)
 36: Tag Number
 37: Tag State                  ## add to 36
 38: Case Filed Date
 39: Case Closed Date
 40: Offense Year
 41: Age

NOTE: can clean up the BOM (byte order mark) character,
and many other encoding poblems, with ftfy's fix_text function, like so:

import ftfy
bad_col = reader.fieldnames[0]
bad_col
>> '\ufeffUniform Case Number'

ftfy.fix_text(bad_col)
>> 'Uniform Case Number'

desired-columns = '1,9,10,11,12,18,19,20,21,22,13,14,15,16,17,7,25,5,6'
"""


def initialize_dataset():
    """Initiate a PANDA dataset for bicycle citations."""
    output_headings = [
        'lname',
        'fname',
        'mname',
        'suffix',
        'DOB',
        'race',
        'gender',
        'DL',
        'address',
        'date',
        'statute',
        'agency',
        'officer',
    ]
    dataset_dict = {
        'name': DATASET_NAME,
        'description': (
            'A list of bicycle citations in Hillsborough County '
            'from 2003 to 2014.'),
        'categories': [
            '/api/1.0/category/all-dob/',
            '/api/1.0/category/crime/',
            '/api/1.0/category/traffic-tickets/'
        ]
    }
    rtest = panda_get(DATASET_URL)
    if rtest.status_code == 404:  # Dataset doesn't exist, so we can create it
        response = panda_put(
            DATASET_URL,
            json.dumps(dataset_dict),
            params={'columns': ','.join(output_headings)}
        )
        if response.status_code == 201:
            print("Created PANDA dataset {}".format(DATASET_NAME))
            return True
        else:
            print(
                "Failed with status code {} and reason {}".format(
                    response.status_code, response.reason))
            return False
    elif rtest.reason == 'OK':
        testd = json.loads(rtest.text)
        print(
            "Dataset is already created and has {} rows; "
            "We can proceed with upload.".format(testd['row_count']))
        return True


def load_tickets(infile):
    if not os.path.isfile(infile):
        print("Couldn't find the source file '{}'".format(infile))
        return
    address_rows = [
        'Address Line 1', 'Address Line 2', 'City', 'State', 'Zip Code']
    put_data = {'objects': []}
    with open(infile, 'r') as f:
        reader = csv.DictReader(f)
        for row in reader:
            RUNVARS.processed += 1
            pk = row['ID']
            addr = ", ".join([row[key] for key in address_rows if row[key]])
            dl = "{} ({})".format(
                row['Driver License Number'], row['Driver License State'])
            put_data['objects'].append({
                'external_id': str(pk),
                'data': [
                    row['Last Name'],
                    row['First Name'],
                    row['Middle Name'],
                    row['Suffix'],
                    row['Date Of Birth'],
                    row['Race'].replace('White', 'Wh').replace(
                        'Black', 'Bl'),
                    row['Gender'],
                    dl,
                    addr,
                    row['Offense Date'],
                    row['Statute Description'],
                    row['Law Enf Agency Name'],
                    row['Law Enf Officer Name']
                ]
            })
            if len(put_data['objects']) == 1000:
                RUNVARS.created += 1000
                print("Shipped {} rows".format(RUNVARS.created))
                panda_put(DATA_URL, json.dumps(put_data))
                put_data['objects'] = []
    if put_data['objects']:
        print('Shipping final {} rows.'.format(len(put_data['objects'])))
        panda_put(DATA_URL, json.dumps(put_data))
        RUNVARS.created += len(put_data['objects'])
        put_data['objects'] = []
    print("pushed {} rows to panda dataset {}; process took {}".format(
        RUNVARS.created,
        DATASET_NAME,
        (datetime.datetime.now() - RUNVARS.starter)))


if __name__ == "__main__":
    """
    Load a spreadsheet of Tampa bicycle citations into PANDA.

    If no source file path is passed,
    the default will be /data/AllBikeViolations.csv
    """
    if not initialize_dataset():
        print("Couldn't initialize the PANDA dataset.")
        sys.exit(1)
    if len(sys.argv) > 1:
        source = sys.argv[1]
    else:
        source = "/data/AllBikeViolations.csv"
    load_tickets(source)

GitHub Events

Total
  • Push event: 2
Last Year
  • Push event: 2

Committers

Last synced: 9 months ago

All Time
  • Total Commits: 62
  • Total Committers: 1
  • Avg Commits per committer: 62.0
  • Development Distribution Score (DDS): 0.0
Past Year
  • Commits: 3
  • Committers: 1
  • Avg Commits per committer: 3.0
  • Development Distribution Score (DDS): 0.0
Top Committers
Name Email Commits
William Higgins h****1@g****m 62

Issues and Pull Requests

Last synced: 9 months ago

All Time
  • Total issues: 3
  • Total pull requests: 5
  • Average time to close issues: 4 minutes
  • Average time to close pull requests: less than a minute
  • Total issue authors: 1
  • Total pull request authors: 1
  • Average comments per issue: 0.0
  • Average comments per pull request: 0.0
  • Merged pull requests: 5
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 0
  • Pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 0
  • Pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • higs4281 (3)
Pull Request Authors
  • higs4281 (5)
Top Labels
Issue Labels
Pull Request Labels

Dependencies

requirements/base.txt pypi
  • csvkit ==1.0.2
  • django >=2.1.6
  • requests >=2.20.0
requirements/test.txt pypi
  • coverage *
  • flake8 *
  • tox *