Science Score: 44.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
○Academic publication links
-
○Committers with academic emails
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (12.8%) to scientific vocabulary
Repository
Scripts to load data into a PANDA instance
Basic Info
Statistics
- Stars: 3
- Watchers: 1
- Forks: 0
- Open Issues: 0
- Releases: 0
Metadata Files
README.md
panda-loaders
These scripts began as a way to load voter data into a PANDA instance via its API.
Along the way, the end-of-life of Python 2 appeared on the horizon, and the wonderful PANDA project has fallen a bit out of date.
We're not giving up on PANDA, but for now, the voter script has been updated to run in Python 3.6+ and to be able to create a Postgres database of voter data as an alternative to feeding the data to a PANDA instance.
If you use the Postgres option, you'll get an indexed database ready to be plugged into a Django project, if desired. This repo is not a working Django project, but a Django model that can be used to hook up the Postgres voter db is included in /voters/models.py.
Getting started
First, install the Python requirements:
bash
pip install -r requirements/base.txt
The voter script is tailored to Florida but could be adapted. It requires some source data and a date value: - County voter registration files, which in Florida are available from the state's Division of Elections. - A VOTERDATADATE value in YYYY-MM-DD form, provided by an environment variable of that name or by manually editing the script's global variable. This is used to name the database, to provide a default sourcedate in the `votersvoter` table and to provide a YEAR value for accessing the right data directory.
Local directories
The load_county_voters script uses some local directories as it transforms the raw data.
The assumed directory structure, all expected to live under a directory named for the data year (such as 2019/), includes these folders:
Folder | Use :---- | :-- VoterDetail | Put raw county voter files here load | Prepped voter files loaded | Voter files that were sent to PANDA prep | Processing folder temp | Processing folder
The load_county_voters script will look for a local environment variable, PANDA_LOADERS_BASE_DIR, to use as the base directory for the above folders, and will default to /tmp as the base directory if no environment var is found.
Running the script
With raw voter files in place, you can run the voter script with python load_county_voters.py plus one of four arguments:
1. [RAW FILE]: You can pass in a file name for a raw county voter file, such as BAY_20190312.txt. This will prep a single raw file for loading to a database or to PANDA, if the raw file is in /VoterDetail/.
2. prep_files: This preps all raw county files found in /VoterDetail/ and makes them ready for export to a database or to PANDA.
3. load_to_postgres: After files are prepped, this will create, load and index a Postgres voter database. If you put all 67 Florida county files in the /VoterDetail/ directory and prep them, this will create a statewide database of Florida voters.
4. export_to_panda: To export all prepped county files to a PANDA instance, creating one dataset per county.
PANDA loading
If you use the script to load data into a PANDA instance, the process uses PANDA's API, which is slow. But the API method has some advantages over PANDA's manual data upload process. - It sidesteps memory issues that you can encounter in PANDA's loading GUI. - It only uses PANDA index space, rather than index space + file storage space. - It results in a dataset with external_id values, which makes rows editable via the API.
Owner
- Name: william higgins
- Login: higs4281
- Kind: user
- Company: Consumer Financial Protection Bureau
- Repositories: 43
- Profile: https://github.com/higs4281
Web developer @cfpb
Citation (citations/load_tampa_bike_citations.py)
from __future__ import unicode_literals
import csv
import datetime
import json
import os
import sys
import requests
from dateutil import parser
from django.template.defaultfilters import slugify
class RunVars:
"""Track runtime variables."""
def __init__(self):
self.starter = None
self.processed = 0
self.updated = 0
self.created = 0
self.passed = 0
# panda params
RUNVARS = RunVars()
RUNVARS.starter = datetime.datetime.now()
PANDA_BASE_URL = os.getenv('PANDA_BASE_URL')
PANDA_AUTH_PARAMS = {
'email': os.getenv('PANDA_USER'),
'api_key': os.getenv('PANDA_API_KEY')
}
QUERY_LIMIT = 1200
PANDA_BULK_UPDATE_SIZE = 1000
PANDA_API = '%s/api/1.0' % PANDA_BASE_URL
PANDA_ALL_DATA_URL = "%s/data/" % PANDA_API
PANDA_DATASET_BASE = "%s/dataset" % PANDA_API
DATASET_NAME = "Tampa bicycle citations 2003 to 2014"
DATASET_SLUG = slugify(DATASET_NAME)
DATASET_URL = '{}/{}/'.format(PANDA_DATASET_BASE, DATASET_SLUG)
DATA_URL = '{}data/'.format(DATASET_URL)
def parse_dob(value):
turndate = datetime.date(2004, 1, 1) # allow for any citees as young as 10
try:
dob = parser.parse(value.strip()).date()
except Exception as e:
print("Couldn't parse dob {} ({})".format(value, e))
return None
else:
if dob == datetime.date.today():
return None
elif dob > turndate:
return dob.replace(year=dob.year - 100)
else:
return dob
def panda_get(url, params={}):
params.update(PANDA_AUTH_PARAMS)
return requests.get(url, params=params)
def panda_put(url, data, params={}):
params.update(PANDA_AUTH_PARAMS)
return requests.put(
url, data,
params=params,
headers={'Content-Type': 'application/json'})
def panda_delete(url, params={}):
params.update(PANDA_AUTH_PARAMS)
return requests.delete(url, params=params)
"""
RAW HEADINGS
1: ID
2: Uniform Case Number
3: Citaion Number
4: Case Number
5: Law Enf Agency Name
6: Law Enf Officer Name
7: Offense Date
8: Received Date
9: Last Name
10: First Name
11: Middle Name
12: Suffix
13: Address Line 1
14: Address Line 2 ## add to 13
15: City ## add to 13
16: State ## add to 13
17: Zip Code ## add to 13
18: Date Of Birth
19: Race
20: Gender
21: Driver License Number
22: Driver License State ## add to 21
23: Commercial Vehicle
24: Statute
25: Statute Description
26: Posted Speed
27: Actual Speed
28: Disposition
29: Disposition Date
30: Amount Paid
31: Date Paid
32: Defensive Driving School (DDS)
33: DDS Court Ordered
34: DDS Elected (Regular)
35: DDS Elected (Advanced)
36: Tag Number
37: Tag State ## add to 36
38: Case Filed Date
39: Case Closed Date
40: Offense Year
41: Age
NOTE: can clean up the BOM (byte order mark) character,
and many other encoding poblems, with ftfy's fix_text function, like so:
import ftfy
bad_col = reader.fieldnames[0]
bad_col
>> '\ufeffUniform Case Number'
ftfy.fix_text(bad_col)
>> 'Uniform Case Number'
desired-columns = '1,9,10,11,12,18,19,20,21,22,13,14,15,16,17,7,25,5,6'
"""
def initialize_dataset():
"""Initiate a PANDA dataset for bicycle citations."""
output_headings = [
'lname',
'fname',
'mname',
'suffix',
'DOB',
'race',
'gender',
'DL',
'address',
'date',
'statute',
'agency',
'officer',
]
dataset_dict = {
'name': DATASET_NAME,
'description': (
'A list of bicycle citations in Hillsborough County '
'from 2003 to 2014.'),
'categories': [
'/api/1.0/category/all-dob/',
'/api/1.0/category/crime/',
'/api/1.0/category/traffic-tickets/'
]
}
rtest = panda_get(DATASET_URL)
if rtest.status_code == 404: # Dataset doesn't exist, so we can create it
response = panda_put(
DATASET_URL,
json.dumps(dataset_dict),
params={'columns': ','.join(output_headings)}
)
if response.status_code == 201:
print("Created PANDA dataset {}".format(DATASET_NAME))
return True
else:
print(
"Failed with status code {} and reason {}".format(
response.status_code, response.reason))
return False
elif rtest.reason == 'OK':
testd = json.loads(rtest.text)
print(
"Dataset is already created and has {} rows; "
"We can proceed with upload.".format(testd['row_count']))
return True
def load_tickets(infile):
if not os.path.isfile(infile):
print("Couldn't find the source file '{}'".format(infile))
return
address_rows = [
'Address Line 1', 'Address Line 2', 'City', 'State', 'Zip Code']
put_data = {'objects': []}
with open(infile, 'r') as f:
reader = csv.DictReader(f)
for row in reader:
RUNVARS.processed += 1
pk = row['ID']
addr = ", ".join([row[key] for key in address_rows if row[key]])
dl = "{} ({})".format(
row['Driver License Number'], row['Driver License State'])
put_data['objects'].append({
'external_id': str(pk),
'data': [
row['Last Name'],
row['First Name'],
row['Middle Name'],
row['Suffix'],
row['Date Of Birth'],
row['Race'].replace('White', 'Wh').replace(
'Black', 'Bl'),
row['Gender'],
dl,
addr,
row['Offense Date'],
row['Statute Description'],
row['Law Enf Agency Name'],
row['Law Enf Officer Name']
]
})
if len(put_data['objects']) == 1000:
RUNVARS.created += 1000
print("Shipped {} rows".format(RUNVARS.created))
panda_put(DATA_URL, json.dumps(put_data))
put_data['objects'] = []
if put_data['objects']:
print('Shipping final {} rows.'.format(len(put_data['objects'])))
panda_put(DATA_URL, json.dumps(put_data))
RUNVARS.created += len(put_data['objects'])
put_data['objects'] = []
print("pushed {} rows to panda dataset {}; process took {}".format(
RUNVARS.created,
DATASET_NAME,
(datetime.datetime.now() - RUNVARS.starter)))
if __name__ == "__main__":
"""
Load a spreadsheet of Tampa bicycle citations into PANDA.
If no source file path is passed,
the default will be /data/AllBikeViolations.csv
"""
if not initialize_dataset():
print("Couldn't initialize the PANDA dataset.")
sys.exit(1)
if len(sys.argv) > 1:
source = sys.argv[1]
else:
source = "/data/AllBikeViolations.csv"
load_tickets(source)
GitHub Events
Total
- Push event: 2
Last Year
- Push event: 2
Committers
Last synced: 9 months ago
Top Committers
| Name | Commits | |
|---|---|---|
| William Higgins | h****1@g****m | 62 |
Issues and Pull Requests
Last synced: 9 months ago
All Time
- Total issues: 3
- Total pull requests: 5
- Average time to close issues: 4 minutes
- Average time to close pull requests: less than a minute
- Total issue authors: 1
- Total pull request authors: 1
- Average comments per issue: 0.0
- Average comments per pull request: 0.0
- Merged pull requests: 5
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 0
- Pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Issue authors: 0
- Pull request authors: 0
- Average comments per issue: 0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
- higs4281 (3)
Pull Request Authors
- higs4281 (5)
Top Labels
Issue Labels
Pull Request Labels
Dependencies
- csvkit ==1.0.2
- django >=2.1.6
- requests >=2.20.0
- coverage *
- flake8 *
- tox *