https://github.com/ccao-data/service-sqoop-iasworld

Service to continually import iasWorld backend data to Parquet using Apache Sqoop

https://github.com/ccao-data/service-sqoop-iasworld

Science Score: 26.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (12.7%) to scientific vocabulary

Keywords

docker hadoop service shell sqoop
Last synced: 5 months ago · JSON representation

Repository

Service to continually import iasWorld backend data to Parquet using Apache Sqoop

Basic Info
  • Host: GitHub
  • Owner: ccao-data
  • Language: Shell
  • Default Branch: master
  • Homepage:
  • Size: 407 KB
Statistics
  • Stars: 0
  • Watchers: 0
  • Forks: 0
  • Open Issues: 0
  • Releases: 1
Archived
Topics
docker hadoop service shell sqoop
Created over 2 years ago · Last pushed over 1 year ago
Metadata Files
Readme Codeowners

README.md

Sqoop Extractor for iasWorld

This repository contains the dependencies and scripts necessary to run sqoop, a data extraction tool for transferring data from relational databases to Hadoop, Hive, or Parquet.

In this case, sqoop is used to export table dumps from iasWorld, the CCAO's system of record, to an HCatalog. The result is a set of partitioned and bucketed Parquet files which can be uploaded to AWS S3 and queried directly via AWS Athena.

Structure

Directories

  • docker-config/ - Configuration and setup files for the Hadoop/Hive backend. Used during Docker build only
  • drivers/ - Mounted during run to provide connection drivers to sqoop. Put OJDBC files here (ojdbc8.jar or ojdbc7.jar)
  • logs/ - Location of temporary log files. Logs are manually uploaded to AWS CloudWatch after each run is complete
  • scripts/ - Runtime scripts to run sqoop jobs within Docker
  • secrets/ - Mounted during run to provide DB password via a file. Alter secrets/IPTS_PASSWORD to contain your password
  • tables/ - Table definitions and metadata used to create Hive tables for sqoop to extract to. Manually stored since certain tables include partitioning and bucketing
  • target/ - Mounted during run as output directory. All parquet files and job artifacts are saved temporarily before being uploaded to S3

Important Files

  • Dockerfile - Dockerfile to build sqoop and all dependencies from scratch if unavailable via the GitLab container registry
  • run.sh - Main entrypoint script. Idempotent. Run with sudo ./run.sh to extract all iasWorld tables.
  • docker-compose.yaml - Defines the containers and environment needed to run sqoop jobs in a small, distributed Hadoop/Hive environment
  • .env - Contains DB connection details. Alter before running to provide your own details

Usage

Dependencies

You will need the following tools installed before using this repo:

The rest of the dependencies for sqoop are installed using the included Dockerfile. To retrieve them, run either of the following commands within the repo directory:

  • docker-compose pull - Grabs the latest image from the CCAO GitHub registry, if it exists
  • docker-compose build - Builds the sqoop image from the included Dockerfile

Update table schemas

If tables schemas are altered in iasWorld (column type change, new columns), then the associated table schema files need to be updated in order to extract the altered tables from iasWorld. To update the schema files:

  1. (Optional) If new tables have been added, they must be added to tables/tables-list.csv
  2. Change /tmp/scripts/run-sqoop.sh to /tmp/scripts/get-tables.sh in docker-compose.yaml
  3. Run docker compose up and wait for the schema files (tables/$TABLE.sql) to update
  4. Run ./update-tables.sh to add bucketing and partitioning to the table schemas
  5. Update the cron job in the README with any new tables, as well as the actual cronjob using sudo crontab -e

Export Tables

Nearly all the functionality of this repository is contained in run.sh. This script will complete four main tasks:

  1. Extract the specified tables from iasWorld and save them to the target/ directory as Parquet
  2. Remove any existing files on S3 for the extracted tables
  3. Upload the extracted Parquet files to S3
  4. Upload a logstream of the extraction and uploading process to CloudWatch

By default, sudo ./run.sh will export all tables in iasWorld to target/ (and then to S3). To extract a specific table or tables, prefix the run command with the environmental variable IPTS_TABLE. For example sudo IPTS_TABLE="ASMT_HIST CNAME" ./run.sh will grab the ASMT_HIST and CNAME tables

You can also specify a TAXYR within IPTS_TABLE using conditional symbols. For example, sudo IPTS_TABLE="ASMT_HIST>2019 ADDRINDX=2020" ./run.sh will get only records with a TAXYR greater than 2019 for ASMT_HIST and only records with a TAXYR equal to 2020 for ADDRINDX. Only =, <, and > are allowed as conditional operators.

Scheduling

Table extractions are schedule via cron. To edit the schedule file, use sudo crontab -e. The example below schedules daily jobs for frequently updated tables and weekly ones for rarely-updated tables.

```bash

Extract recent years from frequently used tables on weekdays at 1 AM CST

0 6 * * 1,2,3,4,5 cd /local/path/to/repo && YEAR="$(($(date '+\%Y') - 2))" IPTSTABLE="ADDN>$YEAR APRVAL>$YEAR ASMTHIST>$YEAR ASMT_ALL>$YEAR COMDAT>$YEAR CVLEG>$YEAR DWELDAT>$YEAR ENTER HTPAR>$YEAR LEGDAT>$YEAR OBY>$YEAR OWNDAT>$YEAR PARDAT>$YEAR PERMIT SALES SPLCOM>$YEAR" /bin/bash ./run.sh

Extract all tables except for ASMTALL and ASMTHIST on Saturday at 1 AM CST

0 6 * * 6 cd /local/path/to/repo && IPTS_TABLE="AASYSJUR ADDN ADDRINDX APRVAL CNAME COMDAT COMFEAT COMINTEXT COMNT COMNT3 CVLEG CVOWN CVTRAN DEDIT DWELDAT ENTER EXADMN EXAPP EXCODE EXDET HTAGNT HTDATES HTPAR LAND LEGDAT LPMOD LPNBHD OBY OWNDAT PARDAT RCOBY PERMIT SALES SPLCOM VALCLASS" /bin/bash ./run.sh ```

Useful Resources

Documentation and Guides

Docker Resources

Hive/HCatalog Resources

Debugging

Data Type Mappings (used to populate --map-column-hive)

Owner

  • Name: Cook County Assessor's Office
  • Login: ccao-data
  • Kind: organization
  • Email: assessor.data@cookcountyil.gov

GitHub Events

Total
Last Year

Issues and Pull Requests

Last synced: over 1 year ago

All Time
  • Total issues: 4
  • Total pull requests: 12
  • Average time to close issues: 10 days
  • Average time to close pull requests: about 9 hours
  • Total issue authors: 3
  • Total pull request authors: 1
  • Average comments per issue: 0.0
  • Average comments per pull request: 0.33
  • Merged pull requests: 11
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 2
  • Pull requests: 7
  • Average time to close issues: 6 days
  • Average time to close pull requests: about 14 hours
  • Issue authors: 2
  • Pull request authors: 1
  • Average comments per issue: 0.0
  • Average comments per pull request: 0.43
  • Merged pull requests: 6
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • dfsnow (4)
  • jeancochrane (1)
  • wrridgeway (1)
Pull Request Authors
  • dfsnow (15)
Top Labels
Issue Labels
enhancement (3)
Pull Request Labels
enhancement (2) bug (2)

Dependencies

.github/workflows/docker-build.yaml actions
  • actions/checkout v3 composite
  • docker/build-push-action v4 composite
  • docker/login-action v2 composite
  • docker/setup-buildx-action v2 composite
Dockerfile docker
  • centos 7 build
  • hadoop latest build