https://github.com/ccao-data/service-sqoop-iasworld

Service to continually import iasWorld backend data to Parquet using Apache Sqoop

Science Score: 26.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
○
Academic publication links
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (12.7%) to scientific vocabulary

Keywords

docker hadoop service shell sqoop

Last synced: 10 months ago · JSON representation

Repository

Service to continually import iasWorld backend data to Parquet using Apache Sqoop

Basic Info

Host: GitHub
Owner: ccao-data
Language: Shell
Default Branch: master
Homepage:
Size: 407 KB

Statistics

Stars: 0
Watchers: 0
Forks: 0
Open Issues: 0
Releases: 1

Archived

Topics

docker hadoop service shell sqoop

Created about 3 years ago · Last pushed almost 2 years ago

Metadata Files

Readme Codeowners

Sqoop Extractor for iasWorld

This repository contains the dependencies and scripts necessary to run sqoop, a data extraction tool for transferring data from relational databases to Hadoop, Hive, or Parquet.

In this case, sqoop is used to export table dumps from iasWorld, the CCAO's system of record, to an HCatalog. The result is a set of partitioned and bucketed Parquet files which can be uploaded to AWS S3 and queried directly via AWS Athena.

Structure

Directories

docker-config/ - Configuration and setup files for the Hadoop/Hive backend. Used during Docker build only
drivers/ - Mounted during run to provide connection drivers to sqoop. Put OJDBC files here (ojdbc8.jar or ojdbc7.jar)
logs/ - Location of temporary log files. Logs are manually uploaded to AWS CloudWatch after each run is complete
scripts/ - Runtime scripts to run sqoop jobs within Docker
secrets/ - Mounted during run to provide DB password via a file. Alter secrets/IPTS_PASSWORD to contain your password
tables/ - Table definitions and metadata used to create Hive tables for sqoop to extract to. Manually stored since certain tables include partitioning and bucketing
target/ - Mounted during run as output directory. All parquet files and job artifacts are saved temporarily before being uploaded to S3

Important Files

Dockerfile - Dockerfile to build sqoop and all dependencies from scratch if unavailable via the GitLab container registry
run.sh - Main entrypoint script. Idempotent. Run with sudo ./run.sh to extract all iasWorld tables.
docker-compose.yaml - Defines the containers and environment needed to run sqoop jobs in a small, distributed Hadoop/Hive environment
.env - Contains DB connection details. Alter before running to provide your own details

Usage

Dependencies

You will need the following tools installed before using this repo:

Docker
Docker Compose
AWS CLI v2 - Authenticated using aws configure
moreutils - For the ts timestamp command
jq - To parse logs to JSON

The rest of the dependencies for sqoop are installed using the included Dockerfile. To retrieve them, run either of the following commands within the repo directory:

docker-compose pull - Grabs the latest image from the CCAO GitHub registry, if it exists
docker-compose build - Builds the sqoop image from the included Dockerfile

Update table schemas

If tables schemas are altered in iasWorld (column type change, new columns), then the associated table schema files need to be updated in order to extract the altered tables from iasWorld. To update the schema files:

(Optional) If new tables have been added, they must be added to tables/tables-list.csv
Change /tmp/scripts/run-sqoop.sh to /tmp/scripts/get-tables.sh in docker-compose.yaml
Run docker compose up and wait for the schema files (tables/$TABLE.sql) to update
Run ./update-tables.sh to add bucketing and partitioning to the table schemas
Update the cron job in the README with any new tables, as well as the actual cronjob using sudo crontab -e

Export Tables

Nearly all the functionality of this repository is contained in run.sh. This script will complete four main tasks:

Extract the specified tables from iasWorld and save them to the target/ directory as Parquet
Remove any existing files on S3 for the extracted tables
Upload the extracted Parquet files to S3
Upload a logstream of the extraction and uploading process to CloudWatch

By default, sudo ./run.sh will export all tables in iasWorld to target/ (and then to S3). To extract a specific table or tables, prefix the run command with the environmental variable IPTS_TABLE. For example sudo IPTS_TABLE="ASMT_HIST CNAME" ./run.sh will grab the ASMT_HIST and CNAME tables

You can also specify a TAXYR within IPTS_TABLE using conditional symbols. For example, sudo IPTS_TABLE="ASMT_HIST>2019 ADDRINDX=2020" ./run.sh will get only records with a TAXYR greater than 2019 for ASMT_HIST and only records with a TAXYR equal to 2020 for ADDRINDX. Only =, <, and > are allowed as conditional operators.

Scheduling

Table extractions are schedule via cron. To edit the schedule file, use sudo crontab -e. The example below schedules daily jobs for frequently updated tables and weekly ones for rarely-updated tables.

```bash

Extract recent years from frequently used tables on weekdays at 1 AM CST

0 6 * * 1,2,3,4,5 cd /local/path/to/repo && YEAR="$(($(date '+\%Y') - 2))" IPTSTABLE="ADDN>$YEAR APRVAL>$YEAR ASMTHIST>$YEAR ASMT_ALL>$YEAR COMDAT>$YEAR CVLEG>$YEAR DWELDAT>$YEAR ENTER HTPAR>$YEAR LEGDAT>$YEAR OBY>$YEAR OWNDAT>$YEAR PARDAT>$YEAR PERMIT SALES SPLCOM>$YEAR" /bin/bash ./run.sh

Extract all tables except for ASMTALL and ASMTHIST on Saturday at 1 AM CST

0 6 * * 6 cd /local/path/to/repo && IPTS_TABLE="AASYSJUR ADDN ADDRINDX APRVAL CNAME COMDAT COMFEAT COMINTEXT COMNT COMNT3 CVLEG CVOWN CVTRAN DEDIT DWELDAT ENTER EXADMN EXAPP EXCODE EXDET HTAGNT HTDATES HTPAR LAND LEGDAT LPMOD LPNBHD OBY OWNDAT PARDAT RCOBY PERMIT SALES SPLCOM VALCLASS" /bin/bash ./run.sh ```

Useful Resources

Documentation and Guides

Docker Resources

Sqoop Docker image
Hadoop Docker image - Sqoop image uses this as a dependency
Hadoop cluster Docker image - Useful setup and options for pseudo-distributed Hadoop

Hive/HCatalog Resources

Debugging

Data Type Mappings (used to populate `--map-column-hive`)

HCatalog Data Types
Java to Oracle Type Mapping Matrix
PL/SQL (Oracle SQL) to JDBC mapping - Not applicable here but still helpful

Owner

Name: Cook County Assessor's Office
Login: ccao-data
Kind: organization
Email: assessor.data@cookcountyil.gov

Website: https://www.cookcountyassessor.com
Twitter: AssessorCook
Repositories: 1
Profile: https://github.com/ccao-data

GitHub Events

Total

Last Year

Issues and Pull Requests

Last synced: over 1 year ago

All Time

Total issues: 4
Total pull requests: 12
Average time to close issues: 10 days
Average time to close pull requests: about 9 hours
Total issue authors: 3
Total pull request authors: 1
Average comments per issue: 0.0
Average comments per pull request: 0.33
Merged pull requests: 11
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 2
Pull requests: 7
Average time to close issues: 6 days
Average time to close pull requests: about 14 hours
Issue authors: 2
Pull request authors: 1
Average comments per issue: 0.0
Average comments per pull request: 0.43
Merged pull requests: 6
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

dfsnow (4)
jeancochrane (1)
wrridgeway (1)

Pull Request Authors

dfsnow (15)

Top Labels

Issue Labels

enhancement (3)

Pull Request Labels

enhancement (2) bug (2)

Dependencies

.github/workflows/docker-build.yaml actions

actions/checkout v3 composite
docker/build-push-action v4 composite
docker/login-action v2 composite
docker/setup-buildx-action v2 composite

Dockerfile docker

centos 7 build
hadoop latest build

https://github.com/ccao-data/service-sqoop-iasworld

Science Score: 26.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

Sqoop Extractor for iasWorld

Structure

Directories

Important Files

Usage

Dependencies

Update table schemas

Export Tables

Scheduling

Extract recent years from frequently used tables on weekdays at 1 AM CST

Extract all tables except for ASMTALL and ASMTHIST on Saturday at 1 AM CST

Useful Resources

Documentation and Guides

Docker Resources

Hive/HCatalog Resources

Debugging

Data Type Mappings (used to populate --map-column-hive)

Owner

GitHub Events

Total

Last Year

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Dependencies

Data Type Mappings (used to populate `--map-column-hive`)