pipeline

Breviloquia Italica: data pipeline

https://github.com/breviloquia-italica/pipeline

Science Score: 67.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
✓
DOI references
Found 3 DOI reference(s) in README
✓
Academic publication links
Links to: zenodo.org
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (4.4%) to scientific vocabulary

Last synced: 9 months ago · JSON representation ·

Repository

Breviloquia Italica: data pipeline

Basic Info

Host: GitHub
Owner: breviloquia-italica
Language: Jupyter Notebook
Default Branch: main
Homepage:
Size: 40.1 MB

Statistics

Stars: 2
Watchers: 3
Forks: 0
Open Issues: 0
Releases: 4

Created almost 3 years ago · Last pushed over 2 years ago

Metadata Files

Readme Changelog Citation Zenodo

Breviloquia Italica: data pipeline

This resource contains the full sourcecode for the data pipeline of the Breviloquia Italica project.

Description

The pipeline is organized into a series of numbered scripts subdivided into the stages of preparation, transformation, selection and annotation. Their dependencies are encoded into Makefile, which can also be executed to target specific outputs in the pipeline. Here is a dependency graph depicting inputs, scripts and outputs:

```mermaid flowchart TD;

subgraph P1 [PREPARATION]

00[00unpack-data.sh]:::code; 10[10extract-places.sh]:::code; 11[11extract-tweets.sh]:::code; 12[12flatten-tweets.sh]:::code; 20[20cleanup-places.py]:::code; 21[21cleanup-tweets.py]:::code;

2022-MM-DD.jsonl[/data/2022-MM-DD.jsonl/]:::data; places.jsonl[/places.jsonl/]:::data; places.parquet[/places.parquet/]:::data; tweets.jsonl[/tweets.jsonl/]:::data; tweets.parquet[/tweets.parquet/]:::data; tweets.csv[/tweets.csv/]:::data;

data.zip[/data.zip/]:::extdata --- 00; 00 --> 2022-MM-DD.jsonl;

2022-MM-DD.jsonl --- 10 --> places.jsonl; places.jsonl --- 20 --> places.parquet;

2022-MM-DD.jsonl --- 11 --> tweets.jsonl; tweets.jsonl --- 21 --> tweets.parquet;

2022-MM-DD.jsonl --- 12 ----> tweets.csv;

end

subgraph P2 [TRANSFORMATION]

30[30tokenize-tweets.py]:::code; 31[31locate-tweets.py]:::code;

tweets-tok.parquet[/tweets-tok.parquet/]:::data; tweets-geo.parquet[/tweets-geo.parquet/]:::data;

_02[/italy-regions.geojson/]:::extdata --- 31; places.parquet --- 31; tweets.parquet --- 31; 31 --> tweets-geo.parquet;

tweets.parquet --- 30; 30 --> tweets-tok.parquet;

end

subgraph P3 [SELECTION]

40[40compute-wforms-occ.py]:::code; 41[41compute-wforms-usr.py]:::code; 42[42_compute-wforms-bat.py]:::code;

wforms-occ.parquet[/wforms-occ.parquet/]:::data; wforms-usr.parquet[/wforms-usr.parquet/]:::data; wforms-bat.parquet[/wforms-bat.parquet/]:::data;

tweets-tok.parquet --- 40 --> wforms-occ.parquet;

tweets-tok.parquet --- 41; tweets.parquet --- 41; 41 --> wforms-usr.parquet;

wforms-occ.parquet --- 42; _03[/attested-forms.csv/]:::extdata --- 42; wforms-usr.parquet --- 42; 42 --> wforms-bat.parquet;

end

subgraph P4 [ANNOTATION]

50[50export-ann-batches.py]:::code; 51[[51process-ann-batches.md]]:::code; 52[52_import-ann-batches.py]:::code;

wforms-ann-batch-N.csv[/"wforms-ann-batch-{1,2}.csv"/]:::data wforms-ann-batch-N.gsheet.csv[/"wforms-ann-{batch-1,batch-2,patches}.gsheet.csv"/]:::extdata; wforms-ann.parquet[/wforms-ann.parquet/]:::data; wforms-ann.csv[/wforms-ann.csv/]:::data;

wforms-bat.parquet --- 50; 50 -.- wforms-ann.parquet; 50 --> wforms-ann-batch-N.csv;

wforms-ann-batch-N.csv --- 51; tweets.csv --- 51; 51 --> wforms-ann-batch-N.gsheet.csv;

wforms-ann-batch-N.gsheet.csv --- 52; 52 --> wforms-ann.parquet; 52 --> wforms-ann.csv;

end

subgraph P5 [EXPORT]

60[60export-tweets-ids.sh]:::code; 61[61export-occurrences.py]:::code;

tweets-ids.csv[/tweets-ids.csv/]:::data; occurrences.csv[/occurrences.csv/]:::data;

2022-MM-DD.jsonl --- 60 ----> tweets-ids.csv;

tweets.jsonl --> 61; tweets-geo.parquet --> 61; wforms-occ.parquet --> 61; wforms-ann.parquet --> 61; 61 --> occurrences.csv;

end

P1 ~~~~~~~ P2; P2 ~~~~ P3; P3 ~~~~~~ P4; P4 ~~~~~~~~ P5;

classDef code stroke:red; classDef data stroke:green; classDef extdata stroke:blue; ```

Data visualizations and statistics are produced by a few Python scripts, including some Jupyter notebooks. Makefile encodes the dependencies of these too, as depicted in this graph:

```mermaid flowchart TB;

subgraph P5 [ANALYSIS] 90[90basic-stats.ipynb]:::code; 91[91choro-stats.ipynb]:::code; 92[92annos-stats.ipynb]:::code; 98[98parts-chart.py]:::code; 99[99_choro-chart.py]:::code;

places.parquet[/places.parquet/]:::data; tweets.parquet[/tweets.parquet/]:::data; tweets-tok.parquet[/tweets-tok.parquet/]:::data; wforms-bat.parquet[/wforms-bat.parquet/]:::data; world-nations.geojson[/world-nations.geojson/]:::extdata; italy-regions.geojson[/italy-regions.geojson/]:::extdata;

world-nations.geojson ---- 90; italy-regions.geojson ---- 90; tweets.parquet --- 90; places.parquet --- 90; wforms-bat.parquet --- 90; tweets-tok.parquet --- 90; 90 -.-> 90;

italy-regions.geojson ---- 91; D9[/"wforms-{bat,ann}.parquet"/]:::dataref --- 91; D8[/"tweets-{tok,geo}.parquet"/]:::dataref --- 91; 91 -.-> 91;

D1[/"wforms-{ann,bat,occ,usr}.parquet"/]:::dataref --- 92; %% for spacing only: italy-regions.geojson ~~~ D1; 92 -.-> 92;

D2[/"wforms-{occ,usr}.parquet"/]:::dataref --- 98; 98 --> subsets.pdf; 98 -.-> 98;

italy-regions.geojson ---- 99; D3[/"wforms-{bat,ann}.parquet"/]:::dataref --- 99; D4[/"tweets-{tok,geo}.parquet"/]:::dataref --- 99; 99 --> choros-*.pdf["choros-{sample,more-1,more-2}.pdf"]; 99 -.-> 99;

end

classDef code stroke:red; classDef data stroke:green; classDef extdata stroke:blue; classDef dataref stroke:green,stroke-width:2px,stroke-dasharray: 10 10,font-style:italic; ```

Workbooks named as XX_*.ipynb are dead ends or in-progress work, so they are not documented in the graphs above.

jupyterlab.sh and Makefile.hpc are development tools used to prepare and run the pipeline in our HPC environment, and therefore are probably not of general interest.

requirements.txt lists all Python dependencies, as is customary.

Authors

Paolo Brasolin.

License

This work is openly licensed via CC BY 4.0.

Owner

Name: Breviloquia Italica
Login: breviloquia-italica
Kind: organization
Location: Italy

Repositories: 1
Profile: https://github.com/breviloquia-italica

Citation (CITATION.cff)

cff-version: 1.2.0
title: "Breviloquia Italica: data pipeline"
authors:
  - family-names: Brasolin
    given-names: Paolo
    orcid: https://orcid.org/0000-0003-2471-7797
url: https://github.com/breviloquia-italica/pipeline
doi: 10.5281/zenodo.8430341 # NOTE: this is the concept DOI
date-released: "2024-02-05"
version: 1.2.0
license: "CC-BY-4.0"

GitHub Events

Total

Last Year

Dependencies

requirements.txt pypi

Babel ==2.12.1
Fiona ==1.9.4.post1
Jinja2 ==3.1.2
MarkupSafe ==2.1.3
Pillow ==10.0.0
PyYAML ==6.0.1
Pygments ==2.16.1
Send2Trash ==1.8.2
aiohttp ==3.8.5
aiohttp-cors ==0.7.0
aiosignal ==1.3.1
anyio ==3.7.1
argon2-cffi ==21.3.0
argon2-cffi-bindings ==21.2.0
arrow ==1.2.3
asttokens ==2.2.1
async-lru ==2.0.4
async-timeout ==4.0.2
attrs ==23.1.0
backcall ==0.2.0
beautifulsoup4 ==4.12.2
black ==23.7.0
bleach ==6.0.0
blessed ==1.20.0
blis ==0.7.10
cachetools ==5.3.1
catalogue ==2.0.9
certifi ==2023.7.22
cffi ==1.15.1
charset-normalizer ==3.2.0
click ==8.1.6
click-plugins ==1.1.1
cligj ==0.7.2
colorful ==0.5.5
comm ==0.1.4
confection ==0.1.1
contourpy ==1.1.0
cycler ==0.11.0
cymem ==2.0.7
debugpy ==1.6.7
decorator ==5.1.1
defusedxml ==0.7.1
distlib ==0.3.7
emoji ==2.7.0
exceptiongroup ==1.1.2
executing ==1.2.0
fastjsonschema ==2.18.0
filelock ==3.12.2
fonttools ==4.42.0
fqdn ==1.5.1
frozenlist ==1.4.0
fsspec ==2023.6.0
geopandas ==0.13.2
google-api-core ==2.11.1
google-auth ==2.22.0
googleapis-common-protos ==1.60.0
gpustat ==1.1
grpcio ==1.56.2
idna ==3.4
ipykernel ==6.25.1
ipython ==8.14.0
ipywidgets ==8.1.0
isoduration ==20.11.0
isort ==5.12.0
jedi ==0.19.0
json5 ==0.9.14
jsonpointer ==2.4
jsonschema ==4.19.0
jsonschema-specifications ==2023.7.1
jupyter-events ==0.7.0
jupyter-lsp ==2.2.0
jupyter-resource-usage ==1.0.0
jupyter_client ==8.3.0
jupyter_core ==5.3.1
jupyter_server ==2.7.0
jupyter_server_terminals ==0.4.4
jupyterlab ==4.0.4
jupyterlab-pygments ==0.2.2
jupyterlab-widgets ==3.0.8
jupyterlab_code_formatter ==2.2.1
jupyterlab_server ==2.24.0
kiwisolver ==1.4.4
langcodes ==3.3.0
matplotlib ==3.7.2
matplotlib-inline ==0.1.6
mistune ==3.0.1
modin ==0.23.0
msgpack ==1.0.5
multidict ==6.0.4
murmurhash ==1.0.9
mypy-extensions ==1.0.0
nbclient ==0.8.0
nbconvert ==7.7.3
nbformat ==5.9.2
nest-asyncio ==1.5.7
notebook_shim ==0.2.3
numpy ==1.25.2
nvidia-ml-py ==12.535.77
opencensus ==0.11.2
opencensus-context ==0.1.3
overrides ==7.4.0
packaging ==23.1
pandas ==2.0.3
pandocfilters ==1.5.0
parso ==0.8.3
pathspec ==0.11.2
pathy ==0.10.2
pexpect ==4.8.0
pickleshare ==0.7.5
platformdirs ==3.10.0
preshed ==3.0.8
prometheus-client ==0.17.1
prompt-toolkit ==3.0.39
protobuf ==4.23.4
psutil ==5.9.5
ptyprocess ==0.7.0
pure-eval ==0.2.2
py-spy ==0.3.14
pyarrow ==12.0.1
pyasn1 ==0.5.0
pyasn1-modules ==0.3.0
pycparser ==2.21
pydantic ==1.10.12
pyparsing ==3.0.9
pyproj ==3.6.0
python-dateutil ==2.8.2
python-json-logger ==2.0.7
pytz ==2023.3
pyzmq ==25.1.0
ray ==2.6.2
referencing ==0.30.2
requests ==2.31.0
rfc3339-validator ==0.1.4
rfc3986-validator ==0.1.1
rpds-py ==0.9.2
rsa ==4.9
scipy ==1.11.1
seaborn ==0.12.2
shapely ==2.0.1
six ==1.16.0
smart-open ==6.3.0
sniffio ==1.3.0
soupsieve ==2.4.1
spacy ==3.6.1
spacy-legacy ==3.0.12
spacy-loggers ==1.0.4
srsly ==2.4.7
stack-data ==0.6.2
terminado ==0.17.1
thinc ==8.1.11
tinycss2 ==1.2.1
tomli ==2.0.1
topojson ==1.5
tornado ==6.3.2
tqdm ==4.65.1
traitlets ==5.9.0
typer ==0.9.0
typing_extensions ==4.7.1
tzdata ==2023.3
uri-template ==1.3.0
urllib3 ==1.26.16
virtualenv ==20.21.0
wasabi ==1.1.2
wcwidth ==0.2.6
webcolors ==1.13
webencodings ==0.5.1
websocket-client ==1.6.1
widgetsnbextension ==4.0.8
yarl ==1.9.2

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science

pipeline

Science Score: 67.0%

Repository

Basic Info

Statistics

Metadata Files

README.md

Breviloquia Italica: data pipeline

Description

Authors

License

Owner

Citation (CITATION.cff)

GitHub Events

Total

Last Year

Dependencies