https://github.com/ai4bharat/sangraha-internet-archive-download

Repository Containing Code for Download and Curation of Sangraha Data

Science Score: 13.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
○
.zenodo.json file
○
DOI references
○
Academic publication links
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (7.3%) to scientific vocabulary

Last synced: 9 months ago · JSON representation

Repository

Repository Containing Code for Download and Curation of Sangraha Data

Basic Info

Host: GitHub
Owner: AI4Bharat
Language: Python
Default Branch: master
Size: 13.7 KB

Statistics

Stars: 0
Watchers: 5
Forks: 1
Open Issues: 0
Releases: 0

Created over 2 years ago · Last pushed over 2 years ago

Metadata Files

Readme

Sangraha Internet Archive Data Download

Code Repository for Scripts and Utils for downloading and curating Indic Data from archive.org Files.

Setup

Create a virtual environment and install required python dependencies provided in the requirements.txt file.

Single Machine Download from Internet Archive

In the pipeline folder, We have Single Machine Download Python Script for downloading archive data into your machine. The script requires a list of language names i.e. Dogri, Tamil, Hindi, etc. followed by optional arguments such as pdfonly and idonly download options.

Distributed Machine Download from Internet Archive

This setup was utilized so that we can download data onto machines with more storage and parallelize downloads.

Note : You will have to setup RabbitMQ in your server and client machines and configure the Credentials file accordingly.

In the pipeline folder, We have two files :

Multiple Machine Server for queueing the identifiers from the identifiers.csv file downloaded from the previous section using id_only parameter.
Multiple Machine Client for pulling identifiers from server host and downloading data onto client machine.

Owner

Name: AI4Bhārat
Login: AI4Bharat
Kind: organization
Email: opensource@ai4bharat.org
Location: India

Website: https://ai4bharat.org
Twitter: AI4Bharat
Repositories: 37
Profile: https://github.com/AI4Bharat

Artificial-Intelligence-For-Bhārat : Building open-source AI solutions for India!

GitHub Events

Total

Last Year

Dependencies

requirements.txt pypi

apache-airflow-providers-celery *
internetarchive *
pandas *
pika *
pip-chill *
pyarrow *
scrapy *

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science

https://github.com/ai4bharat/sangraha-internet-archive-download

Science Score: 13.0%

Repository

Basic Info

Statistics

Metadata Files

README.md

Sangraha Internet Archive Data Download

Setup

Single Machine Download from Internet Archive

Distributed Machine Download from Internet Archive

Owner

GitHub Events

Total

Last Year

Dependencies