https://github.com/ai4bharat/setu

Setu is a comprehensive pipeline designed to clean, filter, and deduplicate diverse data sources including Web, PDF, and Speech data. Built on Apache Spark, Setu encompasses four key stages: document preparation, document cleaning and analysis, flagging and filtering, and deduplication.

Science Score: 13.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
○
.zenodo.json file
○
DOI references
○
Academic publication links
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (17.8%) to scientific vocabulary

Last synced: 9 months ago · JSON representation

Repository

Basic Info

Host: GitHub
Owner: AI4Bharat
License: mit
Language: HTML
Default Branch: main
Homepage:
Size: 35 MB

Statistics

Stars: 7
Watchers: 7
Forks: 0
Open Issues: 0
Releases: 1

Created almost 3 years ago · Last pushed about 2 years ago

Metadata Files

Readme License

README.md

Setu: A Comprehensive Pipeline for Data Cleaning, Filtering and Deduplication

Quickstart
Overview
Usage

Updates :

Added Documentation
Flat Project Structure

Future Updates :

Add support for speech transcripts
Add support for multiple data formats

Quickstart

This documentation provides an overview of Setu and its workflow, enabling users to efficiently manage and process Web, PDF, and Speech data with Apache Spark.

Note that users who want to run the pipeline on Windows systems are advised to use WSL (Windows Subsystem for Linux) for easier usage. This is due to the presence of dependencies and scripts that are only usable in a Linux environment.

Installation

Install Python onto WSL

Before installing make sure your Python Version is 3.10.X or above. For ease of installation we recommend using 3.10.X. Also make sure you have Miniconda installed as we will be using conda enviroments.

Install Java OpenJDK

bash sudo update sudo apt install openjdk-11-jdk java --version

Install Spark for Hadoop 3.3

Note : Ensure you do this the following in your home/user folder

bash wget https://dlcdn.apache.org/spark/spark-3.5.1/spark-3.5.1-bin-hadoop3.tgz

bash mkdir hadoop mkdir hadoop/spark-3.5.1 tar -xvzf spark-3.5.1-bin-hadoop3.tgz -C ~/hadoop/spark-3.5.1 --strip 1

Configuration

Edit your bashrc file and add the following lines

export SPARK_HOME= ~/hadoop/spark-3.5.1 export PATH=$SPARK_HOME/bin:$PATH source ~/.bashrc

Copy the default spark config template and save it as config file.

cp $SPARK_HOME/conf/spark-defaults.conf.template $SPARK_HOME/conf/spark-defaults.conf

Edit the config file and mnetion spark host address.

nano $SPARK_HOME/conf/spark-defaults.conf spark.driver.host localhost

Test your spark installation by running spark-shell.

bash spark-shell

Setu Environment Setup

You can now directly create the conda environment from the environment.yaml file provided.

bash conda env create -f environment.yml

Refer the packages.txt text file for verification of libraries downloaded. Some libraries need to be downloaded utilizing pip.

Make sure that Pyspark is working by running pyspark on the terminal

bash pyspark

Overview

As part of the IndicLLMSuite : A Blueprint for Creating Pre-training and Fine-Tuning Datasets for Indian Languages we release Sangraha, a 251 billion tokens dataset summed up over 22 languages extracted from curated URLs, existing multilingual corpora, and large-scale translations.

For data corpus we utilize webcorpus to crawl a large collection of web URLs curated across all 22 Indic languages. For PDF documents we download Book collections from Internet Archive pertaining to Indic Languages. For ease of downloading PDF files, You can refer to Sangraha Data Download.

Document Preparation

The first stage of Setu focuses on extracting text from a variety of sources to create text documents for further processing. For Web documents, Setu utilizes trafilatura (Barbaresi, 2021b) to extract text from HTML. Meanwhile, PDFs undergo a pipeline that generate OCR JSON outputs utilizing GCP Cloud Vision SDK. Once these JSONs are generated, Setu leverages bounding box related information to filter out pages potentially afflicted with recognition issues and noise.

Cleaning and Analysis Stage

In the cleaning and analysis stage, Setu focuses on reducing noise within individual documents. It employs a multi-model approach for language identification, leveraging outputs from three different Language Identification Libraries:

Various statistics such as character and word counts, NSFW word count, and n-gram repetition ratio are computed during analysis.

Flagging and Filtering Stage

During the flagging and filtering stage, Setu applies filters based on the computed statistics. Filters include line length filters, NSFW word filters, and repetition filters, aimed at removing noisy and toxic documents.

Deduplication Stage

The deduplication stage of Setu performs fuzzy deduplication using MinHashLSH implemented in text-dedup. This stage helps in identifying and eliminating duplicate documents, enhancing data cleanliness and efficiency.

Usage

For running the different stages in setu, You can refer to the commands.md file and the also utilize the demo.ipynb files in the repo to understand the usage and output of the different stages. Make sure you configure the $USER and --master to point to your user folder and corresponding spark master URL. If you choose to store your datasets in a different location make sure you modify the different path arguments for the commands accordingly.

Owner

Name: AI4Bhārat
Login: AI4Bharat
Kind: organization
Email: opensource@ai4bharat.org
Location: India

Website: https://ai4bharat.org
Twitter: AI4Bharat
Repositories: 37
Profile: https://github.com/AI4Bharat

Artificial-Intelligence-For-Bhārat : Building open-source AI solutions for India!

GitHub Events

Total

Issues event: 2
Watch event: 6

Last Year

Issues event: 2
Watch event: 6

Issues and Pull Requests

Last synced: 9 months ago

All Time

Total issues: 2
Total pull requests: 14
Average time to close issues: N/A
Average time to close pull requests: about 4 hours
Total issue authors: 2
Total pull request authors: 3
Average comments per issue: 0.0
Average comments per pull request: 0.0
Merged pull requests: 14
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 2
Pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Issue authors: 2
Pull request authors: 0
Average comments per issue: 0.0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

AamodThakur (1)
Gautam-Rajeev (1)

Pull Request Authors

Shanks0465 (12)
prikmm (1)
safikhanSoofiyani (1)

Top Labels

Issue Labels

Pull Request Labels

Dependencies

environment.yml conda

_libgcc_mutex 0.1
_openmp_mutex 4.5
abseil-cpp 20211102.0
aiohttp 3.9.3
aiosignal 1.2.0
arrow-cpp 14.0.2
asttokens 2.4.1
async-timeout 4.0.3
attrs 23.1.0
aws-c-auth 0.6.19
aws-c-cal 0.5.20
aws-c-common 0.8.5
aws-c-compression 0.2.16
aws-c-event-stream 0.2.15
aws-c-http 0.6.25
aws-c-io 0.13.10
aws-c-mqtt 0.7.13
aws-c-s3 0.1.51
aws-c-sdkutils 0.1.6
aws-checksums 0.1.13
aws-crt-cpp 0.18.16
aws-sdk-cpp 1.10.55
beautifulsoup4 4.12.2
blas 1.0
boost-cpp 1.82.0
bottleneck 1.3.7
brotli 1.0.9
brotli-bin 1.0.9
brotli-python 1.0.9
bzip2 1.0.8
c-ares 1.19.1
ca-certificates 2024.2.2
certifi 2024.2.2
click 8.1.7
comm 0.2.1
contourpy 1.2.0
cuda-cudart 11.8.89
cuda-cupti 11.8.87
cuda-libraries 11.8.0
cuda-nvrtc 11.8.89
cuda-nvtx 11.8.86
cuda-runtime 11.8.0
cycler 0.11.0
cyrus-sasl 2.1.28
datasets 2.12.0
dbus 1.13.18
debugpy 1.8.1
decorator 5.1.1
dill 0.3.6
exceptiongroup 1.2.0
executing 2.0.1
expat 2.5.0
fasttext 0.9.2
ffmpeg 4.3
filelock 3.13.1
flashtext 2.7
fontconfig 2.14.1
fonttools 4.25.0
freetype 2.12.1
frozenlist 1.4.0
fsspec 2023.10.0
geos 3.8.0
gflags 2.2.2
glib 2.78.4
glib-tools 2.78.4
glog 0.5.0
gmp 6.2.1
gmpy2 2.1.2
gnutls 3.6.15
grpc-cpp 1.48.2
gst-plugins-base 1.14.1
gstreamer 1.14.1
huggingface_hub 0.20.3
icu 73.1
idna 3.4
importlib-metadata 7.0.1
importlib_metadata 7.0.1
intel-openmp 2023.1.0
ipykernel 6.29.3
ipython 8.22.2
jedi 0.19.1
jinja2 3.1.3
joblib 1.2.0
jpeg 9e
jupyter_client 8.6.0
jupyter_core 5.7.1
kiwisolver 1.4.4
krb5 1.20.1
lame 3.100
lcms2 2.12
ld_impl_linux-64 2.38
lerc 3.0
libboost 1.82.0
libbrotlicommon 1.0.9
libbrotlidec 1.0.9
libbrotlienc 1.0.9
libclang13 14.0.6
libcublas 11.11.3.6
libcufft 10.9.0.58
libcufile 1.8.1.2
libcups 2.4.2
libcurand 10.3.4.107
libcurl 8.5.0
libcusolver 11.4.1.48
libcusparse 11.7.5.86
libdeflate 1.17
libedit 3.1.20230828
libev 4.33
libevent 2.1.12
libffi 3.4.4
libgcc-ng 13.2.0
libglib 2.78.4
libgomp 13.2.0
libiconv 1.16
libidn2 2.3.4
libjpeg-turbo 2.0.0
libllvm14 14.0.6
libnghttp2 1.57.0
libnpp 11.8.0.86
libnvjpeg 11.9.0.86
libpng 1.6.39
libpq 12.17
libprotobuf 3.20.3
libsodium 1.0.18
libssh2 1.10.0
libstdcxx-ng 13.2.0
libtasn1 4.19.0
libthrift 0.15.0
libtiff 4.5.1
libunistring 0.9.10
libuuid 1.41.5
libwebp-base 1.3.2
libxcb 1.15
libxkbcommon 1.0.1
libxml2 2.10.4
llvm-openmp 14.0.6
lz4-c 1.9.4
markupsafe 2.1.3
matplotlib 3.8.0
matplotlib-base 3.8.0
matplotlib-inline 0.1.6
mkl 2023.1.0
mkl-service 2.4.0
mkl_fft 1.3.8
mkl_random 1.2.4
mpc 1.1.0
mpfr 4.0.2
mpmath 1.3.0
multidict 6.0.4
multiprocess 0.70.14
munkres 1.1.4
mysql 5.7.24
ncurses 6.4
nest-asyncio 1.6.0
nettle 3.7.3
networkx 3.1
nltk 3.8.1
numexpr 2.8.7
numpy 1.26.4
numpy-base 1.26.4
openh264 2.1.1
openjpeg 2.4.0
openssl 3.2.1
orc 1.7.4
packaging 23.1
pandas 2.1.4
parso 0.8.3
pcre2 10.42
pexpect 4.9.0
pickleshare 0.7.5
pillow 10.2.0
pip 23.3.1
platformdirs 4.2.0
ply 3.11
prompt-toolkit 3.0.42
psutil 5.9.8
ptyprocess 0.7.0
pure_eval 0.2.2
py4j 0.10.9.7
pyarrow 14.0.2
pybind11 2.11.1
pybind11-global 2.11.1
pygments 2.17.2
pyparsing 3.0.9
pyqt 5.15.10
pyqt5-sip 12.13.0
pysocks 1.7.1
pyspark 3.5.1
python 3.10.13
python-dateutil 2.8.2
python-tzdata 2023.3
python-xxhash 2.0.2
python_abi 3.10
pytorch 2.2.1
pytorch-cuda 11.8
pytorch-mutex 1.0
pytz 2023.3.post1
pyyaml 6.0.1
pyzmq 25.1.2
qt-main 5.15.2
re2 2022.04.01
readline 8.2
regex 2023.10.3
requests 2.31.0
responses 0.13.3
s2n 1.3.27
safetensors 0.4.2
setuptools 68.2.2
shapely 2.0.1
sip 6.7.12
six 1.16.0
snappy 1.1.10
soupsieve 2.5
sqlite 3.41.2
stack_data 0.6.2
sympy 1.12
tbb 2021.8.0
tk 8.6.12
tokenizers 0.15.1
tomli 2.0.1
torchaudio 2.2.1
torchtriton 2.2.0
torchvision 0.17.1
tornado 6.3.3
tqdm 4.65.0
traitlets 5.14.1
transformers 4.38.2
typing-extensions 4.9.0
typing_extensions 4.9.0
tzdata 2024a
urllib3 2.1.0
utf8proc 2.6.1
wcwidth 0.2.13
wheel 0.41.2
xxhash 0.8.0
xz 5.4.6
yaml 0.2.5
yarl 1.9.3
zeromq 4.3.5
zipp 3.17.0
zlib 1.2.13
zstd 1.5.5

https://github.com/ai4bharat/setu

Science Score: 13.0%

Repository

Basic Info

Statistics

Metadata Files

README.md

Setu: A Comprehensive Pipeline for Data Cleaning, Filtering and Deduplication

Table of Contents

Updates :

Future Updates :

Quickstart

Installation

Install Python onto WSL

Install Java OpenJDK

Install Spark for Hadoop 3.3

Configuration

Setu Environment Setup

Overview

Document Preparation

Cleaning and Analysis Stage

Flagging and Filtering Stage

Deduplication Stage

Usage

Owner

GitHub Events

Total

Last Year

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Dependencies