https://github.com/ai4bharat/setu
Setu is a comprehensive pipeline designed to clean, filter, and deduplicate diverse data sources including Web, PDF, and Speech data. Built on Apache Spark, Setu encompasses four key stages: document preparation, document cleaning and analysis, flagging and filtering, and deduplication.
Science Score: 13.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
○.zenodo.json file
-
○DOI references
-
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (17.8%) to scientific vocabulary
Repository
Setu is a comprehensive pipeline designed to clean, filter, and deduplicate diverse data sources including Web, PDF, and Speech data. Built on Apache Spark, Setu encompasses four key stages: document preparation, document cleaning and analysis, flagging and filtering, and deduplication.
Basic Info
Statistics
- Stars: 7
- Watchers: 7
- Forks: 0
- Open Issues: 0
- Releases: 1
Metadata Files
README.md
Setu: A Comprehensive Pipeline for Data Cleaning, Filtering and Deduplication

Setu is a comprehensive pipeline designed to clean, filter, and deduplicate diverse data sources including Web, PDF, and Speech data. Built on Apache Spark, Setu encompasses four key stages: document preparation, document cleaning and analysis, flagging and filtering, and deduplication. For detailed codebase documentation visit Setu Documentation
Table of Contents
Updates :
- Added Documentation
- Flat Project Structure
Future Updates :
- Add support for speech transcripts
- Add support for multiple data formats
Quickstart
This documentation provides an overview of Setu and its workflow, enabling users to efficiently manage and process Web, PDF, and Speech data with Apache Spark.
Note that users who want to run the pipeline on Windows systems are advised to use WSL (Windows Subsystem for Linux) for easier usage. This is due to the presence of dependencies and scripts that are only usable in a Linux environment.
Installation
Install Python onto WSL
- Before installing make sure your Python Version is 3.10.X or above. For ease of installation we recommend using 3.10.X. Also make sure you have Miniconda installed as we will be using conda enviroments.
Install Java OpenJDK
bash
sudo update
sudo apt install openjdk-11-jdk
java --version
Install Spark for Hadoop 3.3
Note : Ensure you do this the following in your home/user folder
bash
wget https://dlcdn.apache.org/spark/spark-3.5.1/spark-3.5.1-bin-hadoop3.tgz
bash
mkdir hadoop
mkdir hadoop/spark-3.5.1
tar -xvzf spark-3.5.1-bin-hadoop3.tgz -C ~/hadoop/spark-3.5.1 --strip 1
Configuration
- Edit your bashrc file and add the following lines
export SPARK_HOME= ~/hadoop/spark-3.5.1
export PATH=$SPARK_HOME/bin:$PATH
source ~/.bashrc
- Copy the default spark config template and save it as config file.
cp $SPARK_HOME/conf/spark-defaults.conf.template $SPARK_HOME/conf/spark-defaults.conf
- Edit the config file and mnetion spark host address.
nano $SPARK_HOME/conf/spark-defaults.conf
spark.driver.host localhost
- Test your spark installation by running spark-shell.
bash
spark-shell
Setu Environment Setup
You can now directly create the conda environment from the environment.yaml file provided.
bash
conda env create -f environment.yml
- Refer the packages.txt text file for verification of libraries downloaded. Some libraries need to be downloaded utilizing pip.
Make sure that Pyspark is working by running pyspark on the terminal
bash
pyspark
Overview
As part of the IndicLLMSuite : A Blueprint for Creating Pre-training and Fine-Tuning Datasets for Indian Languages we release Sangraha, a 251 billion tokens dataset summed up over 22 languages extracted from curated URLs, existing multilingual corpora, and large-scale translations.
For data corpus we utilize webcorpus to crawl a large collection of web URLs curated across all 22 Indic languages. For PDF documents we download Book collections from Internet Archive pertaining to Indic Languages. For ease of downloading PDF files, You can refer to Sangraha Data Download.
Document Preparation
The first stage of Setu focuses on extracting text from a variety of sources to create text documents for further processing. For Web documents, Setu utilizes trafilatura (Barbaresi, 2021b) to extract text from HTML. Meanwhile, PDFs undergo a pipeline that generate OCR JSON outputs utilizing GCP Cloud Vision SDK. Once these JSONs are generated, Setu leverages bounding box related information to filter out pages potentially afflicted with recognition issues and noise.
Cleaning and Analysis Stage
In the cleaning and analysis stage, Setu focuses on reducing noise within individual documents. It employs a multi-model approach for language identification, leveraging outputs from three different Language Identification Libraries:
Various statistics such as character and word counts, NSFW word count, and n-gram repetition ratio are computed during analysis.
Flagging and Filtering Stage
During the flagging and filtering stage, Setu applies filters based on the computed statistics. Filters include line length filters, NSFW word filters, and repetition filters, aimed at removing noisy and toxic documents.
Deduplication Stage
The deduplication stage of Setu performs fuzzy deduplication using MinHashLSH implemented in text-dedup. This stage helps in identifying and eliminating duplicate documents, enhancing data cleanliness and efficiency.
Usage
For running the different stages in setu, You can refer to the commands.md file and the also utilize the demo.ipynb files in the repo to understand the usage and output of the different stages. Make sure you configure the $USER and --master to point to your user folder and corresponding spark master URL. If you choose to store your datasets in a different location make sure you modify the different path arguments for the commands accordingly.
Owner
- Name: AI4Bhārat
- Login: AI4Bharat
- Kind: organization
- Email: opensource@ai4bharat.org
- Location: India
- Website: https://ai4bharat.org
- Twitter: AI4Bharat
- Repositories: 37
- Profile: https://github.com/AI4Bharat
Artificial-Intelligence-For-Bhārat : Building open-source AI solutions for India!
GitHub Events
Total
- Issues event: 2
- Watch event: 6
Last Year
- Issues event: 2
- Watch event: 6
Issues and Pull Requests
Last synced: 9 months ago
All Time
- Total issues: 2
- Total pull requests: 14
- Average time to close issues: N/A
- Average time to close pull requests: about 4 hours
- Total issue authors: 2
- Total pull request authors: 3
- Average comments per issue: 0.0
- Average comments per pull request: 0.0
- Merged pull requests: 14
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 2
- Pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Issue authors: 2
- Pull request authors: 0
- Average comments per issue: 0.0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
- AamodThakur (1)
- Gautam-Rajeev (1)
Pull Request Authors
- Shanks0465 (12)
- prikmm (1)
- safikhanSoofiyani (1)
Top Labels
Issue Labels
Pull Request Labels
Dependencies
- _libgcc_mutex 0.1
- _openmp_mutex 4.5
- abseil-cpp 20211102.0
- aiohttp 3.9.3
- aiosignal 1.2.0
- arrow-cpp 14.0.2
- asttokens 2.4.1
- async-timeout 4.0.3
- attrs 23.1.0
- aws-c-auth 0.6.19
- aws-c-cal 0.5.20
- aws-c-common 0.8.5
- aws-c-compression 0.2.16
- aws-c-event-stream 0.2.15
- aws-c-http 0.6.25
- aws-c-io 0.13.10
- aws-c-mqtt 0.7.13
- aws-c-s3 0.1.51
- aws-c-sdkutils 0.1.6
- aws-checksums 0.1.13
- aws-crt-cpp 0.18.16
- aws-sdk-cpp 1.10.55
- beautifulsoup4 4.12.2
- blas 1.0
- boost-cpp 1.82.0
- bottleneck 1.3.7
- brotli 1.0.9
- brotli-bin 1.0.9
- brotli-python 1.0.9
- bzip2 1.0.8
- c-ares 1.19.1
- ca-certificates 2024.2.2
- certifi 2024.2.2
- click 8.1.7
- comm 0.2.1
- contourpy 1.2.0
- cuda-cudart 11.8.89
- cuda-cupti 11.8.87
- cuda-libraries 11.8.0
- cuda-nvrtc 11.8.89
- cuda-nvtx 11.8.86
- cuda-runtime 11.8.0
- cycler 0.11.0
- cyrus-sasl 2.1.28
- datasets 2.12.0
- dbus 1.13.18
- debugpy 1.8.1
- decorator 5.1.1
- dill 0.3.6
- exceptiongroup 1.2.0
- executing 2.0.1
- expat 2.5.0
- fasttext 0.9.2
- ffmpeg 4.3
- filelock 3.13.1
- flashtext 2.7
- fontconfig 2.14.1
- fonttools 4.25.0
- freetype 2.12.1
- frozenlist 1.4.0
- fsspec 2023.10.0
- geos 3.8.0
- gflags 2.2.2
- glib 2.78.4
- glib-tools 2.78.4
- glog 0.5.0
- gmp 6.2.1
- gmpy2 2.1.2
- gnutls 3.6.15
- grpc-cpp 1.48.2
- gst-plugins-base 1.14.1
- gstreamer 1.14.1
- huggingface_hub 0.20.3
- icu 73.1
- idna 3.4
- importlib-metadata 7.0.1
- importlib_metadata 7.0.1
- intel-openmp 2023.1.0
- ipykernel 6.29.3
- ipython 8.22.2
- jedi 0.19.1
- jinja2 3.1.3
- joblib 1.2.0
- jpeg 9e
- jupyter_client 8.6.0
- jupyter_core 5.7.1
- kiwisolver 1.4.4
- krb5 1.20.1
- lame 3.100
- lcms2 2.12
- ld_impl_linux-64 2.38
- lerc 3.0
- libboost 1.82.0
- libbrotlicommon 1.0.9
- libbrotlidec 1.0.9
- libbrotlienc 1.0.9
- libclang13 14.0.6
- libcublas 11.11.3.6
- libcufft 10.9.0.58
- libcufile 1.8.1.2
- libcups 2.4.2
- libcurand 10.3.4.107
- libcurl 8.5.0
- libcusolver 11.4.1.48
- libcusparse 11.7.5.86
- libdeflate 1.17
- libedit 3.1.20230828
- libev 4.33
- libevent 2.1.12
- libffi 3.4.4
- libgcc-ng 13.2.0
- libglib 2.78.4
- libgomp 13.2.0
- libiconv 1.16
- libidn2 2.3.4
- libjpeg-turbo 2.0.0
- libllvm14 14.0.6
- libnghttp2 1.57.0
- libnpp 11.8.0.86
- libnvjpeg 11.9.0.86
- libpng 1.6.39
- libpq 12.17
- libprotobuf 3.20.3
- libsodium 1.0.18
- libssh2 1.10.0
- libstdcxx-ng 13.2.0
- libtasn1 4.19.0
- libthrift 0.15.0
- libtiff 4.5.1
- libunistring 0.9.10
- libuuid 1.41.5
- libwebp-base 1.3.2
- libxcb 1.15
- libxkbcommon 1.0.1
- libxml2 2.10.4
- llvm-openmp 14.0.6
- lz4-c 1.9.4
- markupsafe 2.1.3
- matplotlib 3.8.0
- matplotlib-base 3.8.0
- matplotlib-inline 0.1.6
- mkl 2023.1.0
- mkl-service 2.4.0
- mkl_fft 1.3.8
- mkl_random 1.2.4
- mpc 1.1.0
- mpfr 4.0.2
- mpmath 1.3.0
- multidict 6.0.4
- multiprocess 0.70.14
- munkres 1.1.4
- mysql 5.7.24
- ncurses 6.4
- nest-asyncio 1.6.0
- nettle 3.7.3
- networkx 3.1
- nltk 3.8.1
- numexpr 2.8.7
- numpy 1.26.4
- numpy-base 1.26.4
- openh264 2.1.1
- openjpeg 2.4.0
- openssl 3.2.1
- orc 1.7.4
- packaging 23.1
- pandas 2.1.4
- parso 0.8.3
- pcre2 10.42
- pexpect 4.9.0
- pickleshare 0.7.5
- pillow 10.2.0
- pip 23.3.1
- platformdirs 4.2.0
- ply 3.11
- prompt-toolkit 3.0.42
- psutil 5.9.8
- ptyprocess 0.7.0
- pure_eval 0.2.2
- py4j 0.10.9.7
- pyarrow 14.0.2
- pybind11 2.11.1
- pybind11-global 2.11.1
- pygments 2.17.2
- pyparsing 3.0.9
- pyqt 5.15.10
- pyqt5-sip 12.13.0
- pysocks 1.7.1
- pyspark 3.5.1
- python 3.10.13
- python-dateutil 2.8.2
- python-tzdata 2023.3
- python-xxhash 2.0.2
- python_abi 3.10
- pytorch 2.2.1
- pytorch-cuda 11.8
- pytorch-mutex 1.0
- pytz 2023.3.post1
- pyyaml 6.0.1
- pyzmq 25.1.2
- qt-main 5.15.2
- re2 2022.04.01
- readline 8.2
- regex 2023.10.3
- requests 2.31.0
- responses 0.13.3
- s2n 1.3.27
- safetensors 0.4.2
- setuptools 68.2.2
- shapely 2.0.1
- sip 6.7.12
- six 1.16.0
- snappy 1.1.10
- soupsieve 2.5
- sqlite 3.41.2
- stack_data 0.6.2
- sympy 1.12
- tbb 2021.8.0
- tk 8.6.12
- tokenizers 0.15.1
- tomli 2.0.1
- torchaudio 2.2.1
- torchtriton 2.2.0
- torchvision 0.17.1
- tornado 6.3.3
- tqdm 4.65.0
- traitlets 5.14.1
- transformers 4.38.2
- typing-extensions 4.9.0
- typing_extensions 4.9.0
- tzdata 2024a
- urllib3 2.1.0
- utf8proc 2.6.1
- wcwidth 0.2.13
- wheel 0.41.2
- xxhash 0.8.0
- xz 5.4.6
- yaml 0.2.5
- yarl 1.9.3
- zeromq 4.3.5
- zipp 3.17.0
- zlib 1.2.13
- zstd 1.5.5