https://github.com/ai4bharat/setu

Setu is a comprehensive pipeline designed to clean, filter, and deduplicate diverse data sources including Web, PDF, and Speech data. Built on Apache Spark, Setu encompasses four key stages: document preparation, document cleaning and analysis, flagging and filtering, and deduplication.

https://github.com/ai4bharat/setu

Science Score: 13.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
  • DOI references
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (17.8%) to scientific vocabulary
Last synced: 9 months ago · JSON representation

Repository

Setu is a comprehensive pipeline designed to clean, filter, and deduplicate diverse data sources including Web, PDF, and Speech data. Built on Apache Spark, Setu encompasses four key stages: document preparation, document cleaning and analysis, flagging and filtering, and deduplication.

Basic Info
  • Host: GitHub
  • Owner: AI4Bharat
  • License: mit
  • Language: HTML
  • Default Branch: main
  • Homepage:
  • Size: 35 MB
Statistics
  • Stars: 7
  • Watchers: 7
  • Forks: 0
  • Open Issues: 0
  • Releases: 1
Created almost 3 years ago · Last pushed about 2 years ago
Metadata Files
Readme License

README.md

Setu: A Comprehensive Pipeline for Data Cleaning, Filtering and Deduplication

image

Setu is a comprehensive pipeline designed to clean, filter, and deduplicate diverse data sources including Web, PDF, and Speech data. Built on Apache Spark, Setu encompasses four key stages: document preparation, document cleaning and analysis, flagging and filtering, and deduplication. For detailed codebase documentation visit Setu Documentation

Table of Contents

  1. Quickstart
  2. Overview
  3. Usage

Updates :

  • Added Documentation
  • Flat Project Structure

Future Updates :

  • Add support for speech transcripts
  • Add support for multiple data formats

Quickstart

This documentation provides an overview of Setu and its workflow, enabling users to efficiently manage and process Web, PDF, and Speech data with Apache Spark.

Note that users who want to run the pipeline on Windows systems are advised to use WSL (Windows Subsystem for Linux) for easier usage. This is due to the presence of dependencies and scripts that are only usable in a Linux environment.

Installation

Install Python onto WSL

  • Before installing make sure your Python Version is 3.10.X or above. For ease of installation we recommend using 3.10.X. Also make sure you have Miniconda installed as we will be using conda enviroments.

Install Java OpenJDK

bash sudo update sudo apt install openjdk-11-jdk java --version

Install Spark for Hadoop 3.3

Note : Ensure you do this the following in your home/user folder

bash wget https://dlcdn.apache.org/spark/spark-3.5.1/spark-3.5.1-bin-hadoop3.tgz

bash mkdir hadoop mkdir hadoop/spark-3.5.1 tar -xvzf spark-3.5.1-bin-hadoop3.tgz -C ~/hadoop/spark-3.5.1 --strip 1

Configuration

  • Edit your bashrc file and add the following lines

export SPARK_HOME= ~/hadoop/spark-3.5.1 export PATH=$SPARK_HOME/bin:$PATH source ~/.bashrc

  • Copy the default spark config template and save it as config file.

cp $SPARK_HOME/conf/spark-defaults.conf.template $SPARK_HOME/conf/spark-defaults.conf

  • Edit the config file and mnetion spark host address.

nano $SPARK_HOME/conf/spark-defaults.conf spark.driver.host localhost

  • Test your spark installation by running spark-shell.

bash spark-shell

Setu Environment Setup

You can now directly create the conda environment from the environment.yaml file provided.

bash conda env create -f environment.yml

  • Refer the packages.txt text file for verification of libraries downloaded. Some libraries need to be downloaded utilizing pip.

Make sure that Pyspark is working by running pyspark on the terminal

bash pyspark

Overview

As part of the IndicLLMSuite : A Blueprint for Creating Pre-training and Fine-Tuning Datasets for Indian Languages we release Sangraha, a 251 billion tokens dataset summed up over 22 languages extracted from curated URLs, existing multilingual corpora, and large-scale translations.

For data corpus we utilize webcorpus to crawl a large collection of web URLs curated across all 22 Indic languages. For PDF documents we download Book collections from Internet Archive pertaining to Indic Languages. For ease of downloading PDF files, You can refer to Sangraha Data Download.

Document Preparation

The first stage of Setu focuses on extracting text from a variety of sources to create text documents for further processing. For Web documents, Setu utilizes trafilatura (Barbaresi, 2021b) to extract text from HTML. Meanwhile, PDFs undergo a pipeline that generate OCR JSON outputs utilizing GCP Cloud Vision SDK. Once these JSONs are generated, Setu leverages bounding box related information to filter out pages potentially afflicted with recognition issues and noise.

Cleaning and Analysis Stage

In the cleaning and analysis stage, Setu focuses on reducing noise within individual documents. It employs a multi-model approach for language identification, leveraging outputs from three different Language Identification Libraries:

Various statistics such as character and word counts, NSFW word count, and n-gram repetition ratio are computed during analysis.

Flagging and Filtering Stage

During the flagging and filtering stage, Setu applies filters based on the computed statistics. Filters include line length filters, NSFW word filters, and repetition filters, aimed at removing noisy and toxic documents.

Deduplication Stage

The deduplication stage of Setu performs fuzzy deduplication using MinHashLSH implemented in text-dedup. This stage helps in identifying and eliminating duplicate documents, enhancing data cleanliness and efficiency.

Usage

For running the different stages in setu, You can refer to the commands.md file and the also utilize the demo.ipynb files in the repo to understand the usage and output of the different stages. Make sure you configure the $USER and --master to point to your user folder and corresponding spark master URL. If you choose to store your datasets in a different location make sure you modify the different path arguments for the commands accordingly.

Owner

  • Name: AI4Bhārat
  • Login: AI4Bharat
  • Kind: organization
  • Email: opensource@ai4bharat.org
  • Location: India

Artificial-Intelligence-For-Bhārat : Building open-source AI solutions for India!

GitHub Events

Total
  • Issues event: 2
  • Watch event: 6
Last Year
  • Issues event: 2
  • Watch event: 6

Issues and Pull Requests

Last synced: 9 months ago

All Time
  • Total issues: 2
  • Total pull requests: 14
  • Average time to close issues: N/A
  • Average time to close pull requests: about 4 hours
  • Total issue authors: 2
  • Total pull request authors: 3
  • Average comments per issue: 0.0
  • Average comments per pull request: 0.0
  • Merged pull requests: 14
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 2
  • Pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 2
  • Pull request authors: 0
  • Average comments per issue: 0.0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • AamodThakur (1)
  • Gautam-Rajeev (1)
Pull Request Authors
  • Shanks0465 (12)
  • prikmm (1)
  • safikhanSoofiyani (1)
Top Labels
Issue Labels
Pull Request Labels

Dependencies

environment.yml conda
  • _libgcc_mutex 0.1
  • _openmp_mutex 4.5
  • abseil-cpp 20211102.0
  • aiohttp 3.9.3
  • aiosignal 1.2.0
  • arrow-cpp 14.0.2
  • asttokens 2.4.1
  • async-timeout 4.0.3
  • attrs 23.1.0
  • aws-c-auth 0.6.19
  • aws-c-cal 0.5.20
  • aws-c-common 0.8.5
  • aws-c-compression 0.2.16
  • aws-c-event-stream 0.2.15
  • aws-c-http 0.6.25
  • aws-c-io 0.13.10
  • aws-c-mqtt 0.7.13
  • aws-c-s3 0.1.51
  • aws-c-sdkutils 0.1.6
  • aws-checksums 0.1.13
  • aws-crt-cpp 0.18.16
  • aws-sdk-cpp 1.10.55
  • beautifulsoup4 4.12.2
  • blas 1.0
  • boost-cpp 1.82.0
  • bottleneck 1.3.7
  • brotli 1.0.9
  • brotli-bin 1.0.9
  • brotli-python 1.0.9
  • bzip2 1.0.8
  • c-ares 1.19.1
  • ca-certificates 2024.2.2
  • certifi 2024.2.2
  • click 8.1.7
  • comm 0.2.1
  • contourpy 1.2.0
  • cuda-cudart 11.8.89
  • cuda-cupti 11.8.87
  • cuda-libraries 11.8.0
  • cuda-nvrtc 11.8.89
  • cuda-nvtx 11.8.86
  • cuda-runtime 11.8.0
  • cycler 0.11.0
  • cyrus-sasl 2.1.28
  • datasets 2.12.0
  • dbus 1.13.18
  • debugpy 1.8.1
  • decorator 5.1.1
  • dill 0.3.6
  • exceptiongroup 1.2.0
  • executing 2.0.1
  • expat 2.5.0
  • fasttext 0.9.2
  • ffmpeg 4.3
  • filelock 3.13.1
  • flashtext 2.7
  • fontconfig 2.14.1
  • fonttools 4.25.0
  • freetype 2.12.1
  • frozenlist 1.4.0
  • fsspec 2023.10.0
  • geos 3.8.0
  • gflags 2.2.2
  • glib 2.78.4
  • glib-tools 2.78.4
  • glog 0.5.0
  • gmp 6.2.1
  • gmpy2 2.1.2
  • gnutls 3.6.15
  • grpc-cpp 1.48.2
  • gst-plugins-base 1.14.1
  • gstreamer 1.14.1
  • huggingface_hub 0.20.3
  • icu 73.1
  • idna 3.4
  • importlib-metadata 7.0.1
  • importlib_metadata 7.0.1
  • intel-openmp 2023.1.0
  • ipykernel 6.29.3
  • ipython 8.22.2
  • jedi 0.19.1
  • jinja2 3.1.3
  • joblib 1.2.0
  • jpeg 9e
  • jupyter_client 8.6.0
  • jupyter_core 5.7.1
  • kiwisolver 1.4.4
  • krb5 1.20.1
  • lame 3.100
  • lcms2 2.12
  • ld_impl_linux-64 2.38
  • lerc 3.0
  • libboost 1.82.0
  • libbrotlicommon 1.0.9
  • libbrotlidec 1.0.9
  • libbrotlienc 1.0.9
  • libclang13 14.0.6
  • libcublas 11.11.3.6
  • libcufft 10.9.0.58
  • libcufile 1.8.1.2
  • libcups 2.4.2
  • libcurand 10.3.4.107
  • libcurl 8.5.0
  • libcusolver 11.4.1.48
  • libcusparse 11.7.5.86
  • libdeflate 1.17
  • libedit 3.1.20230828
  • libev 4.33
  • libevent 2.1.12
  • libffi 3.4.4
  • libgcc-ng 13.2.0
  • libglib 2.78.4
  • libgomp 13.2.0
  • libiconv 1.16
  • libidn2 2.3.4
  • libjpeg-turbo 2.0.0
  • libllvm14 14.0.6
  • libnghttp2 1.57.0
  • libnpp 11.8.0.86
  • libnvjpeg 11.9.0.86
  • libpng 1.6.39
  • libpq 12.17
  • libprotobuf 3.20.3
  • libsodium 1.0.18
  • libssh2 1.10.0
  • libstdcxx-ng 13.2.0
  • libtasn1 4.19.0
  • libthrift 0.15.0
  • libtiff 4.5.1
  • libunistring 0.9.10
  • libuuid 1.41.5
  • libwebp-base 1.3.2
  • libxcb 1.15
  • libxkbcommon 1.0.1
  • libxml2 2.10.4
  • llvm-openmp 14.0.6
  • lz4-c 1.9.4
  • markupsafe 2.1.3
  • matplotlib 3.8.0
  • matplotlib-base 3.8.0
  • matplotlib-inline 0.1.6
  • mkl 2023.1.0
  • mkl-service 2.4.0
  • mkl_fft 1.3.8
  • mkl_random 1.2.4
  • mpc 1.1.0
  • mpfr 4.0.2
  • mpmath 1.3.0
  • multidict 6.0.4
  • multiprocess 0.70.14
  • munkres 1.1.4
  • mysql 5.7.24
  • ncurses 6.4
  • nest-asyncio 1.6.0
  • nettle 3.7.3
  • networkx 3.1
  • nltk 3.8.1
  • numexpr 2.8.7
  • numpy 1.26.4
  • numpy-base 1.26.4
  • openh264 2.1.1
  • openjpeg 2.4.0
  • openssl 3.2.1
  • orc 1.7.4
  • packaging 23.1
  • pandas 2.1.4
  • parso 0.8.3
  • pcre2 10.42
  • pexpect 4.9.0
  • pickleshare 0.7.5
  • pillow 10.2.0
  • pip 23.3.1
  • platformdirs 4.2.0
  • ply 3.11
  • prompt-toolkit 3.0.42
  • psutil 5.9.8
  • ptyprocess 0.7.0
  • pure_eval 0.2.2
  • py4j 0.10.9.7
  • pyarrow 14.0.2
  • pybind11 2.11.1
  • pybind11-global 2.11.1
  • pygments 2.17.2
  • pyparsing 3.0.9
  • pyqt 5.15.10
  • pyqt5-sip 12.13.0
  • pysocks 1.7.1
  • pyspark 3.5.1
  • python 3.10.13
  • python-dateutil 2.8.2
  • python-tzdata 2023.3
  • python-xxhash 2.0.2
  • python_abi 3.10
  • pytorch 2.2.1
  • pytorch-cuda 11.8
  • pytorch-mutex 1.0
  • pytz 2023.3.post1
  • pyyaml 6.0.1
  • pyzmq 25.1.2
  • qt-main 5.15.2
  • re2 2022.04.01
  • readline 8.2
  • regex 2023.10.3
  • requests 2.31.0
  • responses 0.13.3
  • s2n 1.3.27
  • safetensors 0.4.2
  • setuptools 68.2.2
  • shapely 2.0.1
  • sip 6.7.12
  • six 1.16.0
  • snappy 1.1.10
  • soupsieve 2.5
  • sqlite 3.41.2
  • stack_data 0.6.2
  • sympy 1.12
  • tbb 2021.8.0
  • tk 8.6.12
  • tokenizers 0.15.1
  • tomli 2.0.1
  • torchaudio 2.2.1
  • torchtriton 2.2.0
  • torchvision 0.17.1
  • tornado 6.3.3
  • tqdm 4.65.0
  • traitlets 5.14.1
  • transformers 4.38.2
  • typing-extensions 4.9.0
  • typing_extensions 4.9.0
  • tzdata 2024a
  • urllib3 2.1.0
  • utf8proc 2.6.1
  • wcwidth 0.2.13
  • wheel 0.41.2
  • xxhash 0.8.0
  • xz 5.4.6
  • yaml 0.2.5
  • yarl 1.9.3
  • zeromq 4.3.5
  • zipp 3.17.0
  • zlib 1.2.13
  • zstd 1.5.5