https://github.com/citiususc/pyplexity

Cleaning tool for web scraped text

https://github.com/citiususc/pyplexity

Science Score: 13.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
  • DOI references
  • Academic publication links
  • Committers with academic emails
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (17.9%) to scientific vocabulary

Keywords

information-retrieval nlp python scraping tag-cleaning

Keywords from Contributors

labels
Last synced: 6 months ago · JSON representation

Repository

Cleaning tool for web scraped text

Basic Info
  • Host: GitHub
  • Owner: citiususc
  • License: gpl-3.0
  • Language: Python
  • Default Branch: main
  • Homepage:
  • Size: 263 KB
Statistics
  • Stars: 38
  • Watchers: 3
  • Forks: 3
  • Open Issues: 2
  • Releases: 0
Topics
information-retrieval nlp python scraping tag-cleaning
Created about 4 years ago · Last pushed over 2 years ago
Metadata Files
Readme License

README.md

PyPlexity

Downloads PyPi

This package provides a simple interface to apply perplexity filters to any document. A possible use case for this technology could be the removal of boilerplate (sentences with a high perplexity score). Furthermore, it provides a rough HTML tag cleaner and a WARC and HTML bulk processor, with distributed capabilities.

Cite

Anyone that uses this tool, please refer to:

Fernández-Pichel, M., Prada-Corral, M., Losada, D. E., Pichel, J. C., & Gamallo, P. (2023). An unsupervised perplexity-based method for boilerplate removal. Natural Language Engineering, 1-18.

Models

English language

Memory intensive but does not scale on CPU. | Model | RAM usage | Download size | | --- | --- | --- | | bigrams-cord19 | 2GB | 230MB | | bigrams-bnc | 5GB | 660MB | | trigrams-cord19 | 6,6GB | 1GB | | trigrams-bnc | 14GB | 2,2GB |

Two different datasets were selected to build the background language model (LM): CORD-19 dataset [1] and the British National Corpus (BNC) [2].

[1] Wang, L. L., Lo, K., Chandrasekhar, Y., Reas, R., Yang, J., Eide, D., ... & Kohlmeier, S. (2020). Cord-19: The covid-19 open research dataset. ArXiv.

[2] BNC Consortium. (2007). British national corpus. Oxford Text Archive Core Collection.

Galician language

COMING SOON. Support for minority languages, Nós project. You can download the model from here.

Build and use custom models

If you want to build your own models, you can check it here. You can also use the parameter --model PATH to load local models.

Installation process

This package can be directly found in Pypi repository or installed in two ways:

pip install pyplexity

Examples of usage options

Compute perplexity from console

Command "perplexity". This very first command computes the perplexity score or the probability for a given sentence according to a given distribution, in this case, the background LM. By default, bigrams-bnc. Argument "--model bigrams-bnc" changes the model.
Documentation: ``` citius@pc:~$ pyplexity perplexity --help Usage: pyplexity perplexity [OPTIONS] TEXT

Arguments: TEXT [required]

Options: --model TEXT [default: bigrams-bnc] --help Show this message and exit. By default, models are stored in ~/.cache/cached_path/, as per cached-path package documentation. *Example*: citius@pc:~$ pyplexity perplexity "this is normal text" downloading: 100%|##########| 660M/660M [00:11<00:00, 59.0MiB/s] Loading model... Done. 1844.85540669094 citius@pc:~$ pyplexity perplexity "this is normal HTML PAGE BOI%& 678346 NOR text" Loading model... Done. 44787.99199563819 ``` As can be seen, malformed sentences obtain a higher value.

Bulk perplexity computation and cleaning of a directory

The previous command was a toy example, as normally in real applications, we will want to score complete datasets to clean them up. This scenario is where the bulk-perplexity functionality that supports WARC or HTML directories comes in.

Documentation: ``` citius@pc:~$ pyplexity bulk-perplexity --help Usage: pyplexity bulk-perplexity [OPTIONS] INPUT_DIR

Arguments: INPUT_DIR [required]

Options: --output-dir TEXT [default: outdir] --model TEXT [default: bigrams-bnc] --perpl-limit FLOAT [default: 8000.0] --warc-input / --no-warc-input [default: no-warc-input] Distributed computing options: --distributed / --no-distributed [default: no-distributed] --n-workers INTEGER [default: 1] --node INTEGER [default: 1] --port INTEGER [default: 8866] --help Show this message and exit. We will explain the distributed computing capabilities later. Input directory is allowed to have recursive subdirectories with files. WARC containers and HTML files should have been previously tag-cleaned with the command below. *Example*: citius@pc:~$ pyplexity bulk-perplexity ./outdir/ --output-dir cleaned_files --model bigrams-cord19 downloading: 100%|##########| 233M/233M [00:03<00:00, 63.3MiB/s] Loading model... Done. Computed 1124 files in 0:00:01.905390. ```

NOTE: In this new version, we do not remove the malformed sentences. We just tag them with ppl, giving more control to the end-users.

Perform HTML tag cleaning of a directory

Our method does not remove HTML tags by default. This fact could impoverish its global performance. That's why we recommend removing HTML tags first, and we offer this option inside our package.

Documentation: ``` citius@pc:~$ pyplexity tag-remover --help Usage: pyplexity tag-remover [OPTIONS] BASE_DIR

Arguments: BASE_DIR [required]

Options: --output-dir TEXT [default: out_dir] --warc-input / --no-warc-input [default: no-warc-input] Distributed computing options: --distributed / --no-distributed [default: no-distributed] --n-workers INTEGER [default: 1] --node INTEGER [default: 1] --port INTEGER [default: 8866] --help Show this message and exit.

We will explain the distributed computing capabilities later. Input directory is allowed to have recursive subdirectories with files. It can process HTML files or WARC files. In this case, it will recompress the WARC efficiently, after stripping out all the tags. *Example*: citius@pc:~$ pyplexity tag-remover ./html_source --output-dir ./output Computed 1124 files in 0:00:00.543175. ```

Parallel mode (cluster)

Previous documentation shows that our commands have integrated distributed computing capabilities. When using the cluster mode, all the nodes must be interconnected in a local network, having the access to the same files mounted via SSHFS or other filesystem. A master node will recursively load the folder of files to be computed, with the command: pyplexity fileserver /mnt/input_dir --port 8866 Now, clients from the nodes will connect to the master node asking for file names to be processed. This mechanism allows for load distribution, as clients are able to ask for files in queue for processing from the master. For example, from a node: pyplexity bulk-perplexity /mnt/input_dir --output-dir /mnt/output_dir --warc-input --distributed --n-workers 10 --node 2 --url master.local --port 8866 That command should be executed in every machine of the cluster. The node argument identifies the machine for logging purposes, and has no functional relevance. The n-workers argument controls the number of thread workers per machine that will be querying the master node for files concurrently. When the master server has served all the files, worker procceses will shutdown accordingly. In our experiments, we use this feature to run the HTML tag removal and perplexity computation in 20 threads * 15 machines.

Interfacing from Python

We also offer the possibility of utilising pyplexity from Python code. As an example, we provide an API that serves a web app to make some small tests on how to directly clean texts or raw files.

Example: computing the perplexity score for a sentence: ``` from pyplexity import PerplexityComputer

model = PerplexityModel.fromstr("bigrams-cord19") perpl = model.computesentence("this is normal text") Example 2: Cleaning sentences from a text: from pyplexity import PerplexityModel, PerplexityProcessor

model = PerplexityModel.fromstr("bigrams-cord19") textprocessor = PerplexityProcessor(perplmodel=model, perpllimit=8000.0) cleantext = textprocessor.process("This is a normal sentence. Meanwhile, hjldfuia HTML BODY this one will be deleted LINK URL COUISUDOANLHJWQKEJK") Example 3: Removing HTML tags from a website: import requests from pyplexity.tag_remover import HTMLTagRemover

html = requests.get("https://example.com").text text = HTMLTagRemover().process(html) ```

Web Demo

We also provide a web demo as a simple example of the power of our tool. Screenshot:

screenshot

Building the package

If you are interested, you can also build the same package version we have currently deployed in the Pypi repository.

git clone https://github.com/citiususc/pyplexity && cd pyplexity curl -sSL https://install.python-poetry.org | python3 - source $HOME/.poetry/env poetry build pip3 install dist/pyplexity-X.X.X-py3-none-any.whl

General Advice

As you may have noticed, this is an unsupervised method that requires setting the optimal model and threshold. From our experimentation, we have concluded that the bigrams-bnc model and removing sentences with a value higher than 8k is a robust strategy both for an IR search task and a text classification task.

Owner

  • Name: CiTIUS
  • Login: citiususc
  • Kind: organization
  • Email: citius@usc.es
  • Location: Santiago de Compostela

Centro Singular de Investigación en Tecnoloxías Intelixenteas da Universidade de Santiago de Compostela

GitHub Events

Total
Last Year

Committers

Last synced: almost 3 years ago

All Time
  • Total Commits: 25
  • Total Committers: 6
  • Avg Commits per committer: 4.167
  • Development Distribution Score (DDS): 0.68
Top Committers
Name Email Commits
Manuel de Prada Corral eu@m****m 8
MarcosFP97 4****7@u****m 6
Manuel de Prada 6****a@u****m 6
Marcos Fernández Pichel m****l@u****s 3
dependabot[bot] 4****]@u****m 1
Manuel de Prada Corral m****a@g****m 1
Committer Domains (Top 20 + Academic)

Issues and Pull Requests

Last synced: 8 months ago

All Time
  • Total issues: 2
  • Total pull requests: 2
  • Average time to close issues: N/A
  • Average time to close pull requests: 1 minute
  • Total issue authors: 2
  • Total pull request authors: 2
  • Average comments per issue: 0.0
  • Average comments per pull request: 0.0
  • Merged pull requests: 2
  • Bot issues: 0
  • Bot pull requests: 1
Past Year
  • Issues: 1
  • Pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 1
  • Pull request authors: 0
  • Average comments per issue: 0.0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • Xiaoshu-Zhao (1)
  • johnnyjana730 (1)
Pull Request Authors
  • MarcosFP97 (1)
  • dependabot[bot] (1)
Top Labels
Issue Labels
Pull Request Labels
dependencies (1)

Packages

  • Total packages: 1
  • Total downloads:
    • pypi 81,196 last-month
  • Total dependent packages: 0
  • Total dependent repositories: 1
  • Total versions: 18
  • Total maintainers: 2
pypi.org: pyplexity

Perplexity filter for documents and bulk HTML and WARC boilerplate removal.

  • Versions: 18
  • Dependent Packages: 0
  • Dependent Repositories: 1
  • Downloads: 81,196 Last month
Rankings
Dependent packages count: 10.0%
Stargazers count: 11.1%
Downloads: 16.2%
Average: 16.4%
Dependent repos count: 21.8%
Forks count: 22.7%
Maintainers (2)
Last synced: 7 months ago

Dependencies

pyproject.toml pypi
  • pytest ^5.2 develop
  • Flask ^2.0.2
  • cached-path ^1.0.2
  • html5lib ^1.1
  • lxml ^4.7.1
  • memory-tempfile ^2.2.3
  • nltk ^3.6.7
  • pandas ^1.1.5
  • python ^3.6.1
  • storable ^1.2.4
  • typer ^0.4.0
  • warcio ^1.7.4
setup.py pypi