https://github.com/citiususc/pyplexity
Cleaning tool for web scraped text
Science Score: 13.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
○.zenodo.json file
-
○DOI references
-
○Academic publication links
-
○Committers with academic emails
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (17.9%) to scientific vocabulary
Keywords
Keywords from Contributors
Repository
Cleaning tool for web scraped text
Basic Info
Statistics
- Stars: 38
- Watchers: 3
- Forks: 3
- Open Issues: 2
- Releases: 0
Topics
Metadata Files
README.md
PyPlexity
This package provides a simple interface to apply perplexity filters to any document. A possible use case for this technology could be the removal of boilerplate (sentences with a high perplexity score). Furthermore, it provides a rough HTML tag cleaner and a WARC and HTML bulk processor, with distributed capabilities.
Cite
Anyone that uses this tool, please refer to:
Fernández-Pichel, M., Prada-Corral, M., Losada, D. E., Pichel, J. C., & Gamallo, P. (2023). An unsupervised perplexity-based method for boilerplate removal. Natural Language Engineering, 1-18.
Models
English language
Memory intensive but does not scale on CPU. | Model | RAM usage | Download size | | --- | --- | --- | | bigrams-cord19 | 2GB | 230MB | | bigrams-bnc | 5GB | 660MB | | trigrams-cord19 | 6,6GB | 1GB | | trigrams-bnc | 14GB | 2,2GB |
Two different datasets were selected to build the background language model (LM): CORD-19 dataset [1] and the British National Corpus (BNC) [2].
[1] Wang, L. L., Lo, K., Chandrasekhar, Y., Reas, R., Yang, J., Eide, D., ... & Kohlmeier, S. (2020). Cord-19: The covid-19 open research dataset. ArXiv.
[2] BNC Consortium. (2007). British national corpus. Oxford Text Archive Core Collection.
Galician language
COMING SOON. Support for minority languages, Nós project. You can download the model from here.
Build and use custom models
If you want to build your own models, you can check it here. You can also use the parameter --model PATH to load local models.
Installation process
This package can be directly found in Pypi repository or installed in two ways:
pip install pyplexity
Examples of usage options
Compute perplexity from console
Command "perplexity". This very first command computes the perplexity score or the probability for a given sentence according to a given distribution, in this case, the background LM. By default, bigrams-bnc. Argument "--model bigrams-bnc" changes the model.
Documentation:
```
citius@pc:~$ pyplexity perplexity --help
Usage: pyplexity perplexity [OPTIONS] TEXT
Arguments: TEXT [required]
Options:
--model TEXT [default: bigrams-bnc]
--help Show this message and exit.
By default, models are stored in ~/.cache/cached_path/, as per cached-path package documentation. *Example*:
citius@pc:~$ pyplexity perplexity "this is normal text"
downloading: 100%|##########| 660M/660M [00:11<00:00, 59.0MiB/s]
Loading model... Done.
1844.85540669094
citius@pc:~$ pyplexity perplexity "this is normal HTML PAGE BOI%& 678346 NOR text"
Loading model... Done.
44787.99199563819
```
As can be seen, malformed sentences obtain a higher value.
Bulk perplexity computation and cleaning of a directory
The previous command was a toy example, as normally in real applications, we will want to score complete datasets to clean them up. This scenario is where the bulk-perplexity functionality that supports WARC or HTML directories comes in.
Documentation: ``` citius@pc:~$ pyplexity bulk-perplexity --help Usage: pyplexity bulk-perplexity [OPTIONS] INPUT_DIR
Arguments: INPUT_DIR [required]
Options:
--output-dir TEXT [default: outdir]
--model TEXT [default: bigrams-bnc]
--perpl-limit FLOAT [default: 8000.0]
--warc-input / --no-warc-input [default: no-warc-input]
Distributed computing options:
--distributed / --no-distributed [default: no-distributed]
--n-workers INTEGER [default: 1]
--node INTEGER [default: 1]
--port INTEGER [default: 8866]
--help Show this message and exit.
We will explain the distributed computing capabilities later. Input directory is allowed to have recursive subdirectories with files. WARC containers and HTML files should have been previously tag-cleaned with the command below. *Example*:
citius@pc:~$ pyplexity bulk-perplexity ./outdir/ --output-dir cleaned_files --model bigrams-cord19
downloading: 100%|##########| 233M/233M [00:03<00:00, 63.3MiB/s]
Loading model... Done.
Computed 1124 files in 0:00:01.905390.
```
NOTE: In this new version, we do not remove the malformed sentences. We just tag them with ppl, giving more control to the end-users.
Perform HTML tag cleaning of a directory
Our method does not remove HTML tags by default. This fact could impoverish its global performance. That's why we recommend removing HTML tags first, and we offer this option inside our package.
Documentation: ``` citius@pc:~$ pyplexity tag-remover --help Usage: pyplexity tag-remover [OPTIONS] BASE_DIR
Arguments: BASE_DIR [required]
Options: --output-dir TEXT [default: out_dir] --warc-input / --no-warc-input [default: no-warc-input] Distributed computing options: --distributed / --no-distributed [default: no-distributed] --n-workers INTEGER [default: 1] --node INTEGER [default: 1] --port INTEGER [default: 8866] --help Show this message and exit.
We will explain the distributed computing capabilities later. Input directory is allowed to have recursive subdirectories with files. It can process HTML files or WARC files. In this case, it will recompress the WARC efficiently, after stripping out all the tags. *Example*:
citius@pc:~$ pyplexity tag-remover ./html_source --output-dir ./output
Computed 1124 files in 0:00:00.543175.
```
Parallel mode (cluster)
Previous documentation shows that our commands have integrated distributed computing capabilities. When using the cluster mode, all the nodes must be interconnected in a local network, having the access to the same files mounted via SSHFS or other filesystem. A master node will recursively load the folder of files to be computed, with the command:
pyplexity fileserver /mnt/input_dir --port 8866
Now, clients from the nodes will connect to the master node asking for file names to be processed. This mechanism allows for load distribution, as clients are able to ask for files in queue for processing from the master. For example, from a node:
pyplexity bulk-perplexity /mnt/input_dir --output-dir /mnt/output_dir --warc-input --distributed --n-workers 10 --node 2 --url master.local --port 8866
That command should be executed in every machine of the cluster. The node argument identifies the machine for logging purposes, and has no functional relevance. The n-workers argument controls the number of thread workers per machine that will be querying the master node for files concurrently. When the master server has served all the files, worker procceses will shutdown accordingly. In our experiments, we use this feature to run the HTML tag removal and perplexity computation in 20 threads * 15 machines.
Interfacing from Python
We also offer the possibility of utilising pyplexity from Python code. As an example, we provide an API that serves a web app to make some small tests on how to directly clean texts or raw files.
Example: computing the perplexity score for a sentence: ``` from pyplexity import PerplexityComputer
model = PerplexityModel.fromstr("bigrams-cord19")
perpl = model.computesentence("this is normal text")
Example 2: Cleaning sentences from a text:
from pyplexity import PerplexityModel, PerplexityProcessor
model = PerplexityModel.fromstr("bigrams-cord19")
textprocessor = PerplexityProcessor(perplmodel=model, perpllimit=8000.0)
cleantext = textprocessor.process("This is a normal sentence. Meanwhile, hjldfuia HTML BODY this one will be deleted LINK URL COUISUDOANLHJWQKEJK")
Example 3: Removing HTML tags from a website:
import requests
from pyplexity.tag_remover import HTMLTagRemover
html = requests.get("https://example.com").text text = HTMLTagRemover().process(html) ```
Web Demo
We also provide a web demo as a simple example of the power of our tool. Screenshot:
Building the package
If you are interested, you can also build the same package version we have currently deployed in the Pypi repository.
git clone https://github.com/citiususc/pyplexity && cd pyplexity
curl -sSL https://install.python-poetry.org | python3 -
source $HOME/.poetry/env
poetry build
pip3 install dist/pyplexity-X.X.X-py3-none-any.whl
General Advice
As you may have noticed, this is an unsupervised method that requires setting the optimal model and threshold. From our experimentation, we have concluded that the bigrams-bnc model and removing sentences with a value higher than 8k is a robust strategy both for an IR search task and a text classification task.
Owner
- Name: CiTIUS
- Login: citiususc
- Kind: organization
- Email: citius@usc.es
- Location: Santiago de Compostela
- Website: https://citius.gal
- Twitter: citiususc
- Repositories: 49
- Profile: https://github.com/citiususc
Centro Singular de Investigación en Tecnoloxías Intelixenteas da Universidade de Santiago de Compostela
GitHub Events
Total
Last Year
Committers
Last synced: almost 3 years ago
All Time
- Total Commits: 25
- Total Committers: 6
- Avg Commits per committer: 4.167
- Development Distribution Score (DDS): 0.68
Top Committers
| Name | Commits | |
|---|---|---|
| Manuel de Prada Corral | eu@m****m | 8 |
| MarcosFP97 | 4****7@u****m | 6 |
| Manuel de Prada | 6****a@u****m | 6 |
| Marcos Fernández Pichel | m****l@u****s | 3 |
| dependabot[bot] | 4****]@u****m | 1 |
| Manuel de Prada Corral | m****a@g****m | 1 |
Committer Domains (Top 20 + Academic)
Issues and Pull Requests
Last synced: 8 months ago
All Time
- Total issues: 2
- Total pull requests: 2
- Average time to close issues: N/A
- Average time to close pull requests: 1 minute
- Total issue authors: 2
- Total pull request authors: 2
- Average comments per issue: 0.0
- Average comments per pull request: 0.0
- Merged pull requests: 2
- Bot issues: 0
- Bot pull requests: 1
Past Year
- Issues: 1
- Pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Issue authors: 1
- Pull request authors: 0
- Average comments per issue: 0.0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
- Xiaoshu-Zhao (1)
- johnnyjana730 (1)
Pull Request Authors
- MarcosFP97 (1)
- dependabot[bot] (1)
Top Labels
Issue Labels
Pull Request Labels
Packages
- Total packages: 1
-
Total downloads:
- pypi 81,196 last-month
- Total dependent packages: 0
- Total dependent repositories: 1
- Total versions: 18
- Total maintainers: 2
pypi.org: pyplexity
Perplexity filter for documents and bulk HTML and WARC boilerplate removal.
- Homepage: https://github.com/citiususc/pyplexity
- Documentation: https://pyplexity.readthedocs.io/
- License: GPL-3.0-only
-
Latest release: 0.2.12
published over 2 years ago
Rankings
Maintainers (2)
Dependencies
- pytest ^5.2 develop
- Flask ^2.0.2
- cached-path ^1.0.2
- html5lib ^1.1
- lxml ^4.7.1
- memory-tempfile ^2.2.3
- nltk ^3.6.7
- pandas ^1.1.5
- python ^3.6.1
- storable ^1.2.4
- typer ^0.4.0
- warcio ^1.7.4