text-selection

https://github.com/stefantaubert/text-selection

Science Score: 67.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
✓
DOI references
Found 3 DOI reference(s) in README
✓
Academic publication links
Links to: zenodo.org
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (9.1%) to scientific vocabulary

Last synced: 10 months ago · JSON representation ·

Repository

Basic Info

Host: GitHub
Owner: stefantaubert
License: mit
Language: Python
Default Branch: master
Size: 634 KB

Statistics

Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Releases: 3

Created about 5 years ago · Last pushed over 2 years ago

Metadata Files

Readme License Citation

text-selection

PyPI

Command-line interface (CLI) to select lines of a text file.

Features

dataset
- create: create a dataset based on a text file
- export-statistics: exporting statistics to a CSV
subsets
- add: add subsets
- remove: remove subsets
- rename: rename subset
- select-all: select all lines
- select-fifo: select lines FIFO-style
- select-greedily: select lines greedily regarding units
- select-greedily-ep: select lines greedily regarding units (epoch-based)
- select-uniformly: select lines with units uniformly distributed
- select-randomly: select lines randomly
- filter-duplicates: filter duplicate lines
- filter-by-regex: filter lines by regex
- filter-by-text: filter lines by text
- filter-by-weight: filter lines by weight
- filter-by-vocabulary: filter lines by unit vocabulary
- filter-by-count: filter lines by global unit frequencies
- filter-by-unit-freq: filter lines by unit frequencies per line
- filter-by-line-nr: filter lines by line number
- sort-by-line-nr: sort lines by line number
- sort-by-text: sort lines by text
- sort-by-weight: sort lines by weights
- sort-by-shuffle: shuffle lines
- reverse: reverse lines
- export: export lines
weights
- create-from-file: create weights from file
- create-uniform: create uniform weights
- create-from-count: create weights from unit count
- divide: divide weights

Roadmap

add tests
refactoring
outsourcing greedy- and KLD-iterator

Installation

sh pip install text-selection --user

Usage

```txt usage: text-selection-cli [-h] [-v] {dataset,subsets,weights} ...

CLI to select lines of a text file.

positional arguments: {dataset,subsets,weights} description dataset dataset commands subsets subsets commands weights weights commands

optional arguments: -h, --help show this help message and exit -v, --version show program's version number and exit ```

Dependencies

tqdm
numpy
scipy
pandas
ordered_set>=4.1.0

Contributing

If you notice an error, please don't hesitate to open an issue.

Development setup

```sh

update

sudo apt update

install Python 3.8, 3.9, 3.10 & 3.11 for ensuring that tests can be run

sudo apt install python3-pip \ python3.8 python3.8-dev python3.8-distutils python3.8-venv \ python3.9 python3.9-dev python3.9-distutils python3.9-venv \ python3.10 python3.10-dev python3.10-distutils python3.10-venv \ python3.11 python3.11-dev python3.11-distutils python3.11-venv

install pipenv for creation of virtual environments

python3.8 -m pip install pipenv --user

check out repo

git clone https://github.com/stefantaubert/text-selection.git cd text-selection

create virtual environment

python3.8 -m pipenv install --dev ```

Running the tests

```sh

first install the tool like in "Development setup"

then, navigate into the directory of the repo (if not already done)

cd text-selection

activate environment

python3.8 -m pipenv shell

run tests

tox ```

Final lines of test result output:

log py38: commands succeeded py39: commands succeeded py310: commands succeeded py311: commands succeeded congratulations :)

License

MIT License

Acknowledgments

Funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) – Project-ID 416228727 – CRC 1410

Citation

If you want to cite this repo, you can use this BibTeX-entry generated by GitHub (see About => Cite this repository).

Changelog

v0.0.3 (2023-05-30)
- Changed
- Improved speed for filtering OOV/IV words by up to ~20k words/s
- Added
- Added subsets select-randomly
- Added subsets sort-by-shuffle
- Added subsets add option --skip-existing
- Bugfix
- Fixed evaluation of "from subsets" to ensure that the subsets exist
- Fixed subsets remove didn't worked
v0.0.2 (2023-01-13)
- Added
- Added creation of weights from lines
- Add --limit to select duplicates
- Add exit code
- Changed
- Set --limit positional where applicable
- Don't output expected warning from numpy on KLD selection
- Bugfixes
v0.0.1 (2022-05-25)
- Initial release

Owner

Name: Stefan Taubert
Login: stefantaubert
Kind: user
Location: Chemnitz, Germany
Company: Chemnitz University of Technology

Website: https://stefantaubert.com
Twitter: Stefan_Taubert
Repositories: 75
Profile: https://github.com/stefantaubert

Currently I am working on my PhD about the topic of speech synthesis at Chemnitz University of Technology.

Citation (CITATION.cff)

cff-version: 1.2.0
title: text-selection
abstract: Command-line interface (CLI) to select lines of a text file.
message: >-
  If you use this software, please cite it using the
  metadata from this file.
type: software
authors:
  - email: github@stefantaubert.com
    given-names: Stefan
    family-names: Taubert
    affiliation: Chemnitz University of Technology
    orcid: 'https://orcid.org/0000-0002-4932-2874'
    website: 'https://stefantaubert.com/'
version: 0.0.3
date-released: 2023-05-30
license: MIT
url: https://github.com/stefantaubert/text-selection
doi: 10.5281/zenodo.7984739

GitHub Events

Total

Last Year

Issues and Pull Requests

Last synced: 10 months ago

All Time

Total issues: 0
Total pull requests: 1
Average time to close issues: N/A
Average time to close pull requests: about 18 hours
Total issue authors: 0
Total pull request authors: 1
Average comments per issue: 0
Average comments per pull request: 0.0
Merged pull requests: 1
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 0
Pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Issue authors: 0
Pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

Pull Request Authors

jasminsternkopf (1)

Top Labels

Issue Labels

Pull Request Labels

Packages

Total packages: 1
Total downloads:
- pypi 13 last-month

Total dependent packages: 0
Total dependent repositories: 1
Total versions: 3
Total maintainers: 1

pypi.org: text-selection

Command-line interface (CLI) to select lines of a text file.

Homepage: https://github.com/stefantaubert/text-selection
Documentation: https://text-selection.readthedocs.io/
License: MIT
Latest release: 0.0.3
published about 3 years ago

Versions: 3
Dependent Packages: 0
Dependent Repositories: 1
Downloads: 13 Last month

Rankings

Dependent packages count: 9.8%

Dependent repos count: 21.8%

Forks count: 29.9%

Average: 31.8%

Stargazers count: 38.9%

Downloads: 58.4%

Maintainers (1)

stefantaubert

Last synced: 10 months ago

Dependencies

Pipfile pypi

autoflake * develop
autopep8 * develop
isort * develop
pycodestyle * develop
pylint * develop
pytest * develop
rope * develop
twine * develop
txt-selection * develop
numpy *
ordered-set >=4.1.0
pandas *
scipy *
tqdm *

Pipfile.lock pypi

astroid ==2.11.5 develop
attrs ==21.4.0 develop
autoflake ==1.4 develop
autopep8 ==1.6.0 develop
bleach ==5.0.0 develop
certifi ==2022.5.18.1 develop
cffi ==1.15.0 develop
charset-normalizer ==2.0.12 develop
commonmark ==0.9.1 develop
cryptography ==37.0.2 develop
dill ==0.3.5.1 develop
docutils ==0.18.1 develop
idna ==3.3 develop
importlib-metadata ==4.11.4 develop
iniconfig ==1.1.1 develop
isort ==5.10.1 develop
jeepney ==0.8.0 develop
keyring ==23.5.1 develop
lazy-object-proxy ==1.7.1 develop
mccabe ==0.7.0 develop
numpy ==1.22.4 develop
ordered-set ==4.1.0 develop
packaging ==21.3 develop
pandas ==1.4.2 develop
pkginfo ==1.8.2 develop
platformdirs ==2.5.2 develop
pluggy ==1.0.0 develop
py ==1.11.0 develop
pycodestyle ==2.8.0 develop
pycparser ==2.21 develop
pyflakes ==2.4.0 develop
pygments ==2.12.0 develop
pylint ==2.13.9 develop
pyparsing ==3.0.9 develop
pytest ==7.1.2 develop
python-dateutil ==2.8.2 develop
pytz ==2022.1 develop
readme-renderer ==35.0 develop
requests ==2.27.1 develop
requests-toolbelt ==0.9.1 develop
rfc3986 ==2.0.0 develop
rich ==12.4.4 develop
rope ==1.1.1 develop
scipy ==1.8.1 develop
secretstorage ==3.3.2 develop
setuptools ==62.3.2 develop
six ==1.16.0 develop
text-selection * develop
toml ==0.10.2 develop
tomli ==2.0.1 develop
tqdm ==4.64.0 develop
twine ==4.0.0 develop
txt-selection * develop
typing-extensions ==4.2.0 develop
urllib3 ==1.26.9 develop
webencodings ==0.5.1 develop
wrapt ==1.14.1 develop
zipp ==3.8.0 develop
numpy ==1.22.4
ordered-set ==4.1.0
pandas ==1.4.2
python-dateutil ==2.8.2
pytz ==2022.1
scipy ==1.8.1
six ==1.16.0
tqdm ==4.64.0

pyproject.toml pypi

numpy *
ordered_set >=4.1.0
pandas *
scipy *
tqdm *

text-selection

Science Score: 67.0%

Repository

Basic Info

Statistics

Metadata Files

README.md

text-selection

Features

Roadmap

Installation

Usage

Dependencies

Contributing

Development setup

update

install Python 3.8, 3.9, 3.10 & 3.11 for ensuring that tests can be run

install pipenv for creation of virtual environments

check out repo

create virtual environment

Running the tests

first install the tool like in "Development setup"

then, navigate into the directory of the repo (if not already done)

activate environment

run tests

License

Acknowledgments

Citation

Changelog

Owner

Citation (CITATION.cff)

GitHub Events

Total

Last Year

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Packages

pypi.org: text-selection

Rankings

Maintainers (1)

Dependencies