Science Score: 67.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 3 DOI reference(s) in README
  • Academic publication links
    Links to: zenodo.org
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (9.1%) to scientific vocabulary
Last synced: 6 months ago · JSON representation ·

Repository

Basic Info
  • Host: GitHub
  • Owner: stefantaubert
  • License: mit
  • Language: Python
  • Default Branch: master
  • Size: 634 KB
Statistics
  • Stars: 0
  • Watchers: 1
  • Forks: 0
  • Open Issues: 0
  • Releases: 3
Created almost 5 years ago · Last pushed about 2 years ago
Metadata Files
Readme License Citation

README.md

text-selection

PyPI PyPI MIT PyPI PyPI PyPI DOI

Command-line interface (CLI) to select lines of a text file.

Features

  • dataset
    • create: create a dataset based on a text file
    • export-statistics: exporting statistics to a CSV
  • subsets
    • add: add subsets
    • remove: remove subsets
    • rename: rename subset
    • select-all: select all lines
    • select-fifo: select lines FIFO-style
    • select-greedily: select lines greedily regarding units
    • select-greedily-ep: select lines greedily regarding units (epoch-based)
    • select-uniformly: select lines with units uniformly distributed
    • select-randomly: select lines randomly
    • filter-duplicates: filter duplicate lines
    • filter-by-regex: filter lines by regex
    • filter-by-text: filter lines by text
    • filter-by-weight: filter lines by weight
    • filter-by-vocabulary: filter lines by unit vocabulary
    • filter-by-count: filter lines by global unit frequencies
    • filter-by-unit-freq: filter lines by unit frequencies per line
    • filter-by-line-nr: filter lines by line number
    • sort-by-line-nr: sort lines by line number
    • sort-by-text: sort lines by text
    • sort-by-weight: sort lines by weights
    • sort-by-shuffle: shuffle lines
    • reverse: reverse lines
    • export: export lines
  • weights
    • create-from-file: create weights from file
    • create-uniform: create uniform weights
    • create-from-count: create weights from unit count
    • divide: divide weights

Roadmap

  • add tests
  • refactoring
  • outsourcing greedy- and KLD-iterator

Installation

sh pip install text-selection --user

Usage

```txt usage: text-selection-cli [-h] [-v] {dataset,subsets,weights} ...

CLI to select lines of a text file.

positional arguments: {dataset,subsets,weights} description dataset dataset commands subsets subsets commands weights weights commands

optional arguments: -h, --help show this help message and exit -v, --version show program's version number and exit ```

Dependencies

  • tqdm
  • numpy
  • scipy
  • pandas
  • ordered_set>=4.1.0

Contributing

If you notice an error, please don't hesitate to open an issue.

Development setup

```sh

update

sudo apt update

install Python 3.8, 3.9, 3.10 & 3.11 for ensuring that tests can be run

sudo apt install python3-pip \ python3.8 python3.8-dev python3.8-distutils python3.8-venv \ python3.9 python3.9-dev python3.9-distutils python3.9-venv \ python3.10 python3.10-dev python3.10-distutils python3.10-venv \ python3.11 python3.11-dev python3.11-distutils python3.11-venv

install pipenv for creation of virtual environments

python3.8 -m pip install pipenv --user

check out repo

git clone https://github.com/stefantaubert/text-selection.git cd text-selection

create virtual environment

python3.8 -m pipenv install --dev ```

Running the tests

```sh

first install the tool like in "Development setup"

then, navigate into the directory of the repo (if not already done)

cd text-selection

activate environment

python3.8 -m pipenv shell

run tests

tox ```

Final lines of test result output:

log py38: commands succeeded py39: commands succeeded py310: commands succeeded py311: commands succeeded congratulations :)

License

MIT License

Acknowledgments

Funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) – Project-ID 416228727 – CRC 1410

Citation

If you want to cite this repo, you can use this BibTeX-entry generated by GitHub (see About => Cite this repository).

Changelog

  • v0.0.3 (2023-05-30)
    • Changed
    • Improved speed for filtering OOV/IV words by up to ~20k words/s
    • Added
    • Added subsets select-randomly
    • Added subsets sort-by-shuffle
    • Added subsets add option --skip-existing
    • Bugfix
    • Fixed evaluation of "from subsets" to ensure that the subsets exist
    • Fixed subsets remove didn't worked
  • v0.0.2 (2023-01-13)
    • Added
    • Added creation of weights from lines
    • Add --limit to select duplicates
    • Add exit code
    • Changed
    • Set --limit positional where applicable
    • Don't output expected warning from numpy on KLD selection
    • Bugfixes
  • v0.0.1 (2022-05-25)
    • Initial release

Owner

  • Name: Stefan Taubert
  • Login: stefantaubert
  • Kind: user
  • Location: Chemnitz, Germany
  • Company: Chemnitz University of Technology

Currently I am working on my PhD about the topic of speech synthesis at Chemnitz University of Technology.

Citation (CITATION.cff)

cff-version: 1.2.0
title: text-selection
abstract: Command-line interface (CLI) to select lines of a text file.
message: >-
  If you use this software, please cite it using the
  metadata from this file.
type: software
authors:
  - email: github@stefantaubert.com
    given-names: Stefan
    family-names: Taubert
    affiliation: Chemnitz University of Technology
    orcid: 'https://orcid.org/0000-0002-4932-2874'
    website: 'https://stefantaubert.com/'
version: 0.0.3
date-released: 2023-05-30
license: MIT
url: https://github.com/stefantaubert/text-selection
doi: 10.5281/zenodo.7984739

GitHub Events

Total
Last Year

Issues and Pull Requests

Last synced: 7 months ago

All Time
  • Total issues: 0
  • Total pull requests: 1
  • Average time to close issues: N/A
  • Average time to close pull requests: about 18 hours
  • Total issue authors: 0
  • Total pull request authors: 1
  • Average comments per issue: 0
  • Average comments per pull request: 0.0
  • Merged pull requests: 1
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 0
  • Pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 0
  • Pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
Pull Request Authors
  • jasminsternkopf (1)
Top Labels
Issue Labels
Pull Request Labels

Packages

  • Total packages: 1
  • Total downloads:
    • pypi 13 last-month
  • Total dependent packages: 0
  • Total dependent repositories: 1
  • Total versions: 3
  • Total maintainers: 1
pypi.org: text-selection

Command-line interface (CLI) to select lines of a text file.

  • Versions: 3
  • Dependent Packages: 0
  • Dependent Repositories: 1
  • Downloads: 13 Last month
Rankings
Dependent packages count: 9.8%
Dependent repos count: 21.8%
Forks count: 29.9%
Average: 31.8%
Stargazers count: 38.9%
Downloads: 58.4%
Maintainers (1)
Last synced: 7 months ago

Dependencies

Pipfile pypi
  • autoflake * develop
  • autopep8 * develop
  • isort * develop
  • pycodestyle * develop
  • pylint * develop
  • pytest * develop
  • rope * develop
  • twine * develop
  • txt-selection * develop
  • numpy *
  • ordered-set >=4.1.0
  • pandas *
  • scipy *
  • tqdm *
Pipfile.lock pypi
  • astroid ==2.11.5 develop
  • attrs ==21.4.0 develop
  • autoflake ==1.4 develop
  • autopep8 ==1.6.0 develop
  • bleach ==5.0.0 develop
  • certifi ==2022.5.18.1 develop
  • cffi ==1.15.0 develop
  • charset-normalizer ==2.0.12 develop
  • commonmark ==0.9.1 develop
  • cryptography ==37.0.2 develop
  • dill ==0.3.5.1 develop
  • docutils ==0.18.1 develop
  • idna ==3.3 develop
  • importlib-metadata ==4.11.4 develop
  • iniconfig ==1.1.1 develop
  • isort ==5.10.1 develop
  • jeepney ==0.8.0 develop
  • keyring ==23.5.1 develop
  • lazy-object-proxy ==1.7.1 develop
  • mccabe ==0.7.0 develop
  • numpy ==1.22.4 develop
  • ordered-set ==4.1.0 develop
  • packaging ==21.3 develop
  • pandas ==1.4.2 develop
  • pkginfo ==1.8.2 develop
  • platformdirs ==2.5.2 develop
  • pluggy ==1.0.0 develop
  • py ==1.11.0 develop
  • pycodestyle ==2.8.0 develop
  • pycparser ==2.21 develop
  • pyflakes ==2.4.0 develop
  • pygments ==2.12.0 develop
  • pylint ==2.13.9 develop
  • pyparsing ==3.0.9 develop
  • pytest ==7.1.2 develop
  • python-dateutil ==2.8.2 develop
  • pytz ==2022.1 develop
  • readme-renderer ==35.0 develop
  • requests ==2.27.1 develop
  • requests-toolbelt ==0.9.1 develop
  • rfc3986 ==2.0.0 develop
  • rich ==12.4.4 develop
  • rope ==1.1.1 develop
  • scipy ==1.8.1 develop
  • secretstorage ==3.3.2 develop
  • setuptools ==62.3.2 develop
  • six ==1.16.0 develop
  • text-selection * develop
  • toml ==0.10.2 develop
  • tomli ==2.0.1 develop
  • tqdm ==4.64.0 develop
  • twine ==4.0.0 develop
  • txt-selection * develop
  • typing-extensions ==4.2.0 develop
  • urllib3 ==1.26.9 develop
  • webencodings ==0.5.1 develop
  • wrapt ==1.14.1 develop
  • zipp ==3.8.0 develop
  • numpy ==1.22.4
  • ordered-set ==4.1.0
  • pandas ==1.4.2
  • python-dateutil ==2.8.2
  • pytz ==2022.1
  • scipy ==1.8.1
  • six ==1.16.0
  • tqdm ==4.64.0
pyproject.toml pypi
  • numpy *
  • ordered_set >=4.1.0
  • pandas *
  • scipy *
  • tqdm *