https://github.com/biodataanalysisgroup/kmeranalyzer

An alignment-free method capable of processing and counting k-mers in a reasonable time, while evaluating multiple values of the k parameter concurrently.

Science Score: 10.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
○
codemeta.json file
○
.zenodo.json file
○
DOI references
✓
Academic publication links
Links to: ncbi.nlm.nih.gov
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (14.2%) to scientific vocabulary

Keywords

feature-selection k-mers phylogenetics sars-cov-2 unsupervised-learning

Last synced: 5 months ago · JSON representation

Repository

An alignment-free method capable of processing and counting k-mers in a reasonable time, while evaluating multiple values of the k parameter concurrently.

Basic Info

Host: GitHub
Owner: BiodataAnalysisGroup
License: mit
Language: Python
Default Branch: master
Homepage:
Size: 4.64 MB

Statistics

Stars: 0
Watchers: 2
Forks: 1
Open Issues: 1
Releases: 0

Fork of togkousa/kmerAnalyzer

Topics

feature-selection k-mers phylogenetics sars-cov-2 unsupervised-learning

Created over 5 years ago · Last pushed over 4 years ago

https://github.com/BiodataAnalysisGroup/kmerAnalyzer/blob/master/

# A computational framework for pattern detection on unaligned sequences: An application on SARS-CoV-2 data

[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)

## Overview
An alignment-free method capable of processing and counting k-mers in a reasonable time, while evaluating multiple values of the k parameter concurrently.

## Installation

### Python Version
kmerAnalyzer was initially implemented in **Python 2.7** version, but it seems to work pretty well in **Python 3.8** too.

### Required Python Packages
- [`pandas`](https://pandas.pydata.org/getting_started.html)
- [`numpy`](https://numpy.org/install/)

## Usage

### Input files
The current application supports only `.fasta` files as input files.

### How to execute
1. In order to execute the application, there must be a unique `fasta` file inside the `data/` folder, which will be used as an input to the current k-mer analyzer toolkit.
2. Folder `Output/` needs to be empty. Otherwise, the application will remove everything (file or subfolder) inside it. In case the folder doesn't exist, it wil be created automatically.
3. Specify the parameters inside `featuresExtraction.py` script in **lines 21-22**, `kmax` and `eval_factor`. `Eval_factor` parameter determines the strictness in the assessment of kmers of each length. Recommended values for `eval_factor` lie inside the interval [1,2]. For optimal results, it's highly recommended to select a value between [1.2, 1.5]. At any case, for values lower than 1, the application won't run properly
4. Execute the python script `featuresExtraction.py`

### Output files
Assuming that the input file is called `filename.fasta`:

1. Inside the `Output/` directory there is a `.csv` file called `clustData.csv` which is actually the data matrix that we aimed for. Every sequemce is being represented by a number k-mer based features. The value of every feature is the number of times each k-mer was detected in the current sequence. There's also a CSV file called `heades_to_IDS.csv`, which maps the headers of each sequence from fasta input to code names ID-1, ID-2 etc.
2. Inside the `Output/filename/` sub-folder there are 3 `csv` files:
* File `output.csv` contains the list that is generated from the kmer-tree. Every row represents a k-mer. The first column is the k-mer itself, the second is the length of the k-mer, the third column its frequency (the number of times that was detected in the input data) and the fourth one is its evaluation in the tree.
* The two remaining files are associated with the sequences that every k-mer appears, and the number of times that each k-mer appears in every sequence occurs as well.

### Extra comments
- It's important to have a look at the lengths of the sequences, prior to executing kmerAnalyzer. For example, in the example dataset, the sequence with header `>ERR525627.984.1 984 length=31` has length 31, so probably its better to either exclude this sequence (data filtering) or examine lower k-values, e.g. up to 20. However, if we set `kmax = 35`, the code seems to work properly.
- While executing kmerAnalyzer, a folder called `input` is created isnide the project direcoty, containing some necessary files for the execution process. The folder is deleted at the end of the process.

## Data availability

SARS-CoV-2 data have been downloaded from [NCBI SARS-CoV-2 Resources](https://www.ncbi.nlm.nih.gov/sars-cov-2/).

## Contributing

Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.

## License

This project is licensed under the [MIT](https://opensource.org/licenses/MIT) License - see the [LICENSE](LICENSE) file for details

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science

https://github.com/biodataanalysisgroup/kmeranalyzer

Science Score: 10.0%

Keywords

Repository

Basic Info

Statistics

Topics

https://github.com/BiodataAnalysisGroup/kmerAnalyzer/blob/master/

Owner

GitHub Events

Total

Last Year