Recent Releases of keggcharter
keggcharter - "--resume" now evaluates the files already produced
For both data_for_charting.tsvand taxon_to_mmap_to_orthologs .json:
* if the --resume parameter was used and the file is found, KEGGCharter won't generate it again.
* else, KEGGCharter will again generate the data, and overwrite the file if it exists.
Also, an important fix
On retrieving kegg taxa prefixes - checks with type(taxa) == str now, instead of taxa != np.nan.
- Python
Published by iquasere about 2 years ago
keggcharter - New needs, new regexes
Changed regex for EC numbers to account for provisional ECs
Changed ^(\d+)(\.(\d+|-)){3}$ to ^(\d+)(\.(\d+|-)){2}(\.(.*))?$, which accepts provisional EC numbers (e.g., 1.1.1.n1).
Changed regex for KEGG IDs to account for other taxonomy codes
Changed ^[A-Za-z]{3}:.+$ to ^[A-Za-z]+:.+$ to accept taxonomy codes that have less or more than three characters (e.g., pall:UYA_22060).
Also, some bug fixes
- One of the weirdest bugs ever -
pandas.DataFrame.groupbyhas a maximum number of columns (20). - Fix on saving
box2taxonwhen it is empty - Also removed some code from the time only one functional column was considered at a time
- Python
Published by iquasere about 2 years ago
keggcharter - Important fixes on ID cross-referencing, validation of functional ID columns and colormap picking
Validation of input data columns implemented
Four regexes will check if values in columns are valid.
- KEGG ID: ^[A-Za-z]{3}:.+$
- KO: ^K\d{5}$
- EC number: ^(\d+)(\.(\d+|-)){3}$
- COG: ^COG\d{4}$
Values can come in comma separated values, but each value between commas must obey the regexes.
Also, several fixes
Fix on adding new ids from API
Merging new IDs with old ones was creating some disconnect between old and new columns, and the new IDs were being placed in new columns disconnected from the rest of the dataframe. It's fixed now.
Differential colormap starts at 0
Before, the colormap was being generated between the maximmum and minimum values of the dataframe. Now, it begins at 0, up to the maximum of the dataframe.
Implemented new parameter for chosing colormap of differential maps
--differential-colormap allows to chose a new colormap instead of the default (summer). Valid values can be consulted at matplotlib.
Also, KEGGCharter now only creates output dirs when it passes input file validation
- Python
Published by iquasere about 2 years ago
keggcharter - Fix on having cog2ko available
Must be updated on the meta. Lines change:
cp *.py resources/KEGGCharter_prokaryotic_maps.txt resources/cog2ko_keggcharter.tsv $PREFIX/share &&
to
cp *.py *.txt *.tsv $PREFIX/share &&
- Python
Published by iquasere about 2 years ago
keggcharter - Fix on checking for columns of functional IDs
KEGGCharter was only looking for KEGG IDs, KOs and EC numbers columns to check if some functional IDs column was inputted.
This would make it exit with error if only a column with COG IDs was inputted.
Now it also looks for COGs columns, and accepts to only input a COG IDs column.
Also am trying to understand with it doesn't find cog2ko.tsv.
- Python
Published by iquasere about 2 years ago
keggcharter - KEGGCharter as a proper tool of science
Implemented COG2KO
This idea belongs to Lovro Grum. For each KO, COGs are extracted from their KEGG HTML page. This information is reversed, and becomes COG to KO conversion.
New database, making KEGGCharter far more powerful! Makes for a great synergy with reCOGnizer.
Because this is webscrapping, 403 - Forbidden and Timeouts may often occur.
KEGGCharter gives some time between failed tries, and at the end checks for any KOs whose HTMLs were not retrieved. It tries to retrieve those as well.
Sanitization of input file
Checks if:
* inputted columns exist in the input file
* if --kegg-column, --ko-column, --ec-column, --cog-column columns don't have invalid values / bad characters (" " and ";").
Added parameter for dividing quantification of each enzyme by the KOs assigned to it
When set, the --distribute-quantification parameter will instruct KEGGCharter to split the quantification of each enzyme by all the KOs that were assigned to it.
This information is outputted in data_for_charting.tsv.
New tests for several different parameters' combinations
show-available-maps for --show-available-maps parameter.
input-quantification-and-taxonomy for --input-taxonomy and --input-quantification parameters.
include-missing-genomes for --include-missing-genomes parameter.
map-all for --map-all parameter.
New output folders and writting of JSON information
KEGGCharter now stores metabolic maps representations in a maps folder. No brainer.
KEGGCharter additionally stores the information concerning the maps into a json folder. This folder will contain the dictionaries used for generating both the potential and differential maps.
"Potential" JSONs come in the form {box_id: [tax1, tax2, ...]}.
"Differential" JSONs come in the form {box_id: [col1, col2, ...]}. In the future, these should include the quantification value instead.
Also added lxml as dependency.
- Python
Published by iquasere about 2 years ago
keggcharter - Sanitization of input file
Forces input file to have the columns specified through the command line.
Applies to taxa-column, kegg-column, ko-column, ec-column and columns specified through --quantification-columns.
- Python
Published by iquasere over 2 years ago
keggcharter - Information from "kegg-column", "ko-column" and "ec-column" is now all combined
Multiple new columns are now outputted, depending on the source of information, e.g., KO (kegg-column) contains the KOs obtained from the IDs on the column specified with -keggc.
All KOs obtained are grouped into the KO (KEGGCharter) column, now the only used for charting functions.
Multiple IDs in the same cell now accepted and considered properly
Comma , is the only delimiter accepted for parsing multiple IDs inside the same cell.
Multiple KEGG IDs were accepted before, if separated by semi-comma (;). This is now deprecated, and they most come comma-separated.
"Data" dataframe extends and compresses with each cycle of ID conversion.
Simplified input of quantification columns
No more --genomic-columns nor --transcriptomic-columns, only --quantification-columns (-tcols) now.
All maps ("potential" and "differential") are produced for those columns.
"gene" features now also mapped
KEGGCharter was only considering the orthologs attribute of the Pathway instances, but some boxes are present in the KGML as gene features. Now, KEGGCharter considers those as well.
Reestructured the repo, simplified CICD, improved output to the command line, performance improvements
Maps inside resources folder, all yamls and CI files in cicd folder.
Much smaller keggcharter_input.tsv is still enough to build nice maps.
Had to specify version of libarchive (3.6.2=h039dbb9_1) in the Dockerfile.
More comprehensive messages.
Lighter progress bars.
--map-all workflow was running write_kgmls function for all taxa. Simply runs for ko now, and associates information to all taxa. Much faster, less dumber.
- Python
Published by iquasere over 2 years ago
keggcharter - New options for dealing with tax information
Original workflow of KEGGCharter attempts to download taxa specific KGMLs for organisms in KEGG Genomes (Fig. 1).
Fig. 1 - Original KEGGCharter workflow. Only arcticus had KOs with functions for the TCA cycle attributed that, simultaneously, were present in the KGML for the TCA cycle and the taxon arcticus.
This type of workflow uses both taxon-specific information and results from the datasets inputted. All functions represented validated by KEGG (i.e., those functions are available for those organisms), but many identifications may be lacking, since information at KEGG is often incomplete.
Setting "--include-missing-genomes" represents organisms that are not in KEGG Genomes
Organisms that are not identified in KEGG Genomes can still be represented, if running KEGGCharter with the option --include-missing-genomes. All functions for the KOs identified for that organism will be represented (Fig. 2).
Fig. 2 - KEGGCharter output expanded with --include-missing-genomes parameter. hydrocola is not present in KEGG Genomes, but all functions attributed to its KOs are still represented.
This setting allows to still obtain validated information for the taxonomies that are present in KEGG Genomes, while also allowing for representation of organisms not present in KEGG Genomes. It should offer the best compromise between false positives and false negatives.
Setting "--map-all" ignores KEGG Genomes completely, and represents all functions identified
Functions that are not present organisms specific KGMLs can still be represented, if running KEGGCharter with the option --map-all. This will bypass all taxon specific KGMLs, and map all functions for all KOs present in the input dataset (Fig. 3).
Fig. 3 - KEGGCharter output expanded with --map-all parameter. No functions for oleophylus and franklandus were simultaneously present in the KOs identified and available in their KGMLs. In this case, the requirement for presence in the KGMLs is bypassed, and all functions are represented for all taxa.
This setting represents the most information on the KEGG maps, and will produce the most colourful representations, but will likely return many false positives. Maps produced should be analyzed with caution This setting may be required, however, if information for organisms in KEGG Genomes is very incomplete.
- Python
Published by iquasere over 2 years ago
keggcharter - Fixed mapping boxes' IDs and submitting too many IDs to KEGG
Major fix in mapping boxes IDs and positions in orthologs array
Difference between mapping by box.id and by the index in the pathway.orthologs array.
Also changed default "step" to 40
KEGG's API will report on less ID mappings if many IDs are submitted in the same request. This will take much longer, but all information will be obtained.
- Python
Published by iquasere over 2 years ago
keggcharter - KEGGCharter symbolic link changed
KEGGCharter is now called as keggcharter, instead of the old keggcharter.py.
- Python
Published by iquasere almost 3 years ago
keggcharter - Bug fix on checking for functions in taxonomic KGMLs
Organisms that should not be represented - because their KGMLs lacked the boxes in question - were being represented, if they were the first ones for a certain box.
Fixes #8
- Python
Published by iquasere about 3 years ago
keggcharter - Now uses inputted KOs directly
KOs are still converted to EC numbers and back to KOs, but the ones used in the metabolic maps are those inputted with the -koc parameter.
Users can still use the newly obtained KOs by reformatting the command to accept those in the "KO (KEGGCharter)" column, ideally using the --resume parameter.
Also, some fixes on adding boxes to grey taxonomy.
- Python
Published by iquasere over 3 years ago
keggcharter - Several improvements to taxonomy representation
- Changed font color to white for grey boxes
- Grey adds to already present taxonomies - when a box already has one or more taxonomies, it will be added the grey (other) colour if other taxon is there
- Boxes labels written without prefixes ("ec:" and "ko:")
Several fixes
- checks correctly the taxonomies most abundant, by only considering KOs of the map being represented
- on the path when searching for downloaded KGMLs if downloading resources
- correctly assumes manually set "input-taxonomy"
- if "input-taxonomy", handles correctly the taxontommaptoorthologs and mmaps2taxa objects Removed map00072
- Python
Published by iquasere over 3 years ago
keggcharter - Better help on --genomic-columns
When no parameter is set for --genomic-columns, KEGGCharter will ask if the user wants to input quantification - the same could be accomplished with the --input-quantification parameter.
Helps in handling the absence of genomic quantification.
- Python
Published by iquasere over 3 years ago
keggcharter - Resources moved to their own folders
KEGGCharter now downloads resources to proper folders, instead of dumping everything in rd
* kgmls moved to rd/kc_kgmls
* csvs moved to rd/kc_csvs
- Python
Published by iquasere about 4 years ago
keggcharter - Implemented CI
CI includes full analysis of KEGGCharter
Also fixed preparation of data for charting
Was happening before transcriptomic_columns was split
- Python
Published by iquasere about 4 years ago
keggcharter - MT columns quantification now normalized by number of KOs
For each protein, MT columns quantification is now divided by the number of KOs that were detected for that protein.
Fix
"--number-of-taxa" input working again
- Python
Published by iquasere about 4 years ago
keggcharter - Maps tailored for taxonomies
Genomic maps now don't include orthologs that are not associated with the analyzed taxa: * KGMLs specific for different taxa are obtained * Allows to know which pathways each organism can participate in * Also allows to know which orthologs are available
- Python
Published by iquasere over 4 years ago
keggcharter - Better imports and progress bars
- Moved progress bars to
tqdm - Simplified imports, became more explicit
- Added
openpyxldependency because of updates on pandas - Added
popplerdependency because I'm dumb sometimes - Added
mscorefontsdependency: fixespdftoppmerrorSyntax Error: Couldn't find a font for 'Helvetica'
- Python
Published by iquasere over 4 years ago
keggcharter - Bug fix in genomic maps to transcriptomic
Genomic information was presented in transcriptomic maps. It's fixed now
- Python
Published by iquasere over 4 years ago
keggcharter - Fix on not using genomic columns
When not specifying genomic columns, still tried to estimate most abundant taxa from genomic column
- Python
Published by iquasere about 5 years ago
keggcharter - KGMLs and ECs check, and handling of exception in PDF generation
KGMLs and ECs check * number of orthologs in KGML is now compared with number of lines in EC list (CSV) * if they don't match, both will be downloaded again
Handling of exception in PDF generation * to_pdf started to fail with timeout exception * a try except now handles that
- Python
Published by iquasere about 5 years ago
keggcharter - Fixed bugs in new functionalities
Taxonomy inputting was always happening
KGML loading always read from sys.path, now reads from --resources-directory
- Python
Published by iquasere about 5 years ago
keggcharter - Compatibility options for other tools
Two new parameters provide quantification and taxonomy inputation
--input-quantification provides quantification
* creates new column, named Quantification (KEGGCharter), with 1 in every cell
* new column serves as single value --genomic-columns
--input-taxonomy provides taxonomy
* takes string as input
* parameter serves as single value --taxa-list
* creates new column, named Taxonomy (KEGGCharter), with the parameter in every cell
This options allow KEGGCharter to take directly as input the output of UPIMAPI or reCOGnizer
- Python
Published by iquasere about 5 years ago
keggcharter - An even faster KEGGCharter
kegglink access was moved out of the main map generation loop * intermediate step to allow multiprocessing * added ```--resourcesdirectory``` parameter to allow customization of the KGMLs and CSVs folder
- Python
Published by iquasere about 5 years ago
keggcharter - KGMLs handled locally
KEGGCharter now downloads KGMLs, and charts information directly from them * Significantly speeds KEGGCharter workflow * Pathway object is created from KGML with KGML_parser.read * Circumvents BioPython problem of forcing 0.7.1 DTD guidelines on the XML schema
- Python
Published by iquasere about 5 years ago
keggcharter - Major improvements in visualization
- KOs now converted to EC numbers directly to the maps
- No more gray outline boxes
- Exception for loading metabolic map improved
- Python
Published by iquasere over 5 years ago
keggcharter - New input options and improved handling
Can now be resumed after running it the first time * to retry failed maps, or if something failed in the latter steps
Also, accepts more diversified input * allows to set different column names for taxonomy, KEGG IDs and KOs
- Python
Published by iquasere over 5 years ago
keggcharter - Changes for bioconda
KEGGCharter files now go to share folder Also created try/except for maps that fail to download
- Python
Published by iquasere over 5 years ago
keggcharter - Ready for bioconda
Added LICENSE Linted the YAML
- Python
Published by iquasere over 5 years ago
keggcharter - Some things for bioconda
Composed the meta.yaml Removed some last references to reCOGnizer Removed the solve EC numbers shenanigan - should deal with it in MOSCA Several TODOS still to go!
- Python
Published by iquasere over 5 years ago
keggcharter - KEGGCharter
A tool for representing genomic potential and transcriptomic expression into KEGG pathways
Features
KEGGCharter is a user-friendly implementation of KEGG API and Pathway functionalities. It allows for: * Conversion of KEGG IDs to KEGG Orthologs (KO) and of KO to EC numbers * Representation of the metabolic potential of the main taxa in KEGG metabolic maps (up to the top 10, each distinguished by its own colour) * Representation of differential expression between samples in KEGG metabolic maps (the collective sum of each function will be represented)
Installation
To install KEGGCharter, simply clone this repository and run install.bash! It requires Conda previousy installed!
git clone https://github.com/iquasere/KEGGCharter.git
sudo KEGGCharter/install.bash
Usage
reCOGnizer needs an input file, but that is all it needs! ``` usage: kegg_charter.py [-h] [-f FILE] [-o OUTPUT] [--tsv] [-mm METABOLIC_MAPS] [-mgc METAGENOMIC_COLUMNS] [-mtc METATRANSCRIPTOMIC_COLUMNS] [-tc TAXA_COLUMN] [-tls TAXA_LIST] [-not NUMBEROFTAXA] [-koc KOS_COLUMN] [-v] [-utc] [-tl {SPECIES,GENUS,FAMILY,ORDER,CLASS,PHYLUM,SUPERKINGDOM}] [--show-available-maps]
reCOGnizer - a tool for domain based annotation with the COG database
optional arguments: -h, --help show this help message and exit -f FILE, --file FILE TSV or EXCEL table with information to chart -o OUTPUT, --output OUTPUT Output directory --tsv Results will be outputed in TSV format (and not EXCEL). -mm METABOLICMAPS, --metabolic-maps METABOLICMAPS IDs of metabolic maps to output -mgc METAGENOMICCOLUMNS, --metagenomic-columns METAGENOMICCOLUMNS Names of columns with metagenomic quantification -mtc METATRANSCRIPTOMICCOLUMNS, --metatranscriptomic-columns METATRANSCRIPTOMICCOLUMNS Names of columns with metatranscriptomics quantification -tc TAXACOLUMN, --taxa-column TAXACOLUMN Column with the taxa designations to represent with KEGGChart -tls TAXALIST, --taxa-list TAXALIST List of taxa to represent in genomic potential charts (comma separated) -not NUMBEROFTAXA, --number-of-taxa NUMBEROFTAXA Number of taxa to represent in genomic potential charts (comma separated) -koc KOSCOLUMN, --kos-column KOSCOLUMN "If input file has a column "KO (KEGG Charter)", setting this option will make KEGG Charter use those KOs instead (THIS ARGUMENT OVERRIDES KEGG IDS COLUMNS, USING KOS DIRECTLY INSTEAD!) -v, --version show program's version number and exit
UniProt arguments: -utc, --uniprot-taxonomic-columns If columns have UniProt names, KEGGCharter will search for UniProt designations (e.g. Taxonomic lineage(GENUS)) -tl {SPECIES,GENUS,FAMILY,ORDER,CLASS,PHYLUM,SUPERKINGDOM}, --taxonomic-level {SPECIES,GENUS,FAMILY,ORDER,CLASS,PHYLUM,SUPERKINGDOM} The taxonomic level to represent
Special functions: --show-available-maps Outputs KEGG maps IDs and descriptions to the console (so you may pick the ones you want!) ```
To run KEGGCharter, an input file must be supplied - see "Example" section - and the columns with genomic and/or transcriptomic information as well. Output directory is not mandatory, but may help find results.
python kegg_charter.py -f input_file.xlsx -o output_folder -mgc mg_column1,mg_column2 -mtc mt_column1,mt_column2 ...
Example
An example input file is available when downloading the GitHub repository. Inserting the KEGG IDs and genomic and transcriptomic quantifications in this file will allow to use KEGGCharter with no errors... in principle.
Outputs
KEGGCharter produces a table from the inputed data with two new columns - KO (KEGG Charter) and EC number (KEGG Charter) - containing the results of conversion of KEGG IDs to KOs and KOs to EC numbers, respectively. This file is saved as KEGGCharterresults in the output directory. KEGGCharter then represents this information in KEGG metabolic maps. If information is available as result of (meta)genomics analysis, KEGGCharter will localize the boxes whose functions are present in the organisms' genomes, mapping their genomic potential. If (meta)transcriptomics data is available, KEGGCharter will consider the sample as a whole, measuring gene expression and performing a multi-sample comparison for each function in the metabolic maps. * maps with genomic information are identified with the prefix "potential" from genomic potential * maps with transcriptomic information are identified with the prefix "differential_" from differential expression
- Python
Published by iquasere over 5 years ago