recognizer - Fix on database creation from PN files

The SMP files were still being specified with the smp_directory part attached.

- HTML
Published by iquasere over 1 year ago

recognizer - Databases' names changed to CD-batch search options

Databases' names inputted to the --databases has changed to accomodate the options present at CDD Batch Search. The new options are:

NCBI_Curated
Pfam
SMART
KOG
COG
PRK
TIGR

Domains now follow the lists at the PN files provided by NCBI

Domains related to the NCBI_Curated and PRK databases were not all being considered when building databases. This has been fixed, in accordance to the PN files provided with cdd.tar.gz.

Database construction reimplemented to use the PN files provided by CDD

If those are not available, reCOGnizer will still build the PN files, but with more added domains.

This should fix #19. But lets see.

Also removed deprecated parameters

--download-resources and --skip-downloaded parameters now will result in error when specified.

- HTML
Published by iquasere over 1 year ago

recognizer - Fix on regex search of EC numbers

re.escape is required for handling the regex search where strings are being concatenated.

E.g. to consider the literal ) when searching for (1.1.1.1), in the function in question.

This problem was caused by using the new r"regex" format.

- HTML
Published by iquasere about 2 years ago

recognizer - Simpler download of databases and more robust COG2KO conversion

Much simpler download of databases

reCOgnizer relied on --download-resources and --skip-downloaded parameters for setting up its databases.

--download-resources instructed reCOgnizer to download the files required for its execution, and --skip-downloaded instructed it to ignore already downloaded files, if there had simply been the mistake of removing one file.

Now, reCOGnizer relies on the recognizer_dwnl.timestamp to check if databases have already been downloaded. If the file exists, it skips installation. If the file doesn't exist, reCOGnizer will remove all available files, and download everything.

COG2KO conversion more reliable

Previously, reCOGnizer built the cog2ko conversion as a collection of all KOs available for each protein mapping to the specific COG.

Now, reCOGnizer uses a similar approach to cog2ec conversion, where it will only assign a KO to a COG where over half of instances of that COG have that particular KO.

This obtains a more reliable COG2KO conversion, while keeping KOs for a considerable number of COGs.

Also removes the intermediate ssv files outputted during construction of the cog2ko database.

New parameters --test-run and --output-rpsbproc-columns will usually not be needed

--test-run parameter had to be implemented as consequence of a simpler database downloading. When set, reCOGnizer runs in an abnormal fashion, which is required for the tests at GitHub. reCOGnizer will move the cdd.tar.gz file available in the repo, and use it as a valid cdd.tar.gz file.

--output-rpsbproc-columns will output the Superfamilies, Sites, Motifs columns, which are usually empty for almost all annotations.

Removed some unnecessary files

recognizer.log was produced at working directory. It only included rpsblast outputs, mainly for error assessment. Users can obtain that information by running reCOGnizer with the --debug parameter, and manually running the faulty commands.

taxonomy.rdf was obtained as part of building taxonomy.tsv. Now, reCOgnizer removes it after it outlived its usefulness.

Some fixes

reCOGnizer was not reporting the download of files when the --quiet flag was set, except when the files had already been downloaded, and it removed them.

Also updated regexes to new format, the r'regex' format.

- HTML
Published by iquasere about 2 years ago

recognizer - Fixed KOG outputting

rpsbproc doesn't work with the KOG database. reCOGnizer's KOG report is now made directly from BLAST 6.

- HTML
Published by iquasere over 2 years ago

recognizer - Fix when only downloading resources

reCOGnizer wasn't properly checking if --file parameter had been imputed. Therefore, reCOGnizer still attempeted to perform annotation and searched for annotation outputs, when no --file argument was specified.

Now, it's working properly.

- HTML
Published by iquasere over 2 years ago

recognizer - Custom databases workflow now multithreaded

Now works multithreaded

Removed -db parameter. Incorporated into -dbs. --custom-database changed to --custom-databases to reflect this change. Added input sanitization for custom/default databases. Only custom or default databases can be used at the same time.

Also some necessary changes on the tests

latest image of miniconda is not funcitonal, fixed version on 22.11.1. Added test for custom-database-workflow. Tests now simultaneous, instead of one at a time.

- HTML
Published by iquasere over 2 years ago

recognizer - Fixed several annoyances

No more need to confirm you don't want to gunzip download resource files

If --skip-downloaded was set, reCOGnizer will both skip the downloading and gunzipping.

No more FutureWarning when trying to sum COGs

.sum(numeric_only=True) fixed that.

- HTML
Published by iquasere almost 3 years ago

recognizer - reCOGnizer is called without ".py"

Now called as "recognizer"

reCOGnizer was always called through the shell as recognizer.py. Now, is called with recognizer.

Now removes intermediate folders

Unused directories - tmp, rpsbproc, et al, whose files were removed, are now themselves removed.

Also, several fixes

Fixed conversion COG2KO. Fixed future warning - xlsx_report.save() to xlsx_report.close().

Updated documentation

Added a nice interactive krona plot. Also corrected the parameters, and talked about the taxonomy thing.

- HTML
Published by iquasere almost 3 years ago

recognizer - Fix on outputting COG categories

Due to reformatting how reCOGnizer outputs information, its capacity for outputting COG categories was damaged.

It is fixed now.

- HTML
Published by iquasere over 3 years ago

recognizer - Increase maximum SMPs per database

Set option -max_smp_vol 1000000 for the makeprofiledb command.

Context: the blast package had an update, and the makeprofiledb tool now outputs a database for each 1000 HMM profiles by default.

- HTML
Published by iquasere over 3 years ago

recognizer - Fix on COG2KO

Blocked it for now. So reCOGnizer finishes its workflow.

- HTML
Published by iquasere over 3 years ago

recognizer - Major improvements on reporting results

Columns have been standardized to have the same names, regardless of database For example, COG functional category and cog columns renamed to functional category and DB ID, respectively Helps to provide a simpler report, with much less NA values

Databases now inputted as comma-separated values

No problem when using one or all default databases (without specifying values), but breaks backwards compatibility, and so version was upped to 1.8.

Also some miscellaneous fixes

Prohibited creating kronas when there is no annotation for the respective database (COG or KOG) Removed Biopython as dependency

- HTML
Published by iquasere over 3 years ago

recognizer - Intermediates now removed

Files in the asn, blast, rpsbproc are again removed.

Fixes in versions

So reCOGnizer can be integrated easily with other tools, versions for krona and Biopython were relaxed. Because of a previous bug in blast 2.11, version of blast was set to >=2.12.

- HTML
Published by iquasere almost 4 years ago

recognizer - BLAST version relaxed

Now can use any blast version, as new ones come fixed from the bug that prevented using newer versions in reCOGnizer

- HTML
Published by iquasere almost 4 years ago

recognizer - EC numbers obtained from CDD and Smart

EC numbers are now obtained from parsing database descriptions of CDD and Smart.

For Smart, all EC numbers are obained, as they are always respective of the domain described.

In the case of CDD, only EC numbers in the form "(EC:X.X.X.X)" are obtained, as many more EC numbers are reference in other formats that are respective of other proteins in the same domain family, but not respective to the domain in question.

- HTML
Published by iquasere almost 4 years ago

recognizer - A working Continuous Integration

Added mini cdd.tar.gz with only some HMMs for all databases

New parameter of reCOGnizer, --skip-downloaded, mainly for CI: if set, files already downloaded will be skiped, no longer asking for the files one at a time

Also simplified some intermediate tasks

"Organize COGs to each tax ID" is now limited to when taxonomy is relevant
cog2ko downloads are simplified: silenced with the -q parameter of wget

- HTML
Published by iquasere about 4 years ago

recognizer - Removal of artifacts and bug fixes

Removal of artifacts

Now removes CDD tarball Now removes all files helper directories: fasta, asn, blast, rpsbproc and tmp Integrated cog2ec.py code

Bug fixes

Fix on pointing to directory where SMPs are now Fix on only reporting time in hours, minutes and seconds: now also reports days Removed redundant asking for resources download

Also changed default of --max-target-seqs from 1 to 20

- HTML
Published by iquasere about 4 years ago

recognizer - Now downloads RPSBPROC files

reCOGnizer now downloads the following files to --resources_directory: https://ftp.ncbi.nih.gov/pub/mmdb/cdd/bitscorespecific.txt https://ftp.ncbi.nih.gov/pub/mmdb/cdd/cddannot.dat.gz https://ftp.ncbi.nih.gov/pub/mmdb/cdd/cddannotgeneric.dat.gz https://ftp.ncbi.nih.gov/pub/mmdb/cdd/cddid.tbl.gz https://ftp.ncbi.nih.gov/pub/mmdb/cdd/cdtrack.txt https://ftp.ncbi.nih.gov/pub/mmdb/cdd/familysuperfamilylinks and gunzips the archives This fixes #4

- HTML
Published by iquasere about 4 years ago

recognizer - Implemented COG taxonomic workflow

COG annotation can now follow an alternative workflow based on taxonomy. * if --tax-file is inputted and --species-taxids is set * --species-taxids new parameter, just for this * SMPs will each be its own database * Tax ID to list of COGs is estimated from NOG.members.tsv * If a Tax ID from tax file is present in tax ID to COG, those COGs will be used as reference database

- HTML
Published by iquasere over 4 years ago

recognizer - Fix on annotation of proteins without taxonomy

The obligatory bug fixing rampage after each major version

Fix on annotation of proteins without taxonomy when following taxonomic workflow

reCOGnizer was not associating correctly the corresponding databases

Fix on reports with duplicate entries

Some reports were creating duplicate entries incorrectly, dunno why

Fix on tax file inputation

When not inputting a --tax-file, reCOGnizer flipped its mind. Tis fixed now

- HTML
Published by iquasere over 4 years ago

recognizer - Multiprocessing instead of multithreading

reCOGnizer no longer constructs split versions of the databases for multithreaded annotation.
This became necessary because of the taxonomic implementation: the use of databases split by taxonomy requires multiprocessing to not waste resources (since the number of split databases used can be less than threads available). However, this will inevitably lead to cases where multiple queries using multiple databases will provoke the use of more threads than specified by the user.
This also simplifies the logic of reCOGnizer, helping development. reCOGnzer now runs multiple instances of rpsblast in each and every annotation, independently of using a taxonomic file or not, each using one thread.
Fixed parsing empty rpsbproc file (hadn't I done this before?)

Fixes

rpsblast commands are no longer printed

Extras

No taxonomy (taxid = 0) proteins are now annotated with the entire databases (they were not being annotated at all)
reCOGnizer now removes all intermediate blast, asn and rpsbproc reports

- HTML
Published by iquasere over 4 years ago

recognizer - Less verbosed, more organized

Now creates folders for specific outputs:

fasta stores input FASTA file divided by taxa
asn stores ASN alignment reports (outfmt 11)
blast stores TSV alignment reports (outfmt 6)
rpsbproc stores RPSBPROC reports

No output of commands, only timed message of main steps

Removed output of commands for running rpsblast, rpsbproc, ktImportText (of krona package) and reading FASTA files

Also fixed EXCEL pages numbering in final report

- HTML
Published by iquasere over 4 years ago

recognizer - Retrieval of lineages in multiprocesses

Now uses subprocess.Pool.starmap to query the taxonomy dataframe for retrieval of taxonomies.

Splits FASTA by tax ID in memory

Loads the entire FASTA and splits it by tax ID of --tax-col

Also, several bug fixes

bug in cog2ko - bad string formatting - Fixes #3
merging of report_6 with blast - made no sense
bug when rpsbproc report was empty

- HTML
Published by iquasere over 4 years ago

recognizer - Get upper taxid is now performed locally

Taxonomic RPS-BLAST with Pool

Multiprocesses now makes the annotations divided by: * multiple FASTA, one for each main taxon * multiple databases, a selection for each taxon

Also, extensive debugging of 1.5:

Reads 'threads' and 'max-target-seqs' as int
Downloads cdd.info for DB version control
Get upper taxid is now performed locally
Multithreaded workflow with postprocessing as it must be
Removed "Warning: Examining 5 or more matches is recommended" hell verbosity
Fixed death bug on reading reports in "Handling results"

- HTML
Published by iquasere over 4 years ago

recognizer - Taxonomic considerations and multi-domain resolution

Taxonomy can now be inputted to reCOGnizer

If taxonomy of inputted proteins is known, a taxonomy file can be inputted to reCOGnizer using the following parameters: 1. --tax-file - specifies the file where the taxonomic information is stored 2. --protein-id-col - specifies the column with IDs equal to the IDs in the FASTA inputted 3. --tax-col - specifies the column with taxonomic classification (must be tax IDs)

Multi-domain annotations are now solved with rpsbproc

An oversight of reCOGnizer's implementation was the possibility of multi-domain proteins. The rpsbproc utility has already been developed to tackle this problem, and it was implemented in reCOGnizer. Consequentially: * reCOGnizer now comes with a default of 20 for --max-target-seqs to provide enough options for rpsbproc; * first annotation of reCOGnizer produces the result in outfmt 11, but this format is converted to outfmt 6 with the blast_formatter utility; * reCOGnizer now only outputs one match per domain, but for all domains. This is good, since there is less noise, but may be bad for downstream analyses that would benefit from a single annotation by protein. We may look at text mining in the future.

- HTML
Published by iquasere over 4 years ago

recognizer - Fixed bug for 0 identifications

FASTA read with generator No longer writes c(k)og_quantification.xlsx ProgressBar replaced with tqdm

- HTML
Published by iquasere over 4 years ago

recognizer - Fix on RPS-BLAST

Removed testing artifact

- HTML
Published by iquasere over 4 years ago

recognizer - Results outputted in a single TSV report

Results from all databases are now also outputted in a single TSV report - reCOGnizer_results.tsv * ordered by qseqid and DB ID

Default for resources_directory is now ~/resources_directory

- HTML
Published by iquasere over 4 years ago

recognizer - Added % of identity parameter

Set with argument --pident Also changed default e-value to 1e-2

- HTML
Published by iquasere almost 5 years ago

recognizer - Added option for setting e-value

Added parameter --evalue for setting e-value to RPS-BLAST

- HTML
Published by iquasere almost 5 years ago

recognizer - Added xlsxwriter as dependency

Now it installs xlxwriter when installing reCOGnizer

- HTML
Published by iquasere almost 5 years ago

recognizer - Fix on blast version

blast was recently updated to 2.11, but it doesn't work. Now, its version in reCOGnizer is fixed to 2.10.1

- HTML
Published by iquasere almost 5 years ago

recognizer - Deals with multiple categories

Some COGs and KOGs have several letters/functional categories attributed. Those were not properly handled. * now, reCOGnizer reads both cog-20.def and kog by adding more rows to account for these multiple categories

Also improved sheet names in reCOGnizer_results, only adds number if multiple sheets will be needed (over 1M proteins)

- HTML
Published by iquasere about 5 years ago

recognizer - Improvements on krona plotting

Fixed a bug in krona plot generation.

Now outputs krona plot of COG categories for KOG database.

- HTML
Published by iquasere about 5 years ago

recognizer - Minor fix on downloading resources

Minor fix on downloading resources: tar command requires shell to interpret wildcards Fix on krona installation: build.sh script was not calling install.pl

- HTML
Published by iquasere about 5 years ago

recognizer - All databases from CDD are now usable

Added support for all databases for which there are HMMs in CDD: * COG * KOG * NCBIfam * Pfam * Protein Clusters * Smart * TIGRFAM * CDD itself

Results obained for each database differ, with all databases getting the base CDD description. Besides that: * NCBIfam, Protein Clusters, Pfam and TIGRFAM get taxonomic, domain name and ec number information * COG keeps getting the COG categories and corresponding EC numbers and KOs * KOG gets the COG categories * Smart gets domain name information

- HTML
Published by iquasere about 5 years ago

recognizer - Minor fix for integration in pipelines

Krona now automatically runs the symlinks script, so now reCOGnizer calls it through the symlink, instead of the literal executable name, which was broken when integrated into environment such as in MOSCA

- HTML
Published by iquasere over 5 years ago

recognizer - Several fixes on cog2ko

Fixed broken bash commands in cog2ko
Fixed the directories for storing the relational tables
Fixed the merging of dataframes

- HTML
Published by iquasere over 5 years ago

recognizer - Now compatible with MacOS!

MacOS requires the use of gnu-tar instead of tar to use --wildcards
wget is now added as dependency
cog2ko default folder changed to the same as all others

- HTML
Published by iquasere over 5 years ago

recognizer - COG to KO conversion added

Files are downloaded from StringDB
COGs are converted to StringDB IDs and then to KOs
All KOs for each COG are obtained in comma separated format
Only retrieves KOs at the end of file protein.info, may be improved in the future

- HTML
Published by iquasere over 5 years ago

recognizer - Bug fixed in RPS-BLAST handling

- HTML
Published by iquasere over 5 years ago

recognizer - blast info now outputed in protein report

Joined protein2cog with blast info CDD IDs to COGs now in cdd_aligned.txt

- HTML
Published by iquasere over 5 years ago

recognizer - New parameters provide more information

Added two new parameters: * --remove-spaces replaces spaces with underscores to keep the full IDs (BLAST disregards everything after a space) * --output-sequences will output protein2cog with a new column, "Sequences", with the sequences of proteins inputed

- HTML
Published by iquasere over 5 years ago

recognizer - Fully functional for bioconda

Even though reCOGnizer was already included in Bioconda, it has now been adapted to that new reality. * Changed the header of the main script -> still can be run with python recognizer.py is coming from GitHub * -rd option now fully functional (using the default with Bioconda may produce strange results) * Added the version parameter -> now, you can track it!

- HTML
Published by iquasere almost 6 years ago

recognizer - Ready for Conda!

Made first modifications for including reCOGnizer into Conda. * shifted Krona installation to Conda * removed need for cdd2cog.pl script. Methods were created for implementing its functionality into main script

- HTML
Published by iquasere almost 6 years ago

recognizer - Introduced conversion of COG functions to EC numbers

Conversion of COG functions to EC numbers is based on some eggNOG tables that convert COG functions to STRING IDs and then to EC numbers. Many thanks go to SciLifeLab Bioinformatics LTS for composing all of this. Important files include: * the script used for construction of the relational table of cog2ec * the NOG.members.tsv table that allows conversion of COG functions to STRING IDs * the eggnog4.proteinidconversion.tsv table that allows conversion of STRING IDs to EC numbers

- HTML
Published by iquasere almost 6 years ago

recognizer - Now outputs in TSV!

Added parameter for writing tables in TSV format. Also, cog_quantification now produces two tables, one for krona plotting, other for user interaction.

- HTML
Published by iquasere almost 6 years ago

recognizer - Different time for database construction

The code for construction of COG database from the SMP files has been improved. * Latter stages of reCOGnizer are no longer dependent on reCOGnizer location and corresponding databases. * Added parameters for specifying that database was built by reCOGnizer and for customizing directory of databases

Also reformulated the downloading of resources for cdd2cog. * Migrated commands to new bash script, download_resources * Removed the download at installation time of reCOGnizer * Now, reCOGizer reacts at the absence of any of the files and runs the script to download everything again (might be overkill, but is safer)

- HTML
Published by iquasere almost 6 years ago

recognizer - reCOGnizer 1.1

Custom databases now allowed

Custom databases can be inputed If multiple databases are inputed, multithreading can be used Previous reCOGnizer databases can also now be inputed!

- HTML
Published by iquasere almost 6 years ago

recognizer - reCOGnizer 1.0

Fully tested and functional!

For next versions

Set argument --database for inputing custom user database
And... that's it! Hold the earth for reCOGnizer 1.1!

- HTML
Published by iquasere about 6 years ago

Recent Releases of recognizer

recognizer - Fix on database creation from PN files

recognizer - Databases' names changed to CD-batch search options

Domains now follow the lists at the PN files provided by NCBI

Database construction reimplemented to use the PN files provided by CDD

Also removed deprecated parameters

recognizer - Fix on regex search of EC numbers

recognizer - Simpler download of databases and more robust COG2KO conversion

Much simpler download of databases

COG2KO conversion more reliable

New parameters --test-run and --output-rpsbproc-columns will usually not be needed

Removed some unnecessary files

Some fixes

recognizer - Fixed KOG outputting

recognizer - Fix when only downloading resources

recognizer - Custom databases workflow now multithreaded

Also some necessary changes on the tests

recognizer - Fixed several annoyances

No more need to confirm you don't want to gunzip download resource files

No more FutureWarning when trying to sum COGs

recognizer - reCOGnizer is called without ".py"

Now called as "recognizer"

Now removes intermediate folders

Also, several fixes

Updated documentation

recognizer - Fix on outputting COG categories

recognizer - Increase maximum SMPs per database

recognizer - Fix on COG2KO

recognizer - Major improvements on reporting results

Databases now inputted as comma-separated values

Also some miscellaneous fixes

recognizer - Intermediates now removed

Fixes in versions

recognizer - BLAST version relaxed

recognizer - EC numbers obtained from CDD and Smart

recognizer - A working Continuous Integration

Also simplified some intermediate tasks

recognizer - Removal of artifacts and bug fixes

Removal of artifacts

Bug fixes

recognizer - Now downloads RPSBPROC files

recognizer - Implemented COG taxonomic workflow

recognizer - Fix on annotation of proteins without taxonomy

The obligatory bug fixing rampage after each major version

Fix on annotation of proteins without taxonomy when following taxonomic workflow

Fix on reports with duplicate entries

Fix on tax file inputation

recognizer - Multiprocessing instead of multithreading

Fixes

Extras

recognizer - Less verbosed, more organized

Now creates folders for specific outputs:

No output of commands, only timed message of main steps

Also fixed EXCEL pages numbering in final report

recognizer - Retrieval of lineages in multiprocesses

Splits FASTA by tax ID in memory

Also, several bug fixes

recognizer - Get upper taxid is now performed locally

Taxonomic RPS-BLAST with Pool

Also, extensive debugging of 1.5:

recognizer - Taxonomic considerations and multi-domain resolution

Taxonomy can now be inputted to reCOGnizer

Multi-domain annotations are now solved with rpsbproc

recognizer - Fixed bug for 0 identifications

recognizer - Fix on RPS-BLAST

recognizer - Results outputted in a single TSV report

recognizer - Added % of identity parameter

recognizer - Added option for setting e-value

recognizer - Added xlsxwriter as dependency

recognizer - Fix on blast version

recognizer - Deals with multiple categories

recognizer - Improvements on krona plotting

recognizer - Minor fix on downloading resources

recognizer - All databases from CDD are now usable

recognizer - Minor fix for integration in pipelines

recognizer - Several fixes on cog2ko

recognizer - Now compatible with MacOS!

recognizer - COG to KO conversion added

recognizer - Bug fixed in RPS-BLAST handling

recognizer - blast info now outputed in protein report