Recent Releases of upimapi
upimapi - Simplified database download
When inputting a database, there are three options:
1. input one of three reserved values: uniprot, swissprot or taxids
2. input a FASTA database
3. input a DIAMOND formatted database
UPIMAPI will first check if the DIAMOND version of the databases exist, and if it finds it, will run annotation with it.
1. in --resources-directory folder, either uniprot.dmnd, uniprot_sprot.dmnd or taxids_database.dmnd
2. the database filename with termination replaced with .dmnd
3. the database filename itself
If that doesn't exist, UPIMAPI will search for the FASTA format, and if it finds it, will convert to DIAMOND format.
1. in --resources-directory folder, either uniprot.fasta, uniprot_sprot.fasta or taxids_database.fasta
2. the database filename itself
3. will exit with file not found error
This removes the need to tinker with the --skip-db-check parameter, but more trust is placed on the user.
- Python
Published by iquasere over 2 years ago
upimapi - Sanitization of mapping columns
Wrong columns can no longer be inputted
Now UPIMAPI will report an error and exit with a code different from 0.
New command for showing available fields
upimapi --show available-fields will print the columns available for ID mapping. Properly capitalized, directly extracted from the return fields page.
- Python
Published by iquasere over 2 years ago
upimapi - Fixed parsing of custom inputted "-cols"
In handling the columns Organism, Organism (ID), Taxonomic lineage and Taxonomic lineage IDs, when some of Taxonomic lineage (LEVEL) or Taxonomic lineage IDs (LEVEL) columns are specified.
UPIMAPI now properly adds and discards columns through its execution, obeying the respective conditions.
Also, UPIMAPI now detects if input ends in a compressed format, i.e., if an input file is specified and ends with .zip, .tar, .gz or .bz2, UPIMAPI will stop executing and will exit.
- Python
Published by iquasere over 2 years ago
upimapi - Fixed handling taxonomic columns
Columns were not being parsed correctly. Repeated columns were being outputted, i.e., Taxonomic lineage (SPECIES) and Taxonomic lineage IDs (SPECIES).
Also simplified repo structure extensively, put all into cicd folder.
- Python
Published by iquasere over 2 years ago
upimapi - Sorted the input of taxonomic columns
Specifying taxonomic columns (e.g., Taxonomic lineage (SPECIES), Taxonomic lineage IDs (SUPERKINGDOM)) was always outputting the columns Taxonomic lineage and Taxonomic lineage (Ids).
These columns are no longer outputted if not called for.
Also, several fixes
Fixed outputting taxonomy with extra space (e.g. Bacteria -> Bacteria).
Fixed case where no additional IDs are mapped, it was throwing error.
Fixed case where no columns are inputted.
Fixed getting fasta - request was badly formatted.
- Python
Published by iquasere over 2 years ago
upimapi - From/To ID mapping implemented
Implemented the ID mapping available at https://www.uniprot.org/id-mapping triggered when "From database" and "To database" are different to the default values - "UniProtKB AC/ID" and "UniProtKB".
Two new parameters: --from-db and --to-db. Possible values for these can be consulted by consulting the information at https://rest.uniprot.org/configure/idmapping/fields
They can also be checked on by inputting a wrong value to the parameter. Possible options will show up.
UPIMAPI will end execution after performing this new ID mapping. It can't be combined with the ID mapping that obtains columns of information from UniProt.
Re-added pyyaml as dependency, as api_info is now obtained again, and used directly.
- Python
Published by iquasere almost 3 years ago
upimapi - Columns outputted in order of input
Columns were being outputted in random orders, because of set commands among the code of UPIMAPI.
Columns are now properly outputted in the order that they are specified by input of the user.
- Python
Published by iquasere almost 3 years ago
upimapi - Fix on default memory
When memory is inputted with --max-memory, UPIMAPI assumes it comes as Gb.
Default in UPIMAPI (when not explicitly inputting) was cheking for available memory, which comes in bytes. This lead to values in memory too large, that lead to values of block-size too small, and the reference database would be split in too many blocks. Then, UPIMAPI/DIAMOND would take forever.
Now, UPIMAPI parses default memory to Gb before determining block-size and number-of-chunks.
- Python
Published by iquasere almost 3 years ago
upimapi - Important and nice options for homology search
Added control over DIAMOND search
--diamond-mode accepts six options (by decreasing search time and increasing sensibility): fast, mid_sensitive, sensitive, more_sensitive, very_sensitive and ultra_sensitive.
Helps to dramatically decrease search times, but also reduce memory usage and apparently disk usage as well (no ideia why this one).
Added parameter for max memory
Set with --max-memory, read as float in Gb.
Allows to calculate DIAMOND parameters b and c automatically.
Also two small bug fixes
Fixed the case where database was inputted with --skip-db-check and as a FASTA file - UPIMAPI would input the FASTA database directly to DIAMOND.
Fixed outputting days as float. Days don't float.
- Python
Published by iquasere about 3 years ago
upimapi - Added selection of mirror to download UniProt from
New parameter --mirror to determine where to download UniProt. It allows the following options:
* expasy: https://ftp.expasy.org
* uniprot: https://ftp.uniprot.org/pub
* ebi: https://ftp.ebi.ac.uk/pub
from where to download SwissProt and TrEMBL. More information at https://www.uniprot.org/help/downloads
- Python
Published by iquasere about 3 years ago
upimapi - UPIMAPI now run as "upimapi"
Changed the symbolic link from upimapi.py to upimapi.
Also changed TaxIDs database name to "taxids_database.fasta".
- Python
Published by iquasere about 3 years ago
upimapi - Fix when merging with previous ID mapping
Taxonomic columns were messing things up, and becoming repeated.
Now, it first produces these columns, and only then it merges with previous result.
Also fixes unpaired columns, as those become NAs.
- Python
Published by iquasere over 3 years ago
upimapi - Fixes on parsing taxonomy
Taxonomy was not considering some taxa have commas. Now, it parses the Taxonomic lineage column fine.
Also set new default columns, concerning the seven most popular levels of taxonomy - Superkingdom, Phylum, Class, Order, Family, Genus, Species. These are extracted from the Taxonomic lineage and the Organism columns.
These are all exported as Taxonomic lineage (taxon level), as was previously the case in the old version of UniProt's API (the one that worked fine until it was ruined by idiotic development).
- Python
Published by iquasere over 3 years ago
upimapi - Correct handling of FASTA input when only ID mapping
Correct handling of FAST input when only ID mapping
When inputting a FASTA file solely for ID mapping, UPIMAPI was not parsing the file correctly. It was not getting the IDs correctly (was retrieving the sequences alongside them) and trying to parse the IDs as "full IDs" was breaking UPIMAPI.
Now, UPIMAPI gets only the names of the sequences, and correctly parses them.
Removed unneccessary user input to check if annotation should be performed
Also removed unneccessary user input to check if annotation should be performed if the user inputs a FASTA file and specifies --no-annotation.
Users know what they want, and the default is to perform annotation. This was a leftover from when ID mapping was the main feature of UPIMAPI, and now is removed.
- Python
Published by iquasere over 3 years ago
upimapi - Accesses columns through API
No more need for apt-get packages! UPIMAPI now obtains available columns of the API through the API itself!
Also, it checks for valid and invalid columns, ignoring and reporting on the incorrect columns. Bit of an input sanitization.
- Python
Published by iquasere over 3 years ago
upimapi - Deal with dotted IDs
Dotted IDs (e.g. A1ZAI5.1) are identified by UniProt as valid IDs. However, mapping them will return a 400 error.
IDs are now split by the dot to return truly valid IDs (e.g. A1ZAI5).
- Python
Published by iquasere over 3 years ago
upimapi - Maaaaajor speed improvement
On taxonomy parsing for columns Taxonomic lineage and Taxonomic lineage IDs.
Changed to using pandas methods.
- Python
Published by iquasere over 3 years ago
upimapi - Removed testing artifact
Limiting ID mapping to only first 1000 IDs.
- Python
Published by iquasere over 3 years ago
upimapi - Taxonomic lineage columns reestablished
Provides Taxonomic lineage and Taxonomic lineage IDs for all levels of taxonomy.
If some field of Taxonomic lineage or Taxonomic lineage IDs is specified, UPIMAPI will retrieve the corresponding column of ID mapping, i.e., Taxonomic lineage and Taxonomic lineage (IDs), respectively, and parse them to obtain the request information.
E.g., if Taxonomic lineage (SPECIES) information was requested, UPIMAPI will search in Taxonomic lineage column for some example of Species name (species), and retrieve the relevant information.
Not requested taxonomic information (other levels of taxonomy) are discarded.
This follows previous behavior of UPIMAPI (before version 1.8), and closes the adaptations that were necessary because of UniProt's update this year.
- Python
Published by iquasere over 3 years ago
upimapi - Fix on FASTA retrieval
Small (but vital) fix on retrieving FASTA result.
Also, changed default of threads for using all available.
- Python
Published by iquasere over 3 years ago
upimapi - Multiprocessed validation of IDs
UniProt's API cannot handle now more than 1000 IDs at a time, for any request. So now, the validation required before ID mapping is performed both multithreaded and in chunks. Multithreaded validation - list of IDs is split between the number of threads available Submission in chunks - IDs are split into chunks of 1000 IDs per request to UniProt
Also, if input is not FASTA, UPIMAPI never tries to do annotation
UPIMAPI will not prompt to skip annotation or not, it will assume it is not to be performed.
- Python
Published by iquasere almost 4 years ago
upimapi - Fix on default columns
Removed the last file from resources folders, default_columns.txt.
Integrated that information in the main script.
Also, IDs inputted as a TXT file can now include both commas and newlines.
- Python
Published by iquasere almost 4 years ago
upimapi - Fixes for inputting custom columns
UPIMAPI now parses inputted columns properly.
Also forces Èntry and Entry name fields to be present in the columns (it adds them if not set).
- Python
Published by iquasere almost 4 years ago
upimapi - Several improvements in accordance to API updates
- Simplified validation of IDs - every page result is retrieved from a single POST request
- Databases are now obtained from return_fields
Also had to make some concessions to keep the tool in Bioconda
- Reformatted the web scrapping so it is only run when needed (e.g., not when running
upimapi.py -v) - Removed installation of apt-get packages from meta.yaml
- Python
Published by iquasere almost 4 years ago
upimapi - Adjusted for new UniProt's API
Base URL is now rest.uniprot.org
Changed methods for accessing UniProt's API
- Retrieval of reference proteomes by taxonomy now through
uniprotkb/stream - Retrieval of UniProt information through
uniprotkb/accessions - New web scraping for composing dictionary of columns names -> id in request
- Three new dependencies for that:
selenium,firefox,lxml,geckodriver,pyyaml - Also requires apt packages for those new dependencies:
packagekit-gtk3-module,libasound2,libdbus-glib-1-2,libx11-xcb1 - Added retrieval of API information, for now only for keeping base URL updated
Much information is no longer accessible
- Cross-references no longer available - databases field will be ignored for now
- Taxonomic lineage columns ignored - will fix it in the next version
Plus some minor alterations
- Now sorts results by E-value, not by % of identity, in the final report
- Relaxed versions of pandas and Biopython
- Removed SwissProt mapping from tests
- Python
Published by iquasere almost 4 years ago
upimapi - Final report is now sorted
by qseqid and pident columns - identifier of query sequences and percentage of identity between matches.
- Python
Published by iquasere almost 4 years ago
upimapi - Fix on obtaining consensus
Some out of the loop debauchery
- Python
Published by iquasere about 4 years ago
upimapi - Updated help, reflect CSV nature of ids.txt
Help message was still describing IDs inputted as txt as requiring a newline separated format. IDs should instead be inputted in comma-separated format.
fixes #1
- Python
Published by iquasere about 4 years ago
upimapi - Fix on --max-target-seqs
Parameter type was wrongly evaluated, when it was changed from the default it failed. It is fixed now.
- Python
Published by iquasere about 4 years ago
upimapi - New option - "--skip-id-mapping"
--skip-id-mapping is useful for when ID mapping is not desired (e.g. when using a database not from UniProt)
Also, time of analysis is now reported with days (if it comes to that)
- Python
Published by iquasere about 4 years ago
upimapi - Consensus annotation now provided
Consensus annotation provides a simplified view of a BLAST annotation
Finds the best relation of queries to reference sequences that best attributes unique identifications minimizing the sum of E-values
Is only executed for --max-target-seqs > 1
Fix on renaming UniProt columns
Before, UniProt columns ending in [CC] were missing a space (e.g. Function[CC]). UPIMAPI used to fix that, but it has been fixed by UniProt. This alteration removes that previous fix.
Added option "skip-db-check"
When set, --skip-db-check skips checking if the FASTA database inputted exists, assumes it is already present, and doesn't replace it.
- Python
Published by iquasere about 4 years ago
upimapi - Default for local ID mapping set as False
Since it is still a beta feature, with one or two bugs to fix.
This functionality is working fine for almost all cases, however using mapping directly through the API will also be faster for most of cases.
- Python
Published by iquasere over 4 years ago
upimapi - Implemented CI
CI jobs are:
- IDs inputted through TXT file
- ID mapping with SwissProt mapping
- IDs inputted through BLAST file
- Obtain FASTA sequences
- Full workflow, TaxIDs DB at Family level
Columns and databases now determined automatically, with web scrapping
Columns from https://www.uniprot.org/help/uniprotkbcolumnnames
Databases from https://www.uniprot.org/docs/dbxref.txt
Fixes names of columns with Column name[CC] (adds space to get Column name [CC])
Removed uniprot_support.py
Default databases and columns now supplied in files Executables and resources all go to "share" folder now
Tax IDs are now inputted as comma separated value
Much better for users
Also, some fixes
Fix on not inputting columns nor databases Removed "KEGG Orthology (KO)" as default database
- Python
Published by iquasere over 4 years ago
upimapi - Local workflow now considers input columns and databases
By trimming the result from local ID mapping to only include those columns and databases.
Changed some default parameters
Removed "Taxonomic identifiers" as default columns: * for "SUPERKINGDOM", "PHYLUM", "CLASS", "ORDER" and "FAMILY" * but still kept it for "GENUS" and "SPECIES"
Increased the default number of IDs per request to 10000. Decreased default number of seconds to wait between requests to 3 seconds.
Local ID mapping now works from a TSV as well
As in the API workflow, it checks which IDs have already been mapped, and maps the remaning.
Some fixes in confusion of column names between web and local
Taxonomic lineage (SPECIES) is now available, among others. For now it doesn't print warning if it can't handle some part of description. Web and local results seem to align nicely now.
- Python
Published by iquasere over 4 years ago
upimapi - Almost all columns parsed from local ID mapping
Columns already implemented are presented in the following table:
| Column name from local mapping | Column name from API mapping | |--------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | accessions | Entry (first one) | | entryname | Entry name | | dataclass | Status | | sequencelength | Length | | description | Protein names, EC number | | sequence | Sequence | | taxonomyid | Taxonomic identifier (SPECIES), Organism ID | | hostorganism | Virus hosts (first one) | | genename | Gene names, Gene names (ordered locus ), Gene names (ORF ), Gene names (primary ), Gene names (synonym ) | | keywords | Keywords | | organism | Organism | | organismclassification | Taxonomic lineage (ALL LEVELS) | | comments | Function [CC], Subunit structure [CC], Interacts with, Subcellular location [CC], Alternative products (isoforms), Tissue specificity, Post-translational modification, Polymorphism, Involvement in disease, Miscellaneous [CC], Sequence similarities, Caution, Sequence caution, Web resources (only available through local mapping), Mass spectrometry, RNA editing, Catalytic activity, Cofactor, Activity regulation, Pathway, Developmental stage, Induction, Allergenic properties, Biotechnological use, Disruption phenotype, Pharmaceutical use, Toxic dose, Domain [CC] | | crossreferences | Cross-references (ALL DBS), Gene ontology (GO), Gene ontology IDs, Gene ontology (cellular component), Gene ontology (molecular function), Gene ontology (biological process) | | created | Date of creation | | annotationupdate | Date of last modification, Version (entry) | | sequenceupdate | Date of last sequence modification, Version (sequence) | | features | Alternative sequence, Natural variant, Non-adjacent residues, Non-standard residue, Non-terminal residue, Sequence conflict, Sequence uncertainty, Active site, Binding site, DNA binding, Metal binding, Nucleotide binding, Site, Intramembrane, Topological domain, Transmembrane, Chain, Cross-link, Disulfide bond, Glycosylation, Initiator methionine, Lipidation, Modified residue, Peptide, Propeptide, Signal peptide, Transit peptide, Beta strand, Helix, Turn, Coiled coil, Compositional bias, Domain [FT], Motif, Region, Repeat, Zinc finger, Mutagenesis, Calcium binding | | organelle | Gene encoded by | | seqinfo | Mass | | references | PubMed ID | | hosttaxonomyid | Virus hosts |
Information on columns already obtained, still not implemented, and about implementation, is available in the project page.
- Python
Published by iquasere over 4 years ago
upimapi - Implemented local mapping of SwissProt IDs
Makes use of data available in DAT format at UniProt's FTP.
It obtains information from the following columns:
| Column name from local mapping | Column name from API mapping | |--------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | accessions | Entry (first one) | | entryname | Entry name | | dataclass | Status | | sequencelength | Length | | description | Protein names* | | sequence | Sequence | | taxonomyid | Taxonomic identifier (SPECIES) | | hostorganism | Virus hosts (first one) | | genename | Gene names (primary ) | | keywords | Keywords | | organism | Organism | | organismclassification | Taxonomic lineage (ALL LEVELS) | | comments | Function [CC], Subunit structure [CC], Interacts with, Subcellular location [CC], Alternative products (isoforms), Tissue specificity, Post-translational modification, Polymorphism, Involvement in disease, Miscellaneous [CC], Sequence similarities, Caution, Sequence caution, Web resources (only available through local mapping) | | crossreferences | Cross-references (ALL DBS) | | created | Date of creation | | annotationupdate | Date of last modification, Version (entry) | | sequenceupdate | Date of last sequence modification, Version (sequence) |
* Still requires some work
Still not implemented: molecule_type, organelle, host_taxonomy_id, references, features, protein_existence, seqinfo
- Python
Published by iquasere over 4 years ago
upimapi - Fix on input of "block size" and "index chunks"
--block-size and --index-chunks parameters are now read as int
- Python
Published by iquasere over 4 years ago
upimapi - "Taxonomic identifier" columns as default
"Taxonomic identifier" columns as default for easier integration with reCOGnizer Added requests dependency Fixed bug always reporting table override
- Python
Published by iquasere over 4 years ago
upimapi - New options for reference database
UPIMAPI now offers 4 options to automatically or manually build the reference database: 1. uniprot - UPIMAPI will download the entire UniProt and use it as reference 2. swissprot - UPIMAPI will download SwissProt and use it as reference 3. taxids - Reference proteomes will be downloaded for the taxa specified with the --taxids, and those will be used as reference 4. a custom database - Input will be considered as the database, and will be used as reference
Also no longer outputs results in EXCEL, never made sense.
- Python
Published by iquasere over 4 years ago
upimapi - Added sleep parameter
--sleep takes the time (in seconds) to wait between requests to UniProt's API
Backwards incompatibility notice: --annotation-columns and --annotation-databases renamed to --columns and --databases, respectively.
Also fixed some bugs in the new namespaces - "output_table" and "columns"
- Python
Published by iquasere over 4 years ago
upimapi - UPIMAPI now easier to handle
Added --output-table option:
* for specifying UniProt info table filename;
* overriding "output" parameter for that file.
Fixed bug in get_ids, now always checks for *
Expanded --full-id parameter, now accepts auto, true and false
* auto - default, if "|" is detected in IDs they are in "full" form
* true - IDs in "full" form
* false - IDs not in "full" form
progressbar replaced with tqdm for a more informative bar
- Python
Published by iquasere almost 5 years ago
upimapi - Main report outputted in TSV format
Now outputs UPIMAPI_results in TSV format besides EXCEL
- Python
Published by iquasere almost 5 years ago
upimapi - Added new parameters
--evalue: maximum E-value of matches
--pident: minimum percentage of identity
--bitscore: minimum bit score of matches
- Python
Published by iquasere about 5 years ago
upimapi - Joins all information in a single report
Simplified the commands
* Removed --diamond-output: now output is folder where all is stored
* Removed --excel: now EXCEL report is created with all information, as well as TSV report just for UniProt info
* Columns and databases values are now inputed separated by &
Allows an easier connection to KEGGCharter and other tools
- Python
Published by iquasere about 5 years ago
upimapi - Output directories created automatically
Both the directory of --output and the parameter of --diamond-output will be created if not existent.
- Python
Published by iquasere about 5 years ago
upimapi - Max tries now apply to partial lists of IDs
Updated the get_uniprot_information method to take max_tries as input for more robust retrieval of information
* max tries still applies to whole workflow, so three times failed in one interval won't fail the whole workflow
New parameter - max-tries - to set the value of maximum tries for requests from the CLI
- Python
Published by iquasere over 5 years ago
upimapi - Unmapped IDs append to file
Unmapped IDs now do not overwrite previously written unmapped IDs * instead, append to file
- Python
Published by iquasere over 5 years ago
upimapi - Now with DIAMOND annotation!
UPIMAPI can now be used to perform annotation with DIAMOND, with several improvements: * threads, block size and number of index chunks can be inputted, or UPIMAPI will automatically determine best values for them * output from DIAMOND follows directly to UniProt's ID mapping
Also fixed a bug in checking previous mapping with same file name
- Python
Published by iquasere over 5 years ago
upimapi - Fixed unmapped IDs
- when writting to working directory, it failed
- also, now can handle empty input file
- Python
Published by iquasere almost 6 years ago
upimapi - UPIMAPI now reads input through the command line!
Added the option to read from input the IDs to be mapped, might be nice to open that possibility. Also fixed some bugs related to EXCEL/TSV reading and writing. Now install script also mentions the EXCEL plugins for pandas, required for EXCEL I/O.
- Python
Published by iquasere about 6 years ago
upimapi - Improvements for integration into Conda
Changed header of main script Added version info
- Python
Published by iquasere about 6 years ago
upimapi - Several improvements for integration in pipelines
UPIMAPI now: * properly handles BLAST input * accepts '' as input for columns and databases * handles tr|XXX|XXX IDs if specified with the --full-id parameter
- Python
Published by iquasere about 6 years ago
upimapi - UPIMAPI 1.0
A tool for mapping UniProt IDs through the API made available by the good friends at EBI. It allows to obtain information regarding proteins' names & taxonomy, sequences, function, interactions, expression, gene ontology, pathology & biotech, subcellular location, PTM / processsing, structure, publications, dates of publications, family & domains, taxonomic lineage, taxonomic identifier, and cross-references to 169 databases! UPIMAPI can get all this information with a simple input of UniProt IDs or a BLAST file, and export results to TSV and EXCEL format! UPIMAPI can also instead obtain the sequences of proteins corresponding to the IDs, and export results in FASTA format.
- Python
Published by iquasere about 6 years ago