https://github.com/bigdatabiology/gmsc-api

Science Score: 13.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
○
.zenodo.json file
○
DOI references
○
Academic publication links
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (5.6%) to scientific vocabulary

Keywords

bioinformatics gmsc small-proteins

Last synced: 10 months ago · JSON representation

Repository

Basic Info

Host: GitHub
Owner: BigDataBiology
Language: Python
Default Branch: main
Homepage:
Size: 64.5 KB

Statistics

Stars: 0
Watchers: 1
Forks: 2
Open Issues: 0
Releases: 0

Topics

bioinformatics gmsc small-proteins

Created about 3 years ago · Last pushed about 2 years ago

Metadata Files

Readme

GMSC API

API Endpoints

https://{{base_url}}/v1/seq-info/{{gmsc_id}}

Where {{gmsc_id}} is of the form GMSC10.100AA.xxx_xxx_xxx or GMSC10.90AA.xxx_xxx_xxx.

Returns

json { "id": "GMSC10.xxAA.xxx_xxx_xxxx", "nucleotide": "ATC...", "aminoacid": "MAV...", "taxonomy": "s__Bacteroides_vulgatus", "habitat": "human gut", "quality": { "antifam": true, "terminal": true, "rnacode": 0.9, "metat": 1, "metap": 1, "riboseq": 0.9 } }

Note that the quality field is only present for 90AA sequences.

https://{{base_url}}/v1/seq-info-multi/

This is a POST-only endpoint, expecting a JSON package consisting of a dictonary with an entry seq_ids, which is a list of strings (identifiers). For example:

json { "seq_ids": [ "GMSC10.90AA.123_456_789", "GMSC10.90AA.123_456_790", ...] }

Returns a list of entries like the outputs of seq-info.

https://{{base_url}}/v1/seq-filter/

POST endpoint, with arguments:

hq_only: boolean. optional (only active for 90AA)
habitat: str. mandatory
taxonomy: str. optional
quality_antifam: boolean. optional
quality_terminal: boolean. optional
quality_rnacode: float. optional
quality_metat: integer. optional
quality_metap: integer. optional
quality_riboseq: float. optional

habitat is treated as a comma separated list (e.g., you can use marine,freshwater to match all the entities that are present in both marine and freshwater).

taxonomy is a substring match so you can pass any taxonomic level (e.g., passing o__Pelagibacterales will match d__Bacteria;p__Proteobacteria;c__Alphaproteobacteria;o__Pelagibacterales;f__Pelagibacteraceae;g__AAA240-E13).

Returns

json { "status":"Ok", "results": [ { "habitat":"marine,plant associated,sediment", "seq_id":"GMSC10.90AA.000_013_322", "taxonomy":"d__Bacteria"}, .... ] }

At most 1,001 entries are returned.

https://{{base_url}}/v1/cluster-info/{{gmsc_90AA_id}}

Returns the membership of the given cluster. At most 20 results are thick (meaning that metadata is also returned). For the rest, only identifiers are returned. Example output

json { "status":" Ok", "cluster": [ { "aminoacid":"MAAAGFLIVSFKPFEKPSRNAATTAGFSAENFEFTMIALPYSLRP", "habitat":"soil", "nucleotide":"ATGGCCGCGGCCGGATTCTTGATCGTGTCCTTCAAGCCTTTCGAGAAGCCTTCGAGAAACGCCGCGACGACGGCCGGCTTCTCGGCCGAGAATTTCGAGTTCACGATGATCGCGCTGCCGTACAGCTTGAGACCGTAA", "seq_id":"GMSC10.100AA.547_444_661", "taxonomy":"d__Bacteria;p__Proteobacteria;c__Alphaproteobacteria;o__Rhizobiales;f__Xanthobacteraceae;g__VAZQ01;s__VAZQ01 sp005883115" }, ... ] }

Sequence search interface (non-public interface)

NOTE. These are not recommended for public use. For large-scale analyses, we recommend you use the GMSC-mapper command line tool locally. Public API endpoints will be maintained for the long-term. No such commitment is made for endpoints marked internal. You have been warned.

https://{{base_url}}/internal/seq-search (POST)

Arguments:

sequence_faa: FASTA formatted set of sequences
is_contigs: bool (when True, inputs are assumed to be DNA contigs)

Returns

json { "status": "message (normally 'Ok')", "search-id": "xxxxx" }

https://{{base_url}}/internal/seq-search/{{search_id}}

Returns

json { "search_id": "str", "status": "str", "results": [ { "query_id": "query_1", "aminoacid": "MHEDVIQFARNEVWSLV....", "taxonomy": "s__Bacteroides_vulgatus", "habitat": "human gut", "hits": [ { "id": "GMSC10.xxAA.xxx_xxx_xxxx", "e_value": "2.1e-23", "aminoacid": "MHEELIQFARNEV...", "identity": "98.4" }, ... ] }, ...]

status will be one of Running (if the results are not yet ready), Done, or Expired. In the case of Done, the results field will be filled in.

Install & Testing

Dependencies

flask
numpy
pandas
polars

Running this (in test mode) can be done with

bash python -m flask run

Testing can be done with curl:

bash curl http://127.0.0.1:5000/v1/seq-info/GMSC10.100AA.000_000_002

These examples assume you are running the test version on http://127.0.0.1:5000/. Adapt as necessary.

Searching requires using POST and a FASTA file. For example, if you have the file example.faa, you can use

bash curl -X POST --form "sequence_faa=$(cat example.faa)" http://127.0.0.1:5000/internal/seq-search/

The output will look something like this:

json {"search_id":"1-jmgi","status":"Ok"}

You can later use the given ID (in this case 1-jmgi, but it will be different every time the app runs) to retrieve the results:

bash curl http://127.0.0.1:5000/internal/seq-search/1-jgmi

Results will look like one of the following

{"search_id":"1-jmgi","status":"Running"}
{"search_id":"1-jmgi","status":"Done", results":[...]}
{"search_id":"1-jmgi","status":"Expired"}

Search ID are of the form #-xxxx where # is just an index counting up and xxxx is a random string.

Indexing

Indexing is done by the make-indices.py Jug script. It expects FASTA and other files to be present in the gsmc-db subdirectory.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science

https://github.com/bigdatabiology/gmsc-api

Science Score: 13.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

GMSC API

API Endpoints

Sequence search interface (non-public interface)

Install & Testing

Indexing

Owner

GitHub Events

Total

Last Year