betadescribe-code

Protein2Text: Providing Rich Descriptions from Protein Sequences

https://github.com/technion-cs-nlp/betadescribe-code

Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (9.7%) to scientific vocabulary

Keywords

bioinformatics deep-learning nlp protein-annotations protein2text transformers
Last synced: 7 months ago · JSON representation ·

Repository

Protein2Text: Providing Rich Descriptions from Protein Sequences

Basic Info
Statistics
  • Stars: 2
  • Watchers: 0
  • Forks: 0
  • Open Issues: 0
  • Releases: 0
Topics
bioinformatics deep-learning nlp protein-annotations protein2text transformers
Created over 1 year ago · Last pushed 8 months ago
Metadata Files
Readme Citation Support

README.md

Protein2Text: Providing Rich Descriptions from Protein Sequences

Abstract:

Understanding the functionality of proteins has been a focal point of biological research due to their critical roles in various biological processes. Unraveling protein functions is essential for advancements in medicine, agriculture, and biotechnology, enabling the development of targeted therapies, engineered crops, and novel biomaterials. However, this endeavor is challenging due to the complex nature of proteins, requiring sophisticated experimental designs and extended timelines to uncover their specific functions. Public large language models (LLMs), though proficient in natural language processing, struggle with biological sequences due to the unique and intricate nature of biochemical data. These models often fail to accurately interpret and predict the functional and structural properties of proteins, limiting their utility in bioinformatics. To address this gap, we introduce BetaDescribe, a collection of models designed to generate detailed and rich textual descriptions of proteins, encompassing properties such as function, catalytic activity, involvement in specific metabolic pathways, subcellular localizations, and the presence of particular domains. The trained BetaDescribe model receives protein sequences as input and outputs a textual description of these properties. BetaDescribe’s starting point was the LLAMA2 model, which was trained on trillions of tokens. Next, we trained our model on datasets containing both biological and English text, allowing biological knowledge to be incorporated. We demonstrate the utility of BetaDescribe by providing descriptions for proteins that share little to no sequence similarity to proteins with functional descriptions in public datasets. We also show that BetaDescribe can be harnessed to conduct in-silico mutagenesis procedures to identify regions important for protein functionality without needing homologous sequences for the inference. Altogether, BetaDescribe offers a powerful tool to explore protein functionality, augmenting existing approaches such as annotation transfer based on sequence or structure similarity.

alt text

BetaDescribe workflow. The generator processes the protein sequences and creates multiple candidate descriptions. Independently, the validators provide simple textual properties of the protein. The judge receives the candidate descriptions (from the generator) and the predicted properties (from the validators) and rejects or accepts each description. Finally, BetaDescribe provides up to three alternative descriptions for each protein.

Examples of descriptions of unknown proteins:

SnRV-Env:

Sequence: MKLVLLFSLSVLLGTSVGRILEIPETNQTRTVQVRKGQLVQLTCPQLPPPQGTGVLIWGRNKRTGGGALDFNGVLTVPVGDNENTYQCMWCQNTTSKNAPRQKRSLRNQPTEWHLHMCGPPGDYICIWTNKKPVCTTYHEGQDTYSLGTHRKVLPKVTEACAVGQPPQIPGTYVASSKGWTMFNKFEVHSYPANVTQIKTNRTLHDVTLWWCHDNSIWRCTQMGFIHPHQGRRIQLGDGTRFRDGLYVIVSNHGDHHTVQHYMLGSGYTVPVSTATRVQMQKIGPGEWKIATSMVGLCLDEWEIECTGFCSGPPPCSLSITQQQDTVGGSYDSWNGCFVKSIHTPVMALNLWWRRSCKGLPEATGMVKIYYPDQFEIAPWMRPQPRQPKLILPFTVAPKYRRQRRGLNPSTTPDYYTNEDYSGSGGWEINDEWEYIPPTVKPTTPSVEFIQKVTTPRQDKLTTVLSRNKRGVNIASSGNSWKAEIDEIRKQKWQKCYFSGKLRIKGTDYEEIDTCPKPLIGPLSGFIPTGVTKTLKTGVTWTTAVVKIDLQQWVDILNSTCKDTLIGKHWIKVIQRLLREYQKTGVTFNLPQVQSLPNWETKNKDNPGHHIPKSRRKRIRRGLGEALGLGNFADNRWKDLQIAGLGVEQQKLMGLTREATFEAWNALKGISNELIKWEEDMVATLRQLLLQIKGTNTTLCSAMGPLMATNIQQIMFALQHGNLPEMSYSNPVLKEIAKQYNGQMLGVPVETTGNNLGIMLSLPTGGENIGRAVAVYDMGVRHNRTLYLDPNARWIHNHTEKSNPKGWVTIVDLSKCVETTGTIYCNEHGFRDRKFTKGPSELVQHLAGNTWCLNSGTWSSLKNETLYVSGRNCSFSLTSRRRPVCFHLNSTAQWRGHVLPFVSNSQEAPNTEIWEGLIEEAIREHNKVQDILTKLEQQHQNWKQNTDNALQNMKDAIDSMDNNMLTFRYEYTQYGLFIVCLLAFLFAVIFGWLCGVTVRLREVFTILSVKIHALKSQAHQLAMLRGLRDPETGEQDRQAPAYREPPTYQEWARRRGGRPPIVTFLIDRETGERHDGQIFQPIRNRSNQVHRPQPPRPTAPNPDNQRPIREPRPEEPEHGDFLQGASWMWQ

Description: FUNCTION$ The leader peptide is a component of released, infectious virions and is required for particle budding, & The transmembrane protein (TM) acts as a class I viral fusion protein. Under the current model, the protein has at least 3 conformational states: pre-fusion native state, pre-hairpin intermediate state, and post-fusion hairpin state. During viral and target cell membrane fusion, the coiled coil regions (heptad repeats) assume a trimer-of-hairpins structure, positioning the fusion peptide in close proximity to the C-terminal region of the ectodomain. The formation of this structure appears to drive apposition and subsequent fusion of viral and target cell membranes. Membranes fusion leads to delivery of the nucleocapsid into the cytoplasm, SUBCELLULAR LOCATION$ Endoplasmic reticulum membrane.

TGV-S:

Sequence: MISGHTLCMLVLFYLYSYSNAQHELQLNPTTYHWLNCATSDCKSWQACPSTQATTCVSFSYTGLAWHKQDNTIIGYSNFTSQSLYDTISYTFAPSYVLSHAMTNLEPQKLCSLKSTIQSFHGFTPADCCLNPSASPACSYFSTGDTSFITGTPYQCTASYYGYGSPYGTDCEPYFASVSPYGTSVTPSGDVFTNFGEKSVHTYDCFYENWARYRPAPYTNNPSDPRWNLCHSIYYYVWTLSDTNHQFTTVESEPGDKVIMKQLSSHTPVYLTLGGWTSNNTVLYQAISSRRLDTIAMLRDLHDNYGVTGVCIDFEFIGGSNQYSNIFLLDWVPDLLSFLSSVRLEFGPSYYITFVGLAVGSHFLPTIYQQIDPLIDAWLISGYDLHGDWEVKATQQAALVDDPKSDFPTYSLFTSVDNMLAITTPDKIILGLPQYTRGVYTSLTGSTTGPYPPTTPMCPTPPACGTDIVISTSHGEIPSTHDTTKGDIIIEDPSQPKFYISKGSRNGRTFNHFFMNSTTASHIRSTLQPKGITRWYSYASSMNLQTNTNFKTALLSQSRKARQLSTYYKYPAPAGSGVTSCPGIVVFTDTFVVTTTAYAGSHALPLLDGNFYSPRSTFTCSPGFSTLMPTTTTRCSGIDPSNLLPSDSSSVSIVCPDMTFFGAKIAICASSTTTSKPTHLQLEVSTSIEGQFQFNSLPIYSQHKVSTTSFSVPYKCINFTPIPSCISSVCGSSHSCVTKLQESPASYACQSAAAIAIVYNNTLDLVKRSQTTTELLFNQVVLESSKFGVVTHTRQTRGLFGILSITSLIMSGVALATSSSALYVSIKNQAELSSLRNDVNSKFTTIDQNFDQITSKFNHLSTTTSDAFIAQSNINTQLQSSINQLQENLEVLSNFVTTQLSSVSSSITQLSEAIDALSDQVNYLAYLTSGISSYTSRLTSVTVQATNTAVKFSTLQSHLSNCLTSLQQQSFTGCIHKSGNIIPLKVVYTPFGNTRYLSFIYAEAELLGYQQYKSALSYCDQNFLYSSSPGCFFLLNGSSIDHRSSLSAACPTPATVVSMSCQNVTLDLSSQSIVRPYVFPLLNLTLPTPVKTNISFTPGKAPVFQNITQIDQTLLLDLAQQLQAIQLQLNPVGPISTSSFSPVVIALTVISAVVFLAVTSIVIYMLCKTAPFKPSRKTA

Descriptions: 1. FUNCTION$ Envelope glycoprotein that forms spikes at the surface of virion envelope. Essential for the initial attachment to heparan sulfate moities of the host cell surface proteoglycans. Involved in fusion of viral and cellular membranes leading to virus entry into the host cell. Following initial binding to its host receptors, membrane fusion is mediated by the fusion machinery composed at least of gB and the heterodimer gH/gL. May be involved in the fusion between the virion envelope and the outer nuclear membrane during virion egress, SUBCELLULAR LOCATION$ Virion membrane, SUBUNIT$ Homotrimer; disulfide-linked. Binds to heparan sulfate proteoglycans. Interacts with gH/gL heterodimer, SIMILARITY$ Belongs to the herpesviridae glycoprotein B family.

  1. FUNCTION$ The surface protein (SU) attaches the virus to the host cell by binding to its receptor. This interaction triggers the refolding of the transmembrane protein (TM) and is thought to activate its fusogenic potential by unmasking its fusion peptide. Fusion occurs at the host cell plasma membrane, & The transmembrane protein (TM) acts as a class I viral fusion protein. Under the current model, the protein has at least 3 conformational states: pre-fusion native state, pre-hairpin intermediate state, and post-fusion hairpin state. During viral and target cell membrane fusion, the coiled coil regions (heptad repeats) assume a trimer-of-hairpins structure, positioning the fusion peptide in close proximity to the C-terminal region of the ectodomain. The formation of this structure appears to drive apposition and subsequent fusion of viral and target cell membranes. Membranes fusion leads to delivery of the nucleocapsid into the cytoplasm, SUBCELLULAR LOCATION$ Cell membrane. SUBUNIT$ The mature envelope protein (Env) consists of a trimer of SU-TM heterodimers attached by noncovalent interactions or by a labile interchain disulfide bond

Protein 1 (TiLV virus):

Sequence: MWAFQEGVCKGNLLSGPTSMKAPDSAARESLDRASEIMTGKSYNAVHTGDLSKLPNQGESPLRIVDSDLYSERSCCWVIEKEGRVVCKSTTLTRGMTGLLNTTRCSSPSELICKVLTVESLSEKIGDTSVEELLSHGRYFKCALRDQERGKPKSRAIFLSHPFFRLLSSVVETHARSVLSKVSAVYTATASAEQRAMMAAQVVESRKHVLNGDCTKYNEAIDADTLLKVWDAIGMGSIGVMLAYMVRRKCVLIKDTLVECPGGMLMGMFNATATLALQGTTDRFLSFSDDFITSFNSPAELREIEDLLFASCHNLSLKKSYISVASLEINSCTLTRDGDLATGLGCTAGVPFRGPLVTLKQTAAMLSGAVDSGVMPFHSAERLFQIKQQECAYRYNNPTYTTRNEDFLPTCLGGKTVISFQSLLTWDCHPFWYQVHPDGPDTIDQKVLSVLASKTRRRRTRLEALSDLDPLVPHRLLVSESDVSKIRAARQAHLKSLGLEQPTNFNYAIYKAVQPTAGC

Description: FUNCTION$ Probably involved in the RNA silencing pathway and required for the generation of small interfering RNAs (siRNAs), CATALYTIC ACTIVITY$ a ribonucleoside 5'-triphosphate + RNA(n) = diphosphate + RNA(n+1), SIMILARITY$ Belongs to the RdRP family.

Protein 2 (TiLV virus):

Sequence: MSQFGKSFKGRTEVTITEYRSHTVKDVHRSLLTADKSLRKSFCFRNALNQFLDKDLPLLPIRPKLESRVAVKKSKLRSQLSFRPGLTQEEAIDLYNKGYDGDSVSGALQDRVVNEPVAYSSADNDKFHRGLAALGYTLADRAFDTCESGFVRAIPTTPCGFICCGPGSFKDSLGFVIKIGEFWHMYDGFQHFVAVEDAKFLASKSPSFWLAKRLAKRLNLVPKEDPSIAAAECPCRKVWEASFARAPTALDPFGGRAFCDQGWVYHRDVGYATANHISQETLFQQALSVRNLGPQGSANVSGSIHTALDRLRAAYSRGTPASRSILQGLANLITPVGENFECDLDKRKLNIKALRSPERYITIEGLVVNLDDVVRGFYLDKAKVTVLSRSKWMGYEDLPQKPPNGTFYCRKRKAMLLISCSPGTYAKKRKVAVQEDRFKDMRVENFREVAENMDLNQ

Description: FUNCTION$ DNA-dependent RNA polymerase catalyzes the transcription of DNA into RNA using the four ribonucleoside triphosphates as substrates, CATALYTIC ACTIVITY$ a ribonucleoside 5'-triphosphate + RNA(n) = diphosphate + RNA(n+1), SIMILARITY$ Belongs to the RNA polymerase beta' chain family.

Protein 3 (TiLV virus):

Sequence: MDSRFAQLTGVFCDDFTYSEGSRRFLSSYSTVERRPGVPVEGDCYDCLKNKWIAFELEGQPRKFPKATVRCILNNDATYVCSEQEYQQICKVQFKDYLEIDGVVKVGHKASYDAELRERLLELPHPKSGPKPRIEWVAPPRLADISKETAELKRQYGFFECSKFLACGEECGLDQEARELILNEYARDREFEFRNGGWIQRYTVASHKPATQKILPLPASAPLARELLMLIARSTTQAGKVLHSDNTSILAVPVMRDSGKHSKRRPTASTHHLVVGLSKPGCEHDFEFDGYRAAVHVMHLDPKQSANIGEQDFVSTREIYKLDMLELPPISRKGDLDRASGLETRWDVILLLECLDSTRVSQAVAQHFNRHRLALSVCKDEFRKGYQLASEIRGTIPLSSLYYSLCAVRLRMTVHPFAR

Descriptions: 1. FUNCTION$ DNA-dependent RNA polymerase catalyzes the transcription of DNA into RNA using the four ribonucleoside triphosphates as substrates. Specific core component of RNA polymerase III which synthesizes small RNAs, such as 5S rRNA and tRNAs, SUBCELLULAR LOCATION$ Nucleus, SUBUNIT$ Component of the RNA polymerase III (Pol III) complex consisting of 17 subunits, SIMILARITY$ Belongs to the eukaryotic RPC3/POLR3C RNA polymerase subunit family.

  1. FUNCTION$ Decapping enzyme for NAD-capped RNAs: specifically hydrolyzes the nicotinamide adenine dinucleotide (NAD) cap from a subset of RNAs by removing the entire NAD moiety from the 5'-end of an NAD-capped RNA, SUBCELLULAR LOCATION$ Nucleus, COFACTOR$ a divalent metal cation, SIMILARITY$ Belongs to the DXO/Dom3Z family.

Scripts:

1: You can use the provided script 1_run_models.py for the generation of the descriptions.

2: You can use the provided script 2_reject_alternatives.py to reject alternatives.

3: You can use the provided script 3_find_optimals.py to recive the three best descriptions.

supports_files:

contains support files for predictions

1runmodels:

Generates the descriptions for proteins. Also predicts the origin, subcellular locations and if enzymes using the validators.

Flags (1runmodels):

  • --protein_sequence (str, required): Input protein sequence.
  • --protein_name (str, optional): Input protein name. Default is 'protein'.
  • --id2labelpathcelllocation (str, required): Path to cell location id2label JSON file (see supportsfiles).
  • --label2idpathcelllocation (str, required): Path to cell location label2id JSON file (see supportsfiles).
  • --modelpathcell_location (str, required): Path to cell location model.
  • --id2labelpathorigin (str, required): Path to origin id2label JSON file (see supports_files).
  • --label2idpathorigin (str, required): Path to origin label2id JSON file (see supports_files).
  • --modelpathorigin (str, required): Path to origin model.
  • --id2labelpathenzymes (str, required): Path to enzymes id2label JSON file (see supports_files).
  • --label2idpathenzymes (str, required): Path to enzymes label2id JSON file (see supports_files).
  • --modelpathenzymes (str, required): Path to enzymes model.
  • --base_model (str, required): Path to base model.
  • --working_dir (str, required): Path to save predictions.
  • --temperature (float, optional): Temperature for generation. Default is 1.0.
  • --numofdescriptions (int, optional): Number of descriptions per prompt. Default is 15.
  • --maxsequencelength (int, optional): Maximum number of tokens per generation. Default is 1024.
  • --validatorsresultsname (str, optional): Validators file name. Default is 'validators_results'.

Examples (1runmodels):

python 1_run_models.py --protein_sequence "MSEQNNTEMTFQIQRIYTKDISFEAPNAPHVFQAKGNRITRSSDLAQELNAQVDWLTLSPLTLLHSNLADLSMKMLQEEGESYQEVPSPTFLGNEPISTVPVPPTQPSTTGLVNADGNSNNLALNDNFAVICNQRQSDMVKKRAVFESGAGEIGSKQLSRSILAVVEFLTEGDLHFSVFYNHEGYQFSNTHGGGEIRKLQNVNAELSHVGKDYQEDYAAEYSRVMERNYQSEIAPHLVGNNTLVQDYIKSIKKDVKGDQWRQAAKPNDAILWLKDNKYHPFAGPLSYNNLSSLMVELSYYIPDRLEESY" \ --protein_name "example_protein" \ --id2label_path_cell_location "/path/to/cell_location/id2label.json" \ --label2id_path_cell_location "/path/to/cell_location/label2id.json" \ --model_path_cell_location "/path/to/cell_location/model" \ --id2label_path_origin "/path/to/origin/id2label.json" \ --label2id_path_origin "/path/to/origin/label2id.json" \ --model_path_origin "/path/to/origin/model" \ --id2label_path_enzymes "/path/to/enzymes/id2label.json" \ --label2id_path_enzymes "/path/to/enzymes/label2id.json" \ --model_path_enzymes "/path/to/enzymes/model" \ --base_model "/path/to/base/model" \ --working_dir "/path/to/working/directory" \ --temperature 0.8 \ --num_of_descriptions 10 \ --max_sequence_length 1024 \ --validators_results_name "my_validators_results"

python 1_run_models.py --protein_sequence "MTMKMRILLTLALLALTLASPIRTLSSGLAYFETYLHIGYKTRNSEASKQQQQPPQPPPLIRAGAGLGGQFLVVATDGDGVNDSPHGDQMAKAVRKNGETGPESQERDNTAVRILMVEKEFSSLKCDEYFSTCIVTASDENGVSKYGLKPTHFLFVQISSSGLVAVDPVYANGGHTYVSGAGCISYRVGYRLPVGCVTLLTGYIGSGEITDGKKVVTIKTNWHSKTVVFYEDGSPSLLSKTPVFVNSDGVYHGNITVPFLKREAFHFLQSLPEFGTSHVLPVTWELRVKGVKEGECGSMTIRKVSFVHYPDPTITWVQTVLMQGYPGPSYHRPSTIQIRNLNFKLLEKTNVEVTYGSLAIAECAVMIRGIKNVNEVEPLTTVVSKTAPVPFPFHQKLNQTTVSDTHLEVT" \ --protein_name "example_protein" \ --id2label_path_cell_location "/path/to/cell_location/id2label.json" \ --label2id_path_cell_location "/path/to/cell_location/label2id.json" \ --model_path_cell_location "/path/to/cell_location/model" \ --id2label_path_origin "/path/to/origin/id2label.json" \ --label2id_path_origin "/path/to/origin/label2id.json" \ --model_path_origin "/path/to/origin/model" \ --id2label_path_enzymes "/path/to/enzymes/id2label.json" \ --label2id_path_enzymes "/path/to/enzymes/label2id.json" \ --model_path_enzymes "/path/to/enzymes/model" \ --base_model "/path/to/base/model" \ --working_dir "/path/to/working/directory" \ --temperature 1.2 \ --num_of_descriptions 20 \ --max_sequence_length 2048 \ --validators_results_name "custom_validators_results"

2rejectalternatives.py

For each description, uses the validators prediction to check if the description is valid (using ChatGPT).

Flags (2rejectalternatives):

  • --protein_name: Input protein name (default: 'protein')
  • --working_dir: Path to save predictions (required)
  • --resultsfilename: Results file name (default: 'rejection_summary')
  • --validatorsresultsname: Validators file name (default: 'validators_results')
  • --chatgptapi_key: ChatGPT API key (required)

Examples (2rejectalternatives):

python 2_reject_alternatives.py --protein_name "example_protein" \ --working_dir "/path/to/working/directory" \ --results_file_name "rejection_summary" \ --validators_results_name "validators_results" \ --chat_gpt_api_key "your_chatgpt_api_key"

3findoptimals.py

Processes the valid descriptions and return the three optimal ones.

Flags (3findoptimals):

  • --protein_name: Input protein name (default: 'protein')
  • --working_dir: Path to save predictions (required)
  • --rejectionresultsfilename: Rejection results file name (default: 'rejectionsummary')
  • --optimalresultsfilename: Optimal descriptions file name (default: 'optimalresults')

Examples (3findoptimals):

python 3_find_optimals --protein_name "example_protein" \ --working_dir "/path/to/working/directory" \ --rejection_results_file_name "rejection_summary" \ --optimal_results_file_name "optimal_results"

Install BetaDescribe:

We note that we use ChatGPT to reject / accept descriptions, thus, CHATGPTAPI_KEY is needed. The installation process should take a few minutes.

``` export BETADESCRIBEDIR="<PATHTOSAVEBETADESCRIBE>" cd $BETADESCRIBE_DIR git clone https://github.com/technion-cs-nlp/BetaDescribe-code

mkdir $BETADESCRIBEDIR/pythonvenv/ conda create -y -p $BETADESCRIBEDIR/pythonvenv/BetaDescribe python=3.11 export PIPCACHEDIR=$BETADESCRIBEDIR/pythonvenv/ conda activate $BETADESCRIBEDIR/pythonvenv/BetaDescribe pip install -r $BETADESCRIBE_DIR/BetaDescribe-code/requirements.txt

```

Run BetaDescribe pipeline:

We recommend using GPU to reduce inference time. Model loading onto the GPU typically takes a few minutes (and can take up to an hour in some cases), while inference for a single protein usually takes a few seconds (up to a minute). We tested code and installation on: - a single NVIDIA L40S GPU core, with a memory of 46G (and a CUDA version of 12.6) - a single NVIDIA RTX A6000 GPU core, with a memory of 48G (and a CUDA version of 12.8).

``` export CHATGPTAPIKEY="<YOURTOKENID>" export BETADESCRIBEDIR=""

export PROTEINSEQUENCE="MWAFQEGVCKGNLLSGPTSMKAPDSAARESLDRASEIMTGKSYNAVHTGDLSKLPNQGESPLRIVDSDLYSERSCCWVIEKEGRVVCKSTTLTRGMTGLLNTTRCSSPSELICKVLTVESLSEKIGDTSVEELLSHGRYFKCALRDQERGKPKSRAIFLSHPFFRLLSSVVETHARSVLSKVSAVYTATASAEQRAMMAAQVVESRKHVLNGDCTKYNEAIDADTLLKVWDAIGMGSIGVMLAYMVRRKCVLIKDTLVECPGGMLMGMFNATATLALQGTTDRFLSFSDDFITSFNSPAELREIEDLLFASCHNLSLKKSYISVASLEINSCTLTRDGDLATGLGCTAGVPFRGPLVTLKQTAAMLSGAVDSGVMPFHSAERLFQIKQQECAYRYNNPTYTTRNEDFLPTCLGGKTVISFQSLLTWDCHPFWYQVHPDGPDTIDQKVLSVLASKTRRRRTRLEALSDLDPLVPHRLLVSESDVSKIRAARQAHLKSLGLEQPTNFNYAIYKAVQPTAGC" export PROTEINNAME="protein1"

conda activate $BETADESCRIBEDIR/pythonvenv/BetaDescribe export HFHOME=$BETADESCRIBEDIR/pythonvenv/ export PYTHONCODEDIR=$BETADESCRIBEDIR/BetaDescribe-code

export ID2LABELPATHCELLLOCATION="$PYTHONCODEDIR/supportsfiles/id2labelcelllocation.json" export LABEL2IDPATHCELLLOCATION="$PYTHONCODEDIR/supportsfiles/label2idcelllocation.json" export MODELPATHCELLLOCATION="dotan1111/BetaDescribe-Validator-SubcellularLocalization" export ID2LABELPATHORIGIN="$PYTHONCODEDIR/supportsfiles/id2labellevel0origin.json" export LABEL2IDPATHORIGIN="$PYTHONCODEDIR/supportsfiles/label2idlevel0origin.json" export MODELPATHORIGIN="dotan1111/BetaDescribe-Validator-HigherLevelTaxonomy" export ID2LABELPATHENZYMES="$PYTHONCODEDIR/supportsfiles/id2labelenzyme.json" export LABEL2IDPATHENZYMES="$PYTHONCODEDIR/supportsfiles/label2idenzyme.json" export MODELPATH_ENZYMES="dotan1111/BetaDescribe-Validator-EnzymaticActivity"

export BASEMODEL="dotan1111/BetaDescribe-TheGenerator" export WORKINGDIR="$PYTHONCODEDIR/testing/$PROTEIN_NAME"

cd $PYTHONCODEDIR

python "1runmodels.py" \ --proteinsequence $PROTEINSEQUENCE \ --proteinname $PROTEINNAME \ --id2labelpathcelllocation $ID2LABELPATHCELLLOCATION \ --label2idpathcelllocation $LABEL2IDPATHCELLLOCATION \ --modelpathcelllocation $MODELPATHCELLLOCATION \ --id2labelpathorigin $ID2LABELPATHORIGIN \ --label2idpathorigin $LABEL2IDPATHORIGIN \ --modelpathorigin $MODELPATHORIGIN \ --id2labelpathenzymes $ID2LABELPATHENZYMES \ --label2idpathenzymes $LABEL2IDPATHENZYMES \ --modelpathenzymes $MODELPATHENZYMES \ --basemodel $BASEMODEL \ --workingdir $WORKINGDIR

python "2rejectalternatives.py" \ --proteinname $PROTEINNAME \ --workingdir $WORKINGDIR \ --chatgptapikey $CHATGPTAPIKEY

python "3findoptimals.py" \ --proteinname $PROTEINNAME \ --workingdir $WORKINGDIR ```

Owner

  • Name: technion-cs-nlp
  • Login: technion-cs-nlp
  • Kind: organization

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this repo, please cite it as below."
authors:
- family-names: "Dotan"
  given-names: "Edo"
- family-names: "Lyubman"
  given-names: "Iris"
- family-names: "Bacharach"
  given-names: "Eran"
- family-names: "Pupko"
  given-names: "Tal"
  orcid: "https://orcid.org/0000-0001-9463-2575"
- family-names: "Belinkov"
  given-names: "Yonatan"
title: "Protein2Text: Providing Rich Descriptions for Protein Sequences"
version: 1.0.0
doi: 10.1101/2024.12.04.626777
date-released: 2024-12-07
url: "https://github.com/technion-cs-nlp/BetaDescribe-code"
preferred-citation: 
  type: article
  authors:
  - family-names: "Dotan"
    given-names: "Edo"
  - family-names: "Lyubman"
    given-names: "Iris"
  - family-names: "Bacharach"
    given-names: "Eran"
  - family-names: "Pupko"
    given-names: "Tal"
    orcid: "https://orcid.org/0000-0001-9463-2575"
  - family-names: "Belinkov"
    given-names: "Yonatan"
  doi: "10.1101/2024.12.04.626777"
  journal: "bioRxiv"
  month: 12
  title: "Protein2Text: Providing Rich Descriptions for Protein Sequences"
  year: 2024

GitHub Events

Total
  • Watch event: 2
  • Push event: 8
  • Public event: 1
Last Year
  • Watch event: 2
  • Push event: 8
  • Public event: 1

Dependencies

requirements.txt pypi
  • accelerate =0.21.0=pyhd8ed1ab_0
  • aiohttp =3.8.5=py311h459d7ec_0
  • aiosignal =1.3.1=pyhd8ed1ab_0
  • annotated-types =0.5.0=pyhd8ed1ab_0
  • anyio =4.4.0=pyhd8ed1ab_0
  • aom =3.5.0=h27087fc_0
  • arrow =1.3.0=pyhd8ed1ab_0
  • async-timeout =4.0.2=pyhd8ed1ab_0
  • asyncio =3.4.3=pypi_0
  • attrs =23.1.0=pyh71513ae_1
  • aws-c-auth =0.7.0=hf8751d9_2
  • aws-c-cal =0.6.0=h93469e0_0
  • aws-c-common =0.8.23=hd590300_0
  • aws-c-compression =0.2.17=h862ab75_1
  • aws-c-event-stream =0.3.1=h9599702_1
  • aws-c-http =0.7.11=hbe98c3e_0
  • aws-c-io =0.13.28=h3870b5a_0
  • aws-c-mqtt =0.8.14=h2e270ba_2
  • aws-c-s3 =0.3.13=heb0bb06_2
  • aws-c-sdkutils =0.1.11=h862ab75_1
  • aws-checksums =0.1.16=h862ab75_1
  • aws-crt-cpp =0.20.3=he9c0e7f_4
  • aws-sdk-cpp =1.10.57=hbc2ea52_17
  • binaryornot =0.4.4=py_1
  • blas =1.0=mkl
  • brotli =1.0.9=h9c3ff4c_4
  • brotli-python =1.0.9=py311ha362b79_9
  • bzip2 =1.0.8=h7f98852_4
  • c-ares =1.19.1=hd590300_0
  • ca-certificates =2024.7.4=hbcca054_0
  • cairo =1.16.0=h35add3b_1015
  • captum =0.7.0=0
  • certifi =2024.7.4=pyhd8ed1ab_0
  • chardet =5.2.0=py311h38be061_1
  • charset-normalizer =3.2.0=pyhd8ed1ab_0
  • click =8.1.6=unix_pyh707e725_0
  • colorama =0.4.6=pyhd8ed1ab_0
  • contourpy =1.2.1=py311h9547e67_0
  • cookiecutter =2.6.0=pyhca7485f_0
  • cuda-cudart =11.8.89=0
  • cuda-cupti =11.8.87=0
  • cuda-libraries =11.8.0=0
  • cuda-nvrtc =11.8.89=0
  • cuda-nvtx =11.8.86=0
  • cuda-runtime =11.8.0=0
  • cuda-version =12.0=hffde075_2
  • cycler =0.12.1=pyhd8ed1ab_0
  • dataclasses =0.8=pyhc8e2a94_3
  • datasets =2.13.1=pyhd8ed1ab_0
  • dav1d =1.2.1=hd590300_0
  • deepspeed =0.13.2=pypi_0
  • dill =0.3.6=pyhd8ed1ab_1
  • distro =1.9.0=pyhd8ed1ab_0
  • einops =0.7.0=pypi_0
  • evaluate =0.4.1=pyhd8ed1ab_0
  • exceptiongroup =1.2.2=pyhd8ed1ab_0
  • expat =2.5.0=hcb278e6_1
  • ffmpeg =6.0.0=gpl_hdbbbd96_103
  • filelock =3.12.2=pyhd8ed1ab_0
  • flash-attn =2.5.3=pypi_0
  • font-ttf-dejavu-sans-mono =2.37=hab24e00_0
  • font-ttf-inconsolata =3.000=h77eed37_0
  • font-ttf-source-code-pro =2.038=h77eed37_0
  • font-ttf-ubuntu =0.83=hab24e00_0
  • fontconfig =2.14.2=h14ed4e7_0
  • fonts-conda-ecosystem =1=0
  • fonts-conda-forge =1=0
  • fonttools =4.53.1=py311h61187de_0
  • freetype =2.12.1=hca18f0e_1
  • fribidi =1.0.10=h36c2ea0_0
  • frozenlist =1.4.0=py311h459d7ec_0
  • fsspec =2023.6.0=pyh1a96a4e_0
  • gettext =0.21.1=h27087fc_0
  • gflags =2.2.2=he1b5a44_1004
  • glog =0.6.0=h6f12383_0
  • gmp =6.2.1=h58526e2_0
  • gmpy2 =2.1.2=py311h6a5fa03_1
  • gnutls =3.7.8=hf3e180e_0
  • graphite2 =1.3.13=h58526e2_1001
  • h11 =0.14.0=pyhd8ed1ab_0
  • h2 =4.1.0=pyhd8ed1ab_0
  • harfbuzz =7.3.0=hdb3a94d_0
  • hjson-py =3.1.0=pyhd8ed1ab_0
  • hpack =4.0.0=pyh9f0ad1d_0
  • httpcore =1.0.5=pyhd8ed1ab_0
  • httpx =0.27.0=pyhd8ed1ab_0
  • huggingface-hub =0.24.3=pypi_0
  • huggingface_hub =0.16.4=pyhd8ed1ab_0
  • hyperframe =6.0.1=pyhd8ed1ab_0
  • icu =72.1=hcb278e6_0
  • idna =3.4=pyhd8ed1ab_0
  • ijson =3.2.3=pyhd8ed1ab_0
  • importlib-metadata =6.8.0=pyha770c72_0
  • importlib_metadata =6.8.0=hd8ed1ab_0
  • jinja2 =3.1.2=pyhd8ed1ab_1
  • joblib =1.3.0=pyhd8ed1ab_1
  • jpeg =9e=h0b41bf4_3
  • keyutils =1.6.1=h166bdaf_0
  • kiwisolver =1.4.5=py311h9547e67_1
  • krb5 =1.21.1=h659d440_0
  • lame =3.100=h166bdaf_1003
  • lcms2 =2.15=hfd0df8a_0
  • ld_impl_linux-64 =2.40=h41732ed_0
  • lerc =4.0.0=h27087fc_0
  • libabseil =20230125.3=cxx17_h59595ed_0
  • libaio =0.9.3=pypi_0
  • libarrow =12.0.1=h657c46f_5_cpu
  • libass =0.17.1=hc9aadba_0
  • libbrotlicommon =1.0.9=h166bdaf_9
  • libbrotlidec =1.0.9=h166bdaf_9
  • libbrotlienc =1.0.9=h166bdaf_9
  • libcrc32c =1.1.2=h9c3ff4c_0
  • libcublas =11.11.3.6=0
  • libcufft =10.9.0.58=0
  • libcufile =1.5.0.59=hcb278e6_0
  • libcurand =10.3.1.50=hcb278e6_0
  • libcurl =8.2.0=hca28451_0
  • libcusolver =11.4.1.48=0
  • libcusparse =11.7.5.86=0
  • libdeflate =1.17=h0b41bf4_0
  • libdrm =2.4.114=h166bdaf_0
  • libedit =3.1.20191231=he28a2e2_2
  • libev =4.33=h516909a_1
  • libevent =2.1.12=hf998b51_1
  • libexpat =2.5.0=hcb278e6_1
  • libffi =3.4.2=h7f98852_5
  • libgcc-ng =13.1.0=he5830b7_0
  • libglib =2.76.4=hebfc3b9_0
  • libgoogle-cloud =2.12.0=h840a212_1
  • libgrpc =1.56.2=h3905398_0
  • libhwloc =2.9.1=nocuda_h7313eea_6
  • libiconv =1.17=h166bdaf_0
  • libidn2 =2.3.4=h166bdaf_0
  • libnghttp2 =1.52.0=h61bc06f_0
  • libnpp =11.8.0.86=0
  • libnsl =2.0.0=h7f98852_0
  • libnuma =2.0.16=h0b41bf4_1
  • libnvjpeg =11.9.0.86=0
  • libopus =1.3.1=h7f98852_1
  • libpciaccess =0.17=h166bdaf_0
  • libpng =1.6.39=h753d276_0
  • libprotobuf =4.23.3=hd1fb520_0
  • libsentencepiece =0.1.99=h28b9611_1
  • libsqlite =3.42.0=h2797004_0
  • libssh2 =1.11.0=h0841786_0
  • libstdcxx-ng =13.1.0=hfd8a6a1_0
  • libtasn1 =4.19.0=h166bdaf_0
  • libthrift =0.18.1=h8fd135c_2
  • libtiff =4.5.0=h6adf6a1_2
  • libunistring =0.9.10=h7f98852_0
  • libutf8proc =2.8.0=h166bdaf_0
  • libuuid =2.38.1=h0b41bf4_0
  • libva =2.18.0=h0b41bf4_0
  • libvpx =1.13.0=hcb278e6_0
  • libwebp-base =1.3.1=hd590300_0
  • libxcb =1.13=h7f98852_1004
  • libxml2 =2.11.4=h0d562d8_0
  • libzlib =1.2.13=hd590300_5
  • llvm-openmp =16.0.6=h4dfa4b3_0
  • lz4-c =1.9.4=hcb278e6_0
  • markdown-it-py =3.0.0=pyhd8ed1ab_0
  • markupsafe =2.1.3=py311h459d7ec_0
  • matplotlib-base =3.8.0=py311h54ef318_1
  • mdurl =0.1.2=pyhd8ed1ab_0
  • mkl =2023.1.0=h84fe81f_48680
  • mkl-service =2.4.0=py311h5eee18b_1
  • mkl_fft =1.3.6=py311ha02d727_1
  • mkl_random =1.2.2=py311ha02d727_1
  • mpc =1.3.1=hfe3b2da_0
  • mpfr =4.2.0=hb012696_0
  • mpmath =1.3.0=pyhd8ed1ab_0
  • multidict =6.0.4=py311h2582759_0
  • multiprocess =0.70.15=py311h459d7ec_0
  • munkres =1.1.4=pyh9f0ad1d_0
  • ncurses =6.4=hcb278e6_0
  • nettle =3.8.1=hc379101_1
  • networkx =3.1=pyhd8ed1ab_0
  • numpy =1.25.0=py311h08b1b3b_0
  • numpy-base =1.25.0=py311hf175353_0
  • openai =1.37.1=pyhd8ed1ab_0
  • openh264 =2.3.1=hcb278e6_2
  • openjpeg =2.5.0=hfec8fc6_2
  • openssl =3.3.1=h4bc722e_2
  • orc =1.9.0=h385abfd_1
  • p11-kit =0.24.1=hc5aa10d_0
  • packaging =23.1=pyhd8ed1ab_0
  • pandas =2.0.3=py311h320fe9a_1
  • pcre2 =10.40=hc3806b6_0
  • peft =0.3.0=pyhd8ed1ab_0
  • pillow =9.4.0=py311h50def17_1
  • pip =23.2.1=pyhd8ed1ab_0
  • pixman =0.40.0=h36c2ea0_0
  • portalocker =2.10.1=py311h38be061_0
  • psutil =5.9.5=py311h2582759_0
  • pthread-stubs =0.4=h36c2ea0_1001
  • py-cpuinfo =9.0.0=pyhd8ed1ab_0
  • pyarrow =12.0.1=py311h39c9aba_5_cpu
  • pydantic =2.1.1=pyhd8ed1ab_0
  • pydantic-core =2.4.0=py311h46250e7_0
  • pygments =2.18.0=pyhd8ed1ab_0
  • pynvml =11.5.0=pypi_0
  • pyparsing =3.1.2=pyhd8ed1ab_0
  • pysocks =1.7.1=pyha2e5f31_6
  • python =3.11.4=hab00c5b_0_cpython
  • python-dateutil =2.8.2=pyhd8ed1ab_0
  • python-slugify =8.0.4=pyhd8ed1ab_0
  • python-tzdata =2023.3=pyhd8ed1ab_0
  • python-xxhash =3.2.0=py311h2582759_0
  • python_abi =3.11=3_cp311
  • pytorch =2.0.1=py3.11_cuda11.8_cudnn8.7.0_0
  • pytorch-cuda =11.8=h7e8668a_5
  • pytorch-mutex =1.0=cuda
  • pytz =2023.3=pyhd8ed1ab_0
  • pyyaml =6.0.1=pypi_0
  • re2 =2023.03.02=h8c504da_0
  • readline =8.2=h8228510_1
  • regex =2023.6.3=py311h459d7ec_0
  • requests =2.31.0=pyhd8ed1ab_0
  • responses =0.18.0=pyhd8ed1ab_0
  • rich =13.7.1=pyhd8ed1ab_0
  • s2n =1.3.46=h06160fa_0
  • sacrebleu =2.0.0=pyhd3eb1b0_1
  • sacremoses =0.0.53=pyhd8ed1ab_0
  • safetensors =0.4.3=pypi_0
  • sentencepiece =0.1.99=h38be061_1
  • sentencepiece-python =0.1.99=py311hf03188e_1
  • sentencepiece-spm =0.1.99=h28b9611_1
  • setuptools =68.0.0=pyhd8ed1ab_0
  • six =1.16.0=pyh6c4a22f_0
  • snappy =1.1.10=h9fff704_0
  • sniffio =1.3.1=pyhd8ed1ab_0
  • svt-av1 =1.6.0=h59595ed_0
  • sympy =1.12=pypyh9d50eac_103
  • tabulate =0.9.0=pyhd8ed1ab_1
  • tbb =2021.9.0=hf52228f_0
  • text-unidecode =1.3=pyhd8ed1ab_1
  • tk =8.6.12=h27826a3_0
  • tokenizers =0.19.1=pypi_0
  • torchaudio =2.0.2=py311_cu118
  • torchtriton =2.0.0=py311
  • torchvision =0.15.2=py311_cu118
  • tqdm =4.65.0=pyhd8ed1ab_1
  • transformers =4.43.3=pypi_0
  • types-python-dateutil =2.9.0.20240316=pyhd8ed1ab_0
  • typing =3.10.0.0=pyhd8ed1ab_1
  • typing-extensions =4.7.1=hd8ed1ab_0
  • typing_extensions =4.7.1=pyha770c72_0
  • tzdata =2023c=h71feb2d_0
  • ucx =1.14.1=h0aa22dc_1
  • urllib3 =2.0.4=pyhd8ed1ab_0
  • wheel =0.41.0=pyhd8ed1ab_0
  • x264 =1
  • x265 =3.5=h924138e_3
  • xorg-fixesproto =5.0=h7f98852_1002
  • xorg-kbproto =1.0.7=h7f98852_1002
  • xorg-libice =1.1.1=hd590300_0
  • xorg-libsm =1.2.4=h7391055_0
  • xorg-libx11 =1.8.4=h0b41bf4_0
  • xorg-libxau =1.0.11=hd590300_0
  • xorg-libxdmcp =1.1.3=h7f98852_0
  • xorg-libxext =1.3.4=h0b41bf4_2
  • xorg-libxfixes =5.0.3=h7f98852_1004
  • xorg-libxrender =0.9.10=h7f98852_1003
  • xorg-renderproto =0.11.1=h7f98852_1002
  • xorg-xextproto =7.3.0=h0b41bf4_1003
  • xorg-xproto =7.0.31=h7f98852_1007
  • xxhash =0.8.1=h0b41bf4_0
  • xz =5.2.6=h166bdaf_0
  • yaml =0.2.5=h7f98852_2
  • yarl =1.9.2=py311h459d7ec_0
  • zipp =3.16.2=pyhd8ed1ab_0
  • zlib =1.2.13=hd590300_5
  • zstd =1.5.2=hfc55251_7