amp-bert-biochem

Research material on LLMs based on the Transformer architecture, with application in anti-microbian peptides classification

https://github.com/thenuber/amp-bert-biochem

Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (10.3%) to scientific vocabulary
Last synced: 9 months ago · JSON representation ·

Repository

Research material on LLMs based on the Transformer architecture, with application in anti-microbian peptides classification

Basic Info
  • Host: GitHub
  • Owner: TheNuber
  • Language: Jupyter Notebook
  • Default Branch: main
  • Size: 71.4 MB
Statistics
  • Stars: 0
  • Watchers: 1
  • Forks: 0
  • Open Issues: 0
  • Releases: 0
Created about 2 years ago · Last pushed about 2 years ago
Metadata Files
Readme Citation

README.md

AMP-BERT-BIOCHEM

Welcome!

This repository contains research material on LLMs based on the Transformer architecture, with application in antimicrobial peptides classification. To be precise, it contains source code for the creation of custom, Transformer-based architectures that feature model-parallelization of HuggingFace models.

It is part of a Bachelor of Science final project, and it is in active use by a research team: that is why it is still in active development.

Overview

Its structure is separated in functional folders. That is, every folder contains the necessary files for a certain role. Here is how to travel through the repository, based on what you want to use it for:

  • To use it as a library or code references: go for the /src folder. It contains two main files:

    • MultiGPUModels.py: this file contains the definition of several classes. They represent components or full architectures for Large Language Models
    • pipeline_tools.py: this file contains the definition of different tools to use in a machine learning pipeline, for training and testing models
  • To revise the experiments of the project: go for the /notebooks folder. It contains many notebooks, each one being a step of the research. The relevant ones are:

    • ReproductionModel.ipynb: this is the first notebook. It shows the re-engineering of AMP-BERT, along the creation of a DL pipeline and a comparison of models
    • ResultsVisualization.ipynb: this is the second notebook. It shows how UMAP plots were used to diagnose the possible flaws of AMP-BERT
    • BioChem.ipynb: this is the third notebook. It features the design of AMP-BERT, an enhancement over our reproduction of AMP-BERT, and its comparison with it
    • FeatureAblation.ipynb: this is the fourth notebook. In it, several experiments are conducted, where the main model takes only subsets of predictors in order to analyze their importance.

    The rest of the notebooks represent work in progress.

Owner

  • Login: TheNuber
  • Kind: user

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
- family-names: "Coll Sanchez"
  given-names: "Ruben"
  orcid: "https://orcid.org/0000-0000-0000-0000"
- family-names: "Botia-Blaya"
  given-names: "Juan Antonio"
- family-names: "Sanchez-Ferrer"
  given-names: "Alvaro"
  orcid: "https://orcid.org/0000-0001-7266-4402"
- family-names: "Albaladejo-Riad"
  given-names: "Nora"

title: "AMP-BERT-BIOCHEM"
version: 1.0.0
date-released: 2024-06-31
url: "https://github.com/TheNuber/AMP-BERT-BIOCHEM"

GitHub Events

Total
Last Year