Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (8.6%) to scientific vocabulary
Last synced: 6 months ago · JSON representation ·

Repository

Basic Info
  • Host: GitHub
  • Owner: burnsajohn
  • Language: Shell
  • Default Branch: main
  • Size: 1.42 MB
Statistics
  • Stars: 0
  • Watchers: 1
  • Forks: 0
  • Open Issues: 0
  • Releases: 1
Created over 4 years ago · Last pushed over 4 years ago
Metadata Files
Readme Citation

README.md

GenomePercPredict

These scripts were written as part of a research experience for undergraduates (REU) program in the summer of 2020 in a fully virtual REU program. They subsample whole proteome files of organisms that are confirmed to eat by phagocytosis or to have cells capable of phagocytosis in order to explore the relationship between genome completeness as measured by BUSCO counts and functional predictions that use a machine learning framework.

For the described functional prediction of phagocytosis, the prediction relies on the computational tool: https://github.com/burnsajohn/predictTrophicMode It also requires BUSCO and its broad eukaryote set of BUSCOs, eukaryota_odb10. In R it requires the packages ggplot2 and drc

The procedure to observe the relationship went as follows:

1) Starting with a set of whole proteome files, search the set of hmms from the predictTrophicMode tool against all proteins in each file using the script runhmmer.sh as follows: bash runhmmer.sh -d [fasta directory] -o [output directory] -c [control_file] -h [hmm file location] -t [number of threads] 2) Subsample the proteome files using the script subsampleFasta.sh: bash subsampleFasta.sh -d [fasta directory] -o [output directory] 3) Run BUSCO on the subsamples using the script runBuscoSubs.sh, it will output a data table of the compiled BUSCO runs from all of the subsamples (BUSCOtable.csv): bash runBuscoSubs.sh -d [subsample directory] -o [output directory] -e [busco directory] 4) Map all of the proteome subsamples to significant hits to hmms using the script runmapping.sh from within the directory with the hmmsearch output files, which calls the perl script mapsigmodels.pl: bash runmapping.sh 5) Run the "sigModel" output files from the mapping step (step 4) through the predictTrophicMode tool by placing all of the "sigModel" files into the "TestGenomes" directory of the tool and running it in default mode. It will output a data table containing the predictions. 6) Combine the compiled BUSCO output data (BUSCOtable.csv) and the compiled predictions data table (predictionsDataTable.txt) into one data table giving a BUSCO number and trophic mode predictions for each subset of each proteome using the R script combinePredBuscos.r. This may be best accomplished line by line in R. It has not been tested as a standalone script. 7) Plot the output in R using scripts in the file plot.preds.vs.buscos.r. The scripts will reproduce the output used in the manuscript on this work if the provided data files from the "ManuscriptData" directory are used.

These scripts were written in collaboration by Jessica Liu, TreAndice Williams, and John A. Burns.

Owner

  • Name: John A Burns
  • Login: burnsajohn
  • Kind: user
  • Company: Bigelow Laboratory for Ocean Scienc

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
  - family-names: Burns
    given-names: John
    orcid: https://orcid.org/0000-0002-2348-8438
title: "GenomePercPredict"
version: 2.0.0
doi: 10.5281/zenodo.5544234
date-released: 2021-10-01

GitHub Events

Total
Last Year