repic

REliable PIcking by Consensus (REPIC) - an ensemble learning methodology for cryo-EM particle picking

https://github.com/ccameron/repic

Science Score: 67.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 3 DOI reference(s) in README
  • Academic publication links
    Links to: zenodo.org
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (17.8%) to scientific vocabulary

Keywords

cryo-em ensemble-learning graph-theory integer-linear-programming particle-picking
Last synced: 4 months ago · JSON representation ·

Repository

REliable PIcking by Consensus (REPIC) - an ensemble learning methodology for cryo-EM particle picking

Basic Info
Statistics
  • Stars: 6
  • Watchers: 4
  • Forks: 2
  • Open Issues: 0
  • Releases: 5
Topics
cryo-em ensemble-learning graph-theory integer-linear-programming particle-picking
Created over 3 years ago · Last pushed 12 months ago
Metadata Files
Readme License Citation

README.md

License Anaconda-Server Badge Conda Documentation Status DOI

Overview

REliable PIcking by Consensus (REPIC) is a consensus methodology for harnessing multiple cryogenic-electron microscopy (cryo-EM) particle picking algorithms. It identifies particles common to multiple picked particle sets (i.e., consensus particles) using graph theory and integer linear programming (ILP). Picked particle sets may be found by a human specialist (manual), template matching, mathematical function (e.g., RELION's Laplacian-of-Gaussian auto-picking), or machine-learning method. A schematic representation of REPIC applied to the output of three CNN-based particle pickers is below:

REPIC expects particle sets to be in BOX file format (*.box) where each particle has coordinates, a detection box size (in pixels), and (optional) a score [0-1].

Software requirements

Required: 1. Python v3.8 interpreter (Miniconda installation recommended) 2. Python package dependencies described in setup.py 3. Windows users - Ubuntu terminal environment with Windows Subsystem for Linux (WSL) (v22.04.2 LTS tested)

Optional: 1. Gurobi ILP optimizer (v9.5.2 used) - requires free academic license ** 2. REgularised LIkelihood OptimisatioN (RELION) - particle and density analyses (v3.13 used) 3. UCSF Chimera - map alignment and density visualization (v1.16 used)

** Required to reproduce manuscript results but if the Gurobi package is not found, REPIC will use the SciPy ILP optimizer

Installation guide

REPIC installation is expected to only take a few minutes.

:warning: WARNING: Only the Docker installation of REPIC includes pickers (SPHIRE-crYOLO, DeepPicker, and Topaz). If installing REPIC using either Conda or pip, pickers will need to be separately installed (see docs/ for installation instructions).

Install using Docker (recommended)

  1. Install Docker if the docker command is unavailable
  2. Install and set up NVIDIA Container Toolkit for building and running GPU-accelerated containers
  3. Build CUDA-supported image from Dockerfile in REPIC GitHub repo: docker build -t repic_img https://github.com/ccameron/REPIC.git#main
  4. Run container with GPU acceleration (example iter_pick.py command shown below): docker run --gpus all -v <file_path>/REPIC/examples:/examples repic_img repic iter_pick /examples/10057/iter_config.json 4 100

Install using Conda

:warning: WARNING: if Python package conflicts are encountered during the Conda installation of REPIC, please ensure Conda channels are properly set for Bioconda. See Bioconda Usage for more information

  1. Install Miniconda if the conda command is unavailable
  2. Create a separate Conda environment and install REPIC and Gurobi: conda create -n repic -c bioconda -c gurobi repic gurobi
  3. Activate REPIC Conda environment: conda activate repic
  4. Obtain a Gurobi license and set Gurobi key: grbgetkey <gurobi_key>
  5. Remove unused or temporary Conda files: conda clean --all

Install from source using pip

  1. Either download the package by clicking the "Clone or download" button, unzipping file in desired location, and renaming the directory "REPIC" OR using the following command line: git clone https://github.com/ccameron/REPIC
  2. Install Miniconda if the conda command is unavailable
  3. Navigate to REPIC directory: cd <install_path>/REPIC
  4. Create a separate Conda environment and install Gurobi for REPIC: conda create -n repic -c gurobi python=3.8 gurobi
  5. Activate REPIC Conda environment: conda activate repic
  6. Install REPIC using pip: python -m pip install .
  7. Obtain a Gurobi license and set Gurobi key: grbgetkey <gurobi_key>
  8. Remove unused or temporary Conda files: conda clean --all

To check if REPIC was correctly installed, run the following command (after activating the REPIC Conda environment or using a REPIC container): repic -h A help menu should appear in the terminal.

Integration

Run using published Docker image (with Singularity/Apptainer)

A REPIC Docker image is published on both DockerHub and the GitHub container registery. Apptainer (formerly Singularity) can be used to run this image:

  1. Install Apptainer if the apptainer command is unavailable
  2. Install and set up NVIDIA Container Toolkit for building and running GPU-accelerated containers
  3. Pull REPIC Docker image and convert to Singularity image format (SIF) (requires >8 Gb of memory and ~40 mins for conversion): apptainer pull docker://cjfcameron/repic If SIF file creation is taking a long time, increase the mksquashfs mem parameter in the Apptainer config file (apptainer.conf). See here for more information.

  4. Run container with GPU acceleration (example iter_pick.py command shown below): apptainer run --nv --bind <file_path>/REPIC/examples:/examples repic_latest.sif repic iter_pick /examples/10057/iter_config.json 4 100

Run using Google Colab

A Jupyter Notebook for installing and running REPIC on Google Colab is included in the REPIC GitHub repo: repic_colab.ipynb

To open the notebook in Google Colab:

  1. Navigate a browser to: https://colab.google/
  2. Select "Open Colab", then "GitHub"
  3. Enter the REPIC GitHub web URL: https://github.com/ccameron/REPIC.git
  4. Select the "repic_colab.ipynb" Jupyter Notebook

Run using Scipion plugin

:warning: WARNING: Scipion plugin currently only contains REPIC one-shot mode

REPIC is available as a Scipion plugin: https://github.com/scipion-em/scipion-em-repic

See here for information about installing plugins for Scipion.

Example data

Example SPHIRE-crYOLO, DeepPicker, and Topaz picked particle coordinate files for $\beta$-galactosidase (EMPIAR-10017) micrographs are found in examples/10017/. These files were generated by applying the pre-trained pickers to $\beta$-galactosidase micrographs, filtering false positive per author suggested thresholds, and then converting files to BOX format using coord_converter.py.

Example motion corrected T20S proteasome (EMPIAR-10057) micrographs and normative particles for iterative ensemble particle picking are freely available via Amazon Web Services (AWS). To download this data, please run get_examples.sh (see Quick start below for how to run this Bash script).

Installation instructions for SPHIRE-crYOLO, DeepPicker, and Topaz are found in docs/.

Example commands for fitting and running SPHIRE-crYOLO, DeepPicker, and Topaz models are found in repic/iterativeparticlepicking/.

Parameters used for particle picking algorithms and RELION are found in supplementaldatafile_2.ods.

Quick start

Creating consensus particle sets

  1. Calculate the particle overlap (Jaccard index [JI]) and enumerate cliques using get_cliques.py (expected run time: 1-3 mins):

repic get_cliques examples/10017/ examples/10017/clique_files/ 180

Note - REPIC will use the folder names found in the provided input directory (e.g., examples/10017/) to assign method labels (e.g., "crYOLO", "deepPicker", "topaz")

Correctly executing the above command will produce the following files for each micrograph in the output folder examples/10017/clique_files/: - cliquecoords.pick: pickled clique (x,y*) coordinates (in BOX format) - *constraintmatrix.pickle: pickled constraint matrix file - *weightvector.pickle: pickled clique weight vector file - *_runtime.tsv: runtime tracking TSV file

  1. Find optimal cliques using the ILP solver and create consensus particle BOX files using run_ilp.py (expected run time: <1 min):

repic run_ilp examples/10017/clique_files/ 180

Correctly executing the above command will produce a particle coordinate file (in BOX format) for each micrograph in the output directory examples/10017/clique_files/. The final column in these BOX files represents the clique weight for a consensus particle.

Particle picking by iterative ensemble learning

  1. Download example data from AWS S3 bucket using get_examples.sh (expected run time: 1-5 mins):

bash $(pip show repic | grep -in "Location" | cut -f2 -d ' ')/repic/iterative_particle_picking/get_examples.sh examples/10057/data/ &> aws_download.log

  1. Create a configuration file for iterative ensemble particle picking using iter_config.py (expected run time: <1 min):

repic iter_config examples/10057/ 176 224 <file_path>/gmodel_phosnet_202005_N63_c17.h5 <file_path>/DeepPicker-python 4 22

<file_path> must be replaced with the full file paths to the SPHIRE-crYOLO pre-trained model and DeepPicker directory, respectively. See picker installation instructions in docs/ for more information.

A configuration file iter_config.json will be created in the current working directory.

  1. Pick particles by iterative ensemble learning using iter_pick.py, a wrapper of run.sh (expected run time: 20-30 min/iteration):

repic iter_pick ./iter_config.json 4 100

The final set of consensus particles for the testing set should be found in: examples/10057/iterative_particle_picking/round_4/train_100/clique_files/test/*.box

Command line details

Identifying consensus particle sets with REPIC

  1. Calculating particle overlap (JI) and enumerate cliques using get_cliques.py:

``` usage: repic getcliques [-h] [--multi_out] [--getcc] indir outdir box_size

positional arguments: indir path to input directory containing subdirectories of particle bounding box coordinate files outdir path to output directory (WARNING - script will delete directory if it exists) box_size particle bounding box size (in int[pixels])

optional arguments: -h, --help show this help message and exit --multiout set output of cliques to be members sorted by picker name --getcc filters cliques for those in the largest Connected Component (CC) ```

  1. Finding optimal cliques using ILP solver and creating consensus particle BOX files using run_ilp.py:

``` usage: repic runilp [-h] [--numparticles NUMPARTICLES] indir box_size

positional arguments: indir path to input directory containing getcliques.py output box_size particle bounding box size (in int[pixels])

optional arguments: -h, --help show this help message and exit --numparticles NUMPARTICLES filter for the number of expected particles (int) [run_ilp.py](https://github.com/ccameron/REPIC/blob/main/repic/commands/run_ilp.py) will create a plot of the concensus particle distribution ( particledist.png ) with a recommended (REC) numparticles value (70\% of consensus particles) in the in_dir ```.

Particle picking by iterative ensemble learning

  1. Create a configuration file for iterative ensemble particle picking using iter_config.py: ``` usage: repic iterconfig [-h] [--cryoloenv CRYOLOENV] [--deepenv DEEPENV] [--deepmodel DEEPMODEL] [--topazenv TOPAZENV] [--topazmodel TOPAZMODEL] [--outfilepath OUTFILEPATH] datadir boxsize expparticles cryolomodel deepdir topazscale topazrad

positional arguments: datadir path to directory containing training data boxsize particle bounding box size (in int[pixels]) expparticles number of expected particles (int) cryolomodel path to LOWPASS SPHIRE-crYOLO model deepdir path to DeepPicker scripts topazscale Topaz scale value (int) topaz_rad Topaz particle radius size (in int[pixels])

optional arguments: -h, --help show this help message and exit --cryoloenv CRYOLOENV Conda environment name or prefix for SPHIRE-crYOLO installation (default:cryolo) --deepenv DEEPENV Conda environment name or prefix for DeepPicker installation (default:deep) --deepmodel DEEPMODEL path to pre-trained DeepPicker model (default:out-of-the-box model) --topazenv TOPAZENV Conda environment name or prefix for Topaz installation (default:topaz) --topazmodel TOPAZMODEL path to pre-trained Topaz model (default:out-of-the-box model) --outfilepath OUTFILEPATH path for created config file (default:./iter_config.json) ```

data_dir/ is expected to contain a three-column TSV file of CTFFIND4 defocus values: (1) micrograph filename, (2) defocus x, and (3) defocus y. If this file is not found, then all micrographs will be assigned the same defocus value. A defocus file can be built from the output of a RELION CTF refinement job using the following Bash script:

EMPIAR_ID=<complete> # only integers - i.e., EMPIAR-10017 would be 10017 out=<install_path>/REPIC/examples/${EMPIAR_ID}/data/defocus_${EMPIAR_ID}.txt rm -rf ${out} for file in <relion_path>/relion/CtfFind/job00[0-9]/<mrc_pattern>; do grep '' /dev/null ${file} | tail -n 1 | awk -F ":| " '{print $1,$3,$4}' >> ${out} done

<mrc_pattern> is dependent on the naming convention used for micrographs and will need to be set to your specific substring. For EMPIAR-10017 and -10057, the substrings are '*0.txt' and '*[0-9].txt', respectively.

<relion_path>/relion/CtfFind/job00[0-9]/*<mrc_suffix> should list all CTFFIND4 output files in RELION's CtfFind/.

  1. Iteratively pick particles using iter_pick.py, a wrapper of run.sh: ``` usage: repic iterpick [-h] [--semi_auto] [--sampleprob SAMPLEPROB] [--score] [--outfilepath OUTFILEPATH] configfile numiter trainsize

positional arguments: configfile path to REPIC config file numiter number of iterations (int) train_size training subset percentage (int)

optional arguments: -h, --help show this help message and exit --semiauto initialize training labels with known particles (semi-automatic) --sampleprob SAMPLEPROB sampling probability of initial training labels for 'semiauto' (default:1.) --score evaluate picked particle sets --outfilepath OUTFILEPATH path for picking log file (default:/iterpick.log) trainsize ``` references the output of build_subsets.py, which builds training subsets of sizes 1%, 25%, 50%, and 100% (i.e., 100% will use the entire training set). For more information on dataset handling please see "iterative ensemble particle picking with REPIC" in the Methods section of the REPIC manuscript.

Testing

The REPIC software has been tested on two computer systems: 1. Ubuntu 16.04.6 LTS (Xenial Xerus) running CUDA v10.1 with four Nvidia GP102 TITAN Xp 2. Ubuntu 16.04.7 LTS (Xenial Xerus) running CUDA v11.3 with four Nvidia GeForce GTX 1080

Citing REPIC

If REPIC was used in your analysis/study, please cite:

Cameron, C.J.F., Seager, S.J.H., Sigworth, F.J., Tagare, H.D., and Gerstein, M.B. REliable PIcking by Consensus (REPIC): a consensus methodology for harnessing multiple cryo-EM particle pickers. Commun Biol. DOI: 10.1038/s42003-024-07045-0

Contact

Submitting a GitHub issue is preferred for all problems related to REPIC.

For other concerns, please email Christopher JF Cameron.

Releases

v1.0.0

  • Google Colab integrated
  • Release created for Nature Communications Biology publication

v0.2.1

  • Scipion plugin created
  • Docker/Singularity/Apptainer integrated

v0.2.0

  • k-d tree algorithm integrated to reduce graph building runtime
  • Approval to include DeepPicker with REPIC install/distribution added: https://github.com/ccameron/REPIC/blob/main/imgs/deeppicker_approval.png
  • Various bug fixes

v0.1.0

Owner

  • Login: ccameron
  • Kind: user

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
  - family-names: "Cameron"
    given-names: "Christopher JF"
    orcid: https://orcid.org/0000-0003-4088-0957
  - family-names: "Seager"
    given-names: "Sebastian JH"
  - family-names: "Sigworth"
    given-names: "Fred J"
  - family-names: "Tagare"
    given-names: "Hemant D"
  - family-names: "Gerstein"
    given-names: "Mark B"
title: "REPIC - an ensemble learning methodology for cryo-EM particle picking"
version: v1.0.0
doi: 10.1101/2023.05.13.540636v1
date-released: 2023-05-14
url: "https://github.com/ccameron/REPIC"

GitHub Events

Total
  • Watch event: 4
  • Push event: 2
Last Year
  • Watch event: 4
  • Push event: 2