repeatmaskerfaster

Repeat Masker optimized for running in batches on a cluster

https://github.com/emilytrybulec/repeatmaskerfaster

Science Score: 57.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 4 DOI reference(s) in README
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (11.2%) to scientific vocabulary
Last synced: 10 months ago · JSON representation ·

Repository

Repeat Masker optimized for running in batches on a cluster

Basic Info
  • Host: GitHub
  • Owner: emilytrybulec
  • License: mit
  • Language: Nextflow
  • Default Branch: main
  • Size: 170 KB
Statistics
  • Stars: 0
  • Watchers: 1
  • Forks: 0
  • Open Issues: 0
  • Releases: 0
Created over 1 year ago · Last pushed 12 months ago
Metadata Files
Readme License Citation

README.md

Nextflow run with docker run with singularity

Introduction

emilytrybulec/repeatMaskerFaster is a bioinformatics pipeline that takes a finished genome and performs repeat masking in batches. It produces a masked genome (.fasta.masked), detailed information about the repetitive elements identified by RepeatMasker (.out), and multiple sequence alignment of the repetitive regions identified in the sequence with the corresponding consensus sequences from the RepeatMasker database (.align).

  1. Repeat Masker
  2. Repeat Masker on the Cluster

Usage

If you are new to Nextflow and nf-core, please refer to this page on how to set-up Nextflow.

Configurations and parameters

First, go through nextflow.config to configure the pipeline to your needs. The batch size determines how big of chuncks your genome will be split into for faster processing, and the soft masking option can be modified to change how your output genome is masked.

nextflow.config:

```config params {

// Input options
soft_mask                  = true
batch_size                 = 50000000

species                    = null
genome_fasta               = null
consensus_fasta            = null
cluster                    = null

} ```

Next, create a params.yaml file to input information in place of the null configurations. This information can also be supplied in the command line in place of the params.yaml file, as shown below. You will supply your genome path and preferred out directory name. The RepeatMasker species flag is used to warmup RepeatMasker, and a consensus path and can be supplied to process your genome against known repeats, if available.

params.yaml:

yaml params { genome_fasta : "/core/labs/Oneill/Finished_Genomes_for_Annotation/BayDuikerCDO11_5Jan2023_RaconR3.fasta" outdir : "bay_duiker_softmask" species : "cow" consensus_fasta : "/core/labs/Oneill/etrybulec/bay_duiker/cephalophus_dorsalis_ad.fa" cluster : "xanadu" }

Downloading/updating the pipeline

When you run the above command, Nextflow automatically pulls the pipeline code from GitHub and stores it as a cached version. When running the pipeline after this, it will always use the cached version if available - even if the pipeline has been updated since. To make sure that you're running the latest version of the pipeline, make sure that you regularly update the cached version of the pipeline:

bash nextflow pull emilytrybulec/repeatMaskerFaster

Running the pipeline

You can run the pipeline using:

bash nextflow run emilytrybulec/repeatMaskerFaster \ -profile <docker/singularity/.../institute> \ -params-file params.yaml

Xanadu users: please refer to the example script. You may run into permissions errors associated with the /home/FCAM/... directory if running in /core/..., so please chmod -R 777 your /home/FCAM/.../.nextflow directory after running nextflow pull.

OR... if you prefer to put all of your options in the command line, you will not need a params file:
bash nextflow run emilytrybulec/repeatMaskerFaster \ -profile <docker/singularity/.../institute> \ --genome_fasta /core/labs/Oneill/Finished_Genomes_for_Annotation/BayDuikerCDO11_5Jan2023_RaconR3.fasta \ --consensus_fasta /core/labs/Oneill/etrybulec/bay_duiker/cephalophus_dorsalis_ad.fa \ --outdir bay_duiker_repeatmasker \ --cluster xanadu

Core Nextflow arguments

[!NOTE] These options are part of Nextflow and use a single hyphen (pipeline parameters use a double-hyphen)

-profile

Use this parameter to choose a configuration profile. Profiles can give configuration presets for different compute environments.

Several generic profiles are bundled with the pipeline which instruct the pipeline to use software packaged using different methods (Docker, Singularity, Podman, Shifter, Charliecloud, Apptainer, Conda) - see below.

[!IMPORTANT] We highly recommend the use of Docker or Singularity containers for full pipeline reproducibility, however when this is not possible, Conda is also supported.

Note that multiple profiles can be loaded, for example: -profile test,docker - the order of arguments is important! They are loaded in sequence, so later profiles can overwrite earlier profiles.

If -profile is not specified, the pipeline will run locally and expect all software to be installed and available on the PATH. This is not recommended, since it can lead to different results on different machines dependent on the computer environment.

  • test
    • A profile with a complete configuration for automated testing
    • Includes links to test data so needs no other parameters
  • docker
    • A generic configuration profile to be used with Docker
  • singularity
    • A generic configuration profile to be used with Singularity
  • podman
    • A generic configuration profile to be used with Podman
  • shifter
    • A generic configuration profile to be used with Shifter
  • charliecloud
    • A generic configuration profile to be used with Charliecloud
  • apptainer
    • A generic configuration profile to be used with Apptainer
  • wave
    • A generic configuration profile to enable Wave containers. Use together with one of the above (requires Nextflow 24.03.0-edge or later).
  • conda
    • A generic configuration profile to be used with Conda. Please only use Conda as a last resort i.e. when it's not possible to run the pipeline with Docker, Singularity, Podman, Shifter, Charliecloud, or Apptainer.

-resume

Specify this when restarting a pipeline. Nextflow will use cached results from any pipeline steps where the inputs are the same, continuing from where it got to previously. For input to be considered the same, not only the names must be identical but the files' contents as well. For more info about this parameter, see this blog post.

You can also supply a run name to resume a specific run: -resume [run-name]. Use the nextflow log command to show previous run names.

-c

Specify the path to a specific config file (this is a core Nextflow command). See the nf-core website documentation for more information.

Resource requests

Whilst the default requirements set within the pipeline will hopefully work for most people and with most input data, you may find that you want to customise the compute resources that the pipeline requests. Each step in the pipeline has a default set of requirements for number of CPUs, memory and time, which can be found in the base configuration file. For most of the steps in the pipeline, if the job exits with any of the error codes specified here it will automatically be resubmitted with higher requests (2 x original, then 3 x original). If it still fails after the third attempt then the pipeline execution is stopped.

To change the resource requests, please see the max resources and tuning workflow resources section of the nf-core website.

Credits

emilytrybulec/repeatMaskerFaster was originally written by the DFAM consortium and modified for the Xanadu cluster by Emily Trybulec with the help of Jessica Storer.

Contributions and Support

If you would like to contribute to this pipeline, please see the contributing guidelines.

Citations

If you use emilytrybulec/repeatMaskerFaster for your analysis, please cite it using this git.

This pipeline uses code and infrastructure developed and maintained by the nf-core community, reused here under the MIT license.

The nf-core framework for community-curated bioinformatics pipelines.

Philip Ewels, Alexander Peltzer, Sven Fillinger, Harshil Patel, Johannes Alneberg, Andreas Wilm, Maxime Ulysse Garcia, Paolo Di Tommaso & Sven Nahnsen.

Nat Biotechnol. 2020 Feb 13. doi: 10.1038/s41587-020-0439-x.

Owner

  • Login: emilytrybulec
  • Kind: user

Citation (CITATIONS.md)

# emilytrybulec/repeat_curation: Citations

## [nf-core](https://pubmed.ncbi.nlm.nih.gov/32055031/)

> Ewels PA, Peltzer A, Fillinger S, Patel H, Alneberg J, Wilm A, Garcia MU, Di Tommaso P, Nahnsen S. The nf-core framework for community-curated bioinformatics pipelines. Nat Biotechnol. 2020 Mar;38(3):276-278. doi: 10.1038/s41587-020-0439-x. PubMed PMID: 32055031.

## [Nextflow](https://pubmed.ncbi.nlm.nih.gov/28398311/)

> Di Tommaso P, Chatzou M, Floden EW, Barja PP, Palumbo E, Notredame C. Nextflow enables reproducible computational workflows. Nat Biotechnol. 2017 Apr 11;35(4):316-319. doi: 10.1038/nbt.3820. PubMed PMID: 28398311.

GitHub Events

Total
  • Member event: 1
  • Push event: 42
  • Create event: 2
Last Year
  • Member event: 1
  • Push event: 42
  • Create event: 2