Opfi

Opfi: A Python package for identifying gene clusters in large genomics and metagenomics data sets - Published in JOSS (2021)

https://github.com/wilkelab/opfi

Science Score: 95.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
✓
DOI references
Found 1 DOI reference(s) in JOSS metadata
○
Academic publication links
✓
Committers with academic emails
1 of 4 committers (25.0%) from academic institutions
○
Institutional organization owner
✓
JOSS paper metadata
Published in Journal of Open Source Software

Keywords from Contributors

mesh

Scientific Fields

Biology Life Sciences - 84% confidence

Last synced: 6 months ago · JSON representation

Repository

A Python package for discovery, annotation, and analysis of gene clusters in genomics or metagenomics data sets.

Basic Info

Host: GitHub
Owner: wilkelab
License: mit
Language: Python
Default Branch: master
Homepage: https://opfi.readthedocs.io/
Size: 23.6 MB

Statistics

Stars: 21
Watchers: 5
Forks: 5
Open Issues: 3
Releases: 3

Created about 6 years ago · Last pushed over 4 years ago

Metadata Files

Readme Changelog Contributing License Code of conduct

Opfi

A python package for discovery, annotation, and analysis of gene clusters in genomics or metagenomics datasets.

Installation

The recommended way to install Opfi is with Bioconda, which requires the conda package manager. This will install Opfi and all of its dependencies (which you can read more about here).

Currently, Bioconda supports only 64-bit Linux and Mac OS. Windows users can still install Opfi with pip (see below); however, the complete installation procedure has not been fully tested on a Windows system.

Install with conda (Linux and Mac OS only)

First, set up conda and Bioconda following the quickstart guide. Once this is done, run:

conda install -c bioconda opfi

And that's it! Note that this will install Opfi in the conda environment that is currently active. To create a fresh environment with Opfi installed, do:

conda create --name opfi-env -c bioconda opfi conda activate opfi-env

Install with pip

This method does not automatically install non-Python dependencies, so they will need to be installed separately, following their individual installation instructions. A complete list of required software is available here. Once this step is complete, install Opfi with pip by running:

pip install opfi

For information about installing for development, check out the documentation site.

Gene Finder

Gene Finder iteratively executes homology searches to identify gene clusters of interest. Below is an example script that sets up a search for putative CRISPR-Cas systems in the Rippkaea orientalis PCC 8802 (cyanobacteria) genome. Data inputs are provided in the Opfi tutorial (tutorials/tutorial.ipynb).

```python from gene_finder.pipeline import Pipeline import os

genomicdata = "GCF000024045.1ASM2404v1genomic.fna.gz"

p = Pipeline() p.addseedstep(db="cas1", name="cas1", eval=0.001, blasttype="PROT", numthreads=1) p.addfilterstep(db="casall", name="casall", eval=0.001, blasttype="PROT", numthreads=1) p.addcrisprstep()

use the input filename as the job id

jobid = os.path.basename(genomicdata) results = p.run(jobid=jobid, data=genomicdata, minprot_len=90, span=10000, gzip=True) ```

Operon Analyzer

Operon Analyzer filters results from Gene Finder, and identifies promising candidate operons according to a given set of criteria. It also contains some tools for visualizing candidates and performing basic statistics.

Please note that the use of the word "operon" throughout this library is somewhat of an artifact from early development. At this time, Opfi does not predict whether a candidate system represents a true operon, that is, a set of genes under the control of a single promoter. Although a candidate gene cluster may certainly qualify as an operon, it is currently up to the user to make that distinction.

Analysis

The analysis module provides tools to identify operons that conform to certain rules, such as requiring that they contain a certain gene, or that two genes are within a given distance of each other (the full list is given below). CSV output is written to stdout, which identifies the outcome of the analysis for each putative operon.

Rules defined with the RuleSet determine whether an operon should be considered a candidate for further analysis. Filters defined with the FilterSet help define which features to consider when evaluating rules. You might, for example, want to exclude any operon containing a particular gene, but if a copy of that gene coincidentally exists 5 kb from the true operon, you might want to ignore it for the purposes of evaluating your rules.

A sample script that performs this task is given here:

```python import sys from operonanalyzer.analyze import analyze from operonanalyzer.rules import RuleSet, FilterSet

rs = RuleSet().require('transposase') \ .exclude('cas3') \ .atmostnbpfromanything('transposase', 500) \ .sameorientation()

fs = FilterSet().pickoverlappingfeaturesbybitscore(0.9) \ .mustbewithinnbpof_anything(1000)

if name == 'main': analyze(sys.stdin, rs, fs) ```

List of available rules

exclude(feature_name: str): Forbid the presence of a particular feature.
require(feature_name: str): Require the presence of a particular feature.
max_distance(feature1_name: str, feature2_name: str, distance_bp: int): The two given features must be no further than distance_bp base pairs apart. Requires exactly one of each feature to be present.
at_least_n_bp_from_anything(feature_name: str, distance_bp: int): Requires that a feature be at least distance_bp base pairs away from any other feature. This is mostly useful for eliminating overlapping features.
at_most_n_bp_from_anything(feature_name: str, distance_bp: int): A given feature must be within distance_bp base pairs of another feature. Requires exactly one matching feature to be present. Returns False if the given feature is the only feature.
same_orientation(exceptions: Optional[List[str]] = None): All features in the operon must have the same orientation.
contains_any_set_of_features(sets: List[List[str]]): Returns True if the operon contains features with all of the names in at least one of the lists. Useful for determining if an operon contains all of the essential genes for a particular system, for example.
contains_exactly_one_of(feature1_name: str, feature2_name: str): An exclusive-or of the presence of two features. That is, one of the features must be present and the other must not.
contains_at_least_n_features(feature_names: List[str], feature_count: int, count_multiple_copies: bool = False): The operon must contain at least feature_count features in the list. By default, a matching feature that appears multiple times in the operon will only be counted once; to count multiple copies of the same feature, set count_multiple_copies=True.
contains_group(self, feature_names: List[str], max_gap_distance_bp: int, require_same_orientation: bool): The operon must contain a contiguous set of features (in any order) separated by no more than maxgapdistance_bp. Optionally, the user may require that the features must all have the same orientation.
maximum_size(self, feature_name: str, max_bp: int, all_matching_features_must_pass: bool = False, regex: bool = False): The operon must contain at least one feature with featurename with a size (in base pairs) of maxbp or smaller. If allmatchingfeaturesmustpass is True, every matching Feature must be at least max_bp long.
minimum_size(self, feature_name: str, min_bp: int, all_matching_features_must_pass: bool = False, regex: bool = False): The operon must contain at least one feature with featurename with a size (in base pairs) of minbp or larger. If allmatchingfeaturesmustpass is True, every matching Feature must be at least min_bp long.
custom(rule: 'Rule'): Add a rule with a user-defined function.

List of available filters

must_be_within_n_bp_of_anything(distance_bp: int): If a feature is very far away from anything it's probably not part of an operon.
must_be_within_n_bp_of_feature(feature_name: str, distance_bp: int): There may be situations where two features always appear near each other in functional operons.
pick_overlapping_features_by_bit_score(minimum_overlap_threshold: float): If two features overlap by more than minimum_overlap_threshold, the one with the lower bit score is ignored.
custom(filt: 'Filter'): Add a filter with a user-defined function.

Analysis Output

Each line of the CSV will contain an accession ID and the path to the file that contains it, the contig coordinates, and whether it passed or failed the given rules. If it passed, the last column will contain the word pass only. Otherwise it will start with fail followed by a comma-delimited list of the serialized rules that it failed to adhere to (with the name and parameters that were passed to the method).

Visualization

Interesting operons can be visualized with a simple gene diagram. It is up to the user to decide how to define this, though this sample script below creates diagrams for all operons that passed all given rules:

```python import sys from operonanalyzer.analyze import loadanalyzedoperons from operonanalyzer.visualize import buildoperondictionary, plot_operons

analysiscsv, pipelinecsv, imagedirectory = sys.argv[1:] goodoperons = []

with open(pipelinecsv) as f: operons = buildoperondictionary(f) with open(analysiscsv) as f: for contig, filename, start, end, result in loadanalyzedoperons(f): if result[0] != 'pass': continue op = operons.get((contig, filename, start, end)) if op is None: continue goodoperons.append(op) plotoperons(goodoperons, imagedirectory) ```

Overview Statistics

Some basic tools are provided to inspect the nature of operons that did not pass all given rules. The intent here is to help researchers determine if their filtering is too aggressive (or not aggressive enough), and to get an overall better feel for the data.

Simple bar plots can be produced as follows:

```python import sys import matplotlib.pyplot as plt from operonanalyzer.analyze import loadanalyzedoperons from operonanalyzer.overview import load_counts

def plotbarchart(filename, title, data, rotate=True): fig, ax = plt.subplots() x = [str(d[0]).replace(":", "\n") for d in data] y = [d[1] for d in data] ax.bar(x, y, edgecolor='k') if rotate: plt.xticks(rotation=90) ax.setylabel("Count") ax.settitle(title) plt.savefig("%s.png" % filename, bbox_inches='tight') plt.close()

if name == 'main': uniqueruleviolated, failedruleoccurrences, rulefailurecounts = loadcounts(sys.stdin) plotbarchart("sole-failure.png", "Number of times that each rule\nwas the only one that failed", sorted(uniqueruleviolated.items())) plotbarchart("total-failures", "Total number of rule failures", sorted(failedruleoccurrences.items())) plotbarchart("failures-at-each-contig", "Number of rules failed at each contig", sorted(rulefailure_counts.items()), rotate=False) ```

Owner

Name: Wilke Lab
Login: wilkelab
Kind: organization

Website: http://wilkelab.org
Repositories: 35
Profile: https://github.com/wilkelab

JOSS Publication

Opfi: A Python package for identifying gene clusters in large genomics and metagenomics data sets

Published

October 27, 2021

DOI

10.21105/joss.03678

Volume 6, Issue 66, Page 3678

Authors

Alexis M. Hill
Department of Integrative Biology, The University of Texas at Austin, Austin, Texas 78712, USA

James R. Rybarski
Department of Molecular Biosciences, The University of Texas at Austin, Austin, Texas 78712, USA

Kuang Hu
Department of Integrative Biology, The University of Texas at Austin, Austin, Texas 78712, USA, Department of Molecular Biosciences, The University of Texas at Austin, Austin, Texas 78712, USA

Ilya J. Finkelstein
Department of Molecular Biosciences, The University of Texas at Austin, Austin, Texas 78712, USA, Center for Systems and Synthetic Biology, The University of Texas at Austin, Austin, Texas, 78712, USA

Claus O. Wilke
Department of Integrative Biology, The University of Texas at Austin, Austin, Texas 78712, USA

Editor

Charlotte Soneson

GitHub Events

Total

Last Year

Committers

Last synced: 7 months ago

All Time

Total Commits: 432
Total Committers: 4
Avg Commits per committer: 108.0
Development Distribution Score (DDS): 0.444

Past Year

Commits: 0
Committers: 0
Avg Commits per committer: 0.0
Development Distribution Score (DDS): 0.0

Top Committers

Name	Email	Commits
Alexis Hill	a**3@g**m	240
Jim Rybarski	j**m@r**m	190
dependabot[bot]	4****]	1
James Rybarski	r**j@w**u	1

Committer Domains (Top 20 + Academic)

wilkcomp01.ccbb.utexas.edu: 1 rybarski.com: 1

Issues and Pull Requests

Last synced: 6 months ago

All Time

Total issues: 34
Total pull requests: 67
Average time to close issues: 7 months
Average time to close pull requests: 2 days
Total issue authors: 5
Total pull request authors: 3
Average comments per issue: 0.44
Average comments per pull request: 0.24
Merged pull requests: 66
Bot issues: 0
Bot pull requests: 1

Past Year

Issues: 0
Pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Issue authors: 0
Pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

jimrybarski (13)
alexismhill3 (12)
afrubin (6)
Thomieh73 (2)
Jigyasa3 (1)

Pull Request Authors

jimrybarski (42)
alexismhill3 (24)
dependabot[bot] (1)

Top Labels

Issue Labels

tests (5) enhancement (4) bug (3) documentation (2) wontfix (2) maintenance (1)

Pull Request Labels

bug (1) dependencies (1)

Packages

Total packages: 1
Total downloads:
- pypi 13 last-month

Total dependent packages: 0
Total dependent repositories: 1
Total versions: 3
Total maintainers: 2

pypi.org: opfi

A package for discovery, annotation, and analysis of gene clusters in genomics or metagenomics datasets.

Homepage: https://github.com/wilkelab/Opfi
Documentation: https://opfi.readthedocs.io/
License: MIT License
Latest release: 0.1.2
published over 4 years ago

Versions: 3
Dependent Packages: 0
Dependent Repositories: 1
Downloads: 13 Last month

Rankings

Dependent packages count: 10.1%

Stargazers count: 13.2%

Forks count: 14.3%

Dependent repos count: 21.6%

Average: 22.1%

Downloads: 51.6%

Maintainers (2)

alexismhill3 clauswilke

Last synced: 6 months ago

Dependencies

docs/requirements.txt pypi

docutils ==0.16
sphinx ==4.0.1
sphinx-rtd-theme ==0.5.2

requirements.txt pypi

PyYAML ==5.4
biopython ==1.76
dna-features-viewer ==3.0.1
hypothesis ==5.1.1
matplotlib ==3.2.1
more-itertools ==8.4.0
parasail ==1.2
pytest ==5.3.2

setup.py pypi

PyYAML ==5.4
biopython ==1.76
dna-features-viewer ==3.0.1
hypothesis ==5.1.1
matplotlib ==3.2.1
more-itertools ==8.4.0
parasail ==1.2
pytest ==5.3.2

Opfi

Science Score: 95.0%

Keywords from Contributors

Scientific Fields

Repository

Basic Info

Statistics

Metadata Files

README.md

Opfi

Installation

Install with conda (Linux and Mac OS only)

Install with pip

Gene Finder

use the input filename as the job id

Operon Analyzer

Analysis

List of available rules

List of available filters

Analysis Output

Visualization

Overview Statistics

Owner

JOSS Publication

Opfi: A Python package for identifying gene clusters in large genomics and metagenomics data sets

Authors

Editor

Tags

GitHub Events

Total

Last Year

Committers

All Time

Past Year

Top Committers

Committer Domains (Top 20 + Academic)

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Packages

pypi.org: opfi

Rankings

Maintainers (2)

Dependencies