fardes

Features Arrangement Description Miniformat

https://github.com/ggonnella/fardes

Science Score: 54.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Committers with academic emails
    2 of 2 committers (100.0%) from academic institutions
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (7.8%) to scientific vocabulary
Last synced: 6 months ago · JSON representation ·

Repository

Features Arrangement Description Miniformat

Basic Info
  • Host: GitHub
  • Owner: ggonnella
  • License: other
  • Language: Python
  • Default Branch: main
  • Size: 39.1 KB
Statistics
  • Stars: 0
  • Watchers: 1
  • Forks: 0
  • Open Issues: 0
  • Releases: 0
Created about 3 years ago · Last pushed about 3 years ago
Metadata Files
Readme Changelog License Citation Authors

README.md

Fardes: Features Arrangement Description Miniformat

The mini-format described here allows to describe the relative arrangement of named sequence features on one or multiple molecules, in terms of their order, length of the interval between them, possible presence of further features between them, strand, position on the same or different molecule.

It was developed for an application in which the expected genome contents of prokaryotic genomes is expressed as a set of rules, which in some cases concern the relative arrangement of features.

Specification

The miniformat is described in the Markdown document SPECIFICATION.md in this repository.

Examples

Here are examples on how the format can be used to express different arrangements:

A,B,C,D: this is a list without any interval specifications, thus the features (whose IDs are given) will just follow each other without any other relevant feature in between.

A,B?,C: A is maybe followed by B and surely be C

A,1,B,C: in this case, between A and B, there is a further feature.

A,1(gene),C: in this case, between A and B, there is exacly one feature, which is of type gene

A,3(rRNA;tRNA),C: in this case, between A and B, there are 3 features, of type rRNA or tRNA.

A,8:10,B: in this case, between A and B, there are 8 to 10 other features.

A,1:*,B or A,>=1,B: these are two equivalent ways to express the fact that between A and B there is at least one other feature.

A,<10,B: between A and B there are less than 10 features (max 9).

A,0[1000:3000],B: there are no features between A and B, but there are between 1000 and 3000 bases.

A,0:1[1kb:3kb],B or A,<1[1kb:3kb],B: there are between 1000 and 3000 bases and eventually a feature in this interval

A,[>30kbp],B or A,>=0[>30kbp],B: there are at least 30000 bases between A and B, including any number of features.

A,><,B,<>,C: A and B are close to each other (and thus also on the same molecule) and distant from C (which can be on the same molecule or another)

A,><,B,<.>,C: A and B are close to each other and distant from C, but all three are on the same molecule

A,><,B,<|>,C: A and B are close to each other on the same molecule, while C is on another molecule

A,&,^B: A and B overlap each other, on different strands

A,B,>C,^D: the order of the features is A, B, C and D with no other feature in between them; thereby C and D are on opposite strands, while A and B can be on any strand)

A,B,=C: the feature C is on the same strand as A, but B can be on the same or on the oppposite strand.

A,><,=B,><,=C: all three features are on the same strand and close to each other, with no features in between.

Limitations

The format is designed to be as simple as possible. There is no way to express a branched graph structure; instead all possible different paths would be required to be linearly spelled.

A possible way (not yet implemented) to introduce branching could be using a syntax such as: A,1,{,B,C,|,C,{,B,|,D,E},},F for expressing the set of paths: A,1,B,C,F, A,1,C,B,F and A,1,C,D,E,F. This would require to implement additional validations, to check if the branches opening { and closing } are balanced, and the branch separator | used properly.

Implementation as a Python package

The miniformat has been implemented as a TextFormats specification (fardes.tf.yaml).

This has been included in a Python module fardes, which additionally include cross-checking not expressable in TextFormats and normalizes the elements while parsing a string (e.g. by including implicit values and applying multipliers). The module can be installed using pip install fardes.

Example usage of the Python parser

Here is an example of usage of the module: import fardes elements = fardes.parse("A,1:10[1kb:3kb],>B,1(rRNA;tRNA),>C,1[2],>D,=E,[3:*],F,1:*[>2Mb],G,<>,H,>0,I,<4,J,[~3kb],K,<|>,L,><,M,&,N")

will result in the following:

[{'type': 'unit', 'unit': 'A', 'prefix': ''}, {'type': 'interval', 'length': {'min': 1000, 'max': 3000}, 'n_features': {'min': 1, 'max': 10}}, {'type': 'unit', 'unit': 'B', 'prefix': '>'}, {'type': 'interval', 'length': {'min': 0, 'max': None}, 'n_features': {'min': 1, 'max': 1, 'type_spec': {'types': ['rRNA', 'tRNA']}}}, {'type': 'unit', 'unit': 'C', 'prefix': '>'}, {'type': 'interval', 'length': {'min': 2, 'max': 2}, 'n_features': {'min': 1, 'max': 1}}, {'type': 'unit', 'unit': 'D', 'prefix': '>'}, {'type': 'interval', 'length': {'min': 0, 'max': None}, 'n_features': {'min': 0, 'max': 0}}, {'type': 'unit', 'unit': 'E', 'prefix': '='}, {'type': 'interval', 'length': {'min': 3, 'max': None}, 'n_features': {'min': 0, 'min': None}}, {'type': 'unit', 'unit': 'F', 'prefix': ''}, {'type': 'interval', 'length': {'min': 2000001, 'max': None}, 'n_features': {'min': 1, 'max': None}}, {'type': 'unit', 'unit': 'G', 'prefix': ''}, {'type': 'interval', 'special': 'distant'} {'type': 'unit', 'unit': 'H', 'prefix': ''}, {'type': 'interval', 'length': {'min': 0, 'max': None}, 'n_features': {'min': 1, 'max': None}}, {'type': 'unit', 'unit': 'I', 'prefix': ''}, {'type': 'interval', 'length': {'min': 0, 'max': None}, 'n_features': {'min': 0, 'min': 3}}, {'type': 'unit', 'unit': 'J', 'prefix': ''}, {'type': 'interval', 'length': {'approx': 3000}, 'n_features': {'min': 0, 'max': None}}, {'type': 'unit', 'unit': 'K', 'prefix': ''}, {'type': 'interval', 'special': 'other_molecule'}, {'type': 'unit', 'unit': 'L', 'prefix': ''}, {'type': 'interval', 'special': 'near'}, {'type': 'unit', 'unit': 'M', 'prefix': ''}, {'type': 'interval', 'special': 'overlap'}, {'type': 'unit', 'unit': 'N', 'prefix': ''}]

Acknowledgements

This specification has been created in context of the DFG project GO 3192/1-1 “Automated characterization of microbial genomes and metagenomes by collection and verification of association rules”. The funders had no role in study design, data collection and analysis.

Name

The name Fardes is an acronym for "feature arrangement description". After naming the project, I noticed that, according to Wiktionary, in Belgian French, a "farde" (plural: fardes) is a file, in the meaning of stationery to keep documents together. This fits well to the purpose of the format.

Owner

  • Name: Giorgio Gonnella
  • Login: ggonnella
  • Kind: user
  • Location: Goettingen, Germany
  • Company: Bioinformatics, University of Goettingen

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
- family-names: "Gonnella"
  given-names: "Giorgio"
  orcid: "https://orcid.org/0000-0003-3900-5397"
title: 'The EGC format and GenExpect: representation and storage of rules about prokaryotic genome contents"
version: 1.0
date-released: 2023-02-24
url: "htts://github.com/ggonnella/fardes/"

GitHub Events

Total
Last Year

Committers

Last synced: 8 months ago

All Time
  • Total Commits: 27
  • Total Committers: 2
  • Avg Commits per committer: 13.5
  • Development Distribution Score (DDS): 0.148
Past Year
  • Commits: 0
  • Committers: 0
  • Avg Commits per committer: 0.0
  • Development Distribution Score (DDS): 0.0
Top Committers
Name Email Commits
Giorgio Gonnella g****a@z****e 23
Giorgio Gonnella g****a@u****e 4
Committer Domains (Top 20 + Academic)

Issues and Pull Requests

Last synced: 7 months ago

All Time
  • Total issues: 0
  • Total pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Total issue authors: 0
  • Total pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 0
  • Pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 0
  • Pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
Pull Request Authors
Top Labels
Issue Labels
Pull Request Labels

Packages

  • Total packages: 1
  • Total downloads:
    • pypi 21 last-month
  • Total dependent packages: 0
  • Total dependent repositories: 0
  • Total versions: 3
  • Total maintainers: 1
pypi.org: fardes

A miniformat for expressing arrangements of sequence features

  • Versions: 3
  • Dependent Packages: 0
  • Dependent Repositories: 0
  • Downloads: 21 Last month
Rankings
Dependent packages count: 6.6%
Downloads: 16.6%
Average: 24.7%
Forks count: 30.5%
Dependent repos count: 30.6%
Stargazers count: 39.1%
Maintainers (1)
Last synced: 7 months ago