https://github.com/acenglish/regioners
A rust implementation of regioneR for interval overlap permutation testing
Science Score: 10.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
○codemeta.json file
-
○.zenodo.json file
-
○DOI references
-
✓Academic publication links
Links to: pubmed.ncbi, ncbi.nlm.nih.gov -
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (10.2%) to scientific vocabulary
Repository
A rust implementation of regioneR for interval overlap permutation testing
Basic Info
- Host: GitHub
- Owner: ACEnglish
- Language: Rust
- Default Branch: main
- Size: 211 KB
Statistics
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 0
- Releases: 0
Metadata Files
README.md
regioners
A rust implementation of regioneR for interval overlap permutation testing.
Install
Binaries are available under releases.
Or, build from the repo: ```bash git clone https://github.com/ACEnglish/regioners cd regioners cargo build --release
build with progress bars by adding --features progbars
executable in ./target/release/regioners
```
Quick Start
Check for the significance of CpG islands' intersection with promoters ```bash
Download test beds
bash testbeds/trackgetter.sh
Test
./target/release/regioners -g testbeds/grch38.genome.txt \ -A testbeds/grch38.epdpromoters.bed \ -B testbeds/grch38.cpg_islands.bed \ -o cpgiVprom.json
Look at all options available
./target/release/regioners -h ```
Introduction
regioners performs a permutation test on the intersection of two bed files. It first counts the number of intersections
between the two bed files and then will randomly shuffle one of the bed files and count the number of intersections.
This shuffling/counting is repeated --num-times. The mean and standard deviation of the permutations is compared to the
original intersection and a p-value is computed.
Parameter details
There are a number of options for controlling how regioners runs. Most have to do with IO and four are important for
the tests.
Randomization strategy --random [shuffle | circle | novl]
How intervals are randomized is an important part of the permutation test. By default, regioners will randomly
shuffle each region. For example, two regions at (x1, y1), (x2, y2) will each get a random shift (r) to
(x1±r1, y1±r1) and (x2±r2, y2±r2).
With circle, all regions are shifted by a set amount such that their spatial distances are preserved. i.e.
(x1±r1, y1±r1), (x2±r1, y2±r1)
The novl method is much like the shuffle method, except that regions won't overlap after shuffling. This is achieved
by looking at all uncovered spans of the genome and randomly breaking them apart into smaller segments. novl
then shuffles all regions with the uncovered segments. This shuffled list is then re-placed along the genome,
discarding the uncovered segments and updating the regions to their new position. Note that this strategy is slightly
less random. See src/gapbreaks.rs for details.
Controlling placement with --per-chrom
Some intervals shouldn't be shuffled across chromosomes. For example, genes are not randomly
distributed across chromosomes (ref).
Therefore, the randomization strategy may need to limit where intervals are moved.
The --per-chrom flag will keep intervals on their same chromosome.
Counting strategy --count [all | any]
By default, all calculates intersections as the number of overlaps. For example, if one -A region hits two -B regions,
that counts as two intersections. With any, the presence of an intersection is counted. So our example above would count
a single intersection.
Excluding genomic regions with --mask
The genome may have regions where intervals should not be placed (e.g. reference gaps). Input intervals overlapping masked regions are removed and randomization will not place intervals there.
Local z-score --window and --step
regioners will calculate a local z-score for the two intervals' overlap
(details).
The --window is how many basepairs upstream and downstream the intervals will be shifted to perform the local z-score and the
--step is the step size of the windows. For example, with a 1,000bp --window and --step of 100bp, the output will
have 20 local z-scores.
IO parameters
--genome: A two column file withchrom\tsize. This becomes the space over which we can shuffle regions. If there are any regions in the bed files on chromosomes not inside the--genomefile, those regions will not be loaded.-Aand-B: Bed files with genomic regions to test. They must be sorted and everystart < stop.--num-times: Number of permutations to perform. See this for help on selecting a value.--no-merge-ovl: Turn off merging of overlapping intervals in-Aand-Bbefore processing. Incompatible with--random novl.--no-swap: Turn off swapping-Aand-Bif-Acontains fewer intervals.
Performance Test
Test of 1,000 permutations on 29,598 promoter regions tested against 1,784,804 TRs using 4 cores on a Mac book. For comparison, a regioneR test of 100 permutations on above data in an Rstudio docker: 1292.313s
- --random shuffle: 3.4s
- --random shuffle --per-chrom : 3.2s
- --random circle : 2.8s
- --random circle --per-chrom : 2.8s
- --random novl : 11.0s
- --random novl --per-chrom : 6.4s
Output
The output is a json with structure:
- Acnt : number of entries in -A (note may be swapped from original paramter)
- Bcnt : number of entries in -B (note may be swapped from original paramter)
- count : overlap counter used
- nomerge : input beds overlaps were not merged before processing if true
- perchrom : randomization performed per-chromosome
- random : randomizer used
- swapped : were -A and -B swapped
- test : dictionary of test results
- localZ : dictionary of local z-score results
Test Key/Values - alt : alternate hypothesis used for p-value - 'l'ess or 'g'reater - mean : average number of overlaps of the permutations - numperms : number of permutations performed - observed : observed number of intersections - pval : permutation test's p-value - perms : list of permutations' number of intersections - stddev : permutations' standard deviation - z_score : permutation test's z-score
LocalZ Key/Values - shifts : list of z-scores for each shift - step : step size used - window : window size used
Plotting
Using python with seaborn: ```python import json import seaborn as sb import matplotlib.pyplot as plt
Load results and get the test information for plotting
results = json.load(open("regioners_output.json")) test = results['test']
Draw the permutations' distribution
p = sb.histplot(data=test, x="perms", color='gray', edgecolor='gray', kde=False, stat='density') p = sb.kdeplot(data=test, x="perms", color='black', ax=p)
Draw a line at the observed intersections
obs = test['observed'] plt.axvline(obs, color='blue')
Draw a box for annotation
props = dict(boxstyle='round', facecolor='wheat', alpha=0.9) y = 0.007 plt.text(obs, y, 'observed intersections',rotation=90, bbox=props, ma='center') p.set(xlabel="Intersection Count", ylabel="Permutation Density") plt.show()
Plot the local z-scores
localz = results["localZ"] p = sb.lineplot(x=range(-localz["window"], localz["window"], localz["step"]), y=local_z['shifts']) p.set(title="Local z-score values", xlabel="Shift", ylabel="z-score") ```

Future Features?:
- gzip file reading
Owner
- Name: Adam English
- Login: ACEnglish
- Kind: user
- Repositories: 7
- Profile: https://github.com/ACEnglish