hicberg

Statistical profiling based program for contact (Hi-C) and pair ended genomic data reconstruction

https://github.com/sebgra/hicberg

Science Score: 49.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
✓
DOI references
Found 3 DOI reference(s) in README
✓
Academic publication links
Links to: zenodo.org
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (11.7%) to scientific vocabulary

Keywords

bioinformatics genomics hi-c statistics

Last synced: 9 months ago · JSON representation

Repository

Statistical profiling based program for contact (Hi-C) and pair ended genomic data reconstruction

Basic Info

Host: GitHub
Owner: sebgra
License: other
Language: Python
Default Branch: main
Homepage: https://sebgra.github.io/hicberg/
Size: 148 MB

Statistics

Stars: 2
Watchers: 4
Forks: 1
Open Issues: 8
Releases: 2

Topics

bioinformatics genomics hi-c statistics

Created almost 3 years ago · Last pushed 12 months ago

Metadata Files

Readme License Citation

HiC-BERG

Badges

Python package to reconstruct Hi-C contact maps.

Documentation

Documentation and tutorial: https://sebgra.github.io/hicberg/

Environment and dependencies
Docker
Installation
Usage/Examples
Config files
Snakemake usage
Individual components
Chaining pipeline steps
Model evaluation
Contributing
License
Authors
Citation

Environment and dependencies

Environnement

Create environment by using following command :

bash mamba env create -n [ENV_NAME] -f hicberg.yaml;

Dependencies

To ensure that HiC-BERG is correctly working, Bowtie2, Samtools, bedGraphToBigWig and BedTools have to be installed. These can be install through :

```bash

mamba install bowtie2 -c bioconda; mamba install samtools -c bioconda; mamba install -c bioconda ucsc-bedgraphtobigwig; mamba install bedtools -c bioconda; ```

Depending on your aligner preferences, BWA and Minimap2 might be installed through:

bash mamba install bioconda::bwa; mamba install bioconda::minimap2

Installation

Install my-project with pip:

bash pip install hicberg

or in developper mode:

bash mamba activate [ENV_NAME]; pip install -e . ;

pip

Install HiC-BERG locally by using

```bash

pip install -e .

```

Conda / Mamba

We highly recommend installing HiC-BERG through Mamba.

bash conda install -c bioconda hicberg

```bash

wget "https://github.com/conda-forge/miniforge/releases/latest/download/Mambaforge-$(uname)-$(uname -m).sh" bash Mambaforge-$(uname)-$(uname -m).sh mamba create -n hicberg python=3.11.4 mamba activate hicberg mamba install bioconda::bowtie2 mamba install bioconda::samtools mamba install bioconda::bedtools mamba install bioconda::ucsc-bedgraphtobigwig

For exhaustive aligners usage

mamba install bioconda::bwa; mamba install bioconda::minimap2 ```

Docker

HiC-BERG can be used via Docker to limit dependency compatibility problems. More information about container such as Docker can be found there.

To be used through Docker, the HiC-BERG container as to be build using :

```bash

Pick user and group variables

export HOSTUID=$(id -u);
export HOSTGID=$(id -g);

Build container

sudo docker build --build-arg UID=$HOSTUID --build-arg GID=$HOSTGID -t hicberg:1.0.1 . ```

HiC-BERG can therefore be used with usual command detailed further using :

bash sudo docker run -v $PWD:$PWD -w <working directory> -u $(id -u):$(id -g) hicberg:1.0.1 <hicberg_command>

N.B : workingdirectory_ and output (-o argument) of HiC-BERG command have to be the same.

For instance if you need to access the help of HiC-BERG through Docker :

bash sudo docker run -v $PWD:$PWD -w <working directory> -u $(id -u):$(id -g) hicberg:1.0.1 hicberg --help

HiC-BERG can also be used through Docker in interactive mode using :

bash sudo docker run -v $PWD:$PWD -w <working directory> -u $(id -u):$(id -g) -it hicberg:0.0.1

Then, the user will be directly placed in an interactive shell where HiC-BERG can be use directly, by typing any of HiC-BERG command such as further examples.

Usage/Examples

Full pipeline

HiC-BERG

All components of the pipeline can be run at once using the hicberg pipeline command. This allows to generate a contact matrix and its reconstruction from reads in a single command.\ By default, the output is in COOL format. More detailed documentation can be found on the readthedocs website:

https://sebgra.github.io/hicberg/

```bash

hicberg pipeline [--enzyme=["DpnII", "HinfI"]] [--distance=1000] [--rate=1.0] [--cpus=1] [--mode="full"] [--aligner="bowtie2"] [--read-type="sr"] [--max-alignments=None] [--sensitivity="very-sensitive"] [--bins=2000] [--circular=""] [--mapq=35] [--kernel-size=11] [--deviation=0.5] [--start-stage="fastq"] [--exit-stage=None] [--output=DIR] [--index=None] [--blacklist=STR] [--force] [--config=STR]

```

For example, to run the pipeline using 8 threads, performing alignments with bowtie2, in default mode, using ARIMA Hi-C kit enzymes (DpnII & HinfI) without blacklisting and generate a matrix and its reconstruction in the directory out:

```bash hicberg pipeline -e DpnII -e HinfI --cpus 8 -o out/ genome.fa readsfor.fq revreads.fq

```

Snakemake usage

Configuration

Several parameters and Hi-C banks can be set in the file config.yaml. The parameters are the following:

```yaml samples: "config/samples.csv" basedir: Path to the base directory containing data outdir: Path where to save results refdir: Path to the folder containing genomes fastqdir: Path to the folder containing fastq files

name : libraryname ref: Path to the reference genome from refdir R1: Path to the foward reads file from fastqdir R2: Path to the reverse read file from fastqdir circular: Coma separated list of circular chromosomes enzymes : Coma separated list of enzymes used for the experiment sampling_rate: Sampling rate of the restriction map res: Resolution of the contact matrix (in bp) ```

The samples.csv file can be used to set the parameters for each library. The file is a csv file with the following columns:

```csv library;species;samplingrates;enzymes;kernelsizes;deviations;modes;resolutions;max_reports;circularity

name_1

libraryname1;species1;samplingrate1;enzymes1;kernelsizes1;deviations1;mode1;resolutions1;maxreports_1

name_2

libraryname2;species2;samplingrate2;enzymes2;kernelsizes2;deviations2;mode2;resolutions2;maxreports_2 ```

Run

Locally

The HiC-BERG pipeline can be run using Snakemake. The pipeline is defined in the file Snakefile. The pipeline can be run using the following command:

bash snakemake --cores [cpus]

Cluster

HiC-BERG can also be run on a cluster using Snakemake. The cluster configuration is defined in the file cluster_slurm.yaml. The pipeline can be run using the following command:

bash snakemake --cluster "sbatch --mem {cluster.mem} -p {cluster.partition} --qos {cluster.queue} -c {cluster.ncpus} -n {cluster.ntasks} -J {cluster.name} -o {cluster.output} -e {cluster.error}" --cluster-config config/cluster_slurm.yaml -j 16 --rerun-incomplete

As for the local run, the libraries to process are defined in the file samples.csv. Computational resources can be set in the file cluster_slurm.yaml. In this file, the different parameters have to be specified by rules. For instance:

yaml hicberg_step_0: queue: normal partition: common ncpus: 16 mem: 32G ntasks: 1 name: hicberg.{rule}.{wildcards} output: logs/cluster/{rule}.{wildcards}.out error: logs/cluster/{rule}.{wildcards}.err

Jobs will be sent to the cluster as usual sbatch commands. The fields queue, partition, ncpus, mem, ntasks, name, output and error are mandatory.

Considering the previous example, the following command will be sent to the cluster:

bash sbatch --mem 32G -p common -c 16 -n 1 -J hicberg.hicberg_step_0.{wildcards} -o logs/cluster/hicberg_step_0.{wildcards}.out -e logs/cluster/hicberg_step_0.{wildcards}.err

And such for each rule defined in the file cluster_slurm.yaml, through all the libraries specified in sample.csv.

Log files will be saved in the folder logs/cluster. The output and error files will be named as the following: {rule}.{wildcards}.out and {rule}.{wildcards}.err. The wildcards are the parameters specified in the file samples.csv.

_N.B: The parameter --rerun-incomplete is used to restart the pipeline from the last step if it has been interrupted.

N.B 2: The parameter -j is used to specify the number of jobs to run in parallel. It is recommended to set this parameter to the number of libraries to process.

N.B 3: The ressources allocated to each job can be modified in the file clusterslurm.yaml. The parameters mem and ncpus are used to specify the memory and the number of CPUs to allocate to each job. The parameter ntasks is used to specify the number of tasks to run in parallel. The parameter queue is used to specify the queue to use. The parameter partition is used to specify the partition to use. The parameters name, output and error are used to specify the name of the job and the output and error files.

Configuration Files

Hicberg allows you to manage pipeline parameters efficiently using configuration files in the INI format. This is particularly useful for complex runs or for sharing reproducible settings across multiple analyses.

INI File Format

Hicberg uses the standard INI file format. Each file is organized into [sections] and contains key = value pairs. Comments can be added using ; or #.

A template of such config file is provided in the templates folder.

Example config.ini:

```ini ; General pipeline settings [General] samplename = myexperimentrun startstage = fastq exitstage = None ; Set to 'None' for full pipeline, or a stage name outputdirectory = ./results verboselogging = True forceoverwrite = False ; Set to True to overwrite existing output directories

; General pipeline settings [General] samplename = myexperimentrun startstage = fastq exitstage = None ; Set to 'None' for full pipeline, or a stage name outputdirectory = ./results verboselogging = True forceoverwrite = False ; Set to True to overwrite existing output directories

; Input file paths [InputFiles] genomefasta = /path/to/mydata/genome.fasta genomeindex = /path/to/mydata/genomeindexprefix ; Optional: if not provided, Hicberg will build it forwardreads = /path/to/mydata/readsR1.fastq.gz reversereads = /path/to/mydata/readsR2.fastq.gz blacklist_regions = None ; Set to 'None' or leave blank if no blacklist file otherwise provide either a bed file or a list of coordinates coma separated using UCSC format.

; Alignment parameters [Alignment] aligner = bowtie2 ; 'bowtie2', 'bwa' or 'minimap2' cpus = 8 sensitivity = very-sensitive ; presets: very-fast, fast, sensitive, very-sensitive maxalignment = None ; Max alignments to report per read (int) or 'None' for unlimited mapq = 35 readtype = short ; "sr", "map-pb", "map-hifi", "map-ont", "splice", "splice:hq", "asm5", "asm10", "ava-pb" or"ava-ont"

; Processing parameters [Processing] enzyme = DpnII,HinfI ; Comma-separated list for multiple enzymes (e.g., 'DpnII,HinfI') circulargenome = ; e.g., 'chrM' if mitochondrial genome is circular, otherwise leave blank or 'None' (ChrM in our genome example) rate = 1.0 ; Downsampling rate (float) distance = 1000 ; Distance parameter (int) for omics mode bins = 2000 ; Contact map resolution(int) mode = standard ; 'full', 'standard', 'density', or 'omics' kernelsize = 11 ; Kernel size for density calculation (int) deviation = 0.5 ; Deviation for density calculation (float) ```

Using a Configuration File

To run the pipeline using a configuration file, specify its path with the -C or --config-file option.

Important Note: The positional arguments <genome_fasta>, <R1_fastq>, and <R2_fastq> are always required, even when using a configuration file. These arguments specify the primary input files for your pipeline run.

```bash

hicberg pipeline -C myconfig.ini <pathtogenome.fasta> <pathtoR1.fastq.gz> <pathto_R2.fastq.gz> ```

Individual components

I/O

Create folder

Be careful to create a folder before running the pipeline. The folder can be created using the following command:

bash hicberg create-folder --output=DIR [--name="folder_name"] [--force]

For example to create a folder named "test" on the desktop:

bash hicberg create-folder -o ~/Desktop/ -n test

The folders architecture will be the following:

bash output alignments contacts index plots statistics

Preprocessing

After having created a folder with the previous command mentioned in create folder, the genome can be processed to generate fragment file fragmentfixedsizes.txt and the dictionary of chromosomes' sizes chromosome_sizes.npy using the following command:

bash hicberg get-tables --output=DIR --genome=FILE [--bins=2000] For example to these files in a folder named "test" previously created on the desktop with a binning size of 2000 bp :

bash hicberg get-tables -o ~/Desktop/test/ --bins 2000 <genome>

The files fragmentfixedsizes.txt and chromosome_sizes.npy will be generated in the folder output/.

Alignment

After having created a folder with the previous command mentioned in create folder and performed the creation of fragment file fragmentfixedsizes.txt and the dictionary of chromosomes' sizes chromosome_sizes.npy , the reads can be aligned using the following command:

bash hicberg alignment --output=DIR [--cpus=1] [--aligner="bowtie2"] [--read-type="sr"][--max-alignments=None] [--sensitivity="very-sensitive"] [--index=index] [--verbosity] <genome> <forward> <reverse>

For example to align reads in a folder named "test" previously created on the desktop with 8 threads:

bash hicberg alignment -o ~/Desktop/test/ --cpus 8 <genome.fa> <reads_for.fq> <rev_reads.fq>

If the user have already created the index, the following command can be used:

bash hicberg alignment -o ~/Desktop/test/ --cpus 8 --index index_prefix <genome.fa> <reads_for.fq> <rev_reads.fq>

The files XXX.btl2, 1.sorted.bam and 2.sorted.bam will be created if using bowtie2 as aligner --aligner parameter or -a.

If the aligner used is BWA, the files XXX.fa.amb, XXX.fa.ann, XXX.fa.bwt, XXX.fa.pac and XXX.fa.sa will be created.

Using Minimap2 for Alignment

When using Minimap2 (--aligner "minimap2"), it's crucial to specify the --read-type parameter to ensure optimal alignment for your specific sequencing data. Minimap2 uses different presets (-x option) that are highly optimized for various types of reads, affecting its performance and accuracy.

The hicberg alignment command supports the following read-type values, directly mapping to Minimap2's presets:

sr: For standard short genomic reads (e.g., Illumina, BGI). This is the default setting and is typically suitable for most Hi-C experiments that use short-read sequencing.
map-ont: For Oxford Nanopore Technologies (ONT) reads. These are long reads, often characterized by a higher error rate.
map-pb: For PacBio CLR (Continuous Long Read) data. These are also long reads but generally have different error profiles than ONT reads.
map-hifi: For PacBio HiFi reads. These are highly accurate long reads (circular consensus sequencing).
splice: For RNA-seq reads, which accounts for splicing events during alignment.
splice:hq: A higher-quality variant for RNA-seq reads, offering more accurate spliced alignment.
asm5 / asm10: For aligning reads during genome assembly, optimized for around 5% or 10% sequence divergence, respectively.
ava-pb / ava-ont: For all-versus-all read overlapping with PacBio or ONT reads, primarily used in assembly workflows.

Example for Nanopore reads:

To align long reads sequenced with Oxford Nanopore Technologies, you would use:

bash hicberg alignment -o ~/Desktop/test/ --cpus 8 --aligner "minimap2" --read-type map-ont <genome.fa> <reads_for.fq> <rev_reads.fq>

For more detailed information on Minimap2's presets and their underlying parameters, please refer to the official Minimap2 documentation

Classification

bash hicberg classify --output=DIR [--mapq=35]

Considering the previous example, to classify the reads in a folder named "test" previously created on the desktop:

bash hicberg classify -o ~/Desktop/test/

The files created are:

group0.1.bam and group0.2.bam : bam files containing the reads of group0 e.g. where at least one read of the pair is unaligned.
group1.1.bam and group1.2.bam : bam files containing the reads of group1 e.g. where both reads of the pair are aligned only one time.
group2.1.bam and group2.2.bam : bam files containing the reads of group2 e.g. where at least one reads of the pair are aligned more than one time.

Pairs and matrix building

Build pairs

After having aligned the reads, the pairs file group1.pairs can be built using the following command:

bash hicberg build-pairs --output=DIR [--recover]

If the flag argument recover is used, the pairs file will be built from the last step of the analysis e.g. after having computed the statistics and re-attributed reads from group2 bam files.

Considering the previous example, to build the matrix in a folder named "test" previously created on the desktop:

bash hicberg build-pairs -o ~/Desktop/test/

The file group1.pairs will be created.

If the pairs file has to be built after reads of group2 reassignment, the following command can be used:

bash hicberg build-pairs -o ~/Desktop/test/ --recover Thus, the built pairs file will be all_group.pairs.

Build matrix

After having aligned the reads and built the pairs file group1.pairs, the cooler matrix unrescued_map.cool can be built using the following command:

bash hicberg build-matrix --output=DIR [--recover]

If the flag argument recover is used, the matrix file will be built from the last step of the analyis e.g. after having computed the statistics and re-attributed reads from group2 bam files.

Considering the previous example, to build the matrix in a folder named "test" previously created on the desktop:

bash hicberg build-matrix -o ~/Desktop/test/

The file unrescued_map.cool will be created.

If the cooler file has to be built after reads of group2 re-assignament, the following command can be used:

bash hicberg build-matrix -o ~/Desktop/test/ --recover

Thus, the built matrix file will be rescued_map.cool.

Statistics

After having aligned the reads and built the pairs file group1.pairs, the cooler matrix unrescued_map.cool, the statistical laws for the reassignment of the reads from group2 can be learnt by using the following command:

bash hicberg statistics --output=DIR [--bins=bins_number] [--circular=""] [--rate=1.0] [--mode="standard"] [--kernel-size=11] [--deviation=0.5] [--balcklist=STR] <genome> Considering the previous example, to get the statistical laws (with respect of ARIMA kit enzymes) and default parameters for density estimation, without sub-sampling the restriction map and without blacklisting regions and considering "chrM" as circular in a folder named "test" previously created on the desktop:

bash hicberg statistics -e DpnII -e HinfI -c "chrM" -o ~/Desktop/test/ <genome.fa>

The statistical laws are going to be saved as:

xs.npy : dictionary containing the log binned genome as dictionary such as {chromosome: [log bins]}
uncuts.npy, loops.npy, weirds.npy : dictionary containing the distribution of uncuts, loops and weirds as dictionary such as {chromosome: [distribution]}
pseudo_ps.pdf : plot of the distribution of the pseudo ps, i.e. ps equivalent for trans-chromosomal cases, extracted from the reads of group1.
coverage: dictionary containing the coverage of the genome as dictionary such as {chromosome: [coverage]}
d1d2.npy: np.array containing the d1d2 law as dictionary such as [distribution]
density_map.npy : dictionary containing the density map as dictionary such as {chromosome_pair: [density map]}

Blacklisting regions

The user can specify regions to blacklist in the analysis. The regions to blacklist have to be specified in a bed file or a coma separated list of genomic coordinates considering UCSC format. The bed file has to be formatted as follow:

bed chr1 200000 220000 chr1 308000 314000 ... chr3 100000 120000

List of blacklisted regions provided as a string can be specified as follow:

bash chr1:200000-220000,chr1:308000-314000,...,chr3:100000-120000

So the learning of the statistical laws, as previously described, with blacklisting regions for P(s) computing can be done using the following command:

Using a bed file:

```bash

hicberg statistics -e DpnII -e HinfI -c "chrM" -o ~/Desktop/test/ -B ```

Using a string:

bash hicberg statistics -o ~/Desktop/test/ -B "chr1:200000-220000,chr1:308000-314000,chr3:100000-120000" <genome>

Omics mode

The omics mode can be used to reconstruct any pair-ended sequenced genomic data. For such, the model used relies on the $P(s)$ and the coverage. After reconstruction, the data is located in the folder statistics/. The files generated are: - coverage.bed : bed file containing the coverage of the genome - coverage.bedgraph : bedgraph file containing the coverage of the genome - signal.bw : bigwig file containing the reconstructed signal (pair-ended data)

The omics mode can be used using the following command: bash hicberg pipeline -o <out folder> -t <cpus> -m omics -s <alignment sensitivity> genome.fa reads_for.fq rev_reads.fq

Reconstruction

After having learnt the statistical laws (based on reads of group1), the reads from group2 can be reassigned using the following command:

bash hicberg rescue --output=DIR [--enzyme=["DpnII", "HinfI"]] [--mode="full"] [--cpus=1] <genome>

Considering the previous example, to reassign the reads from group2 in a folder named "test" previously created on the desktop:

```bash

hicberg rescue -e DpnII -e HinfI -o ~/Desktop/test/ ```

The files group2.1.rescued.bam and group2.2.rescued.bam will be created.

Plot

To plot all the information about the analysis, the following command can be used:

bash hicberg plot --output=DIR [--bins=2000] <genome>

Considering all the previous analysis, with 2000bp as bin size to plot all the information in a folder named "test" previously created on the desktop:

bash hicberg plot -o ~/Desktop/test/ -b 2000 <genome>

The plots created are:

patternsdistributionX.pdf : plot of the distribution of the different patterns extracted from the reads of group1.
coverage_X.png : plot of the genome coverage extracted from the reads of group1.
d1d2.pdf : plot of the d1d2 law extracted from the reads of group1.
density_X-Y.pdf : plot of the density map extracted from the reads of group1.
Couplesizesdistribution.pdf : plot of the distribution of the plausible couple sizes extracted from the reads of group2.
chr_X.pdf : plot of the original map and reconstructed one for each chromosome.

Tidy folder

To tidy the folder, the following command can be used:

bash hicberg tidy --output=DIR

Considering all the previous analysis, to tidy the folder in a folder named "test" previously created on the desktop:

bash hicberg tidy -o ~/Desktop/test/

After tidying the folders architecture will be the following:

bash /home/sardine/Bureau/sample_name alignments group0.1.bam group0.2.bam group1.1.bam group1.2.bam group2.1.bam group2.1.rescued.bam group2.2.bam group2.2.rescued.bam chunks chunk_for_X.bam chunk_rev_X.bam contacts matrices rescued_map.cool unrescued_map.cool pairs all_group.pairs group1.pairs fragments_fixed_sizes.txt chromosome_sizes.txt index index.1.bt2l (Bowtie2) index.2.bt2l (Bowtie2) index.3.bt2l (Bowtie2) index.4.bt2l (Bowtie2) index.rev.1.bt2l (Bowtie2) index.rev.2.bt2l (Bowtie2) index.fa.amb (BWA) index.fa.ann (BWA) index.fa.bwt (BWA) index.fa.pac (BWA) | index.fa.sa (BWA) plots chr_X.pdf Couple_sizes_distribution.pdf coverage_X.pdf patterns_distribution_X.pdf pseudo_ps.pdf density_X-Y.pdf statistics chromosome_sizes.npy coverage.npy d1d2.npy density_map.npy dist.frag.npy loops.npy restriction_map.npy trans_ps.npy uncuts.npy weirds.npy xs.npy chromsome_sizes.bed(*) coverage.bed(*) coverage.bedgraph(*) signal.bw(*)

(*) : files generated by hicberg in omics mode

Chaining pipeline steps

It is possible to chain the different steps of the pipeline by using the following command:

```bash

0. Prepare analysis

hicberg pipeline -o --output=DIR [--cpus=1] [--enzyme=[STR, STR]] [--mode=STR] --name=NAME --start-stage fastq --exit-stage bam

1. Align reads

hicberg pipeline -o --output=DIR [--cpus=1] [--aligner=STR] [--read-type=STR] [--enzyme=[STR, STR]] [--mode=STR] --name=NAME --start-stage bam --exit-stage groups

2. Group reads

hicberg pipeline -o --output=DIR [--cpus=1] [--enzyme=[STR, STR]] [--mode=STR] --name=NAME --start-stage groups --exit-stage build

3. Build pairs & cool

hicberg pipeline -o --output=DIR [--cpus=1] [--enzyme=[STR, STR]] [--mode=STR] --name=NAME --start-stage build --exit-stage stats

4. Compute statistics

hicberg pipeline -o --output=DIR [--cpus=1] [--enzyme=[STR, STR]] [--mode=STR] --name=NAME --start-stage stats --exit-stage rescue

5. Reassign ambiguous reads, build pairs & cool then get results

hicberg pipeline -o --output=DIR [--cpus=1] [--enzyme=[STR, STR]] [--mode=STR] --name=NAME --start-stage rescue --exit-stage final

```

Evaluating the model

General principle

HiC-BERG provide a method to evaluate the inferred reconstructed maps. The evaluation is based on first a split of the original uniquely mapping reads into two sets :

group1.X.out.bam : alignment files where selected reads are complementary with the group1.X.in.bam from the alignment files. Thus the reads are uniquely mapped (as the original alignment files) and used to learn the statistics for read couple inference.
group1.X_in.bam: alignment files where selected reads are duplicated between all possible genomic intervals defined by the user. Thus ambiguity is introduced in the alignment of the reads.

Hence, the most plausible couple from fake duplicated reads in group1.X.in.bam is inferred and the corresponding Hi-C contact matrix is built and compared to the one built from the original reads in group1.X.bam (unrescued_map.cool). The two matrices are then compared (modified bins through duplication) compared using the Pearson correlation coefficient that relates the quality of the reconstruction. The closest the coefficient is to 1, the better the reconstruction is.

The evaluation pipeline can be illustrated as follow :

HiC-BERG Evaluation

The genomic intervals used to duplicate the reads are defined by the user through the definition of source and target intervals.The source interval is set through the parameters --chromosome , --position and --bins. The target intervals are set through the parameters --strides and eventually --trans_chromosome with --trans_position.

So in a basic example considering only one chromosome and two artificial duplicated sequence, it is necessary to define a source interval corresponding to the chromosome of interest and a target interval corresponding to the duplicated sequence. The source interval is defined by the chromosome name (chromosome), the position (--position) and the width of the interval in number of bins (bins).

Thus the source interval is defined as $[chromosome:position-binsbin size ; chromosome:position+bins * binsize]$ and the target interval as $[chromosome:(position-binsbinsize) + stride ; chromosome:(position+bins*binsize) + stride]$.

For example, if the source interval is chromosome 1, position 68000 and strides set as [0, 50000] with a bin size of 2000bp, the source interval is defined as chr1:68000-70000 and the target interval is defined as chr1:118000-120000.

The files group1.1.in.bam, group1.2.in.bam, group1.1.out.bam and group1.2.out.bam will be created.

The duplicated aligned reads should look like this :

group1.1.in.bam :

NS500150:487:HNLLNBGXC:1:11101:1071:2862 0 chr1 69227 255 35M * 0 0 ATCTGTTGTGNNGAAGGATACTCCCAGAACTCGTT AAAAAEEEAE##EEEEEEEEEEEEEEEEEEEEEEE AS:i:-2 XN:i:0 XM:i:2 XO:i:0 NM:i:2 MD:Z:10G0A23 YT:Z:UU XG:i:230218 NS500150:487:HNLLNBGXC:1:11101:1071:2862 0 chr1 119227 255 35M * 0 0 ATCTGTTGTGNNGAAGGATACTCCCAGAACTCGTT AAAAAEEEAE##EEEEEEEEEEEEEEEEEEEEEEE AS:i:-2 XN:i:0 XM:i:2 XO:i:0 NM:i:2 MD:Z:10G0A23 YT:Z:UU XG:i:230218 XF:Z:Fake NS500150:487:HNLLNBGXC:1:11101:3001:19423 16 chr1 118866 255 35M * 0 0 GAAAAAGGATTGGTCCAATAAGTGGGAAAAAAGAT EEAEEEAEE/EAE/EEEEEEEE/EEEEEE6AAAAA AS:i:0 XN:i:0 XM:i:0 XO:i:0 NM:i:0 MD:Z:35 YT:Z:UU XG:i:230218 NS500150:487:HNLLNBGXC:1:11101:3001:19423 16 chr1 68866 255 35M * 0 0 GAAAAAGGATTGGTCCAATAAGTGGGAAAAAAGAT EEAEEEAEE/EAE/EEEEEEEE/EEEEEE6AAAAA AS:i:0 XN:i:0 XM:i:0 XO:i:0 NM:i:0 MD:Z:35 YT:Z:UU XG:i:230218 XF:Z:Fake NS500150:487:HNLLNBGXC:1:11101:4986:15168 16 chr1 69239 255 35M * 0 0 GAAGGATACTCCCAGAACTCGTTACTGTCTGGACT EEEEEEEEEEEEEEEEEEEEEEEAEEEAEEAAAAA AS:i:0 XN:i:0 XM:i:0 XO:i:0 NM:i:0 MD:Z:35 YT:Z:UU XG:i:230218 NS500150:487:HNLLNBGXC:1:11101:4986:15168 16 chr1 119239 255 35M * 0 0 GAAGGATACTCCCAGAACTCGTTACTGTCTGGACT EEEEEEEEEEEEEEEEEEEEEEEAEEEAEEAAAAA AS:i:0 XN:i:0 XM:i:0 XO:i:0 NM:i:0 MD:Z:35 YT:Z:UU XG:i:230218 XF:Z:Fake ...

group1.2.in.bam :

NS500150:487:HNLLNBGXC:1:11101:1071:2862 16 chr1 103994 255 35M * 0 0 TGCTTTTTTGGGATTGGGAATGATTTTTCCTCCTT EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAAAAA AS:i:0 XN:i:0 XM:i:0 XO:i:0 NM:i:0 MD:Z:35 YT:Z:UU XG:i:230218 NS500150:487:HNLLNBGXC:1:11101:3001:19423 16 chr1 121776 255 35M * 0 0 GGTCAAGAAATGGTTTTCACAGGCGAAATCATTGG EEEEEEEEEEE<EEEE/EEEEEEEEEEAEEAAAAA AS:i:0 XN:i:0 XM:i:0 XO:i:0 NM:i:0 MD:Z:35 YT:Z:UU XG:i:230218 NS500150:487:HNLLNBGXC:1:11101:4986:15168 0 chr1 86626 255 35M * 0 0 GATCTAGGGGTACCTCCTCGGGAAACATCCAGCCC AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE AS:i:0 XN:i:0 XM:i:0 XO:i:0 NM:i:0 MD:Z:35 YT:Z:UU XG:i:230218 ...

The XF:Z:Fake signed read duplication.

In the case of trans chromosomal duplications, the user has to specify the names of trans chromosomes and the relative positions for each trans chromosome selected. The user has to provide as many position as the number of chromosome names selected.

For example, if the source interval is chromosome 1, position 100000 and strides set as [0, 50000] with a bin size of 2000bp and the specified trans chromosomes and trans positions are respectively [chr2, chr8] and [70000, 130000], the source interval is defined as chr1:100000-102000 and the target intervals are defined as chr1:150000-152000, chr2:70000-72000 and chr8:130000-132000.

The stride is the number of bins between the first bin of the source interval and the first bin of the target interval. The stride can be negative or positive. If the stride is negative, the target interval is located before the source interval. If the stride is positive, the target interval is located after the source interval. The stride can be set to 0, in this case the target interval is the same as the source interval. The target interval can be located on the same chromosome as the source interval or on another chromosome. In this case, the chromosome name and the position of the first bin of the target interval must be specified. All the parameters --position, --strides, --trans-chromosome and --trans-position should be provided as coma separated lists.

The benchmark can be performed considering several modes. The modes are defined by the parameter --modes. The modes are defined as a list of strings separated by comas. The modes are the same as the one used for the reconstruction :

full
random
ps
cov
d1d2
density
standard (ps and cov)
one_enzyme (ps, cov and d1d2)
omics (ps, cov)

N.B : depending on the modes selected for the benchmark, if one of the mode is not included in the list of modes selected for the reconstruction, the reconstruction will not be performed for this mode, and the corresponding statistics will not be computed.

The evaluation can be run using the following command :

bash hicberg benchmark --output=DIR [--chromosome=STR] [--position=INT] [--trans-chromosome=STR][--trans-position=INT] [--stride=INT] [--bins=INT] [--auto=INT] [--rounds=INT] [--magnitude=FLOAT] [--modes=STR] [--pattern=STR] [--threshold=FLOAT] [--jitter=INT] [--trend] [--top=INT] [--genome=STR][--force]

Considering a benchmark with 4 artificially duplicated sequences set at chr1:100000-102000 (source), chr1:200000-202000 (target 1), chr4:50000-52000 (target 2) and chr7:300000-302000 (target 3), with 2000bp as bin size and considering full and ps_only modes to get the performance of the reconstructions considering a folder named "test" previously created on the desktop containing the original alignment files and the unreconstructed maps, the command line is the following :

bash hicberg benchmark -o ~/Desktop/test/ -c chr1 -p 100000 -s 0,100000 -C chr4,chr7 -P 50000,30000 -m full,ps_only -g <genome>

It is also possible to let the source and target intervals being picked at random. However in such cases, the empty bins are not considered in the evaluation. The random mode is activated by setting the parameter --auto to the number of desired artificially duplicated sequences.

Thus, considering a benchmark with 100 artificially duplicated sequences , with 2000bp as bin size and considering full and ps_only modes to get the performance of the reconstructions considering a folder named "test" previously created on the desktop containing the original alignment files and the unreconstructed maps, the command line is the following :

bash hicberg benchmark -o ~/Desktop/test/ -a 100 -m full,ps_only -g <genome.fa>

Pattern based evaluation

A pattern based strategy can be set. Such strategy is similar to the strategy where genomic intervals have to be defined as mentioned above where user has to provide th genomic coordinates of the intervals for read selections whereas in pattern base strategy, only pattern type has to be specified to set the genomic intervals. The pattern is defined by the parameter --pattern.

Such patterns are going to be detected from the original Hi-C map using Chromosight. Then genomic coordinates of the detected patterns are going to be used to select the reads for the evaluation. The number of duplication are going to be adjusted by specifying the chromosome name with --chromosome parameter, the --threshold parameter to set the minimum Pearson score to consider a pattern as detected and the --top parameter to eventually keep the top k% to the remaining detected patterns. Thus, the genomic intervals are going to be defined as the genomic coordinates of the detected patterns.

Thus, the same read strategy will be applied. After reconstruction, the evaluation will be performed using the Pearson correlation coefficient between the original and reconstructed bins selected from the original map. Furthermore, a second pattern detection with Chromosight will be performed on the reconstructed map. Then, the precision, recall and F1 score will be computed to evaluate the reconstruction, by comparing the number of retrieved patterns while identifying false positives and false negatives.

N.B. : Because of the stochasticity of Chromosight while splitting the Hi-C map for pattern detection, some patterns can be considered as false positives and negatives because their coordinates after reconstruction are aside of the original one. We recommend using the --jitter parameter to allow pattern to be considered as retrieved post-reconstruction if they are detected at j bins from the original coordinates.

Considering a benchmark based on loops patterns on chromosome 7 with a threshold of 0.5 with all the detected patterns after thresholding (i.e. 100% rate) and a jitter of 0 with detrend, in full mode, the command line is the following :

bash hicberg benchmark -o ~/Desktop/test/ -c chr7 -p 100000 -S loops -t 0.5 -k 100 -j 0 -T -m random -g <genome.fa>

Python usage

All components of the hicberg program can be used as python modules. See the documentation on readthedocs. The expected contact map format for the library is a simple COOL file, and the objects handled by the library are simple Numpy arrays through Cooler. The various submodules of hicberg contain various utilities.

python import hicberg.io #Functions for I/O and folder management. import hicberg.align #Functions for sequence alignment steps import hicberg.utils #Functions for handling reads and alignment import hicberg.statistics #Functions for extract and create statistical models import hicberg.omics #Functions to treat non Hi-C data. import hicberg.pipeline #Functions to run end to end Hi-C map reconstruction.

Connecting the modules

All the steps described here are handled automatically when running the hicberg pipeline. But if you want to connect the different modules manually, the intermediate input and output files can be processed using some python scripting.

File formats

pair files: This format is used for all intermediate files in the pipeline and is also used by hicberg build_matrix. It is a tab-separated format holding information about Hi-C pairs. It has an official specification defined by the 4D Nucleome data coordination and integration center.
- readID: Read (pair) identifier.
- chr1: Chromosome identifier of the forward read of the pair.
- pos1: 0-based position of the forward mate, in base pairs.
- chr2: Chromosome identifier of the reverse read of the pair.
- pos2: 0-based position of the reverse mate, in base pairs.
- strand1: Orientation of the aligned forward mate.
- strand2: Orientation of the aligned reverse mate.

```

pairs format v1.0

columns: readID chr1 pos1 strand1 chr2 pos2 strand2

chromsize: chr1 230218

chromsize: chr2 813184

NS500150:487:HNLLNBGXC:1:11101:3066:7109 chr2 683994 chr2 684725 - - NS500150:487:HNLLNBGXC:1:11101:6114:4800 chr2 795379 chr2 796279 + + NS500150:487:HNLLNBGXC:1:11101:6488:14927 chr2 379433 chr2 379138 - + ... ```

cool files: This format is used to store genomic interaction data such as Hi-C contact matrices. These file can be handled using cooler Python library.
npy files: This format is used to store dictionaries containing information about genomic coordinates, binning or statistical laws. Dictionaries are stores with chromosome as key and arrays as values. Such file can be handled using numpy Python library.
- chromosome_sizes.npy : This file is used to store the size of each chromosome. Structure is the following : {chromosome: size}
- xs.npy : This file is used to store the log binned genome. Structure is the following : {chromosome: [log bins]} with log bins a list of integers.
- uncuts.npy : This file is used to store the distribution of uncuts. Structure is the following : {chromosome: [distribution]} with distribution a list of integers.
- loops.npy : This file is used to store the distribution of loops. Structure is the following : {chromosome: [distribution]} with distribution a list of integers.
- weirds.npy : This file is used to store the distribution of weirds. Structure is the following : {chromosome: [distribution]} with distribution a list of integers.
- pseudo_ps.npy : This file is used to store the distribution of pseudo ps. Structure is the following : {(chrom1, chrom2): [map]} with (chrom1, chrom2) a tuple of chromosomes where chrom1 is different than chrom2 and map a float value.
- coverage.npy : This file is used to store the coverage of the genome. Structure is the following : {chromosome: [coverage]} with coverage a list of integers.
- d1d2.npy : This file is used to store the d1d2 law. Structure is the following : [distribution] with distribution a list of integers.
- density_map.npy : This file is used to store the density map. Structure is the following : {(chrom1, chrom2): [density map]} with (chrom1, chrom2) a tuple of chromosomes density map a 2D numpy array.
bt2l files: Thi format is used to store index of genomes performer using Bowtie2.
bam files: This format is used to built analyses on, by several functions of hicberg. It is a compressed standard alignment format file providing multiple information about read alignments performer by Bowtie2. Such files can be handled through Samtools and it's Python wrapper PySam. More details about SAM and BAM format can be found here.
bed files: This format is used to store genomic intervals. It is a tab-separated format holding information about genomic intervals. It is a standard format used by the UCSC genome browser.

chr4 150 200 chr6 300 400 chr4 800 900 chr2 680000 684000 ...

chromosome_sizes.bed : This file is used to store the size of each chromosome. Structure is the following : chromosome start end
coverage.bed : This file is used to store the coverage of the genome. Structure is the following : chromosome start end coverage
coverage.bedgraph : This file is used to store the coverage of the genome. Structure is the following : chromosome start end coverage
signal.bw : This file is used to store the coverage of the genome. Structure is the following : chromosome start end coverage
- fragmentsfixedsizes.txt:
chrom: Chromosome identifier. Order should be the same as in pairs files.
start: 0-based start of fragment, in base pairs.
end: 0-based end of fragment, in base pairs.

chrom start end chr1 0 2000 chr1 2000 4000 chr1 4000 6000 ... chr1 14000 16000 ... chr2 0 2000 chr2 2000 4000 ...

Contributing

All contributions are welcome, in the form of bug reports, suggestions, documentation or pull requests. We use the Numpy standard for docstrings when documenting functions.

The code formatting standard we use is black, with --line-length=79 to follow PEP8 recommendations. We use pytest with the pytest-doctest and pytest-pylint plugins as our testing framework. Ideally, new functions should have associated unit tests, placed in the tests folder. To test the code, you can run:

```bash coverage run --source=hicberg -m pytest -v tests --cov-report=xml

```

Authors

@sebgra

Citation

Owner

Login: sebgra
Kind: user
Location: France

Repositories: 1
Profile: https://github.com/sebgra

I'm a bioengineer and developper specilized in Bioinformatics, Biological image processing, machine & deep learning, drug development and biomecanics

GitHub Events

Total

Create event: 4
Issues event: 11
Release event: 9
Watch event: 2
Delete event: 1
Public event: 1
Push event: 77
Fork event: 1

Last Year

Create event: 4
Issues event: 11
Release event: 9
Watch event: 2
Delete event: 1
Public event: 1
Push event: 77
Fork event: 1

Packages

Total packages: 1
Total downloads:
- pypi 19 last-month

Total dependent packages: 0
Total dependent repositories: 0
Total versions: 2
Total maintainers: 1

pypi.org: hicberg

Standalone command line tool to visualize coverage from a BAM file

Homepage: https://github.com/sebgra/hicberg
Documentation: https://hicberg.readthedocs.io/
License: MIT
Latest release: 1.0.1
published 12 months ago

Versions: 2
Dependent Packages: 0
Dependent Repositories: 0
Downloads: 19 Last month

Rankings

Dependent packages count: 9.8%

Average: 32.6%

Dependent repos count: 55.3%

Maintainers (1)

sebgra

Last synced: 10 months ago

Dependencies

.github/workflows/build.yml actions

actions/checkout v2 composite
mamba-org/setup-micromamba v1 composite

Dockerfile docker

condaforge/mambaforge latest build

hicberg.egg-info/requires.txt pypi

bioframe *
biopython *
click *
cooler *
funcy *
hicstuff *
matplotlib *
numpy *
pandas *
pysam *
scikit-learn *
scipy *
statsmodels *

requirements.txt pypi

bioframe *
biopython *
click *
cooler *
funcy *
hicstuff *
matplotlib *
numpy *
pandas *
pysam *
scikit-learn *
scipy *
statsmodels *

setup.py pypi

.github/workflows/gh-pages.yml actions

JamesIves/github-pages-deploy-action 3.7.1 composite
actions/checkout v2 composite
actions/setup-python v1 composite

.github/workflows/publish.yml actions

actions/checkout v2 composite
actions/setup-python v2 composite
pypa/gh-action-pypi-publish 27b31702a0e7fc50959f5ad993c78deac1bdfc29 composite

environment.yml conda

_libgcc_mutex 0.1
_openmp_mutex 4.5
aioeasywebdav 2.4.0
aiohttp 3.9.1
aiosignal 1.3.1
alsa-lib 1.2.9
amply 0.1.6
appdirs 1.4.4
asttokens 2.2.1
attmap 0.13.2
attr 2.5.1
attrs 23.1.0
backcall 0.2.0
backports 1.0
backports.functools_lru_cache 1.6.5
bcrypt 4.1.1
boto3 1.33.6
botocore 1.33.6
brotli 1.1.0
brotli-bin 1.1.0
brotli-python 1.1.0
bzip2 1.0.8
c-ares 1.23.0
ca-certificates 2023.11.17
cachetools 5.3.2
cairo 1.16.0
cffi 1.16.0
chardet 4.0.0
charset-normalizer 3.3.2
coin-or-cbc 2.10.10
coin-or-cgl 0.60.7
coin-or-clp 1.17.8
coin-or-osi 0.108.8
coin-or-utils 2.11.9
coincbc 2.10.10
colorama 0.4.6
comm 0.1.3
configargparse 1.7
connection_pool 0.0.3
cryptography 41.0.5
datrie 0.8.2
dbus 1.13.6
debugpy 1.6.7
decorator 5.1.1
defusedxml 0.7.1
docutils 0.20.1
dpath 2.1.6
dropbox 11.36.2
eido 0.2.2
exceptiongroup 1.2.0
executing 1.2.0
expat 2.5.0
filechunkio 1.8
font-ttf-dejavu-sans-mono 2.37
font-ttf-inconsolata 3.000
font-ttf-source-code-pro 2.038
font-ttf-ubuntu 0.83
fontconfig 2.14.2
fonts-conda-ecosystem 1
fonts-conda-forge 1
freetype 2.12.1
frozenlist 1.4.0
ftputil 5.0.4
gettext 0.21.1
gitdb 4.0.11
gitpython 3.1.40
glib 2.78.1
glib-tools 2.78.1
google-api-core 2.14.0
google-api-python-client 2.109.0
google-auth 2.24.0
google-auth-httplib2 0.1.1
google-cloud-core 2.3.3
google-cloud-storage 2.13.0
google-crc32c 1.1.2
google-resumable-media 2.6.0
googleapis-common-protos 1.61.0
graphite2 1.3.13
grpcio 1.59.3
gst-plugins-base 1.22.5
gstreamer 1.22.5
harfbuzz 7.3.0
htslib 1.17
httplib2 0.22.0
humanfriendly 10.0
icu 72.1
idna 3.6
importlib-metadata 6.8.0
importlib_metadata 6.8.0
importlib_resources 6.1.1
iniconfig 2.0.0
ipykernel 6.24.0
ipympl 0.9.3
ipython 8.14.0
ipython_genutils 0.2.0
ipywidgets 8.1.1
jedi 0.18.2
jinja2 3.1.2
jmespath 1.0.1
jsonschema 4.20.0
jsonschema-specifications 2023.11.2
jupyter_client 8.3.0
jupyter_core 5.3.1
jupyterlab_widgets 3.0.9
keyutils 1.6.1
krb5 1.21.2
lame 3.100
lcms2 2.15
ld_impl_linux-64 2.40
lerc 4.0.0
libabseil 20230802.1
libblas 3.9.0
libbrotlicommon 1.1.0
libbrotlidec 1.1.0
libbrotlienc 1.1.0
libcap 2.69
libcblas 3.9.0
libclang 15.0.7
libclang13 15.0.7
libcrc32c 1.1.2
libcups 2.3.3
libcurl 8.4.0
libdeflate 1.18
libedit 3.1.20191231
libev 4.33
libevent 2.1.12
libexpat 2.5.0
libffi 3.4.2
libflac 1.4.3
libgcc-ng 13.1.0
libgcrypt 1.10.3
libgfortran-ng 13.2.0
libgfortran5 13.2.0
libglib 2.78.1
libgomp 13.1.0
libgpg-error 1.47
libgrpc 1.59.3
libiconv 1.17
libjpeg-turbo 2.1.5.1
liblapack 3.9.0
liblapacke 3.9.0
libllvm15 15.0.7
libnghttp2 1.58.0
libnsl 2.0.0
libogg 1.3.4
libopenblas 0.3.25
libopus 1.3.1
libpng 1.6.39
libpq 15.4
libprotobuf 4.24.4
libre2-11 2023.06.02
libsndfile 1.2.2
libsodium 1.0.18
libsqlite 3.42.0
libssh2 1.11.0
libstdcxx-ng 13.1.0
libsystemd0 254
libtiff 4.6.0
libuuid 2.38.1
libvorbis 1.3.7
libwebp-base 1.3.2
libxcb 1.15
libxkbcommon 1.6.0
libxml2 2.11.5
libzlib 1.2.13
logmuse 0.2.6
lz4-c 1.9.4
markdown-it-py 3.0.0
markupsafe 2.1.3
matplotlib-base 3.7.0
matplotlib-inline 0.1.6
mdurl 0.1.0
mpg123 1.32.3
multidict 6.0.4
munkres 1.1.4
mysql-common 8.0.33
mysql-libs 8.0.33
nbformat 5.9.2
ncurses 6.4
nest-asyncio 1.5.6
nspr 4.35
nss 3.92
oauth2client 4.1.3
openjpeg 2.5.0
openssl 3.1.4
packaging 23.1
paramiko 3.3.1
parso 0.8.3
pcre2 10.42
peppy 0.35.7
pexpect 4.8.0
pickleshare 0.7.5
pillow 10.0.1
pip 23.2
pixman 0.42.2
pkgutil-resolve-name 1.3.10
plac 1.4.1
platformdirs 3.9.1
ply 3.11
prettytable 3.9.0
prompt-toolkit 3.0.39
prompt_toolkit 3.0.39
protobuf 4.24.4
psutil 5.9.5
pthread-stubs 0.4
ptyprocess 0.7.0
pulp 2.7.0
pulseaudio-client 16.1
pure_eval 0.2.2
pyasn1 0.5.1
pyasn1-modules 0.3.0
pycparser 2.21
pygments 2.15.1
pynacl 1.5.0
pyopenssl 23.3.0
pyqt 5.15.9
pyqt5-sip 12.12.2
pysftp 0.2.9
pysocks 1.7.1
python 3.11.4
python-dateutil 2.8.2
python-fastjsonschema 2.19.0
python-irodsclient 1.1.9
python-tzdata 2023.3
python_abi 3.11
pyu2f 0.1.5
pyyaml 6.0.1
pyzmq 25.1.0
qt-main 5.15.8
re2 2023.06.02
readline 8.2
referencing 0.31.1
requests 2.31.0
reretry 0.11.8
rich 13.7.0
rpds-py 0.13.2
rsa 4.9
s3transfer 0.8.2
samtools 1.17
seqtk 1.4
setuptools 68.0.0
setuptools-scm 8.0.4
sip 6.7.12
six 1.16.0
slacker 0.14.0
smart_open 6.4.0
smmap 5.0.0
snakemake 7.32.3
snakemake-minimal 7.32.3
stack_data 0.6.2
stone 3.3.1
stopit 1.1.2
tabulate 0.9.0
throttler 1.2.2
tk 8.6.13
toml 0.10.2
tomli 2.0.1
toposort 1.10
tornado 6.3.2
traitlets 5.9.0
typing-extensions 4.7.1
typing_extensions 4.7.1
tzdata 2023c
ubiquerg 0.6.3
uritemplate 4.1.1
veracitools 0.1.3
wcwidth 0.2.6
wheel 0.40.0
widgetsnbextension 4.0.9
wrapt 1.16.0
xcb-util 0.4.0
xcb-util-image 0.4.0
xcb-util-keysyms 0.4.0
xcb-util-renderutil 0.3.9
xcb-util-wm 0.4.1
xkeyboard-config 2.40
xorg-kbproto 1.0.7
xorg-libice 1.1.1
xorg-libsm 1.2.4
xorg-libx11 1.8.7
xorg-libxau 1.0.11
xorg-libxdmcp 1.1.3
xorg-libxext 1.3.4
xorg-libxrender 0.9.11
xorg-renderproto 0.11.1
xorg-xextproto 7.3.0
xorg-xf86vidmodeproto 2.3.1
xorg-xproto 7.0.31
xz 5.2.6
yaml 0.2.5
yarl 1.9.3
yte 1.5.1
zeromq 4.3.4
zipp 3.16.2
zlib 1.2.13
zstd 1.5.5

hicberg

Science Score: 49.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

HiC-BERG

Badges

Documentation

Table of contents

Environment and dependencies

Environnement

Dependencies

Installation

pip

Conda / Mamba

For exhaustive aligners usage

Docker

Pick user and group variables

Build container

Usage/Examples

Full pipeline

Snakemake usage

Configuration

name_1

name_2

Run

Locally

Cluster

Configuration Files

INI File Format

Using a Configuration File

Individual components

I/O

Create folder

Preprocessing

Alignment

Using Minimap2 for Alignment

Classification

Pairs and matrix building

Build pairs

Build matrix

Statistics

Blacklisting regions

Omics mode

Reconstruction

Plot

Tidy folder

Chaining pipeline steps

0. Prepare analysis

1. Align reads

2. Group reads

3. Build pairs & cool

4. Compute statistics

5. Reassign ambiguous reads, build pairs & cool then get results

Evaluating the model

Evaluating the model

General principle

Pattern based evaluation

Python usage

Connecting the modules

File formats

pairs format v1.0

columns: readID chr1 pos1 strand1 chr2 pos2 strand2

chromsize: chr1 230218

chromsize: chr2 813184

Contributing

Authors

Citation

Owner

GitHub Events

Total

Last Year

Packages

pypi.org: hicberg

Rankings

Maintainers (1)