https://github.com/01life/pynetcor

An efficient tool for large-scale correlation network analysis

Science Score: 26.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
○
.zenodo.json file
✓
DOI references
Found 4 DOI reference(s) in README
○
Academic publication links
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (12.0%) to scientific vocabulary

Last synced: 9 months ago · JSON representation

Repository

An efficient tool for large-scale correlation network analysis

Basic Info

Host: GitHub
Owner: 01life
License: mit
Language: C++
Default Branch: main
Size: 2.34 MB

Statistics

Stars: 9
Watchers: 0
Forks: 0
Open Issues: 1
Releases: 0

Created about 2 years ago · Last pushed over 1 year ago

Metadata Files

Readme License

pyNetCor

PyNetCor is a fast Python C++ extension for correlation and network analysis on high-dimensional datasets. It aims to serve as a scalable foundational package to accelerate large-scale computations.

Features

Calculate correlation matrix using Pearson, Spearman, or Kendall methods
Processing large-scale computation in chunks (larger than RAM)
Find top-k and differential correlations between each row of two arrays
Efficient P-value approximation and multiple testing correction
Handle missing values
Multi-thread

For more details, please refer to the documentation.

Installation

You can install pyNetCor using pip:

bash pip install pynetcor

Quick Start

We provide a demo notebook to help you quickly understand how to use this project.

Create Data

```python import numpy as np

features = 1000 samples = 100 arr1 = np.random.random((features, samples)) arr2 = np.random.random((features, samples)) ```

Calculate correlation matrix

Compute and return the full matrix at once.

```python from pynetcor.cor import corrcoef

using 8 threads

Pearson correlations between `arr1` and itself

cor_result = corrcoef(arr1, threads=8) ```

Compute the matrix in chunks and return an Iterator, recommended for large-scale analysis that exceed RAM.

```python from pynetcor.cor import chunked_corrcoef

Calculate and return `chunk_size=1024` rows of the correlation matrix with each iteration.

coriter = chunkedcorrcoef(arr1, chunksize=1024, threads=8) for corchunkmatrix in coriter: ... ```

Top-k correlation search

Identify the accurate top k correlations (Spearman correlation).

```python from pynetcor.cor import cor_topk

top 1% correlations

cortopkresult = cor_topk(arr1, method="spearman", k=0.001, threads=8)

top 100 correlations

cortopkresult = cor_topk(arr1, method="spearman", k=100, threads=8)

Return a 2D array with 4 columns: [rowindex, colindex, correlation, pvalue]

```

Top-k differential correlation search

Identify the accurate top k differences in correlation between pairs of features across two states or time points.

```python

Compute the pairwise correlations separately for `arr1` with `arr1`, and `arr2` with `arr2`, then identify the feature pairs with the largest difference

from pynetcor.cor import cor_topkdiff

top 1% differential correlations

cortopkdiffresult = cor_topkdiff(x1=arr1, y1=arr2, x2=arr1, y2=arr2, k=0.001, threads=8)

top 100 differential correlations

cortopkdiffresult = cor_topkdiff(x1=arr1, y1=arr2, x2=arr1, y2=arr2, k=100, threads=8)

Return a 2D array with 5 columns: [rowindex, colindex, diffCor, cor1, cor2]

```

P-value computation

Compute the P-values for correlations (Pearson or Spearman) using the Student's t-distribution. The approximation method is significantly faster than the classical method, with the absolute errors are nearly less than 1e-8.

```python from pynetcor.cor import corrcoef, pvaluestudentt samples = arr1.shape[1]

Generate the Pearson correlation matrix

cor_result = corrcoef(arr1, threads=8)

P-value approximation

pvalueresult = pvaluestudentt(corresult, df=samples-2, approx=True, threads=8)

P-value classic

pvalueresult = pvaluestudentt(corresult, df=samples-2, approx=False, threads=8) ```

Unified implementation for calculating correlations and P-values.

```python from pynetcor.cor import cortest, chunked_cortest

Pearson correlation & P-value approximation

cortestresult = cortest(arr1, approxpvalue=True, threads=8)

chunking computation, recommended for large-scale analysis that exceed RAM

for iter in chunkedcortest(arr1, approxpvalue=True, threads=8): for (rowindex, colindex, correlation, pvalue) in iter: ...

Return a 2D array with 4 columns: [rowindex, colindex, correlation, pvalue]

```

Multiple testing correction: holm, hochberg, bonferroni, BH, BY.

```python from pynetcor.cor import cortest, chunked_cortest

Pearson correlation & multiple testing correction

cortestresult = cortest(arr1, adjustpvalue=True, adjust_method="BH", threads=8)

chunking computation, recommended for large-scale analysis that exceed RAM

for iter in chunkedcortest(arr1, adjustpvalue=True, adjustmethod="BH", threads=8): for (rowindex, col_index, correlation, pvalue) in iter: ...

Return a 2D array with 5 columns: [rowindex, colindex, correlation, pvalue, adjusted_pvalue]

```

NOTE: chunked function only supports approximate adjusted P-value. PyNetCor utilizes approximation methods to achieve effective FDR control before computing all P-values.

Memory Management and Chunk Size Optimization

When conducting large-scale correlation analysis using pyNetCor, optimizing memory usage and chunk size is crucial for achieving optimal performance. Our experiments have revealed important relationships between dataset dimensions, chunk_size, runtime, and memory consumption (as illustrated in the figures below):

Larger chunk_size generally lead to faster runtimes.
The reduction in runtime becomes less significant as the chunk_size exceeds 500-1000.
Memory consumption increases linearly with chunk_size.

Based on these observations, we can provide users with recommendations for optimizing chunk_size:

Start with a moderate chunk_size: Begin with a chunk_size around 500-750. This range typically offers a good balance between runtime performance and memory usage.
Consider your dataset size: For smaller datasets (e.g., 70,000-90,000 features), you may be able to use larger chunk_size without excessive memory consumption. This can potentially speed up processing times. However, for very large datasets (150,000+ features), you might need to use smaller chunk_size to manage memory constraints effectively. Always monitor system resources when working with large datasets.
Fine-tune for your specific use case: The optimal chunk_size can vary depending on your dataset size and available RAM. We recommend referring to our experimental results to guide your configuration. As a default setting designed to accommodate most analytical needs, we use a default chunk_size = 512. However, don't hesitate to adjust this based on your specific requirements and system capabilities.

Citation

If you use pyNetCor in your research, please cite the publication: PyNetCor: a high-performance Python package for large-scale correlation analysis.

Shibin Long, Yan Xia, Lifeng Liang, Ying Yang, Hailiang Xie, Xiaokai Wang, PyNetCor: a high-performance Python package for large-scale correlation analysis, NAR Genomics and Bioinformatics, Volume 6, Issue 4, December 2024, lqae177, https://doi.org/10.1093/nargab/lqae177

Owner

Name: 01life
Login: 01life
Kind: organization

Repositories: 1
Profile: https://github.com/01life

GitHub Events

Total

Issues event: 2
Watch event: 10
Issue comment event: 1
Push event: 6
Create event: 1

Last Year

Issues event: 2
Watch event: 10
Issue comment event: 1
Push event: 6
Create event: 1

Issues and Pull Requests

Last synced: 10 months ago

All Time

Total issues: 1
Total pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Total issue authors: 1
Total pull request authors: 0
Average comments per issue: 0.0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 1
Pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Issue authors: 1
Pull request authors: 0
Average comments per issue: 0.0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

https://github.com/01life/pynetcor

Science Score: 26.0%

Repository

Basic Info

Statistics

Metadata Files

README.md

pyNetCor

Features

Installation

Quick Start

Create Data

Calculate correlation matrix

using 8 threads

Pearson correlations between arr1 and itself

Calculate and return chunk_size=1024 rows of the correlation matrix with each iteration.

Top-k correlation search

top 1% correlations

top 100 correlations

Return a 2D array with 4 columns: [rowindex, colindex, correlation, pvalue]

Top-k differential correlation search

Compute the pairwise correlations separately for arr1 with arr1, and arr2 with arr2, then identify the feature pairs with the largest difference

top 1% differential correlations

top 100 differential correlations

Return a 2D array with 5 columns: [rowindex, colindex, diffCor, cor1, cor2]

P-value computation

Generate the Pearson correlation matrix

P-value approximation

P-value classic

Unified implementation for calculating correlations and P-values.

Pearson correlation & P-value approximation

chunking computation, recommended for large-scale analysis that exceed RAM

Return a 2D array with 4 columns: [rowindex, colindex, correlation, pvalue]

Pearson correlation & multiple testing correction

chunking computation, recommended for large-scale analysis that exceed RAM

Return a 2D array with 5 columns: [rowindex, colindex, correlation, pvalue, adjusted_pvalue]

Memory Management and Chunk Size Optimization

Citation

Owner

GitHub Events

Total

Last Year

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Dependencies

Pearson correlations between `arr1` and itself

Calculate and return `chunk_size=1024` rows of the correlation matrix with each iteration.

Compute the pairwise correlations separately for `arr1` with `arr1`, and `arr2` with `arr2`, then identify the feature pairs with the largest difference