https://github.com/01life/pynetcor

An efficient tool for large-scale correlation network analysis

https://github.com/01life/pynetcor

Science Score: 26.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
  • DOI references
    Found 4 DOI reference(s) in README
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (12.0%) to scientific vocabulary
Last synced: 9 months ago · JSON representation

Repository

An efficient tool for large-scale correlation network analysis

Basic Info
  • Host: GitHub
  • Owner: 01life
  • License: mit
  • Language: C++
  • Default Branch: main
  • Size: 2.34 MB
Statistics
  • Stars: 9
  • Watchers: 0
  • Forks: 0
  • Open Issues: 1
  • Releases: 0
Created about 2 years ago · Last pushed over 1 year ago
Metadata Files
Readme License

README.md

pyNetCor

docs

PyNetCor is a fast Python C++ extension for correlation and network analysis on high-dimensional datasets. It aims to serve as a scalable foundational package to accelerate large-scale computations.

Features

  • Calculate correlation matrix using Pearson, Spearman, or Kendall methods
  • Processing large-scale computation in chunks (larger than RAM)
  • Find top-k and differential correlations between each row of two arrays
  • Efficient P-value approximation and multiple testing correction
  • Handle missing values
  • Multi-thread

For more details, please refer to the documentation.

Installation

You can install pyNetCor using pip:

bash pip install pynetcor

Quick Start

We provide a demo notebook to help you quickly understand how to use this project.

Create Data

```python import numpy as np

features = 1000 samples = 100 arr1 = np.random.random((features, samples)) arr2 = np.random.random((features, samples)) ```

Calculate correlation matrix

Compute and return the full matrix at once.

```python from pynetcor.cor import corrcoef

using 8 threads

Pearson correlations between arr1 and itself

cor_result = corrcoef(arr1, threads=8) ```

Compute the matrix in chunks and return an Iterator, recommended for large-scale analysis that exceed RAM.

```python from pynetcor.cor import chunked_corrcoef

Calculate and return chunk_size=1024 rows of the correlation matrix with each iteration.

coriter = chunkedcorrcoef(arr1, chunksize=1024, threads=8) for corchunkmatrix in coriter: ... ```

Top-k correlation search

Identify the accurate top k correlations (Spearman correlation).

```python from pynetcor.cor import cor_topk

top 1% correlations

cortopkresult = cor_topk(arr1, method="spearman", k=0.001, threads=8)

top 100 correlations

cortopkresult = cor_topk(arr1, method="spearman", k=100, threads=8)

Return a 2D array with 4 columns: [rowindex, colindex, correlation, pvalue]

```

Top-k differential correlation search

Identify the accurate top k differences in correlation between pairs of features across two states or time points.

```python

Compute the pairwise correlations separately for arr1 with arr1, and arr2 with arr2, then identify the feature pairs with the largest difference

from pynetcor.cor import cor_topkdiff

top 1% differential correlations

cortopkdiffresult = cor_topkdiff(x1=arr1, y1=arr2, x2=arr1, y2=arr2, k=0.001, threads=8)

top 100 differential correlations

cortopkdiffresult = cor_topkdiff(x1=arr1, y1=arr2, x2=arr1, y2=arr2, k=100, threads=8)

Return a 2D array with 5 columns: [rowindex, colindex, diffCor, cor1, cor2]

```

P-value computation

Compute the P-values for correlations (Pearson or Spearman) using the Student's t-distribution. The approximation method is significantly faster than the classical method, with the absolute errors are nearly less than 1e-8.

```python from pynetcor.cor import corrcoef, pvaluestudentt samples = arr1.shape[1]

Generate the Pearson correlation matrix

cor_result = corrcoef(arr1, threads=8)

P-value approximation

pvalueresult = pvaluestudentt(corresult, df=samples-2, approx=True, threads=8)

P-value classic

pvalueresult = pvaluestudentt(corresult, df=samples-2, approx=False, threads=8) ```

Unified implementation for calculating correlations and P-values.

```python from pynetcor.cor import cortest, chunked_cortest

Pearson correlation & P-value approximation

cortestresult = cortest(arr1, approxpvalue=True, threads=8)

chunking computation, recommended for large-scale analysis that exceed RAM

for iter in chunkedcortest(arr1, approxpvalue=True, threads=8): for (rowindex, colindex, correlation, pvalue) in iter: ...

Return a 2D array with 4 columns: [rowindex, colindex, correlation, pvalue]

```

Multiple testing correction: holm, hochberg, bonferroni, BH, BY.

```python from pynetcor.cor import cortest, chunked_cortest

Pearson correlation & multiple testing correction

cortestresult = cortest(arr1, adjustpvalue=True, adjust_method="BH", threads=8)

chunking computation, recommended for large-scale analysis that exceed RAM

for iter in chunkedcortest(arr1, adjustpvalue=True, adjustmethod="BH", threads=8): for (rowindex, col_index, correlation, pvalue) in iter: ...

Return a 2D array with 5 columns: [rowindex, colindex, correlation, pvalue, adjusted_pvalue]

```

NOTE: chunked function only supports approximate adjusted P-value. PyNetCor utilizes approximation methods to achieve effective FDR control before computing all P-values.

Memory Management and Chunk Size Optimization

When conducting large-scale correlation analysis using pyNetCor, optimizing memory usage and chunk size is crucial for achieving optimal performance. Our experiments have revealed important relationships between dataset dimensions, chunk_size, runtime, and memory consumption (as illustrated in the figures below):

  • Larger chunk_size generally lead to faster runtimes.
  • The reduction in runtime becomes less significant as the chunk_size exceeds 500-1000.
  • Memory consumption increases linearly with chunk_size.

Based on these observations, we can provide users with recommendations for optimizing chunk_size:

  1. Start with a moderate chunk_size: Begin with a chunk_size around 500-750. This range typically offers a good balance between runtime performance and memory usage.
  2. Consider your dataset size: For smaller datasets (e.g., 70,000-90,000 features), you may be able to use larger chunk_size without excessive memory consumption. This can potentially speed up processing times. However, for very large datasets (150,000+ features), you might need to use smaller chunk_size to manage memory constraints effectively. Always monitor system resources when working with large datasets.
  3. Fine-tune for your specific use case: The optimal chunk_size can vary depending on your dataset size and available RAM. We recommend referring to our experimental results to guide your configuration. As a default setting designed to accommodate most analytical needs, we use a default chunk_size = 512. However, don't hesitate to adjust this based on your specific requirements and system capabilities.

Citation

If you use pyNetCor in your research, please cite the publication: PyNetCor: a high-performance Python package for large-scale correlation analysis.

Shibin Long, Yan Xia, Lifeng Liang, Ying Yang, Hailiang Xie, Xiaokai Wang, PyNetCor: a high-performance Python package for large-scale correlation analysis, NAR Genomics and Bioinformatics, Volume 6, Issue 4, December 2024, lqae177, https://doi.org/10.1093/nargab/lqae177

Owner

  • Name: 01life
  • Login: 01life
  • Kind: organization

GitHub Events

Total
  • Issues event: 2
  • Watch event: 10
  • Issue comment event: 1
  • Push event: 6
  • Create event: 1
Last Year
  • Issues event: 2
  • Watch event: 10
  • Issue comment event: 1
  • Push event: 6
  • Create event: 1

Issues and Pull Requests

Last synced: 10 months ago

All Time
  • Total issues: 1
  • Total pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Total issue authors: 1
  • Total pull request authors: 0
  • Average comments per issue: 0.0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 1
  • Pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 1
  • Pull request authors: 0
  • Average comments per issue: 0.0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • s-a-nersisyan (1)
Pull Request Authors
Top Labels
Issue Labels
Pull Request Labels

Dependencies

setup.py pypi