https://github.com/01life/pynetcor
An efficient tool for large-scale correlation network analysis
Science Score: 26.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
○.zenodo.json file
-
✓DOI references
Found 4 DOI reference(s) in README -
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (12.0%) to scientific vocabulary
Repository
An efficient tool for large-scale correlation network analysis
Basic Info
- Host: GitHub
- Owner: 01life
- License: mit
- Language: C++
- Default Branch: main
- Size: 2.34 MB
Statistics
- Stars: 9
- Watchers: 0
- Forks: 0
- Open Issues: 1
- Releases: 0
Metadata Files
README.md
pyNetCor
PyNetCor is a fast Python C++ extension for correlation and network analysis on high-dimensional datasets. It aims to serve as a scalable foundational package to accelerate large-scale computations.
Features
- Calculate correlation matrix using Pearson, Spearman, or Kendall methods
- Processing large-scale computation in chunks (larger than RAM)
- Find top-k and differential correlations between each row of two arrays
- Efficient P-value approximation and multiple testing correction
- Handle missing values
- Multi-thread
For more details, please refer to the documentation.
Installation
You can install pyNetCor using pip:
bash
pip install pynetcor
Quick Start
We provide a demo notebook to help you quickly understand how to use this project.
Create Data
```python import numpy as np
features = 1000 samples = 100 arr1 = np.random.random((features, samples)) arr2 = np.random.random((features, samples)) ```
Calculate correlation matrix
Compute and return the full matrix at once.
```python from pynetcor.cor import corrcoef
using 8 threads
Pearson correlations between arr1 and itself
cor_result = corrcoef(arr1, threads=8) ```
Compute the matrix in chunks and return an Iterator, recommended for large-scale analysis that exceed RAM.
```python from pynetcor.cor import chunked_corrcoef
Calculate and return chunk_size=1024 rows of the correlation matrix with each iteration.
coriter = chunkedcorrcoef(arr1, chunksize=1024, threads=8) for corchunkmatrix in coriter: ... ```
Top-k correlation search
Identify the accurate top k correlations (Spearman correlation).
```python from pynetcor.cor import cor_topk
top 1% correlations
cortopkresult = cor_topk(arr1, method="spearman", k=0.001, threads=8)
top 100 correlations
cortopkresult = cor_topk(arr1, method="spearman", k=100, threads=8)
Return a 2D array with 4 columns: [rowindex, colindex, correlation, pvalue]
```
Top-k differential correlation search
Identify the accurate top k differences in correlation between pairs of features across two states or time points.
```python
Compute the pairwise correlations separately for arr1 with arr1, and arr2 with arr2, then identify the feature pairs with the largest difference
from pynetcor.cor import cor_topkdiff
top 1% differential correlations
cortopkdiffresult = cor_topkdiff(x1=arr1, y1=arr2, x2=arr1, y2=arr2, k=0.001, threads=8)
top 100 differential correlations
cortopkdiffresult = cor_topkdiff(x1=arr1, y1=arr2, x2=arr1, y2=arr2, k=100, threads=8)
Return a 2D array with 5 columns: [rowindex, colindex, diffCor, cor1, cor2]
```
P-value computation
Compute the P-values for correlations (Pearson or Spearman) using the Student's t-distribution. The approximation method is significantly faster than the classical method, with the absolute errors are nearly less than 1e-8.
```python from pynetcor.cor import corrcoef, pvaluestudentt samples = arr1.shape[1]
Generate the Pearson correlation matrix
cor_result = corrcoef(arr1, threads=8)
P-value approximation
pvalueresult = pvaluestudentt(corresult, df=samples-2, approx=True, threads=8)
P-value classic
pvalueresult = pvaluestudentt(corresult, df=samples-2, approx=False, threads=8) ```
Unified implementation for calculating correlations and P-values.
```python from pynetcor.cor import cortest, chunked_cortest
Pearson correlation & P-value approximation
cortestresult = cortest(arr1, approxpvalue=True, threads=8)
chunking computation, recommended for large-scale analysis that exceed RAM
for iter in chunkedcortest(arr1, approxpvalue=True, threads=8): for (rowindex, colindex, correlation, pvalue) in iter: ...
Return a 2D array with 4 columns: [rowindex, colindex, correlation, pvalue]
```
Multiple testing correction: holm, hochberg, bonferroni, BH, BY.
```python from pynetcor.cor import cortest, chunked_cortest
Pearson correlation & multiple testing correction
cortestresult = cortest(arr1, adjustpvalue=True, adjust_method="BH", threads=8)
chunking computation, recommended for large-scale analysis that exceed RAM
for iter in chunkedcortest(arr1, adjustpvalue=True, adjustmethod="BH", threads=8): for (rowindex, col_index, correlation, pvalue) in iter: ...
Return a 2D array with 5 columns: [rowindex, colindex, correlation, pvalue, adjusted_pvalue]
```
NOTE: chunked function only supports approximate adjusted P-value. PyNetCor utilizes approximation methods to achieve effective FDR control before computing all P-values.
Memory Management and Chunk Size Optimization
When conducting large-scale correlation analysis using pyNetCor, optimizing memory usage and chunk size is crucial for achieving optimal performance. Our experiments have revealed important relationships between dataset dimensions, chunk_size, runtime, and memory consumption (as illustrated in the figures below):
- Larger
chunk_sizegenerally lead to faster runtimes. - The reduction in runtime becomes less significant as the
chunk_sizeexceeds 500-1000. - Memory consumption increases linearly with
chunk_size.
Based on these observations, we can provide users with recommendations for optimizing chunk_size:
- Start with a moderate
chunk_size: Begin with achunk_sizearound 500-750. This range typically offers a good balance between runtime performance and memory usage. - Consider your dataset size: For smaller datasets (e.g., 70,000-90,000 features), you may be able to use larger
chunk_sizewithout excessive memory consumption. This can potentially speed up processing times. However, for very large datasets (150,000+ features), you might need to use smallerchunk_sizeto manage memory constraints effectively. Always monitor system resources when working with large datasets. - Fine-tune for your specific use case: The optimal
chunk_sizecan vary depending on your dataset size and available RAM. We recommend referring to our experimental results to guide your configuration. As a default setting designed to accommodate most analytical needs, we use a defaultchunk_size= 512. However, don't hesitate to adjust this based on your specific requirements and system capabilities.
Citation
If you use pyNetCor in your research, please cite the publication: PyNetCor: a high-performance Python package for large-scale correlation analysis.
Shibin Long, Yan Xia, Lifeng Liang, Ying Yang, Hailiang Xie, Xiaokai Wang, PyNetCor: a high-performance Python package for large-scale correlation analysis, NAR Genomics and Bioinformatics, Volume 6, Issue 4, December 2024, lqae177, https://doi.org/10.1093/nargab/lqae177
Owner
- Name: 01life
- Login: 01life
- Kind: organization
- Repositories: 1
- Profile: https://github.com/01life
GitHub Events
Total
- Issues event: 2
- Watch event: 10
- Issue comment event: 1
- Push event: 6
- Create event: 1
Last Year
- Issues event: 2
- Watch event: 10
- Issue comment event: 1
- Push event: 6
- Create event: 1
Issues and Pull Requests
Last synced: 10 months ago
All Time
- Total issues: 1
- Total pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Total issue authors: 1
- Total pull request authors: 0
- Average comments per issue: 0.0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 1
- Pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Issue authors: 1
- Pull request authors: 0
- Average comments per issue: 0.0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
- s-a-nersisyan (1)