OnlinePCA.jl: A Julia Package for Out-of-core and Sparse Principal Component Analysis
OnlinePCA.jl: A Julia Package for Out-of-core and Sparse Principal Component Analysis - Published in JOSS (2026)
Science Score: 87.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
○codemeta.json file
-
○.zenodo.json file
-
✓DOI references
Found 2 DOI reference(s) in README and JOSS metadata -
✓Academic publication links
Links to: arxiv.org, scholar.google, ncbi.nlm.nih.gov, sciencedirect.com, zenodo.org -
○Academic email domains
-
○Institutional organization owner
-
✓JOSS paper metadata
Published in Journal of Open Source Software
Keywords
Repository
Online Principal Component Analysis
Basic Info
- Host: GitHub
- Owner: rikenbit
- License: other
- Language: Julia
- Default Branch: master
- Homepage: https://rikenbit.github.io/OnlinePCA.jl/
- Size: 9.14 MB
Statistics
- Stars: 24
- Watchers: 8
- Forks: 3
- Open Issues: 0
- Releases: 21
Topics
Metadata Files
README.md
OnlinePCA.jl
Online Principal Component Analysis
📚 Documentation
Description
OnlinePCA.jl binarizes CSV file, summarizes the information of data matrix and, performs some online-PCA functions for extreamely large scale matrix.
Algorithms
- Gradient-based
- GD-PCA
- SGD-PCA
- Oja's method : Erkki Oja et. al., 1985, Erkki Oja, 1992
- CCIPCA : Juyang Weng et. al., 2003
- RSGD-PCA : Silvere Bonnabel, 2013
- SVRG-PCA : Ohad Shamir, 2015
- RSVRG-PCA : Hongyi Zhang, et. al., 2016, Hiroyuki Sato, et. al., 2017
- Krylov subspace-based
- Orthgonal Iteration (A power method to calculate multiple eigenvectors at once) : Zhaofun Bai, 1987
- Arnoldi method : Zhaofun Bai, 1987
- Lanczos method : Zhaofun Bai, 1987
- Random projection-based
- Halko's method : Halko, N., et. al., 2011, Halko, N. et. al., 2011
- Algorithm971 : George C. Linderman, et. al., 2017, Huamin, Li, et. al., 2017, Vladimir Rokhlin, et. al., 2009
- Randomized Block Krylov Iteration : W, Yu, et. al., 2017
- Single-pass PCA : C Musco, et. al., 2015
Learning Parameter Scheduling
- Robbins-Monro : Herbert Robbins, et. al., 1951
- Momentum : Ning Qian, 1999
- Nesterov's Accelerated Gradient Descent(NAG) : Nesterov, 1983
- ADAGRAD : John Duchi, et. al., 2011
Installation
Requirements
- Julia 1.0 or later
Installation Methods
Method 1: Using Pkg.add()
julia
julia> Pkg.add(url="https://github.com/rikenbit/OnlinePCA.jl.git")
Method 2: Using Pkg REPL mode ```julia
push the key "]" and type the following command.
(v1.7) pkg> add https://github.com/rikenbit/OnlinePCA.jl
Press Backspace or Ctrl+C to return to Julia REPL
```
Optional Dependencies
For interactive visualization of PCA results:
julia
Pkg.add("PlotlyJS")
Basic API usage
Preprocess of CSV
```julia using OnlinePCA using OnlinePCA: write_csv using Distributions using DelimitedFiles using SparseArrays using MatrixMarket
CSV
tmp = mktempdir() data = Int64.(ceil.(rand(NegativeBinomial(1, 0.5), 300, 99))) data[1:50, 1:33] .= 100data[1:50, 1:33] data[51:100, 34:66] .= 100data[51:100, 34:66] data[101:150, 67:99] .= 100*data[101:150, 67:99] write_csv(joinpath(tmp, "Data.csv"), data)
Binarization (Zstandard)
csv2bin(csvfile=joinpath(tmp, "Data.csv"), binfile=joinpath(tmp, "Data.zst"))
Matrix Market (MM)
mmwrite(joinpath(tmp, "Data.mtx"), sparse(data))
Summary of data for CSV/Dense Matrix
densepath = mktempdir() sumr(binfile=joinpath(tmp, "Data.zst"), outdir=densepath) ```
Setting for plot
```julia using DataFrames using PlotlyJS
function subplots(respca, group) # data frame dataleft = DataFrame(pc1=respca[:,1], pc2=respca[:,2], group=group) dataright = DataFrame(pc2=respca[:,2], pc3=respca[:,3], group=group) # plot pleft = Plot(dataleft, x=:pc1, y=:pc2, mode="markers", markersize=10, group=:group) pright = Plot(dataright, x=:pc2, y=:pc3, mode="markers", markersize=10, group=:group, showlegend=false) pleft.data[1]["marker_color"] = "red" pleft.data[2]["markercolor"] = "blue" pleft.data[3]["markercolor"] = "green" pright.data[1]["markercolor"] = "red" pright.data[2]["markercolor"] = "blue" pright.data[3]["markercolor"] = "green" pleft.data[1]["name"] = "group1" pleft.data[2]["name"] = "group2" pleft.data[3]["name"] = "group3" pleft.layout["title"] = "PC1 vs PC2" pright.layout["title"] = "PC2 vs PC3" pleft.layout["xaxistitle"] = "pc1" pleft.layout["yaxistitle"] = "pc2" pright.layout["xaxistitle"] = "pc2" pright.layout["yaxistitle"] = "pc3" plot([pleft pright]) end
group=vcat(repeat(["group1"],inner=33), repeat(["group2"],inner=33), repeat(["group3"],inner=33)) ```
GD-PCA
```julia outgd1 = gd(input=joinpath(tmp, "Data.zst"), dim=3, scheduling="robbins-monro", stepsize=1E-3, numepoch=10, rowmeanlist=joinpath(densepath, "FeatureLogMeans.csv")) outgd2 = gd(input=joinpath(tmp, "Data.zst"), dim=3, scheduling="momentum", stepsize=1E-3, numepoch=10, rowmeanlist=joinpath(densepath, "FeatureLogMeans.csv")) outgd3 = gd(input=joinpath(tmp, "Data.zst"), dim=3, scheduling="nag", stepsize=1E-3, numepoch=10, rowmeanlist=joinpath(densepath, "FeatureLogMeans.csv")) outgd4 = gd(input=joinpath(tmp, "Data.zst"), dim=3, scheduling="adagrad", stepsize=1E-0, numepoch=10, rowmeanlist=joinpath(densepath, "FeatureLogMeans.csv"))
subplots(outgd1[1], group) # Top, Left
subplots(outgd2[1], group) # Top, Right
subplots(outgd3[1], group) # Bottom, Left
subplots(outgd4[1], group) # Bottom, Right
```

SGD-PCA
```julia outsgd1 = sgd(input=joinpath(tmp, "Data.zst"), dim=3, scheduling="robbins-monro", stepsize=1E-3, numbatch=100, numepoch=10, rowmeanlist=joinpath(densepath, "FeatureLogMeans.csv")) outsgd2 = sgd(input=joinpath(tmp, "Data.zst"), dim=3, scheduling="momentum", stepsize=1E-3, numbatch=100, numepoch=10, rowmeanlist=joinpath(densepath, "FeatureLogMeans.csv")) outsgd3 = sgd(input=joinpath(tmp, "Data.zst"), dim=3, scheduling="nag", stepsize=1E-3, numbatch=100, numepoch=10, rowmeanlist=joinpath(densepath, "FeatureLogMeans.csv")) outsgd4 = sgd(input=joinpath(tmp, "Data.zst"), dim=3, scheduling="adagrad", stepsize=1E-0, numbatch=100, numepoch=10, rowmeanlist=joinpath(densepath, "FeatureLogMeans.csv"))
subplots(outsgd1[1], group) # Top, Left
subplots(outsgd2[1], group) # Top, Right
subplots(outsgd3[1], group) # Bottom, Left
subplots(outsgd4[1], group) # Bottom, Right
```

Oja's method
```julia outoja1 = oja(input=joinpath(tmp, "Data.zst"), dim=3, scheduling="robbins-monro", stepsize=1E+0, numepoch=10, rowmeanlist=joinpath(densepath, "FeatureLogMeans.csv")) outoja2 = oja(input=joinpath(tmp, "Data.zst"), dim=3, scheduling="momentum", stepsize=1E-3, numepoch=10, rowmeanlist=joinpath(densepath, "FeatureLogMeans.csv")) outoja3 = oja(input=joinpath(tmp, "Data.zst"), dim=3, scheduling="nag", stepsize=1E-3, numepoch=10, rowmeanlist=joinpath(densepath, "FeatureLogMeans.csv")) outoja4 = oja(input=joinpath(tmp, "Data.zst"), dim=3, scheduling="adagrad", stepsize=1E-1, numepoch=10, rowmeanlist=joinpath(densepath, "FeatureLogMeans.csv"))
subplots(outoja1[1], group) # Top, Left
subplots(outoja2[1], group) # Top, Right
subplots(outoja3[1], group) # Bottom, Left
subplots(outoja4[1], group) # Bottom, Right
```

CCIPCA
```julia outccipca1 = ccipca(input=joinpath(tmp, "Data.zst"), dim=3, stepsize=1E-0, numepoch=10, rowmeanlist=joinpath(densepath, "Feature_LogMeans.csv"))
subplots(out_ccipca1[1], group)
```

RSGD-PCA
```julia outrsgd1 = rsgd(input=joinpath(tmp, "Data.zst"), dim=3, scheduling="robbins-monro", stepsize=1E+2, numepoch=10, rowmeanlist=joinpath(densepath, "FeatureLogMeans.csv")) outrsgd2 = rsgd(input=joinpath(tmp, "Data.zst"), dim=3, scheduling="momentum", stepsize=1E-3, numepoch=10, rowmeanlist=joinpath(densepath, "FeatureLogMeans.csv")) outrsgd3 = rsgd(input=joinpath(tmp, "Data.zst"), dim=3, scheduling="nag", stepsize=1E-3, numepoch=10, rowmeanlist=joinpath(densepath, "FeatureLogMeans.csv")) outrsgd4 = rsgd(input=joinpath(tmp, "Data.zst"), dim=3, scheduling="adagrad", stepsize=1E-1, numepoch=10, rowmeanlist=joinpath(densepath, "FeatureLogMeans.csv"))
subplots(outrsgd1[1], group) # Top, Left
subplots(outrsgd2[1], group) # Top, Right
subplots(outrsgd3[1], group) # Bottom, Left
subplots(outrsgd4[1], group) # Bottom, Right
```

SVRG-PCA
```julia outsvrg1 = svrg(input=joinpath(tmp, "Data.zst"), dim=3, scheduling="robbins-monro", stepsize=1E-5, numepoch=10, rowmeanlist=joinpath(densepath, "FeatureLogMeans.csv")) outsvrg2 = svrg(input=joinpath(tmp, "Data.zst"), dim=3, scheduling="momentum", stepsize=1E-5, numepoch=10, rowmeanlist=joinpath(densepath, "FeatureLogMeans.csv")) outsvrg3 = svrg(input=joinpath(tmp, "Data.zst"), dim=3, scheduling="nag", stepsize=1E-5, numepoch=10, rowmeanlist=joinpath(densepath, "FeatureLogMeans.csv")) outsvrg4 = svrg(input=joinpath(tmp, "Data.zst"), dim=3, scheduling="adagrad", stepsize=1E-2, numepoch=10, rowmeanlist=joinpath(densepath, "FeatureLogMeans.csv"))
subplots(outsvrg1[1], group) # Top, Left
subplots(outsvrg2[1], group) # Top, Right
subplots(outsvrg3[1], group) # Bottom, Left
subplots(outsvrg4[1], group) # Bottom, Right
```

RSVRG-PCA
```julia outrsvrg1 = rsvrg(input=joinpath(tmp, "Data.zst"), dim=3, scheduling="robbins-monro", stepsize=1E-6, numepoch=10, rowmeanlist=joinpath(densepath, "FeatureLogMeans.csv")) outrsvrg2 = rsvrg(input=joinpath(tmp, "Data.zst"), dim=3, scheduling="momentum", stepsize=1E-6, numepoch=10, rowmeanlist=joinpath(densepath, "FeatureLogMeans.csv")) outrsvrg3 = rsvrg(input=joinpath(tmp, "Data.zst"), dim=3, scheduling="nag", stepsize=1E-6, numepoch=10, rowmeanlist=joinpath(densepath, "FeatureLogMeans.csv")) outrsvrg4 = rsvrg(input=joinpath(tmp, "Data.zst"), dim=3, scheduling="adagrad", stepsize=1E-2, numepoch=10, rowmeanlist=joinpath(densepath, "FeatureLogMeans.csv"))
subplots(outrsvrg1[1], group) # Top, Left
subplots(outrsvrg2[1], group) # Top, Right
subplots(outrsvrg3[1], group) # Bottom, Left
subplots(outrsvrg4[1], group) # Bottom, Right
```

Orthogonal Iteration (Power method)
```julia outorthiter = orthiter(input=joinpath(tmp, "Data.zst"), dim=3, rowmeanlist=joinpath(densepath, "Feature_LogMeans.csv"))
subplots(out_orthiter[1], group)
```

Arnoldi method
```julia outarnoldi = arnoldi(input=joinpath(tmp, "Data.zst"), dim=3, rowmeanlist=joinpath(densepath, "Feature_LogMeans.csv"))
subplots(out_arnoldi[1], group)
```

Lanczos method
```julia outlanczos = lanczos(input=joinpath(tmp, "Data.zst"), dim=3, rowmeanlist=joinpath(densepath, "Feature_LogMeans.csv"))
subplots(out_lanczos[1], group)
```

Halko's method
```julia outhalko = halko(input=joinpath(tmp, "Data.zst"), dim=3, rowmeanlist=joinpath(densepath, "Feature_LogMeans.csv"))
subplots(out_halko[1], group)
```

Algorithm 971
```julia outalgorithm971 = algorithm971(input=joinpath(tmp, "Data.zst"), dim=3, rowmeanlist=joinpath(densepath, "Feature_LogMeans.csv"))
subplots(out_algorithm971[1], group)
```

Randomized Block Krylov Iteration
```julia outrbkiter = rbkiter(input=joinpath(tmp, "Data.zst"), dim=3, rowmeanlist=joinpath(densepath, "Feature_LogMeans.csv"))
subplots(out_rbkiter[1], group)
```

Single-pass PCA type I
```julia outsinglepass = singlepass(input=joinpath(tmp, "Data.zst"), dim=3, rowmeanlist=joinpath(densepath, "Feature_LogMeans.csv"))
subplots(out_singlepass[1], group)
```

Single-pass PCA type II
```julia outsinglepass2 = singlepass2(input=joinpath(tmp, "Data.zst"), dim=3, rowmeanlist=joinpath(densepath, "Feature_LogMeans.csv"))
subplots(out_singlepass2[1], group)
```

Summarization for 10X-HDF5
julia
tenxsumr(tenxfile="Data.h5", group="mm10", chunksize=100)
Algorithm 971 for 10X-HDF5
julia
out_tenxpca = tenxpca(tenxfile="Data.h5", scale="sqrt",
rowmeanlist="Feature_SqrtMeans.csv", dim=3, chunksize=100, group="mm10")
Summary of data for MM/Sparse Matrix
```julia
Sparsification + Binarization (Zstandard + MM format)
mm2bin(mmfile=joinpath(tmp, "Data.mtx"), binfile=joinpath(tmp, "Data.mtx.zst"))
sparsepath = mktempdir() sumr(binfile=joinpath(tmp, "Data.mtx.zst"), outdir=sparsepath, mode="sparse_mm") ```
Sparse Randomized SVD for MM format
```julia outsparsersvd = sparsersvd( input=joinpath(tmp, "Data.mtx.zst"), scale="ftt", rowmeanlist=joinpath(sparsepath, "Feature_FTTMeans.csv"), dim=3, chunksize=100)
subplots(outsparsersvd[1], group)
```

Exact Out-of-Core PCA
Unlike other PCAs, this function assumes matrix data with data x dimensions. It is also computationally efficient when the data is vertical with number of data >> number of dimensions. In the following, data assuming this assumption are first prepared. The function can also be used without performing a sumr to extract row and column statistics in advance.
```julia
CSV
tmp2 = mktempdir() data2 = Int64.(ceil.(rand(NegativeBinomial(1, 0.5), 99, 30))) data2[1:33, 1:10] .= 100data2[1:33, 1:10] data2[34:66, 11:20] .= 100data2[34:66, 11:20] data2[67:99, 21:30] .= 100*data2[67:99, 21:30] write_csv(joinpath(tmp2, "Data2.csv"), data2)
Binarization (Zstandard)
csv2bin(csvfile=joinpath(tmp2, "Data2.csv"), binfile=joinpath(tmp2, "Data2.zst"))
Matrix Market (MM)
mmwrite(joinpath(tmp2, "Data2.mtx"), sparse(data2))
Binary COO (BinCOO)
data3 = Int64.(ceil.(rand(Binomial(1, 0.2), 99, 33))) data3[1:33, 1:11] .= 1 data3[34:66, 12:22] .= 1 data3[67:99, 23:33] .= 1
bincoofile = joinpath(tmp2, "Data3.bincoo") open(bincoofile, "w") do io for i in 1:size(data3, 1) for j in 1:size(data3, 2) if data3[i, j] != 0 println(io, "$i $j") end end end end
Binarization (CSV + Zstandard)
csv2bin(csvfile=joinpath(tmp2, "Data2.csv"), binfile=joinpath(tmp2, "Data2.zst"))
Binarization (MM + Zstandard)
mm2bin(mmfile=joinpath(tmp2, "Data2.mtx"), binfile=joinpath(tmp2, "Data2.mtx.zst"))
Binarziation (BinCOO + Zstandard)
bincoo2bin(bincoofile=bincoofile, binfile=joinpath(tmp2, "Data3.bincoo.zst")) ```
```julia
Dense-mode
outexactoocpcadense = exactoocpca( input=joinpath(tmp2, "Data2.zst"), scale="raw", dim=3, chunksize=10)
subplots(outexactoocpcadense[3], group)
```

```julia
Sparse-mode (MM)
outexactoocpcasparsemm = exactoocpca( input=joinpath(tmp2, "Data2.mtx.zst"), scale="raw", dim=3, chunksize=10, mode="sparsemm")
subplots(outexactoocpcasparse_mm[3], group)
```

```julia
Sparse-mode (BinCOO)
outexactoocpcasparsebincoo = exactoocpca( input=joinpath(tmp2, "Data3.bincoo.zst"), scale="raw", dim=3, chunksize=10, mode="sparsebincoo")
subplots(outexactoocpcasparse_bincoo[3], group)
```

Command line usage
All the CSV preprocess functions and PCA functions also can be performed as command line tools with same parameter names like below.
```bash
CSV → Julia Binary (e.g, csv2bin, mm2bin)
julia YOURHOMEDIR/.julia/v0.x/OnlinePCA/bin/csv2bin \ --csvfile Data.csv --binfile Data.zst
Summary statistics extracted from Julia Binary (e.g., sumr, tenxsumr)
julia YOURHOMEDIR/.julia/v0.x/OnlinePCA/bin/sumr \ --binfile Data.zst
Perform PCA
julia YOURHOMEDIR/.julia/v0.x/OnlinePCA/bin/gd \ --input Data.zst --dim 3 --scheduling robbins-monro --stepsize 10 \ --numepoch 10 --rowmeanlist Feature_LogMeans.csv ```
Distributed Computing with Multiple Stepsize Setting
The online PCA algorithms are performed until the reconstruction error is converged. In the default stopping criteria, the calculation is stopped when the relative change is bellow 1E-3 or above 0.03. These values can be changed by lower and upper options, respectively.
The convergence is depend on the step size parameter and default value is set as 1000. This value is tuned for single-cell RNA-Seq dataset, but the appropriate level may change according to the size and dynamic range of data matrix.
Combined with Grid Engine, this step is easily paralled, because each calculation of different step size are independently performed. For example, we firstly make the following template file (e.g., oja_template) containing the online PCA script,
```bash
!/bin/bash
julia YOURHOMEDIR/.julia/v0.x/OnlinePCA/bin/oja \ --scale log \ --input Data.zst \ --outdir XXXXX \ --rowmeanlist Feature_LogMeans.csv \ --dim 10 \ --stepsize YYYYY \ --logdir XXXXX/log ```
and then rewrite the template to set different step size by sed command and submit each job by qsub command.
```bash
!/bin/bash
Steps=(1 10 100 1000 10000 100000 1000000) for i in ${Step[@]}; do OUT="Step"$i mkdir -p $OUT sed -e "s|XXXXX|$OUT|g" ojatemplate > TMPojascData.sh sed -e "s|YYYYY|$i|g" TMPojascData.sh > ojascData.sh chmod +x ojascData.sh qsub ojascData.sh done ```
Even if there are no distributed computational environment, background process is applicable (just adding & in the end of command).
```bash
!/bin/bash
Steps=(1 10 100 1000 10000 100000 1000000) for i in ${Steps[@]}; do mkdir -p "Step"$i julia YOURHOMEDIR/.julia/v0.x/OnlinePCA/bin/oja \ --scale log \ --input Data.zst \ --outdir "Step"$i \ --rowmeanlist Feature_LogMeans.csv \ --dim 10 \ --stepsize $i \ --logdir "Step"$i/log & done
ps | grep julia ```
Contributing
If you have suggestions for how OnlinePCA.jl could be improved, or want to report a bug, open an issue! We'd love all and any contributions.
For more, check out the Contributing Guide.
Author
- Koki Tsuyuzaki
Owner
- Name: RIKEN BiT
- Login: rikenbit
- Kind: organization
- Location: Japan
- Website: https://bit.riken.jp/
- Twitter: dritoshien
- Repositories: 80
- Profile: https://github.com/rikenbit
Laboratory for Bioinformatics Research, RIKEN Center for Biosystems Dynamics Research
JOSS Publication
OnlinePCA.jl: A Julia Package for Out-of-core and Sparse Principal Component Analysis
Authors
Tags
Principal Component Analysis Out-of-Core Sparse dimensionality reductionGitHub Events
Total
- Release event: 13
- Delete event: 4
- Pull request event: 32
- Issues event: 6
- Watch event: 2
- Issue comment event: 8
- Push event: 59
- Create event: 30
Last Year
- Release event: 13
- Delete event: 4
- Pull request event: 32
- Issues event: 6
- Watch event: 1
- Issue comment event: 8
- Push event: 59
- Create event: 30
Issues and Pull Requests
Last synced: 3 months ago
All Time
- Total issues: 3
- Total pull requests: 13
- Average time to close issues: 33 minutes
- Average time to close pull requests: 8 days
- Total issue authors: 2
- Total pull request authors: 2
- Average comments per issue: 1.33
- Average comments per pull request: 0.0
- Merged pull requests: 12
- Bot issues: 0
- Bot pull requests: 13
Past Year
- Issues: 3
- Pull requests: 13
- Average time to close issues: 33 minutes
- Average time to close pull requests: 8 days
- Issue authors: 2
- Pull request authors: 2
- Average comments per issue: 1.33
- Average comments per pull request: 0.0
- Merged pull requests: 12
- Bot issues: 0
- Bot pull requests: 13
Top Authors
Issue Authors
- kokitsuyuzaki (2)
- JuliaTagBot (1)
Pull Request Authors
- github-actions[bot] (10)
- dependabot[bot] (3)
Top Labels
Issue Labels
Pull Request Labels
Packages
- Total packages: 1
- Total downloads: unknown
- Total dependent packages: 0
- Total dependent repositories: 0
- Total versions: 2
juliahub.com: OnlinePCA
Online Principal Component Analysis
- Homepage: https://rikenbit.github.io/OnlinePCA.jl/
- Documentation: https://docs.juliahub.com/General/OnlinePCA/stable/
- License: MIT
-
Latest release: 0.3.10
published 8 months ago
Rankings
Dependencies
- actions/checkout v4 composite
- julia-actions/cache v2 composite
- julia-actions/julia-buildpkg v1 composite
- julia-actions/julia-runtest v1 composite
- julia-actions/setup-julia v2 composite
- JuliaRegistries/TagBot v1 composite
- actions/checkout v4 composite
- docker/build-push-action v6 composite
- docker/login-action v3 composite
- julia 1.8.0-rc1-buster build
- actions/checkout v3 composite
- julia-actions/setup-julia v1 composite
