rfmix-reader
Science Score: 67.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
✓DOI references
Found 3 DOI reference(s) in README -
✓Academic publication links
Links to: biorxiv.org, zenodo.org -
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (17.7%) to scientific vocabulary
Keywords
Repository
Basic Info
- Host: GitHub
- Owner: heart-gen
- License: gpl-3.0
- Language: Python
- Default Branch: main
- Size: 650 KB
Statistics
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 12
- Releases: 2
Topics
Metadata Files
README.md
RFMix-reader
RFMix-reader is a Python package designed to efficiently read and process output
files generated by RFMix, a popular tool
for estimating local ancestry in admixed populations. The package employs a lazy
loading approach, which minimizes memory consumption by reading only the loci that
are accessed by the user, rather than loading the entire dataset into memory at
once. Additionally, we leverage GPU acceleration to improve computational speed.
Install
rfmix-reader can be installed using pip:
bash
pip install rfmix-reader
GPU Acceleration:
rfmix-reader leverages GPU acceleration for improved performance. To use this
functionality, you will need to install the following libraries for your specific
CUDA version:
- RAPIDS: Refer to official installation guide here
- PyTorch: Installation instructions can be found here
Additional Notes:
- We have not tested installation with Docker or Conda environemnts. Compatibility
may vary.
- If you do not have GPU, you can still use the basic functionality of rfmix-reader.
This is still much faster than processing the files with stardard scripting.
Key Features
Lazy Loading - Reads data on-the-fly as requested, reducing memory footprint. - Ideal for working with large RFMix output files that may not fit entirely in memory.
Efficient Data Access - Provides convenient access to specific loci or regions of interest. - Allows for selective loading of data, enabling faster processing times.
Seamless Integration
- Designed to work seamlessly with existing Python data analysis workflows.
- Facilitates downstream analysis and manipulation of RFMix output data.
Loci Imputation - Designed to impute local ancestry loci to a larger genotype data genomic positions. - Array-based data for ease of integration with downstream analysis.
Whether you are working with large-scale genomic datasets or have limited
computational resources, RFMix-reader offers an efficient and memory-conscious
solution for reading and processing RFMix output files. Its lazy loading approach
ensures optimal resource utilization, making it a valuable tool for researchers
and bioinformaticians working with admixed population data.
Simulation Data
Simulation data is available for testing two and three population admixture on Synapse: syn61691659.
Usage
This works similarly to pandas-plink:
Two Population Admixture Example
This is a two-part process.
Generate Binary Files
To reduce computational time and memory, we leverage binary files.
While RFMix does not generate these directly, we provide a function
for their creation: create_binaries. This function can also be invoked
via the command line:
create-binaries [-h] [--version] [--binary_dir BINARY_DIR] file_path.
bash
create-binaries two_pops/out/
This will generate the binary files in a location './binaryfiles" as
`--binarydir` is an optional parameter.
```python from rfmixreader import createbinaries
Generate binary files
filepath = "twopops/out/" binarydir = "./binaryfiles" createbinaries(filepath, binarydir=binarydir) ```
You can also do this on the fly.
```python from rfmixreader import readrfmix
filepath = "twopops/out/" binarydir = "./binaryfiles" loci, rfq, admix = readrfmix(filepath, binarydir=binarydir, generatebinary=True) ```
We do not have this turned on by default, as it is the
rate limiting step. It can take upwards of 20 to 25 minutes
to run depending on *fb.tsv file size.
Main Function
Once binary files are generated, you can the main function to process the RFMix results. With GPU this takes less than 5 minutes.
```python from rfmixreader import readrfmix
filepath = "twopops/out/"
loci, rfq, admix = readrfmix(filepath)
``
**Note:**./binaryfilesis the default forbinary_dir`,
so this is an optional parameter.
Three Population Admixture Example
RFMix-reader is adaptable for as many population admixtures as
needed.
```python from rfmixreader import readrfmix
filepath = "examples/threepopuations/out/" binarydir = "./binaryfiles" loci, rfq, admix = readrfmix(filepath, binarydir=binarydir, generatebinary=True) ```
Loci Imputation
Imputing local ancestry loci information to genotype variant locations improves
integration of the local ancestry information with genotype data. As such, we provide
the interpolate_array function to efficiently interpolate missing values when local
ancestry loci information is converted to more variable genotype variant locations.
It leverages the power of Zarr
arrays, making it suitable for handling substantial datasets while managing memory
usage effectively.
Features
- CUDA Acceleration: Uses CUDA for performance enhancement when available;
otherwise, it defaults to
NumPy. - Chunk Processing: Processes data in manageable chunks to optimize memory usage, making it ideal for large datasets.
- Progress Monitoring: Displays progress through a
tqdmprogress bar, providing real-time feedback during execution. - Column-wise Interpolation: Employs the
_interpolate_colfunction to perform interpolation along each column of the dataset.
Example Usage
```python import pandas as pd import dask.array as da
Outer merged dataframe of loci and variant locations
"i" is from the loci information; "chrom" and "pos" from both dataframes
variantlocidf = pd.DataFrame({'chrom': ['1', '1', '1', '1'], 'pos': [100, 200, 300, 400], 'i': [1, NA, NA, 2]})
Dask array of admixture data, which will have few rows than variantlocidf
admix = da.random.random((2, 3)) # Random data here
This expands the Dask array (admix) and interpolates missing data
Default chunksize = 50,000 assuming variantloci_df 6-9M rows.
Default batch_size = 10,000 assuming admix loci from 2-4M rows.
Adjust this based on variantlocidf size.
z = interpolatearray(variantlocidf, admix, '/path/to/output', chunksize=100, batch_size=100)
Check the shape of the resulting Zarr array, which should have the same
row numbers as variantlocidf
print(z.shape) # Output: (4, 3) ```
Example Preprocessing Functions
The helper functions _load_genotypes and _load_admix are designed to facilitate
the loading of loci and genotype data for constructing the variant_loci_df
DataFrame.
_load_genotypes(plink_prefix_path): This function uses thetensorqtllibrary to read genotype data from PLINK files (PGEN). It returns both the loaded genotype data and a DataFrame containing variant information, which includes chromosome and position details. The chromosome identifiers are formatted to include the "chr" prefix for consistency._load_admix(prefix_path, binary_dir): This function employs therfmix_readerlibrary to load local ancestry data from specified paths. It reads the ancestry information into a suitable format for further processing, enabling integration with genotype data.
These functions ensure accurate loading and formatting of variant and local ancestry data, streamlining subsequent analyses.
```python def loadgenotypes(plinkprefixpath): from tensorqtl import pgen pgr = pgen.PgenReader(plinkprefixpath) variantdf = pgr.variantdf variantdf.loc[:, "chrom"] = "chr" + variantdf.chrom return pgr.loadgenotypes(), variantdf
def loadadmix(prefixpath, binarydir): from rfmixreader import readrfmix return readrfmix(prefixpath, binarydir=binarydir)
def testing(): basename = "/projects/b1213/largeprojects/braincolocapp/input" # Local ancestry prefixpath = f"{basename}/localancestryrfmix/m/" binarydir = f"{basename}/localancestryrfmix/m/binaryfiles/" loci, , admix = _loadadmix(prefixpath, binarydir) loci.rename(columns={"chromosome": "chrom", "physicalposition": "pos"}, inplace=True) # Variant data plinkprefix = f"{basename}/genotypes/TOPMedLIBD" _, variantdf = loadgenotypes(plinkprefix) variantdf = variantdf.dropduplicates(subset=["chrom", "pos"], keep='first') # Keep all locations for more accurate imputation variantlocidf = variantdf.merge(loci.topandas(), on=["chrom", "pos"], how="outer", indicator=True)\ .loc[:, ["chrom", "pos", "i", "merge"]] datapath = f"{basename}/localancestryrfmix/m" z = interpolatearray(variantlocidf, admix, datapath) # Match variant data genomic positions arrgeno = arrmod.array(variantlocidf[~(variantlocidf["merge"] == "rightonly")].index) newadmix = z[arr_geno.get(), :] ```
Note: Following imputation, variant_df will include genomic positions for
both local ancestry and genotype data.
Author(s)
Citation
If you use this software in your work, please cite it.
Benjamin, K. J. M. (2024). RFMix-reader (Version v0.1.15) [Computer software]. https://github.com/heart-gen/rfmix_reader
Kynon JM Benjamin. "RFMix-reader: Accelerated reading and processing for local ancestry studies." bioRxiv. 2024. DOI: 10.1101/2024.07.13.603370.
Funding
This work was supported by grants from the National Institutes of Health, National Institute on Minority Health and Health Disparities (NIMHD) K99MD016964 / R00MD016964.
Owner
- Login: heart-gen
- Kind: user
- Repositories: 1
- Profile: https://github.com/heart-gen
Citation (CITATION.cff)
# This CITATION.cff file was generated with cffinit.
# Visit https://bit.ly/cffinit to generate yours today!
cff-version: 1.2.0
title: RFMix-reader
message: >-
If you use this software, please cite it using the
metadata from this file.
type: software
authors:
- given-names: Kynon J M
name-particle:
family-names: Benjamin
email: heartgen.lab@gmail.com
affiliation: Lieber Institute for Brain Development
orcid: 'https://orcid.org/0000-0003-2016-4646'
identifiers:
- type: doi
value: 10.5281/zenodo.12629788
- type: doi
value: 10.1101/2024.07.13.603370
repository-code: 'https://github.com/heart-gen/rfmix_reader'
url: 'https://rfmix-reader.readthedocs.io/en/latest/index.html'
abstract: >-
RFMix-reader is a Python package designed to efficiently
read and process output files generated by RFMix, a
popular tool for estimating local ancestry in admixed
populations. The package employs a lazy loading approach,
which minimizes memory consumption by reading only the
loci that are accessed by the user, rather than loading
the entire dataset into memory at once. Additionally, we
leverage GPU acceleration to improve computational speed.
keywords:
- admixed populations
- local ancestry
- GPU acceleration
- RFMix
- file parser
license: GPL-3.0
commit: 2761c43045572dcdf3f5d24f336fab43d6a7ed5e
version: v0.1.15
date-released: '2024-07-02'
GitHub Events
Total
- Create event: 14
- Release event: 1
- Issues event: 44
- Delete event: 4
- Member event: 1
- Issue comment event: 2
- Push event: 111
- Pull request event: 10
Last Year
- Create event: 14
- Release event: 1
- Issues event: 44
- Delete event: 4
- Member event: 1
- Issue comment event: 2
- Push event: 111
- Pull request event: 10
Issues and Pull Requests
Last synced: 6 months ago
All Time
- Total issues: 21
- Total pull requests: 5
- Average time to close issues: 2 months
- Average time to close pull requests: 1 minute
- Total issue authors: 1
- Total pull request authors: 2
- Average comments per issue: 0.05
- Average comments per pull request: 0.0
- Merged pull requests: 5
- Bot issues: 0
- Bot pull requests: 3
Past Year
- Issues: 21
- Pull requests: 5
- Average time to close issues: 2 months
- Average time to close pull requests: 1 minute
- Issue authors: 1
- Pull request authors: 2
- Average comments per issue: 0.05
- Average comments per pull request: 0.0
- Merged pull requests: 5
- Bot issues: 0
- Bot pull requests: 3
Top Authors
Issue Authors
- KrotosBenjamin (21)
- sapema (1)
Pull Request Authors
- dependabot[bot] (5)
- KrotosBenjamin (2)
Top Labels
Issue Labels
Pull Request Labels
Packages
- Total packages: 1
-
Total downloads:
- pypi 120 last-month
- Total dependent packages: 0
- Total dependent repositories: 0
- Total versions: 39
- Total maintainers: 1
pypi.org: rfmix-reader
RFMix-reader is a Python package designed to efficiently read and process output files generated by RFMix, a popular tool for estimating local ancestry in admixed populations. The package employs a lazy loading approach, which minimizes memory consumption by reading only the loci that are accessed by the user, rather than loading the entire dataset into memory at once.
- Homepage: https://rfmix-reader.readthedocs.io/en/latest/
- Documentation: https://rfmix-reader.readthedocs.io/
- License: GPL-3.0-or-later
-
Latest release: 0.2.0
published 10 months ago
Rankings
Maintainers (1)
Dependencies
- alabaster 0.7.16
- babel 2.15.0
- certifi 2024.6.2
- charset-normalizer 3.3.2
- click 8.1.7
- cloudpickle 3.0.0
- colorama 0.4.6
- dask 2024.5.2
- docutils 0.20.1
- exceptiongroup 1.2.1
- fsspec 2024.5.0
- idna 3.7
- imagesize 1.4.1
- importlib-metadata 7.1.0
- iniconfig 2.0.0
- jinja2 3.1.4
- locket 1.0.0
- markupsafe 2.1.5
- numpy 1.26.4
- packaging 24.0
- pandas 2.2.2
- partd 1.4.2
- pluggy 1.5.0
- pygments 2.18.0
- pytest 8.2.1
- python-dateutil 2.9.0.post0
- pytz 2024.1
- pyyaml 6.0.1
- requests 2.32.3
- six 1.16.0
- snowballstemmer 2.2.0
- sphinx 7.3.7
- sphinx-autodoc-typehints 2.1.0
- sphinx-rtd-theme 2.0.0
- sphinxcontrib-applehelp 1.0.8
- sphinxcontrib-devhelp 1.0.6
- sphinxcontrib-htmlhelp 2.0.5
- sphinxcontrib-jquery 4.1
- sphinxcontrib-jsmath 1.0.1
- sphinxcontrib-qthelp 1.0.7
- sphinxcontrib-serializinghtml 1.1.10
- tomli 2.0.1
- toolz 0.12.1
- tqdm 4.66.4
- tzdata 2024.1
- urllib3 2.2.1
- zipp 3.19.1
- pytest ^8.2 develop
- sphinx ^7.3.7 docs
- sphinx-autodoc-typehints ^2.1.0 docs
- sphinx-rtd-theme ^2.0.0 docs
- dask ^2024.5
- numpy ^1.26
- pandas ^2.0
- python ^3.9
- tqdm ^4.66
- sphinx >=7.3.7
- sphinx-autodoc-typehints >=2.1.0
- sphinx-rtd-theme >=2.0.0