snpaimer

Check diagnostic power of SNP combinations

https://github.com/oksanave/snpaimer

Science Score: 13.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
○
codemeta.json file
○
.zenodo.json file
✓
DOI references
Found 3 DOI reference(s) in README
○
Academic publication links
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (13.8%) to scientific vocabulary

Last synced: 10 months ago · JSON representation

Repository

Check diagnostic power of SNP combinations

Basic Info

Host: GitHub
Owner: OksanaVe
Language: R
Default Branch: main
Size: 235 KB

Statistics

Stars: 0
Watchers: 2
Forks: 1
Open Issues: 0
Releases: 0

Created over 3 years ago · Last pushed almost 2 years ago

Metadata Files

Readme

README.md

--> snpAIMeR is available on CRAN, and a companion publication is available here.

This R package assesses the diagnostic power of SNP combinations using leave-one-out style cross-validation. To do so, it uses Discriminant Analysis of Principal Components within the adegenet R package. Its value is in (1) identifying ancestry informative markers (AIMs) and (2) evaluating how well different marker combinations can predict an unknown sample's population of origin.

The user provides candidate markers, SNP genotypes from individuals of known origin, a range of panel sizes, and a threshold value for an acceptable rate of correct sample identification.

snpAIMeR tests every marker combination within the specified minimum and maximum panel sizes. For each cross-validation replicate, individuals are randomly divided with 75% for the DAPC and 25% withheld as test samples. Results from the DAPC are used to predict the population of origin for each test individual, which is then compared with the known population label from the input file.

Because of the number of possible combinations, we recommend testing no more than 15 markers. For example, testing 15 markers in panel sizes of 1 to 15 (32,767 total combinations) with 1,000 cross-validation replicates on a system with 48 processor cores took about 5 hours and 20 GB RAM. To mitigate run time, snpAIMeR automatically uses n - 1 the number of available processor cores. Reducing the number of cross-validation replicates also reduces run time, however, we recommend no less than 100 replicates.

Requirements

.stru (STRUCTURE) formatted genotype file. Individuals must have population assignments. For file conversion from other formats (plink, vcf, etc.), please see PGDSpider.

Usage

snpAIMeR(run_mode, config_file = NULL, verbose = TRUE)

Run interactively (user-friendly)

```

snpAIMeR("interactive") Upon executing the function, the user is prompted with the following (do not quote paths): Enter path to working directory: Enter path to STRUCTURE file: Then, the user is prompted (by adegenet) for information about the SNP genotype file: How many genotypes are there? How many markers are there? Which column contains labels for genotypes ('0' if absent)? Which column contains the population factor ('0' if absent)? Which other optional columns should be read (press 'return' when done)? Which row contains the marker names ('0' if absent)? Are genotypes coded by a single row (y/n)? Finally, after a few messages about the data (again from adegenet), the user is prompted for the following (we recommend no less than 100 cross-validation replicates): Minimum number of markers in combination: Maximum number of markers in combination: Assignment rate threshold (minimum rate of successful assignments): Number of cross-validation replicates: ```

Run without interaction

```

snpAIMeR("non-interactive", "configfile") Non-interactive mode requires a config file in YAML format. Example [here](https://github.com/OksanaVe/snpAIMeR/blob/main/snpAIMeR_config.yml) minrange: 1 # Minimum combination size maxrange: 5 # Maximum combination size assignmentratethreshold: 0.9 # Value from 0 to 1 crossvalidationreplicates: 100 # We recommend no less than 100 replicates workingdirectory: "./" # Path name in quotes; use "./" for current directory

structurefile: "snpAIMeRtoydataset176inds5SNPs.str" # Path name in quotes numberofindividuals: 176 # Same as adegenet's "n.ind" numberofloci: 5 # Same as adegenet's "n.loc" onedatarowperindividual: FALSE # TRUE or FALSE columnsampleIDs: 1 # Column number with individual sample names columnpopulationassignments: 2 # Column number with individual population of origin columnotherinfo: # Column number rowmarkernames: 1 # Row number with marker names nogenotypecharacter: -9 # Default is "-9" optionalpopulationinfo: # Optional genotypecharacterseparator: # Optional ```

Output

For each panel size, when all the combinations have been evaluated, replicate cross-validation data for the last combination is displayed as a histogram. * "Allcombinationsassignrate.csv" has the mean correct assignment rate for each combination tested (average of all cross-validation replicates) * "Panelsizeassignrate.csv" has the mean correct assignment rate for each panel size tested (average of all combinations) * "Abovethresholdassign_rate.csv" lists the combinations with a mean correct assignment rate above the user-specified threshold.

"Singlemarkerassignmentrate.pdf" is each candidate marker's individual assignment rate. This is the same as running minrange=1, maxrange=1.
<img src="https://github.com/OksanaVe/SNPcheck/assets/131922755/0526f289-f2c7-45ff-95f0-0214b5d4a328" align="left" width="25%" height="25%" />

"Panelsizeassignrate.pdf" is a visualization of "Panelsizeassignrate.csv"

Toy dataset

A toy dataset of 5 SNPs and 176 individuals is provided here. The example YAML file is already setup for this dataset. For interactive mode, use the following prompt responses. ``` snpAIMeR("interactive") Enter path to working directory: ./ Enter path to STRUCTURE file: snpAIMeRtoydataset176inds15SNPs.str

How many genotypes are there? 176

How many markers are there? 5

Which column contains labels for genotypes ('0' if absent)? 1

Which column contains the population factor ('0' if absent)? 2

Which other optional columns should be read (press 'return' when done)? 1:

Which row contains the marker names ('0' if absent)? 1

Are genotypes coded by a single row (y/n)? n

Converting data from a STRUCTURE .stru file to a genind object...

Data file contains 5 markers File contains the following group definitions: group1 group2 120 56 Minimum number of markers in combination: 1 Maximum number of markers in combination: 5 Assignment rate threshold (minimum rate of successful assignments): 0.9 Number of cross-validation replicates: 100

Owner

Name: Oksana Vernygora
Login: OksanaVe
Kind: user

Repositories: 2
Profile: https://github.com/OksanaVe

GitHub Events

Total

Last Year

Packages

Total packages: 1
Total downloads:
- cran 212 last-month

Total dependent packages: 0
Total dependent repositories: 0
Total versions: 2
Total maintainers: 1

cran.r-project.org: snpAIMeR

Assess the Diagnostic Power of Genomic Marker Combinations

Homepage: https://github.com/OksanaVe/snpAIMeR
Documentation: http://cran.r-project.org/web/packages/snpAIMeR/snpAIMeR.pdf
License: MIT + file LICENSE
Latest release: 2.1.1
published over 2 years ago

Versions: 2
Dependent Packages: 0
Dependent Repositories: 0
Downloads: 212 Last month

Rankings

Dependent packages count: 28.6%

Dependent repos count: 36.7%

Average: 50.4%

Downloads: 85.8%

Maintainers (1)

kim.vertacnik@mailbox.org

Last synced: 11 months ago

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science