kgp

1000 Genomes Project Metadata R Package

https://github.com/stephenturner/kgp

Science Score: 54.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
    Links to: arxiv.org, pubmed.ncbi, ncbi.nlm.nih.gov, nature.com
  • Committers with academic emails
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (13.2%) to scientific vocabulary

Keywords

1000genomes bioinformatics genetics genomics metadata population-genetics sequencing
Last synced: 4 months ago · JSON representation ·

Repository

1000 Genomes Project Metadata R Package

Basic Info
Statistics
  • Stars: 20
  • Watchers: 2
  • Forks: 4
  • Open Issues: 5
  • Releases: 2
Topics
1000genomes bioinformatics genetics genomics metadata population-genetics sequencing
Created over 3 years ago · Last pushed about 3 years ago
Metadata Files
Readme Changelog License Citation

README.Rmd

---
output: github_document
---



```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.path = "man/figures/README-",
  out.width = "100%"
)
```

# kgp


[![CRAN status](https://www.r-pkg.org/badges/version/kgp)](https://CRAN.R-project.org/package=kgp)
[![Lifecycle: stable](https://img.shields.io/badge/lifecycle-stable-brightgreen.svg)](https://lifecycle.r-lib.org/articles/stages.html#stable)
[![arXiv](https://img.shields.io/badge/arXiv-2210.00539-b31b1b.svg)](https://arxiv.org/abs/2210.00539)


This kgp data package provides metadata about populations and data about samples from the 1000 Genomes Project, including the 2,504 samples sequenced for the Phase 3 release and the expanded collection of 3,202 samples with 602 additional trios.

## Installation

You can install the released version of kgp from [CRAN](https://CRAN.R-project.org/package=kgp) with:

```r
install.packages("kgp")
```

You can install the development version of kgp from [GitHub](https://github.com/stephenturner/kgp) with:

```r
# install.packages("devtools")
devtools::install_github("stephenturner/kgp")
```

## About the data

The 1000 Genomes Project data Phase 3 data contains 2,504 samples with sequence data available, and was later expanded to 3,202 samples with high coverage adding 602 trios. Data is available through the [1000 Genomes FTP site](http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/) and [GitHub](https://github.com/igsr/1000Genomes_data_indexes/). 

- Pilot publication: [An integrated map of genetic variation from 1,092 human genomes](https://www.nature.com/articles/nature11632)
- Phase 1 publication: [A map of human genome variation from population scale sequencing](https://www.nature.com/articles/nature09534)
- Phase 3 publication: [A global reference for human genetic variation](https://www.nature.com/articles/nature15393)
- Expanded high-coverage publication: [High-coverage whole-genome sequencing of the expanded 1000 Genomes Project cohort including 602 trios](https://pubmed.ncbi.nlm.nih.gov/36055201/)

There are three data sets available in the kgp package.

```{r example}
library(kgp)
data(kgp)
```

The `kgp3` data contains pedigree and population information for the 2,504 samples included in the Phase 3 release of the 1000 Genomes Project data.

```{r}
kgp3
```

The `kgpe` data contains pedigree and population information all 3,202 samples included in the expanded 1000 Genomes Project data, which includes 602 trios.

```{r}
kgpe
```

The `kgpmeta` contains population metadata for the 26 populations across five continental regions.

```{r}
kgpmeta
```

## Examples

```{r, message=FALSE, warning=FALSE}
library(dplyr)
library(ggplot2)
library(kgp)
data(kgp)
```

Count the number of samples in each region, or in each population: 

```{r}
kgp3 %>% 
  count(region) %>% 
  knitr::kable()
```

```{r}
kgp3 %>% 
  count(region, population) %>% 
  knitr::kable()
```

```{r kgp3barplot, fig.width=9, fig.height=12}
kgp3 %>% 
  count(region, population) %>% 
  arrange(region, n) %>% 
  mutate(population=forcats::fct_inorder(population)) %>% 
  ggplot(aes(population, n)) + 
  geom_col(aes(fill=region)) + 
  labs(fill=NULL, x=NULL, x="N") + 
  coord_flip() + 
  theme_bw() + 
  theme(legend.position="bottom")
```

The latitude and longitude coordinates in `kgpmeta` can be used to plot a map of the locations of the 1000 Genomes populations. There is also a column for region color, which provides a hexadecimal color code to enable reproduction of the population data map as shown on the IGSR population data page. The figure below shows a static map produced using ggplot2, but interactive maps such as that shown on the IGSR population data portal can be created with the leaflet package.

```{r kgpmap, fig.cap="Map showing locations of the 1000 Genomes Phase 3 populations.", fig.width=8, fig.height=6}
pal <- kgpmeta %>% distinct(reg, regcolor) %>% tibble::deframe()
ggplot() + 
  geom_polygon(data=map_data("world"), 
               aes(long, lat, group=group), 
               col="gray30", fill="gray95", lwd=.2, alpha=.5) + 
  geom_point(data=kgpmeta, aes(lng, lat, col=reg), size=4) + 
  scale_colour_manual(values=pal) +
  theme_minimal() + 
  theme(axis.ticks = element_blank(), 
        axis.text = element_blank(), 
        axis.title = element_blank(), 
        legend.title = element_blank(),
        panel.grid = element_blank(),
        legend.position = "bottom")
```

The table below shows a selection of samples from `kgpe` showing pedigree information for each sample. This pedigree information could be used in downstream analysis to filter out related individuals, select only trios, or to visualize family structure.

```{r kgpe}
kgpe %>% 
  filter(pid!="0" & mid!="0") %>% 
  group_by(pop) %>% 
  slice(1) %>% 
  head(12) %>% 
  arrange(reg, pop) %>% 
  select(fid:reg) %>% 
  select(-sexf) %>% 
  knitr::kable()
```

The figure below shows an example of a pedigree plot made by parsing the pedigree information using [skater](https://cran.r-project.org/package=skater) and plotting using [kinship2](https://cran.r-project.org/package=kinship2). The skater package provides documentation, examples, and a vignette demonstrating how to iteratively plot all pedigrees in a given data set.

```{r pedplot, fig.height=5, fig.width=8, fig.cap="Trios in 1000 Genomes Project family 13291."}
kgpe %>% 
  filter(fid=="13291") %>% 
  transmute(fid, id, dadid=pid, momid=mid, sex, affected=1) %>% 
  skater::fam2ped() %>% 
  pull(ped) %>% 
  purrr::pluck(1) %>% 
  kinship2::plot.pedigree(mar=c(4,2,4,2), cex=.8)
```

Owner

  • Name: Stephen Turner
  • Login: stephenturner
  • Kind: user
  • Location: Charlottesville, VA
  • Company: @colossal-compsci

Data scientist in biotech, former academic, Principal Scientist and Head of Genomic Strategy at Colossal Biosciences

Citation (CITATION.cff)

# -----------------------------------------------------------
# CITATION file created with {cffr} R package, v0.2.3
# See also: https://docs.ropensci.org/cffr/
# -----------------------------------------------------------
 
cff-version: 1.2.0
message: 'To cite package "kgp" in publications use:'
type: software
license: Apache-2.0
title: 'kgp: 1000 Genomes Project Metadata'
version: 1.0.0
abstract: Metadata about populations and data about samples from the 1000 Genomes
  Project, including the 2,504 samples sequenced for the Phase 3 release and the expanded
  collection of 3,202 samples with 602 additional trios. The data is described in
  Auton et al. (2015) <doi:10.1038/nature15393> and Byrska-Bishop et al. (2022) <doi:10.1016/j.cell.2022.08.004>,
  and raw data is available at <http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/>.
authors:
- family-names: Turner
  given-names: Stephen
  email: vustephen@gmail.com
  orcid: https://orcid.org/0000-0001-9140-9028
preferred-citation:
  type: manual
  title: 'kgp: 1000 Genomes Project Metadata'
  authors:
  - family-names: Turner
    given-names: Stephen
    email: vustephen@gmail.com
    orcid: https://orcid.org/0000-0001-9140-9028
  version: 1.0.0
  abstract: Metadata about populations and data about samples from the 1000 Genomes
    Project, including the 2,504 samples sequenced for the Phase 3 release and the
    expanded collection of 3,202 samples with 602 additional trios. The data is described
    in Auton et al. (2015) <doi:10.1038/nature15393> and Byrska-Bishop et al. (2022)
    <doi:10.1016/j.cell.2022.08.004>, and raw data is available at <http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/>.
  repository-code: https://github.com/stephenturner/kgp
  url: https://stephenturner.github.io/kgp/
  contact:
  - family-names: Turner
    given-names: Stephen
    email: vustephen@gmail.com
    orcid: https://orcid.org/0000-0001-9140-9028
  keywords:
  - 1000genomes
  - bioinformatics
  - genetics
  - genomics
  - metadata
  - population-genetics
  - sequencing
  license: Apache-2.0
  year: '2022'
repository-code: https://github.com/stephenturner/kgp
url: https://stephenturner.github.io/kgp/
contact:
- family-names: Turner
  given-names: Stephen
  email: vustephen@gmail.com
  orcid: https://orcid.org/0000-0001-9140-9028
keywords:
- 1000genomes
- bioinformatics
- genetics
- genomics
- metadata
- population-genetics
- sequencing

GitHub Events

Total
  • Watch event: 1
Last Year
  • Watch event: 1

Committers

Last synced: about 1 year ago

All Time
  • Total Commits: 30
  • Total Committers: 1
  • Avg Commits per committer: 30.0
  • Development Distribution Score (DDS): 0.0
Past Year
  • Commits: 0
  • Committers: 0
  • Avg Commits per committer: 0.0
  • Development Distribution Score (DDS): 0.0
Top Committers
Name Email Commits
Stephen Turner v****n@g****m 30

Issues and Pull Requests

Last synced: 5 months ago

All Time
  • Total issues: 7
  • Total pull requests: 2
  • Average time to close issues: about 15 hours
  • Average time to close pull requests: 3 minutes
  • Total issue authors: 3
  • Total pull request authors: 1
  • Average comments per issue: 0.43
  • Average comments per pull request: 0.0
  • Merged pull requests: 2
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 0
  • Pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 0
  • Pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • stephenturner (5)
  • pamonlan (1)
  • carolhuaxia (1)
Pull Request Authors
  • stephenturner (2)
Top Labels
Issue Labels
Pull Request Labels

Packages

  • Total packages: 1
  • Total downloads:
    • cran 222 last-month
  • Total dependent packages: 0
  • Total dependent repositories: 0
  • Total versions: 3
  • Total maintainers: 1
cran.r-project.org: kgp

1000 Genomes Project Metadata

  • Versions: 3
  • Dependent Packages: 0
  • Dependent Repositories: 0
  • Downloads: 222 Last month
Rankings
Stargazers count: 14.6%
Forks count: 14.9%
Dependent packages count: 29.8%
Average: 32.6%
Dependent repos count: 35.5%
Downloads: 68.4%
Maintainers (1)
Last synced: 5 months ago

Dependencies

DESCRIPTION cran
  • R >= 2.10 depends
  • tibble * suggests
.github/workflows/pkgdown.yaml actions
  • JamesIves/github-pages-deploy-action 4.1.4 composite
  • actions/checkout v2 composite
  • r-lib/actions/setup-pandoc v2 composite
  • r-lib/actions/setup-r v2 composite
  • r-lib/actions/setup-r-dependencies v2 composite