https://github.com/alexslemonade/identifier-refinery

Tools and assets for easy gene identifier conversion

https://github.com/alexslemonade/identifier-refinery

Science Score: 23.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
  • .zenodo.json file
  • DOI references
    Found 2 DOI reference(s) in README
  • Academic publication links
    Links to: ncbi.nlm.nih.gov, zenodo.org
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (11.2%) to scientific vocabulary
Last synced: 10 months ago · JSON representation

Repository

Tools and assets for easy gene identifier conversion

Basic Info
Statistics
  • Stars: 2
  • Watchers: 6
  • Forks: 1
  • Open Issues: 0
  • Releases: 0
Created almost 8 years ago · Last pushed over 7 years ago

https://github.com/AlexsLemonade/identifier-refinery/blob/master/

![](https://i.imgur.com/GphUr2m.png)
# identifier-refinery [![](https://zenodo.org/badge/DOI/10.5281/zenodo.1322711.svg)](https://zenodo.org/record/1322711)

Tools and assets for easy and reproducible gene identifier conversion.

## Methods

This repository is used to build matrices which can convert between different gene identifiers.

These conversion matrices are built by:

 * Randomly choosing raw CEL files from NCBI GEO for a given platform accession code (in `/cels`)
 * Reading the CEL header and joining Brainarray (e.g., `hgu133plus2hsensgprobe`) and Bioconductor (e.g., `hgu133plus2.db`) (x, y) coordinates
 * Finding intersecting probe identifiers
 * Extracting supported identifiers and probe IDs from the Bioconductor package
 * Filtering on probe IDs and Ensembl Gene IDs in Brainarray
 * Writing the output to a conversion TSV file
 * Check that all output conversion TSV files have a shared SHA1

## Repository Contents

### Source Files

The `cels` directory contains raw CEL files taken from GEO. The list of supported platforms is in `supported_microarray_platforms.csv`. Source files can be acquired by running the `acquire_cels.py` script.

### Docker Image

The conversion scripts are run on custom Docker images. 

Two Dockerfiles are provided in this repository - `base` Docker image, which is used to install the required R dependencies, and the `pd` image, which is used to build the required databases for a given platform.

### Conversion Scripts

A `build_and_convert.py` script is provided, which build a unique Docker image for each package, mount the downloaded CEL files as a volume, and then run the gene conversion script `R/gene_convert.R` inside the image and output the master conversion matrix. Output TSV files live in `cels/out/`.

## Reproducing

The entire process can be reproduced by running the following command script from a fresh checkout of this repository. It will take some time:

```
$ ./generate_matricies_from_scratch.sh
```

You can also choose to only build a specific platform, ex.,:

```
$ ./generate_matricies_from_scratch.sh celegans
```

## Identifiers

Released assets in this repository are availble under the DOI, `10.5281/zenodo.1322711`, which can be seen on Zenodo [here](https://zenodo.org/record/1322711). This accession is up to date as of https://github.com/AlexsLemonade/identifier-refinery/commit/cace2849baf2666f21ec32f5eee6208d6ec19294.

## Related Projects

 * [AlexsLemonade/refinebio](https://github.com/AlexsLemonade/refinebio)

## Copyright

`identifier-refinery` output assets are released under a [CC0 1.0 Universal](https://creativecommons.org/publicdomain/zero/1.0/legalcode) license. All code is released under the BSD 3-clause license. Input assets are property of the original providers to NCBI GEO, but may be [freely downloaded and redistributed](https://www.ncbi.nlm.nih.gov/geo/info/disclaimer.html) unless otherwise noted.

Owner

  • Name: Alex's Lemonade Stand Foundation
  • Login: AlexsLemonade
  • Kind: organization

Childhood Cancer Data Lab of ALSF

GitHub Events

Total
Last Year

Issues and Pull Requests

Last synced: 10 months ago

All Time
  • Total issues: 6
  • Total pull requests: 4
  • Average time to close issues: 14 days
  • Average time to close pull requests: 5 days
  • Total issue authors: 3
  • Total pull request authors: 1
  • Average comments per issue: 2.33
  • Average comments per pull request: 0.75
  • Merged pull requests: 4
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 0
  • Pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 0
  • Pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • Miserlou (3)
  • jaclyn-taroni (2)
  • kurtwheeler (1)
Pull Request Authors
  • jaclyn-taroni (4)
Top Labels
Issue Labels
Pull Request Labels

Dependencies

requirements.txt pypi
  • GEOparse *
  • pathlib2 *