agetl

Function to process data files from different agricultural and plant science experiments and aggregate them into a standard database table in a central repository to make data available for different variety of data analyses.

https://github.com/ds4ag/agetl

Science Score: 67.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 2 DOI reference(s) in README
  • Academic publication links
    Links to: frontiersin.org, zenodo.org
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (13.5%) to scientific vocabulary
Last synced: 6 months ago · JSON representation ·

Repository

Function to process data files from different agricultural and plant science experiments and aggregate them into a standard database table in a central repository to make data available for different variety of data analyses.

Basic Info
  • Host: GitHub
  • Owner: DS4Ag
  • License: gpl-3.0
  • Language: Python
  • Default Branch: main
  • Size: 181 KB
Statistics
  • Stars: 2
  • Watchers: 1
  • Forks: 0
  • Open Issues: 0
  • Releases: 3
Created over 2 years ago · Last pushed almost 2 years ago
Metadata Files
Readme License Citation

README.md

Wang lab logo

Python version JupyterLab Jupyter Notebook YAML 1.2

DOI

Note: to open links in new tab use CTRL+click (Windows and Linux) or CMD+click (MacOS).

What is AgETL?

Agricultural Data Extract, Transform, and Load Framework is a set of functions written in python that allow you to process data files from different agricultural and plant science experiments and aggregate them into a standard database table in a central repository to make data available for different variety of data analyses.

The execution of functions for this step is divided into two notebook files and configuration files.

  • Extraction and Transformation processes:

Runs the Extraction and Transformation processes, and the user gets a CSV file where the data from different source files are aggregated and standardized into a single format.

Notebook file: extract-transform.ipynb

Configuration file: config_extract-transform.yml
  • Load processes

Loads the data into a single table in a data warehouse

Notebook file: load.ipynb

Configuration file: config_load.yml

If you are working on plant phenotyping experiments, we encourage you to follow the MIAPPE standards (https://www.miappe.org/) for creating your database tables.

How to run AgETL?

  • Option 1

    • You should make a simple installation of either JupyterLab or Jupyter Notebook, or you also can install an environment management such as conda, mamba, or pipenv.
  • Option 2

Prerequisites

  • Option 1
    • Using Requirements File

sh pip install -r requirements.txt - option 2 - Install the requiered libraries using the pip package installer for Python.

[PyYAML](https://pypi.org/project/PyYAML/)
```sh
    pip install pyyaml

```
[Pandas](https://pypi.org/project/pandas/)
```sh
    pip install pandas

```    
[psycopg2](https://pypi.org/project/pandas/)
```sh
    pip install psycopg2 

```    

Clone or download AgTC from the GitHub repository

  • Clone option

    1. Open a new Jupyter Notebook Terminal

    New > Terminal

2. Clone the GitHub repository 

```sh
    git clone https://github.com/Purdue-LuisVargas/agETL.git

```
  • Download option
1. [Download](https://docs.github.com/en/repositories/creating-and-managing-repositories/cloning-a-repository) **AgETL** from the **Github** repository: [https://github.com/Purdue-LuisVargas/agETL](https://github.com/Purdue-LuisVargas/agETL).
2. Unzip the entire folder, then copy (if running Jupyter locally) or upload the downloaded files (if using the Jupyter Hub environment) in your Jupyter Notebook directory.

Which files should I run?

To run the functions in AgETL you should open them in Jupyter Notebook, first modify the configuration file (.yml), and second run the Python functions (.ipynb). The process is divided into two tasks as it is indicated bellow:

Raw data files (input) --> Extraction and transformation --> standardized dataframe (output) --> Load

  • Extraction and Transformation: The first set of functions runs the Extract and Transform processes. It outputs a CSV file where the data from different source files have been aggregated and standardized into a single format.

    You need the following files:
    
        extract-transform.ipynb
    
        config_extract-transform.yml
    
  • Loading: The second group of functions is used to load data into a single table in the database.

    You need the following files:
    
        load.ipynb
    
        config_load.yml
    

    To make the database connection you need to update the following information in the configuration file (config_load.yml), as the following examples:

    • Localhost database: ```sh DATABASE_CREDENTIALS: Host: localhost Dbname: wanglab user: postgres port: 5432 password: **************WAdxm1

    - Cloud server database: sh DATABASE_CREDENTIALS: Host: containers-us-west-187.railway.app Dbname: railway user: postgres port: 7895 password: **************WAdxm1

    ```

Cite as

Vargas-Rojas L, Ting T-C, Rainey KM, Reynolds M and Wang DR (2024) AgTC and AgETL: open-source tools to enhance data collection and management for plant science research. Front. Plant Sci. 15:1265073. doi: 10.3389/fpls.2024.1265073.

Contact

Diane Wang - drwang@purdue.edu

Luis Vargas Rojas - lvargasr@purdue.edu

Purdue University, Wang Lab dianewanglab.com

Owner

  • Name: Luis Vargas Rojas
  • Login: DS4Ag
  • Kind: user
  • Location: Mexico
  • Company: Purdue University

Citation (CITATION.cff)

@ARTICLE{10.3389/fpls.2024.1265073,

AUTHOR={Vargas-Rojas, Luis  and Ting, To-Chia  and Rainey, Katherine M.  and Reynolds, Matthew  and Wang, Diane R. },

TITLE={AgTC and AgETL: open-source tools to enhance data collection and management for plant science research},

JOURNAL={Frontiers in Plant Science},

VOLUME={15},

YEAR={2024},

URL={https://www.frontiersin.org/journals/plant-science/articles/10.3389/fpls.2024.1265073},

DOI={10.3389/fpls.2024.1265073},

ISSN={1664-462X},

ABSTRACT={<p>Advancements in phenotyping technology have enabled plant science researchers to gather large volumes of information from their experiments, especially those that evaluate multiple genotypes. To fully leverage these complex and often heterogeneous data sets (i.e. those that differ in format and structure), scientists must invest considerable time in data processing, and data management has emerged as a considerable barrier for downstream application. Here, we propose a pipeline to enhance data collection, processing, and management from plant science studies comprising of two newly developed open-source programs. The first, called AgTC, is a series of programming functions that generates comma-separated values file templates to collect data in a standard format using either a lab-based computer or a mobile device. The second series of functions, AgETL, executes steps for an <italic>Extract</italic>-<italic>Transform</italic>-<italic>Load</italic> (ETL) data integration process where data are extracted from heterogeneously formatted files, transformed to meet standard criteria, and loaded into a database. There, data are stored and can be accessed for data analysis-related processes, including dynamic data visualization through web-based tools. Both AgTC and AgETL are flexible for application across plant science experiments without programming knowledge on the part of the domain scientist, and their functions are executed on Jupyter Notebook, a browser-based interactive development environment. Additionally, all parameters are easily customized from central configuration files written in the human-readable YAML format. Using three experiments from research laboratories in university and non-government organization (NGO) settings as test cases, we demonstrate the utility of AgTC and AgETL to streamline critical steps from data collection to analysis in the plant sciences.</p>}}

GitHub Events

Total
Last Year