agetl

Function to process data files from different agricultural and plant science experiments and aggregate them into a standard database table in a central repository to make data available for different variety of data analyses.

https://github.com/ds4ag/agetl

Last synced: 10 months ago · JSON representation ·

Repository

Function to process data files from different agricultural and plant science experiments and aggregate them into a standard database table in a central repository to make data available for different variety of data analyses.

Basic Info

Host: GitHub
Owner: DS4Ag
License: gpl-3.0
Language: Python
Default Branch: main
Size: 181 KB

Statistics

Stars: 2
Watchers: 1
Forks: 0
Open Issues: 0
Releases: 3

Created almost 3 years ago · Last pushed about 2 years ago

Metadata Files

Readme License Citation

README.md

Note: to open links in new tab use CTRL+click (Windows and Linux) or CMD+click (MacOS).

What is AgETL?

Agricultural Data Extract, Transform, and Load Framework is a set of functions written in python that allow you to process data files from different agricultural and plant science experiments and aggregate them into a standard database table in a central repository to make data available for different variety of data analyses.

The execution of functions for this step is divided into two notebook files and configuration files.

Extraction and Transformation processes:

Runs the Extraction and Transformation processes, and the user gets a CSV file where the data from different source files are aggregated and standardized into a single format.

Notebook file: extract-transform.ipynb

Configuration file: config_extract-transform.yml

Load processes

Loads the data into a single table in a data warehouse

Notebook file: load.ipynb

Configuration file: config_load.yml

If you are working on plant phenotyping experiments, we encourage you to follow the MIAPPE standards (https://www.miappe.org/) for creating your database tables.

How to run AgETL?

Option 1
- You should make a simple installation of either JupyterLab or Jupyter Notebook, or you also can install an environment management such as conda, mamba, or pipenv.
Option 2
- Using a Jupyter Hub enviroment.

Prerequisites

Option 1
- Using Requirements File

sh pip install -r requirements.txt - option 2 - Install the requiered libraries using the pip package installer for Python.

[PyYAML](https://pypi.org/project/PyYAML/)
```sh
    pip install pyyaml

```
[Pandas](https://pypi.org/project/pandas/)
```sh
    pip install pandas

```    
[psycopg2](https://pypi.org/project/pandas/)
```sh
    pip install psycopg2 

```

Clone or download AgTC from the GitHub repository

Clone option
1. Open a new Jupyter Notebook Terminal
New > Terminal

2. Clone the GitHub repository 

```sh
    git clone https://github.com/Purdue-LuisVargas/agETL.git

```

Download option

1. [Download](https://docs.github.com/en/repositories/creating-and-managing-repositories/cloning-a-repository) **AgETL** from the **Github** repository: [https://github.com/Purdue-LuisVargas/agETL](https://github.com/Purdue-LuisVargas/agETL).
2. Unzip the entire folder, then copy (if running Jupyter locally) or upload the downloaded files (if using the Jupyter Hub environment) in your Jupyter Notebook directory.

Which files should I run?

To run the functions in AgETL you should open them in Jupyter Notebook, first modify the configuration file (.yml), and second run the Python functions (.ipynb). The process is divided into two tasks as it is indicated bellow:

Raw data files (input) --> Extraction and transformation --> standardized dataframe (output) --> Load

Extraction and Transformation: The first set of functions runs the Extract and Transform processes. It outputs a CSV file where the data from different source files have been aggregated and standardized into a single format.
```
You need the following files:

    extract-transform.ipynb

    config_extract-transform.yml
```
Loading: The second group of functions is used to load data into a single table in the database.
```
You need the following files:

    load.ipynb

    config_load.yml
```
To make the database connection you need to update the following information in the configuration file (config_load.yml), as the following examples:
- Localhost database: ```sh DATABASE_CREDENTIALS: Host: localhost Dbname: wanglab user: postgres port: 5432 password: **************WAdxm1
- Cloud server database:sh DATABASE_CREDENTIALS: Host: containers-us-west-187.railway.app Dbname: railway user: postgres port: 7895 password: **************WAdxm1

```

Cite as

Vargas-Rojas L, Ting T-C, Rainey KM, Reynolds M and Wang DR (2024) AgTC and AgETL: open-source tools to enhance data collection and management for plant science research. Front. Plant Sci. 15:1265073. doi: 10.3389/fpls.2024.1265073.

Contact

Diane Wang - drwang@purdue.edu

Luis Vargas Rojas - lvargasr@purdue.edu

Purdue University, Wang Lab dianewanglab.com

Owner

Name: Luis Vargas Rojas
Login: DS4Ag
Kind: user
Location: Mexico
Company: Purdue University

Twitter: L_VargasR
Repositories: 24
Profile: https://github.com/DS4Ag

Citation (CITATION.cff)

@ARTICLE{10.3389/fpls.2024.1265073,

AUTHOR={Vargas-Rojas, Luis  and Ting, To-Chia  and Rainey, Katherine M.  and Reynolds, Matthew  and Wang, Diane R. },

TITLE={AgTC and AgETL: open-source tools to enhance data collection and management for plant science research},

JOURNAL={Frontiers in Plant Science},

VOLUME={15},

YEAR={2024},

URL={https://www.frontiersin.org/journals/plant-science/articles/10.3389/fpls.2024.1265073},

DOI={10.3389/fpls.2024.1265073},

ISSN={1664-462X},

ABSTRACT={<p>Advancements in phenotyping technology have enabled plant science researchers to gather large volumes of information from their experiments, especially those that evaluate multiple genotypes. To fully leverage these complex and often heterogeneous data sets (i.e. those that differ in format and structure), scientists must invest considerable time in data processing, and data management has emerged as a considerable barrier for downstream application. Here, we propose a pipeline to enhance data collection, processing, and management from plant science studies comprising of two newly developed open-source programs. The first, called AgTC, is a series of programming functions that generates comma-separated values file templates to collect data in a standard format using either a lab-based computer or a mobile device. The second series of functions, AgETL, executes steps for an <italic>Extract</italic>-<italic>Transform</italic>-<italic>Load</italic> (ETL) data integration process where data are extracted from heterogeneously formatted files, transformed to meet standard criteria, and loaded into a database. There, data are stored and can be accessed for data analysis-related processes, including dynamic data visualization through web-based tools. Both AgTC and AgETL are flexible for application across plant science experiments without programming knowledge on the part of the domain scientist, and their functions are executed on Jupyter Notebook, a browser-based interactive development environment. Additionally, all parameters are easily customized from central configuration files written in the human-readable YAML format. Using three experiments from research laboratories in university and non-government organization (NGO) settings as test cases, we demonstrate the utility of AgTC and AgETL to streamline critical steps from data collection to analysis in the plant sciences.</p>}}

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science