https://github.com/bayer-group/bcs-ely-benchmark

Science Score: 36.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
✓
Academic publication links
Links to: arxiv.org
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (10.9%) to scientific vocabulary

Keywords

beat-undefined

Last synced: 10 months ago · JSON representation

Repository

Basic Info

Host: GitHub
Owner: Bayer-Group
License: bsd-3-clause
Default Branch: main
Size: 0 Bytes

Statistics

Stars: 1
Watchers: 1
Forks: 0
Open Issues: 0
Releases: 0

Topics

beat-undefined

Created over 1 year ago · Last pushed about 1 year ago

Metadata Files

Readme Contributing License

Bayer Crop Science US Corn Crop Protection Benchmark Data Set

Introduction

This repository is dedicated to sharing benchmarking questions related to Crop Protection for evaluating Q&A agents.

Bayer's E.L.Y. team has identified that while large language models (LLMs) are improving in general agronomy knowledge, there are still significant gaps in their understanding of specific agricultural products.

Current leading models demonstrate a broad understanding of agronomy but lack the depth required for providing high-stakes guidance on crop protection. This benchmark aims to quantify the performance differences when models are specifically trained on agricultural enterprise product portfolios.

The current benchmark focuses on the Bayer Crop Science US Corn Crop Protection Portfolio and includes a sample of 156 real-world product positioning questions. These questions reflect inquiries that farmers and channel partners might have regarding the application of Bayer’s corn crop protection products. The benchmark addresses common Crop Protection application questions related to:

Site of action
Application timing
Application rate
Common tank mixes
Common additives
Pests controlled

It encompasses herbicides, fungicides, and insecticides intended for use in field corn.

QnA Construction

Answers are formatted to reflect the common output style of large language models (LLMs) compared to traditional agricultural notation.

Example of Agricultural Notation:
“0.5 oz prior to Nov 1st soil pH < 6.8 or 0.3 oz after Nov 1st pH > 6.8”

Common LLM Output:
“The application rate of Autumn Super for field corn is 0.5 oz per acre if applied prior to November 1st for soils with a pH of less than 6.8, or 0.3 oz per acre if applied after November 1st for soils with a pH greater than 6.8.”

Limitations and Restrictions on the Use of This Dataset

This dataset is not intended for training language models or retrieval-augmented generation (RAG) solutions. It is solely meant for validating the performance of language models or language model-based solutions.
This Q&A set should not be used for making crop protection recommendations.
It is not permitted to crawl this dataset to train base models.

Results

Evaluation Methodology

Each question in the benchmark dataset was prompted with the same instruction and then evaluated using the g-eval framework 1, 2. G-Eval utilizes large language models (LLMs) with chain-of-thought (CoT) reasoning to assess LLM outputs based on custom criteria.

For our evaluation, we used the following definition, which was processed via gpt4o_mini. Our implementation returns G-Eval scores on a scale of [0, 1], where a score of 1 indicates the best match.

```python

evaluatorparams = { "name": "Correctness", "criteria": "Determine whether the actual output is factually correct based on the expected output.", "evaluationsteps": [ "Compare the actual output directly with the expected output to verify factual accuracy.", "Check if all elements mentioned in the expected output are present and correctly represented in the actual output.", "Assess if there are any discrepancies in details, values, or information between the actual and expected outputs.", ], "evaluationparams": [ LLMTestCaseParams.EXPECTEDOUTPUT, LLMTestCaseParams.ACTUAL_OUTPUT, ], } ```

Parameters for Model Runs

Prompts: To ensure consistency, the same prompt was used for all frontier models and the ELY endpoint.
Parameters: The following parameters were used for the frontier models:
- Max Tokens: 4096
- Temperature: 0.0

This dataset was not used in model training or ingested as as part of ELY's RAG.

G Eval Scores

Planned Next Steps

Our subject matter experts (SMEs) to update the benchmark annually as product portfolios and recommendations evolve.
Expand the benchmark to include additional crops and regions, collaborating with industry partners to develop a more comprehensive suite of agricultural AI benchmarks.
Regularly update and release results for frontier models.
Add metadata to the Q&A dataset, with best-in-class practices for benchmarks during annual updates.
Share Scripts used for evaluation.

Contacts

Balathasan Giritharan : balathasan.giritharan ~ at ~ bayer dot com
Daniel Kurdys : daniel.kurdys ~ at ~ bayer dot com

Owner

Name: Bayer Open Source
Login: Bayer-Group
Kind: organization

Website: https://bayer.com/
Repositories: 98
Profile: https://github.com/Bayer-Group

Science for a better life

GitHub Events

Total

Watch event: 5
Member event: 3
Push event: 1
Create event: 2

Last Year

Watch event: 5
Member event: 3
Push event: 1
Create event: 2

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science