llm-jp-eval

Modified llm-jp-eval with API and HF scripts for LFMs.

https://github.com/liquid4all/llm-jp-eval

Science Score: 26.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
○
Academic publication links
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (8.5%) to scientific vocabulary

Keywords

benchmark evaluation liquid-ai llm llm-jp-eval

Last synced: 6 months ago · JSON representation

Repository

Modified llm-jp-eval with API and HF scripts for LFMs.

Basic Info

Host: GitHub
Owner: Liquid4All
License: apache-2.0
Language: Python
Default Branch: main
Homepage:
Size: 5.69 MB

Statistics

Stars: 1
Watchers: 8
Forks: 1
Open Issues: 1
Releases: 0

Topics

benchmark evaluation liquid-ai llm llm-jp-eval

Created about 1 year ago · Last pushed 8 months ago

Metadata Files

Readme License Citation

Run Evaluation through vLLM API

Overview

Run the model through vLLM with an OpenAI compatible API.
- For Liquid models, run the on-prem stack, or use Liquid labs.
- For other models, use the run-vllm.sh script, or use 3rd party providers.
Run the evaluation script with the model API endpoint and API key.
- The evaluation can be run with Docker (recommended) or locally without Docker.

Run Evaluation with Docker

bash bin/api/run_docker_eval.sh --config <config-filen>.yaml \ --model-name <model-name> \ --model-url <model-url>/v1 \ --model-api-key <API-KEY>

Examples

Run Swallow evaluation on lfm-3b-jp on-prem:

```bash bin/api/rundockereval.sh --config configapiswallow.yaml \ --model-name lfm-3b-jp \ --model-url http://localhost:8000/v1 \ --model-api-key

output: ./results/swallow/lfm-3b-jp

```

Run Swallow evaluation on lfm-3b-ichikara on-prem:

```bash bin/api/rundockereval.sh --config configapiswallow.yaml \ --model-name lfm-3b-ichikara \ --model-url http://localhost:8000/v1 \ --model-api-key

output: ./results/swallow/lfm-3b-ichikara

```

Run Nejumi evaluation on lfm-3b-jp on labs:

```bash bin/api/rundockereval.sh --config configapinejumi.yaml \ --model-name lfm-3b-jp \ --model-url https://inference-1.liquid.ai/v1 \ --model-api-key

output: ./results/nejumi/lfm-3b-jp

```

Run Evaluation without Docker

(click to see details)

### Installation It is recommended to create a brand new `conda` environment first. But this step is optional. ```bash conda create -n llm-jp-eval python=3.10 conda activate llm-jp-eval ``` Run the following commands to set up the environment and install the dependencies. This step can take a few minutes. They are idempotent and safe to run multiple times. ```bash bin/api/prepare.sh bin/api/download_data.sh ``` Then run the evaluation script: ```bash bin/api/run_api_eval.sh --config .yaml \ --model-name \ --model-url /v1 \ --model-api-key ``` The config files are the same as the ones used in the Docker example above. ### Examples Run Swallow evaluation on `lfm-3b-jp` on-prem: ```bash bin/api/run_api_eval.sh --config config_api_swallow.yaml \ --model-name lfm-3b-jp \ --model-url http://localhost:8000/v1 \ --model-api-key # output: ./results/swallow/lfm-3b-jp ``` Run Swallow evaluation on `lfm-3b-ichikara` on-prem: ```bash bin/api/run_api_eval.sh --config config_api_swallow.yaml \ --model-name lfm-3b-ichikara \ --model-url http://localhost:8000/v1 \ --model-api-key # output: ./results/swallow/lfm-3b-ichikara ``` Run Nejumi evaluation on `lfm-3b-jp` on `labs`: ```bash bin/api/run_api_eval.sh --config config_api_nejumi.yaml \ --model-name lfm-3b-jp \ --model-url https://inference-1.liquid.ai/v1 \ --model-api-key # output: ./results/nejumi/lfm-3b-jp ```

Configs

(click to see details about swallow and nejumi configs)

### Swallow Both `configs/config_api.yaml` and `configs/config_api_swallow.yaml` are for running [Swallow](https://swallow-llm.github.io/evaluation/about.ja.html) evaluations. It runs all samples, and sets different shots for different tests: | Test | Number of Shots | | --- | --- | | ALT, JCom, JEMHopQA, JSQuAD, MGSM, NIILC, WikiCorpus | 4 | | JMMLU, MMLU_EN, XL-SUM (0-shot) | 5 | `configs/config_api.yaml` has been deprecated and will be removed in the future. Please use `configs/config_api_swallow.yaml` instead. ### Nejumi `configs/config_api_nejumi.yaml` is for running Nejumi evaluations. It sets **0-shot** and runs **100 samples** for each test.

Non-Liquid Model Evaluation

To launch any model on HuggingFace, first run the following command in the on-prem stack:

```bash ./run-vllm.sh \ --model-name \ --hf-model-path \ --hf-token

e.g.

./run-vllm.sh \ --model-name llama-7b \ --hf-model-path "meta-llama/Llama-2-7b-chat-hf" \ --hf-token hfmocktoken_abcd ```

Note that no API key is needed for generic vLLM launched by run-vllm.sh.

Then run the evaluation script using the relevant URL and model name.

Troubleshooting

(click to expand)

### `PermissionError` when running `XL-SUM` tests Tests like `XL-SUM` need to download extra models from Huggingface for evaluation. This process requires access to the Huggingface cache directory. The `bin/api/prepare.sh` script does create this directory manually. However, if the cache directory has already been created by root or other users on the machine, the download will fail with a `PermissionError` like below: > PermissionError: [Errno 13] Permission denied: '/home/ubuntu/.cache/huggingface/hub/.locks/models--bert-base-multilingual-cased' The fix is to change the ownership of the cache directory to the current user: ```bash sudo chown $USER:$USER ~/.cache/huggingface/hub/.locks ```

Acknowledgement

This repository is modified from llm-jp/llm-jp-eval.

Owner

Name: Liquid AI
Login: Liquid4All
Kind: organization
Email: code@liquid.ai
Location: United States of America

Website: https://liquid.ai
Twitter: LiquidAI_
Repositories: 1
Profile: https://github.com/Liquid4All

Liquid AI, Inc.

GitHub Events

Total

Watch event: 1
Delete event: 3
Push event: 26
Public event: 1
Pull request event: 6
Fork event: 1
Create event: 3

Last Year

Watch event: 1
Delete event: 3
Push event: 26
Public event: 1
Pull request event: 6
Fork event: 1
Create event: 3

Issues and Pull Requests

Last synced: 6 months ago

All Time

Total issues: 0
Total pull requests: 5
Average time to close issues: N/A
Average time to close pull requests: 7 minutes
Total issue authors: 0
Total pull request authors: 3
Average comments per issue: 0
Average comments per pull request: 0.2
Merged pull requests: 1
Bot issues: 0
Bot pull requests: 3

Past Year

Issues: 0
Pull requests: 5
Average time to close issues: N/A
Average time to close pull requests: 7 minutes
Issue authors: 0
Pull request authors: 3
Average comments per issue: 0
Average comments per pull request: 0.2
Merged pull requests: 1
Bot issues: 0
Bot pull requests: 3

View more stats

Top Authors

Issue Authors

Pull Request Authors

tuliren (2)
dependabot[bot] (2)
devin-ai-integration[bot] (1)

Top Labels

Issue Labels

Pull Request Labels

dependencies (2) github_actions (2)

Dependencies

.github/workflows/lint.yaml actions

actions/checkout v4 composite
actions/setup-python v5 composite

.github/workflows/requirements.yaml actions

actions/checkout v4 composite
actions/setup-python v5 composite
stefanzweifel/git-auto-commit-action v5 composite

.github/workflows/run-eval.yaml actions

actions/checkout v4 composite
actions/setup-python v5 composite

.github/workflows/test.yaml actions

actions/cache v4 composite
actions/checkout v4 composite
actions/setup-python v5 composite

Dockerfile docker

ubuntu 22.04 build

offline_inference/transformers/requirements_transformers_cuda118.txt pypi

bitsandbytes *
hydra-core *
peft >=0.12.0
torch ==2.4.0
transformers >=4.44.2
wandb >=0.17.7,<0.18.0
wheel *

offline_inference/transformers/requirements_transformers_cuda121.txt pypi

bitsandbytes *
hydra-core *
peft >=0.12.0
torch ==2.4.0
transformers >=4.44.2
wandb >=0.17.7,<0.18.0
wheel *

offline_inference/trtllm/requirements_trtllm.txt pypi

click ==8.0.2
cython <3.0.0
hydra-core <1.3.0
markdown-it-py <2.3.0
omegaconf <2.3.0
setuptools ==65.5.1
wandb >=0.17.7,<0.18.0

offline_inference/trtllm/requirements_trtllm_quantization.txt pypi

mpmath ==1.3.0
nemo-toolkit <=1.20.0,>=1.18.0
pydantic >=2.0.0
transformers_stream_generator ==0.0.4

offline_inference/vllm/requirements_vllm_cuda121.txt pypi

hydra-core *
numpy ==1.26.4
torch ==2.4.0
transformers >=4.45.1,<4.46.0
vllm ==0.6.2
vllm-flash-attn *
wandb >=0.17.7,<0.18.0
wheel *

poetry.lock pypi

155 dependencies

pyproject.toml pypi

mock * develop
pytest ^7.4.3 develop
accelerate ^0.26.0
bert-score ^0.3.12
datasets ^2.9.0
fastparquet ^2023.10.0
fuzzywuzzy ^0.18.0
hydra-core ^1.3.2
langchain ^0.2
langchain-community ^0.2.3
langchain-huggingface ^0.0.2
langchain-openai ^0.1.7
pandas ^2.1.3
peft ^0.5.0
pyarrow ^15.0.0
pylint ^3.0.0
python >=3.9,<3.13
python-levenshtein ^0.25.1
rhoknp ^1.6.0
rouge-score ^0.1.2
sacrebleu ^2.3.0
scikit-learn ^1.3.1
sumeval ^0.2.2
tokenizers >=0.14.0
torch >=2.1.1
transformers ^4.42.0
typing-extensions ^4.8.0
unbabel-comet ^2.2.0
wandb >=0.16.0
xmltodict ^0.13.0

requirements.txt pypi

absl-py ==2.1.0
accelerate ==0.26.1
aiohappyeyeballs ==2.4.0
aiohttp ==3.10.5
aiosignal ==1.3.1
annotated-types ==0.7.0
antlr4-python3-runtime ==4.9.3
anyio ==4.4.0
astroid ==3.2.4
async-timeout ==4.0.3
attrs ==24.2.0
bert-score ==0.3.13
certifi ==2024.8.30
charset-normalizer ==3.3.2
click ==8.1.7
colorama ==0.4.6
contourpy ==1.3.0
cramjam ==2.8.3
cycler ==0.12.1
dataclasses-json ==0.6.7
datasets ==2.21.0
dill ==0.3.8
distro ==1.9.0
docker-pycreds ==0.4.0
entmax ==1.3
exceptiongroup ==1.2.2
fastparquet ==2023.10.1
filelock ==3.15.4
fonttools ==4.53.1
frozenlist ==1.4.1
fsspec ==2024.6.1
fuzzywuzzy ==0.18.0
gitdb ==4.0.11
gitpython ==3.1.43
greenlet ==3.0.3
h11 ==0.14.0
httpcore ==1.0.5
httpx ==0.27.2
huggingface-hub ==0.24.6
hydra-core ==1.3.2
idna ==3.8
importlib-resources ==6.4.4
ipadic ==1.0.0
isort ==5.13.2
jinja2 ==3.1.4
jiter ==0.5.0
joblib ==1.4.2
jsonargparse ==3.13.1
jsonpatch ==1.33
jsonpointer ==3.0.0
kiwisolver ==1.4.7
langchain ==0.2.16
langchain-community ==0.2.16
langchain-core ==0.2.38
langchain-huggingface ==0.0.2
langchain-openai ==0.1.23
langchain-text-splitters ==0.2.4
langsmith ==0.1.111
levenshtein ==0.25.1
lightning-utilities ==0.11.7
lxml ==5.3.0
markupsafe ==2.1.5
marshmallow ==3.22.0
matplotlib ==3.9.2
mccabe ==0.7.0
mecab-python3 ==1.0.9
mpmath ==1.3.0
multidict ==6.0.5
multiprocess ==0.70.16
mypy-extensions ==1.0.0
networkx ==3.2.1
nltk ==3.9.1
numpy ==1.26.4
nvidia-cublas-cu12 ==12.1.3.1
nvidia-cuda-cupti-cu12 ==12.1.105
nvidia-cuda-nvrtc-cu12 ==12.1.105
nvidia-cuda-runtime-cu12 ==12.1.105
nvidia-cudnn-cu12 ==9.1.0.70
nvidia-cufft-cu12 ==11.0.2.54
nvidia-curand-cu12 ==10.3.2.106
nvidia-cusolver-cu12 ==11.4.5.107
nvidia-cusparse-cu12 ==12.1.0.106
nvidia-nccl-cu12 ==2.20.5
nvidia-nvjitlink-cu12 ==12.6.68
nvidia-nvtx-cu12 ==12.1.105
omegaconf ==2.3.0
openai ==1.43.0
orjson ==3.10.7
packaging ==24.1
pandas ==2.2.2
peft ==0.5.0
pillow ==10.4.0
plac ==1.4.3
platformdirs ==4.2.2
portalocker ==2.10.1
protobuf ==4.25.4
psutil ==6.0.0
pyarrow ==15.0.2
pydantic ==2.8.2
pydantic-core ==2.20.1
pylint ==3.2.7
pyparsing ==3.1.4
python-dateutil ==2.9.0.post0
python-levenshtein ==0.25.1
pytorch-lightning ==2.4.0
pytz ==2024.1
pywin32 ==306
pyyaml ==6.0.2
rapidfuzz ==3.9.7
regex ==2024.7.24
requests ==2.32.3
rhoknp ==1.7.0
rouge-score ==0.1.2
sacrebleu ==2.4.3
safetensors ==0.4.4
scikit-learn ==1.5.1
scipy ==1.13.1
sentence-transformers ==3.0.1
sentencepiece ==0.1.99
sentry-sdk ==2.13.0
setproctitle ==1.3.3
setuptools ==74.1.1
six ==1.16.0
smmap ==5.0.1
sniffio ==1.3.1
sqlalchemy ==2.0.33
sumeval ==0.2.2
sympy ==1.13.2
tabulate ==0.9.0
tenacity ==8.5.0
text-generation ==0.7.0
threadpoolctl ==3.5.0
tiktoken ==0.7.0
tokenizers ==0.19.1
tomli ==2.0.1
tomlkit ==0.13.2
torch ==2.4.0
torchmetrics ==0.10.3
tqdm ==4.66.5
transformers ==4.44.2
triton ==3.0.0
typing-extensions ==4.12.2
typing-inspect ==0.9.0
tzdata ==2024.1
unbabel-comet ==2.2.2
urllib3 ==2.2.2
wandb ==0.17.8
xmltodict ==0.13.0
xxhash ==3.5.0
yarl ==1.9.8
zipp ==3.20.1

bin/api/Dockerfile docker

python 3.9-slim build