llm-jp-eval

Modified llm-jp-eval with API and HF scripts for LFMs.

https://github.com/liquid4all/llm-jp-eval

Science Score: 26.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (8.5%) to scientific vocabulary

Keywords

benchmark evaluation liquid-ai llm llm-jp-eval
Last synced: 6 months ago · JSON representation

Repository

Modified llm-jp-eval with API and HF scripts for LFMs.

Basic Info
  • Host: GitHub
  • Owner: Liquid4All
  • License: apache-2.0
  • Language: Python
  • Default Branch: main
  • Homepage:
  • Size: 5.69 MB
Statistics
  • Stars: 1
  • Watchers: 8
  • Forks: 1
  • Open Issues: 1
  • Releases: 0
Topics
benchmark evaluation liquid-ai llm llm-jp-eval
Created about 1 year ago · Last pushed 8 months ago
Metadata Files
Readme License Citation

README.md

Run Evaluation through vLLM API

Overview

  1. Run the model through vLLM with an OpenAI compatible API.
    • For Liquid models, run the on-prem stack, or use Liquid labs.
    • For other models, use the run-vllm.sh script, or use 3rd party providers.
  2. Run the evaluation script with the model API endpoint and API key.
    • The evaluation can be run with Docker (recommended) or locally without Docker.

Run Evaluation with Docker

bash bin/api/run_docker_eval.sh --config <config-filen>.yaml \ --model-name <model-name> \ --model-url <model-url>/v1 \ --model-api-key <API-KEY>

Examples

Run Swallow evaluation on lfm-3b-jp on-prem:

```bash bin/api/rundockereval.sh --config configapiswallow.yaml \ --model-name lfm-3b-jp \ --model-url http://localhost:8000/v1 \ --model-api-key

output: ./results/swallow/lfm-3b-jp

```

Run Swallow evaluation on lfm-3b-ichikara on-prem:

```bash bin/api/rundockereval.sh --config configapiswallow.yaml \ --model-name lfm-3b-ichikara \ --model-url http://localhost:8000/v1 \ --model-api-key

output: ./results/swallow/lfm-3b-ichikara

```

Run Nejumi evaluation on lfm-3b-jp on labs:

```bash bin/api/rundockereval.sh --config configapinejumi.yaml \ --model-name lfm-3b-jp \ --model-url https://inference-1.liquid.ai/v1 \ --model-api-key

output: ./results/nejumi/lfm-3b-jp

```

Run Evaluation without Docker

(click to see details) ### Installation It is recommended to create a brand new `conda` environment first. But this step is optional. ```bash conda create -n llm-jp-eval python=3.10 conda activate llm-jp-eval ``` Run the following commands to set up the environment and install the dependencies. This step can take a few minutes. They are idempotent and safe to run multiple times. ```bash bin/api/prepare.sh bin/api/download_data.sh ``` Then run the evaluation script: ```bash bin/api/run_api_eval.sh --config .yaml \ --model-name \ --model-url /v1 \ --model-api-key ``` The config files are the same as the ones used in the Docker example above. ### Examples Run Swallow evaluation on `lfm-3b-jp` on-prem: ```bash bin/api/run_api_eval.sh --config config_api_swallow.yaml \ --model-name lfm-3b-jp \ --model-url http://localhost:8000/v1 \ --model-api-key # output: ./results/swallow/lfm-3b-jp ``` Run Swallow evaluation on `lfm-3b-ichikara` on-prem: ```bash bin/api/run_api_eval.sh --config config_api_swallow.yaml \ --model-name lfm-3b-ichikara \ --model-url http://localhost:8000/v1 \ --model-api-key # output: ./results/swallow/lfm-3b-ichikara ``` Run Nejumi evaluation on `lfm-3b-jp` on `labs`: ```bash bin/api/run_api_eval.sh --config config_api_nejumi.yaml \ --model-name lfm-3b-jp \ --model-url https://inference-1.liquid.ai/v1 \ --model-api-key # output: ./results/nejumi/lfm-3b-jp ```

Configs

(click to see details about swallow and nejumi configs) ### Swallow Both `configs/config_api.yaml` and `configs/config_api_swallow.yaml` are for running [Swallow](https://swallow-llm.github.io/evaluation/about.ja.html) evaluations. It runs all samples, and sets different shots for different tests: | Test | Number of Shots | | --- | --- | | ALT, JCom, JEMHopQA, JSQuAD, MGSM, NIILC, WikiCorpus | 4 | | JMMLU, MMLU_EN, XL-SUM (0-shot) | 5 | `configs/config_api.yaml` has been deprecated and will be removed in the future. Please use `configs/config_api_swallow.yaml` instead. ### Nejumi `configs/config_api_nejumi.yaml` is for running Nejumi evaluations. It sets **0-shot** and runs **100 samples** for each test.

Non-Liquid Model Evaluation

To launch any model on HuggingFace, first run the following command in the on-prem stack:

```bash ./run-vllm.sh \ --model-name \ --hf-model-path \ --hf-token

e.g.

./run-vllm.sh \ --model-name llama-7b \ --hf-model-path "meta-llama/Llama-2-7b-chat-hf" \ --hf-token hfmocktoken_abcd ```

Note that no API key is needed for generic vLLM launched by run-vllm.sh.

Then run the evaluation script using the relevant URL and model name.

Troubleshooting

(click to expand) ### `PermissionError` when running `XL-SUM` tests Tests like `XL-SUM` need to download extra models from Huggingface for evaluation. This process requires access to the Huggingface cache directory. The `bin/api/prepare.sh` script does create this directory manually. However, if the cache directory has already been created by root or other users on the machine, the download will fail with a `PermissionError` like below: > PermissionError: [Errno 13] Permission denied: '/home/ubuntu/.cache/huggingface/hub/.locks/models--bert-base-multilingual-cased' The fix is to change the ownership of the cache directory to the current user: ```bash sudo chown $USER:$USER ~/.cache/huggingface/hub/.locks ```

Acknowledgement

This repository is modified from llm-jp/llm-jp-eval.

Owner

  • Name: Liquid AI
  • Login: Liquid4All
  • Kind: organization
  • Email: code@liquid.ai
  • Location: United States of America

Liquid AI, Inc.

GitHub Events

Total
  • Watch event: 1
  • Delete event: 3
  • Push event: 26
  • Public event: 1
  • Pull request event: 6
  • Fork event: 1
  • Create event: 3
Last Year
  • Watch event: 1
  • Delete event: 3
  • Push event: 26
  • Public event: 1
  • Pull request event: 6
  • Fork event: 1
  • Create event: 3

Issues and Pull Requests

Last synced: 6 months ago

All Time
  • Total issues: 0
  • Total pull requests: 5
  • Average time to close issues: N/A
  • Average time to close pull requests: 7 minutes
  • Total issue authors: 0
  • Total pull request authors: 3
  • Average comments per issue: 0
  • Average comments per pull request: 0.2
  • Merged pull requests: 1
  • Bot issues: 0
  • Bot pull requests: 3
Past Year
  • Issues: 0
  • Pull requests: 5
  • Average time to close issues: N/A
  • Average time to close pull requests: 7 minutes
  • Issue authors: 0
  • Pull request authors: 3
  • Average comments per issue: 0
  • Average comments per pull request: 0.2
  • Merged pull requests: 1
  • Bot issues: 0
  • Bot pull requests: 3
Top Authors
Issue Authors
Pull Request Authors
  • tuliren (2)
  • dependabot[bot] (2)
  • devin-ai-integration[bot] (1)
Top Labels
Issue Labels
Pull Request Labels
dependencies (2) github_actions (2)

Dependencies

.github/workflows/lint.yaml actions
  • actions/checkout v4 composite
  • actions/setup-python v5 composite
.github/workflows/requirements.yaml actions
  • actions/checkout v4 composite
  • actions/setup-python v5 composite
  • stefanzweifel/git-auto-commit-action v5 composite
.github/workflows/run-eval.yaml actions
  • actions/checkout v4 composite
  • actions/setup-python v5 composite
.github/workflows/test.yaml actions
  • actions/cache v4 composite
  • actions/checkout v4 composite
  • actions/setup-python v5 composite
Dockerfile docker
  • ubuntu 22.04 build
offline_inference/transformers/requirements_transformers_cuda118.txt pypi
  • bitsandbytes *
  • hydra-core *
  • peft >=0.12.0
  • torch ==2.4.0
  • transformers >=4.44.2
  • wandb >=0.17.7,<0.18.0
  • wheel *
offline_inference/transformers/requirements_transformers_cuda121.txt pypi
  • bitsandbytes *
  • hydra-core *
  • peft >=0.12.0
  • torch ==2.4.0
  • transformers >=4.44.2
  • wandb >=0.17.7,<0.18.0
  • wheel *
offline_inference/trtllm/requirements_trtllm.txt pypi
  • click ==8.0.2
  • cython <3.0.0
  • hydra-core <1.3.0
  • markdown-it-py <2.3.0
  • omegaconf <2.3.0
  • setuptools ==65.5.1
  • wandb >=0.17.7,<0.18.0
offline_inference/trtllm/requirements_trtllm_quantization.txt pypi
  • mpmath ==1.3.0
  • nemo-toolkit <=1.20.0,>=1.18.0
  • pydantic >=2.0.0
  • transformers_stream_generator ==0.0.4
offline_inference/vllm/requirements_vllm_cuda121.txt pypi
  • hydra-core *
  • numpy ==1.26.4
  • torch ==2.4.0
  • transformers >=4.45.1,<4.46.0
  • vllm ==0.6.2
  • vllm-flash-attn *
  • wandb >=0.17.7,<0.18.0
  • wheel *
poetry.lock pypi
  • 155 dependencies
pyproject.toml pypi
  • mock * develop
  • pytest ^7.4.3 develop
  • accelerate ^0.26.0
  • bert-score ^0.3.12
  • datasets ^2.9.0
  • fastparquet ^2023.10.0
  • fuzzywuzzy ^0.18.0
  • hydra-core ^1.3.2
  • langchain ^0.2
  • langchain-community ^0.2.3
  • langchain-huggingface ^0.0.2
  • langchain-openai ^0.1.7
  • pandas ^2.1.3
  • peft ^0.5.0
  • pyarrow ^15.0.0
  • pylint ^3.0.0
  • python >=3.9,<3.13
  • python-levenshtein ^0.25.1
  • rhoknp ^1.6.0
  • rouge-score ^0.1.2
  • sacrebleu ^2.3.0
  • scikit-learn ^1.3.1
  • sumeval ^0.2.2
  • tokenizers >=0.14.0
  • torch >=2.1.1
  • transformers ^4.42.0
  • typing-extensions ^4.8.0
  • unbabel-comet ^2.2.0
  • wandb >=0.16.0
  • xmltodict ^0.13.0
requirements.txt pypi
  • absl-py ==2.1.0
  • accelerate ==0.26.1
  • aiohappyeyeballs ==2.4.0
  • aiohttp ==3.10.5
  • aiosignal ==1.3.1
  • annotated-types ==0.7.0
  • antlr4-python3-runtime ==4.9.3
  • anyio ==4.4.0
  • astroid ==3.2.4
  • async-timeout ==4.0.3
  • attrs ==24.2.0
  • bert-score ==0.3.13
  • certifi ==2024.8.30
  • charset-normalizer ==3.3.2
  • click ==8.1.7
  • colorama ==0.4.6
  • contourpy ==1.3.0
  • cramjam ==2.8.3
  • cycler ==0.12.1
  • dataclasses-json ==0.6.7
  • datasets ==2.21.0
  • dill ==0.3.8
  • distro ==1.9.0
  • docker-pycreds ==0.4.0
  • entmax ==1.3
  • exceptiongroup ==1.2.2
  • fastparquet ==2023.10.1
  • filelock ==3.15.4
  • fonttools ==4.53.1
  • frozenlist ==1.4.1
  • fsspec ==2024.6.1
  • fuzzywuzzy ==0.18.0
  • gitdb ==4.0.11
  • gitpython ==3.1.43
  • greenlet ==3.0.3
  • h11 ==0.14.0
  • httpcore ==1.0.5
  • httpx ==0.27.2
  • huggingface-hub ==0.24.6
  • hydra-core ==1.3.2
  • idna ==3.8
  • importlib-resources ==6.4.4
  • ipadic ==1.0.0
  • isort ==5.13.2
  • jinja2 ==3.1.4
  • jiter ==0.5.0
  • joblib ==1.4.2
  • jsonargparse ==3.13.1
  • jsonpatch ==1.33
  • jsonpointer ==3.0.0
  • kiwisolver ==1.4.7
  • langchain ==0.2.16
  • langchain-community ==0.2.16
  • langchain-core ==0.2.38
  • langchain-huggingface ==0.0.2
  • langchain-openai ==0.1.23
  • langchain-text-splitters ==0.2.4
  • langsmith ==0.1.111
  • levenshtein ==0.25.1
  • lightning-utilities ==0.11.7
  • lxml ==5.3.0
  • markupsafe ==2.1.5
  • marshmallow ==3.22.0
  • matplotlib ==3.9.2
  • mccabe ==0.7.0
  • mecab-python3 ==1.0.9
  • mpmath ==1.3.0
  • multidict ==6.0.5
  • multiprocess ==0.70.16
  • mypy-extensions ==1.0.0
  • networkx ==3.2.1
  • nltk ==3.9.1
  • numpy ==1.26.4
  • nvidia-cublas-cu12 ==12.1.3.1
  • nvidia-cuda-cupti-cu12 ==12.1.105
  • nvidia-cuda-nvrtc-cu12 ==12.1.105
  • nvidia-cuda-runtime-cu12 ==12.1.105
  • nvidia-cudnn-cu12 ==9.1.0.70
  • nvidia-cufft-cu12 ==11.0.2.54
  • nvidia-curand-cu12 ==10.3.2.106
  • nvidia-cusolver-cu12 ==11.4.5.107
  • nvidia-cusparse-cu12 ==12.1.0.106
  • nvidia-nccl-cu12 ==2.20.5
  • nvidia-nvjitlink-cu12 ==12.6.68
  • nvidia-nvtx-cu12 ==12.1.105
  • omegaconf ==2.3.0
  • openai ==1.43.0
  • orjson ==3.10.7
  • packaging ==24.1
  • pandas ==2.2.2
  • peft ==0.5.0
  • pillow ==10.4.0
  • plac ==1.4.3
  • platformdirs ==4.2.2
  • portalocker ==2.10.1
  • protobuf ==4.25.4
  • psutil ==6.0.0
  • pyarrow ==15.0.2
  • pydantic ==2.8.2
  • pydantic-core ==2.20.1
  • pylint ==3.2.7
  • pyparsing ==3.1.4
  • python-dateutil ==2.9.0.post0
  • python-levenshtein ==0.25.1
  • pytorch-lightning ==2.4.0
  • pytz ==2024.1
  • pywin32 ==306
  • pyyaml ==6.0.2
  • rapidfuzz ==3.9.7
  • regex ==2024.7.24
  • requests ==2.32.3
  • rhoknp ==1.7.0
  • rouge-score ==0.1.2
  • sacrebleu ==2.4.3
  • safetensors ==0.4.4
  • scikit-learn ==1.5.1
  • scipy ==1.13.1
  • sentence-transformers ==3.0.1
  • sentencepiece ==0.1.99
  • sentry-sdk ==2.13.0
  • setproctitle ==1.3.3
  • setuptools ==74.1.1
  • six ==1.16.0
  • smmap ==5.0.1
  • sniffio ==1.3.1
  • sqlalchemy ==2.0.33
  • sumeval ==0.2.2
  • sympy ==1.13.2
  • tabulate ==0.9.0
  • tenacity ==8.5.0
  • text-generation ==0.7.0
  • threadpoolctl ==3.5.0
  • tiktoken ==0.7.0
  • tokenizers ==0.19.1
  • tomli ==2.0.1
  • tomlkit ==0.13.2
  • torch ==2.4.0
  • torchmetrics ==0.10.3
  • tqdm ==4.66.5
  • transformers ==4.44.2
  • triton ==3.0.0
  • typing-extensions ==4.12.2
  • typing-inspect ==0.9.0
  • tzdata ==2024.1
  • unbabel-comet ==2.2.2
  • urllib3 ==2.2.2
  • wandb ==0.17.8
  • xmltodict ==0.13.0
  • xxhash ==3.5.0
  • yarl ==1.9.8
  • zipp ==3.20.1
bin/api/Dockerfile docker
  • python 3.9-slim build