cif-bench

https://github.com/yizhilll/cif-bench

Last synced: 7 months ago · JSON representation ·

Repository

Basic Info

Host: GitHub
Owner: yizhilll
Language: JavaScript
Default Branch: main
Size: 2.45 MB

Statistics

Stars: 2
Watchers: 2
Forks: 0
Open Issues: 0
Releases: 0

Created about 2 years ago · Last pushed about 2 years ago

Metadata Files

Readme Citation

CIF-Bench

This is the official repo for the paper CIF-Bench: A Chinese Instruction-Following Benchmark for Evaluating the Generalizability of Large Language Models. The project page has a more user-friendly UI for reading.

Results

Private Split

In the private split, there 5 instructions used for each task, hence derive $5 \times 150 \times 50=37500$ data instances for each model in evaluation.

| Model Name | Overall | Chinese Culture | Classification | Code | Commonsense | Creative NLG | Evaluation | Grammar | Linguistic | Motion Detection | NER | NLI | QA | Reasoning | Role Playing | Sentiment | Structured Data | Style Transfer | Summarization | Toxic | Translation | | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | | Baichuan2-13B-Chat | .529 | .520 | .674 | .333 | .641 | .497 | .686 | .542 | .528 | .578 | .563 | .632 | .569 | .515 | .752 | .624 | .459 | .462 | .332 | .441 | .273 | | Qwen-72B-Chat | .519 | .486 | .630 | .296 | .634 | .508 | .634 | .458 | .520 | .494 | .550 | .626 | .565 | .528 | .762 | .613 | .496 | .459 | .282 | .608 | .271 | | Yi-34B-Chat | .512 | .483 | .606 | .347 | .623 | .497 | .598 | .480 | .490 | .575 | .525 | .619 | .554 | .494 | .757 | .580 | .472 | .439 | .346 | .514 | .259 | | Qwen-14B-Chat | .500 | .481 | .582 | .307 | .614 | .494 | .645 | .428 | .475 | .496 | .513 | .616 | .548 | .507 | .764 | .583 | .469 | .453 | .283 | .575 | .262 | | Deepseek-Llm-67B-Chat | .471 | .467 | .571 | .259 | .577 | .486 | .549 | .442 | .476 | .475 | .509 | .566 | .496 | .439 | .711 | .546 | .409 | .436 | .262 | .570 | .235 | | Baichuan-13B-Chat | .450 | .408 | .491 | .286 | .552 | .439 | .670 | .417 | .422 | .482 | .486 | .565 | .505 | .377 | .704 | .552 | .387 | .402 | .350 | .431 | .304 | | Chatglm3-6B | .436 | .381 | .439 | .330 | .541 | .452 | .577 | .310 | .358 | .436 | .453 | .544 | .503 | .414 | .762 | .560 | .446 | .402 | .321 | .391 | .270 | | Yi-6B-Chat | .417 | .402 | .454 | .313 | .523 | .425 | .506 | .383 | .383 | .487 | .396 | .523 | .457 | .369 | .754 | .482 | .401 | .380 | .310 | .455 | .227 | | Baichuan2-7B-Chat | .412 | .437 | .647 | .160 | .520 | .402 | .580 | .511 | .444 | .455 | .407 | .489 | .395 | .406 | .670 | .517 | .342 | .298 | .101 | .463 | .138 | | Chatglm2-6B | .352 | .278 | .469 | .346 | .403 | .424 | .535 | .274 | .397 | .406 | .240 | .397 | .352 | .326 | .714 | .438 | .298 | .313 | .320 | .461 | .190 | | Chatglm-6B-Sft | .349 | .265 | .454 | .365 | .385 | .462 | .554 | .296 | .379 | .427 | .232 | .380 | .321 | .292 | .718 | .415 | .296 | .333 | .351 | .441 | .190 | | Chinese-Llama2-Linly-13B | .344 | .250 | .462 | .311 | .399 | .429 | .557 | .273 | .358 | .385 | .268 | .390 | .330 | .313 | .653 | .433 | .279 | .332 | .292 | .457 | .181 | | Gpt-3.5-Turbo-Sft | .343 | .269 | .427 | .298 | .389 | .395 | .575 | .325 | .365 | .389 | .226 | .382 | .394 | .345 | .710 | .433 | .324 | .266 | .290 | .397 | .225 | | Chinese-Alpaca-2-13B | .341 | .242 | .421 | .356 | .382 | .442 | .602 | .256 | .363 | .430 | .210 | .376 | .334 | .317 | .714 | .459 | .299 | .316 | .308 | .452 | .200 | | Chinese-Alpaca-13B | .334 | .250 | .399 | .348 | .364 | .435 | .616 | .275 | .349 | .421 | .223 | .370 | .309 | .319 | .724 | .426 | .285 | .307 | .298 | .445 | .181 | | Chinese-Alpaca-7B | .334 | .216 | .412 | .378 | .381 | .425 | .576 | .265 | .359 | .393 | .243 | .383 | .326 | .295 | .710 | .409 | .301 | .327 | .325 | .405 | .186 | | Chinese-Llama2-Linly-7B | .333 | .218 | .451 | .330 | .396 | .427 | .583 | .248 | .350 | .410 | .231 | .367 | .345 | .276 | .698 | .433 | .259 | .315 | .310 | .469 | .168 | | Tigerbot-13B-Chat | .331 | .205 | .397 | .309 | .385 | .420 | .614 | .310 | .379 | .341 | .276 | .363 | .329 | .301 | .694 | .419 | .280 | .310 | .283 | .393 | .186 | | Telechat-7B | .329 | .267 | .338 | .321 | .420 | .404 | .420 | .272 | .265 | .327 | .320 | .388 | .355 | .244 | .672 | .344 | .334 | .335 | .299 | .364 | .184 | | Ziya-Llama-13B | .329 | .196 | .402 | .324 | .341 | .428 | .616 | .312 | .349 | .400 | .228 | .351 | .279 | .313 | .721 | .468 | .311 | .291 | .278 | .431 | .175 | | Chinese-Alpaca-33B | .326 | .234 | .370 | .372 | .364 | .429 | .614 | .246 | .318 | .377 | .221 | .368 | .300 | .314 | .713 | .428 | .288 | .303 | .295 | .401 | .199 | | Tigerbot-7B-Chat | .325 | .218 | .395 | .306 | .370 | .413 | .631 | .294 | .370 | .368 | .215 | .355 | .313 | .292 | .713 | .415 | .283 | .315 | .290 | .389 | .171 | | Chinese-Alpaca-2-7B | .323 | .215 | .374 | .335 | .366 | .415 | .546 | .257 | .326 | .395 | .215 | .375 | .318 | .289 | .698 | .417 | .285 | .303 | .312 | .439 | .193 | | Aquilachat-7B | .309 | .162 | .234 | .291 | .320 | .437 | .344 | .135 | .266 | .309 | .287 | .337 | .342 | .236 | .609 | .255 | .249 | .400 | .527 | .430 | .306 | | Moss-Moon-003-Sft | .302 | .214 | .405 | .274 | .347 | .380 | .448 | .305 | .341 | .378 | .232 | .317 | .321 | .267 | .694 | .375 | .251 | .259 | .288 | .424 | .152 | | Qwen-7B-Chat | .301 | .211 | .410 | .289 | .349 | .391 | .531 | .219 | .387 | .404 | .208 | .325 | .297 | .278 | .681 | .419 | .266 | .251 | .248 | .371 | .157 | | Belle-13B-Sft | .264 | .198 | .307 | .285 | .316 | .349 | .409 | .237 | .305 | .222 | .177 | .317 | .284 | .242 | .631 | .299 | .244 | .222 | .234 | .296 | .133 | | Cpm-Bee-10B | .244 | .234 | .377 | .024 | .278 | .311 | .255 | .302 | .278 | .327 | .148 | .286 | .224 | .147 | .603 | .277 | .117 | .263 | .220 | .352 | .125 |

Public Split

In the public split, there is only one instruction used for each task, hence derive $1 \times 150 \times 50=7500$ data instances for each model in evaluation.

| Model Name | Overall | Chinese Culture | Classification | Code | Commonsense | Creative NLG | Evaluation | Grammar | Linguistic | Motion Detection | NER | NLI | QA | Reasoning | Role Playing | Sentiment | Structured Data | Style Transfer | Summarization | Toxic | Translation | | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | | Qwen-72B-Chat | .589 | .512 | .716 | .444 | .706 | .587 | .661 | .424 | .521 | .694 | .515 | .695 | .668 | .539 | .752 | .637 | .505 | .587 | .609 | .671 | .466 | | Qwen-14B-Chat | .564 | .481 | .678 | .416 | .657 | .567 | .669 | .396 | .485 | .663 | .486 | .647 | .609 | .498 | .757 | .638 | .460 | .610 | .629 | .691 | .467 | | Deepseek-LLM-67B-Chat | .526 | .477 | .617 | .364 | .609 | .559 | .573 | .374 | .458 | .631 | .493 | .588 | .624 | .444 | .694 | .592 | .384 | .576 | .594 | .666 | .439 | | gpt-3.5-Public-SFT | .522 | .316 | .611 | .492 | .578 | .538 | .639 | .377 | .447 | .580 | .492 | .587 | .565 | .498 | .745 | .583 | .444 | .501 | .620 | .643 | .452 | | Yi-34B-Chat | .516 | .452 | .607 | .437 | .624 | .516 | .545 | .254 | .382 | .671 | .398 | .631 | .592 | .460 | .761 | .566 | .440 | .551 | .610 | .608 | .408 | | Baichuan2-13B-Chat | .512 | .446 | .623 | .403 | .600 | .505 | .582 | .352 | .423 | .633 | .435 | .600 | .591 | .474 | .751 | .597 | .434 | .525 | .572 | .494 | .372 | | Tigerbot-13B-Chat | .494 | .350 | .558 | .447 | .599 | .528 | .707 | .352 | .447 | .551 | .498 | .571 | .569 | .413 | .732 | .560 | .365 | .502 | .607 | .601 | .306 | | Chinese-Alpaca-2-13B | .492 | .260 | .572 | .434 | .533 | .562 | .574 | .318 | .417 | .624 | .467 | .566 | .545 | .420 | .712 | .595 | .382 | .488 | .641 | .740 | .347 | | Chinese-Alpaca-33B | .484 | .274 | .546 | .470 | .527 | .540 | .703 | .332 | .382 | .582 | .464 | .550 | .506 | .423 | .732 | .548 | .342 | .494 | .629 | .648 | .334 | | Ziya-Llama-13B | .479 | .287 | .550 | .422 | .523 | .551 | .650 | .294 | .384 | .610 | .437 | .546 | .499 | .404 | .749 | .582 | .367 | .499 | .629 | .722 | .313 | | Chinese-Llama2-Linly-13B | .479 | .286 | .623 | .439 | .549 | .535 | .626 | .286 | .403 | .587 | .468 | .563 | .524 | .411 | .676 | .561 | .359 | .482 | .602 | .696 | .313 | | Tigerbot-7B-Chat | .478 | .354 | .528 | .440 | .570 | .540 | .708 | .314 | .430 | .528 | .413 | .532 | .554 | .393 | .731 | .583 | .351 | .519 | .630 | .614 | .291 | | ChatGLM3-6B | .472 | .321 | .488 | .436 | .527 | .503 | .588 | .290 | .328 | .574 | .415 | .557 | .526 | .397 | .749 | .612 | .431 | .529 | .620 | .589 | .392 | | Chinese-Alpaca-13B | .471 | .264 | .553 | .443 | .495 | .525 | .587 | .334 | .394 | .653 | .457 | .524 | .513 | .402 | .726 | .526 | .323 | .486 | .628 | .702 | .336 | | ChatGLM2-6B | .464 | .334 | .532 | .436 | .522 | .527 | .651 | .314 | .395 | .536 | .402 | .520 | .533 | .407 | .725 | .506 | .363 | .480 | .627 | .661 | .303 | | Chinese-Alpaca-7B | .452 | .237 | .536 | .438 | .484 | .502 | .672 | .318 | .389 | .652 | .394 | .504 | .501 | .351 | .699 | .543 | .365 | .478 | .623 | .711 | .328 | | Chinese-Alpaca-2-7B | .448 | .251 | .472 | .435 | .480 | .532 | .577 | .268 | .348 | .596 | .431 | .509 | .493 | .344 | .703 | .510 | .334 | .483 | .637 | .596 | .343 | | Chinese-Llama2-Linly-7B | .443 | .264 | .558 | .419 | .497 | .522 | .664 | .236 | .381 | .593 | .381 | .496 | .546 | .350 | .713 | .559 | .323 | .495 | .603 | .584 | .293 | | Qwen-7B-Chat | .442 | .313 | .549 | .404 | .520 | .515 | .646 | .244 | .411 | .570 | .368 | .489 | .514 | .384 | .713 | .563 | .328 | .463 | .576 | .639 | .281 | | ChatGLM-6B | .440 | .311 | .499 | .446 | .484 | .548 | .558 | .278 | .382 | .484 | .386 | .480 | .483 | .353 | .738 | .460 | .346 | .480 | .633 | .543 | .322 | | Baichuan-13B-Chat | .426 | .355 | .416 | .361 | .516 | .416 | .564 | .324 | .374 | .380 | .394 | .531 | .584 | .339 | .668 | .478 | .402 | .459 | .559 | .497 | .392 | | Yi-6B-Chat | .420 | .320 | .439 | .395 | .489 | .449 | .493 | .230 | .293 | .587 | .341 | .496 | .516 | .344 | .742 | .488 | .348 | .498 | .627 | .510 | .285 | | CPM-Bee-10B | .415 | .382 | .455 | .284 | .431 | .508 | .300 | .317 | .367 | .494 | .397 | .451 | .472 | .304 | .647 | .329 | .284 | .538 | .534 | .486 | .305 | | Moss-Moon-003-SFT | .399 | .233 | .465 | .389 | .427 | .482 | .509 | .274 | .369 | .526 | .385 | .403 | .457 | .325 | .712 | .450 | .304 | .435 | .594 | .542 | .308 | | Belle-SFT-Public | .397 | .196 | .503 | .376 | .426 | .472 | .543 | .269 | .371 | .512 | .356 | .450 | .430 | .338 | .645 | .426 | .300 | .398 | .558 | .683 | .224 | | Telechat-7B | .350 | .172 | .299 | .438 | .386 | .456 | .400 | .138 | .202 | .412 | .322 | .375 | .414 | .261 | .660 | .341 | .320 | .462 | .639 | .494 | .304 | | Aquilachat-7B | .350 | .203 | .270 | .357 | .404 | .449 | .394 | .090 | .260 | .348 | .322 | .385 | .426 | .274 | .595 | .308 | .267 | .434 | .607 | .409 | .355 | | Baichuan2-7B-Chat | .339 | .345 | .595 | .154 | .455 | .327 | .523 | .362 | .354 | .466 | .233 | .414 | .349 | .339 | .673 | .429 | .300 | .246 | .097 | .357 | .130 |

TODOs

[ ] Inference code for demo model.
[ ] Evaluation code and prompts.
[ ] Public split data.

Others

Contacts for any discussions can be made to Yizhi Li and Ge Zhang

@article{li2024cifbench, title={CIF-Bench: A Chinese Instruction-Following Benchmark for Evaluating the Generalizability of Large Language Models}, author={Yizhi LI and Ge Zhang and Xingwei Qu and Jiali Li and Zhaoqun Li and Zekun Wang and Hao Li and Ruibin Yuan and Yinghao Ma and Kai Zhang and Wangchunshu Zhou and Yiming Liang and Lei Zhang and Lei Ma and Jiajun Zhang and Zuowen Li and Stephen W. Huang and Chenghua Lin and Wenhu Chen and Jie Fu}, year={2024}, eprint={2402.13109}, archivePrefix={arXiv}, primaryClass={cs.CL} }

Owner

Name: Yizhi Li
Login: yizhilll
Kind: user
Company: DCS, University of Sheffield

Repositories: 2
Profile: https://github.com/yizhilll

NLP & MIR

Citation (CITATION.bib)

@article{li2024cifbench,
      title={CIF-Bench: A Chinese Instruction-Following Benchmark for Evaluating the Generalizability of Large Language Models}, 
      author={Yizhi LI and Ge Zhang and Xingwei Qu and Jiali Li and Zhaoqun Li and Zekun Wang and Hao Li and Ruibin Yuan and Yinghao Ma and Kai Zhang and Wangchunshu Zhou and Yiming Liang and Lei Zhang and Lei Ma and Jiajun Zhang and Zuowen Li and Stephen W. Huang and Chenghua Lin and Wenhu Chen and Jie Fu},
      year={2024},
      eprint={2402.13109},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science