BAGEL: Benchmarking Animal Knowledge Expertise in Language Models

TL;DR

BAGEL benchmark evaluates language models' performance on animal knowledge using closed-book questions on taxonomy, morphology, etc.

cs.CL 🔴 Advanced 2026-04-18 29 views

Jiacheng Shen Masato Hagiwara Milad Alizadeh Ellen Gilsenan-McMahon Marius Miron David Robinson Emmanuel Chemla Sara Keen Gagan Narula Mathieu Laurière Matthieu Geist Olivier Pietquin

AI Reader Arxiv Page Download PDF

language models animal knowledge benchmarking closed-book evaluation biodiversity

Key Findings

Methodology

BAGEL benchmark is constructed using diverse scientific and reference sources such as bioRxiv, GloBI, Xeno-canto, and Wikipedia. It combines curated examples and automatically generated closed-book question-answer pairs to evaluate language models on animal knowledge. BAGEL covers multiple aspects of animal knowledge, including taxonomy, morphology, habitat, behavior, vocalization, geographic distribution, and species interactions. By focusing on closed-book evaluation, BAGEL measures the models' animal-related knowledge without external retrieval during inference, allowing for precise analysis of model strengths and weaknesses.

Key Results

Result 1: Performance is strongest on Wikipedia and bioRxiv, with GPT-5.4 achieving an overall accuracy of 76.01%, but weaker on Xeno-canto, indicating source sensitivity in animal knowledge.
Result 2: Among open-weight models, Gemma 3 27B IT achieves the highest open score of 0.6789 under our protocol, highlighting the gap with proprietary models.
Result 3: Mid-sized open models perform well on text-heavy domains but may fail on Xeno-canto, indicating areas for improvement.

Significance

The BAGEL benchmark provides a new testbed for studying domain-specific knowledge generalization in language models, particularly in biodiversity-related applications. By enabling fine-grained analysis across source domains, taxonomic groups, and knowledge categories, BAGEL reveals systematic failure modes and strengths of models, offering clear directions for future model improvements.

Technical Contribution

BAGEL's technical contribution lies in its unique closed-book evaluation protocol, which tests models' animal knowledge without relying on external information. This approach allows for a more precise assessment of models' intrinsic capabilities in domain-specific knowledge and reveals potential weaknesses in handling complex biodiversity knowledge.

Novelty

BAGEL is the first benchmark focused on closed-book evaluation of animal knowledge, filling a gap in current language model evaluations for domain-specific knowledge. Unlike other benchmarks, BAGEL integrates animal knowledge from multiple sources, providing a comprehensive evaluation perspective.

Limitations

Limitation 1: BAGEL's evaluation focuses on text data, not covering multimodal data processing capabilities, which may limit comprehensive evaluation for certain biodiversity applications.
Limitation 2: The diversity of data sources may lead to incomplete knowledge in some areas, affecting comprehensive evaluation of models in specific domains.
Limitation 3: Models perform weaker on Xeno-canto, indicating insufficient capability in handling textual descriptions of animal vocalizations.

Future Work

Future research directions include expanding BAGEL to cover multimodal data, further enhancing the evaluation of models' capabilities in handling complex biodiversity knowledge. Additionally, research on improving models' performance in specific domains like Xeno-canto to enhance their capability in textual descriptions of animal vocalizations.

AI Executive Summary

The BAGEL benchmark provides a new platform for evaluating language models' performance on animal knowledge. Currently, while large language models excel in broad-domain knowledge and reasoning benchmarks, their ability to handle specialized animal-related knowledge remains unclear. BAGEL evaluates language models on animal knowledge by constructing from diverse scientific and reference sources, combining curated examples and automatically generated closed-book question-answer pairs. BAGEL covers multiple aspects of animal knowledge, including taxonomy, morphology, habitat, behavior, vocalization, geographic distribution, and species interactions. By focusing on closed-book evaluation, BAGEL measures the models' animal-related knowledge without external retrieval during inference, allowing for precise analysis of model strengths and weaknesses.

The experimental results of BAGEL show that while performance is strongest on Wikipedia and bioRxiv, it is weaker on Xeno-canto, indicating source sensitivity in animal knowledge. Among open-weight models, Gemma 3 27B IT achieves the highest open score, highlighting the gap with proprietary models. Mid-sized open models perform well on text-heavy domains but may fail on Xeno-canto, indicating areas for improvement.

Deep Analysis

Background

In recent years, large language models (LLMs) have excelled in a wide range of knowledge and reasoning tasks, particularly in benchmarks like Multi-task Language Understanding (MMLU) and ScienceQA. However, their performance in handling long-tail knowledge about the natural world remains unclear, especially when answering questions that require species-level facts, ecological relations, or natural-history reasoning. As language models are increasingly explored for biodiversity and animal-related applications, evaluating their capabilities in these domains becomes crucial. The BAGEL benchmark aims to fill this gap by systematically testing language models on animal knowledge through a closed-book evaluation protocol.

Core Problem

While current language models excel in broad-domain knowledge and reasoning tasks, their ability to handle specialized animal-related knowledge remains unclear. This is particularly true for questions requiring species-level facts, ecological relations, or natural-history reasoning. Solving this problem is crucial for improving the reliability of models in biodiversity and animal-related applications.

Innovation

The core innovation of the BAGEL benchmark lies in its unique closed-book evaluation protocol, which tests models' animal knowledge without relying on external information. Unlike other benchmarks, BAGEL integrates animal knowledge from multiple sources, including Wikipedia, GloBI, bioRxiv, and Xeno-canto, providing a comprehensive evaluation perspective. Additionally, BAGEL supports fine-grained analysis across source domains, taxonomic groups, and knowledge categories, revealing systematic failure modes and strengths of models.

Methodology

The BAGEL benchmark is constructed through the following steps:

�� Data Sources: Data is sourced from diverse scientific and reference sources such as Wikipedia, GloBI, bioRxiv, and Xeno-canto.

�� Data Processing: The acquired data undergoes preprocessing, including text cleaning, deduplication, and formatting.

�� Question Generation: Closed-book question-answer pairs are generated using the GPT-4o-mini API, covering multiple aspects of animal knowledge.

�� Evaluation Protocol: Focuses on closed-book evaluation without relying on external retrieval during inference.

�� Result Analysis: Detailed analysis of models' performance across different source domains, taxonomic groups, and knowledge categories to reveal strengths and weaknesses.

Experiments

The experimental design of the BAGEL benchmark includes the following aspects:

�� Dataset: Data sourced from Wikipedia, GloBI, bioRxiv, and Xeno-canto, totaling 11,852 four-option, single-answer multiple-choice questions.

�� Baseline Models: Evaluated using multiple open-weight models and closed-source models, including GPT-5.4 and Claude Opus 4.6.

�� Evaluation Metrics: Accuracy is used as the primary evaluation metric, evaluated separately on each source domain and overall.

�� Hyperparameter Settings: Evaluations are conducted using a fixed random seed and greedy generation strategy to ensure reproducibility.

Results

The experimental results show that while performance is strongest on Wikipedia and bioRxiv, it is weaker on Xeno-canto, indicating source sensitivity in animal knowledge. Among open-weight models, Gemma 3 27B IT achieves the highest open score, highlighting the gap with proprietary models. Mid-sized open models perform well on text-heavy domains but may fail on Xeno-canto, indicating areas for improvement.

Applications

Application scenarios of the BAGEL benchmark include:

�� Biodiversity Research: Provides researchers with a platform to evaluate language models' performance on animal knowledge, aiding in better understanding and conservation of biodiversity.

�� Educational Applications: Offers educators a platform to assess students' animal knowledge, promoting the development of biology education.

�� Scientific Research: Provides scientists with a platform to evaluate language models' capabilities in handling complex biodiversity knowledge, advancing research in related fields.

Limitations & Outlook

The limitations of the BAGEL benchmark include:

�� The diversity of data sources may lead to incomplete knowledge in some areas, affecting comprehensive evaluation of models in specific domains.

�� Models perform weaker on Xeno-canto, indicating insufficient capability in handling textual descriptions of animal vocalizations.

�� The evaluation focuses on text data, not covering multimodal data processing capabilities, which may limit comprehensive evaluation for certain biodiversity applications.

Plain Language Accessible to non-experts

Imagine you're a zookeeper responsible for managing a large zoo. You need to know the habits, diet, habitat, and behavior of each animal to take better care of them. The BAGEL benchmark is like an animal encyclopedia that helps you quickly access all kinds of animal knowledge without looking up external resources. It tests your knowledge of animals through a series of carefully designed questions, like giving you a closed-book exam on animals. This way, BAGEL helps you assess your strengths and weaknesses in animal knowledge, identifying areas for improvement.

ELI14 Explained like you're 14

Hey there, friends! Today we're going to talk about something super cool called the BAGEL benchmark. Imagine you're playing a trivia game about animals, and this game asks you all sorts of questions about animals, like where they live, what they eat, how they sound, and so on. BAGEL is like the ultimate version of this game, testing how much you know about animals. It pulls questions from all sorts of scientific sources and then asks you to answer them without looking anything up. It's like testing your animal knowledge level to see if you're an animal trivia master!

Glossary

BAGEL Benchmark

BAGEL is a benchmark for evaluating language models' performance on animal knowledge. It tests models' knowledge of animal taxonomy, morphology, habitat, behavior, and more through closed-book questions.

BAGEL is used to evaluate language models' domain-specific knowledge generalization capabilities.

Closed-book Evaluation

A testing method that requires the test-taker to answer questions without consulting external resources.

BAGEL uses a closed-book evaluation protocol to test language models' intrinsic capabilities.

Language Model

A model that predicts the probability distribution of word sequences by learning from large amounts of text data.

BAGEL evaluates language models' performance on animal knowledge.

Biodiversity

The diversity of life forms on Earth, including species diversity, genetic diversity, and ecosystem diversity.

BAGEL provides an evaluation platform for biodiversity-related applications.

Taxonomy

The science of classifying and naming organisms.

BAGEL tests language models' knowledge of animal taxonomy.

Morphology

The study of the form and structure of organisms.

BAGEL tests language models' knowledge of animal morphology.

Habitat

The natural environment where an organism lives and reproduces.

BAGEL tests language models' knowledge of animal habitats.

Behavioral Science

The study of animal behavior and its mechanisms.

BAGEL tests language models' knowledge of animal behavior.

Vocalization

The act of producing sound by animals using vocal cords or other organs.

BAGEL tests language models' knowledge of animal vocalizations.

Geographic Distribution

The range of areas where a species is found on Earth.

BAGEL tests language models' knowledge of animal geographic distribution.

Open Questions Unanswered questions from this research

1 Open Question 1: How can we improve language models' ability to handle textual descriptions of animal vocalizations? Current models perform weaker on Xeno-canto, indicating insufficient capability in this area. Further research is needed to enhance models' performance in this domain.
2 Open Question 2: How can we expand BAGEL to cover multimodal data? Currently, BAGEL's evaluation focuses on text data, not covering multimodal data processing capabilities, which may limit comprehensive evaluation for certain biodiversity applications.
3 Open Question 3: How can we enhance models' generalization capabilities in handling complex biodiversity knowledge? BAGEL's experimental results show varying performance across source domains, necessitating further research to improve models' generalization capabilities.
4 Open Question 4: How can we improve models' performance in specific domains? BAGEL's experimental results indicate suboptimal performance in certain domains, requiring further research to enhance models' performance in these areas.
5 Open Question 5: How can we improve models' performance in handling long-tail knowledge? Current language models' performance in handling long-tail knowledge about the natural world remains unclear, necessitating further research to enhance models' capabilities in this area.

Applications

Immediate Applications

Biodiversity Research

BAGEL provides researchers with a platform to evaluate language models' performance on animal knowledge, aiding in better understanding and conservation of biodiversity.

Educational Applications

BAGEL offers educators a platform to assess students' animal knowledge, promoting the development of biology education.

Scientific Research

BAGEL provides scientists with a platform to evaluate language models' capabilities in handling complex biodiversity knowledge, advancing research in related fields.

Long-term Vision

Multimodal Data Processing

Future research can expand BAGEL to cover multimodal data, enhancing the evaluation of models' capabilities in handling complex biodiversity knowledge.

Improving Model Performance

Research on improving models' performance in specific domains can enhance their capability in textual descriptions of animal vocalizations.

Abstract

Large language models have shown strong performance on broad-domain knowledge and reasoning benchmarks, but it remains unclear how well language models handle specialized animal-related knowledge under a unified closed-book evaluation protocol. We introduce BAGEL, a benchmark for evaluating animal knowledge expertise in language models. BAGEL is constructed from diverse scientific and reference sources, including bioRxiv, Global Biotic Interactions, Xeno-canto, and Wikipedia, using a combination of curated examples and automatically generated closed-book question-answer pairs. The benchmark covers multiple aspects of animal knowledge, including taxonomy, morphology, habitat, behavior, vocalization, geographic distribution, and species interactions. By focusing on closed-book evaluation, BAGEL measures animal-related knowledge of models without external retrieval at inference time. BAGEL further supports fine-grained analysis across source domains, taxonomic groups, and knowledge categories, enabling a more precise characterization of model strengths and systematic failure modes. Our benchmark provides a new testbed for studying domain-specific knowledge generalization in language models and for improving their reliability in biodiversity-related applications.

cs.CL cs.AI

References (20)

SmolLM2: When Smol Goes Big - Data-Centric Training of a Small Language Model

Loubna Ben Allal, Anton Lozhkov, Elie Bakouch et al.

2025 215 citations ⭐ Influential View Analysis →

Gemma 3 Technical Report

Gemma Team Aishwarya Kamath, Johan Ferret, Shreya Pathak et al.

2025 1214 citations ⭐ Influential View Analysis →

Large language models possess some ecological knowledge, but how much?

Filip Dorm, Joseph W. Millard, Drew Purves et al.

2026 11 citations ⭐ Influential

Measuring Massive Multitask Language Understanding

Dan Hendrycks, Collin Burns, Steven Basart et al.

2020 7676 citations ⭐ Influential View Analysis →

NatureLM-audio: an Audio-Language Foundation Model for Bioacoustics

David Robinson, Marius Miron, Masato Hagiwara et al.

2024 30 citations ⭐ Influential View Analysis →

Phi-4 Technical Report

Marah Abdin, J. Aneja, Harkirat Singh Behl et al.

2024 557 citations ⭐ Influential View Analysis →

BEANS: The Benchmark of Animal Sounds

Masato Hagiwara, Benjamin Hoffman, Jen-Yu Liu et al.

2022 56 citations ⭐ Influential View Analysis →

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang et al.

2025 4301 citations ⭐ Influential View Analysis →

Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering

Pan Lu, Swaroop Mishra, Tony Xia et al.

2022 2168 citations ⭐ Influential View Analysis →

The Llama 3 Herd of Models

Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey et al.

2024 14252 citations ⭐ Influential View Analysis →

Mistral 7B

Albert Qiaochu Jiang, Alexandre Sablayrolles, Arthur Mensch et al.

2023 3302 citations ⭐ Influential View Analysis →

Plausibly Problematic Questions in Multiple-Choice Benchmarks for Commonsense Reasoning

Shramay Palta, Nishant Balepur, Peter Rankel et al.

2024 12 citations View Analysis →

Large Language Models Sensitivity to The Order of Options in Multiple-Choice Questions

Pouya Pezeshkpour, Estevam Hruschka

2023 226 citations View Analysis →

Which of These Best Describes Multiple Choice Evaluation with LLMs? A) Forced B) Flawed C) Fixable D) All of the Above

Nishant Balepur, Rachel Rudinger, J. Boyd-Graber

2025 42 citations View Analysis →

OceanGPT: A Large Language Model for Ocean Science Tasks

Zhen Bi, Ningyu Zhang, Yida Xue et al.

2023 76 citations View Analysis →

Training language models to follow instructions with human feedback

Long Ouyang, Jeff Wu, Xu Jiang et al.

2022 19868 citations View Analysis →

Overview of BioASQ 2023: The eleventh BioASQ challenge on Large-Scale Biomedical Semantic Indexing and Question Answering

A. Nentidis, Georgios Katsimpras, Anastasia Krithara et al.

2023 40 citations View Analysis →

SATA-BENCH: Select All That Apply Benchmark for Multiple Choice Questions

Weijie Xu, Shixian Cui, Xi Fang et al.

2025 4 citations View Analysis →

Answer Matching Outperforms Multiple Choice for Language Model Evaluation

Nikhil Chandak, Shashwat Goel, Ameya Prabhu et al.

2025 24 citations View Analysis →

Environmental large language model Evaluation (ELLE) dataset: A Benchmark for Evaluating Generative AI applications in Eco-environment Domain

Jing Guo, Nan Li, Ming Xu

2025 4 citations View Analysis →

BAGEL: Benchmarking Animal Knowledge Expertise in Language Models

Key Findings

Methodology

Key Results

Significance

Technical Contribution

Novelty

Limitations

Future Work

AI Executive Summary

Deep Analysis

Background

Core Problem

Innovation

Methodology

Experiments

Results

Applications

Limitations & Outlook

Plain Language Accessible to non-experts

ELI14 Explained like you're 14

Glossary

BAGEL Benchmark

Closed-book Evaluation

Language Model

Biodiversity

Taxonomy

Morphology

Habitat

Behavioral Science

Vocalization

Geographic Distribution

Open Questions Unanswered questions from this research

Applications

Immediate Applications

Biodiversity Research

Educational Applications

Scientific Research

Long-term Vision

Multimodal Data Processing

Improving Model Performance

Abstract

References (20)

Related Papers

Sentiment and Emotion Classification of Indonesian E-Commerce Reviews via Multi-Task BiLSTM and AutoML Benchmarking

SeaEvo: Advancing Algorithm Discovery with Strategy Space Evolution

Improving Robustness of Tabular Retrieval via Representational Stability

Representational Harms in LLM-Generated Narratives Against Global Majority Nationalities

CRAFT: Clustered Regression for Adaptive Filtering of Training data

BERAG: Bayesian Ensemble Retrieval-Augmented Generation for Knowledge-based Visual Question Answering