BAGEL: Benchmarking Animal Knowledge Expertise in Language Models
BAGEL benchmark evaluates language models' performance on animal knowledge using closed-book questions on taxonomy, morphology, etc.
Key Findings
Methodology
BAGEL benchmark is constructed using diverse scientific and reference sources such as bioRxiv, GloBI, Xeno-canto, and Wikipedia. It combines curated examples and automatically generated closed-book question-answer pairs to evaluate language models on animal knowledge. BAGEL covers multiple aspects of animal knowledge, including taxonomy, morphology, habitat, behavior, vocalization, geographic distribution, and species interactions. By focusing on closed-book evaluation, BAGEL measures the models' animal-related knowledge without external retrieval during inference, allowing for precise analysis of model strengths and weaknesses.
Key Results
- Result 1: Performance is strongest on Wikipedia and bioRxiv, with GPT-5.4 achieving an overall accuracy of 76.01%, but weaker on Xeno-canto, indicating source sensitivity in animal knowledge.
- Result 2: Among open-weight models, Gemma 3 27B IT achieves the highest open score of 0.6789 under our protocol, highlighting the gap with proprietary models.
- Result 3: Mid-sized open models perform well on text-heavy domains but may fail on Xeno-canto, indicating areas for improvement.
Significance
The BAGEL benchmark provides a new testbed for studying domain-specific knowledge generalization in language models, particularly in biodiversity-related applications. By enabling fine-grained analysis across source domains, taxonomic groups, and knowledge categories, BAGEL reveals systematic failure modes and strengths of models, offering clear directions for future model improvements.
Technical Contribution
BAGEL's technical contribution lies in its unique closed-book evaluation protocol, which tests models' animal knowledge without relying on external information. This approach allows for a more precise assessment of models' intrinsic capabilities in domain-specific knowledge and reveals potential weaknesses in handling complex biodiversity knowledge.
Novelty
BAGEL is the first benchmark focused on closed-book evaluation of animal knowledge, filling a gap in current language model evaluations for domain-specific knowledge. Unlike other benchmarks, BAGEL integrates animal knowledge from multiple sources, providing a comprehensive evaluation perspective.
Limitations
- Limitation 1: BAGEL's evaluation focuses on text data, not covering multimodal data processing capabilities, which may limit comprehensive evaluation for certain biodiversity applications.
- Limitation 2: The diversity of data sources may lead to incomplete knowledge in some areas, affecting comprehensive evaluation of models in specific domains.
- Limitation 3: Models perform weaker on Xeno-canto, indicating insufficient capability in handling textual descriptions of animal vocalizations.
Future Work
Future research directions include expanding BAGEL to cover multimodal data, further enhancing the evaluation of models' capabilities in handling complex biodiversity knowledge. Additionally, research on improving models' performance in specific domains like Xeno-canto to enhance their capability in textual descriptions of animal vocalizations.
AI Executive Summary
The BAGEL benchmark provides a new platform for evaluating language models' performance on animal knowledge. Currently, while large language models excel in broad-domain knowledge and reasoning benchmarks, their ability to handle specialized animal-related knowledge remains unclear. BAGEL evaluates language models on animal knowledge by constructing from diverse scientific and reference sources, combining curated examples and automatically generated closed-book question-answer pairs. BAGEL covers multiple aspects of animal knowledge, including taxonomy, morphology, habitat, behavior, vocalization, geographic distribution, and species interactions. By focusing on closed-book evaluation, BAGEL measures the models' animal-related knowledge without external retrieval during inference, allowing for precise analysis of model strengths and weaknesses.
The experimental results of BAGEL show that while performance is strongest on Wikipedia and bioRxiv, it is weaker on Xeno-canto, indicating source sensitivity in animal knowledge. Among open-weight models, Gemma 3 27B IT achieves the highest open score, highlighting the gap with proprietary models. Mid-sized open models perform well on text-heavy domains but may fail on Xeno-canto, indicating areas for improvement.
The BAGEL benchmark provides a new testbed for studying domain-specific knowledge generalization in language models, particularly in biodiversity-related applications. By enabling fine-grained analysis across source domains, taxonomic groups, and knowledge categories, BAGEL reveals systematic failure modes and strengths of models, offering clear directions for future model improvements.
BAGEL's technical contribution lies in its unique closed-book evaluation protocol, which tests models' animal knowledge without relying on external information. This approach allows for a more precise assessment of models' intrinsic capabilities in domain-specific knowledge and reveals potential weaknesses in handling complex biodiversity knowledge.
Future research directions include expanding BAGEL to cover multimodal data, further enhancing the evaluation of models' capabilities in handling complex biodiversity knowledge. Additionally, research on improving models' performance in specific domains like Xeno-canto to enhance their capability in textual descriptions of animal vocalizations.
Deep Analysis
Background
In recent years, large language models (LLMs) have excelled in a wide range of knowledge and reasoning tasks, particularly in benchmarks like Multi-task Language Understanding (MMLU) and ScienceQA. However, their performance in handling long-tail knowledge about the natural world remains unclear, especially when answering questions that require species-level facts, ecological relations, or natural-history reasoning. As language models are increasingly explored for biodiversity and animal-related applications, evaluating their capabilities in these domains becomes crucial. The BAGEL benchmark aims to fill this gap by systematically testing language models on animal knowledge through a closed-book evaluation protocol.
Core Problem
While current language models excel in broad-domain knowledge and reasoning tasks, their ability to handle specialized animal-related knowledge remains unclear. This is particularly true for questions requiring species-level facts, ecological relations, or natural-history reasoning. Solving this problem is crucial for improving the reliability of models in biodiversity and animal-related applications.
Innovation
The core innovation of the BAGEL benchmark lies in its unique closed-book evaluation protocol, which tests models' animal knowledge without relying on external information. Unlike other benchmarks, BAGEL integrates animal knowledge from multiple sources, including Wikipedia, GloBI, bioRxiv, and Xeno-canto, providing a comprehensive evaluation perspective. Additionally, BAGEL supports fine-grained analysis across source domains, taxonomic groups, and knowledge categories, revealing systematic failure modes and strengths of models.
Methodology
The BAGEL benchmark is constructed through the following steps:
- �� Data Sources: Data is sourced from diverse scientific and reference sources such as Wikipedia, GloBI, bioRxiv, and Xeno-canto.
- �� Data Processing: The acquired data undergoes preprocessing, including text cleaning, deduplication, and formatting.
- �� Question Generation: Closed-book question-answer pairs are generated using the GPT-4o-mini API, covering multiple aspects of animal knowledge.
- �� Evaluation Protocol: Focuses on closed-book evaluation without relying on external retrieval during inference.
- �� Result Analysis: Detailed analysis of models' performance across different source domains, taxonomic groups, and knowledge categories to reveal strengths and weaknesses.
Experiments
The experimental design of the BAGEL benchmark includes the following aspects:
- �� Dataset: Data sourced from Wikipedia, GloBI, bioRxiv, and Xeno-canto, totaling 11,852 four-option, single-answer multiple-choice questions.
- �� Baseline Models: Evaluated using multiple open-weight models and closed-source models, including GPT-5.4 and Claude Opus 4.6.
- �� Evaluation Metrics: Accuracy is used as the primary evaluation metric, evaluated separately on each source domain and overall.
- �� Hyperparameter Settings: Evaluations are conducted using a fixed random seed and greedy generation strategy to ensure reproducibility.
Results
The experimental results show that while performance is strongest on Wikipedia and bioRxiv, it is weaker on Xeno-canto, indicating source sensitivity in animal knowledge. Among open-weight models, Gemma 3 27B IT achieves the highest open score, highlighting the gap with proprietary models. Mid-sized open models perform well on text-heavy domains but may fail on Xeno-canto, indicating areas for improvement.
Applications
Application scenarios of the BAGEL benchmark include:
- �� Biodiversity Research: Provides researchers with a platform to evaluate language models' performance on animal knowledge, aiding in better understanding and conservation of biodiversity.
- �� Educational Applications: Offers educators a platform to assess students' animal knowledge, promoting the development of biology education.
- �� Scientific Research: Provides scientists with a platform to evaluate language models' capabilities in handling complex biodiversity knowledge, advancing research in related fields.
Limitations & Outlook
The limitations of the BAGEL benchmark include:
- �� The diversity of data sources may lead to incomplete knowledge in some areas, affecting comprehensive evaluation of models in specific domains.
- �� Models perform weaker on Xeno-canto, indicating insufficient capability in handling textual descriptions of animal vocalizations.
- �� The evaluation focuses on text data, not covering multimodal data processing capabilities, which may limit comprehensive evaluation for certain biodiversity applications.
Plain Language Accessible to non-experts
Imagine you're a zookeeper responsible for managing a large zoo. You need to know the habits, diet, habitat, and behavior of each animal to take better care of them. The BAGEL benchmark is like an animal encyclopedia that helps you quickly access all kinds of animal knowledge without looking up external resources. It tests your knowledge of animals through a series of carefully designed questions, like giving you a closed-book exam on animals. This way, BAGEL helps you assess your strengths and weaknesses in animal knowledge, identifying areas for improvement.
ELI14 Explained like you're 14
Hey there, friends! Today we're going to talk about something super cool called the BAGEL benchmark. Imagine you're playing a trivia game about animals, and this game asks you all sorts of questions about animals, like where they live, what they eat, how they sound, and so on. BAGEL is like the ultimate version of this game, testing how much you know about animals. It pulls questions from all sorts of scientific sources and then asks you to answer them without looking anything up. It's like testing your animal knowledge level to see if you're an animal trivia master!
Glossary
BAGEL Benchmark
BAGEL is a benchmark for evaluating language models' performance on animal knowledge. It tests models' knowledge of animal taxonomy, morphology, habitat, behavior, and more through closed-book questions.
BAGEL is used to evaluate language models' domain-specific knowledge generalization capabilities.
Closed-book Evaluation
A testing method that requires the test-taker to answer questions without consulting external resources.
BAGEL uses a closed-book evaluation protocol to test language models' intrinsic capabilities.
Language Model
A model that predicts the probability distribution of word sequences by learning from large amounts of text data.
BAGEL evaluates language models' performance on animal knowledge.
Biodiversity
The diversity of life forms on Earth, including species diversity, genetic diversity, and ecosystem diversity.
BAGEL provides an evaluation platform for biodiversity-related applications.
Taxonomy
The science of classifying and naming organisms.
BAGEL tests language models' knowledge of animal taxonomy.
Morphology
The study of the form and structure of organisms.
BAGEL tests language models' knowledge of animal morphology.
Habitat
The natural environment where an organism lives and reproduces.
BAGEL tests language models' knowledge of animal habitats.
Behavioral Science
The study of animal behavior and its mechanisms.
BAGEL tests language models' knowledge of animal behavior.
Vocalization
The act of producing sound by animals using vocal cords or other organs.
BAGEL tests language models' knowledge of animal vocalizations.
Geographic Distribution
The range of areas where a species is found on Earth.
BAGEL tests language models' knowledge of animal geographic distribution.
Open Questions Unanswered questions from this research
- 1 Open Question 1: How can we improve language models' ability to handle textual descriptions of animal vocalizations? Current models perform weaker on Xeno-canto, indicating insufficient capability in this area. Further research is needed to enhance models' performance in this domain.
- 2 Open Question 2: How can we expand BAGEL to cover multimodal data? Currently, BAGEL's evaluation focuses on text data, not covering multimodal data processing capabilities, which may limit comprehensive evaluation for certain biodiversity applications.
- 3 Open Question 3: How can we enhance models' generalization capabilities in handling complex biodiversity knowledge? BAGEL's experimental results show varying performance across source domains, necessitating further research to improve models' generalization capabilities.
- 4 Open Question 4: How can we improve models' performance in specific domains? BAGEL's experimental results indicate suboptimal performance in certain domains, requiring further research to enhance models' performance in these areas.
- 5 Open Question 5: How can we improve models' performance in handling long-tail knowledge? Current language models' performance in handling long-tail knowledge about the natural world remains unclear, necessitating further research to enhance models' capabilities in this area.
Applications
Immediate Applications
Biodiversity Research
BAGEL provides researchers with a platform to evaluate language models' performance on animal knowledge, aiding in better understanding and conservation of biodiversity.
Educational Applications
BAGEL offers educators a platform to assess students' animal knowledge, promoting the development of biology education.
Scientific Research
BAGEL provides scientists with a platform to evaluate language models' capabilities in handling complex biodiversity knowledge, advancing research in related fields.
Long-term Vision
Multimodal Data Processing
Future research can expand BAGEL to cover multimodal data, enhancing the evaluation of models' capabilities in handling complex biodiversity knowledge.
Improving Model Performance
Research on improving models' performance in specific domains can enhance their capability in textual descriptions of animal vocalizations.
Abstract
Large language models have shown strong performance on broad-domain knowledge and reasoning benchmarks, but it remains unclear how well language models handle specialized animal-related knowledge under a unified closed-book evaluation protocol. We introduce BAGEL, a benchmark for evaluating animal knowledge expertise in language models. BAGEL is constructed from diverse scientific and reference sources, including bioRxiv, Global Biotic Interactions, Xeno-canto, and Wikipedia, using a combination of curated examples and automatically generated closed-book question-answer pairs. The benchmark covers multiple aspects of animal knowledge, including taxonomy, morphology, habitat, behavior, vocalization, geographic distribution, and species interactions. By focusing on closed-book evaluation, BAGEL measures animal-related knowledge of models without external retrieval at inference time. BAGEL further supports fine-grained analysis across source domains, taxonomic groups, and knowledge categories, enabling a more precise characterization of model strengths and systematic failure modes. Our benchmark provides a new testbed for studying domain-specific knowledge generalization in language models and for improving their reliability in biodiversity-related applications.
References (20)
SmolLM2: When Smol Goes Big - Data-Centric Training of a Small Language Model
Loubna Ben Allal, Anton Lozhkov, Elie Bakouch et al.
Gemma 3 Technical Report
Gemma Team Aishwarya Kamath, Johan Ferret, Shreya Pathak et al.
Large language models possess some ecological knowledge, but how much?
Filip Dorm, Joseph W. Millard, Drew Purves et al.
Measuring Massive Multitask Language Understanding
Dan Hendrycks, Collin Burns, Steven Basart et al.
NatureLM-audio: an Audio-Language Foundation Model for Bioacoustics
David Robinson, Marius Miron, Masato Hagiwara et al.
Phi-4 Technical Report
Marah Abdin, J. Aneja, Harkirat Singh Behl et al.
BEANS: The Benchmark of Animal Sounds
Masato Hagiwara, Benjamin Hoffman, Jen-Yu Liu et al.
Qwen3 Technical Report
An Yang, Anfeng Li, Baosong Yang et al.
Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering
Pan Lu, Swaroop Mishra, Tony Xia et al.
The Llama 3 Herd of Models
Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey et al.
Mistral 7B
Albert Qiaochu Jiang, Alexandre Sablayrolles, Arthur Mensch et al.
Plausibly Problematic Questions in Multiple-Choice Benchmarks for Commonsense Reasoning
Shramay Palta, Nishant Balepur, Peter Rankel et al.
Large Language Models Sensitivity to The Order of Options in Multiple-Choice Questions
Pouya Pezeshkpour, Estevam Hruschka
Which of These Best Describes Multiple Choice Evaluation with LLMs? A) Forced B) Flawed C) Fixable D) All of the Above
Nishant Balepur, Rachel Rudinger, J. Boyd-Graber
OceanGPT: A Large Language Model for Ocean Science Tasks
Zhen Bi, Ningyu Zhang, Yida Xue et al.
Training language models to follow instructions with human feedback
Long Ouyang, Jeff Wu, Xu Jiang et al.
Overview of BioASQ 2023: The eleventh BioASQ challenge on Large-Scale Biomedical Semantic Indexing and Question Answering
A. Nentidis, Georgios Katsimpras, Anastasia Krithara et al.
SATA-BENCH: Select All That Apply Benchmark for Multiple Choice Questions
Weijie Xu, Shixian Cui, Xi Fang et al.
Answer Matching Outperforms Multiple Choice for Language Model Evaluation
Nikhil Chandak, Shashwat Goel, Ameya Prabhu et al.
Environmental large language model Evaluation (ELLE) dataset: A Benchmark for Evaluating Generative AI applications in Eco-environment Domain
Jing Guo, Nan Li, Ming Xu