Characterizing Cultural Localization in AI-Generated Stories
Proposes a method combining lexical token analysis and multi-word similarity to quantify cultural localization in AI-generated stories, revealing only 9-17% of vocabulary accounts for cultural differences.
Key Findings
Methodology
This paper introduces a two-stage approach: first, it employs normalized pointwise mutual information (NPMI) to identify cultural markers—lexical items that distinguish stories across nationalities. Second, it measures the homogeneity of stories after removing these markers using multi-word similarity metrics such as Longest Common Substring (LCS) and Jaccard similarity on 4-grams. The study involves generating stories with five different models (including GPT-3.5, GPT-4, Llama 3.1, Llama 3.3, Gemma 12B) across 193 countries and 125 topics. The cultural markers constitute only 9-17% of the vocabulary, and removing them results in stories with higher structural similarity, indicating the presence of shared, culture-agnostic narrative templates. The analysis also incorporates the SeeGULL dataset to evaluate stereotypicality and offensiveness of cultural markers, revealing that markers from Global South countries tend to be more offensive.
Key Results
- Analysis shows that only 9-17% of vocabulary (average ~12%) in generated stories is responsible for cultural differentiation. Removing these words makes the remaining story sequences more similar across cultures, indicating a shared underlying narrative structure.
- The classifier's accuracy in identifying stories by nationality drops from 96.8% in original stories to near chance (around 0.5%) after cultural markers are masked, confirming the effectiveness of cultural marker detection. Multi-word similarity measures (LCS and Jaccard) improve by 10-25% post-removal, demonstrating increased content homogeneity.
- Cultural markers from certain countries, especially in Africa and West Asia, are rated as more offensive in SeeGULL, highlighting biases and potential harm in model outputs. This underscores the importance of bias mitigation in AI storytelling systems.
Significance
This work provides a quantitative framework for understanding how AI models embed cultural information in generated stories. By revealing that cultural variation is primarily surface-level, it informs strategies to enhance cultural diversity and reduce stereotypes. The methodology offers tools for detecting and mitigating biases, fostering fairer AI content. Moreover, uncovering the structural homogeneity of stories across cultures suggests that models may rely heavily on shared templates, which has implications for creativity and authenticity in AI storytelling. These insights are crucial for advancing ethical AI deployment in multicultural contexts.
Technical Contribution
The paper combines lexical statistical measures (NPMI) with multi-word similarity metrics (LCS, Jaccard) to detect and analyze cultural markers in generated stories. It introduces a robust pipeline that quantifies the proportion of vocabulary responsible for cultural differentiation and assesses the structural similarity of stories after cultural marker removal. The approach is validated across multiple models and datasets, demonstrating its robustness and scalability. Additionally, the integration of the SeeGULL dataset for stereotypicality and offensiveness evaluation extends the analysis to bias detection, providing a comprehensive framework for cultural content assessment.
Novelty
This study is the first to systematically quantify the extent of templated versus holistic cultural localization in AI story generation. It innovatively combines lexical mutual information with multi-word similarity measures to reveal the underlying narrative homogeneity masked by surface lexical differences. Unlike prior work that focused solely on lexical variation or stereotypes, this approach provides a nuanced understanding of the structural and lexical dimensions of cultural embedding, offering a new perspective on model biases and content diversity.
Limitations
- The analysis is limited to English-language stories and models, restricting its applicability across multilingual settings. Cross-lingual cultural template detection remains an open challenge.
- The reliance on lexical statistics (NPMI) may overlook subtle semantic or contextual cultural cues, leading to incomplete identification of cultural markers.
- Generated stories are influenced by training data biases, which may skew the analysis of cultural representation and offend potential biases in the models themselves.
Future Work
Future research can extend this framework to multilingual and multimodal settings, incorporating semantic and discourse-level analyses for deeper cultural understanding. Developing cross-lingual detection methods will broaden applicability. Additionally, integrating user feedback and bias mitigation techniques will improve cultural sensitivity. Exploring the evolution of cultural content over time and across different contexts can further enhance model fairness and creativity. These directions aim to foster AI systems that generate culturally rich, diverse, and respectful narratives.
AI Executive Summary
As artificial intelligence continues to permeate global markets, the ability of AI systems to generate culturally appropriate content becomes increasingly critical. In particular, story generation models must navigate the complex landscape of cultural diversity, balancing surface-level markers with deeper narrative structures. Existing research has primarily focused on lexical variation—such as names, locations, and stereotypical tokens—yet the extent to which these surface cues reflect genuine cultural differences remains underexplored.
This paper introduces a novel analytical framework that combines lexical token analysis with multi-word similarity metrics to quantify cultural localization in AI-generated stories. The core idea is to identify a minimal set of cultural markers—lexical items highly associated with specific nationalities—and assess how their removal affects story similarity. Using five models, including GPT-3.5, GPT-4, Llama 3.1, Llama 3.3, and Gemma 12B, the authors generate stories across 193 countries and 125 topics, providing a comprehensive dataset for analysis.
The findings reveal that only about 9-17% of vocabulary in these stories accounts for cultural differences, with the remaining content exhibiting high structural similarity. After masking cultural markers, stories from different nationalities become markedly more homogeneous, suggesting that models predominantly rely on template-like structures with surface-level lexical insertions. This insight raises important questions about the depth of cultural understanding in AI storytelling systems.
Furthermore, by evaluating the stereotypicality and offensiveness of cultural markers using the SeeGULL dataset, the study uncovers a concerning trend: markers from many Global South countries tend to be more offensive, highlighting biases embedded within training data and model outputs. This underscores the urgent need for bias detection and mitigation strategies in AI content generation.
Overall, this research provides a powerful toolkit for quantifying and understanding cultural localization in AI stories. It offers a foundation for developing more culturally sensitive, diverse, and fair AI systems, with broad implications for applications in education, entertainment, and intercultural communication. Future work will focus on extending these methods to multilingual contexts, incorporating semantic and discourse analyses, and refining bias mitigation techniques to foster inclusive AI narratives.
Deep Dive
Abstract
The global use of artificial intelligence has increased interest in assessing the ability to generate culturally localized content, including stories. Cultural localization in stories often occurs through either templated localization -- the use of cultural markers (e.g., names, locations) in a generic narrative -- or holistic localization -- the variation of plots, values, and themes, in addition to cultural markers. We propose a method to measure the degree to which content was generated through templated localization. Specifically, we identify the lexical tokens that distinguish stories across nationalities and measure the similarity of the narratives that remain after removing them. In stories generated by five models on 125 topics for 193 nationalities, our method is able to detect that only a small subset (9-17%) of the vocabulary accounts for the variation across nationalities and that the narratives that remain after removing them contain repeated multi-word sequences, suggesting the presence of a shared culturally-agnostic narrative template. Finally, we characterize the cultural markers for their stereotypicality and offensiveness, finding that markers from 19 countries, mostly located in the Global South, are on average offensive.
References (20)
TALES: A Taxonomy and Analysis of Cultural Representations in LLM-generated Stories
Kirti Bhagat, Shaily Bhatt, Athul Velagapudi et al.
Extrinsic Evaluation of Cultural Competence in Large Language Models
Shaily Bhatt, F. Diaz
Echoes in AI: Quantifying lack of plot diversity in LLM outputs
Weijia Xu, Nebojsa Jojic, Sudha Rao et al.
Kahani: Culturally-Nuanced Visual Storytelling Tool for Non-Western Cultures
Hamna, D. Sudharsan, Agrima Seth et al.
mmBERT: A Modern Multilingual Encoder with Annealed Language Learning
Marc Marone, Orion Weller, William Fleshman et al.
Biased Tales: Cultural and Topic Bias in Generating Children's Stories
Donya Rooein, Vilém Zouhar, Debora Nozza et al.
AI Suggestions Homogenize Writing Toward Western Styles and Diminish Cultural Nuances
Dhruv Agarwal, Mor Naaman, Aditya Vashistha
Can Good Writing Be Generative? Expert-Level AI Writing Emerges through Fine-Tuning on High Quality Books
Tuhin Chakrabarty, Paramveer Dhillon
Richer Output for Richer Countries: Uncovering Geographical Disparities in Generated Stories and Travel Recommendations
Kirti Bhagat, Kinshuk Vasisht, Danish Pruthi
Towards Automatic Evaluation for Image Transcreation
Simran Khanuja, V. Iyer, Claire He et al.
DIWALI - Diversity and Inclusivity aWare cuLture specific Items for India: Dataset and Assessment of LLMs for Cultural Text Adaptation in Indian Context
Pramit Sahoo, Maharaj Brahma, M. Desarkar
Artificial Hivemind: The Open-Ended Homogeneity of Language Models (and Beyond)
Liwei Jiang, Yuanjun Chai, Margaret Li et al.
STORYTELLER: An Enhanced Plot-Planning Framework for Coherent and Cohesive Story Generation
Jiaming Li, Yukun Chen, Ziqiang Liu et al.
Research Borderlands: Analysing Writing Across Research Cultures
Shaily Bhatt, Tal August, Maria Antoniak
How Deep Is Representational Bias in LLMs? The Cases of Caste and Religion
Agrima Seth, Monojit Choudhary, Sunayana Sitaram et al.
SANSKRITI: A Comprehensive Benchmark for Evaluating Language Models' Knowledge of Indian Culture
Arijit Maji, Raghvendra Kumar, Akash Ghosh et al.
Detection and Measurement of Syntactic Templates in Generated Text
Chantal Shaib, Yanai Elazar, J. Li et al.
The Shrinking Landscape of Linguistic Diversity in the Age of Large Language Models
Zhivar Sourati, Farzan Karimi-Malekabadi, Meltem Ozcan et al.
QUDsim: Quantifying Discourse Similarities in LLM-Generated Text
Ramya Namuduri, Yating Wu, A. Zheng et al.
Towards Measuring and Modeling “Culture” in LLMs: A Survey
Farid Adilazuarda, Sagnik Mukherjee, Pradhyumna Lavania et al.