Characterizing Cultural Localization in AI-Generated Stories

TL;DR

Proposes a method combining lexical token analysis and multi-word similarity to quantify cultural localization in AI-generated stories, revealing only 9-17% of vocabulary accounts for cultural differences.

cs.CL 🔴 Advanced 2026-06-13 56 views
Shaily Bhatt Supriti Vijay Jeremiah Milbauer Fernando Diaz
Natural Language Processing Cultural Localization Story Generation Template Detection Model Analysis

Key Findings

Methodology

This paper introduces a two-stage approach: first, it employs normalized pointwise mutual information (NPMI) to identify cultural markers—lexical items that distinguish stories across nationalities. Second, it measures the homogeneity of stories after removing these markers using multi-word similarity metrics such as Longest Common Substring (LCS) and Jaccard similarity on 4-grams. The study involves generating stories with five different models (including GPT-3.5, GPT-4, Llama 3.1, Llama 3.3, Gemma 12B) across 193 countries and 125 topics. The cultural markers constitute only 9-17% of the vocabulary, and removing them results in stories with higher structural similarity, indicating the presence of shared, culture-agnostic narrative templates. The analysis also incorporates the SeeGULL dataset to evaluate stereotypicality and offensiveness of cultural markers, revealing that markers from Global South countries tend to be more offensive.

Key Results

  • Analysis shows that only 9-17% of vocabulary (average ~12%) in generated stories is responsible for cultural differentiation. Removing these words makes the remaining story sequences more similar across cultures, indicating a shared underlying narrative structure.
  • The classifier's accuracy in identifying stories by nationality drops from 96.8% in original stories to near chance (around 0.5%) after cultural markers are masked, confirming the effectiveness of cultural marker detection. Multi-word similarity measures (LCS and Jaccard) improve by 10-25% post-removal, demonstrating increased content homogeneity.
  • Cultural markers from certain countries, especially in Africa and West Asia, are rated as more offensive in SeeGULL, highlighting biases and potential harm in model outputs. This underscores the importance of bias mitigation in AI storytelling systems.

Significance

This work provides a quantitative framework for understanding how AI models embed cultural information in generated stories. By revealing that cultural variation is primarily surface-level, it informs strategies to enhance cultural diversity and reduce stereotypes. The methodology offers tools for detecting and mitigating biases, fostering fairer AI content. Moreover, uncovering the structural homogeneity of stories across cultures suggests that models may rely heavily on shared templates, which has implications for creativity and authenticity in AI storytelling. These insights are crucial for advancing ethical AI deployment in multicultural contexts.

Technical Contribution

The paper combines lexical statistical measures (NPMI) with multi-word similarity metrics (LCS, Jaccard) to detect and analyze cultural markers in generated stories. It introduces a robust pipeline that quantifies the proportion of vocabulary responsible for cultural differentiation and assesses the structural similarity of stories after cultural marker removal. The approach is validated across multiple models and datasets, demonstrating its robustness and scalability. Additionally, the integration of the SeeGULL dataset for stereotypicality and offensiveness evaluation extends the analysis to bias detection, providing a comprehensive framework for cultural content assessment.

Novelty

This study is the first to systematically quantify the extent of templated versus holistic cultural localization in AI story generation. It innovatively combines lexical mutual information with multi-word similarity measures to reveal the underlying narrative homogeneity masked by surface lexical differences. Unlike prior work that focused solely on lexical variation or stereotypes, this approach provides a nuanced understanding of the structural and lexical dimensions of cultural embedding, offering a new perspective on model biases and content diversity.

Limitations

  • The analysis is limited to English-language stories and models, restricting its applicability across multilingual settings. Cross-lingual cultural template detection remains an open challenge.
  • The reliance on lexical statistics (NPMI) may overlook subtle semantic or contextual cultural cues, leading to incomplete identification of cultural markers.
  • Generated stories are influenced by training data biases, which may skew the analysis of cultural representation and offend potential biases in the models themselves.

Future Work

Future research can extend this framework to multilingual and multimodal settings, incorporating semantic and discourse-level analyses for deeper cultural understanding. Developing cross-lingual detection methods will broaden applicability. Additionally, integrating user feedback and bias mitigation techniques will improve cultural sensitivity. Exploring the evolution of cultural content over time and across different contexts can further enhance model fairness and creativity. These directions aim to foster AI systems that generate culturally rich, diverse, and respectful narratives.

AI Executive Summary

As artificial intelligence continues to permeate global markets, the ability of AI systems to generate culturally appropriate content becomes increasingly critical. In particular, story generation models must navigate the complex landscape of cultural diversity, balancing surface-level markers with deeper narrative structures. Existing research has primarily focused on lexical variation—such as names, locations, and stereotypical tokens—yet the extent to which these surface cues reflect genuine cultural differences remains underexplored.

This paper introduces a novel analytical framework that combines lexical token analysis with multi-word similarity metrics to quantify cultural localization in AI-generated stories. The core idea is to identify a minimal set of cultural markers—lexical items highly associated with specific nationalities—and assess how their removal affects story similarity. Using five models, including GPT-3.5, GPT-4, Llama 3.1, Llama 3.3, and Gemma 12B, the authors generate stories across 193 countries and 125 topics, providing a comprehensive dataset for analysis.

The findings reveal that only about 9-17% of vocabulary in these stories accounts for cultural differences, with the remaining content exhibiting high structural similarity. After masking cultural markers, stories from different nationalities become markedly more homogeneous, suggesting that models predominantly rely on template-like structures with surface-level lexical insertions. This insight raises important questions about the depth of cultural understanding in AI storytelling systems.

Furthermore, by evaluating the stereotypicality and offensiveness of cultural markers using the SeeGULL dataset, the study uncovers a concerning trend: markers from many Global South countries tend to be more offensive, highlighting biases embedded within training data and model outputs. This underscores the urgent need for bias detection and mitigation strategies in AI content generation.

Overall, this research provides a powerful toolkit for quantifying and understanding cultural localization in AI stories. It offers a foundation for developing more culturally sensitive, diverse, and fair AI systems, with broad implications for applications in education, entertainment, and intercultural communication. Future work will focus on extending these methods to multilingual contexts, incorporating semantic and discourse analyses, and refining bias mitigation techniques to foster inclusive AI narratives.

Deep Dive

Abstract

The global use of artificial intelligence has increased interest in assessing the ability to generate culturally localized content, including stories. Cultural localization in stories often occurs through either templated localization -- the use of cultural markers (e.g., names, locations) in a generic narrative -- or holistic localization -- the variation of plots, values, and themes, in addition to cultural markers. We propose a method to measure the degree to which content was generated through templated localization. Specifically, we identify the lexical tokens that distinguish stories across nationalities and measure the similarity of the narratives that remain after removing them. In stories generated by five models on 125 topics for 193 nationalities, our method is able to detect that only a small subset (9-17%) of the vocabulary accounts for the variation across nationalities and that the narratives that remain after removing them contain repeated multi-word sequences, suggesting the presence of a shared culturally-agnostic narrative template. Finally, we characterize the cultural markers for their stereotypicality and offensiveness, finding that markers from 19 countries, mostly located in the Global South, are on average offensive.

cs.CL

References (20)

TALES: A Taxonomy and Analysis of Cultural Representations in LLM-generated Stories

Kirti Bhagat, Shaily Bhatt, Athul Velagapudi et al.

2025 5 citations ⭐ Influential View Analysis →

Extrinsic Evaluation of Cultural Competence in Large Language Models

Shaily Bhatt, F. Diaz

2024 24 citations ⭐ Influential View Analysis →

Echoes in AI: Quantifying lack of plot diversity in LLM outputs

Weijia Xu, Nebojsa Jojic, Sudha Rao et al.

2024 47 citations View Analysis →

Kahani: Culturally-Nuanced Visual Storytelling Tool for Non-Western Cultures

Hamna, D. Sudharsan, Agrima Seth et al.

2024 6 citations View Analysis →

mmBERT: A Modern Multilingual Encoder with Annealed Language Learning

Marc Marone, Orion Weller, William Fleshman et al.

2025 37 citations View Analysis →

Biased Tales: Cultural and Topic Bias in Generating Children's Stories

Donya Rooein, Vilém Zouhar, Debora Nozza et al.

2025 14 citations View Analysis →

AI Suggestions Homogenize Writing Toward Western Styles and Diminish Cultural Nuances

Dhruv Agarwal, Mor Naaman, Aditya Vashistha

2024 120 citations View Analysis →

Can Good Writing Be Generative? Expert-Level AI Writing Emerges through Fine-Tuning on High Quality Books

Tuhin Chakrabarty, Paramveer Dhillon

2026 6 citations View Analysis →

Richer Output for Richer Countries: Uncovering Geographical Disparities in Generated Stories and Travel Recommendations

Kirti Bhagat, Kinshuk Vasisht, Danish Pruthi

2024 8 citations View Analysis →

Towards Automatic Evaluation for Image Transcreation

Simran Khanuja, V. Iyer, Claire He et al.

2024 6 citations View Analysis →

DIWALI - Diversity and Inclusivity aWare cuLture specific Items for India: Dataset and Assessment of LLMs for Cultural Text Adaptation in Indian Context

Pramit Sahoo, Maharaj Brahma, M. Desarkar

2025 4 citations View Analysis →

Artificial Hivemind: The Open-Ended Homogeneity of Language Models (and Beyond)

Liwei Jiang, Yuanjun Chai, Margaret Li et al.

2025 89 citations View Analysis →

STORYTELLER: An Enhanced Plot-Planning Framework for Coherent and Cohesive Story Generation

Jiaming Li, Yukun Chen, Ziqiang Liu et al.

2025 10 citations View Analysis →

Research Borderlands: Analysing Writing Across Research Cultures

Shaily Bhatt, Tal August, Maria Antoniak

2025 3 citations View Analysis →

How Deep Is Representational Bias in LLMs? The Cases of Caste and Religion

Agrima Seth, Monojit Choudhary, Sunayana Sitaram et al.

2025 13 citations View Analysis →

SANSKRITI: A Comprehensive Benchmark for Evaluating Language Models' Knowledge of Indian Culture

Arijit Maji, Raghvendra Kumar, Akash Ghosh et al.

2025 15 citations View Analysis →

Detection and Measurement of Syntactic Templates in Generated Text

Chantal Shaib, Yanai Elazar, J. Li et al.

2024 48 citations View Analysis →

The Shrinking Landscape of Linguistic Diversity in the Age of Large Language Models

Zhivar Sourati, Farzan Karimi-Malekabadi, Meltem Ozcan et al.

2025 25 citations View Analysis →

QUDsim: Quantifying Discourse Similarities in LLM-Generated Text

Ramya Namuduri, Yating Wu, A. Zheng et al.

2025 14 citations View Analysis →

Towards Measuring and Modeling “Culture” in LLMs: A Survey

Farid Adilazuarda, Sagnik Mukherjee, Pradhyumna Lavania et al.

2024 189 citations View Analysis →