The Algorithmic Caricature: Auditing LLM-Generated Political Discourse Across Crisis Events
Using a Computational Social Science framework, audit LLM-generated political discourse across nine crisis events, finding it more negative and structurally consistent.
Key Findings
Methodology
This study adopts a Computational Social Science (CSS) framework, constructing a paired corpus of 1,789,406 posts across nine political crisis events. By comparing observed discourse from social platforms with synthetic discourse generated for the same context, the study evaluates divergence along four dimensions: emotional intensity, structural regularity, lexical-ideological framing, and cross-event dependency. Mean gaps and dispersion evidence are used to assess population distortion.
Key Results
- Result 1: Synthetic discourse is more negative in sentiment, with an average sentiment score of -0.215 compared to +0.018 for observed discourse. The sentiment distribution of synthetic discourse is more concentrated, with a standard deviation of 0.458 compared to 0.522 for observed discourse.
- Result 2: Structurally, synthetic discourse averages 23.08 words per post with a standard deviation of 11.29, while observed discourse averages 32.16 words with a standard deviation of 55.93, indicating higher structural consistency.
- Result 3: In terms of lexical-ideological framing, synthetic discourse is more abstract and formalized, lacking the specific, event-dependent lexical markers found in observed discourse.
Significance
This research highlights the limitations of LLM-generated political discourse during social crisis events, particularly in terms of emotional diversity and structural complexity. By introducing the 'Caricature Gap' as a simple event-level measure, the study provides a new perspective for evaluating the social realism of synthetic discourse. This not only complements traditional text detection methods but also offers a theoretical foundation for future applications of generative AI systems in social sciences.
Technical Contribution
The technical contribution of this paper lies in proposing a new framework for evaluating whether synthetic discourse reproduces the aggregate behavioral signatures of observed online publics. By treating the discourse population rather than the individual text as the unit of analysis, the study connects generative AI evaluation to central CSS concerns: collective behavior, political communication, crisis response, and the measurement of online publics.
Novelty
This study is the first to audit the social realism of synthetic political discourse from a population level rather than a sentence level. Unlike previous methods focusing on local textual features, this paper emphasizes the overall distortion of synthetic discourse in terms of emotion, structure, and lexicon.
Limitations
- Limitation 1: Synthetic discourse shows greater distortion in fast-moving and decentralized crisis events, possibly due to the model's limitations in handling informal and heterogeneous discourse.
- Limitation 2: The study is limited to nine specific crisis events, which may not comprehensively represent all types of political discourse.
- Limitation 3: The generation of synthetic discourse relies on specific prompts and parameter settings, which may affect the generalizability of the results.
Future Work
Future research could expand to more types of events and a broader range of social platforms to verify the performance of synthetic discourse in different social contexts. Additionally, research could explore how to improve generative models to better reproduce the observed emotional diversity and structural complexity.
AI Executive Summary
In today's digital age, social media has become a central platform for political expression and mobilization. However, the rapid development of Large Language Models (LLMs) has raised new concerns about the large-scale generation of synthetic discourse during crisis events. Existing AI text detection methods primarily focus on local linguistic features such as perplexity and burstiness, but these signals may become unreliable as generative systems improve.
This paper proposes a new Computational Social Science (CSS) framework aimed at auditing the social realism of synthetic political discourse from a population level rather than a sentence level. The study constructs a paired corpus of 1,789,406 posts covering nine political crisis events, including the COVID-19 pandemic, the 2020 and 2024 US elections, and BLM protests. By comparing observed discourse from social platforms with synthetic discourse generated for the same context, the study evaluates divergence along four dimensions: emotional intensity, structural regularity, lexical-ideological framing, and cross-event dependency.
The study finds that synthetic discourse is more negative in sentiment, structurally more consistent, and lexically more abstract compared to observed discourse, which lacks specific, event-dependent lexical markers. These differences are more pronounced in fast-moving and decentralized crisis events and smaller in formal or institutionally mediated events. The study introduces the 'Caricature Gap' as a simple event-level measure to summarize these differences.
These findings suggest that the main limitation of synthetic political discourse is not grammar or fluency but reduced population realism. By treating the discourse population rather than the individual text as the unit of analysis, the study connects generative AI evaluation to central CSS concerns: collective behavior, political communication, crisis response, and the measurement of online publics.
While this paper provides a new perspective for evaluating the social realism of synthetic discourse, it also has some limitations. The study is limited to nine specific crisis events, which may not comprehensively represent all types of political discourse. Additionally, the generation of synthetic discourse relies on specific prompts and parameter settings, which may affect the generalizability of the results. Future research could expand to more types of events and a broader range of social platforms to verify the performance of synthetic discourse in different social contexts.
Deep Analysis
Background
In recent years, social media platforms have become central infrastructures for political expression, mobilization, and contestation. However, these platforms have also become key sites of manipulation, including coordinated influence operations, misinformation campaigns, and automated amplification. The rapid diffusion of Large Language Models (LLMs) raises a new concern within this landscape: the possibility of generating large volumes of fluent, politically charged synthetic discourse that can imitate grassroots expression at scale. Existing AI-generated text detection methods have primarily focused on local textual signatures, such as token predictability, burstiness, repetition, or perplexity-based irregularities. While these methods can be useful, they are increasingly vulnerable to model improvements, paraphrasing, and stylistic adaptation. In settings such as political communication, where language is noisy, emotional, and event-dependent, a narrow focus on sentence-level cues may miss broader population-level distortions.
Core Problem
The core problem addressed in this paper is: how does synthetic political discourse differ from observed online populations during crisis events? Traditional AI text detection methods primarily focus on local linguistic features, which may fail to capture the overall distortion of synthetic discourse in terms of emotion, structure, and lexicon. This paper proposes a new Computational Social Science (CSS) framework aimed at auditing the social realism of synthetic discourse from a population level. By treating the discourse population rather than the individual text as the unit of analysis, the study connects generative AI evaluation to central CSS concerns: collective behavior, political communication, crisis response, and the measurement of online publics.
Innovation
The core innovations of this paper include: 1) Proposing a new CSS framework for evaluating whether synthetic discourse reproduces the aggregate behavioral signatures of observed online publics; 2) Constructing a paired corpus of 1,789,406 posts covering nine political crisis events; 3) Introducing the 'Caricature Gap' as a simple event-level measure to summarize the differences between synthetic and observed discourse. Unlike previous methods focusing on local textual features, this paper emphasizes the overall distortion of synthetic discourse in terms of emotion, structure, and lexicon.
Methodology
The methodology of this paper includes the following steps:
- �� Dataset Construction: Collecting observed and synthetic discourse for nine political crisis events, forming a paired corpus of 1,789,406 posts.
- �� Dimension Evaluation: Comparing divergence along four dimensions: emotional intensity, structural regularity, lexical-ideological framing, and cross-event dependency.
- �� Divergence Measurement: Using mean gaps and dispersion evidence to assess population distortion, introducing the 'Caricature Gap' as a simple event-level measure.
- �� Statistical Analysis: Conducting statistical analysis of differences in emotion, structure, and lexicon using VADER sentiment analysis and TF-IDF lexical analysis.
Experiments
The experimental design includes the following aspects:
- �� Dataset: Constructing a paired corpus of 1,789,406 posts covering nine political crisis events.
- �� Baseline: Using observed discourse from social platforms as a baseline for comparison with synthetic discourse.
- �� Evaluation Metrics: Evaluating divergence along four dimensions: emotional intensity, structural regularity, lexical-ideological framing, and cross-event dependency.
- �� Hyperparameters: Using specific prompts and parameter settings in generating synthetic discourse to ensure alignment with event context.
Results
The results analysis shows:
- �� Synthetic discourse is more negative in sentiment, with an average sentiment score of -0.215 compared to +0.018 for observed discourse.
- �� Structurally, synthetic discourse averages 23.08 words per post with a standard deviation of 11.29, while observed discourse averages 32.16 words with a standard deviation of 55.93, indicating higher structural consistency.
- �� In terms of lexical-ideological framing, synthetic discourse is more abstract and formalized, lacking the specific, event-dependent lexical markers found in observed discourse.
Applications
The findings of this study can be applied in the following scenarios:
- �� Political Communication Analysis: By evaluating the social realism of synthetic discourse, it helps identify and understand potential manipulation and influence in political communication.
- �� Social Media Monitoring: Provides a new tool for social media platforms to identify and filter potential synthetic discourse.
- �� Generative AI System Optimization: Offers a theoretical foundation for future applications of generative AI systems in social sciences, helping improve models to better reproduce observed emotional diversity and structural complexity.
Limitations & Outlook
The limitations of this paper include:
- �� The study is limited to nine specific crisis events, which may not comprehensively represent all types of political discourse.
- �� The generation of synthetic discourse relies on specific prompts and parameter settings, which may affect the generalizability of the results.
- �� Synthetic discourse shows greater distortion in fast-moving and decentralized crisis events, possibly due to the model's limitations in handling informal and heterogeneous discourse.
Plain Language Accessible to non-experts
Imagine you're at a large social gathering where everyone is enthusiastically discussing recent major events. Each person has their own perspective and emotions—some are excited, others are calm, and some are pondering solutions. This is similar to what we see on social media, where a variety of voices come together to form a complex social picture.
Now, suppose there's a robot that can mimic human speech and join these discussions. This robot is very smart and can generate seemingly fluent conversations, but its emotional expression and structure aren't as diverse as those of real humans. It might always express emotions in the same way or be overly negative in certain events.
This is the core of the study: how does synthetic discourse differ from observed real discourse during crisis events? By comparing the differences in emotion, structure, and lexicon between synthetic and real discourse, the study reveals the limitations of synthetic discourse in terms of social realism.
Just like at the gathering, we're not only interested in what each person says but also in how they say it and how these discourses reflect their true emotions and backgrounds. This way, we can better understand the role and impact of synthetic discourse in society.
ELI14 Explained like you're 14
Hey, friends! Have you ever thought that some of the political discussions you see online might not be written by humans but generated by robots? Sounds like science fiction, right?
Actually, this is what scientists are studying. They want to know how these robot-generated texts differ from what humans write. For example, in major events, are the robot-generated texts always negative or very formal?
To study this, scientists collected a lot of online discussions, including those written by humans and those generated by robots. They found that while robot-generated texts seem fluent, they differ greatly from humans in terms of emotion and structure.
So next time you read an article online, think about this: is it really written by a human, or is there a smart robot behind it? It's an interesting thought, isn't it?
Glossary
Large Language Model
A Large Language Model is an AI model based on deep learning that can generate natural language text. It learns from large amounts of text data to understand and generate human language.
In this paper, Large Language Models are used to generate synthetic political discourse.
Computational Social Science
Computational Social Science is an interdisciplinary field that uses computational methods and tools to study social phenomena. It combines methods from social science and computer science.
The paper adopts a Computational Social Science framework to audit the social realism of synthetic discourse.
Sentiment Analysis
Sentiment Analysis is a natural language processing technique used to identify and extract emotional information from text. It is often used to determine the sentiment polarity (positive, negative, or neutral) of a text.
The paper uses VADER sentiment analysis to evaluate the emotional intensity of synthetic discourse.
Lexical-Ideological Framing
Lexical-Ideological Framing refers to how the vocabulary and expressions used in a text reflect and convey specific ideologies and viewpoints.
The paper compares synthetic and observed discourse in terms of lexical-ideological framing.
Caricature Gap
The Caricature Gap is a simple event-level measure introduced in the paper to summarize the differences between synthetic and observed discourse.
The Caricature Gap is used to assess the limitations of synthetic discourse in terms of social realism.
VADER
VADER is a sentiment analysis tool designed for social media text, capable of identifying the sentiment polarity and intensity of a text.
The paper uses VADER to evaluate the emotional intensity of synthetic discourse.
TF-IDF
TF-IDF is a statistical method used in text mining to evaluate the importance of a word in a document. It combines term frequency and inverse document frequency.
The paper uses TF-IDF to analyze lexical differences between synthetic and observed discourse.
Social Media
Social Media refers to internet platforms that enable social interaction and information sharing.
The paper studies the differences between synthetic and observed discourse on social media.
Crisis Event
A Crisis Event is a sudden event that has a significant impact on society, such as natural disasters or political upheavals.
The paper studies the performance of synthetic discourse across nine political crisis events.
Synthetic Discourse
Synthetic Discourse refers to text generated by artificial intelligence to simulate human language expression.
The paper compares synthetic discourse with observed discourse across multiple dimensions.
Open Questions Unanswered questions from this research
- 1 Open Question 1: How does synthetic discourse perform in different cultural contexts? Existing research focuses mainly on specific political crisis events and has not comprehensively explored the performance of synthetic discourse in different cultural contexts.
- 2 Open Question 2: How can generative models be improved to better reproduce observed emotional diversity and structural complexity? Current models have limitations in handling informal and heterogeneous discourse.
- 3 Open Question 3: What is the long-term social impact of synthetic discourse? While this paper reveals the limitations of synthetic discourse, its long-term social impact has not been fully studied.
- 4 Open Question 4: How can ideological bias in synthetic discourse generation be better controlled? Current models may inherit ideological biases from training data.
- 5 Open Question 5: How does synthetic discourse perform on different types of social platforms? This paper mainly studies platforms like Twitter, Telegram, and Reddit, leaving other platforms unexplored.
- 6 Open Question 6: How does synthetic discourse perform in multilingual environments? Existing research focuses mainly on English text and has not comprehensively explored the performance of synthetic discourse in multilingual environments.
- 7 Open Question 7: How can emotional intensity in synthetic discourse generation be better controlled? Current models may express emotions too concentratedly or negatively.
Applications
Immediate Applications
Political Communication Analysis
By evaluating the social realism of synthetic discourse, it helps identify and understand potential manipulation and influence in political communication.
Social Media Monitoring
Provides a new tool for social media platforms to identify and filter potential synthetic discourse.
Generative AI System Optimization
Offers a theoretical foundation for future applications of generative AI systems in social sciences, helping improve models to better reproduce observed emotional diversity and structural complexity.
Long-term Vision
Research on Synthetic Discourse in Multicultural Contexts
Explore the performance of synthetic discourse in different cultural contexts, helping improve the cross-cultural adaptability of models.
Research on Synthetic Discourse in Multilingual Environments
Study the performance of synthetic discourse in multilingual environments, promoting the multilingual application of generative AI systems.
Abstract
Large Language Models (LLMs) can generate fluent political text at scale, raising concerns about synthetic discourse during crises and social conflict. Existing AI-text detection often focuses on sentence-level cues such as perplexity, burstiness, or token irregularities, but these signals may weaken as generative systems improve. We instead adopt a Computational Social Science perspective and ask whether synthetic political discourse behaves like an observed online population. We construct a paired corpus of 1,789,406 posts across nine crisis events: COVID-19, the Jan. 6 Capitol attack, the 2020 and 2024 U.S. elections, Dobbs/Roe v. Wade, the 2020 BLM protests, U.S. midterms, the Utah shooting, and the U.S.-Iran war. For each event, we compare observed discourse from social platforms with synthetic discourse generated for the same context. We evaluate four dimensions: emotional intensity, structural regularity, lexical-ideological framing, and cross-event dependency, using mean gaps and dispersion evidence. Across events, synthetic discourse is fluent but population-level unrealistic. It is generally more negative and less dispersed in sentiment, structurally more regular, and lexically more abstract than observed discourse. Observed discourse instead shows broader emotional variation, longer-tailed structural distributions, and more context-specific, colloquial lexical markers. These differences are event-dependent: larger for fast-moving, decentralized crises and smaller for formal or institutionally mediated events. We summarize them with a simple event-level measure, the Caricature Gap. Our findings suggest that the main limitation of synthetic political discourse is not grammar or fluency, but reduced population realism. Population-level auditing complements traditional text-detection and provides a CSS framework for evaluating the social realism of generated discourse.