Interpretable Semantic Gradients in SSD: A PCA Sweep Approach and a Case Study on AI Discourse

TL;DR

Proposed a PCA sweep method to optimize dimension selection in SSD, enhancing interpretability and stability.

cs.CL 🔴 Advanced 2026-03-13 2 views
Hubert Plisiecki Maria Leniarska Jan Piotrowski Marcin Zajenkowski
semantic gradient PCA interpretability AI text analysis

Key Findings

Methodology

The paper introduces a novel PCA sweep approach to optimize dimension selection in Supervised Semantic Differential (SSD). This method selects the appropriate number of dimensions K by jointly optimizing representation capacity, gradient interpretability, and stability. Specifically, the PCA sweep evaluates a sequence of K values, fits SSD at each K, and tracks cluster coherence, gradient alignment, and result stability based on cosine differences.

Key Results

  • Result 1: For the Admiration-related semantic gradient, the model explained 19% of the variance (R²_adj = 0.19, F = 6.32, p < 0.0001, r = 0.47), contrasting optimistic AI collaboration with distrustful discourse.
  • Result 2: The Rivalry model did not reach significance (R²_adj = 0.03, p = 0.095), indicating no robust semantic alignment for this trait.
  • Result 3: Comparison with a high-dimensional PCA solution showed it produced diffuse, weakly structured clusters, underscoring the value of the sweep-based choice of K.

Significance

By introducing the PCA sweep method, the study significantly reduces researcher degrees of freedom in SSD analysis while maintaining its interpretive aims. This method supports transparent and psychologically meaningful analyses of connotative meaning, particularly in text semantic analysis involving individual differences. The case study on AI discourse demonstrates how dimension selection can be constrained without sacrificing interpretability, thereby enhancing analysis transparency and stability.

Technical Contribution

The technical contribution lies in proposing a new PCA sweep method that not only optimizes dimension selection in SSD but also provides a systematic criterion by jointly optimizing representation capacity, gradient interpretability, and stability. This method reduces the risk of overfitting, increases analysis transparency, and provides a more robust foundation for interpreting semantic gradients.

Novelty

This study is the first to apply the PCA sweep method in SSD, providing a systematic criterion for dimension selection. Unlike traditional methods, this approach considers not only representation capacity but also integrates gradient interpretability and stability, ensuring reliable semantic analysis.

Limitations

  • Limitation 1: The case study is based on a relatively small dataset of AI-related short posts, limiting the generalizability of semantic gradients to other populations and task contexts.
  • Limitation 2: The analysis operates at the level of whole-text representations rather than concept-specific lexicon-based PCVs, which may blur finer-grained semantic distinctions.
  • Limitation 3: The PCA sweep addresses only one source of flexibility in SSD, namely dimensionality selection, while other design choices like embedding model selection remain open.

Future Work

Future directions include developing similarly principled criteria for upstream modeling choices, particularly the selection of the embedding model. Different embedding spaces encode distinct cultural, temporal, and stylistic regularities, and these choices can significantly shape the structure of recovered gradients. Extending the logic of stability- and interpretability-based diagnostics to the level of model selection represents an important next step toward a fully transparent and methodologically grounded SSD workflow.

AI Executive Summary

Supervised Semantic Differential (SSD) is a mixed-method approach for analyzing how text semantics vary with individual difference variables. However, it lacks a systematic standard for selecting dimensions in Principal Component Analysis (PCA), leading to excessive researcher freedom and increased risk of overfitting. This paper proposes a PCA sweep method that provides a systematic criterion for dimension selection by jointly optimizing representation capacity, gradient interpretability, and stability.

In a case study on AI discourse, the PCA sweep method successfully identified a stable and interpretable Admiration-related semantic gradient, contrasting optimistic AI collaboration with distrustful and derisive discourse. In contrast, the Rivalry trait did not form a robust semantic alignment, highlighting the method's differential performance across traits.

The PCA sweep method reduces researcher degrees of freedom while maintaining SSD's interpretive aims, supporting transparent and psychologically meaningful analyses of connotative meaning. This method not only optimizes dimension selection but also provides a systematic criterion by jointly optimizing representation capacity, gradient interpretability, and stability.

The technical contribution lies in proposing a new PCA sweep method that not only optimizes dimension selection in SSD but also provides a systematic criterion by jointly optimizing representation capacity, gradient interpretability, and stability. This method reduces the risk of overfitting, increases analysis transparency, and provides a more robust foundation for interpreting semantic gradients.

However, the case study is based on a relatively small dataset of AI-related short posts, limiting the generalizability of semantic gradients to other populations and task contexts. The analysis operates at the level of whole-text representations rather than concept-specific lexicon-based PCVs, which may blur finer-grained semantic distinctions. Future directions include developing similarly principled criteria for upstream modeling choices, particularly the selection of the embedding model. Different embedding spaces encode distinct cultural, temporal, and stylistic regularities, and these choices can significantly shape the structure of recovered gradients. Extending the logic of stability- and interpretability-based diagnostics to the level of model selection represents an important next step toward a fully transparent and methodologically grounded SSD workflow.

Deep Analysis

Background

Supervised Semantic Differential (SSD) is an emerging mixed-method approach designed to analyze how text semantics vary with individual difference variables. The method is inspired by long-standing psychological semantic differential methods that measure the connotative meaning of concepts via polar concept opposites (e.g., warm/cold, strong/weak) and by modern distributional semantics, where word embeddings model relational structure in language use. SSD combines these traditional and modern approaches by estimating a semantic gradient in an embedding space and interpreting its poles through clustering and text retrieval. Although SSD applies PCA to reduce redundant dimensions in small corpora, there is currently no systematic method for choosing the number of components to retain, introducing avoidable researcher degrees of freedom, increasing the risk of overfitting, reducing the transparency of the analysis pipeline, and potentially biasing substantive interpretations of the resulting semantic gradients.

Core Problem

SSD lacks a systematic standard for selecting PCA dimensions, leading to excessive researcher freedom, increased risk of overfitting, reduced analysis transparency, and potentially biased substantive interpretations of the resulting semantic gradients. While original results were relatively stable across similar component ranges, leaving the choice of dimensionality to the researcher introduces avoidable researcher degrees of freedom, thereby increasing the risk of overfitting, reducing the transparency of the analysis pipeline, and potentially biasing substantive interpretations of the resulting semantic gradients.

Innovation

The paper introduces a novel PCA sweep approach to optimize dimension selection in SSD. This method selects the appropriate number of dimensions K by jointly optimizing representation capacity, gradient interpretability, and stability. Specifically, the PCA sweep evaluates a sequence of K values, fits SSD at each K, and tracks cluster coherence, gradient alignment, and result stability based on cosine differences. Unlike traditional methods, this approach considers not only representation capacity but also integrates gradient interpretability and stability, ensuring reliable semantic analysis.

Methodology

  • �� The PCA sweep method evaluates a sequence of K values and fits SSD at each K.
  • �� Calculates cluster coherence at each K as an interpretability criterion.
  • �� Assesses gradient alignment with clusters and result stability based on cosine differences.
  • �� Selects the appropriate dimension K by jointly optimizing representation capacity, gradient interpretability, and stability.
  • �� Applies plateau-sensitive smoothing using a local neighborhood average to emphasize broad, stable plateaus over sharp spikes.

Experiments

The experimental design is based on a dataset of short posts about artificial intelligence written by Prolific participants who also completed Admiration and Rivalry narcissism scales. Tokenization and linguistic preprocessing were performed using spaCy, and embeddings were generated using the Dolma GloVe model. The PCA sweep method evaluates a sequence of K values, fits SSD at each K, and tracks cluster coherence, gradient alignment, and result stability based on cosine differences. The number of clusters was chosen by silhouette from a range, and other settings followed the original SSD configuration.

Results

For the Admiration-related semantic gradient, the model explained 19% of the variance (R²_adj = 0.19, F = 6.32, p < 0.0001, r = 0.47), contrasting optimistic AI collaboration with distrustful and derisive discourse. The Rivalry model did not reach significance (R²_adj = 0.03, p = 0.095), indicating no robust semantic alignment for this trait. Comparison with a high-dimensional PCA solution showed it produced diffuse, weakly structured clusters, underscoring the value of the sweep-based choice of K.

Applications

The PCA sweep method can be directly applied in scenarios requiring analysis of text semantics in relation to individual differences, such as psychological research and social science studies. By reducing researcher degrees of freedom and enhancing analysis transparency and stability, this method provides a more reliable tool for semantic analysis in these fields.

Limitations & Outlook

The case study is based on a relatively small dataset of AI-related short posts, limiting the generalizability of semantic gradients to other populations and task contexts. The analysis operates at the level of whole-text representations rather than concept-specific lexicon-based PCVs, which may blur finer-grained semantic distinctions. The PCA sweep addresses only one source of flexibility in SSD, namely dimensionality selection, while other design choices like embedding model selection remain open. Future directions include developing similarly principled criteria for upstream modeling choices, particularly the selection of the embedding model.

Plain Language Accessible to non-experts

Imagine you're in a kitchen cooking a meal. You have a variety of ingredients, but you're not sure how many to use to make a delicious dish. The PCA sweep method is like a smart chef who tries different combinations of ingredients to find the perfect amount, making the dish both tasty and nutritious. In this study, researchers used the PCA sweep method to select the right number of dimensions, just like the chef chooses the right amount of ingredients. By trying different combinations of dimensions, they found a method that could explain changes in text semantics while keeping the results stable. It's like finding the perfect recipe that makes the dish both delicious and healthy. Through this method, researchers can better analyze the relationship between text semantics and individual differences, just like a chef can make a more delicious dish.

ELI14 Explained like you're 14

Hey there! Imagine you're playing a super complex puzzle game. This game has a lot of pieces, and you need to find the right number of pieces to complete a picture. The PCA sweep method is like a puzzle master who tries different combinations of pieces to find the perfect number, making the puzzle both complete and beautiful. In this study, scientists used the PCA sweep method to select the right number of dimensions, just like the puzzle master chooses the right number of pieces. By trying different combinations of dimensions, they found a method that could explain changes in text semantics while keeping the results stable. It's like finding the perfect puzzle solution that makes the picture both complete and beautiful. Through this method, scientists can better analyze the relationship between text semantics and individual differences, just like a puzzle master can create a more beautiful picture. Isn't that cool?

Glossary

Supervised Semantic Differential (SSD)

A mixed quantitative-interpretive method for analyzing how text semantics vary with individual difference variables.

Used to estimate a semantic gradient in an embedding space and interpret its poles.

Principal Component Analysis (PCA)

A linear dimensionality reduction technique that reduces data dimensions by extracting the most important components.

Used to reduce redundant dimensions in SSD.

Semantic Gradient

A direction estimated in an embedding space to represent changes in text semantics.

Interpreted through clustering and text retrieval.

Cosine Difference

A measure of similarity between two vectors, with lower values indicating greater similarity.

Used to assess result stability.

Admiration

A narcissistic trait reflecting assertive self-enhancement and status-seeking tendencies.

Analyzed in the case study on AI discourse.

Rivalry

A narcissistic trait reflecting defensive, antagonistic self-protection in response to perceived threat.

Analyzed in the case study on AI discourse.

Dolma GloVe Model

A pre-trained word vector model used for text embeddings.

Used to embed AI discourse into a 300-dimensional space.

Cluster Coherence

A measure of similarity within a cluster, with higher values indicating greater coherence.

Part of the interpretability criterion.

Silhouette Coefficient

A measure of cluster separation and tightness, with higher values indicating better clustering.

Used to select the number of clusters.

Smooth Inverse Frequency (SIF) Weighting

A word vector weighting method that reduces noise through inverse frequency smoothing.

Used for embedding AI discourse.

Open Questions Unanswered questions from this research

  • 1 How to validate the effectiveness of the PCA sweep method on larger and more diverse datasets? The current study is based on a relatively small dataset of AI-related short posts, limiting the generalizability of semantic gradients to other populations and task contexts.
  • 2 How to select the most suitable embedding model in SSD? Different embedding spaces encode distinct cultural, temporal, and stylistic regularities, and these choices can significantly shape the structure of recovered gradients.
  • 3 How to better integrate concept-specific lexicons in SSD? The current analysis operates at the level of whole-text representations, which may blur finer-grained semantic distinctions.
  • 4 How to further reduce researcher degrees of freedom in SSD? The PCA sweep addresses only the flexibility of dimension selection, while other design choices like embedding model selection remain open.
  • 5 How to validate the effectiveness of SSD in different languages and cultural contexts? The current study is primarily based on English texts, and future work needs to validate it in multilingual and multicultural contexts.

Applications

Immediate Applications

Psychological Research

Using the PCA sweep method, researchers can more accurately analyze the relationship between text semantics and individual psychological traits, enhancing interpretability and stability.

Social Science Research

In social science research, the PCA sweep method can be used to analyze the relationship between text semantics and social behavior, providing a more reliable analysis tool.

Text Analysis Tool Development

The PCA sweep method can be integrated into text analysis tools to help users better understand changes in text semantics and improve analysis accuracy.

Long-term Vision

Cross-Cultural Semantic Analysis

By applying the PCA sweep method in different languages and cultural contexts, researchers can reveal cross-cultural semantic differences and promote the development of global semantic analysis.

Automated Semantic Analysis Systems

Develop automated semantic analysis systems based on the PCA sweep method to help businesses and research institutions conduct text analysis and decision support more efficiently.

Abstract

Supervised Semantic Differential (SSD) is a mixed quantitative-interpretive method that models how text meaning varies with continuous individual-difference variables by estimating a semantic gradient in an embedding space and interpreting its poles through clustering and text retrieval. SSD applies PCA before regression, but currently no systematic method exists for choosing the number of retained components, introducing avoidable researcher degrees of freedom in the analysis pipeline. We propose a PCA sweep procedure that treats dimensionality selection as a joint criterion over representation capacity, gradient interpretability, and stability across nearby values of K. We illustrate the method on a corpus of short posts about artificial intelligence written by Prolific participants who also completed Admiration and Rivalry narcissism scales. The sweep yields a stable, interpretable Admiration-related gradient contrasting optimistic, collaborative framings of AI with distrustful and derisive discourse, while no robust alignment emerges for Rivalry. We also show that a counterfactual using a high-PCA dimension solution heuristic produces diffuse, weakly structured clusters instead, reinforcing the value of the sweep-based choice of K. The case study shows how the PCA sweep constrains researcher degrees of freedom while preserving SSD's interpretive aims, supporting transparent and psychologically meaningful analyses of connotative meaning.

cs.CL

References (12)

Narcissistic admiration and rivalry: disentangling the bright and dark sides of narcissism.

M. Back, Albrecht C. P. Küfner, Michael Dufner et al.

2013 1024 citations ⭐ Influential

The Narcissistic Admiration and Rivalry Concept

M. Back

2018 145 citations ⭐ Influential

The measurement of meaning

J. M. Kittross

1959 5161 citations

A Simple but Tough-to-Beat Baseline for Sentence Embeddings

Sanjeev Arora, Yingyu Liang, Tengyu Ma

2017 1392 citations

The Geometry of Culture: Analyzing the Meanings of Class through Word Embeddings

Austin C. Kozlowski, Matt Taddy, James A. Evans

2018 456 citations View Analysis →

Data from Paper “False-Positive Psychology: Undisclosed Flexibility in Data Collection and Analysis Allows Presenting Anything as Significant”

J. Simmons, Leif D. Nelson, U. Simonsohn

2014 1368 citations

All-but-the-Top: Simple and Effective Postprocessing for Word Representations

Jiaqi Mu, S. Bhat, P. Viswanath

2017 357 citations View Analysis →

Word embeddings quantify 100 years of gender and ethnic stereotypes

Nikhil Garg, L. Schiebinger, Dan Jurafsky et al.

2017 1054 citations View Analysis →

Discovering Language Model Behaviors with Model-Written Evaluations

Ethan Perez, Sam Ringer, Kamilė Lukošiūtė et al.

2022 665 citations View Analysis →

Principal Component Analysis

H. Shen

2003 28441 citations

Optimizing Semantic Coherence in Topic Models

David Mimno, Hanna M. Wallach, E. Talley et al.

2011 1890 citations

Efficient Estimation of Word Representations in Vector Space

Tomas Mikolov, Kai Chen, G. Corrado et al.

2013 33858 citations View Analysis →