HMS-BERT: Hybrid Multi-Task Self-Training for Multilingual and Multi-Label Cyberbullying Detection
HMS-BERT uses hybrid multi-task self-training for multilingual, multi-label cyberbullying detection, achieving a macro F1-score of 0.9847.
Key Findings
Methodology
HMS-BERT is a hybrid multi-task self-training framework built on a pretrained multilingual BERT model. It integrates contextual representations with handcrafted linguistic features and jointly optimizes a fine-grained multi-label abuse classification task and a three-class main classification task. To address labeled data scarcity in low-resource languages, HMS-BERT introduces an iterative self-training strategy with confidence-based pseudo-labeling to facilitate cross-lingual knowledge transfer.
Key Results
- On the multi-label task, HMS-BERT achieves a macro F1-score of up to 0.9847 across four public datasets, significantly outperforming existing methods. This indicates the model's strong capability in handling multilingual and multi-label cyberbullying detection.
- For the main classification task, HMS-BERT achieves an accuracy of 0.6775, demonstrating robustness in the three-class classification task. Compared to baseline models, HMS-BERT excels in multilingual scenarios.
- Ablation studies verify the effectiveness of HMS-BERT's components, particularly the handcrafted features and self-training mechanism, which play crucial roles in enhancing model performance.
Significance
HMS-BERT's introduction holds significant implications for academia and industry. It addresses the limitations of existing methods in multilingual and multi-label scenarios, especially in low-resource languages. By combining multi-task learning and self-training strategies, the framework enhances cross-lingual generalization, providing a novel solution for multilingual cyberbullying detection.
Technical Contribution
HMS-BERT offers distinct technical contributions compared to existing state-of-the-art methods. It not only combines multi-task learning and self-training strategies but also enhances contextual understanding through handcrafted features. The framework's design provides new theoretical guarantees for multilingual multi-label learning and opens up new engineering possibilities.
Novelty
HMS-BERT is the first to integrate multi-task self-training strategies for multilingual multi-label cyberbullying detection. Compared to related work, the framework presents unique innovations in handling low-resource languages and multi-label classification tasks, particularly in cross-lingual knowledge transfer and pseudo-label generation.
Limitations
- HMS-BERT may underperform on extremely imbalanced datasets, particularly when certain categories have very few samples, limiting the model's generalization capability.
- The framework's computational complexity is relatively high, with long training times, which may not be suitable for real-time applications.
- In specific cultural contexts, handcrafted features may not fully capture the nuances of certain languages.
Future Work
Future research can expand in several directions: 1) further optimize self-training strategies to improve pseudo-label quality; 2) explore more efficient model architectures to reduce computational complexity; 3) extend to more low-resource languages to validate HMS-BERT's broad applicability.
AI Executive Summary
With the rapid rise of online communication, cyberbullying has become a pervasive and pressing social concern. Traditional methods for detecting cyberbullying have primarily focused on monolingual data, employing rule-based approaches and traditional machine learning models. However, these methods are limited in their effectiveness in multilingual and multi-label scenarios, particularly in low-resource languages.
HMS-BERT is an innovative hybrid multi-task self-training framework designed to address the challenges of multilingual and multi-label cyberbullying detection. Built upon a pretrained multilingual BERT model, HMS-BERT integrates contextual representations with handcrafted linguistic features and jointly optimizes a fine-grained multi-label abuse classification task and a three-class main classification task. To address labeled data scarcity in low-resource languages, HMS-BERT introduces an iterative self-training strategy with confidence-based pseudo-labeling to facilitate cross-lingual knowledge transfer.
In experiments, HMS-BERT demonstrates strong performance across four public datasets, achieving a macro F1-score of up to 0.9847 on the multi-label task and an accuracy of 0.6775 on the main classification task. Ablation studies further verify the effectiveness of HMS-BERT's components, particularly the handcrafted features and self-training mechanism, which play crucial roles in enhancing model performance.
HMS-BERT's introduction holds significant implications for academia and industry. It addresses the limitations of existing methods in multilingual and multi-label scenarios, especially in low-resource languages. By combining multi-task learning and self-training strategies, the framework enhances cross-lingual generalization, providing a novel solution for multilingual cyberbullying detection.
However, HMS-BERT may underperform on extremely imbalanced datasets, particularly when certain categories have very few samples, limiting the model's generalization capability. Additionally, the framework's computational complexity is relatively high, with long training times, which may not be suitable for real-time applications. Future research can expand in optimizing self-training strategies, exploring more efficient model architectures, and extending to more low-resource languages.
Deep Analysis
Background
Cyberbullying involves the transmission of abusive, threatening, or degrading content through digital channels. Compared to traditional bullying, cyberbullying is not limited by time or location, often occurring anonymously and spreading widely across platforms, causing greater psychological harm to victims. In recent years, the multilingual nature of user-generated content has underscored the need for detection systems that can operate effectively across languages. Additionally, online abuse often involves overlapping forms such as insults, discrimination, and threats, posing significant challenges for binary or single-label classifiers. Early research on cyberbullying detection primarily focused on English monolingual data, employing rule-based approaches and traditional machine learning models. However, these methods have clear limitations in capturing semantic nuances, contextual dependencies, and implicit aggression, especially in cross-domain scenarios.
Core Problem
Current methods for cyberbullying detection face challenges in multilingual and multi-label scenarios. Existing methods commonly assume monolingual or single-task formulations, which restrict their effectiveness in realistic multilingual and multi-label scenarios. Furthermore, the scarcity of labeled data in low-resource languages further limits the model's generalization capabilities. Cyberbullying texts often contain multiple overlapping forms of aggression, making single-label classification inadequate for representing semantic complexity. Therefore, there is a pressing need for a solution that can effectively detect cyberbullying in multilingual and multi-label scenarios.
Innovation
HMS-BERT's core innovations include its hybrid multi-task self-training framework:
- �� Combining multi-task learning and self-training strategies: By jointly optimizing multi-label abuse classification tasks and a three-class main classification task, the framework enhances cross-lingual generalization.
- �� Introducing confidence-based pseudo-label iterative self-training strategy: By generating high-quality pseudo-labels, it facilitates cross-lingual knowledge transfer, particularly in low-resource languages.
- �� Integrating contextual representations and handcrafted linguistic features: Enhances the model's ability to detect semantic nuances and implicit aggression.
Methodology
The implementation of HMS-BERT involves the following key steps:
- �� Data Preprocessing: Clean and standardize datasets from multiple sources to ensure consistent multi-label annotations.
- �� Input Representation: Transform input text into multilingual BERT-encoded contextual semantics and handcrafted lexical feature streams.
- �� Semantic Encoding: Obtain sentence-level representations through the 12-layer Transformer encoder of multilingual BERT.
- �� Feature Enhancement: Process handcrafted features through dropout and dense layers before integrating with BERT representations.
- �� Classification: Use a fully connected layer with sigmoid activation for multi-label task predictions and a softmax-activated fully connected layer for three-class main task classification.
- �� Self-Training Optimization: Implement an iterative self-training loop that leverages unlabeled data to generate high-confidence pseudo-labels, enhancing model robustness and generalization.
Experiments
The experimental design includes evaluations on four public datasets: HateXplain, Cyberbullying Classification, SCCDUser, and SCCDComment. HateXplain is used as the primary training resource for the multi-label classification task, while the remaining three datasets are used exclusively for pseudo-labeling and cross-lingual evaluation. The experiments employ metrics such as macro F1-score, accuracy, and MCC, and conduct ablation studies to verify the effectiveness of each component.
Results
Experimental results show that HMS-BERT achieves a macro F1-score of up to 0.9847 on the multi-label task and an accuracy of 0.6775 on the main classification task. Compared to baseline models, HMS-BERT excels in multilingual scenarios, particularly in low-resource languages. Ablation studies verify the effectiveness of the handcrafted features and self-training mechanism in enhancing model performance.
Applications
HMS-BERT can be applied to real-time cyberbullying detection on multilingual online platforms, especially in low-resource language environments. The framework's design enables it to effectively handle multilingual and multi-label scenarios, providing a comprehensive abuse detection solution for social media platforms.
Limitations & Outlook
Despite HMS-BERT's strong performance in multilingual and multi-label scenarios, it may underperform on extremely imbalanced datasets. Additionally, the framework's computational complexity is relatively high, with long training times, which may not be suitable for real-time applications. Future research can expand in optimizing self-training strategies, exploring more efficient model architectures, and extending to more low-resource languages.
Plain Language Accessible to non-experts
Imagine you're in a large international school with students from different countries speaking various languages. The school wants to ensure every student can learn in a safe environment, so they decide to develop a system to detect any form of bullying. This system needs to understand texts in different languages and identify different types of bullying, such as insults, threats, or discrimination.
HMS-BERT is like the school's super detective, capable of handling multiple languages and identifying various bullying behaviors. It's like a smart teacher who can understand the vocabulary and expressions used by students in different languages. To achieve this, HMS-BERT uses a special training method called self-training. Just like a teacher teaching students in class, HMS-BERT continuously learns and updates its knowledge to improve its capabilities.
The system also uses some special techniques, such as combining contextual information and handcrafted features, just like a teacher noticing the tone and expressions of students. This way, HMS-BERT can more accurately identify bullying behaviors, even in less common languages.
ELI14 Explained like you're 14
Hey there! Imagine you're playing a super cool multiplayer online game with players from all over the world. Everyone's chatting in different languages, but sometimes you encounter some unfriendly players who might use language to attack others. To make the game environment friendlier, the game company decides to develop a super smart system to detect these unfriendly behaviors.
This system is called HMS-BERT, and it's like a superhero in the game, capable of understanding multiple languages and identifying various unfriendly behaviors. It not only understands every word you say but also determines if those words are bullying others. Just like you level up your character in the game, HMS-BERT also improves its abilities through self-training.
HMS-BERT is like a smart game admin, able to recognize unfriendly words spoken in different languages. Even if some players use less common languages, HMS-BERT can identify these behaviors with its super skills. This way, everyone can play in a safer and friendlier game environment!
Glossary
HMS-BERT
A framework for multilingual and multi-label cyberbullying detection, combining multi-task learning and self-training strategies.
HMS-BERT is the core method proposed in this paper for handling multilingual and multi-label scenarios.
Multilingual BERT (mBERT)
A pretrained language model capable of processing text in multiple languages, supporting cross-lingual semantic representation.
mBERT is the foundational model for HMS-BERT, used to generate contextual semantic representations.
Self-Training
A machine learning strategy that improves model generalization by using unlabeled data to generate pseudo-labels.
HMS-BERT uses self-training strategies to facilitate cross-lingual knowledge transfer.
Pseudo-Label
Labels generated by the model for unlabeled data, used for model updates during the self-training process.
HMS-BERT uses pseudo-labels to enhance detection capabilities in low-resource languages.
Multi-Task Learning
A learning strategy that improves overall model performance by optimizing multiple related tasks simultaneously.
HMS-BERT uses multi-task learning to optimize both multi-label and main classification tasks.
Macro F1-Score
A metric for evaluating the performance of multi-label classification models, considering precision and recall for each label.
HMS-BERT achieves a macro F1-score of up to 0.9847 on the multi-label task.
Handcrafted Features
Features manually designed to enhance the model's understanding of specific tasks.
HMS-BERT combines handcrafted features with contextual representations to improve detection accuracy.
Cross-Lingual Knowledge Transfer
Improving model performance in one language by leveraging knowledge learned in another language.
HMS-BERT achieves cross-lingual knowledge transfer through self-training strategies.
Ablation Study
An experimental method that evaluates the impact of removing certain components of a model on overall performance.
Ablation studies verify the effectiveness of HMS-BERT's components.
Main Classification Task
A task in HMS-BERT responsible for classifying text into three categories (normal, offensive, hateful).
The main classification task achieves an accuracy of 0.6775.
Open Questions Unanswered questions from this research
- 1 How can HMS-BERT's generalization capability be improved on extremely imbalanced datasets? Current methods may underperform when handling categories with very few samples, requiring new strategies to enhance model robustness.
- 2 How can the computational complexity of HMS-BERT be reduced for real-time applications? The existing framework has high computational costs and long training times, limiting its application in real-time detection.
- 3 How can HMS-BERT's broad applicability be validated in more low-resource languages? Research needs to be expanded to cover more languages and verify its effectiveness in different cultural contexts.
- 4 How can pseudo-label quality be improved to further optimize self-training strategies? The accuracy of pseudo-labels directly affects the training effect of the model, requiring exploration of more efficient generation methods.
- 5 How can more effective handcrafted features be designed for specific cultural contexts? Existing features may not fully capture the nuances of certain languages, requiring targeted optimization.
Applications
Immediate Applications
Social Media Platforms
HMS-BERT can be used for real-time cyberbullying detection on social media platforms, helping platform managers quickly identify and handle inappropriate content, enhancing user experience.
Educational Institutions
Educational institutions can use HMS-BERT to monitor student interactions on online learning platforms, promptly identifying and intervening in potential bullying behaviors to maintain a healthy learning environment.
Online Gaming
Online gaming companies can deploy HMS-BERT to detect inappropriate language in games, ensuring players engage in a friendly and safe environment.
Long-term Vision
Cross-Cultural Communication
HMS-BERT's multilingual capabilities can facilitate cross-cultural communication, helping people from different language backgrounds better understand and communicate, reducing misunderstandings and conflicts.
Global Cybersecurity
HMS-BERT can be part of global cybersecurity efforts, helping governments and organizations detect and prevent cyberbullying and hate speech, maintaining harmony and safety in cyberspace.
Abstract
Cyberbullying on social media is inherently multilingual and multi-faceted, where abusive behaviors often overlap across multiple categories. Existing methods are commonly limited by monolingual assumptions or single-task formulations, which restrict their effectiveness in realistic multilingual and multi-label scenarios. In this paper, we propose HMS-BERT, a hybrid multi-task self-training framework for multilingual and multi-label cyberbullying detection. Built upon a pretrained multilingual BERT backbone, HMS-BERT integrates contextual representations with handcrafted linguistic features and jointly optimizes a fine-grained multi-label abuse classification task and a three-class main classification task. To address labeled data scarcity in low-resource languages, an iterative self-training strategy with confidence-based pseudo-labeling is introduced to facilitate cross-lingual knowledge transfer. Experiments on four public datasets demonstrate that HMS-BERT achieves strong performance, attaining a macro F1-score of up to 0.9847 on the multi-label task and an accuracy of 0.6775 on the main classification task. Ablation studies further verify the effectiveness of the proposed components.
References (20)
mBERT-GRU multilingual deep learning framework for hate speech detection in social media
Pardeep Singh, N. Singh, Monika et al.
MC-BERT4HATE: Hate Speech Detection using Multi-channel BERT for Different Languages and Translations
Hajung Sohn, Hyunju Lee
Cyberbullying: Causes, Consequences, and Coping Strategies
Nicole L. Weber, William V. Pelfrey
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Jacob Devlin, Ming-Wei Chang, Kenton Lee et al.
Bias and Cyberbullying Detection and Data Generation Using Transformer Artificial Intelligence Models and Top Large Language Models
Y. Kumar, Kuan Huang, Á. Pérez et al.
Language-agnostic BERT Sentence Embedding
Fangxiaoyu Feng, Yinfei Yang, Daniel Matthew Cer et al.
MTBullyGNN: A Graph Neural Network-Based Multitask Framework for Cyberbullying Detection
Krishanu Maity, Tanmay Sen, S. Saha et al.
DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter
Victor Sanh, Lysandre Debut, Julien Chaumond et al.
How Multilingual is Multilingual BERT?
Telmo Pires, Eva Schlinger, Dan Garrette
A Machine Learning Approach to Cyberbullying Detection in Arabic Tweets
Dhiaa Musleh, Atta Rahman, Mohammed Abbas Alkherallah et al.
SCCD: A Session-based Dataset for Chinese Cyberbullying Detection
Qingpo Yang, Yakai Chen, Zihui Xu et al.
Unsupervised Cross-lingual Representation Learning at Scale
Alexis Conneau, Kartikay Khandelwal, Naman Goyal et al.
Multi-Task Self-Training for Learning General Representations
Golnaz Ghiasi, Barret Zoph, E. D. Cubuk et al.
Label prompt for multi-label text classification
Rui Song, Zelong Liu, Xingbing Chen et al.
Exhaustive Study into Machine Learning and Deep Learning Methods for Multilingual Cyberbullying Detection in Bangla and Chittagonian Texts
Tanjim Mahmud, Michal Ptaszynski, Fumito Masui
Hybrid Fake News Detection Model: Bagging and Logistic Regression Approach
Shivam Kumar, N. Tiwari, Abhishek Bajpai et al.
Multimodal hate speech detection: a novel deep learning framework for multilingual text and images
Furqan Khan Saddozai, Sahar K. Badri, Daniyal M. Alghazzawi et al.
Cyberbullying, Mental Health, and Violence in Adolescents and Associations With Sex and Race: Data From the 2015 Youth Risk Behavior Survey
Mohammed Alhajji, S. Bass, Ting Dai
From Words to Wounds: Cyberbullying and Its Influence on Mental Health Across the Lifespan
S. von Humboldt, Gail Low, Isabel Leal
Attention is All you Need
Ashish Vaswani, Noam Shazeer, Niki Parmar et al.