Sentiment and Emotion Classification of Indonesian E-Commerce Reviews via Multi-Task BiLSTM and AutoML Benchmarking

TL;DR

Sentiment and emotion classification of Indonesian e-commerce reviews using Multi-Task BiLSTM and AutoML, achieving high accuracy.

cs.CL 🔴 Advanced 2026-04-28 27 views
Hermawan Manurung Ibrahim Al-Kahfi Ahmad Rizqi Martin Clinton Tosima Manullang
Sentiment Analysis Emotion Recognition Indonesian NLP BiLSTM AutoML

Key Findings

Methodology

This paper presents a dual-track classification pipeline applied to the PRDECT-ID dataset. The first track uses TF-IDF vectorization with PyCaret AutoML for cross-validation of standard classifiers. The second track is a PyTorch Bidirectional Long Short-Term Memory (BiLSTM) network with a shared encoder and two task-specific output heads. A preprocessing module applies 14 sequential cleaning steps, including a 140-entry slang dictionary compiled from marketplace corpora. Four configurations are benchmarked: BiLSTM Baseline, BiLSTM Improved, BiLSTM Large, and TextCNN. Training uses class-weighted cross-entropy loss, ReduceLROnPlateau scheduling, and early stopping.

Key Results

  • In the binary sentiment classification task, the TF-IDF with Best AutoML model performed best, achieving accuracy, precision, recall, and F1 all at 0.9574. In contrast, the deep learning models' F1 scores ranged from 0.8474 to 0.8609.
  • For the five-class emotion classification task, the TextCNN model performed best, with an accuracy of 0.5399, Macro-F1 of 0.5077, and AUC of 0.8458.
  • The experimental results indicate that emotion classification is substantially more challenging than sentiment classification, as reflected in the lower performance achieved by all models.

Significance

This study effectively addresses the challenges of sentiment and emotion classification in Indonesian e-commerce reviews by proposing a hybrid method combining traditional machine learning and deep learning. The method shows superiority, especially for informal texts containing slang, regional loanwords, and emojis. The findings are significant in academia, advancing NLP research for low-resource languages, and have practical applications in industry, particularly for e-commerce platforms requiring automated sentiment analysis.

Technical Contribution

The technical contributions of this paper include a hybrid framework combining traditional TF-IDF and AutoML methods with deep learning approaches like BiLSTM and TextCNN, forming a multi-task learning framework. This framework not only excels in sentiment and emotion classification but also provides a flexible model registration system that allows configuration switching via a simple string key. Additionally, the paper offers a comprehensive preprocessing module that significantly enhances model robustness.

Novelty

This study is the first to combine multi-task BiLSTM with AutoML for sentiment and emotion classification of Indonesian e-commerce reviews. Compared to previous studies, this approach not only improves classification accuracy but also addresses the diversity issues in informal texts through detailed preprocessing steps.

Limitations

  • The method's performance in emotion classification tasks still has room for improvement, especially when dealing with class imbalance.
  • While the preprocessing module is effective, its complexity may lead to longer processing times, affecting real-time applications.
  • The model may perform poorly when dealing with extremely informal or emerging slang.

Future Work

Future research directions include optimizing the performance of emotion classification models, particularly in class imbalance scenarios. Additionally, more efficient preprocessing methods could be explored to reduce processing time and enhance real-time application feasibility. The framework could also be applied to sentiment and emotion analysis of other low-resource languages.

AI Executive Summary

Every day, millions of product reviews are written on Indonesian e-commerce platforms. These reviews contain not only standard vocabulary but also slang, regional loanwords, numeric shorthands, and emojis, making lexicon-based sentiment analysis tools unreliable in practice. Existing studies have shown that deep learning models perform well on user-generated review texts, but Indonesian marketplace reviews remain challenging due to their informal vocabulary, domain-specific abbreviations, and spelling variations.

This paper proposes a hybrid method combining traditional machine learning and deep learning, applied to the PRDECT-ID dataset. This dataset contains 5,400 product reviews from 29 Indonesian e-commerce categories, each labeled for binary sentiment (Positive/Negative) and five-class emotion (Happy, Sad, Fear, Love, Anger). The study employs two classification pipelines: the first uses TF-IDF vectorization with PyCaret AutoML for cross-validation of standard classifiers; the second is a PyTorch Bidirectional Long Short-Term Memory (BiLSTM) network with a shared encoder and two task-specific output heads.

In experiments, the TF-IDF with Best AutoML model performed best in the binary sentiment classification task, achieving accuracy, precision, recall, and F1 all at 0.9574. In contrast, the deep learning models' F1 scores ranged from 0.8474 to 0.8609. For the five-class emotion classification task, the TextCNN model performed best, with an accuracy of 0.5399, Macro-F1 of 0.5077, and AUC of 0.8458.

The results indicate that emotion classification is substantially more challenging than sentiment classification, as reflected in the lower performance achieved by all models. This study effectively addresses the challenges of sentiment and emotion classification in Indonesian e-commerce reviews by proposing a hybrid method combining traditional machine learning and deep learning. The method shows superiority, especially for informal texts containing slang, regional loanwords, and emojis.

Future research directions include optimizing the performance of emotion classification models, particularly in class imbalance scenarios. Additionally, more efficient preprocessing methods could be explored to reduce processing time and enhance real-time application feasibility. The framework could also be applied to sentiment and emotion analysis of other low-resource languages.

Deep Analysis

Background

Sentiment analysis and emotion recognition are two crucial research areas in natural language processing. With the advancement of deep learning technologies, significant progress has been made in these fields in recent years. Particularly in sentiment analysis for low-resource languages, deep learning models have demonstrated their superiority. However, Indonesian marketplace reviews remain challenging due to their informal vocabulary, domain-specific abbreviations, and spelling variations. Existing studies have shown that deep learning models perform well on user-generated review texts, but Indonesian marketplace reviews remain challenging due to their informal vocabulary, domain-specific abbreviations, and spelling variations.

Core Problem

Indonesian e-commerce reviews mix standard vocabulary with slang, regional loanwords, numeric shorthands, and emojis, making lexicon-based sentiment analysis tools unreliable in practice. Existing studies have shown that deep learning models perform well on user-generated review texts, but Indonesian marketplace reviews remain challenging due to their informal vocabulary, domain-specific abbreviations, and spelling variations.

Innovation

The core innovations of this paper include a hybrid framework combining traditional TF-IDF and AutoML methods with deep learning approaches like BiLSTM and TextCNN, forming a multi-task learning framework. This framework not only excels in sentiment and emotion classification but also provides a flexible model registration system that allows configuration switching via a simple string key. Additionally, the paper offers a comprehensive preprocessing module that significantly enhances model robustness.

Methodology

  • �� Use TF-IDF vectorization to extract features from preprocessed text sequences.
  • �� Employ the PyCaret AutoML framework to train and cross-validate a series of classification models, including Logistic Regression, Random Forest, LightGBM, Extra Trees, and SVM.
  • �� Implement a PyTorch Bidirectional Long Short-Term Memory (BiLSTM) network with a shared encoder and two task-specific output heads.
  • �� Apply a preprocessing module with 14 sequential cleaning steps, including a 140-entry slang dictionary compiled from marketplace corpora.
  • �� Benchmark four configurations: BiLSTM Baseline, BiLSTM Improved, BiLSTM Large, and TextCNN.
  • �� Train using class-weighted cross-entropy loss, ReduceLROnPlateau scheduling, and early stopping.

Experiments

The experimental design includes using the PRDECT-ID dataset, which contains 5,400 product reviews from 29 Indonesian e-commerce categories. The study employs two classification pipelines: the first uses TF-IDF vectorization with PyCaret AutoML for cross-validation of standard classifiers; the second is a PyTorch Bidirectional Long Short-Term Memory (BiLSTM) network with a shared encoder and two task-specific output heads. The experiments use class-weighted cross-entropy loss, ReduceLROnPlateau scheduling, and early stopping.

Results

In the binary sentiment classification task, the TF-IDF with Best AutoML model performed best, achieving accuracy, precision, recall, and F1 all at 0.9574. In contrast, the deep learning models' F1 scores ranged from 0.8474 to 0.8609. For the five-class emotion classification task, the TextCNN model performed best, with an accuracy of 0.5399, Macro-F1 of 0.5077, and AUC of 0.8458. The results indicate that emotion classification is substantially more challenging than sentiment classification, as reflected in the lower performance achieved by all models.

Applications

The applications of this study include automated sentiment analysis for Indonesian e-commerce platforms, particularly for scenarios requiring processing of informal texts containing slang, regional loanwords, and emojis. This method can help e-commerce platforms better understand user sentiment and emotion, thereby improving user experience and satisfaction.

Limitations & Outlook

The method's performance in emotion classification tasks still has room for improvement, especially when dealing with class imbalance. While the preprocessing module is effective, its complexity may lead to longer processing times, affecting real-time applications. The model may perform poorly when dealing with extremely informal or emerging slang. Future research directions include optimizing the performance of emotion classification models, particularly in class imbalance scenarios. Additionally, more efficient preprocessing methods could be explored to reduce processing time and enhance real-time application feasibility.

Plain Language Accessible to non-experts

Imagine you're shopping in a marketplace with a variety of products and customers. Each customer leaves a review after purchasing, some positive, some negative. Our task is to automatically identify the sentiment and emotion of these reviews. It's like having a smart assistant that can quickly read each review and tell you if the customer is happy, sad, or angry.

To achieve this, we use a clever method that combines traditional statistical methods with modern machine learning techniques. First, we act like a librarian, counting the frequency of each word in the reviews, then use this information to help us understand the overall sentiment of the review.

Next, we use an advanced technique called BiLSTM, which is like a reader that can look both forward and backward, better understanding the context of the review. Finally, we use a technique called TextCNN, which is like a magnifying glass that identifies important words in the review, helping us more accurately identify emotions.

Through these methods, we can more accurately understand the sentiment and emotion of customers, like an experienced market analyst who can quickly identify customer satisfaction and dissatisfaction.

ELI14 Explained like you're 14

Hey there! Did you know that when you shop online, every review you write tells the seller how happy you are with their product? Imagine if there was a super-smart robot that could read every review and tell the seller if you're happy, sad, or angry. How cool would that be?

This research is like giving that robot a super brain. First, it acts like a super detective, analyzing every word in each review to see which words appear the most. Then, it uses this information to guess the overall sentiment of the review.

Next, it uses a technique called BiLSTM, which is like a super reader that can look both forward and backward, understanding the meaning of the review better. Finally, it uses a technique called TextCNN, which is like a super magnifying glass that identifies important words in the review, helping it more accurately identify emotions.

With these methods, this robot can more accurately understand your sentiment and emotion, like an experienced market analyst who can quickly identify your satisfaction and dissatisfaction. Isn't that cool!

Glossary

Sentiment Analysis

Sentiment analysis is a natural language processing technique used to identify and classify the sentiment expressed in text, such as positive, negative, or neutral.

In this paper, sentiment analysis is used to identify the sentiment inclination in Indonesian e-commerce reviews.

Emotion Recognition

Emotion recognition involves identifying and classifying more nuanced emotional states in text, such as happiness, sadness, anger, etc.

In this paper, emotion recognition is used to classify five emotion categories in Indonesian e-commerce reviews.

Bidirectional Long Short-Term Memory (BiLSTM)

BiLSTM is a neural network that can propagate information in both forward and backward sequences, suitable for processing sequential data.

In this paper, BiLSTM is used to process the contextual information in review texts.

AutoML

AutoML is a technology for automating the selection, training, and optimization of machine learning models.

In this paper, AutoML is used to select the best sentiment classification model.

TF-IDF

TF-IDF is a statistical method used to evaluate the importance of a word in a text, based on term frequency and inverse document frequency.

In this paper, TF-IDF is used to extract features from review texts.

TextCNN

TextCNN is a convolutional neural network architecture specifically designed for text classification tasks, capable of identifying local features in text.

In this paper, TextCNN is used for emotion classification tasks.

Preprocessing

Preprocessing is the process of cleaning and transforming raw data to facilitate better analysis.

In this paper, preprocessing includes cleaning slang and special symbols from review texts.

ReduceLROnPlateau

ReduceLROnPlateau is a learning rate scheduling strategy that automatically reduces the learning rate when model performance ceases to improve.

In this paper, this strategy is used to optimize model training.

Class-weighted Cross-Entropy Loss

A loss function that assigns different weights to each class to handle class imbalance issues.

In this paper, this loss function is used for sentiment and emotion classification tasks.

Early Stopping

Early stopping is a technique to prevent model overfitting by stopping training when validation performance no longer improves.

In this paper, this strategy is used to optimize the model training process.

Open Questions Unanswered questions from this research

  • 1 How can the performance of emotion classification models be further improved in scenarios of class imbalance? Current methods perform limitedly in handling class imbalance, requiring new strategies to enhance model robustness.
  • 2 The model performs poorly when dealing with extremely informal or emerging slang. How can more flexible models be designed to adapt to these changes?
  • 3 The complexity of the preprocessing module may lead to longer processing times, affecting real-time applications. How can preprocessing steps be optimized to improve efficiency?
  • 4 In a multi-task learning framework, how can information be better shared and allocated between tasks to improve overall performance?
  • 5 How can this framework be applied to sentiment and emotion analysis of other low-resource languages? What adjustments and optimizations are needed?

Applications

Immediate Applications

E-commerce Platform Sentiment Analysis

This method can help e-commerce platforms automatically analyze user reviews' sentiment inclination, improving user experience and satisfaction.

Market Research

By analyzing sentiment and emotion in user reviews, businesses can better understand consumer needs and preferences, optimizing products and services.

Social Media Monitoring

This technology can be used for real-time monitoring of user sentiment on social media, aiding brand management and crisis public relations.

Long-term Vision

Multilingual Sentiment Analysis

Expanding this framework to other languages, especially low-resource languages, advancing sentiment analysis research globally.

Intelligent Customer Service Systems

Combining sentiment and emotion analysis technology to develop smarter customer service systems that better understand and respond to users' emotional needs.

Abstract

Indonesian marketplace reviews mix standard vocabulary with slang, regional loanwords, numeric shorthands, and emoji, making lexicon-based sentiment tools unreliable in practice. This paper describes a two-track classification pipeline applied to the PRDECT-ID dataset, which contains 5,400 product reviews from 29 Indonesian e-commerce categories, each labeled for binary sentiment (Positive/Negative) and five-class emotion (Happy, Sad, Fear, Love, Anger). The first track applies TF-IDF vectorization with a PyCaret AutoML sweep across standard classifiers. The second track is a PyTorch Bidirectional Long Short-Term Memory (BiLSTM) network with a shared encoder and two task-specific output heads. A preprocessing module applies 14 sequential cleaning steps, including a 140-entry slang dictionary assembled from marketplace corpora. Four configurations are benchmarked: BiLSTM Baseline, BiLSTM Improved, BiLSTM Large, and TextCNN. Training uses class-weighted cross-entropy loss, ReduceLROnPlateau scheduling, and early stopping. Both tracks are deployed as Gradio applications on Hugging Face Spaces. Source code is publicly available at https://github.com/ikii-sd/pba2026-crazyrichteam.

cs.CL

References (9)

Comparative Study of BiLSTM and GRU for Sentiment Analysis on Indonesian E-Commerce Product Reviews Using Deep Sequential Modeling

K. Nasution, Khairun Saddami, R. Roslidar et al.

2025 2 citations

Emotion classification of Indonesian Tweets using Bidirectional LSTM

A. Glenn, Phillip M. LaCasse, Bruce A. Cox

2023 33 citations

Analisis Sentimen Ulasan Pengguna GoPay di Google Play Store menggunakan Model IndoELECTRA

Lisna Rahma Fitriati, Rangga Gelar Guntara, B. Purwaamijaya

2025 1 citations

A Comparison of BiLSTM, BERT, and Ensemble Method for Emotion Recognition on Indonesian Product Reviews

Rio Pramana, M. Jonathan, Habel Steven Yani et al.

2024 15 citations

Emotion Detection Using Contextual Embeddings for Indonesian Product Review Texts on E-commerce Platform

Amelia Devi, Putri Ariyanto, Fari Katul Fikriah et al.

2024 3 citations

Analisis Sentimen Wacana Pemindahan Ibu Kota Indonesia Menggunakan Algoritma Support Vector Machine (SVM)

Primandani Arsi, Retno Waluyo

2021 124 citations

Deep Learning for Aspect-Based Sentiment Analysis on Indonesian Hotels Reviews

Siwi Cahyaningtyas, D. Fudholi, Ahmad Fathan Hidayatullah

2021 23 citations

Research on Sentimental Evaluation of E-commerce Product Reviews Based on the BiLSTM-Attention Mechanism

Yuhan Wang

2026 1 citations

PRDECT-ID: Indonesian product reviews dataset for emotions classification tasks

Rhio Sutoyo, Said Achmad, Andry Chowanda et al.

2022 31 citations