2025-11-21T03:40:14.666813

Investigating Large Language Models' Linguistic Abilities for Text Preprocessing

Braga, Milanese, Pasi

Text preprocessing is a fundamental component of Natural Language Processing, involving techniques such as stopword removal, stemming, and lemmatization to prepare text as input for further processing and analysis. Despite the context-dependent nature of the above techniques, traditional methods usually ignore contextual information. In this paper, we investigate the idea of using Large Language Models (LLMs) to perform various preprocessing tasks, due to their ability to take context into account without requiring extensive language-specific annotated resources. Through a comprehensive evaluation on web-sourced data, we compare LLM-based preprocessing (specifically stopword removal, lemmatization and stemming) to traditional algorithms across multiple text classification tasks in six European languages. Our analysis indicates that LLMs are capable of replicating traditional stopword removal, lemmatization, and stemming methods with accuracies reaching 97%, 82%, and 74%, respectively. Additionally, we show that ML algorithms trained on texts preprocessed by LLMs achieve an improvement of up to 6% with respect to the $F_1$ measure compared to traditional techniques. Our code, prompts, and results are publicly available at https://github.com/GianCarloMilanese/llm_pipeline_wi-iat.

academic

Investigating Large Language Models' Linguistic Abilities for Text Preprocessing

Basic Information

Paper ID: 2510.11482
Title: Investigating Large Language Models' Linguistic Abilities for Text Preprocessing
Authors: Marco Braga (University of Milano-Bicocca), Gian Carlo Milanese (University of Milano-Bicocca), Gabriella Pasi (University of Milano-Bicocca)
Classification: cs.CL (Computational Linguistics), cs.AI (Artificial Intelligence)
Publication Date: October 13, 2025 (arXiv preprint)
Paper Link: https://arxiv.org/abs/2510.11482

Abstract

Text preprocessing is a fundamental component of natural language processing, involving techniques such as stopword removal, stemming, and lemmatization to prepare textual input for subsequent processing and analysis. Although these techniques are context-dependent, traditional methods typically overlook contextual information. This paper investigates the application of large language models (LLMs) to perform various preprocessing tasks, leveraging their ability to consider context without requiring extensive language-specific annotated resources. Through comprehensive evaluation on web data, we compare LLM-based preprocessing with traditional algorithms across multiple text classification tasks in six European languages. Our analysis demonstrates that LLMs can replicate traditional stopword removal, lemmatization, and stemming methods with accuracies of 97%, 82%, and 74%, respectively. Furthermore, machine learning algorithms trained on LLM-preprocessed text achieve up to 6% improvement in F1 scores compared to traditional techniques.

Research Background and Motivation

Problem Definition

Text preprocessing is a critical step in the NLP pipeline, encompassing operations such as stopword removal, stemming, and lemmatization. These operations aim to normalize text, reduce computational costs, and eliminate noise and irrelevant information.

Limitations of Existing Methods

Lack of Context Awareness: Traditional preprocessing methods primarily rely on predefined stopword lists and fixed stemming/lemmatization rules, neglecting domain-specific information and contextual factors
Part-of-Speech Ambiguity: For example, the word "saw" should be lemmatized to "see" when used as a verb, but retained as "saw" when used as a noun
Domain Sensitivity: The same word may require different processing in different domains; for instance, "leaves" should be lemmatized to "leaf" in botanical documents but to "leave" in employee leave documents

Research Motivation

LLMs possess powerful language understanding capabilities and can consider linguistic context without requiring extensive language-specific annotated resources. This research hypothesizes that LLMs can dynamically detect stopwords, lemmas, and stems based on input documents, context, and task requirements.

Core Contributions

First Systematic Evaluation: Comprehensive assessment of LLMs' capabilities on text preprocessing tasks (stopword removal, lemmatization, stemming)
Multilingual Analysis: Validation of method effectiveness across six European languages (English, French, German, Italian, Portuguese, Spanish)
Downstream Task Evaluation: Demonstration of performance improvements of LLM preprocessing over traditional methods in text classification tasks
Open-Source Contribution: Public release of code, prompts, and experimental results to promote reproducible research

Methodology

Task Definition

This research defines three core preprocessing tasks:

Stopword Removal: Identification and removal of lexical items not relevant to specific tasks
Lemmatization: Reduction of lexical items to their dictionary form (lemma)
Stemming: Simplification of lexical items to their root form

LLM Preprocessing Approach

The research employs in-context learning methodology, providing LLMs with:

Task Description: Formal definition of preprocessing operations
Examples: Few-shot preprocessing examples
Input Text: Text to be processed
Language Information: Language identifier for the text
Task Context: Specific information about downstream tasks

Prompt Engineering

Specialized prompt templates were designed for different preprocessing tasks:

Stopword Removal Example:

You specialize in removing stopwords from text. Stopwords are words that are not relevant for processing a text. [...] In this case, the relevant task is detecting the sentiment of a tweet (positive, negative or neutral). In this task, the word 'not' is often not considered a stopword, and it should be kept in the text.

Lemmatization Example:

You specialize in text lemmatization. [...] Lemmatization depends on correctly identifying the intended part of speech and meaning of a word in a sentence, as well as within the larger context surrounding that sentence.

Multilingual Processing Strategy

For non-English languages, both English and target language prompts were employed
Evaluation of whether language-specific prompts provide additional contextual advantages

Experimental Setup

Datasets

English Datasets

SemEval Series: Including emoji prediction, sarcasm detection, hate speech detection, offensive language identification, and sentiment analysis
News Classification: Reuters and AG News datasets
Focus: Social media data such as Twitter, due to informal language and high noise levels

Multilingual Datasets

Tweet Sentiment Multilingual Corpus: Covering French, German, Italian, Portuguese, and Spanish
Sampling Strategy: Random sampling of up to 3,000 training and 3,000 test documents due to computational constraints

Model Selection

Five state-of-the-art open-source LLMs were evaluated:

Gemma-2-9B and Gemma-3-4B: Primarily trained on English data
LLama-3.1-8B: Native multilingual model
Phi-4-mini (3.8B): Primarily English-trained
Qwen-2.5-7B: Native multilingual model

Baseline Methods

Stopword Removal: NLTK-provided stopword lists
Stemming: Porter, Lancaster, and Snowball algorithms
Lemmatization: Rule-based or edit-tree-based lemmatizers provided by spaCy

Evaluation Metrics

RQ1 Evaluation

SW: Percentage of lexical items removed by LLM that match NLTK stopword list
NSW: Percentage of non-stopwords removed by LLM
L: Percentage of LLM lemmatization results matching traditional methods
S: Percentage of LLM stemming results matching traditional methods

RQ2 Evaluation

Micro-averaged F1 score for classification performance
Averaged across three ML algorithms: decision trees, logistic regression, naive Bayes

Experimental Results

Preprocessing Capability Assessment (RQ1)

English Results

Stopword Removal: Gemma-2 achieved best performance with 84.29% accuracy
Lemmatization: All models exceeded 77% accuracy, with Gemma-2 reaching 82.61%
Stemming: Relatively lower performance, with Gemma-2 achieving 75.65% (matching any traditional algorithm)

Multilingual Results

Stopword Removal: Gemma-2 achieved 97% accuracy on French, with at least 79% on other languages
Lemmatization: Qwen-2.5 performed best on French, Italian, and Spanish
Language-Specific Prompts: No consistent evidence that target language prompts yield better results

Downstream Task Performance (RQ2)

English Text Classification

Overall Performance: LLMs surpassed traditional methods in 25 of 35 dataset-preprocessing task combinations
Best Results: Gemma-2 achieved 6.16% improvement over traditional methods on AG News dataset with stopword removal + lemmatization
Stemming Limitations: LLM stemming surpassed traditional methods in only 3 of 7 datasets

Multilingual Text Classification

Average Performance: LLMs achieved comparable or superior performance in approximately half of evaluated cases
Lemmatization Advantage: Achieved best performance in 4 of 5 datasets
Language-Specific Patterns: Llama-3.1 showed 80% performance improvement with language-specific prompts

Key Findings

Context Sensitivity: LLMs frequently remove words not traditionally considered stopwords, supporting the hypothesis that contextual understanding influences stopword selection
Stemming Inconsistency: LLMs may produce different stems for the same word across different documents, resulting in non-standardized text representation
Model Scale Effects: Gemma-3, despite having approximately half the parameters of other large models, frequently achieved comparable or superior performance

LLM Applications in NLP

LLMs achieve state-of-the-art performance across diverse tasks, particularly effective in few-shot settings
Applicable to unseen tasks or domains without additional supervised fine-tuning

Context-Aware Preprocessing

The relationship between preprocessing operations and input text context has been long studied
Application of context-specific stopword definitions in information retrieval pipelines

Existing LLM Preprocessing Research

Prior work primarily focused on stemming in information retrieval pipelines
Lack of comprehensive analysis of LLM text preprocessing capabilities

Conclusions and Discussion

Main Conclusions

Replication Capability: LLMs effectively replicate traditional preprocessing methods, achieving accuracies of 97%, 82%, and 74% for stopword removal, lemmatization, and stemming, respectively
Performance Improvement: ML algorithms based on LLM preprocessing achieve up to 6% improvement in F1 scores
Multilingual Effectiveness: The method demonstrates effectiveness across multiple European languages

Limitations

Evaluation Limitations: Potential cases where LLMs outperform traditional libraries but are not captured by evaluation metrics
Computational Cost: Computational cost of LLM preprocessing is significantly higher than traditional methods
Prompt Engineering: Limited prompt engineering exploration may impact results
Stemming Consistency: LLMs lack consistency in stemming, affecting downstream task performance

Future Directions

Exploration of LLMs as tools for stemming and lemmatization in low-resource languages
Investigation of more effective prompting strategies and in-context learning approaches
Development of computationally efficient LLM preprocessing solutions

In-Depth Evaluation

Strengths

Research Novelty: First systematic evaluation of LLMs' capabilities on text preprocessing tasks
Experimental Comprehensiveness: Comprehensive evaluation spanning multiple languages, tasks, and models
Practical Value: Provides novel solutions for text preprocessing in low-resource languages
Open-Source Contribution: Complete code and data release promotes reproducible research

Weaknesses

Insufficient Theoretical Analysis: Lack of in-depth theoretical analysis of LLM preprocessing mechanisms
Computational Efficiency Issues: Insufficient discussion of trade-offs between computational cost and performance gains
Prompt Sensitivity: Limited exploration of impact of different prompting strategies on results
Missing Error Analysis: Lack of detailed analysis of error types in LLM preprocessing

Impact

Academic Contribution: Provides new research directions for NLP preprocessing
Practical Value: Particularly applicable to low-resource languages lacking mature preprocessing tools
Methodological Inspiration: Demonstrates potential of LLMs in traditional NLP tasks

Applicable Scenarios

Low-Resource Language Processing: Languages lacking high-quality lemmatizers and stemmers
Domain-Specific Applications: Tasks requiring context-sensitive preprocessing
Multilingual Systems: Cross-lingual applications requiring unified preprocessing schemes

References

The paper cites 37 relevant references covering important works in LLMs, text preprocessing, information retrieval, and multilingual NLP, providing a solid theoretical foundation for the research.

Summary: This paper pioneering explores the application of LLMs in text preprocessing, demonstrating through comprehensive multilingual experiments the advantages of LLMs in context-aware preprocessing. Despite limitations such as high computational costs, it provides valuable solutions for low-resource languages and context-sensitive preprocessing tasks.