2025-11-21T03:40:14.666813

Investigating Large Language Models' Linguistic Abilities for Text Preprocessing

Braga, Milanese, Pasi
Text preprocessing is a fundamental component of Natural Language Processing, involving techniques such as stopword removal, stemming, and lemmatization to prepare text as input for further processing and analysis. Despite the context-dependent nature of the above techniques, traditional methods usually ignore contextual information. In this paper, we investigate the idea of using Large Language Models (LLMs) to perform various preprocessing tasks, due to their ability to take context into account without requiring extensive language-specific annotated resources. Through a comprehensive evaluation on web-sourced data, we compare LLM-based preprocessing (specifically stopword removal, lemmatization and stemming) to traditional algorithms across multiple text classification tasks in six European languages. Our analysis indicates that LLMs are capable of replicating traditional stopword removal, lemmatization, and stemming methods with accuracies reaching 97%, 82%, and 74%, respectively. Additionally, we show that ML algorithms trained on texts preprocessed by LLMs achieve an improvement of up to 6% with respect to the $F_1$ measure compared to traditional techniques. Our code, prompts, and results are publicly available at https://github.com/GianCarloMilanese/llm_pipeline_wi-iat.
academic

Investigating Large Language Models' Linguistic Abilities for Text Preprocessing

Basic Information

  • Paper ID: 2510.11482
  • Title: Investigating Large Language Models' Linguistic Abilities for Text Preprocessing
  • Authors: Marco Braga (University of Milano-Bicocca), Gian Carlo Milanese (University of Milano-Bicocca), Gabriella Pasi (University of Milano-Bicocca)
  • Classification: cs.CL (Computational Linguistics), cs.AI (Artificial Intelligence)
  • Publication Date: October 13, 2025 (arXiv preprint)
  • Paper Link: https://arxiv.org/abs/2510.11482

Abstract

Text preprocessing is a fundamental component of natural language processing, involving techniques such as stopword removal, stemming, and lemmatization to prepare textual input for subsequent processing and analysis. Although these techniques are context-dependent, traditional methods typically overlook contextual information. This paper investigates the application of large language models (LLMs) to perform various preprocessing tasks, leveraging their ability to consider context without requiring extensive language-specific annotated resources. Through comprehensive evaluation on web data, we compare LLM-based preprocessing with traditional algorithms across multiple text classification tasks in six European languages. Our analysis demonstrates that LLMs can replicate traditional stopword removal, lemmatization, and stemming methods with accuracies of 97%, 82%, and 74%, respectively. Furthermore, machine learning algorithms trained on LLM-preprocessed text achieve up to 6% improvement in F1 scores compared to traditional techniques.

Research Background and Motivation

Problem Definition

Text preprocessing is a critical step in the NLP pipeline, encompassing operations such as stopword removal, stemming, and lemmatization. These operations aim to normalize text, reduce computational costs, and eliminate noise and irrelevant information.

Limitations of Existing Methods

  1. Lack of Context Awareness: Traditional preprocessing methods primarily rely on predefined stopword lists and fixed stemming/lemmatization rules, neglecting domain-specific information and contextual factors
  2. Part-of-Speech Ambiguity: For example, the word "saw" should be lemmatized to "see" when used as a verb, but retained as "saw" when used as a noun
  3. Domain Sensitivity: The same word may require different processing in different domains; for instance, "leaves" should be lemmatized to "leaf" in botanical documents but to "leave" in employee leave documents

Research Motivation

LLMs possess powerful language understanding capabilities and can consider linguistic context without requiring extensive language-specific annotated resources. This research hypothesizes that LLMs can dynamically detect stopwords, lemmas, and stems based on input documents, context, and task requirements.

Core Contributions

  1. First Systematic Evaluation: Comprehensive assessment of LLMs' capabilities on text preprocessing tasks (stopword removal, lemmatization, stemming)
  2. Multilingual Analysis: Validation of method effectiveness across six European languages (English, French, German, Italian, Portuguese, Spanish)
  3. Downstream Task Evaluation: Demonstration of performance improvements of LLM preprocessing over traditional methods in text classification tasks
  4. Open-Source Contribution: Public release of code, prompts, and experimental results to promote reproducible research

Methodology

Task Definition

This research defines three core preprocessing tasks:

  • Stopword Removal: Identification and removal of lexical items not relevant to specific tasks
  • Lemmatization: Reduction of lexical items to their dictionary form (lemma)
  • Stemming: Simplification of lexical items to their root form

LLM Preprocessing Approach

The research employs in-context learning methodology, providing LLMs with:

  1. Task Description: Formal definition of preprocessing operations
  2. Examples: Few-shot preprocessing examples
  3. Input Text: Text to be processed
  4. Language Information: Language identifier for the text
  5. Task Context: Specific information about downstream tasks

Prompt Engineering

Specialized prompt templates were designed for different preprocessing tasks:

Stopword Removal Example:

You specialize in removing stopwords from text. Stopwords are words that are not relevant for processing a text. [...] In this case, the relevant task is detecting the sentiment of a tweet (positive, negative or neutral). In this task, the word 'not' is often not considered a stopword, and it should be kept in the text.

Lemmatization Example:

You specialize in text lemmatization. [...] Lemmatization depends on correctly identifying the intended part of speech and meaning of a word in a sentence, as well as within the larger context surrounding that sentence.

Multilingual Processing Strategy

  • For non-English languages, both English and target language prompts were employed
  • Evaluation of whether language-specific prompts provide additional contextual advantages

Experimental Setup

Datasets

English Datasets

  • SemEval Series: Including emoji prediction, sarcasm detection, hate speech detection, offensive language identification, and sentiment analysis
  • News Classification: Reuters and AG News datasets
  • Focus: Social media data such as Twitter, due to informal language and high noise levels

Multilingual Datasets

  • Tweet Sentiment Multilingual Corpus: Covering French, German, Italian, Portuguese, and Spanish
  • Sampling Strategy: Random sampling of up to 3,000 training and 3,000 test documents due to computational constraints

Model Selection

Five state-of-the-art open-source LLMs were evaluated:

  • Gemma-2-9B and Gemma-3-4B: Primarily trained on English data
  • LLama-3.1-8B: Native multilingual model
  • Phi-4-mini (3.8B): Primarily English-trained
  • Qwen-2.5-7B: Native multilingual model

Baseline Methods

  • Stopword Removal: NLTK-provided stopword lists
  • Stemming: Porter, Lancaster, and Snowball algorithms
  • Lemmatization: Rule-based or edit-tree-based lemmatizers provided by spaCy

Evaluation Metrics

RQ1 Evaluation

  • SW: Percentage of lexical items removed by LLM that match NLTK stopword list
  • NSW: Percentage of non-stopwords removed by LLM
  • L: Percentage of LLM lemmatization results matching traditional methods
  • S: Percentage of LLM stemming results matching traditional methods

RQ2 Evaluation

  • Micro-averaged F1 score for classification performance
  • Averaged across three ML algorithms: decision trees, logistic regression, naive Bayes

Experimental Results

Preprocessing Capability Assessment (RQ1)

English Results

  • Stopword Removal: Gemma-2 achieved best performance with 84.29% accuracy
  • Lemmatization: All models exceeded 77% accuracy, with Gemma-2 reaching 82.61%
  • Stemming: Relatively lower performance, with Gemma-2 achieving 75.65% (matching any traditional algorithm)

Multilingual Results

  • Stopword Removal: Gemma-2 achieved 97% accuracy on French, with at least 79% on other languages
  • Lemmatization: Qwen-2.5 performed best on French, Italian, and Spanish
  • Language-Specific Prompts: No consistent evidence that target language prompts yield better results

Downstream Task Performance (RQ2)

English Text Classification

  • Overall Performance: LLMs surpassed traditional methods in 25 of 35 dataset-preprocessing task combinations
  • Best Results: Gemma-2 achieved 6.16% improvement over traditional methods on AG News dataset with stopword removal + lemmatization
  • Stemming Limitations: LLM stemming surpassed traditional methods in only 3 of 7 datasets

Multilingual Text Classification

  • Average Performance: LLMs achieved comparable or superior performance in approximately half of evaluated cases
  • Lemmatization Advantage: Achieved best performance in 4 of 5 datasets
  • Language-Specific Patterns: Llama-3.1 showed 80% performance improvement with language-specific prompts

Key Findings

  1. Context Sensitivity: LLMs frequently remove words not traditionally considered stopwords, supporting the hypothesis that contextual understanding influences stopword selection
  2. Stemming Inconsistency: LLMs may produce different stems for the same word across different documents, resulting in non-standardized text representation
  3. Model Scale Effects: Gemma-3, despite having approximately half the parameters of other large models, frequently achieved comparable or superior performance

LLM Applications in NLP

  • LLMs achieve state-of-the-art performance across diverse tasks, particularly effective in few-shot settings
  • Applicable to unseen tasks or domains without additional supervised fine-tuning

Context-Aware Preprocessing

  • The relationship between preprocessing operations and input text context has been long studied
  • Application of context-specific stopword definitions in information retrieval pipelines

Existing LLM Preprocessing Research

  • Prior work primarily focused on stemming in information retrieval pipelines
  • Lack of comprehensive analysis of LLM text preprocessing capabilities

Conclusions and Discussion

Main Conclusions

  1. Replication Capability: LLMs effectively replicate traditional preprocessing methods, achieving accuracies of 97%, 82%, and 74% for stopword removal, lemmatization, and stemming, respectively
  2. Performance Improvement: ML algorithms based on LLM preprocessing achieve up to 6% improvement in F1 scores
  3. Multilingual Effectiveness: The method demonstrates effectiveness across multiple European languages

Limitations

  1. Evaluation Limitations: Potential cases where LLMs outperform traditional libraries but are not captured by evaluation metrics
  2. Computational Cost: Computational cost of LLM preprocessing is significantly higher than traditional methods
  3. Prompt Engineering: Limited prompt engineering exploration may impact results
  4. Stemming Consistency: LLMs lack consistency in stemming, affecting downstream task performance

Future Directions

  • Exploration of LLMs as tools for stemming and lemmatization in low-resource languages
  • Investigation of more effective prompting strategies and in-context learning approaches
  • Development of computationally efficient LLM preprocessing solutions

In-Depth Evaluation

Strengths

  1. Research Novelty: First systematic evaluation of LLMs' capabilities on text preprocessing tasks
  2. Experimental Comprehensiveness: Comprehensive evaluation spanning multiple languages, tasks, and models
  3. Practical Value: Provides novel solutions for text preprocessing in low-resource languages
  4. Open-Source Contribution: Complete code and data release promotes reproducible research

Weaknesses

  1. Insufficient Theoretical Analysis: Lack of in-depth theoretical analysis of LLM preprocessing mechanisms
  2. Computational Efficiency Issues: Insufficient discussion of trade-offs between computational cost and performance gains
  3. Prompt Sensitivity: Limited exploration of impact of different prompting strategies on results
  4. Missing Error Analysis: Lack of detailed analysis of error types in LLM preprocessing

Impact

  1. Academic Contribution: Provides new research directions for NLP preprocessing
  2. Practical Value: Particularly applicable to low-resource languages lacking mature preprocessing tools
  3. Methodological Inspiration: Demonstrates potential of LLMs in traditional NLP tasks

Applicable Scenarios

  1. Low-Resource Language Processing: Languages lacking high-quality lemmatizers and stemmers
  2. Domain-Specific Applications: Tasks requiring context-sensitive preprocessing
  3. Multilingual Systems: Cross-lingual applications requiring unified preprocessing schemes

References

The paper cites 37 relevant references covering important works in LLMs, text preprocessing, information retrieval, and multilingual NLP, providing a solid theoretical foundation for the research.


Summary: This paper pioneering explores the application of LLMs in text preprocessing, demonstrating through comprehensive multilingual experiments the advantages of LLMs in context-aware preprocessing. Despite limitations such as high computational costs, it provides valuable solutions for low-resource languages and context-sensitive preprocessing tasks.