Investigating Large Language Models' Linguistic Abilities for Text Preprocessing
Braga, Milanese, Pasi
Text preprocessing is a fundamental component of Natural Language Processing, involving techniques such as stopword removal, stemming, and lemmatization to prepare text as input for further processing and analysis. Despite the context-dependent nature of the above techniques, traditional methods usually ignore contextual information. In this paper, we investigate the idea of using Large Language Models (LLMs) to perform various preprocessing tasks, due to their ability to take context into account without requiring extensive language-specific annotated resources. Through a comprehensive evaluation on web-sourced data, we compare LLM-based preprocessing (specifically stopword removal, lemmatization and stemming) to traditional algorithms across multiple text classification tasks in six European languages. Our analysis indicates that LLMs are capable of replicating traditional stopword removal, lemmatization, and stemming methods with accuracies reaching 97%, 82%, and 74%, respectively. Additionally, we show that ML algorithms trained on texts preprocessed by LLMs achieve an improvement of up to 6% with respect to the $F_1$ measure compared to traditional techniques. Our code, prompts, and results are publicly available at https://github.com/GianCarloMilanese/llm_pipeline_wi-iat.
academic
Investigating Large Language Models' Linguistic Abilities for Text Preprocessing
Text preprocessing is a fundamental component of natural language processing, involving techniques such as stopword removal, stemming, and lemmatization to prepare textual input for subsequent processing and analysis. Although these techniques are context-dependent, traditional methods typically overlook contextual information. This paper investigates the application of large language models (LLMs) to perform various preprocessing tasks, leveraging their ability to consider context without requiring extensive language-specific annotated resources. Through comprehensive evaluation on web data, we compare LLM-based preprocessing with traditional algorithms across multiple text classification tasks in six European languages. Our analysis demonstrates that LLMs can replicate traditional stopword removal, lemmatization, and stemming methods with accuracies of 97%, 82%, and 74%, respectively. Furthermore, machine learning algorithms trained on LLM-preprocessed text achieve up to 6% improvement in F1 scores compared to traditional techniques.
Text preprocessing is a critical step in the NLP pipeline, encompassing operations such as stopword removal, stemming, and lemmatization. These operations aim to normalize text, reduce computational costs, and eliminate noise and irrelevant information.
Lack of Context Awareness: Traditional preprocessing methods primarily rely on predefined stopword lists and fixed stemming/lemmatization rules, neglecting domain-specific information and contextual factors
Part-of-Speech Ambiguity: For example, the word "saw" should be lemmatized to "see" when used as a verb, but retained as "saw" when used as a noun
Domain Sensitivity: The same word may require different processing in different domains; for instance, "leaves" should be lemmatized to "leaf" in botanical documents but to "leave" in employee leave documents
LLMs possess powerful language understanding capabilities and can consider linguistic context without requiring extensive language-specific annotated resources. This research hypothesizes that LLMs can dynamically detect stopwords, lemmas, and stems based on input documents, context, and task requirements.
Specialized prompt templates were designed for different preprocessing tasks:
Stopword Removal Example:
You specialize in removing stopwords from text. Stopwords are words that are not relevant for processing a text. [...] In this case, the relevant task is detecting the sentiment of a tweet (positive, negative or neutral). In this task, the word 'not' is often not considered a stopword, and it should be kept in the text.
Lemmatization Example:
You specialize in text lemmatization. [...] Lemmatization depends on correctly identifying the intended part of speech and meaning of a word in a sentence, as well as within the larger context surrounding that sentence.
Context Sensitivity: LLMs frequently remove words not traditionally considered stopwords, supporting the hypothesis that contextual understanding influences stopword selection
Stemming Inconsistency: LLMs may produce different stems for the same word across different documents, resulting in non-standardized text representation
Model Scale Effects: Gemma-3, despite having approximately half the parameters of other large models, frequently achieved comparable or superior performance
Replication Capability: LLMs effectively replicate traditional preprocessing methods, achieving accuracies of 97%, 82%, and 74% for stopword removal, lemmatization, and stemming, respectively
Performance Improvement: ML algorithms based on LLM preprocessing achieve up to 6% improvement in F1 scores
Multilingual Effectiveness: The method demonstrates effectiveness across multiple European languages
The paper cites 37 relevant references covering important works in LLMs, text preprocessing, information retrieval, and multilingual NLP, providing a solid theoretical foundation for the research.
Summary: This paper pioneering explores the application of LLMs in text preprocessing, demonstrating through comprehensive multilingual experiments the advantages of LLMs in context-aware preprocessing. Despite limitations such as high computational costs, it provides valuable solutions for low-resource languages and context-sensitive preprocessing tasks.