2025-11-21T03:40:14.666813

Investigating Large Language Models' Linguistic Abilities for Text Preprocessing

Braga, Milanese, Pasi

Text preprocessing is a fundamental component of Natural Language Processing, involving techniques such as stopword removal, stemming, and lemmatization to prepare text as input for further processing and analysis. Despite the context-dependent nature of the above techniques, traditional methods usually ignore contextual information. In this paper, we investigate the idea of using Large Language Models (LLMs) to perform various preprocessing tasks, due to their ability to take context into account without requiring extensive language-specific annotated resources. Through a comprehensive evaluation on web-sourced data, we compare LLM-based preprocessing (specifically stopword removal, lemmatization and stemming) to traditional algorithms across multiple text classification tasks in six European languages. Our analysis indicates that LLMs are capable of replicating traditional stopword removal, lemmatization, and stemming methods with accuracies reaching 97%, 82%, and 74%, respectively. Additionally, we show that ML algorithms trained on texts preprocessed by LLMs achieve an improvement of up to 6% with respect to the $F_1$ measure compared to traditional techniques. Our code, prompts, and results are publicly available at https://github.com/GianCarloMilanese/llm_pipeline_wi-iat.

academic

Investigating Large Language Models' Linguistic Abilities for Text Preprocessing

基本信息

论文ID: 2510.11482
标题: Investigating Large Language Models' Linguistic Abilities for Text Preprocessing
作者: Marco Braga (University of Milano-Bicocca), Gian Carlo Milanese (University of Milano-Bicocca), Gabriella Pasi (University of Milano-Bicocca)
分类: cs.CL (Computational Linguistics), cs.AI (Artificial Intelligence)
发表时间: 2025年10月13日 (arXiv预印本)
论文链接: https://arxiv.org/abs/2510.11482

摘要

文本预处理是自然语言处理的基础组件，涉及停用词移除、词干提取和词形还原等技术，用于为后续处理和分析准备文本输入。尽管这些技术具有上下文依赖性，传统方法通常忽略上下文信息。本文研究使用大语言模型(LLMs)执行各种预处理任务的想法，因为它们能够考虑上下文而无需大量特定语言的标注资源。通过对网络数据的全面评估，我们在六种欧洲语言的多个文本分类任务中比较了基于LLM的预处理与传统算法。分析表明，LLMs能够复制传统的停用词移除、词形还原和词干提取方法，准确率分别达到97%、82%和74%。此外，在LLM预处理文本上训练的ML算法相比传统技术在F1指标上最高提升6%。

研究背景与动机

问题定义

文本预处理是NLP流水线中的关键步骤，包括停用词移除、词干提取和词形还原等操作。这些操作的目的是标准化文本、降低计算成本并减少噪声和无关信息。

现有方法的局限性

缺乏上下文感知：传统预处理方法主要依赖预定义的停用词列表和固定的词干/词形还原规则，忽略了领域特定信息和上下文
词性歧义问题：例如"saw"一词，作为动词时应还原为"see"，作为名词时应保持"saw"
领域敏感性：同一个词在不同领域可能有不同的处理方式，如"leaves"在植物学文档中应还原为"leaf"，在员工请假文档中应还原为"leave"

研究动机

LLMs具有强大的语言理解能力，能够在不需要大量特定语言标注资源的情况下考虑语言上下文。本研究假设LLMs可以基于输入文档、上下文和任务动态检测停用词、词形和词干。

核心贡献

首次系统性评估：对LLMs在文本预处理任务(停用词移除、词形还原、词干提取)上的能力进行了全面评估
多语言分析：在六种欧洲语言(英语、法语、德语、意大利语、葡萄牙语、西班牙语)上验证了方法的有效性
下游任务评估：证明了LLM预处理相比传统方法在文本分类任务上的性能提升
开源贡献：公开了代码、提示和实验结果，促进可重现性研究

方法详解

任务定义

本研究定义了三个核心预处理任务：

停用词移除：识别并移除对特定任务不重要的词汇
词形还原：将词汇还原为其字典形式(词元)
词干提取：将词汇简化为其词根形式

LLM预处理方法

研究采用上下文学习(in-context learning)方法，为LLMs提供：

任务描述：预处理操作的正式定义
示例：少量预处理示例
输入文本：待处理的文本
语言信息：文本的语言标识
任务上下文：下游任务的具体信息

提示工程

针对不同预处理任务设计了专门的提示模板：

停用词移除示例：

You specialize in removing stopwords from text. Stopwords are words that are not relevant for processing a text. [...] In this case, the relevant task is detecting the sentiment of a tweet (positive, negative or neutral). In this task, the word 'not' is often not considered a stopword, and it should be kept in the text.

词形还原示例：

You specialize in text lemmatization. [...] Lemmatization depends on correctly identifying the intended part of speech and meaning of a word in a sentence, as well as within the larger context surrounding that sentence.