2025-11-21T03:40:14.666813

Investigating Large Language Models' Linguistic Abilities for Text Preprocessing

Braga, Milanese, Pasi

Text preprocessing is a fundamental component of Natural Language Processing, involving techniques such as stopword removal, stemming, and lemmatization to prepare text as input for further processing and analysis. Despite the context-dependent nature of the above techniques, traditional methods usually ignore contextual information. In this paper, we investigate the idea of using Large Language Models (LLMs) to perform various preprocessing tasks, due to their ability to take context into account without requiring extensive language-specific annotated resources. Through a comprehensive evaluation on web-sourced data, we compare LLM-based preprocessing (specifically stopword removal, lemmatization and stemming) to traditional algorithms across multiple text classification tasks in six European languages. Our analysis indicates that LLMs are capable of replicating traditional stopword removal, lemmatization, and stemming methods with accuracies reaching 97%, 82%, and 74%, respectively. Additionally, we show that ML algorithms trained on texts preprocessed by LLMs achieve an improvement of up to 6% with respect to the $F_1$ measure compared to traditional techniques. Our code, prompts, and results are publicly available at https://github.com/GianCarloMilanese/llm_pipeline_wi-iat.

academic

大規模言語モデルのテキスト前処理に対する言語能力の調査

基本情報

論文ID: 2510.11482
タイトル: Investigating Large Language Models' Linguistic Abilities for Text Preprocessing
著者: Marco Braga (ミラノ・ビコッカ大学)、Gian Carlo Milanese (ミラノ・ビコッカ大学)、Gabriella Pasi (ミラノ・ビコッカ大学)
分類: cs.CL (計算言語学)、cs.AI (人工知能)
発表日: 2025年10月13日 (arXiv プレプリント)
論文リンク: https://arxiv.org/abs/2510.11482

要約

テキスト前処理は自然言語処理の基礎的な構成要素であり、ストップワード除去、語幹抽出、見出し語化などの技術を含み、後続の処理と分析のためにテキスト入力を準備する。これらの技術は文脈に依存しているにもかかわらず、従来の方法は通常文脈情報を無視している。本論文は、大規模言語モデル(LLM)を使用して様々な前処理タスクを実行するという考えを調査している。LLMは大量の言語固有の注釈付きリソースを必要とせずに文脈を考慮することができるからである。ウェブデータの包括的な評価を通じて、6つのヨーロッパ言語の複数のテキスト分類タスクにおいて、LLMベースの前処理と従来のアルゴリズムを比較した。分析結果は、LLMがストップワード除去、見出し語化、語幹抽出の従来の方法をそれぞれ97%、82%、74%の精度で複製できることを示している。さらに、LLM前処理されたテキストで訓練された機械学習アルゴリズムは、従来の技術と比較してF1指標で最大6%の改善を達成した。

研究背景と動機

問題定義

テキスト前処理はNLPパイプラインの重要なステップであり、ストップワード除去、語幹抽出、見出し語化などの操作を含む。これらの操作の目的は、テキストを標準化し、計算コストを削減し、ノイズと無関連情報を減らすことである。

既存方法の限界

文脈認識の欠如：従来の前処理方法は主に事前定義されたストップワードリストと固定の語幹/見出し語化ルールに依存し、領域固有情報と文脈を無視している
品詞曖昧性の問題：例えば「saw」という単語は、動詞として使用される場合は「see」に見出し語化されるべきであり、名詞として使用される場合は「saw」のままである
領域感度：同じ単語が異なる領域では異なる処理方法を必要とする場合がある。例えば「leaves」は植物学文書では「leaf」に見出し語化されるべきであり、従業員休暇文書では「leave」に見出し語化されるべきである

研究動機

LLMは強力な言語理解能力を持ち、大量の言語固有の注釈付きリソースを必要とせずに言語文脈を考慮することができる。本研究は、LLMが入力文書、文脈、タスクに基づいて動的にストップワード、見出し語、語幹を検出できるという仮説を立てている。

核心的貢献

初の体系的評価：テキスト前処理タスク(ストップワード除去、見出し語化、語幹抽出)に対するLLMの能力の包括的な評価を実施した
多言語分析：6つのヨーロッパ言語(英語、フランス語、ドイツ語、イタリア語、ポルトガル語、スペイン語)での方法の有効性を検証した
下流タスク評価：従来の方法と比較してテキスト分類タスクにおけるLLM前処理の性能向上を実証した
オープンソース貢献：コード、プロンプト、実験結果を公開し、再現可能な研究を促進した

方法の詳細

タスク定義

本研究は3つの核心的な前処理タスクを定義した：

ストップワード除去：特定のタスクに対して重要でない語彙を識別して除去する
見出し語化：語彙を辞書形式(見出し語)に還元する
語幹抽出：語彙をその語根形式に簡略化する

LLM前処理方法

本研究は文脈内学習(in-context learning)アプローチを採用し、LLMに以下を提供した：

タスク説明：前処理操作の正式な定義
例：少数の前処理例
入力テキスト：処理対象のテキスト
言語情報：テキストの言語識別
タスク文脈：下流タスクの具体的情報

プロンプトエンジニアリング

異なる前処理タスクのために専門的なプロンプトテンプレートを設計した：

ストップワード除去の例：

You specialize in removing stopwords from text. Stopwords are words that are not relevant for processing a text. [...] In this case, the relevant task is detecting the sentiment of a tweet (positive, negative or neutral). In this task, the word 'not' is often not considered a stopword, and it should be kept in the text.

見出し語化の例：

You specialize in text lemmatization. [...] Lemmatization depends on correctly identifying the intended part of speech and meaning of a word in a sentence, as well as within the larger context surrounding that sentence.