2025-11-18T15:52:13.572441

Detection and Prevention of Smishing Attacks

Goel

Phishing is an online identity theft technique where attackers steal users personal information, leading to financial losses for individuals and organizations. With the increasing adoption of smartphones, which provide functionalities similar to desktop computers, attackers are targeting mobile users. Smishing, a phishing attack carried out through Short Messaging Service (SMS), has become prevalent due to the widespread use of SMS-based services. It involves deceptive messages designed to extract sensitive information. Despite the growing number of smishing attacks, limited research focuses on detecting these threats. This work presents a smishing detection model using a content-based analysis approach. To address the challenge posed by slang, abbreviations, and short forms in text communication, the model normalizes these into standard forms. A machine learning classifier is employed to classify messages as smishing or ham. Experimental results demonstrate the model effectiveness, achieving classification accuracies of 97.14% for smishing and 96.12% for ham messages, with an overall accuracy of 96.20%.

academic

Detection and Prevention of Smishing Attacks

Basic Information

Paper ID: 2501.00260
Title: Detection and Prevention of Smishing Attacks
Author: Diksha Goel (Roll No.: 31603217)
Advisor: Mr. Ankit Kumar Jain (Assistant Professor)
Classification: cs.CR cs.SI
Publication Date: June 2018 (Master of Technology Dissertation)
Institution: Department of Computer Engineering, National Institute of Technology Kurukshetra-136119, Haryana (India)
Paper Link: https://arxiv.org/abs/2501.00260

Abstract

As smartphone capabilities increasingly approach desktop computers, attackers have shifted their focus to mobile device users. Smishing (SMS phishing attacks) represents phishing attacks conducted through SMS services, aimed at stealing sensitive user information. Despite the exponential growth in smishing attacks, detection research targeting these threats remains relatively limited. This study proposes a content analysis-based smishing detection model that normalizes text to handle slang, abbreviations, and shorthand forms, using machine learning classifiers to distinguish between smishing and legitimate SMS messages. Experimental results demonstrate that the model achieves 97.14% classification accuracy for smishing messages, 96.12% for legitimate messages, with an overall accuracy of 96.20%.

Research Background and Motivation

Problem Definition

Primary Problem: With the surge in smartphone users (projected to reach 2.87 billion by 2020), SMS has become a primary channel for attackers to conduct phishing attacks. Smishing attacks exploit users' high trust in SMS (35% of users consider SMS the most trustworthy messaging platform) for fraud.
Problem Significance:
- 33% of mobile users have received smishing messages
- 42% of mobile users click on malicious links
- Smartphone users face 3 times higher risk of phishing attacks compared to desktop users
- 45% of users received smishing messages in 2017, representing a 2% increase from 2016
Limitations of Existing Methods:
- Abundant spam SMS detection techniques exist, but research specifically targeting smishing is limited
- Slang, abbreviations, and shorthand forms in text reduce classifier efficiency
- Lack of effective text normalization mechanisms
Research Motivation:
- Mobile device hardware limitations (small screens, lack of security indicators) increase attack success rates
- Need to effectively detect smishing attacks while protecting user privacy
- Existing solutions require improved accuracy

Core Contributions

Proposed a comprehensive smishing security model: A two-stage detection framework based on content analysis
Innovative text normalization method: Using the NoSlang dictionary to handle slang, abbreviations, and shorthand, significantly improving classification accuracy
Comprehensive mobile phishing attack taxonomy: Systematically organized 7 major categories of mobile phishing attack methods
Excellent detection performance: Achieving 96.20% overall accuracy on public datasets
In-depth literature review: Providing comprehensive analysis of mobile phishing attacks and defense mechanisms

Methodology Details

Task Definition

Input: SMS text messages Output: Binary classification result (smishing message or ham message) Constraints: Protect user privacy, real-time detection, high accuracy

Model Architecture

The model employs a two-stage architecture:

Stage 1: Preprocessing and Normalization

Algorithm 1: Preprocessing and Normalization Algorithm
Input: msg (message), dict (NoSlang dictionary), stop (stop words)
Output: n_msg (preprocessed and normalized message)

Specific Steps:

Tokenization: Splitting text into tokens
Lowercasing: Unified conversion to lowercase
Normalization: Replacing slang and abbreviations using the NoSlang dictionary
Stop Word Removal: Deleting 153 NLTK English stop words
Stemming: Reducing vocabulary to root forms

Stage 2: Classification

Algorithm 2: Classification Algorithm
Input: D (dataset), n_msg (preprocessed and normalized message)
Output: ham or smishing message

Naive Bayes Classifier: Using Bayes' theorem for classification:

$p(C_k|x) = \frac{p(x|C_k)p(C_k)}{p(x)}$

Where:

$p(C_k|x)$ : Posterior probability of belonging to class $C_k$ given feature x
$p(x|C_k)$ : Likelihood probability of feature x given class $C_k$
$p(C_k)$ : Prior probability of class $C_k$

Technical Innovations

Text Normalization Innovation:
- First application of NoSlang dictionary to smishing detection
- Systematic handling of informal language expressions in SMS
- Significantly enhances classifier recognition of deformed text
Two-Stage Processing Framework:
- Preprocessing stage ensures text consistency
- Classification stage makes accurate judgments based on normalized text
Privacy Protection Design:
- Local processing without third-party services
- Based solely on text content features without collecting personal user information

Experimental Setup

Dataset

Data Source: SMS Spam Dataset v.1 (public dataset)
Original Scale: 5,574 messages (4,827 ham, 747 spam)
Processed Scale: 5,169 messages (4,807 ham, 362 smishing)
Data Sources:
- Grumbletext website: 425 spam
- Dr. Caroline Tag's dissertation: 450 ham
- NUS SMS Corpus: 3,375 ham
- SMS Spam Corpus v.0.1: 1,002 ham, 322 spam
- Pinterest collection: 71 smishing

Dataset Statistical Features

Feature	Ham Messages	Smishing Messages
Average Characters	74.55	148.72
Average Words	14.76	24.72
URL Frequency	0.0027	0.2513
Symbol ($,€) Frequency	0.0037	0.0193

Evaluation Metrics

True Positive Rate (TPR): $TPR = \frac{TP}{TP + FN}$
True Negative Rate (TNR): $TNR = \frac{TN}{TN + FP}$
False Positive Rate (FPR): $FPR = \frac{FP}{FP + TN}$
Accuracy: $A = \frac{TP + TN}{TP + TN + FP + FN}$

Comparison Methods

S-Detector (Joo et al.): Naive Bayes classifier
SMSAssassin (Yadav et al.): Bayesian learning + SVM
Lee et al.: Cloud environment detection method

Implementation Details

Platform: Python
System Configuration: i5 processor, 2.4GHz, 8GB RAM
Dependencies: NLTK, CSV, SYS, ConfigParser
Data Split: 90% training, 10% testing

Experimental Results

Main Results

Method	TPR	TNR	FPR	FNR	Accuracy
Without Preprocessing Normalization	94.28%	87.74%	12.25%	5.71%	88.20%
With Preprocessing Normalization	97.14%	96.12%	3.87%	2.85%	96.20%

Comparative Experimental Results

Method	Content Analysis	Text Normalization	Algorithm	Accuracy
Joo et al.	✓	✗	Naive Bayes	-
Yadav et al.	✓	✗	Bayes + SVM	84.75%
Lee et al.	✓	✗	Source Content Analysis	-
Proposed Method	✓	✓	Naive Bayes	96.20%

Ablation Study

By comparing results with and without preprocessing normalization, the importance of text normalization is demonstrated:

Accuracy Improvement: From 88.20% to 96.20% (+8%)
TPR Improvement: From 94.28% to 97.14%
TNR Improvement: From 87.74% to 96.12%

Case Analysis

Text normalization effectiveness examples:

"call" vocabulary smishing probability increased from 0.443425 to 0.464832
"offer" vocabulary smishing probability increased from 0.033639 to 0.055046
Normalized vocabulary exhibits more consistent semantics, improving classifier accuracy

Mobile Phishing Attack Classification

The paper proposes a comprehensive mobile phishing attack taxonomy:

Social Engineering Attacks: SMS, VoIP, websites, email
Mobile Application Attacks: Similarity attacks, forwarding attacks, background attacks
Malware Attacks: Trojans, worms, rootkits, ransomware
Social Network Attacks: Identity spoofing, malicious links, fake profiles
Content Injection Attacks: XSS attacks
Wireless Medium Attacks: Wi-Fi, Bluetooth attacks
Technical Deception Attacks: DNS poisoning, man-in-the-middle attacks

Defense Mechanism Classification

User Education: Warning mechanisms, gamified training
Smishing Detection: S-Detector, SMSAssassin, DCA methods
Phishing Website Detection: MobiFish, kAYO, MP-Shield
Malicious Application Detection: VeriUI, StopBankun, Andromaly
QR Code Technology: Single sign-on, authentication schemes
Personalized Security Indicators

Conclusions and Discussion

Main Conclusions

Importance of Text Normalization: Preprocessing and normalization significantly improve detection accuracy (+8%)
Method Effectiveness: Achieving excellent 96.20% accuracy on public datasets
Practical Value: Providing a complete smishing detection solution
Theoretical Contribution: Systematically organizing mobile phishing attacks and defense mechanisms

Limitations

Dataset Limitations:
- Lack of dedicated smishing dataset; manual extraction from spam required
- Relatively small dataset scale (362 smishing messages)
- English text support only
Method Limitations:
- Based solely on text content, not considering URLs, senders, and other features
- Dependent on dictionary quality; potential incomplete dictionary coverage
- Adaptability to novel attack methods requires verification
Experimental Limitations:
- Lack of comparison with more recent methods
- No cross-dataset validation
- Absence of real-time performance evaluation

Future Directions

URL Analysis: Combining URL features to detect malicious links and downloads
Context Understanding: Improving normalization process by selecting optimal word meanings based on context
Dataset Expansion: Building larger-scale, multilingual smishing datasets
Multimodal Fusion: Combining text, URL, sender information, and other features
Real-time Deployment: Optimizing algorithm efficiency for real-time detection on mobile devices

In-Depth Evaluation

Strengths

Strong Problem Targeting: Specifically addressing smishing, an important but understudied security threat
Methodological Innovation: First systematic application of text normalization to smishing detection
Sufficient Experiments: Ablation studies demonstrating component contributions
Comprehensive Literature Review: Providing one of the most comprehensive reviews in the field
High Practical Value: Simple, effective method easily deployable in practice

Weaknesses

Limited Technical Depth: Primarily using traditional machine learning; deep learning unexplored
Simple Feature Engineering: Using only text content; relatively limited features
Incomplete Evaluation: Lacking analysis of false positive rates' impact on user experience
Scalability Issues: Generalization ability to novel attack methods requires verification
Unknown Real-time Performance: Lacking performance testing on mobile devices

Impact

Academic Contribution:
- Filling research gaps in smishing detection
- Providing systematic attack and defense taxonomy
- Demonstrating importance of text normalization in security detection
Practical Value:
- Direct application to mobile security products
- Filtering solutions for SMS gateways
- Personal protection tools for users
Reproducibility:
- Using public datasets
- Clear method description
- Detailed algorithm procedures

Applicable Scenarios

Mobile Operators: Real-time SMS gateway filtering
Security Vendors: Integration into mobile security products
Enterprise Users: Internal SMS security monitoring
Individual Users: Smartphone security applications
Research Institutions: Baseline method for further improvements

References

The paper cites 63 relevant references, covering:

Classical phishing attack detection methods
Mobile security threat analysis
Machine learning applications in text classification
SMS spam filtering techniques
Mobile malware detection methods

Primary references include APWG phishing reports, IEEE and ACM conference papers, and important journal articles in related fields, with authoritative and comprehensive citation coverage.

Overall Assessment: This is a practical research addressing an important security problem with certain methodological innovations and satisfactory experimental results. While technical depth is limited, it provides an effective baseline method for smishing detection with good academic and practical value.