2025-11-18T15:52:13.572441

Detection and Prevention of Smishing Attacks

Goel
Phishing is an online identity theft technique where attackers steal users personal information, leading to financial losses for individuals and organizations. With the increasing adoption of smartphones, which provide functionalities similar to desktop computers, attackers are targeting mobile users. Smishing, a phishing attack carried out through Short Messaging Service (SMS), has become prevalent due to the widespread use of SMS-based services. It involves deceptive messages designed to extract sensitive information. Despite the growing number of smishing attacks, limited research focuses on detecting these threats. This work presents a smishing detection model using a content-based analysis approach. To address the challenge posed by slang, abbreviations, and short forms in text communication, the model normalizes these into standard forms. A machine learning classifier is employed to classify messages as smishing or ham. Experimental results demonstrate the model effectiveness, achieving classification accuracies of 97.14% for smishing and 96.12% for ham messages, with an overall accuracy of 96.20%.
academic

Detection and Prevention of Smishing Attacks

Basic Information

  • Paper ID: 2501.00260
  • Title: Detection and Prevention of Smishing Attacks
  • Author: Diksha Goel (Roll No.: 31603217)
  • Advisor: Mr. Ankit Kumar Jain (Assistant Professor)
  • Classification: cs.CR cs.SI
  • Publication Date: June 2018 (Master of Technology Dissertation)
  • Institution: Department of Computer Engineering, National Institute of Technology Kurukshetra-136119, Haryana (India)
  • Paper Link: https://arxiv.org/abs/2501.00260

Abstract

As smartphone capabilities increasingly approach desktop computers, attackers have shifted their focus to mobile device users. Smishing (SMS phishing attacks) represents phishing attacks conducted through SMS services, aimed at stealing sensitive user information. Despite the exponential growth in smishing attacks, detection research targeting these threats remains relatively limited. This study proposes a content analysis-based smishing detection model that normalizes text to handle slang, abbreviations, and shorthand forms, using machine learning classifiers to distinguish between smishing and legitimate SMS messages. Experimental results demonstrate that the model achieves 97.14% classification accuracy for smishing messages, 96.12% for legitimate messages, with an overall accuracy of 96.20%.

Research Background and Motivation

Problem Definition

  1. Primary Problem: With the surge in smartphone users (projected to reach 2.87 billion by 2020), SMS has become a primary channel for attackers to conduct phishing attacks. Smishing attacks exploit users' high trust in SMS (35% of users consider SMS the most trustworthy messaging platform) for fraud.
  2. Problem Significance:
    • 33% of mobile users have received smishing messages
    • 42% of mobile users click on malicious links
    • Smartphone users face 3 times higher risk of phishing attacks compared to desktop users
    • 45% of users received smishing messages in 2017, representing a 2% increase from 2016
  3. Limitations of Existing Methods:
    • Abundant spam SMS detection techniques exist, but research specifically targeting smishing is limited
    • Slang, abbreviations, and shorthand forms in text reduce classifier efficiency
    • Lack of effective text normalization mechanisms
  4. Research Motivation:
    • Mobile device hardware limitations (small screens, lack of security indicators) increase attack success rates
    • Need to effectively detect smishing attacks while protecting user privacy
    • Existing solutions require improved accuracy

Core Contributions

  1. Proposed a comprehensive smishing security model: A two-stage detection framework based on content analysis
  2. Innovative text normalization method: Using the NoSlang dictionary to handle slang, abbreviations, and shorthand, significantly improving classification accuracy
  3. Comprehensive mobile phishing attack taxonomy: Systematically organized 7 major categories of mobile phishing attack methods
  4. Excellent detection performance: Achieving 96.20% overall accuracy on public datasets
  5. In-depth literature review: Providing comprehensive analysis of mobile phishing attacks and defense mechanisms

Methodology Details

Task Definition

Input: SMS text messages Output: Binary classification result (smishing message or ham message) Constraints: Protect user privacy, real-time detection, high accuracy

Model Architecture

The model employs a two-stage architecture:

Stage 1: Preprocessing and Normalization

Algorithm 1: Preprocessing and Normalization Algorithm
Input: msg (message), dict (NoSlang dictionary), stop (stop words)
Output: n_msg (preprocessed and normalized message)

Specific Steps:

  1. Tokenization: Splitting text into tokens
  2. Lowercasing: Unified conversion to lowercase
  3. Normalization: Replacing slang and abbreviations using the NoSlang dictionary
  4. Stop Word Removal: Deleting 153 NLTK English stop words
  5. Stemming: Reducing vocabulary to root forms

Stage 2: Classification

Algorithm 2: Classification Algorithm
Input: D (dataset), n_msg (preprocessed and normalized message)
Output: ham or smishing message

Naive Bayes Classifier: Using Bayes' theorem for classification:

p(Ckx)=p(xCk)p(Ck)p(x)p(C_k|x) = \frac{p(x|C_k)p(C_k)}{p(x)}

Where:

  • p(Ckx)p(C_k|x): Posterior probability of belonging to class CkC_k given feature x
  • p(xCk)p(x|C_k): Likelihood probability of feature x given class CkC_k
  • p(Ck)p(C_k): Prior probability of class CkC_k

Technical Innovations

  1. Text Normalization Innovation:
    • First application of NoSlang dictionary to smishing detection
    • Systematic handling of informal language expressions in SMS
    • Significantly enhances classifier recognition of deformed text
  2. Two-Stage Processing Framework:
    • Preprocessing stage ensures text consistency
    • Classification stage makes accurate judgments based on normalized text
  3. Privacy Protection Design:
    • Local processing without third-party services
    • Based solely on text content features without collecting personal user information

Experimental Setup

Dataset

  • Data Source: SMS Spam Dataset v.1 (public dataset)
  • Original Scale: 5,574 messages (4,827 ham, 747 spam)
  • Processed Scale: 5,169 messages (4,807 ham, 362 smishing)
  • Data Sources:
    • Grumbletext website: 425 spam
    • Dr. Caroline Tag's dissertation: 450 ham
    • NUS SMS Corpus: 3,375 ham
    • SMS Spam Corpus v.0.1: 1,002 ham, 322 spam
    • Pinterest collection: 71 smishing

Dataset Statistical Features

FeatureHam MessagesSmishing Messages
Average Characters74.55148.72
Average Words14.7624.72
URL Frequency0.00270.2513
Symbol ($,€) Frequency0.00370.0193

Evaluation Metrics

  • True Positive Rate (TPR): TPR=TPTP+FNTPR = \frac{TP}{TP + FN}
  • True Negative Rate (TNR): TNR=TNTN+FPTNR = \frac{TN}{TN + FP}
  • False Positive Rate (FPR): FPR=FPFP+TNFPR = \frac{FP}{FP + TN}
  • Accuracy: A=TP+TNTP+TN+FP+FNA = \frac{TP + TN}{TP + TN + FP + FN}

Comparison Methods

  • S-Detector (Joo et al.): Naive Bayes classifier
  • SMSAssassin (Yadav et al.): Bayesian learning + SVM
  • Lee et al.: Cloud environment detection method

Implementation Details

  • Platform: Python
  • System Configuration: i5 processor, 2.4GHz, 8GB RAM
  • Dependencies: NLTK, CSV, SYS, ConfigParser
  • Data Split: 90% training, 10% testing

Experimental Results

Main Results

MethodTPRTNRFPRFNRAccuracy
Without Preprocessing Normalization94.28%87.74%12.25%5.71%88.20%
With Preprocessing Normalization97.14%96.12%3.87%2.85%96.20%

Comparative Experimental Results

MethodContent AnalysisText NormalizationAlgorithmAccuracy
Joo et al.Naive Bayes-
Yadav et al.Bayes + SVM84.75%
Lee et al.Source Content Analysis-
Proposed MethodNaive Bayes96.20%

Ablation Study

By comparing results with and without preprocessing normalization, the importance of text normalization is demonstrated:

  • Accuracy Improvement: From 88.20% to 96.20% (+8%)
  • TPR Improvement: From 94.28% to 97.14%
  • TNR Improvement: From 87.74% to 96.12%

Case Analysis

Text normalization effectiveness examples:

  • "call" vocabulary smishing probability increased from 0.443425 to 0.464832
  • "offer" vocabulary smishing probability increased from 0.033639 to 0.055046
  • Normalized vocabulary exhibits more consistent semantics, improving classifier accuracy

Mobile Phishing Attack Classification

The paper proposes a comprehensive mobile phishing attack taxonomy:

  1. Social Engineering Attacks: SMS, VoIP, websites, email
  2. Mobile Application Attacks: Similarity attacks, forwarding attacks, background attacks
  3. Malware Attacks: Trojans, worms, rootkits, ransomware
  4. Social Network Attacks: Identity spoofing, malicious links, fake profiles
  5. Content Injection Attacks: XSS attacks
  6. Wireless Medium Attacks: Wi-Fi, Bluetooth attacks
  7. Technical Deception Attacks: DNS poisoning, man-in-the-middle attacks

Defense Mechanism Classification

  1. User Education: Warning mechanisms, gamified training
  2. Smishing Detection: S-Detector, SMSAssassin, DCA methods
  3. Phishing Website Detection: MobiFish, kAYO, MP-Shield
  4. Malicious Application Detection: VeriUI, StopBankun, Andromaly
  5. QR Code Technology: Single sign-on, authentication schemes
  6. Personalized Security Indicators

Conclusions and Discussion

Main Conclusions

  1. Importance of Text Normalization: Preprocessing and normalization significantly improve detection accuracy (+8%)
  2. Method Effectiveness: Achieving excellent 96.20% accuracy on public datasets
  3. Practical Value: Providing a complete smishing detection solution
  4. Theoretical Contribution: Systematically organizing mobile phishing attacks and defense mechanisms

Limitations

  1. Dataset Limitations:
    • Lack of dedicated smishing dataset; manual extraction from spam required
    • Relatively small dataset scale (362 smishing messages)
    • English text support only
  2. Method Limitations:
    • Based solely on text content, not considering URLs, senders, and other features
    • Dependent on dictionary quality; potential incomplete dictionary coverage
    • Adaptability to novel attack methods requires verification
  3. Experimental Limitations:
    • Lack of comparison with more recent methods
    • No cross-dataset validation
    • Absence of real-time performance evaluation

Future Directions

  1. URL Analysis: Combining URL features to detect malicious links and downloads
  2. Context Understanding: Improving normalization process by selecting optimal word meanings based on context
  3. Dataset Expansion: Building larger-scale, multilingual smishing datasets
  4. Multimodal Fusion: Combining text, URL, sender information, and other features
  5. Real-time Deployment: Optimizing algorithm efficiency for real-time detection on mobile devices

In-Depth Evaluation

Strengths

  1. Strong Problem Targeting: Specifically addressing smishing, an important but understudied security threat
  2. Methodological Innovation: First systematic application of text normalization to smishing detection
  3. Sufficient Experiments: Ablation studies demonstrating component contributions
  4. Comprehensive Literature Review: Providing one of the most comprehensive reviews in the field
  5. High Practical Value: Simple, effective method easily deployable in practice

Weaknesses

  1. Limited Technical Depth: Primarily using traditional machine learning; deep learning unexplored
  2. Simple Feature Engineering: Using only text content; relatively limited features
  3. Incomplete Evaluation: Lacking analysis of false positive rates' impact on user experience
  4. Scalability Issues: Generalization ability to novel attack methods requires verification
  5. Unknown Real-time Performance: Lacking performance testing on mobile devices

Impact

  1. Academic Contribution:
    • Filling research gaps in smishing detection
    • Providing systematic attack and defense taxonomy
    • Demonstrating importance of text normalization in security detection
  2. Practical Value:
    • Direct application to mobile security products
    • Filtering solutions for SMS gateways
    • Personal protection tools for users
  3. Reproducibility:
    • Using public datasets
    • Clear method description
    • Detailed algorithm procedures

Applicable Scenarios

  1. Mobile Operators: Real-time SMS gateway filtering
  2. Security Vendors: Integration into mobile security products
  3. Enterprise Users: Internal SMS security monitoring
  4. Individual Users: Smartphone security applications
  5. Research Institutions: Baseline method for further improvements

References

The paper cites 63 relevant references, covering:

  • Classical phishing attack detection methods
  • Mobile security threat analysis
  • Machine learning applications in text classification
  • SMS spam filtering techniques
  • Mobile malware detection methods

Primary references include APWG phishing reports, IEEE and ACM conference papers, and important journal articles in related fields, with authoritative and comprehensive citation coverage.


Overall Assessment: This is a practical research addressing an important security problem with certain methodological innovations and satisfactory experimental results. While technical depth is limited, it provides an effective baseline method for smishing detection with good academic and practical value.