2025-11-29T10:22:18.756657

Blockchain-Based Federated Learning: Incentivizing Data Sharing and Penalizing Dishonest Behavior

Jaberzadeh, Shrestha, Khan et al.
With the increasing importance of data sharing for collaboration and innovation, it is becoming more important to ensure that data is managed and shared in a secure and trustworthy manner. Data governance is a common approach to managing data, but it faces many challenges such as data silos, data consistency, privacy, security, and access control. To address these challenges, this paper proposes a comprehensive framework that integrates data trust in federated learning with InterPlanetary File System, blockchain, and smart contracts to facilitate secure and mutually beneficial data sharing while providing incentives, access control mechanisms, and penalizing any dishonest behavior. The experimental results demonstrate that the proposed model is effective in improving the accuracy of federated learning models while ensuring the security and fairness of the data-sharing process. The research paper also presents a decentralized federated learning platform that successfully trained a CNN model on the MNIST dataset using blockchain technology. The platform enables multiple workers to train the model simultaneously while maintaining data privacy and security. The decentralized architecture and use of blockchain technology allow for efficient communication and coordination between workers. This platform has the potential to facilitate decentralized machine learning and support privacy-preserving collaboration in various domains.
academic

Blockchain-Based Federated Learning: Incentivizing Data Sharing and Penalizing Dishonest Behavior

Basic Information

  • Paper ID: 2307.10492
  • Title: Blockchain-Based Federated Learning: Incentivizing Data Sharing and Penalizing Dishonest Behavior
  • Authors: Amir Jaberzadeh, Ajay Kumar Shrestha, Faijan Ahamad Khan, Mohammed Afaan Shaikh, Bhargav Dave, Jason Geng
  • Institutions: Bayes Solutions (USA) and Vancouver Island University (Canada)
  • Classification: cs.LG (Machine Learning)
  • Publication Date: July 2023
  • Paper Link: https://arxiv.org/abs/2307.10492

Abstract

This paper proposes a comprehensive framework addressing security and trust issues in data sharing by integrating federated learning with blockchain, smart contracts, and IPFS (InterPlanetary File System). The framework promotes secure and reciprocal data sharing through incentive mechanisms, access control, and penalty mechanisms. Experimental results demonstrate that the model achieves over 95% accuracy when training a CNN model on the MNIST dataset while ensuring security and fairness in the data sharing process. The platform supports multiple worker nodes training models simultaneously, maintaining data privacy and security through decentralized architecture and blockchain technology.

Research Background and Motivation

1. Core Problems to Address

This research tackles several key challenges:

  • Data Silos: Difficulty in sharing and integrating data across different organizations
  • Privacy and Security: Privacy breach risks associated with centralized data storage and sharing
  • Lack of Trust: Absence of reliable trust mechanisms among participants
  • Insufficient Incentives: Lack of effective mechanisms to promote high-quality data sharing
  • Malicious Behavior: Need to prevent and penalize participants providing low-quality or malicious data

2. Problem Significance

As data sharing becomes increasingly important for collaboration and innovation, ensuring data is managed and shared in a secure and trustworthy manner has become critical. Traditional data governance approaches face multiple challenges including data consistency, compatibility, privacy, security, access control, ownership, and sharing rewards.

3. Limitations of Existing Approaches

  • Traditional Federated Learning: Relies on central servers with single point of failure risks; central servers may be attacked, compromising system privacy
  • Centralized Storage: Increases data breach risks and raises concerns about data ownership and control
  • Existing FedAvg Variants: While proposing various improvements (e.g., momentum methods, adaptive learning rates), they remain insufficient in privacy protection, incentive mechanisms, and malicious behavior prevention

4. Research Motivation

This paper aims to construct a decentralized federated learning framework by integrating blockchain, smart contracts, IPFS, and cryptographic techniques, simultaneously addressing multiple challenges including privacy protection, incentive mechanisms, access control, and malicious behavior penalties.

Core Contributions

  1. Proposed a comprehensive decentralized federated learning framework: Integrating data trust, IPFS, blockchain, and smart contracts into federated learning to enable secure and reciprocal data sharing
  2. Designed collateral-based incentive and penalty mechanisms: Through smart contracts requiring participants to provide collateral, imposing economic penalties on those providing low-quality or malicious data, and distributing penalties to honest participants
  3. Implemented a dual encryption scheme: Combining symmetric encryption (AES) and asymmetric encryption (RSA) to protect model and data confidentiality with only 2% computational overhead
  4. Constructed IPFS-based decentralized model storage: Avoiding centralized storage risks and supporting peer-to-peer model sharing
  5. Verified framework effectiveness: Achieving over 95% accuracy on the MNIST dataset, demonstrating the feasibility and efficiency of the decentralized architecture

Methodology Details

Task Definition

This paper studies the task of building a decentralized federated learning platform enabling multiple participants (worker nodes) to collaboratively train a global machine learning model without sharing raw data. The system must satisfy the following requirements:

  • Input: Local datasets of each worker node, initial model, number of training rounds, total reward amount
  • Output: Trained global model
  • Constraints: Protect data privacy, prevent malicious behavior, fairly distribute rewards, decentralized architecture

Model Architecture

1. Overall Architecture Design

The system contains two types of roles:

  • Requester: Initiates federated learning tasks, deploys smart contracts, sets training parameters (rounds N, total reward D), pushes initial model to IPFS
  • Workers: Participate in training tasks, train models on local data, evaluate other nodes' models, receive rewards based on performance

Core components:

  • Blockchain and Smart Contracts: Coordinate FL tasks, manage participant information, allocate rewards and penalties
  • IPFS Storage: Decentralized storage for trained models
  • Encryption Module: Protect model and data confidentiality

2. Module Functions and Implementation

a) Data Trust, Access Control, and Incentive Mechanisms

  • Participants must register and provide collateral deposits
  • Collateral serves as an economic penalty mechanism preventing participants from providing low-quality or misleading data
  • If participants behave dishonestly, collateral is forfeited and distributed to honest participants
  • Smart contracts update and allocate total compensation based on participant contributions
  • Ensures each participant registers only once, with compensation distributed only when total compensation is positive

b) IPFS Storage

  • Uses InterPlanetary File System as a peer-to-peer distributed file system
  • Models are stored on user devices without requiring centralized storage
  • Reduces data breach risks and enhances data ownership and control

c) Confidentiality and Privacy Protection

  • Adopts hybrid encryption scheme:
    • Uses symmetric keys (AES) to encrypt actual data/models
    • Uses asymmetric keys (RSA) to encrypt symmetric keys
    • Ensures only recipients with corresponding private keys can decrypt data
  • Implements encryption functionality using Python's cryptography library
  • Implements methods for obtaining, decrypting, and pushing encrypted model states
  • Optimizes memory usage: maintains hash lists of pushed models, clearing after reaching specified count

d) Smart Contract Functions Smart contracts include the following key functions:

  • initializeTask: Requester initializes FL task, sets model URI and rounds, requires deposit
  • startTask: Requester starts task, status changes to "running"
  • joinTask: Worker nodes join task, register and obtain model URI
  • submitScore: Worker nodes submit model scores after evaluation each round
  • removeWorker: Worker nodes exit task
  • nextRound: Requester advances to next round
  • getSubmissions: Requester retrieves all submissions for current round
  • submitRoundTopK: Obtains top K performing worker nodes
  • distributeRewards: Distributes rewards to top-performing worker nodes (top K receive half of rewards, remainder distributed smaller shares)

3. Workflow

  1. Initialization Phase:
    • Requester deploys smart contract, sets training rounds N and total reward D
    • Requester pushes initial model to IPFS
    • Worker nodes join task through smart contract
  2. Training Phase (N rounds total):
    • Each round, worker nodes retrieve all other workers' trained models from IPFS
    • Worker nodes evaluate these models on local data, calculate scores
    • Scores are submitted to smart contract
    • Smart contract aggregates scores, determines top K performing worker nodes
    • Allocates rewards based on performance
    • Worker nodes train models on local data
    • Trained models are pushed to IPFS
    • Repeat for N rounds
  3. Completion Phase:
    • After training completes, requester retrieves final global model from IPFS
    • Calls smart contract function to close task

4. Aggregation/Averaging Method

  • Worker nodes retrieve their own models and other workers' models from IPFS storage
  • Use averaging function to sum all models and divide by number of contributing worker nodes
  • Obtain averaged model to improve accuracy
  • This method avoids extensive communication between central server and clients in centralized FedAvg, reducing channel congestion and privacy attack risks

Technical Innovations

1. Differences from Baseline Methods

  • Decentralized Architecture: Does not rely on central servers, avoiding single point of failure and privacy attacks
  • Economic Incentive Mechanisms: Incentivizes honest behavior and penalizes malicious behavior through collateral and reward systems
  • Dual Encryption: Combines AES and RSA, controlling overhead to 2% while ensuring security
  • Blockchain + IPFS: Leverages blockchain's immutability and IPFS's decentralized storage

2. Design Rationality Analysis

  • Collateral Mechanism: Effectively constrains participant behavior through economic means, more deterrent than pure technical approaches
  • Multi-dimensional Performance Evaluation: Considers accuracy, consistency, precision, and recall, comprehensively assessing worker contributions
  • Hybrid Encryption: Symmetric encryption is efficient (suitable for large data), asymmetric encryption is secure (suitable for key exchange), combining both balances efficiency and security
  • IPFS Storage: Naturally fits decentralized architecture, content addressing mechanism ensures data integrity

Experimental Setup

Dataset

  • Dataset Name: MNIST handwritten digit dataset
  • Data Scale:
    • Training set: 60,000 images
    • Test set: 10,000 images
  • Task: Classify handwritten digits 0-9
  • Data Distribution: Training set uniformly distributed to worker nodes at training start
  • Evaluation: Each worker node uses test set for evaluation and scoring

Evaluation Metrics

  • Accuracy: Percentage of correct classifications
  • Precision: 0.973
  • Recall: 0.97
  • Convergence Time: Time required for model to reach target accuracy

Comparison Methods

  • Encrypted vs Unencrypted: Compare dual encryption's impact on convergence time
  • Different Worker Node Counts: 3 worker nodes vs 5 worker nodes

Implementation Details

  • Model: Simple feedforward neural network (CNN), N layers
  • Framework: PyTorch
  • Blockchain: Ethereum blockchain
  • Simulation Environment: Ganache (local Ethereum blockchain testing environment)
  • Hardware: Xeon CPU, 8 cores
  • Training Method: Decentralized client-server system implemented on local machine, executed sequentially (can also be parallel)
  • Maximum Rounds: 90 epochs

Experimental Results

Main Results

1. Performance Analysis

  • Accuracy: Achieves over 95% accuracy within 90 epochs
  • Precision: 0.973
  • Recall: 0.97
  • Total Training Time (3 worker nodes): 6525.46 seconds
  • Convergence Time per Worker Node: Approximately 36 minutes
  • Conclusion: Convergence time is comparable to decentralized federated learning frameworks

2. Encryption Overhead Analysis

  • Dual Encryption Additional Overhead:
    • Total for all 3 worker nodes: 2 minutes 34 seconds
    • Per worker node: 51 seconds
    • Communication Cost Proportion: Only 2% of convergence time
  • Conclusion: Overhead from dual encryption/decryption and secure key pair transmission protocol is minimal, acceptable while maintaining same accuracy

3. Worker Node Count Comparison

  • 3 Worker Nodes:
    • More stable accuracy pattern
    • Reason: Each worker node possesses more training data
  • 5 Worker Nodes:
    • Achieves acceptable accuracy within similar epoch count
    • Can accelerate training process, expand training scale
    • Reduces computational requirements per worker node, enabling low-end devices as compute nodes
  • Conclusion:
    • Increasing worker node count does not negatively impact model convergence
    • Worker node count should be selected based on training dataset proportion
    • In practical scenarios, increasing training dataset improves multi-worker model stability

Ablation Studies

The paper primarily conducted ablation studies on encryption overhead:

  • Compared convergence time with and without dual encryption
  • Demonstrated encryption mechanism adds only 2% overhead, validating design efficiency

Case Analysis

The paper demonstrates accuracy evolution during training:

  • All three worker nodes start with low initial accuracy
  • Accuracy significantly improves in first round (3 epochs)
  • Subsequently worker nodes train sequentially, accuracy steadily improves
  • Finally all worker nodes achieve over 95% accuracy

Experimental Findings

  1. Decentralized Architecture Feasibility: Experiments prove decentralized federated learning achieves performance comparable to centralized methods
  2. Controllable Encryption Overhead: Dual encryption scheme adds only 2% time overhead, demonstrating good balance between security and efficiency
  3. Scalability: Increasing worker node count does not harm model performance, instead accelerating training and reducing individual node computational requirements
  4. Data Distribution Importance: Worker node count should match training dataset scale to maintain training stability

1. Federated Learning Domain

  • FedAvg and Variants:
    • FedAvg 2: Foundational federated averaging algorithm
    • Momentum Methods 6: For local client training
    • Adaptive FedAvg 7: Employs adaptive learning rates
    • Lazy and Quantized Gradients 8: Reduce communication
    • Newton-type Schemes 9: FedDANE
  • Decentralized Gradient Descent:
    • DGD and Variants 10-13
    • DSGD 14: Decentralized stochastic gradient descent
    • Asynchronous DSGD 15
    • Quantized DSGD 16

2. Blockchain + Federated Learning

  • Smart Healthcare 18: Privacy-preserving architecture using blockchain and federated learning
  • Vehicular Networks 19: Blockchain-based federated learning scheme with reputation-based incentive mechanisms
  • Comprehensive Framework: Integrates incentive mechanisms, penalty mechanisms, access control, and privacy protection
  • Efficient Encryption: Dual encryption scheme with only 2% overhead
  • Practicality: Validates effectiveness on real datasets
  • Economic Incentives: Innovatively introduces collateral mechanism, constraining participant behavior from economic perspective

Conclusions and Discussion

Main Conclusions

  1. The proposed decentralized federated learning architecture successfully integrates blockchain, smart contracts, and IPFS, achieving secure and efficient global model training
  2. Experimental results demonstrate the framework achieves over 95% accuracy within 90 epochs, with convergence time comparable to centralized federated learning frameworks
  3. The dual encryption scheme adds minimal 2% overhead, demonstrating good balance between security and efficiency
  4. The method effectively addresses multiple data management and sharing challenges by establishing trust among stakeholders, promoting reciprocal data sharing, and preventing behaviors that could compromise data security and accuracy

Limitations

  1. Experimental Scale: Only sequential execution testing on local machine, not validated in large-scale distributed environments
  2. Single Dataset: Only MNIST dataset used, lacking validation on more complex datasets and tasks
  3. Blockchain Costs: Lacks detailed analysis of blockchain transaction costs and scalability issues
  4. Malicious Behavior Detection: Collateral mechanism relies on accurate performance evaluation, but lacks in-depth discussion of detecting complex malicious behaviors (e.g., model poisoning attacks)
  5. Worker Node Selection: Does not discuss dynamic worker node selection and management, or handling dynamic node joining and leaving
  6. Practical Deployment Challenges: Does not address network latency, node heterogeneity, and other real-world deployment issues

Future Directions

Future research directions explicitly proposed in the paper:

  • Scalability Research: Explore scalability in real-world scenarios
  • Feasibility Validation: Validate model feasibility in practical applications

Other potential directions:

  • Test framework on more complex datasets and tasks
  • Research advanced malicious behavior detection and defense mechanisms
  • Optimize blockchain transaction costs and throughput
  • Develop dynamic worker node management mechanisms
  • Study performance under heterogeneous devices and network conditions

In-Depth Evaluation

Strengths

1. Method Innovation

  • Multi-technology Integration: Innovatively integrates blockchain, smart contracts, IPFS, and cryptographic techniques into federated learning, forming a complete ecosystem
  • Economic Incentive Mechanisms: Collateral and reward systems constrain participant behavior from economic perspective, effectively complementing technical approaches
  • Hybrid Encryption Scheme: AES+RSA combination balances efficiency and security

2. Experimental Sufficiency

  • Provides multi-dimensional evaluation including accuracy, precision, and recall
  • Compares encrypted and unencrypted performance differences
  • Tests impact of different worker node counts
  • Provides concrete timing and performance data

3. Result Convincingness

  • Over 95% accuracy demonstrates method effectiveness
  • 2% encryption overhead demonstrates scheme practicality
  • Convergence time comparable to existing methods demonstrates competitiveness

4. Writing Clarity

  • Clear architecture design with detailed process descriptions
  • Provides system architecture diagrams and experimental result figures
  • Complete smart contract function descriptions

Weaknesses

1. Method Limitations

  • Insufficient Malicious Behavior Detection: Primarily relies on performance evaluation, lacking defense against advanced attacks like gradient attacks and model poisoning
  • Collateral Amount Setting: Does not discuss how to determine reasonable collateral amounts
  • Byzantine Fault Tolerance: Does not clearly specify how many malicious nodes the system can tolerate

2. Experimental Setup Defects

  • Overly Simple Dataset: MNIST is classic but simple, difficult to reflect complex scenarios
  • Lack of Real Environment Testing: Only sequential execution on local machine, not tested in real distributed environments
  • Lack of Comparative Experiments: No direct comparison with other blockchain+federated learning schemes
  • Unanalyzed Blockchain Costs: Does not provide key metrics like Gas fees and transaction latency

3. Insufficient Analysis

  • Missing Scalability Analysis: Does not discuss performance when worker node count increases significantly
  • Network Condition Impact: Does not consider performance under different network conditions
  • Heterogeneity Handling: Does not discuss device heterogeneity and data heterogeneity impacts
  • Insufficient Theoretical Analysis: Lacks convergence proofs and theoretical guarantees

Impact

1. Contribution to Field

  • Comprehensive Solution: Provides complete framework integrating multiple technologies, serving as reference for subsequent research
  • Practice-Oriented: Focuses on incentive mechanisms and malicious behavior penalties, more aligned with practical application needs
  • Pioneering Work: Conducts beneficial exploration in blockchain+federated learning domain

2. Practical Value

  • Privacy Protection: Applicable to privacy-sensitive domains like healthcare and finance
  • Decentralization: Suitable for scenarios distrusting central servers
  • Incentive Mechanisms: Can promote data sharing and collaboration
  • But Real Deployment Still Faces Challenges: Blockchain costs, scalability issues require further resolution

3. Reproducibility

  • Strengths:
    • Detailed system architecture and workflow descriptions
    • Smart contract function explanations provided
    • Technology stack specified (PyTorch, Ethereum, Ganache, etc.)
  • Weaknesses:
    • Code not open-sourced
    • Lacks detailed hyperparameter settings
    • Does not provide complete smart contract code

Applicable Scenarios

1. Highly Applicable Scenarios

  • Medical Data Collaboration: Multiple hospitals jointly train models, protecting patient privacy
  • Financial Risk Control: Multiple banks share data features without exposing raw data
  • Federated Recommendation Systems: Multiple platforms collaborate to improve recommendation algorithms
  • Edge Computing: IoT devices collaboratively train models

2. Applicable Conditions

  • Participants lack mutual trust, unwilling to use central servers
  • High data privacy requirements, cannot centralize storage
  • Need incentive mechanisms to promote data sharing
  • Can accept certain blockchain transaction costs

3. Less Applicable Scenarios

  • Applications requiring extreme real-time performance (blockchain transactions have latency)
  • Scenarios with extremely large participant numbers (scalability limitations)
  • Devices with extremely limited computational resources (encryption and blockchain operations have overhead)
  • Scenarios with existing trustworthy central servers (reduced necessity for decentralization)

References

The paper cites 21 important references, key ones including:

  1. Delacroix & Lawrence (2019): Foundational approaches to data trust
  2. McMahan et al. (2017): Original FedAvg algorithm paper
  3. Sun et al. (2022): Latest advances in decentralized federated averaging
  4. Singh et al. (2022): Blockchain and federated learning applications in IoT healthcare
  5. Wang et al. (2022): Privacy-preserving federated learning for vehicular networks based on blockchain
  6. Shrestha et al. (2020, 2021): Blockchain platforms for user data sharing and incentive mechanism design

Summary

This paper proposes an innovative blockchain federated learning framework addressing trust, incentive, and privacy challenges in decentralized machine learning by integrating multiple technologies (blockchain, smart contracts, IPFS, hybrid encryption). Experiments validate method effectiveness, but practical deployment, scalability, and complex attack defense require further research. This work provides valuable insights for privacy-preserving collaborative machine learning, with particular application potential in sensitive domains like healthcare and finance.