2025-11-29T10:22:18.756657

Blockchain-Based Federated Learning: Incentivizing Data Sharing and Penalizing Dishonest Behavior

Jaberzadeh, Shrestha, Khan et al.

With the increasing importance of data sharing for collaboration and innovation, it is becoming more important to ensure that data is managed and shared in a secure and trustworthy manner. Data governance is a common approach to managing data, but it faces many challenges such as data silos, data consistency, privacy, security, and access control. To address these challenges, this paper proposes a comprehensive framework that integrates data trust in federated learning with InterPlanetary File System, blockchain, and smart contracts to facilitate secure and mutually beneficial data sharing while providing incentives, access control mechanisms, and penalizing any dishonest behavior. The experimental results demonstrate that the proposed model is effective in improving the accuracy of federated learning models while ensuring the security and fairness of the data-sharing process. The research paper also presents a decentralized federated learning platform that successfully trained a CNN model on the MNIST dataset using blockchain technology. The platform enables multiple workers to train the model simultaneously while maintaining data privacy and security. The decentralized architecture and use of blockchain technology allow for efficient communication and coordination between workers. This platform has the potential to facilitate decentralized machine learning and support privacy-preserving collaboration in various domains.

academic

Basic Information

Paper ID: 2307.10492
Title: Blockchain-Based Federated Learning: Incentivizing Data Sharing and Penalizing Dishonest Behavior
Authors: Amir Jaberzadeh, Ajay Kumar Shrestha, Faijan Ahamad Khan, Mohammed Afaan Shaikh, Bhargav Dave, Jason Geng
Institutions: Bayes Solutions (USA) and Vancouver Island University (Canada)
Classification: cs.LG (Machine Learning)
Publication Date: July 2023
Paper Link: https://arxiv.org/abs/2307.10492

Abstract

This paper proposes a comprehensive framework addressing security and trust issues in data sharing by integrating federated learning with blockchain, smart contracts, and IPFS (InterPlanetary File System). The framework promotes secure and reciprocal data sharing through incentive mechanisms, access control, and penalty mechanisms. Experimental results demonstrate that the model achieves over 95% accuracy when training a CNN model on the MNIST dataset while ensuring security and fairness in the data sharing process. The platform supports multiple worker nodes training models simultaneously, maintaining data privacy and security through decentralized architecture and blockchain technology.

Research Background and Motivation

1. Core Problems to Address

This research tackles several key challenges:

Data Silos: Difficulty in sharing and integrating data across different organizations
Privacy and Security: Privacy breach risks associated with centralized data storage and sharing
Lack of Trust: Absence of reliable trust mechanisms among participants
Insufficient Incentives: Lack of effective mechanisms to promote high-quality data sharing
Malicious Behavior: Need to prevent and penalize participants providing low-quality or malicious data

2. Problem Significance

As data sharing becomes increasingly important for collaboration and innovation, ensuring data is managed and shared in a secure and trustworthy manner has become critical. Traditional data governance approaches face multiple challenges including data consistency, compatibility, privacy, security, access control, ownership, and sharing rewards.

3. Limitations of Existing Approaches

Traditional Federated Learning: Relies on central servers with single point of failure risks; central servers may be attacked, compromising system privacy
Centralized Storage: Increases data breach risks and raises concerns about data ownership and control
Existing FedAvg Variants: While proposing various improvements (e.g., momentum methods, adaptive learning rates), they remain insufficient in privacy protection, incentive mechanisms, and malicious behavior prevention

4. Research Motivation

This paper aims to construct a decentralized federated learning framework by integrating blockchain, smart contracts, IPFS, and cryptographic techniques, simultaneously addressing multiple challenges including privacy protection, incentive mechanisms, access control, and malicious behavior penalties.

Core Contributions

Proposed a comprehensive decentralized federated learning framework: Integrating data trust, IPFS, blockchain, and smart contracts into federated learning to enable secure and reciprocal data sharing
Designed collateral-based incentive and penalty mechanisms: Through smart contracts requiring participants to provide collateral, imposing economic penalties on those providing low-quality or malicious data, and distributing penalties to honest participants
Implemented a dual encryption scheme: Combining symmetric encryption (AES) and asymmetric encryption (RSA) to protect model and data confidentiality with only 2% computational overhead
Constructed IPFS-based decentralized model storage: Avoiding centralized storage risks and supporting peer-to-peer model sharing
Verified framework effectiveness: Achieving over 95% accuracy on the MNIST dataset, demonstrating the feasibility and efficiency of the decentralized architecture

Methodology Details

Task Definition

This paper studies the task of building a decentralized federated learning platform enabling multiple participants (worker nodes) to collaboratively train a global machine learning model without sharing raw data. The system must satisfy the following requirements:

Input: Local datasets of each worker node, initial model, number of training rounds, total reward amount
Output: Trained global model
Constraints: Protect data privacy, prevent malicious behavior, fairly distribute rewards, decentralized architecture

Model Architecture

1. Overall Architecture Design

The system contains two types of roles:

Requester: Initiates federated learning tasks, deploys smart contracts, sets training parameters (rounds N, total reward D), pushes initial model to IPFS
Workers: Participate in training tasks, train models on local data, evaluate other nodes' models, receive rewards based on performance

Core components:

Blockchain and Smart Contracts: Coordinate FL tasks, manage participant information, allocate rewards and penalties
IPFS Storage: Decentralized storage for trained models
Encryption Module: Protect model and data confidentiality

2. Module Functions and Implementation

a) Data Trust, Access Control, and Incentive Mechanisms

Participants must register and provide collateral deposits
Collateral serves as an economic penalty mechanism preventing participants from providing low-quality or misleading data
If participants behave dishonestly, collateral is forfeited and distributed to honest participants
Smart contracts update and allocate total compensation based on participant contributions
Ensures each participant registers only once, with compensation distributed only when total compensation is positive

b) IPFS Storage

Uses InterPlanetary File System as a peer-to-peer distributed file system
Models are stored on user devices without requiring centralized storage
Reduces data breach risks and enhances data ownership and control

c) Confidentiality and Privacy Protection

Adopts hybrid encryption scheme:
- Uses symmetric keys (AES) to encrypt actual data/models
- Uses asymmetric keys (RSA) to encrypt symmetric keys
- Ensures only recipients with corresponding private keys can decrypt data
Implements encryption functionality using Python's cryptography library
Implements methods for obtaining, decrypting, and pushing encrypted model states
Optimizes memory usage: maintains hash lists of pushed models, clearing after reaching specified count

d) Smart Contract Functions Smart contracts include the following key functions:

initializeTask: Requester initializes FL task, sets model URI and rounds, requires deposit
startTask: Requester starts task, status changes to "running"
joinTask: Worker nodes join task, register and obtain model URI
submitScore: Worker nodes submit model scores after evaluation each round
removeWorker: Worker nodes exit task
nextRound: Requester advances to next round
getSubmissions: Requester retrieves all submissions for current round
submitRoundTopK: Obtains top K performing worker nodes
distributeRewards: Distributes rewards to top-performing worker nodes (top K receive half of rewards, remainder distributed smaller shares)

3. Workflow

Initialization Phase:
- Requester deploys smart contract, sets training rounds N and total reward D
- Requester pushes initial model to IPFS
- Worker nodes join task through smart contract
Training Phase (N rounds total):
- Each round, worker nodes retrieve all other workers' trained models from IPFS
- Worker nodes evaluate these models on local data, calculate scores
- Scores are submitted to smart contract
- Smart contract aggregates scores, determines top K performing worker nodes
- Allocates rewards based on performance
- Worker nodes train models on local data
- Trained models are pushed to IPFS
- Repeat for N rounds
Completion Phase:
- After training completes, requester retrieves final global model from IPFS
- Calls smart contract function to close task

4. Aggregation/Averaging Method

Worker nodes retrieve their own models and other workers' models from IPFS storage
Use averaging function to sum all models and divide by number of contributing worker nodes
Obtain averaged model to improve accuracy
This method avoids extensive communication between central server and clients in centralized FedAvg, reducing channel congestion and privacy attack risks

Technical Innovations

1. Differences from Baseline Methods

Decentralized Architecture: Does not rely on central servers, avoiding single point of failure and privacy attacks
Economic Incentive Mechanisms: Incentivizes honest behavior and penalizes malicious behavior through collateral and reward systems
Dual Encryption: Combines AES and RSA, controlling overhead to 2% while ensuring security
Blockchain + IPFS: Leverages blockchain's immutability and IPFS's decentralized storage

2. Design Rationality Analysis

Collateral Mechanism: Effectively constrains participant behavior through economic means, more deterrent than pure technical approaches
Multi-dimensional Performance Evaluation: Considers accuracy, consistency, precision, and recall, comprehensively assessing worker contributions
Hybrid Encryption: Symmetric encryption is efficient (suitable for large data), asymmetric encryption is secure (suitable for key exchange), combining both balances efficiency and security
IPFS Storage: Naturally fits decentralized architecture, content addressing mechanism ensures data integrity

Experimental Setup

Dataset

Dataset Name: MNIST handwritten digit dataset
Data Scale:
- Training set: 60,000 images
- Test set: 10,000 images
Task: Classify handwritten digits 0-9
Data Distribution: Training set uniformly distributed to worker nodes at training start
Evaluation: Each worker node uses test set for evaluation and scoring

Evaluation Metrics

Accuracy: Percentage of correct classifications
Precision: 0.973
Recall: 0.97
Convergence Time: Time required for model to reach target accuracy

Comparison Methods

Encrypted vs Unencrypted: Compare dual encryption's impact on convergence time
Different Worker Node Counts: 3 worker nodes vs 5 worker nodes

Implementation Details

Model: Simple feedforward neural network (CNN), N layers
Framework: PyTorch
Blockchain: Ethereum blockchain
Simulation Environment: Ganache (local Ethereum blockchain testing environment)
Hardware: Xeon CPU, 8 cores
Training Method: Decentralized client-server system implemented on local machine, executed sequentially (can also be parallel)
Maximum Rounds: 90 epochs

Experimental Results

Main Results

1. Performance Analysis

Accuracy: Achieves over 95% accuracy within 90 epochs
Precision: 0.973
Recall: 0.97
Total Training Time (3 worker nodes): 6525.46 seconds
Convergence Time per Worker Node: Approximately 36 minutes
Conclusion: Convergence time is comparable to decentralized federated learning frameworks

2. Encryption Overhead Analysis

Dual Encryption Additional Overhead:
- Total for all 3 worker nodes: 2 minutes 34 seconds
- Per worker node: 51 seconds
- Communication Cost Proportion: Only 2% of convergence time
Conclusion: Overhead from dual encryption/decryption and secure key pair transmission protocol is minimal, acceptable while maintaining same accuracy

3. Worker Node Count Comparison

3 Worker Nodes:
- More stable accuracy pattern
- Reason: Each worker node possesses more training data
5 Worker Nodes:
- Achieves acceptable accuracy within similar epoch count
- Can accelerate training process, expand training scale
- Reduces computational requirements per worker node, enabling low-end devices as compute nodes
Conclusion:
- Increasing worker node count does not negatively impact model convergence
- Worker node count should be selected based on training dataset proportion
- In practical scenarios, increasing training dataset improves multi-worker model stability

Ablation Studies

The paper primarily conducted ablation studies on encryption overhead:

Compared convergence time with and without dual encryption
Demonstrated encryption mechanism adds only 2% overhead, validating design efficiency

Case Analysis

The paper demonstrates accuracy evolution during training:

All three worker nodes start with low initial accuracy
Accuracy significantly improves in first round (3 epochs)
Subsequently worker nodes train sequentially, accuracy steadily improves
Finally all worker nodes achieve over 95% accuracy

Experimental Findings

Decentralized Architecture Feasibility: Experiments prove decentralized federated learning achieves performance comparable to centralized methods
Controllable Encryption Overhead: Dual encryption scheme adds only 2% time overhead, demonstrating good balance between security and efficiency
Scalability: Increasing worker node count does not harm model performance, instead accelerating training and reducing individual node computational requirements
Data Distribution Importance: Worker node count should match training dataset scale to maintain training stability

1. Federated Learning Domain

FedAvg and Variants:
- FedAvg 2: Foundational federated averaging algorithm
- Momentum Methods 6: For local client training
- Adaptive FedAvg 7: Employs adaptive learning rates
- Lazy and Quantized Gradients 8: Reduce communication
- Newton-type Schemes 9: FedDANE
Decentralized Gradient Descent:
- DGD and Variants 10-13
- DSGD 14: Decentralized stochastic gradient descent
- Asynchronous DSGD 15
- Quantized DSGD 16

2. Blockchain + Federated Learning

Smart Healthcare 18: Privacy-preserving architecture using blockchain and federated learning
Vehicular Networks 19: Blockchain-based federated learning scheme with reputation-based incentive mechanisms

Comprehensive Framework: Integrates incentive mechanisms, penalty mechanisms, access control, and privacy protection
Efficient Encryption: Dual encryption scheme with only 2% overhead
Practicality: Validates effectiveness on real datasets
Economic Incentives: Innovatively introduces collateral mechanism, constraining participant behavior from economic perspective

Conclusions and Discussion

Main Conclusions

The proposed decentralized federated learning architecture successfully integrates blockchain, smart contracts, and IPFS, achieving secure and efficient global model training
Experimental results demonstrate the framework achieves over 95% accuracy within 90 epochs, with convergence time comparable to centralized federated learning frameworks
The dual encryption scheme adds minimal 2% overhead, demonstrating good balance between security and efficiency
The method effectively addresses multiple data management and sharing challenges by establishing trust among stakeholders, promoting reciprocal data sharing, and preventing behaviors that could compromise data security and accuracy

Limitations

Experimental Scale: Only sequential execution testing on local machine, not validated in large-scale distributed environments
Single Dataset: Only MNIST dataset used, lacking validation on more complex datasets and tasks
Blockchain Costs: Lacks detailed analysis of blockchain transaction costs and scalability issues
Malicious Behavior Detection: Collateral mechanism relies on accurate performance evaluation, but lacks in-depth discussion of detecting complex malicious behaviors (e.g., model poisoning attacks)
Worker Node Selection: Does not discuss dynamic worker node selection and management, or handling dynamic node joining and leaving
Practical Deployment Challenges: Does not address network latency, node heterogeneity, and other real-world deployment issues

Future Directions

Future research directions explicitly proposed in the paper:

Scalability Research: Explore scalability in real-world scenarios
Feasibility Validation: Validate model feasibility in practical applications

Other potential directions:

Test framework on more complex datasets and tasks
Research advanced malicious behavior detection and defense mechanisms
Optimize blockchain transaction costs and throughput
Develop dynamic worker node management mechanisms
Study performance under heterogeneous devices and network conditions

In-Depth Evaluation

Strengths

1. Method Innovation

Multi-technology Integration: Innovatively integrates blockchain, smart contracts, IPFS, and cryptographic techniques into federated learning, forming a complete ecosystem
Economic Incentive Mechanisms: Collateral and reward systems constrain participant behavior from economic perspective, effectively complementing technical approaches
Hybrid Encryption Scheme: AES+RSA combination balances efficiency and security

2. Experimental Sufficiency

Provides multi-dimensional evaluation including accuracy, precision, and recall
Compares encrypted and unencrypted performance differences
Tests impact of different worker node counts
Provides concrete timing and performance data

3. Result Convincingness

Over 95% accuracy demonstrates method effectiveness
2% encryption overhead demonstrates scheme practicality
Convergence time comparable to existing methods demonstrates competitiveness

4. Writing Clarity

Clear architecture design with detailed process descriptions
Provides system architecture diagrams and experimental result figures
Complete smart contract function descriptions

Weaknesses

1. Method Limitations

Insufficient Malicious Behavior Detection: Primarily relies on performance evaluation, lacking defense against advanced attacks like gradient attacks and model poisoning
Collateral Amount Setting: Does not discuss how to determine reasonable collateral amounts
Byzantine Fault Tolerance: Does not clearly specify how many malicious nodes the system can tolerate

2. Experimental Setup Defects

Overly Simple Dataset: MNIST is classic but simple, difficult to reflect complex scenarios
Lack of Real Environment Testing: Only sequential execution on local machine, not tested in real distributed environments
Lack of Comparative Experiments: No direct comparison with other blockchain+federated learning schemes
Unanalyzed Blockchain Costs: Does not provide key metrics like Gas fees and transaction latency

3. Insufficient Analysis

Missing Scalability Analysis: Does not discuss performance when worker node count increases significantly
Network Condition Impact: Does not consider performance under different network conditions
Heterogeneity Handling: Does not discuss device heterogeneity and data heterogeneity impacts
Insufficient Theoretical Analysis: Lacks convergence proofs and theoretical guarantees

Impact

1. Contribution to Field

Comprehensive Solution: Provides complete framework integrating multiple technologies, serving as reference for subsequent research
Practice-Oriented: Focuses on incentive mechanisms and malicious behavior penalties, more aligned with practical application needs
Pioneering Work: Conducts beneficial exploration in blockchain+federated learning domain

2. Practical Value

Privacy Protection: Applicable to privacy-sensitive domains like healthcare and finance
Decentralization: Suitable for scenarios distrusting central servers
Incentive Mechanisms: Can promote data sharing and collaboration
But Real Deployment Still Faces Challenges: Blockchain costs, scalability issues require further resolution

3. Reproducibility

Strengths:
- Detailed system architecture and workflow descriptions
- Smart contract function explanations provided
- Technology stack specified (PyTorch, Ethereum, Ganache, etc.)
Weaknesses:
- Code not open-sourced
- Lacks detailed hyperparameter settings
- Does not provide complete smart contract code

Applicable Scenarios

1. Highly Applicable Scenarios

Medical Data Collaboration: Multiple hospitals jointly train models, protecting patient privacy
Financial Risk Control: Multiple banks share data features without exposing raw data
Federated Recommendation Systems: Multiple platforms collaborate to improve recommendation algorithms
Edge Computing: IoT devices collaboratively train models

2. Applicable Conditions

Participants lack mutual trust, unwilling to use central servers
High data privacy requirements, cannot centralize storage
Need incentive mechanisms to promote data sharing
Can accept certain blockchain transaction costs

3. Less Applicable Scenarios

Applications requiring extreme real-time performance (blockchain transactions have latency)
Scenarios with extremely large participant numbers (scalability limitations)
Devices with extremely limited computational resources (encryption and blockchain operations have overhead)
Scenarios with existing trustworthy central servers (reduced necessity for decentralization)

References

The paper cites 21 important references, key ones including:

Delacroix & Lawrence (2019): Foundational approaches to data trust
McMahan et al. (2017): Original FedAvg algorithm paper
Sun et al. (2022): Latest advances in decentralized federated averaging
Singh et al. (2022): Blockchain and federated learning applications in IoT healthcare
Wang et al. (2022): Privacy-preserving federated learning for vehicular networks based on blockchain
Shrestha et al. (2020, 2021): Blockchain platforms for user data sharing and incentive mechanism design

Summary

This paper proposes an innovative blockchain federated learning framework addressing trust, incentive, and privacy challenges in decentralized machine learning by integrating multiple technologies (blockchain, smart contracts, IPFS, hybrid encryption). Experiments validate method effectiveness, but practical deployment, scalability, and complex attack defense require further research. This work provides valuable insights for privacy-preserving collaborative machine learning, with particular application potential in sensitive domains like healthcare and finance.