2025-11-20T12:34:13.863172

CaReTS: A Multi-Task Framework Unifying Classification and Regression for Time Series Forecasting

Yao, Zhao, Zheng et al.

Recent advances in deep forecasting models have achieved remarkable performance, yet most approaches still struggle to provide both accurate predictions and interpretable insights into temporal dynamics. This paper proposes CaReTS, a novel multi-task learning framework that combines classification and regression tasks for multi-step time series forecasting problems. The framework adopts a dual-stream architecture, where a classification branch learns the stepwise trend into the future, while a regression branch estimates the corresponding deviations from the latest observation of the target variable. The dual-stream design provides more interpretable predictions by disentangling macro-level trends from micro-level deviations in the target variable. To enable effective learning in output prediction, deviation estimation, and trend classification, we design a multi-task loss with uncertainty-aware weighting to adaptively balance the contribution of each task. Furthermore, four variants (CaReTS1--4) are instantiated under this framework to incorporate mainstream temporal modelling encoders, including convolutional neural networks (CNNs), long short-term memory networks (LSTMs), and Transformers. Experiments on real-world datasets demonstrate that CaReTS outperforms state-of-the-art (SOTA) algorithms in forecasting accuracy, while achieving higher trend classification performance.

academic

CaReTS: A Multi-Task Framework Unifying Classification and Regression for Time Series Forecasting

Basic Information

Paper ID: 2511.09789
Title: CaReTS: A Multi-Task Framework Unifying Classification and Regression for Time Series Forecasting
Authors: Fulong Yao (Cardiff University), Wanqing Zhao (Newcastle University), Chao Zheng (Newcastle University), Xiaofei Han (University of Leeds)
Category: cs.LG (Machine Learning)
Publication Date: November 12, 2025 (arXiv preprint)
Paper Link: https://arxiv.org/abs/2511.09789

Abstract

Deep learning has achieved significant progress in time series forecasting, yet existing methods often struggle to provide interpretable insights into temporal dynamics while delivering accurate predictions. This paper proposes CaReTS, a multi-task learning framework combining classification and regression tasks for multi-step time series forecasting. The framework employs a dual-stream architecture: the classification branch learns stepwise future trends, while the regression branch estimates deviations relative to the most recent observation. This design provides more interpretable forecasts by decoupling macroscopic trends from microscopic deviations. To enable effective learning, an uncertainty-aware multi-task loss function is designed to adaptively balance task contributions. The paper instantiates four variants (CaReTS1-4) combined with mainstream temporal encoding architectures (CNN, LSTM, Transformer). Experiments demonstrate that CaReTS surpasses state-of-the-art algorithms in both prediction accuracy and trend classification performance.

Research Background and Motivation

1. Problem Statement

Time series forecasting is a fundamental problem in energy management, financial analysis, medical monitoring, and climate modeling. Multi-step forecasting is particularly critical but faces two major challenges:

Accuracy Degradation: Prediction precision typically decreases as the forecasting horizon increases
Insufficient Interpretability: In high-risk scenarios, model opacity reduces trustworthiness

2. Problem Significance

Multi-step forecasting is crucial for capturing both short-term and long-term temporal dynamics of systems, enabling informed decision-making. However, while existing deep learning models have improved accuracy, they remain significantly deficient in interpretability, limiting their reliability in practical applications.

3. Limitations of Existing Methods

Single Regression Paradigm: Most deep forecasting models formulate prediction as a single regression task, focusing solely on numerical prediction
Coupled Trends and Deviations: Difficulty in decoupling macroscopic trends (e.g., upward/downward trajectories) from microscopic deviations
Lack of Explicit Trend Modeling: While models like Autoformer and FEDformer introduce decomposition mechanisms, they primarily operate at input or representation layers rather than explicitly separating trends and magnitudes at the output layer

4. Research Motivation

The core insight of this work is that decomposing time series forecasting into two complementary tasks—trend classification (direction) and deviation regression (magnitude)—can simultaneously enhance both prediction accuracy and interpretability. This output-layer decoupling provides a novel multi-task learning perspective.

Core Contributions

Dual-Stream Architecture Design: Proposes the CaReTS framework with a dual-stream architecture where the classification branch predicts stepwise macroscopic trends and the regression branch estimates fine-grained deviations relative to the most recent observation
Uncertainty-Aware Multi-Task Learning: Designs an uncertainty-based multi-task loss function that jointly optimizes classification and regression tasks through adaptive weighting, eliminating manual hyperparameter tuning
Framework Generality: Instantiates four variants (CaReTS1-4) compatible with mainstream temporal encoders (CNN, LSTM, Transformer), demonstrating broad applicability
Performance Enhancement and Interpretability Improvement: Achieves state-of-the-art prediction accuracy on real datasets with trend classification accuracy exceeding 91% and manageable computational overhead

Method Details

Task Definition

Input: Time series $\mathbf{x} = \{x_1, x_2, \ldots, x_n\}$ , where $x_n$ is the most recent observation of the target variable
Output: K-step ahead forecasts $\hat{\mathbf{y}} = \{\hat{y}_1, \hat{y}_2, \ldots, \hat{y}_K\}$
Core Idea: Decompose each step's prediction into trend direction $d^{(k)}$ and deviation magnitude $\delta^{(k)}$

Model Architecture

1. Two Dual-Stream Architectures

Architecture (a): Parallel Dual-Stream

Temporal encoder (CNN/LSTM/Transformer) extracts temporal features
Features are fed in parallel to two independent fully-connected streams:
- Classification Stream: Predicts stepwise trends (up/down)
- Regression Stream: Estimates deviations relative to $x_n$
Residual Fusion: $\hat{y}^{(k)} = x_n + \text{Fusion}(d^{(k)}, \delta^{(k)})$

Architecture (b): Sequential Dual-Stream

First infers trends through the classification stream
Concatenates classification output with original temporal features
Feeds into regression stream for deviation estimation
Direct fusion: $\hat{y}^{(k)} = x_n + \hat{\delta}^{(k)}$

2. Four Model Variants

Model	Architecture	Trend Representation	Deviation Representation	Fusion Method
CaReTS1	(a)	Binary label $\hat{d}^{(k)} \in \{+1,-1\}$	Single non-negative deviation $\hat{\delta}^{(k)}$	$\hat{y}^{(k)} = x_n + \hat{d}^{(k)} \cdot \hat{\delta}^{(k)}$
CaReTS2	(a)	Binary label $\hat{d}^{(k)} \in \{+1,-1\}$	Direction-specific deviations $(\hat{\delta}^{(k)}_{up}, \hat{\delta}^{(k)}_{down})$	Select corresponding deviation by trend
CaReTS3	(a)	Probabilities $(p^{(k)}_{up}, p^{(k)}_{down})$	Direction-specific deviations $(\hat{\delta}^{(k)}_{up}, \hat{\delta}^{(k)}_{down})$	$\hat{y}^{(k)} = x_n + p^{(k)}_{up}\hat{\delta}^{(k)}_{up} - p^{(k)}_{down}\hat{\delta}^{(k)}_{down}$
CaReTS4	(b)	Probability $p^{(k)}$	Signed deviation $\hat{\delta}^{(k)}$	$\hat{y}^{(k)} = x_n + \hat{\delta}^{(k)}$

Multi-Task Loss Function

Loss Function for Architecture (a)

$L^{(a)} = \alpha_{ca}L_{ca} + \alpha_{de}L_{de} + \alpha_{op}L_{op}$

Where:

$L_{ca}$ : Trend classification loss (binary or categorical cross-entropy)
$L_{de}$ : Deviation estimation loss (MSE)
$L_{op}$ : Output prediction loss (MSE)

Loss Function for Architecture (b)

$L^{(b)} = \alpha_{ca}L_{ca} + \alpha_{op}L_{op}$

Uncertainty-Aware Weighting

Core innovation: Model task weights as learnable parameters, adaptively adjusted based on prediction uncertainty:

$\alpha_i = \frac{1}{2\sigma_i^2}, \quad i \in \{ca, de, op\}$

Implementation uses log-variance $\log \sigma_i^2$ as learnable parameters, with final loss:

$L^{(a)} = \sum_{i \in \{ca,de,op\}} \left(\frac{1}{2}e^{-\log \sigma_i^2}L_i + \frac{1}{2}\log \sigma_i^2\right)$

Stabilization Strategies:

Soft regularization: Add penalty terms to log-variance parameters
Value range constraint: Restrict $\log \sigma_i^2$ to $[-10, 10]$

Technical Innovations

Output-Layer Decoupling: Unlike Autoformer et al. that decompose at input layer, CaReTS explicitly separates trends and deviations at output layer, providing more direct interpretability
Soft Fusion Mechanism (CaReTS3): Fuses deviations from both directions via probability weighting, enabling smooth transitions when trend uncertainty is high
Adaptive Task Balancing: Uncertainty-based weight learning eliminates manual hyperparameter tuning, allowing the model to automatically focus on more reliable tasks
Progressive Complexity Design: From CaReTS1 to CaReTS4, gradually increases modeling capacity, systematically exploring the design space

Experimental Setup

Datasets

Two real-world time series forecasting tasks:

Electricity Price Forecasting: 8,784 hourly observations (one year)
Unmet Electricity Demand Forecasting: 8,784 hourly observations

Forecasting Configuration: 15-to-6 scheme

Input: Month, day-of-week, hour of current timestep + past 12 observations of target variable
Output: Next 6 steps of target variable forecasts

Data Split:

Training set: 6,048 points
Test set: 2,736 points
Evaluation method: 10-fold cross-validation

Evaluation Metrics

RMSE (Root Mean Square Error): Measures prediction accuracy
Trend Classification Accuracy: Measures correctness of trend direction prediction

Comparison Methods

Design Baselines (3):

Baseline1: Traditional encoder-decoder architecture
Baseline2: Simplified version without residual connections
Baseline3: Single FC layer replacing fusion module

SOTA Algorithms (10):

Transformer series: Autoformer, FEDformer, Non-stationary Transformer, Informer
Hybrid models: TimesNet, TimeXer, D-CNN-LSTM
Lightweight models: DLinear, NLinear, TimeMixer
Fuzzy neural network: SOIT2FNN-MO

Implementation Details

Platform: Google Colab with T4 GPU
Encoder: 2 layers, 64 hidden units
- CNN: Kernel size 3, padding 1
- Transformer: 4 attention heads
Classification/Regression Branches: 2-layer FC, 64 hidden units
Optimizer: Adam, learning rate 0.001
Batch Size: 64
Training Epochs: Up to 600, early stopping (50 epochs without improvement)
Activation Function: ReLU
Normalization: Min-Max normalization

Experimental Results

Main Results

1. Architecture Evaluation (Table 2)

Unmet Electricity Demand Forecasting (Test Set RMSE):

Best: CaReTS2-Transformer (0.0691 ± 0.0018)
Second best: CaReTS3-CNN (0.0692 ± 0.0010)
All CaReTS2-4 variants outperform baselines

Electricity Price Forecasting (Test Set RMSE):

Best: CaReTS2-Transformer (0.0465 ± 0.0012)
CaReTS1-4 outperform baselines across all encoder configurations (except CaReTS1-LSTM)

Key Findings:

CaReTS2 shows most consistent performance, best in 4 of 6 configurations, second best in 2
Transformer encoder generally outperforms CNN and LSTM
CaReTS1 shows less advantage due to simplified deviation branch

2. Trend Classification Performance (Table 3)

All variants achieve >90% accuracy:

Unmet electricity: CaReTS2-Transformer highest (0.9192 ± 0.0022)
Electricity price: CaReTS2-Transformer highest (0.9146 ± 0.0019)

Cross-Step Analysis (Figure 5):

Trend classification accuracy remains stable across 6-step forecasting, even slightly improving
Contrasts with increasing RMSE, demonstrating framework's robustness in maintaining trend consistency for long-term forecasting

Ablation Studies

Multi-Task vs. Single-Task Learning (Table 4)

Using Transformer encoder as example:

Unmet Electricity:

CaReTS2 multi-task: RMSE 0.0691, trend accuracy 0.9192
CaReTS2 single-task: RMSE 0.0704, trend accuracy 0.9060
Improvement: RMSE reduced by 1.8%, trend accuracy improved by 1.3%

Electricity Price:

CaReTS1 multi-task: RMSE 0.0473, trend accuracy 0.9142
CaReTS1 single-task: RMSE 0.0539, trend accuracy 0.8663
Improvement: RMSE reduced by 12.2%, trend accuracy improved by 5.5%

Computational Overhead:

Additional parameters: only 3 task weight scalars
Runtime increase negligible (253-401 seconds vs. 216-386 seconds)

SOTA Comparison (Table 5)

Unmet Electricity:

CaReTS2: RMSE 0.0691, trend accuracy 0.9192
TimeXer (second-best SOTA): RMSE 0.0700, trend accuracy 0.9066
Advantage: RMSE reduced by 1.3%, trend accuracy improved by 1.4%

Electricity Price:

CaReTS2: RMSE 0.0465, trend accuracy 0.9146
TimeXer (best SOTA): RMSE 0.0463, trend accuracy 0.9013
Advantage: RMSE slightly higher by 0.4%, but trend accuracy higher by 1.5%

Efficiency Comparison:

CaReTS runtime: 200-400 seconds
Lightweight models (DLinear/NLinear): <70 seconds
Heavy models (Autoformer/TimeXer): >460 seconds
Conclusion: CaReTS achieves good balance between accuracy and efficiency

Extended Experiments (Appendix A.6)

Under 15-4 and 15-8 forecasting configurations:

CaReTS2 consistently ranks in top three for both RMSE and trend accuracy
Validates framework stability across different forecasting horizons

Experimental Findings

Trend Stability: Trend classification accuracy does not decrease with forecasting steps, demonstrating robustness of macroscopic trend modeling
Complementary Learning: Multi-task learning promotes complementary learning rather than task interference, joint optimization outperforms single-task
Encoder Compatibility: Framework works well with different encoders, Transformer generally performs best
Direction-Specific Modeling: CaReTS2's direction-specific deviation design captures asymmetric dynamics, outperforming single deviation (CaReTS1)
Soft Fusion Advantage: CaReTS3's probability weighting provides smooth transitions when trend uncertainty is high

1. Deep Time Series Forecasting

CNN Methods: Extract local spatiotemporal patterns
RNN Methods: LSTM, GRU capture sequence dependencies
Transformer Methods:
- Informer: ProbSparse attention
- Autoformer: Seasonal-trend decomposition + autocorrelation attention
- FEDformer: Frequency-domain filtering
- PatchTST: Patch-based embedding
- iTransformer: Inverted modeling focusing on variable dependencies

2. Decomposition and Interpretability

Linear Decomposition: DLinear, NLinear achieve competitive results through simple trend-seasonal decomposition
Transformer Decomposition: ETSformer, Autoformer, FEDformer model components at input/representation layers
This Work's Distinction: Output-layer decoupling directly separates prediction targets' trends and deviations

3. Multi-Task and Modular Architectures

TimeXer: Distinguishes endogenous and exogenous signals
TimesNet: Multi-period modules capture different temporal scales
Lightweight MLPs: TimeMixer, LightTS, TSMixer
This Work's Innovation: Output-layer dual-stream framework with uncertainty-based adaptive task balancing

Conclusions and Discussion

Main Conclusions

CaReTS successfully decouples trend classification and deviation estimation through dual-stream architecture, simultaneously enhancing prediction accuracy and interpretability
The uncertainty-based multi-task learning mechanism effectively balances three tasks' contributions, eliminating manual hyperparameter tuning
Four variants demonstrate framework flexibility, with CaReTS2-Transformer combination performing best
Achieves or exceeds SOTA performance on real datasets with trend classification accuracy exceeding 91% and manageable computational overhead

Limitations

Insufficient Long-Term Forecasting Validation: Limited by GPU resources, primarily evaluated on 6-step forecasting without thoroughly validating ultra-long-term prediction capability
Limited Dataset Diversity: Tested only on two power-related datasets, lacking cross-domain validation (e.g., finance, healthcare)
Limited Encoder Innovation: Employs standard encoders without exploring customized temporal feature extractors
Simplified Binary Trends: Models only up/down trends without considering stationary trends or finer-grained trend classification
Missing Interpretability Quantification: While claiming interpretability enhancement, lacks user studies or quantitative interpretability metrics

Future Directions

Long-Term Forecasting Extension: Validate ultra-long-term (e.g., 100+ steps) forecasting capability with greater computational resources
Cross-Domain Validation: Test framework generalization across diverse domains (finance, healthcare, climate)
Multi-Level Trend Classification: Extend to multi-class trends (e.g., strong up, weak up, stationary)
Customized Encoders: Explore feature extractors optimized for trend-deviation decomposition
Interpretability Research: Conduct user studies and quantitatively evaluate interpretability enhancement

In-Depth Evaluation

Strengths

Innovative Problem Decomposition: Decomposing time series forecasting into trend classification and deviation regression is intuitive and effective, providing a novel modeling perspective
Solid Theoretical Foundation: Uncertainty-aware multi-task learning has solid theoretical support (Kendall et al., 2018) with well-designed implementation details
Systematic Design Exploration: Four variants progressively evolve from simple to complex, clearly showcasing the design space
Rigorous and Comprehensive Experiments:
- 10-fold cross-validation provides reliable estimates
- Comparison with 10 SOTA algorithms
- Ablation studies validate component contributions
- Cross-step analysis reveals trend stability
Strong Reproducibility: Provides anonymous code with detailed implementation details
Clear Writing: Well-structured with rich figures and accurate technical descriptions

Weaknesses

Insufficient Interpretability Evaluation:
- Lacks visualization cases demonstrating how trend-deviation decomposition aids understanding
- No user studies validating interpretability enhancement
- Interpretability remains largely conceptual
Dataset Limitations:
- Only two related-domain datasets
- Relatively small sample size (8,784 points)
- Lacks multivariate time series validation
Missing Long-Term Forecasting Validation:
- Primarily evaluated on 6-step forecasting
- While Figure 5 shows trend stability, longer horizons not actually tested
- Limits judgment on long-term prediction capability
Coarse Computational Analysis:
- Only reports total runtime
- Lacks detailed time and memory complexity analysis
- No analysis of computational bottlenecks
Questionable Baseline Design:
- Three design baselines may be insufficient
- Lacks comparison with other multi-task learning approaches
Simplified Trend Definition:
- Binary trends (up/down) may be overly coarse
- Doesn't consider stationary states or trend strength

Impact

Academic Contribution:
- Provides new perspective on output-layer decomposition
- Application of uncertainty-aware multi-task learning to time series forecasting
- May inspire more trend-magnitude separation research
Practical Value:
- Demonstrates practicality in applications like electricity forecasting
- Trend classification provides decision support information
- Manageable computational overhead suitable for deployment
Reproducibility:
- Provides code (though anonymized)
- Complete implementation details
- Facilitates reproduction and extension
Limitation Impact:
- Dataset and long-term forecasting limitations may restrict impact
- Requires more cross-domain validation for widespread application

Applicable Scenarios

Suitable Scenarios:

Short to Medium-Term Forecasting (6-8 steps): Framework thoroughly validated in this range
Applications Requiring Trend Explanation: Finance, energy scheduling where trend direction matters more than exact values
Univariate or Low-Dimensional Time Series: Current experiments are univariate
Medium Data Volume Scenarios: Training samples ~6,000 points

Less Suitable Scenarios:

Ultra-Long-Term Forecasting (>10 steps): Unvalidated, effectiveness unknown
High-Dimensional Multivariate Time Series: Insufficient testing in multivariate settings
Real-Time Forecasting: 200-400 second computation time may not meet real-time requirements
Stationary Series Without Clear Trends: Trend classification may lack significant advantage

References

Key References Cited in Paper

Kendall et al. (2018): Multi-task learning using uncertainty to weigh losses for scene geometry and semantics. CVPR. Theoretical foundation for uncertainty weighting
Vaswani et al. (2017): Attention is all you need. NeurIPS. Transformer architecture
Zhou et al. (2021): Informer: Beyond efficient transformer for long sequence time-series forecasting. AAAI. ProbSparse attention
Wu et al. (2021): Autoformer: Decomposition transformers with auto-correlation for long-term series forecasting. NeurIPS. Seasonal-trend decomposition
Zhou et al. (2022): FEDformer: Frequency enhanced decomposed transformer for long-term series forecasting. ICML. Frequency-domain decomposition
Liu et al. (2023): iTransformer: Inverted transformers are effective for time series forecasting. arXiv. Inverted modeling
Zeng et al. (2023): Are transformers effective for time series forecasting? AAAI. DLinear/NLinear simple baselines
Wang et al. (2024c): TimeXer: Empowering transformers for time series forecasting with exogenous variables. NeurIPS. Exogenous variable modeling

Overall Assessment: This is a well-designed and rigorously executed time series forecasting paper. The core innovation—output-layer trend-deviation decomposition—is simple yet effective, providing a novel modeling perspective. The uncertainty-aware multi-task learning implementation is elegant. Experimental results demonstrate method effectiveness with improvements in both accuracy and interpretability. Main limitations include insufficient interpretability evaluation, limited dataset diversity, and missing long-term forecasting validation. Recommended future work includes validation across more domains and longer horizons, plus user studies to quantify interpretability gains. Overall, this represents a valuable contribution providing a new modeling paradigm for time series forecasting.