2025-11-12T05:04:10.017076

RATLIP: Generative Adversarial CLIP Text-to-Image Synthesis Based on Recurrent Affine Transformations

Lin, Lu, Chen

Synthesizing high-quality photorealistic images with textual descriptions as a condition is very challenging. Generative Adversarial Networks (GANs), the classical model for this task, frequently suffer from low consistency between image and text descriptions and insufficient richness in synthesized images. Recently, conditional affine transformations (CAT), such as conditional batch normalization and instance normalization, have been applied to different layers of GAN to control content synthesis in images. CAT is a multi-layer perceptron that independently predicts data based on batch statistics between neighboring layers, with global textual information unavailable to other layers. To address this issue, we first model CAT and a recurrent neural network (RAT) to ensure that different layers can access global information. We then introduce shuffle attention between RAT to mitigate the characteristic of information forgetting in recurrent neural networks. Moreover, both our generator and discriminator utilize the powerful pre-trained model, Clip, which has been extensively employed for establishing associations between text and images through the learning of multimodal representations in latent space. The discriminator utilizes CLIP's ability to comprehend complex scenes to accurately assess the quality of the generated images. Extensive experiments have been conducted on the CUB, Oxford, and CelebA-tiny datasets to demonstrate the superiority of the proposed model over current state-of-the-art models. The code is https://github.com/OxygenLu/RATLIP.

academic

RATLIP: Generative Adversarial CLIP Text-to-Image Synthesis Based on Recurrent Affine Transformations

Basic Information

Paper ID: 2405.08114
Title: RATLIP: Generative Adversarial CLIP Text-to-Image Synthesis Based on Recurrent Affine Transformations
Authors: Chengde Lin, Xijun Lu, Guangxi Chen
Category: cs.CV (Computer Vision)
Publication Date: May 2024 (arXiv preprint)
Paper Link: https://arxiv.org/abs/2405.08114
Code Link: https://github.com/OxygenLu/RATLIP

Abstract

This paper proposes RATLIP, a generative adversarial CLIP text-to-image synthesis method based on recurrent affine transformations. Addressing the limitation in existing conditional affine transformation (CAT) methods where each layer makes independent predictions without access to global textual information, the authors propose modeling recurrent affine transformations (RAT) using recurrent neural networks to ensure that different layers can access global information. Additionally, a shuffle attention mechanism is introduced to mitigate the information forgetting characteristics of RNNs. The method leverages pre-trained CLIP models in both the generator and discriminator. Experiments on CUB, Oxford, and CelebA-tiny datasets demonstrate the superiority of the proposed approach.

Research Background and Motivation

Problem Definition

Text-to-image synthesis is a highly challenging cross-modal generation task that requires generating high-quality photorealistic images based on textual descriptions. This task has broad application prospects in text-driven image editing, virtual image synthesis, face reconstruction, and related domains.

Limitations of Existing Methods

Issues with Traditional GANs: Generative adversarial networks frequently suffer from low consistency between generated images and textual descriptions, as well as insufficient diversity in synthesized images.
Defects of Conditional Affine Transformations: Existing CAT methods (such as Conditional Batch Normalization CBN and Conditional Instance Normalization CIN) are multi-layer perceptrons that make independent predictions based on batch statistics between adjacent layers, preventing other layers from accessing global textual information.
Problems with Diffusion Models: Although diffusion models achieve impressive results, they suffer from long inference times and high computational overhead.

Research Motivation

The authors argue that isolated feature fusion blocks cause conditional instance normalization to occur independently across layers, ignoring semantic relationships in cross-layer text information fusion and within global textual information. These isolated fusion blocks are difficult to optimize because they are treated as non-interactive within the model.

Core Contributions

Proposed Recurrent Affine Transformation Module: A recurrent affine transformation module based on LSTM skip connections across feature layers, enabling text information fusion across different layers to maintain semantic relationships within global textual information, thereby improving fusion effectiveness.
Introduced Shuffle Attention Mechanism: Shuffle attention is inserted between every two recurrent affine transformation modules, simulating the "learning-review" pattern in biological learning processes to suppress text information forgetting and maintain stable knowledge transfer.
CLIP Integration Framework: Both the generator and discriminator leverage powerful pre-trained CLIP models. The discriminator utilizes CLIP's ability to understand complex scenes to accurately assess generated image quality.
Experimental Validation: Comprehensive experiments on CUB, Oxford, and CelebA-tiny datasets demonstrate the superiority of the proposed method compared to current state-of-the-art models.

Methodology Details

Task Definition

Given a textual description T, generate high-quality images semantically consistent with it. The input consists of textual description T and noise vector Z, with the output being synthesized images.

Model Architecture

Overall Framework

RATLIP is an improvement over the GALIP framework, comprising three main components:

Pre-trained CLIP Text Encoder: Encodes input textual descriptions into sentence vectors T
Generator G: Contains RAT Bridge, CLIP-BLK, and Image-G modules
Discriminator D: Based on frozen CLIP-ViT, includes paired discriminators

RAT Block Design

The core innovation of recurrent affine transformations lies in replacing traditional multi-layer perceptrons with LSTM:

Traditional CAT Formula:

Affine(c|hi) = γi · c + βi
γ = MLP1(hi), β = MLP2(hi)

RAT Block's LSTM Modeling:

h0 = MLP3(z), c0 = MLP4(z)
[it, ft, ot, ut] = [σ, σ, σ, tanh](T(s[ht-1]))
ct = ft ⊙ ct-1 + it ⊙ ut
ht = ot ⊙ tanh(ct)
γt, βt = MLP1^t(ht), MLP2^t(ht)

where it, ft, ot represent the input gate, forget gate, and output gate, respectively.

Shuffle Attention Mechanism

To address the problem of LSTM forgetting information during long-term learning, the authors introduce shuffle attention between every two RAT blocks:

Input parameters are grouped according to rules
Spatial and channel information are processed separately
Results are re-fused to obtain rich information representations
Simulates the biological "learning-review" learning pattern

Technical Innovations

Global Information Access: Through LSTM skip connections and weight sharing, ensures consistency of textual information across fusion blocks in different layers.
Memory Enhancement: The shuffle attention mechanism effectively alleviates LSTM's forgetting characteristics, maintaining long-term stable knowledge transfer.
CLIP Integration: Fully leverages CLIP's multi-modal representation learning capabilities to enhance text-image association.

Experimental Setup

Datasets

CUB Dataset: Contains 11,788 bird images across 200 different categories
Oxford Dataset: Contains 8,189 flower images across 102 different categories
CelebA-tiny Dataset: Randomly selected 10,000 photographs from CelebAMask-HQ, with 8,000 in the training set and 2,000 in the test set

Each image in every dataset is accompanied by 10 descriptive sentences.

Evaluation Metrics

FID (Fréchet Inception Distance): Evaluates generated image quality; lower values are better
CLIP-Score (CS): Evaluates text-image consistency; higher values are better

Implementation Details

CLIP model: ViT-B/32
Generator learning rate: 0.0001, Discriminator learning rate: 0.0004
Optimizer: Adam
Hardware: 3×3090 GPUs

Comparison Methods

AttnGAN
LAFITE
DF-GAN
GALIP (baseline)

Experimental Results

Main Results

Method	FID↓ (CUB/CelebA-tiny)	CS↑ (CUB/Oxford/CelebA-tiny)
AttnGAN	23.98/125.98	-/-/21.15
LAFITE	14.58/-	31.25/-/-
DF-GAN	14.81/137.6	29.20/26.67/24.41
GALIP	10.0/94.45	31.60/31.77/27.95
RATLIP	13.28/81.48	32.03/31.94/28.91

Key Findings:

Achieves SOTA performance on CelebA-tiny dataset with FID of 81.48
Achieves improvements of 0.78-0.96 in CS metrics across all three datasets
Ranks second on CUB dataset FID

Ablation Study

Method	CS↑ (CUB/Oxford/CelebA-tiny)
Baseline	31.60/31.77/27.95
RAT	31.62/31.83/27.63
RAT+ATT	32.03/31.94/28.91

Analysis:

RAT block alone shows marginal improvements on CUB and Oxford but performance degradation on CelebA-tiny
Adding shuffle attention achieves significant improvements across all datasets, validating the effectiveness of the attention mechanism in suppressing LSTM forgetting

Parameter Analysis

The authors conducted parameter analysis on LSTM hidden layer size h (h = 0, 4, 8, 16, 32, 64, 128). Grad-CAM visualization reveals that h=64 achieves optimal results with complete red region coverage of the target.

Case Analysis

Semantic Space Feature Analysis: By comparing generated results for descriptions "He is young, receding hairline" and "He is old, receding hairline", the authors found:

In the baseline, "young" is overshadowed by "receding hairline", resulting in facial wrinkles
RATLIP generates semantically more appropriate images where different age descriptions produce corresponding visual features
In latent space, RATLIP's feature vectors show clearer fusion, avoiding confused feature blending

Text-to-Image Synthesis

Early Methods: Conditional GAN first introduced conditional GANs, performing coarse fusion by concatenating text features and noise vectors
Advanced Fusion Methods: CIN proposed more sophisticated fusion methods using adaptive mean and variance to control image style
Attention Mechanisms: AttnGAN leveraged attention mechanisms for finer-grained synthesis
CLIP Integration: LAFITE and GALIP utilized CLIP for text-image contrastive learning

Attention Mechanisms in Text-to-Image Synthesis

AttnGAN achieved impressive results in generating high-resolution images
Stacked cross-attention mechanisms used for comprehensive alignment identification
Spatial attention ensures semantic consistency between images and text

Conclusions and Discussion

Main Conclusions

RATLIP effectively addresses the limitation in traditional CAT methods where layers lack access to global textual information through recurrent affine transformations.
The shuffle attention mechanism successfully mitigates LSTM's information forgetting characteristics, enhancing long-term memory capacity for textual information.
Deep integration with CLIP significantly improves text-image consistency and generation quality.
Experimental results demonstrate that RATLIP achieves significant improvements over SOTA methods across multiple datasets.

Limitations

Computational Complexity: LSTM and attention mechanisms increase model computational overhead.
Parameter Sensitivity: LSTM hidden layer size requires careful tuning.
Dataset Scale: Experiments are primarily conducted on relatively small datasets; performance on large-scale datasets remains to be verified.
Inference Speed: Although faster than diffusion models, additional overhead remains compared to simple GANs.

Future Directions

Explore more efficient recurrent mechanisms as alternatives to LSTM
Investigate more advanced attention mechanisms
Extend to larger-scale and more complex datasets
Study model applications in other cross-modal tasks

In-Depth Evaluation

Strengths

Strong Novelty: Introducing recurrent neural networks into conditional affine transformations is a novel idea that effectively addresses the core problem of existing methods.
Solid Theoretical Foundation: LSTM-based modeling of global information access is theoretically sound and elegantly implemented.
Comprehensive Experiments: Includes detailed comparative experiments, ablation studies, and parameter analysis with scientific experimental design.
In-depth Visualization Analysis: Provides intuitive method understanding through Grad-CAM and latent space analysis.
High Practical Value: Improves generation quality while maintaining relatively fast inference speed.

Weaknesses

Writing Quality: The paper contains grammatical errors and some unclear expressions.
Insufficient Theoretical Analysis: Lacks in-depth theoretical analysis of why LSTM can solve the global information access problem.
Limited Experimental Scale: Validation is primarily on relatively simple datasets, lacking experiments on complex scene datasets.
Incomplete Comparisons: Lacks direct comparison with latest diffusion models.
Missing Computational Efficiency Analysis: No detailed analysis of computation time and memory usage.

Impact

Academic Contribution: Provides a new technical pathway for text-to-image synthesis, particularly in conditional information fusion.
Practical Value: The method is relatively simple to implement and likely to be adopted in practical applications.
Inspirational Significance: Introducing recurrent mechanisms into generative models provides new insights for subsequent research.

Applicable Scenarios

Text-Driven Image Editing: Applications requiring precise control over image generation processes.
Virtual Content Creation: Concept design in gaming, film, and entertainment industries.
Education and Training: Generating teaching materials based on textual descriptions.
Personalized Content Generation: Creating customized image content based on user descriptions.

References

The paper cites 42 related references, primarily including:

Diffusion model-related work (BoxDiff, Raphael, etc.)
Classic GAN text-to-image synthesis work (AttnGAN, DF-GAN, GALIP, etc.)
Attention mechanism research (CBAM, cross-attention, etc.)
CLIP-related applications (StyleCLIP, LAFITE, etc.)

Overall Assessment: This is an innovative work in the text-to-image synthesis domain that effectively addresses key problems in existing methods through the proposed recurrent affine transformation approach. Despite some shortcomings in writing quality and experimental scale, the technical contributions and experimental results demonstrate the method's effectiveness and practical value. This work provides new research directions for text-to-image synthesis and merits further exploration and improvement.