2025-11-12T05:04:10.017076

RATLIP: Generative Adversarial CLIP Text-to-Image Synthesis Based on Recurrent Affine Transformations

Lin, Lu, Chen

Synthesizing high-quality photorealistic images with textual descriptions as a condition is very challenging. Generative Adversarial Networks (GANs), the classical model for this task, frequently suffer from low consistency between image and text descriptions and insufficient richness in synthesized images. Recently, conditional affine transformations (CAT), such as conditional batch normalization and instance normalization, have been applied to different layers of GAN to control content synthesis in images. CAT is a multi-layer perceptron that independently predicts data based on batch statistics between neighboring layers, with global textual information unavailable to other layers. To address this issue, we first model CAT and a recurrent neural network (RAT) to ensure that different layers can access global information. We then introduce shuffle attention between RAT to mitigate the characteristic of information forgetting in recurrent neural networks. Moreover, both our generator and discriminator utilize the powerful pre-trained model, Clip, which has been extensively employed for establishing associations between text and images through the learning of multimodal representations in latent space. The discriminator utilizes CLIP's ability to comprehend complex scenes to accurately assess the quality of the generated images. Extensive experiments have been conducted on the CUB, Oxford, and CelebA-tiny datasets to demonstrate the superiority of the proposed model over current state-of-the-art models. The code is https://github.com/OxygenLu/RATLIP.

academic

RATLIP: Generative Adversarial CLIP Text-to-Image Synthesis Based on Recurrent Affine Transformations

基本信息

论文ID: 2405.08114
标题: RATLIP: Generative Adversarial CLIP Text-to-Image Synthesis Based on Recurrent Affine Transformations
作者: Chengde Lin, Xijun Lu, Guangxi Chen
分类: cs.CV (Computer Vision)
发表时间: 2024年5月 (arXiv预印本)
论文链接: https://arxiv.org/abs/2405.08114
代码链接: https://github.com/OxygenLu/RATLIP

摘要

本文提出了RATLIP，一种基于循环仿射变换的生成对抗CLIP文本到图像合成方法。针对现有条件仿射变换(CAT)方法中各层独立预测、缺乏全局文本信息访问的问题，作者提出使用循环神经网络建模循环仿射变换(RAT)，确保不同层能够访问全局信息。同时引入shuffle attention机制缓解RNN的信息遗忘特性。该方法在生成器和判别器中都利用预训练的CLIP模型，在CUB、Oxford和CelebA-tiny数据集上的实验表明了方法的优越性。

研究背景与动机

问题定义

文本到图像合成是一个极具挑战性的跨模态生成任务，需要根据文本描述生成高质量的逼真图像。这项任务在文本驱动图像编辑、虚拟图像合成、人脸重建等领域有广泛应用前景。

现有方法的局限性

传统GAN方法的问题：生成对抗网络在文本到图像合成中经常遭受图像与文本描述一致性低、合成图像丰富性不足的问题
条件仿射变换的缺陷：现有的CAT方法（如条件批归一化CBN和条件实例归一化CIN）是多层感知机，基于相邻层间的批统计独立预测数据，其他层无法访问全局文本信息
扩散模型的问题：虽然扩散模型取得了令人印象深刻的结果，但推理时间长、计算开销高

研究动机

作者认为孤立的特征融合块使得条件实例归一化在不同层独立发生，忽略了跨层融合文本信息的语义关系以及全局文本信息内的语义关系。这些孤立的融合块难以优化，因为模型中它们被认为彼此不交互。

核心贡献

提出循环仿射变换模块：基于LSTM跳跃连接特征层的循环仿射变换模块，使不同层的融合文本信息在全局文本信息中具有语义关系，提升融合效果
引入shuffle attention机制：在每两个循环仿射变换模块之间引入shuffle attention，模拟生物行为学习过程中的"学习-复习"模式，抑制文本信息遗忘，保持知识的稳定传递
CLIP集成框架：生成器和判别器都利用强大的预训练CLIP模型，判别器利用CLIP理解复杂场景的能力准确评估生成图像质量
实验验证：在CUB、Oxford和CelebA-tiny数据集上进行广泛实验，证明了所提方法相比当前最先进模型的优越性

方法详解

任务定义

给定文本描述T，生成与其语义一致的高质量图像。输入为文本描述T和噪声向量Z，输出为合成图像。

模型架构

整体框架

RATLIP基于GALIP框架改进，包含三个主要组件：

预训练CLIP文本编码器：将输入文本描述编码为句子向量T
生成器G：包含RAT Bridge、CLIP-BLK和Image-G模块
判别器D：基于冻结的CLIP-ViT，包含配对判别器

RAT Block设计

循环仿射变换的核心创新在于用LSTM替代传统的多层感知机：

传统CAT公式：

Affine(c|hi) = γi · c + βi
γ = MLP1(hi), β = MLP2(hi)

RAT Block的LSTM建模：

h0 = MLP3(z), c0 = MLP4(z)
[it, ft, ot, ut] = [σ, σ, σ, tanh](T(s[ht-1]))
ct = ft ⊙ ct-1 + it ⊙ ut
ht = ot ⊙ tanh(ct)
γt, βt = MLP1^t(ht), MLP2^t(ht)

其中it、ft、ot分别为输入门、遗忘门和输出门。

Shuffle Attention机制

为解决LSTM在长时间学习中容易遗忘信息的问题，作者在每两个RAT Block之间引入shuffle attention：

将输入参数按规则分组
分别处理空间和通道信息
重新融合得到丰富的信息表示
模拟"学习-复习"的生物学习模式

技术创新点

全局信息访问：通过LSTM的跳跃连接和权重共享，确保不同层的融合块之间保持文本信息一致性
记忆增强：shuffle attention机制有效缓解LSTM的遗忘特性，保持长期稳定的知识传递
CLIP集成：充分利用CLIP的多模态表示学习能力，提升文本-图像关联性

实验设置

数据集

CUB数据集：包含200个不同类别的11,788张鸟类图像
Oxford数据集：包含102个不同类别的8,189张花卉图像
CelebA-tiny数据集：基于CelebAMask-HQ随机选择10,000张照片，训练集8,000张，测试集2,000张

每个数据集的每张图像都包含10个描述句子。

评价指标

FID (Fréchet Inception Distance)：评估生成图像质量，数值越低越好
CLIP-Score (CS)：评估文本-图像一致性，数值越高越好

实现细节

使用ViT-B/32作为CLIP模型
生成器学习率：0.0001，判别器学习率：0.0004
优化器：Adam
硬件：3×3090 GPU

对比方法

AttnGAN
LAFITE
DF-GAN
GALIP (baseline)

实验结果

主要结果

方法	FID↓ (CUB/CelebA-tiny)	CS↑ (CUB/Oxford/CelebA-tiny)
AttnGAN	23.98/125.98	-/-/21.15
LAFITE	14.58/-	31.25/-/-
DF-GAN	14.81/137.6	29.20/26.67/24.41
GALIP	10.0/94.45	31.60/31.77/27.95
RATLIP	13.28/81.48	32.03/31.94/28.91