Text-to-image synthesis with self-supervised learning

Yong Xuan Tan; Chin Poo Lee; Mai Neo; Kian Ming Lim

doi:10.1016/j.patrec.2022.04.010

Text-to-image synthesis with self-supervised learning

Yong Xuan Tan, Chin Poo Lee, Mai Neo, Kian Ming Lim

Research output: Journal Publication › Article › peer-review

17 Citations (Scopus)

Abstract

Text-to-image synthesis extracts the meaning from the text description and converts it into an image correspondingly. Text-to-image synthesis is widely leveraged in many applications, such as graphic design, image editing, etc. Text-to-image synthesis approaches are mainly built on the basis of generative adversarial networks. One of the main challenges in text-to-image synthesis is to generate images that are visually realistic. Not only that, the text-to-image synthesis model is inherently susceptible to overconfidence and training instability issues. To address these challenges, this paper proposes a self-supervised text-to-image synthesis with some enhancements, including self-supervised learning, feature matching, L1 distance loss, and one-sided label smoothing. The self-supervised learning offers more image variations thus improving the classification power of the discriminator. The feature matching and L1 distance functions motivate the generator to synthesize images that are visually more similar to the real images based on the given text description. The one-sided label smoothing adds a penalty value when the discriminator makes a correct classification to alleviate the overconfidence problem and to improve the training stability. The performance of the proposed self-supervised text-to-image synthesis is evaluated on the Oxford-102 and CUB datasets. The empirical results demonstrate that the proposed self-supervised text-to-image synthesis generates images with richer image content diversity, more visually realistic, and more semantically consistent with the given text description. The proposed self-supervised text-to-image synthesis also outshines the methods in comparison in terms of the inception score and Structural Similarity Index.

Original language	English
Pages (from-to)	119-126
Number of pages	8
Journal	Pattern Recognition Letters
Volume	157
DOIs	https://doi.org/10.1016/j.patrec.2022.04.010
Publication status	Published - May 2022
Externally published	Yes

Keywords

Generative adversarial network
Self-supervised learning
Text-to-image-synthesis

ASJC Scopus subject areas

Software
Signal Processing
Computer Vision and Pattern Recognition
Artificial Intelligence

Access to Document

10.1016/j.patrec.2022.04.010

Cite this

@article{2622400d0a30424e80c2d2de4e12a60a,

title = "Text-to-image synthesis with self-supervised learning",

abstract = "Text-to-image synthesis extracts the meaning from the text description and converts it into an image correspondingly. Text-to-image synthesis is widely leveraged in many applications, such as graphic design, image editing, etc. Text-to-image synthesis approaches are mainly built on the basis of generative adversarial networks. One of the main challenges in text-to-image synthesis is to generate images that are visually realistic. Not only that, the text-to-image synthesis model is inherently susceptible to overconfidence and training instability issues. To address these challenges, this paper proposes a self-supervised text-to-image synthesis with some enhancements, including self-supervised learning, feature matching, L1 distance loss, and one-sided label smoothing. The self-supervised learning offers more image variations thus improving the classification power of the discriminator. The feature matching and L1 distance functions motivate the generator to synthesize images that are visually more similar to the real images based on the given text description. The one-sided label smoothing adds a penalty value when the discriminator makes a correct classification to alleviate the overconfidence problem and to improve the training stability. The performance of the proposed self-supervised text-to-image synthesis is evaluated on the Oxford-102 and CUB datasets. The empirical results demonstrate that the proposed self-supervised text-to-image synthesis generates images with richer image content diversity, more visually realistic, and more semantically consistent with the given text description. The proposed self-supervised text-to-image synthesis also outshines the methods in comparison in terms of the inception score and Structural Similarity Index.",

keywords = "Generative adversarial network, Self-supervised learning, Text-to-image-synthesis",

author = "Tan, {Yong Xuan} and Lee, {Chin Poo} and Mai Neo and Lim, {Kian Ming}",

note = "Publisher Copyright: {\textcopyright} 2022 Elsevier B.V.",

year = "2022",

month = may,

doi = "10.1016/j.patrec.2022.04.010",

language = "English",

volume = "157",

pages = "119--126",

journal = "Pattern Recognition Letters",

issn = "0167-8655",

publisher = "Elsevier",

}

TY - JOUR

T1 - Text-to-image synthesis with self-supervised learning

AU - Tan, Yong Xuan

AU - Lee, Chin Poo

AU - Neo, Mai

AU - Lim, Kian Ming

PY - 2022/5

Y1 - 2022/5

N2 - Text-to-image synthesis extracts the meaning from the text description and converts it into an image correspondingly. Text-to-image synthesis is widely leveraged in many applications, such as graphic design, image editing, etc. Text-to-image synthesis approaches are mainly built on the basis of generative adversarial networks. One of the main challenges in text-to-image synthesis is to generate images that are visually realistic. Not only that, the text-to-image synthesis model is inherently susceptible to overconfidence and training instability issues. To address these challenges, this paper proposes a self-supervised text-to-image synthesis with some enhancements, including self-supervised learning, feature matching, L1 distance loss, and one-sided label smoothing. The self-supervised learning offers more image variations thus improving the classification power of the discriminator. The feature matching and L1 distance functions motivate the generator to synthesize images that are visually more similar to the real images based on the given text description. The one-sided label smoothing adds a penalty value when the discriminator makes a correct classification to alleviate the overconfidence problem and to improve the training stability. The performance of the proposed self-supervised text-to-image synthesis is evaluated on the Oxford-102 and CUB datasets. The empirical results demonstrate that the proposed self-supervised text-to-image synthesis generates images with richer image content diversity, more visually realistic, and more semantically consistent with the given text description. The proposed self-supervised text-to-image synthesis also outshines the methods in comparison in terms of the inception score and Structural Similarity Index.

AB - Text-to-image synthesis extracts the meaning from the text description and converts it into an image correspondingly. Text-to-image synthesis is widely leveraged in many applications, such as graphic design, image editing, etc. Text-to-image synthesis approaches are mainly built on the basis of generative adversarial networks. One of the main challenges in text-to-image synthesis is to generate images that are visually realistic. Not only that, the text-to-image synthesis model is inherently susceptible to overconfidence and training instability issues. To address these challenges, this paper proposes a self-supervised text-to-image synthesis with some enhancements, including self-supervised learning, feature matching, L1 distance loss, and one-sided label smoothing. The self-supervised learning offers more image variations thus improving the classification power of the discriminator. The feature matching and L1 distance functions motivate the generator to synthesize images that are visually more similar to the real images based on the given text description. The one-sided label smoothing adds a penalty value when the discriminator makes a correct classification to alleviate the overconfidence problem and to improve the training stability. The performance of the proposed self-supervised text-to-image synthesis is evaluated on the Oxford-102 and CUB datasets. The empirical results demonstrate that the proposed self-supervised text-to-image synthesis generates images with richer image content diversity, more visually realistic, and more semantically consistent with the given text description. The proposed self-supervised text-to-image synthesis also outshines the methods in comparison in terms of the inception score and Structural Similarity Index.

KW - Generative adversarial network

KW - Self-supervised learning

KW - Text-to-image-synthesis

UR - http://www.scopus.com/inward/record.url?scp=85128379543&partnerID=8YFLogxK

U2 - 10.1016/j.patrec.2022.04.010

DO - 10.1016/j.patrec.2022.04.010

M3 - Article

AN - SCOPUS:85128379543

SN - 0167-8655

VL - 157

SP - 119

EP - 126

JO - Pattern Recognition Letters

JF - Pattern Recognition Letters

ER -

Text-to-image synthesis with self-supervised learning

Abstract

Keywords

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this