TIGeR: Unified Text-to-Image Generation and Retrieval

1. NExT++ Lab, National University of Singapore
2. Nanyang Technological University
3. Hong Kong Polytechnic University
4. Harbin Institute of Technology (Shenzhen)
*Correspondence

Abstract

How humans can efficiently and effectively acquire images has always been a perennial question. A typical solution is text-to-image retrieval from an existing database given the text query; however, the limited database typically lacks creativity. By contrast, recent breakthroughs in text-to-image generation have made it possible to produce fancy and diverse visual content, but it faces challenges in synthesizing knowledge-intensive images. In this work, we rethink the relationship between text-to-image generation and retrieval and propose a unified framework in the context of Multimodal Large Language Models (MLLMs). Specifically, we first explore the intrinsic discriminative abilities of MLLMs and introduce a generative retrieval method to perform retrieval in a training-free manner. Subsequently, we unify generation and retrieval in an autoregressive generation way and propose an autonomous decision module to choose the best-matched one between generated and retrieved images as the response to the text query. Additionally, we construct a benchmark called TIGeR-Bench, including creative and knowledge-intensive domains, to standardize the evaluation of unified text-to-image generation and retrieval. Extensive experimental results on TIGeR-Bench and two retrieval benchmarks, i.e., Flickr30K and MS-COCO, demonstrate the superiority and effectiveness of our proposed method.

TIGeR

We consider both creative domains and knowledge-intensive domains for general image acquisition towards unified Text-to-Image Generation and Retrieval (TIGeR), and build the TIGeR-Bench including 8 domains for the unified comprehensive evaluation.

Teaser

Framework

Overview of the framework to unify text-to-image generation and retrieval. Images from the database are first tokenized into discrete codes and a lookup table is maintained for the correspondence between discrete codes and images. The given prompt is first fed into a MLLM and Forward Beam Search is performed to retrieve and generate images in parallel. The prompt and obtained images are then fed into the same MLLM for Reverse Re-Ranking and Decision-making.

Teaser

Experiments

Performance Comparison on TIGeR-Bench

"Token" refers to visual tokenization during image synthesis, including continuous (Cont.) and discrete (Dist.) approaches. Entries by gray are expert models for T2I retrieval or generation, and those with a blue background denote that an image query is first generated and then used to perform image-to-image retrieval. Entries with a gray background denote our methods.

Teaser

Text-to-Image Retrieval Performance Comparison on Flickr30K and MS-COCO

Text-to-image retrieval performance comparison on Flickr30K and MS-COCO. Entries by gray denote dense retrieval methods and others are generative retrieval methods.

Teaser

Qualitative Results on TIGeR-Bench

The prefix prompt "Give me an image of" is omitted here. Green and red bounding boxes highlight correct and wrong retrieval results.

Teaser

Example of Multi-turn Generation and Retrieval

Example of multi-turn chat based on SEED-LLaMA with unified generation and retrieval.

Teaser

Qualitative Results on Creative Domains of TIGeR-Bench

We use ticks or crosses to highlight the selected results from generation or retrieval. Green ticks indicate the correct generated images and red crosses indicate the wrong retrieved images.

Teaser

Qualitative Results on Knowledge-intensive Domains of TIGeR-Bench

We use ticks or crosses to highlight the selected results from generation or retrieval. Green ticks indicate the correct retrieved images and red crosses indicate the wrong generated images.

Teaser