How humans can efficiently and effectively acquire images has always been a perennial question. A typical solution is text-to-image retrieval from an existing database given the text query; however, the limited database typically lacks creativity. By contrast, recent breakthroughs in text-to-image generation have made it possible to produce fancy and diverse visual content, but it faces challenges in synthesizing knowledge-intensive images. In this work, we rethink the relationship between text-to-image generation and retrieval and propose a unified framework in the context of Multimodal Large Language Models (MLLMs). Specifically, we first explore the intrinsic discriminative abilities of MLLMs and introduce a generative retrieval method to perform retrieval in a training-free manner. Subsequently, we unify generation and retrieval in an autoregressive generation way and propose an autonomous decision module to choose the best-matched one between generated and retrieved images as the response to the text query. Additionally, we construct a benchmark called TIGeR-Bench, including creative and knowledge-intensive domains, to standardize the evaluation of unified text-to-image generation and retrieval. Extensive experimental results on TIGeR-Bench and two retrieval benchmarks, i.e., Flickr30K and MS-COCO, demonstrate the superiority and effectiveness of our proposed method.
We consider both creative domains and knowledge-intensive domains for general image acquisition towards unified Text-to-Image Generation and Retrieval (TIGeR), and build the TIGeR-Bench including 8 domains for the unified comprehensive evaluation.
Overview of the framework to unify text-to-image generation and retrieval. Images from the database are first tokenized into discrete codes and a lookup table is maintained for the correspondence between discrete codes and images. The given prompt is first fed into a MLLM and Forward Beam Search is performed to retrieve and generate images in parallel. The prompt and obtained images are then fed into the same MLLM for Reverse Re-Ranking and Decision-making.
"Token" refers to visual tokenization during image synthesis, including continuous (Cont.) and discrete (Dist.) approaches. Entries by gray are expert models for T2I retrieval or generation, and those with a blue background denote that an image query is first generated and then used to perform image-to-image retrieval. Entries with a gray background denote our methods.
Text-to-image retrieval performance comparison on Flickr30K and MS-COCO. Entries by gray denote dense retrieval methods and others are generative retrieval methods.
The prefix prompt "Give me an image of" is omitted here. Green and red bounding boxes highlight correct and wrong retrieval results.
Example of multi-turn chat based on SEED-LLaMA with unified generation and retrieval.
We use ticks or crosses to highlight the selected results from generation or retrieval. Green ticks indicate the correct generated images and red crosses indicate the wrong retrieved images.
We use ticks or crosses to highlight the selected results from generation or retrieval. Green ticks indicate the correct retrieved images and red crosses indicate the wrong generated images.