Discover ProFusion: an AI-free regularization framework for detail preservation in text-to-image synthesis

The field of generating text into images has been extensively explored over the years and significant progress has been made recently. Researchers have made significant progress by training large-scale models on large datasets, enabling zero-shot text-to-image generation with arbitrary text input. Groundbreaking work such as DALL-E and CogView pioneered several methods proposed by researchers, resulting in impressive capabilities to generate high-resolution images aligned with textual descriptions, exhibiting exceptional fidelity. These large-scale models have not only revolutionized text-to-image generation, but have also had a profound impact on various other applications, including image manipulation and video generation.

While the aforementioned large-scale text-in-image generation models excel at producing text-aligned and creative output, they often face difficulties when it comes to generating new and unique concepts as specified by users. Consequently, researchers have explored various methods to customize the pre-trained text-image generation models.

For example, some approaches involve setting up pre-trained generative models using a small number of samples. To prevent overfitting, several smoothing techniques are employed. Other methods aim to encode the new user-supplied concept into a word embedding. This embedding is achieved through an optimization process or by a network of encoders. These approaches allow for custom generation of new concepts by satisfying additional requirements specified in user input text.

Check out 100s AI Tools in our AI Tools Club

Despite significant advances in text-to-image generation, recent research has raised concerns about the potential limitations of personalization when using regularization methods. It is suspected that these smoothing techniques may inadvertently limit the ability of custom generation, resulting in the loss of fine-grained detail.

To overcome this challenge, a new framework called ProFusion has been proposed. Its architecture is presented below.

ProFusion consists of a pre-trained encoder called PromptNet, which infers embedding condition words from an input image and random noise, and a new sampling method called Fusion Sampling. Unlike previous methods, ProFusion eliminates the requirement for regularization during the training process. Instead, the problem is effectively solved during inference using the Fusion Sampling method.

In fact, the authors argue that while regularization allows the faithful creation of text-conditioned content, it also leads to the loss of detailed information, resulting in lower performance.

Fusion Sampling consists of two phases in each timestep. The first step involves a blending step that encodes the information from both the input image embedding and the conditioning text into a noisy partial result. Subsequently, a refinement phase follows, which updates the forecast based on the chosen hyperparameters. Prediction update helps Fusion Sampling preserve granular information from the input image while conditioning the output at the input prompt.

This approach not only saves training time, but also eliminates the need to tune hyperparameters related to regularization methods.

The results below speak for themselves.

We can see a comparison between ProFusion and cutting edge approaches. The proposed approach surpasses all other techniques presented, preserving fine-grained details mainly related to facial features.

That was the summary of ProFusion, a new regularization-free framework for generating text into images with state-of-the-art quality. If you are interested, you can learn more about this technique in the links below.

Check out TheConnecting Paper and Github.Don’t forget to subscribeour 25k+ ML SubReddit,Discord channel,ANDEmail newsletter, where we share the latest news on AI research, cool AI projects, and more. If you have any questions regarding the above article or if you have missed anything, please do not hesitate to email us

Check out 100s AI Tools in the AI ​​Tools Club

Daniele Lorenzi received his M.Sc. in ICT for Internet and Multimedia Engineering in 2021 at the University of Padua, Italy. He is a PhD. candidate at the Institute of Information Technology (ITEC) at the Alpen-Adria-Universitt (AAU) Klagenfurt. He currently works at the Christian Doppler Laboratory ATHENA and his research interests include adaptive video streaming, immersive media, machine learning and QoS/QoE assessment.

Unleash the power of Live Proxies: private, undetectable residential and mobile IPs.

#Discover #ProFusion #AIfree #regularization #framework #detail #preservation #texttoimage #synthesis
Image Source :

Leave a Comment