Rice University researchers have unveiled a new method called ElasticDiffusion, designed to enhance the performance of generative AI models in creating images. This innovative approach addresses several limitations faced by current diffusion models, such as issues with image consistency and adaptability to varying aspect ratios.
Moayed Haji Ali, a doctoral student in computer science at Rice University, introduced ElasticDiffusion at the Institute of Electrical and Electronics Engineers (IEEE) 2024 Conference on Computer Vision and Pattern Recognition (CVPR) in Seattle. The new technique promises to resolve common problems associated with diffusion models, which have struggled with generating non-square images and maintaining visual coherence.
Diffusion models, including popular ones like Stable Diffusion, Midjourney, and DALL-E, typically excel in generating lifelike and photorealistic images but face significant challenges when producing images in aspect ratios other than square. When tasked with creating images with different aspect ratios, such as 16:9, these models often exhibit repetitive elements and deformities, such as extra fingers or distorted objects.
The core issue stems from the way diffusion models are trained and generate images. Traditional diffusion models are trained on images of specific resolutions, which limits their ability to create images of different resolutions or aspect ratios effectively. This training constraint results in overfitting, where the model performs well on training data but struggles with variations outside those parameters.
ElasticDiffusion addresses these challenges by separating the local and global signals used in image generation. In conventional diffusion models, local details (e.g., pixel-level information) and global outlines (e.g., overall shape and aspect ratio) are combined, leading to inconsistencies when adapting to different image sizes. ElasticDiffusion, however, separates these signals into conditional and unconditional generation paths.
The method involves subtracting the conditional model from the unconditional model to obtain a score representing global image information. This score helps define the aspect ratio and content of the image. The local detail is then applied to the image in quadrants, ensuring that the AI does not repeat data or introduce visual imperfections. This approach allows ElasticDiffusion to generate cleaner images across various aspect ratios without requiring additional training.
"ElasticDiffusion is a successful attempt to leverage intermediate representations of the model to achieve global consistency," said Vicente Ordóñez-Román, an associate professor of computer science at Rice University.
Despite its advantages, ElasticDiffusion currently has a drawback in terms of processing time, taking 6-9 times longer than existing models like Stable Diffusion or DALL-E. Haji Ali and his team aim to reduce this processing time to match the efficiency of other diffusion models while maintaining the improved image quality and adaptability.
Haji Ali hopes that further research will refine this approach and address the underlying issues causing repetitive elements in diffusion models, ultimately leading to a framework capable of adapting to any aspect ratio with the same efficiency as current models.