Guiding Instruction-based Image Editing
via Multimodal Large Language Models


Tsu-Jui Fu1   Wenze Hu2   Xianzhi Du2
William Yang Wang1   Yinfei Yang2   Zhe Gan2
1UC Santa Barbara   2Apple
International Conference on Learning Representations (ICLR) 2024 (Spotlight)


Abstract

Instruction-based image editing improves the controllability and flexibility of image manipulation via natural commands without elaborate descriptions or regional masks. However, human instructions are sometimes too brief for current methods to capture and follow. Multimodal large language models (MLLMs) show promising capabilities in cross-modal understanding and visual-aware response generation via LMs. We investigate how MLLMs facilitate edit instructions and present MLLM-Guided Image Editing (MGIE). MGIE learns to derive expressive instructions and provides explicit guidance. The editing model jointly captures this visual imagination and performs manipulation through end-to-end training. We evaluate various aspects of Photoshop-style modification, global photo optimization, and local editing. Extensive experimental results demonstrate that expressive instructions are crucial to instruction-based image editing, and our MGIE can lead to a notable improvement in automatic metrics and human evaluation while maintaining competitive inference efficiency.



Qualitative Comparison

👇 press the tab for different datasets


Input Instruction InsPix2Pix LGIE MGIE GroundTruth
turn the day into night
make the forest path into a beach
make the frame red
as if the shop was a library
make it the vatican
turn the sunset into a firestorm
Input Instruction InsPix2Pix LGIE MGIE GroundTruth
remove text
show him on a frozen lake with snowy mountains
increase the brightness of the entire image
take the people out of the back in the photo
add tiger
change the background to blue
Input Instruction InsPix2Pix LGIE MGIE GroundTruth
edit out skiers on right
make it look more professional
remove hot air balloons
make colors pop out
remove boy with red shirt from picture
lighten out yellow tone
Input Instruction InsPix2Pix LGIE MGIE GroundTruth
add brightness so the clouds look bright white
make the color more green
add more contrast to simulate more light
remove the blue hue out of the picture
brighten image a lot, sharpen photo
need to clarified, more focus
Input Instruction InsPix2Pix LGIE MGIE GroundTruth
have there be a birthday cake on the table
put buildings in the background of the image
make the face happy
let there be palm trees
has a green web page
replace food with soup


Expressive Instruction


Input Instruction Expressive Instruction by LGIE Expressive Instruction by MGIE
turn the day into night
Turn into night, such as a photograph of a starry night sky, a moon, stars, or a planet. Hold your cursor over the night sky icon.
If the day were to be turned into night in this image, the Eiffel Tower would be illuminated by artificial lights, creating a contrast against the dark sky.
make the forest path into a beach
The beach would be made from the dirt and rocks that were on the forest path, and the trees would be transplanted to make way for the new beach.
The image would depict a person running along a sandy shore. The trees surrounding the path would be replaced with ocean waves.
make the frame red
You can edit the image to make the frame red. You could also crop the frame to remove the white borders and make it more prominent.
If the glasses had a red frame instead of a black frame, the red frame would draw attention to the glasses with the gray background.
as if the shop was a library
For the idea of making the shop look like a library, you can organize the books and flowers in a way that creates a calm and orderly environment.
This image shows a woman standing at a counter with an abundance of flowers. The books on the shelves behind her would replace the vases of flowers.
make it the vatican
The image would be draped in gold and crimson fabrics. The colors would be deep and rich, evoking a sense of reverence and awe.
If the image were set at the Vatican, it would show a serene scene with the city's iconic architecture, such as St. Peter's Basilica or the Vatican City walls.
turn the sunset into a firestorm
The entire scene would be enveloped in a hazy, orange-reddish glow, and the horizon would be defined by a towering wall of flame.
The already vibrant colors from the sunset would be intensified, with the orange hues becoming more fiery and possibly spreading across the sky.
Input Instruction Expressive Instruction by LGIE Expressive Instruction by MGIE
remove text
The sentence would become blank, leaving no visual representation of the previously described image.
The text could be removed completely, leaving only the dog and the pier.
show him on a frozen lake with snowy mountains
The image would likely be quite picturesque and beautiful with a touch of brown from the frozen lake and the snow.
The man enjoys the thrill of gliding across the icy surface. The atmosphere would be accentuated by the snow-covered landscape and the distant mountains.
increase the brightness of the entire image
Increasing the brightness of an image. The result is a clearer and more vibrant image.
Adjust the brightness to bring out the details. A setting with a group of women sitting around a round table would appear more vivid.
take the people out of the back in the photo
The photo would be less crowded and have more space without the people in the back, creating a more tranquil atmosphere.
The scene would appear more empty and there would be a noticeable presence of the bride and groom.
add tiger
This image features a cartoon illustration of a beloved character of Tiger. His bright orange fur stands out against the green grass below.
An image of a man holding a football in a field will be enhanced with the addition of tiger. It will engage in a playful interaction or enjoy the field.
change the background to blue
If you change the background color of an image to blue, the other colors in the image will be adjusted to match the new background color.
A woman wearing a pink dress and holding yellow sunflowers would stand in front of a blue background.
Input Instruction Expressive Instruction by LGIE Expressive Instruction by MGIE
edit out skiers on right
This image originally featured skiers on the right side, but they have been removed, leaving the image empty.
The image would focus on the individual skier with the backpack instead of the group of skiers in the foreground.
make it look more professional
Using high-quality, professional-looking stock images or custom artwork. Make it more visually appealing, or remove any unwanted elements.
This image could be made more professional by featuring a businessman and a composition that highlights the tall building in the background.
remove hot air balloons
The sky is now empty. These colorful vessels have now disappeared, leaving no trace of their existence.
The mountain range would still be the focus of the photograph, but the balloon would no longer be a focal point.
make colors pop out
Adjust the colors in the image pop to make them look more vivid and striking. This will create a more eye-catching contrast.
It would enhance the vibrancy and saturation of the man's tan trench coat and light blue sneakers, as well as the wooden bench and dirt field.
remove boy with red shirt from picture
The image would be filled in the area where the boy with the red shirt is currently visible.
Without the boy, the scene would feature a man alone in a forested area, sitting on a log and enjoying a drink.
lighten out yellow tone
Reduce the amount of yellow tone in the image by adjusting the white balance to make the tone appear less yellow.
The image of the man and woman in formal attire will appear more vibrant and lively, and the background will become less prominent.
Input Instruction Expressive Instruction by LGIE Expressive Instruction by MGIE
add brightness so the clouds look bright white
Make the clouds in your image look brighter and more white. Or remove them and then add back in as white.
Adjust the brightness of the image to bring out the details of the sky, the cloud, and the building in the picture.
make the color more green
The image would become greener if the color was made more green. This would mean more green pixels, resulting in a greener hue overall.
Enhance the greenery in the image by making the grass greener. The field would have a richer hue and would stand out more against the backdrop.
add more contrast to simulate more light
Adding more contrast to an image will make it appear more vibrant, sharp, and detailed, just like what you see in real life.
It would make the snow brighter, the skier's jacket stand out more, and other details more visible and dynamic.
remove the blue hue out of the picture
Colors and details will be easier to distinguish, allowing the image to truly stand out.
The image would make the orange truck stand out. The sky in the background would also show more vivid colors.
brighten image a lot, sharpen photo
The image would be transformed into a painting-like look, with bright colors and sharp details. It would be highly saturated.
Brighten the damaged window sill to bring out the details of the chipping paint and broken glass.
need to clarified, more focus
It encourages the viewer to take a closer look and focus on the key elements, bringing the image to better clarity and sharpness.
This will help capture the intricate details of the yellow and orange flowers and maximize the visual impact of the bouquet.
Input Instruction Expressive Instruction by LGIE Expressive Instruction by MGIE
have there be a birthday cake on the table
A birthday cake sits on a table, decorated with candles and frosting.
This image features a kitchen with a dining table and chairs, and a birthday cake has been added to the table.
put buildings in the background of the image
The image would be a peaceful cityscape. Silhouetted buildings in the background would be a stunning contrast.
The image would feature a train traveling down the tracks with city buildings in the background, creating a dynamic composition.
make the face happy
Making the face emoji happier would involve adding more emotion and energy, like smiling lines and cheerful expressions.
Change the emoticon displayed on a cell phone screen to a cheerful one, next to a laptop on a desk.
let there be palm trees
The image would be transformed from its current state to a landscape with palm trees, bringing a sense of warmth and vibrancy.
Palm trees would be a natural addition to a beach scene with a large clock in the foreground.
has a green web page
This web page will show a green background with white text. The vibrant color combination is sure to make any content stand out.
The green web page is related to a project or task that the user is currently working on. The other scheme still remains.
replace food with soup
The bowl was empty, so I filled it with a bowl of steaming soup. The aroma filled the room as I stirred the soup, ready to enjoy.
The image would change from sandwiches to bowls of soup. The tray would still have a plate but with a different meal option of soup.