We believe you shouldn’t have to “learn how to prompt” for images. It should be easy to try things out, iterate, refine and remix ideas visually. Like you’d do with a friend. So we’re trying something new!
Whisk is Google Lab’s latest generative imagery experiment, focusing on fast visual ideation without the need to deeply understand prompting!
Just throw in a couple of images for directional reference (scene, subjects, styles) and Whisk will suggest some images for you to keep refining.
Whisk is powered by Google's Gemini (language model w/ visual understanding) and Imagen 3 (generative image model) working in concert.
Turn a drawing into a plushie? Create an epic holiday card? A beautiful mood board? Or the beginning of a story… We’re excited to see where you take it.
Prepare
Bring in visual elements for Whisk to remix. Drag and drop an image, upload it from a folder. You can also create a simple reference from a prompt, … or have us seed a couple ideas.
Behind the scenes: these assets go through Gemini’s visual understanding for captioning. That’s what Whisk uses. Click edit to see if we got it right and refine as needed!
Explore
Time to whisk things up! You can select assets (1 or more subjects, 1 scene, 1 style) and put them to work. The system will bring those together.
See what Whisk comes up with, and keep riffing! You can throw in some light guidance to keep exploring.
“Make the characters eat ice-cream” “The dinosaur and the cat are high fiving!” “Make sure the enamel pin is round.” “Also, adjust the color scheme to follow a pastel palette”
Behind the scenes: Gemini composes prompts from the captions + your guidance to create the prompt for you. Click edit to see what it’s been whispering to Imagen 3.
Refine
See an image you like? But maybe that hat should be blue though? Or should there be a sunset in the background? Enter refine mode and ask for smaller to medium changes that stay directionally close to the original.
Behind the scenes: Gemini updates the prompt based on your guidance! We still regenerate all the pixels from that prompt, but ask the model to stay close.
Diagnose
Let’s be honest, things might go in wild directions. Maybe some elements were dropped? Maybe that exact thing you’re looking for just doesn’t match.
At any stage above you can diagnose the underlying prompts by clicking the prompt button / icon and edit them to generate options. Ultimately, you’re in control :-)
Subject
That’s what the image is about! Character, objects or a combination of such. An old rotary phone! A cool chair! A cardboard movie display. A mysterious renaissance vampire. You can also throw yourself in as a directional reference and see what comes out :-)
Scene
Where the subjects will show up. A fashion runway? A pop up holiday card? You can bring characters in the scene alongside the ones already there; or maybe you can swap them in? Worth trying out.
Style
Maybe you want to throw more guidance on the aesthetic, material or technique used to represent the above. Style is for that. Feel free to specify what you care about most in the main prompt box.
You can refer to them in natural language as you want to combine (e.g. our characters in the location eating.
We’re included several ways for you to get a sense for how this work natively in the tool.
“Playground” Landing page: a simplified experience for you to feel the magic in one action. Drop in an image from the
Inspire me flow: quickly run through and end to end flow, from asset to optional guidance to outputs.
Dice for a few example assets: quickly add subject, scene, style suggestions to get going… or keep riffing!
In order to whisk elements from different images together, we first need to develop an understanding of each image you reference. This is where Gemini’s multi-modal understanding comes in! When you upload an image, Whisk uses Gemini to visually understand those images and generate text descriptions (or captions) about them. Or in other words, translate that image to text (I2T). These descriptions are meant to capture the essence of your references, not to replicate the original, to facilitate remixing ideas.
These captions are then used to write a detailed prompt to generate an image based on your guidance using our latest and most powerful image generation model, Imagen 3. Or in other words, translating text back to image (T2I).
This process above helps Whisk better understand and represent the ideas you’re forming, and iterate while conversing with you.
Edit the prompt and take a look, try changing or adding words. Note that every generative model, the system might not always respond or interpret your guidance correctly.
Whisk is currently only available in the US, using text inputs in english. We’re working on expanding to more countries soon!
Yes, just click on the download icon to save and share. We’d also love to see what you create, please share with us through our discord channel too!
For information about your user data, user history, our generative policies, how to send feedback and more, please check-out labs.google/fx’s FAQ