What is IP-Adapter?
In recent years, generative AI has seen remarkable advancements, particularly in the domain of large text-to-image diffusion models. These models have astounded us with their ability to craft incredibly lifelike images. However, the process of coaxing these models to generate precisely what we desire from mere text prompts can be a daunting task, often requiring intricate and meticulous prompt engineering.
Fortunately, there’s an innovative solution that offers a fresh perspective – the use of image prompts. Instead of relying solely on text, this approach harnesses the power of visual cues.
Traditionally, one common strategy has been to fine-tune pretrained models on specific styles or objects to enhance their generative abilities. Yet, this method often demands substantial computing resources, placing it out of reach for many users.

This is where IP-Adapter steps into the spotlight. It emerges as a game-changing solution, an efficient and lightweight adapter that empowers pretrained text-to-image diffusion models with the remarkable capability to understand and respond to image prompts. In essence, it bridges the gap between textual instructions and vivid visual output, making the creative potential of these models more accessible to a broader audience.
Can we use IP-Adapter with Diffusers for image generation?
Yes, we can easily use IP-Adapter with the Diffusers library to generate images. If you want to learn more about Diffusers then I recommend you check out the following articles:
- Merge a Logo with an Image using ControlNet
- A Guide to Stable Diffusion Inpainting for Seamless Photo Enhancements
We will use IP-Adapter and Diffusers in Google Colab to generate images of a person in a specified pose. Make sure you have GPU enabled in your Colab notebook.
How to install IP-Adapters in Colab?
First, connect the Colab notebook to a GPU runtime environment. Then in the first cell of the notebook, run the following commands to clone the GitHub repository – github.com/tencent-ailab/IP-Adapter and install a few models.
# install Diffusers and Transformers libraries
!pip install diffusers["torch"] transformers
# clone IP-Adapter
!git clone https://github.com/tencent-ailab/IP-Adapter.git
%cd IP-Adapter
# download models
!wget https://huggingface.co/h94/IP-Adapter/raw/main/models/image_encoder/config.json
!wget https://huggingface.co/h94/IP-Adapter/resolve/main/models/image_encoder/pytorch_model.bin
!wget https://huggingface.co/h94/IP-Adapter/resolve/main/models/ip-adapter_sd15.bin
System setup to generate images with IP-Adapter and Stable Diffusion
Let’s begin with importing the required modules for this tutorial.
import torch
from diffusers import StableDiffusionControlNetPipeline, AutoencoderKL, DDIMScheduler, ControlNetModel
from PIL import Image
from ip_adapter import IPAdapter
vae = AutoencoderKL.from_pretrained("stabilityai/sd-vae-ft-mse").to(dtype=torch.float16)
noise_scheduler = DDIMScheduler(
num_train_timesteps=1000,
beta_start=0.00085,
beta_end=0.012,
beta_schedule="scaled_linear",
clip_sample=False,
set_alpha_to_one=False,
steps_offset=1,
)
Import ControlNet and Stable Diffusion Models
Load ControlNet and Würstchen diffusion models.
# load controlnet
controlnet_model_path = "lllyasviel/control_v11p_sd15_openpose"
controlnet = ControlNetModel.from_pretrained(controlnet_model_path,
torch_dtype=torch.float16)
# load SD pipeline
pipe = StableDiffusionControlNetPipeline.from_pretrained(
"warp-ai/wuerstchen",
controlnet=controlnet,
torch_dtype=torch.float16,
scheduler=noise_scheduler,
vae=vae,
feature_extractor=None,
safety_checker=None
).to("cuda")
Load input images
Load an image with the face of a person. This image will be used as an image-prompt by IP-Adapter to generate images from the Würstchen diffusion model.
image = Image.open("/content/man.JPG")
image #display image

We will also use a human pose image with the above portrait to generate the final image. Let’s load the human pose image.
openpose_image = Image.open("/content/keypoint_1.png")
openpose_image #display image

The image contains several keypoints indicating important joints in the human body. It also has colored edges connecting the keypoints with each other. In addition to the body pose, this image also has facial keypoints marked.
These body and facial keypoints will help the ControlNet model generate images in similar pose and facial attributes.
Let’s also define a function to conveniently display multiple images that will be generated by the diffusion model.
def image_grid(imgs, rows, cols):
assert len(imgs) == rows*cols
w, h = imgs[0].size
grid = Image.new('RGB', size=(cols*w, rows*h))
grid_w, grid_h = grid.size
for i, img in enumerate(imgs):
grid.paste(img, box=(i%cols*w, i//cols*h))
return grid
Generate Images using Stable Diffusion with IP-Adapter
We will first load the IP-Adapter model and generate a few images without any text prompt.
# load ip-adapter
ip_model = IPAdapter(pipe,
image_encoder_path = ".",
ip_ckpt = "ip-adapter_sd15.bin",
device = "cuda")
# generate images
images = ip_model.generate(pil_image=image,
image=openpose_image,
width=700, height=1024,
num_samples=3,
num_inference_steps=40,
seed=5802258)
# display images
grid = image_grid(images, 1, 3)
grid #display grid of images

Quite impressive! The face of the subject in the image resembles the portrait that we loaded earlier and the pose is consistent in all three images.
Now let’s generate images by adding a text prompt – “young man, ((walking in street)), sneakers, jeans, white shirt, natural light”
# generate images
images = ip_model.generate(pil_image=image,
image=openpose_image,
width=700, height=1024,
num_samples=3,
num_inference_steps=40,
guidance_scale=8,
prompt = "young man, ((walking in street)), sneakers, jeans, white shirt, natural light",
negative_prompt = "out of frame, duplicate, watermark, signature, text",
seed=5802258)
grid = image_grid(images, 1, 3)
grid

Awesome! As you can see, we are able to generate more controlled images this time with the addition of a text prompt. So, feel free to include a text prompt along with the input image to add more details to your AI-generated images.