Hands-on Guide: Image Generation in Custom Poses using IP-Adapters and Diffusers Library

What is IP-Adapter?

In recent years, generative AI has seen remarkable advancements, particularly in the domain of large text-to-image diffusion models. These models have astounded us with their ability to craft incredibly lifelike images. However, the process of coaxing these models to generate precisely what we desire from mere text prompts can be a daunting task, often requiring intricate and meticulous prompt engineering.

Fortunately, there’s an innovative solution that offers a fresh perspective – the use of image prompts. Instead of relying solely on text, this approach harnesses the power of visual cues.

Traditionally, one common strategy has been to fine-tune pretrained models on specific styles or objects to enhance their generative abilities. Yet, this method often demands substantial computing resources, placing it out of reach for many users.

ip-adapter image generation
Source: https://ip-adapter.github.io/

This is where IP-Adapter steps into the spotlight. It emerges as a game-changing solution, an efficient and lightweight adapter that empowers pretrained text-to-image diffusion models with the remarkable capability to understand and respond to image prompts. In essence, it bridges the gap between textual instructions and vivid visual output, making the creative potential of these models more accessible to a broader audience.

Can we use IP-Adapter with Diffusers for image generation?

Yes, we can easily use IP-Adapter with the Diffusers library to generate images. If you want to learn more about Diffusers then I recommend you check out the following articles:

We will use IP-Adapter and Diffusers in Google Colab to generate images of a person in a specified pose. Make sure you have GPU enabled in your Colab notebook.

How to install IP-Adapters in Colab?

First, connect the Colab notebook to a GPU runtime environment. Then in the first cell of the notebook, run the following commands to clone the GitHub repository – github.com/tencent-ailab/IP-Adapter and install a few models.

# install Diffusers and Transformers libraries
!pip install diffusers["torch"] transformers

# clone IP-Adapter
!git clone https://github.com/tencent-ailab/IP-Adapter.git
%cd IP-Adapter

# download models
!wget https://huggingface.co/h94/IP-Adapter/raw/main/models/image_encoder/config.json
!wget https://huggingface.co/h94/IP-Adapter/resolve/main/models/image_encoder/pytorch_model.bin
!wget https://huggingface.co/h94/IP-Adapter/resolve/main/models/ip-adapter_sd15.bin

System setup to generate images with IP-Adapter and Stable Diffusion

Let’s begin with importing the required modules for this tutorial.

import torch
from diffusers import StableDiffusionControlNetPipeline, AutoencoderKL, DDIMScheduler, ControlNetModel
from PIL import Image

from ip_adapter import IPAdapter
vae = AutoencoderKL.from_pretrained("stabilityai/sd-vae-ft-mse").to(dtype=torch.float16)

noise_scheduler = DDIMScheduler(

Import ControlNet and Stable Diffusion Models

Load ControlNet and Würstchen diffusion models.

# load controlnet
controlnet_model_path = "lllyasviel/control_v11p_sd15_openpose"
controlnet = ControlNetModel.from_pretrained(controlnet_model_path,

# load SD pipeline
pipe = StableDiffusionControlNetPipeline.from_pretrained(

Load input images

Load an image with the face of a person. This image will be used as an image-prompt by IP-Adapter to generate images from the Würstchen diffusion model.

image = Image.open("/content/man.JPG")
image #display image
ip-adapter and stable diffusion

We will also use a human pose image with the above portrait to generate the final image. Let’s load the human pose image.

openpose_image = Image.open("/content/keypoint_1.png")
openpose_image #display image
ip-adapter and stable diffusion

The image contains several keypoints indicating important joints in the human body. It also has colored edges connecting the keypoints with each other. In addition to the body pose, this image also has facial keypoints marked.

These body and facial keypoints will help the ControlNet model generate images in similar pose and facial attributes.

Let’s also define a function to conveniently display multiple images that will be generated by the diffusion model.

def image_grid(imgs, rows, cols):
    assert len(imgs) == rows*cols

    w, h = imgs[0].size
    grid = Image.new('RGB', size=(cols*w, rows*h))
    grid_w, grid_h = grid.size

    for i, img in enumerate(imgs):
        grid.paste(img, box=(i%cols*w, i//cols*h))
    return grid

Generate Images using Stable Diffusion with IP-Adapter

We will first load the IP-Adapter model and generate a few images without any text prompt.

# load ip-adapter
ip_model = IPAdapter(pipe,
                     image_encoder_path = ".",
                     ip_ckpt = "ip-adapter_sd15.bin",
                     device = "cuda")

# generate images
images = ip_model.generate(pil_image=image,
                           width=700, height=1024,

# display images
grid = image_grid(images, 1, 3)
grid #display grid of images
ip-adapter and stable diffusion

Quite impressive! The face of the subject in the image resembles the portrait that we loaded earlier and the pose is consistent in all three images.

Now let’s generate images by adding a text prompt – “young man, ((walking in street)), sneakers, jeans, white shirt, natural light”

# generate images
images = ip_model.generate(pil_image=image,
                           width=700, height=1024,
                           prompt = "young man, ((walking in street)), sneakers, jeans, white shirt, natural light",
                           negative_prompt = "out of frame, duplicate, watermark, signature, text",

grid = image_grid(images, 1, 3)
ip-adapter and stable diffusion

Awesome! As you can see, we are able to generate more controlled images this time with the addition of a text prompt. So, feel free to include a text prompt along with the input image to add more details to your AI-generated images.

Leave a Reply

Your email address will not be published. Required fields are marked *