How To Add Text Behind Moving Objects Using Python

In this step-by-step tutorial, I am going to show you an amazing way to add text behind moving objects using Python. I will use a pre-trained YOLOv8 model and OpenCV to perform this task.

My objective is to insert a text string behind a moving person in a video. However, this method can be applied to objects other than humans as well if the YOLOv8 model recognizes those objects.

YOLOv8 is a state-of-the-art class of models that can perform object detection and instance segmentation. And guess what! you don’t need a GPU to use YOLOv8.

The final video will look like this.

Install Ultralytics library

Ultralytics library is used to import YOLO models. Run the following command.

!pip install ultralytics

Import YOLOv8 Model

Let’s import the required Python modules and the YOLOv8 model. I will use the YOLOv8 small model, but nano, medium, and large variants of the same model can be used.

from ultralytics import YOLO
import cv2
import numpy as np
import matplotlib.pyplot as plt

# Load a YOLO v8 small model
model = YOLO('yolov8s-seg.pt')

Define a function to add text to image

Let’s define a function to add a text string to the image. Here OpenCV is primarily used for specifying the font style, size, and other things.

def add_text(image, 
             text, 
             font = cv2.FONT_HERSHEY_SIMPLEX, 
             font_scale = 4, 
             font_thickness = 10, 
             font_color = (50, 100, 237)):

  # Get the size of the text box
  text_size = cv2.getTextSize(text, font, font_scale, font_thickness)[0]

  # Calculate the position to center the text
  text_x = (image.shape[1] - text_size[0]) // 2
  text_y = (image.shape[0] + text_size[1]) // 2

  # Put the text on the image
  image_with_text = cv2.putText(image.copy(), text, (text_x, text_y), font, font_scale, font_color, font_thickness)

  return image_with_text

Load first frame of input video

Let me show you how we can insert text in an image behind a person on a single frame first, then it will be replicated for all the frames of the video.

I will simply load the first frame of the input video (sample_video.mp4).

# load first frame of the video 
cap = cv2.VideoCapture("sample_video.mp4")
_, frame = cap.read()
cap.release()

The frame is nothing but a three-dimensional numpy array. Let’s check out the shape of the frame.

# size of the image
frame.shape

Output: (1280, 720, 3)

The height of the video is 1280 pixels and the width is 720 pixels.

# display the first frame
plt.imshow(cv2.cvtColor(frame, cv2.COLOR_BGR2RGB));
object behind moving objects

Human body segmentation using YOLOv8

Now I will use YOLOv8 segmentation model to perform instance segmentation on this frame/image. In other words, I am going to create a magic mask for the person in the image.

# get YOLO predictions on the image
results = model(frame)

Let’s find the classes of the objects segmented by the model in the image.

# classes of objects detected in the frame
results[0].boxes.cls

Output: tensor([0.])

There is only one object in the given image: a human. The label for human is “0”. The list of all the recognizable objects by YOLOv8 are given below.

{0: 'person', 1: 'bicycle', 2: 'car', 3: 'motorcycle', 4: 'airplane', 5: 'bus', 6: 'train', 7: 'truck', 8: 'boat', 9: 'traffic light', 10: 'fire hydrant', 11: 'stop sign', 12: 'parking meter', 13: 'bench', 14: 'bird', 15: 'cat', 16: 'dog', 17: 'horse', 18: 'sheep', 19: 'cow', 20: 'elephant', 21: 'bear', 22: 'zebra', 23: 'giraffe', 24: 'backpack', 25: 'umbrella', 26: 'handbag', 27: 'tie', 28: 'suitcase', 29: 'frisbee', 30: 'skis', 31: 'snowboard', 32: 'sports ball', 33: 'kite', 34: 'baseball bat', 35: 'baseball glove', 36: 'skateboard', 37: 'surfboard', 38: 'tennis racket', 39: 'bottle', 40: 'wine glass', 41: 'cup', 42: 'fork', 43: 'knife', 44: 'spoon', 45: 'bowl', 46: 'banana', 47: 'apple', 48: 'sandwich', 49: 'orange', 50: 'broccoli', 51: 'carrot', 52: 'hot dog', 53: 'pizza', 54: 'donut', 55: 'cake', 56: 'chair', 57: 'couch', 58: 'potted plant', 59: 'bed', 60: 'dining table', 61: 'toilet', 62: 'tv', 63: 'laptop', 64: 'mouse', 65: 'remote', 66: 'keyboard', 67: 'cell phone', 68: 'microwave', 69: 'oven', 70: 'toaster', 71: 'sink', 72: 'refrigerator', 73: 'book', 74: 'clock', 75: 'vase', 76: 'scissors', 77: 'teddy bear', 78: 'hair drier', 79: 'toothbrush'}

Let’s try to extract the mask-array of the detected person from the predictions of YOLOv8.

# get index of person-object in the image
person_idx = np.argmax(results[0].boxes.cls == 0).item()

# get mask-array of the detected person
mask = results[0].masks.data[person_idx].numpy()

# display mask-array
plt.imshow(mask);
add text behind moving person

As you can see from the axes of this image above, the dimensions do not match the height and width (1280, 720) of the input video. This is one issue with the YOLO models that the size of the mask array does not match with the size of the input image.

Therefore, we will have to resize the mask image using OpenCV.

mask = cv2.resize(mask, 
                  dsize=(frame.shape[1], frame.shape[0]), 
                  interpolation=cv2.INTER_AREA)

# make mask binary
threshold = 0.3
mask = np.where(mask >= threshold, 1.0, 0)

# display resized mask
plt.imshow(mask);
opencv resize image

Add text behind moving person in an image

Now we will use the add_text() function to insert text in the same frame image.

# add text to image
frame_with_text = add_text(frame, "Hello World")

plt.imshow(cv2.cvtColor(frame_with_text, cv2.COLOR_BGR2RGB));
add text to image using opencv

As you can see, I have added “Hello World” in the frame. Right now the text is on top of the person in the video. We aim to make the text appear behind the person.

So, we will assign the pixel values of the frame without text as zero wherever it did not overlap with the mask identified by the YOLOv8 segmentation model.

For the frame that contains the text string, we will assign the pixel values as zero wherever it overlaps with the mask.

frame[mask == 0] = 0
frame_with_text[mask == 1] = 0

Let’s display both frames after this image manipulation.

# frame without text
plt.imshow(cv2.cvtColor(frame, cv2.COLOR_BGR2RGB));
insert text behind moving person
# frame with text
plt.imshow(cv2.cvtColor(frame_with_text, cv2.COLOR_BGR2RGB));
insert text behind moving person

Let’s add these two frames and see the magic.

final_frame = frame + frame_with_text

plt.imshow(cv2.cvtColor(final_frame, cv2.COLOR_BGR2RGB));
insert text behind moving person

As you can see, the text is now behind the person. Now I will apply the same method on the remaining frames of the video and insert the text string “Hello World” behind the moving person.

Insert text behind moving person in video

Alright, I will now define a function and try to implement the same techniques on all the frames of the input video. Before that let’s create a video writing object that will stitch all the frames into a new video.

# Read the input video
cap = cv2.VideoCapture('input_video_3.mp4')

# Get the video properties
width = int(cap.get(cv2.CAP_PROP_FRAME_WIDTH))
height = int(cap.get(cv2.CAP_PROP_FRAME_HEIGHT))
fps = cap.get(cv2.CAP_PROP_FPS)
cap.release()

# Create a VideoWriter object for the output video
fourcc = cv2.VideoWriter_fourcc(*'MP4V')
out = cv2.VideoWriter('output.mp4', fourcc, fps, (width, height))

The output video will be saved in the current working directory as ‘output.mp4’.

def create_video(video_path):
  # Open the video file
  cap = cv2.VideoCapture(video_path)

  # frame counter
  cnt = 0

  while True:
    # Read a frame from the video
    ret, frame = cap.read()

    # Break the loop if the video has ended
    if not ret:
        break

    # get YOLO predictions
    results = model(frame)

    if (results[0].boxes.cls == 0).sum().item() != 0:
      # find index of person-object
      person_idx = np.argmax(results[0].boxes.cls == 0).item()

      # get binary mask of the image
      try:
        mask = results[0].masks.data[person_idx].numpy()
        mask = cv2.resize(mask,
                          dsize=(frame.shape[1], frame.shape[0]),
                          interpolation=cv2.INTER_AREA)

        mask_2 = np.where(mask >= 0.3, 1.0, 0)

        frame_with_text = add_text(frame, "Hello World")

        frame[mask_2 == 0] = 0
        frame_with_text[mask_2 == 1] = 0

        final_frame = frame + frame_with_text

        out.write(final_frame)
      except IndexError:
        out.write(frame)

    else:

      out.write(frame)

    cnt = cnt+1
    print("frame no. "+ str(cnt))

  # Release video writer object and capture object
  out.release()
  cap.release()

Let’s use the create_video() function to add text behind a moving human in a video.

create_video("/content/input_video.mp4")

Conclusion

In this article, we learned how to add text behind a moving person in a video using Python and computer vision. We used the YOLOv8 segmentation model to segment the polygon of a person in the video and insert the text string behind the polygon.

I have a video tutorial as well if you want more clarity.

This method will work for only those videos where there is only one person in the foreground. Try to use the same technique for objects other than humans and put the text behind those moving objects.

You can also check out my tutorial on creating video-inside-text effect if interested.

Feel free to download the entire code notebook and the input video from here.

Leave a Reply

Your email address will not be published. Required fields are marked *