Talk To Your Videos Using Large Language Models

Ever caught yourself talking back to a YouTube video, almost wishing it could respond? Well, what if I told you that’s not a pipe dream anymore? That’s right; thanks to RAG strategy, we can chat with videos using natural language.

In this article, I will show you how to talk to a video using a large language model with the help of modern AI tools like Whisper and LangChain.

The idea is to first generate the transcript of the video using a speech-to-text model and then build a question answering system for the text content.

We will use Python and Google Colab for this project.

You can download the Python code from here for free.

Steps involved

  1. Convert the video’s audio into text with the help of Whisper large v3 model.
  2. Split the extracted text into chunks of same size.
  3. Use an embedding model to build embedding vectors for each text chunks.
  4. Store the embeddings in a vector database.
  5. Apply RAG (Retrieval-augmented generation) to fetch the relevant context from the vector database when the user passes a query (prompt).

Quick overview of RAG

You know how you can ask ChatGPT all sorts of questions and it tries its best to answer them. Well, RAG is like a ChatGPT cousin who takes that a step further.

When you ask RAG a question, it’s like it has a huge library of books at its disposal. Instead of just relying on what it already knows (like ChatGPT), RAG goes out and fetches specific bits of information from those books that are super relevant to your question.

So, after it finds the best bits of info, RAG puts all that together into a nice, coherent answer for you. It’s like having an assistant who’s good at research and at explaining things, all rolled into one!

Now that you have a basic idea of RAG, let’s get started with the implementation part.

Speech-to-text using Whisper Large v3 in Google Colab

To convert the video’s audio to text, we will use OpenAI’s Whisper Large v3 model. It is a pre-trained speech-to-text model that will convert the audio or speech to text.

Make sure the GPU is enabled in the Colab notebook.

Install Fast Whisper

Let’s install the fast whisper library in the Colab notebook. Use the command below in a cell.

!pip install insanely-fast-whisper

Extract audio from the video

We will use moviepy library to extract audio from our video. Make sure you have uploaded the video file in your Colab session.

The function extract_audio() will extract the audio from the video file and save it as an mp3 file in the same Colab session.

def extract_audio(video_filename, output_path='.', output_format='mp3'):
    # Load video file
    video_clip = VideoFileClip(video_filename)

    # Extract audio from video
    audio_clip =

    # Define the output filename for audio
    audio_filename = f"audio.{output_format}"

    # Write the audio file

    # Close the clips to release resources


Convert audio to text using Whisper

Once the audio is extracted, we will convert it into text with the help of the Whisper v3 model.

import torch
from transformers import pipeline
# create a Whisper model pipeline
pipe = pipeline("automatic-speech-recognition",

pipe.model = pipe.model.to_bettertransformer()

# convert audio data to text
outputs = pipe("audio.mp3",

The object outputs["text"] contains the entire text of the audio. Kindly save the text in a file video_script.txt.

Implementation of RAG strategy with LangChain and OpenAI API

Now you can either continue in the same notebook or create a new Colab notebook. The GPU is required only to run the Whisper model. For RAG, a GPU instance is not required.

System setup

!pip install langchain
!pip install openai
!pip install sentence-transformers
!pip install faiss-cpu

Configure secret keys in Colab

In Colab notebooks, we can save secret keys, such as OpenAI API’s key, securely.

langchain rag whisper

As you can see in the screenshot above, there is a key icon in the sidebar. Here you can save your secret keys.

I have saved my OpenAI API’s key in this section. It can be accessed using userdata.get() function in Python.

import os
from google.colab import userdata

os.environ["OPENAI_API_KEY"] = userdata.get('openai-key')

Import required modules for RAG system

LangChain provides all the necessary tools to perform wide range of experiments with LLMs. For our video-chat app we will use some of its tools.

from langchain.prompts import PromptTemplate
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import FAISS
from langchain.document_loaders import TextLoader
from langchain.chat_models import ChatOpenAI
from langchain.schema.runnable import RunnablePassthrough

Load an embeddings model

I will use Jina’s embedding model to create embeddings from the text documents. Jina embedding model can embed large documents with ease. You can use other embedding models as well.

embeddings = HuggingFaceEmbeddings(model_name = "jinaai/jina-embedding-s-en-v1")

The text documents and the user query will be converted from text to embeddings with the help of this model because it is easier to search and compare vectors (embeddings).

Split documents into chunks

In this experiment, I am working with a single document that is the transcript of my video. However, even if you have multiple documents, splitting of documents is an important step.

The idea is to split the documents into chunks of same size (i.e. number of tokens) and then create vector representation (embeddings) of each chunk using the embedding model (Jina embeddings).

First, let’s load the text from the text file using LangChain’s TextLoader method.

loader = TextLoader(file_path="video_script.txt")
documents = loader.load()

Then we can split the loaded text into multiple chunks. For each chunk, we will generate embeddings later.

# Split loaded documents into chunks
text_splitter = RecursiveCharacterTextSplitter(chunk_size=300, chunk_overlap=0)
chunked_documents = text_splitter.split_documents(documents)

The chunk size depends on the size of the documents and the type of content in those documents. There is no hard and fast rule to select the right chunk size, so you can try different values and use the one that is giving you the best results for your use case.

Create a vector database using FAISS

The next step is to convert the text chunks into embeddings and store them in a vector database.

We have many options for vector databases. Some popular vector databases are Chroma, Weaviate, Pinecode, and Supabase. My favorite is FAISS as it is open source, easy to setup, and provides fast vector similarity search.

# Create and persist a vector database from the chunked documents
vector_database = FAISS.from_documents(

# save vector db in a folder "db_faiss"

In the code snippet above, you can see that we have used FAISS to create embedding vectors for each chunk of our text data. After we have also saved the database locally.

Configure a retriever chain for RAG

Once the vector database is set up, we can easily load it again. Now let’s create a retriever that will retrieve 3 most similar vectors to the vector of the user query.

db = FAISS.load_local(folder_path="db_faiss", embeddings=embeddings)

# k represents the number of most similar vectors to fetch from the database
retriever = db.as_retriever(search_kwargs={"k": 3})

Let’s say you want to extract 5 most similar vectors instead of 3, then simple set the value of “k” to 5.

Create Prompt Template using LangChain

Let’s create a prompt template using LangChain that will be used to retrieve answers from the video’s text script.

prompt_template = """
You are an assistant for question-answering tasks. Use the following pieces of retrieved context to answer the question.

If you don't know the answer, just say that you don't know. Keep the answer concise.

Question: {question}
Context: {context}

rag_prompt = PromptTemplate(template=prompt_template,
                        input_variables=["question", "context"])

This prompt template contains two variables – ‘question’ and ‘context’. ‘question’ will be replaced by the query passed by the user and the ‘context’ is the content of the chunks whose embeddings are similar to the embeddings of the user query.

Ask questions from your video’s text script

Finally, it is time to test our semantic question answering RAG system.

First, let’s specify the large language model and define the retriever chain. We will use GPT-4 Turbo model for our use case. You may use other variants as well such as GPT-3.5 Turbo or GPT-4.

llm = ChatOpenAI(model_name="gpt-4-turbo", temperature=0)

# define RAG sequence of actions
rag_chain = {"context": retriever, "question": RunnablePassthrough()} | rag_prompt | llm

Finally, we can chat with the content of our video using natural language and get relevant responses from the model.

# ask questions
rag_chain.invoke("What is sequence length and why it is important?")


AIMessage(content='Sequence length refers to the number of tokens in a sequence. It is important because it determines the amount of information that can be processed or represented in a sequence. In the given context, the maximum sequence length is set to 25, and this length is used for padding all sequences to ensure consistency in processing.')


In this article, I demonstrated implementatin of RAG question answering system using LangChain, GPT-4 Turbo, Jina embeddings model, and FAISS vector database. Feel free to comment below or connect with me if you have any question or doubt.

You can download the Python code from here for free.

Leave a Reply

Your email address will not be published. Required fields are marked *