In the digital age, video content is increasingly popular, and YouTube is the most popular video-sharing platform. However, for those who are deaf or hard of hearing, and for those who want to watch videos in a noisy environment, it can be difficult to access the content without subtitles or captions.
Transcription of YouTube videos can be a time-consuming and challenging task, but with the help of machine learning and natural language processing, it’s becoming easier to automate. OpenAI Whisper is an excellent tool for creating accurate and fast video transcriptions. It can be used to build an app that makes the transcription process even more accessible.
In this article, we’ll explore how to build an app that uses Gradio and OpenAI Whisper to produce a transcription of YouTube videos. We’ll provide a step-by-step guide to building the app. By the end of this article, you’ll have a better understanding of how to create an app that can automatically transcribe YouTube videos and make video content more accessible to everyone.
OpenAI Whisper
OpenAI Whisper is a machine learning tool developed by OpenAI, a leading research organization in artificial intelligence. Its primary function is to automatically transcribe audio content, such as speeches or conversations, into written text.
OpenAI Whisper uses state-of-the-art deep learning models that are trained on massive amounts of audio data. These models are capable of accurately identifying individual words and phrases in audio recordings and converting them into text. The technology behind OpenAI Whisper is based on cutting-edge natural language processing techniques, which allow the tool to recognize context and language nuances, improving the accuracy of its transcriptions.
One of the main advantages of OpenAI Whisper is its speed. It can process audio content in real-time, which means that transcriptions can be generated quickly and efficiently. This feature makes OpenAI Whisper particularly useful for applications where speed and accuracy are essential, such as in the production of subtitles for videos or live transcription of events.
Gradio

Gradio is an open-source Python library that simplifies the creation of machine learning interfaces. With Gradio, developers, and data scientists can build web-based interfaces for machine learning models without needing to have knowledge of web development.
Gradio provides an intuitive interface for creating machine learning models, allowing users to build custom user interfaces for their models easily. The library provides a variety of UI elements, such as sliders, text boxes, and dropdown menus, which can be used to adjust model parameters or input data.
The resulting interface is interactive and user-friendly, making it easy to explore and understand the model’s behavior.
Building the App
The workflow of our app is as follows
- The input of the Youtube Video URL
- Extract the audio of the Youtube Video in MP3 format.
- Process this MP3 into the Whisper model to get the transcription.
We will add another feature that summarizes the transcription and gives a summary of the Video. So a new step comes into our workflow which takes the transcription and summarizes it.
First, let’s import all the necessary libraries as follows
import whisper
from pytube import YouTube
from transformers import pipeline
import gradio as gr
import os
Pytube is a library that is used to extract audio from the youtube URL. We use the transformers library to summarize the transcription. We also imported gradio and whisper libraries.
Let’s load the whisper model and summarization pipeline as shown below
model = whisper.load_model("base")
summarizer = pipeline("summarization")
We loaded the ‘base’ model in the whisper and the summarization pipeline from transformers. Now we define three functions to get the audio of the youtube URL, transcription of the audio, and finally the summary of the transcription. The code is as follows
def get_audio(url):
yt = YouTube(url)
video = yt.streams.filter(only_audio=True).first()
out_file=video.download(output_path=".")
base, ext = os.path.splitext(out_file)
new_file = base+'.mp3'
os.rename(out_file, new_file)
a = new_file
return a
def get_text(url):
result = model.transcribe(get_audio(url))
return result['text']
def get_summary(url):
article = get_text(url)
b = summarizer(article)
b = b[0]['summary_text']
return b
Now we create a gradio interface for our app as follows
with gr.Blocks() as demo:
gr.Markdown("<h1><center>Youtube video transcription with OpenAI's Whisper</center></h1>")
gr.Markdown("<center>Enter the link of any youtube video to get the transcription of the video and a summary of the video in the form of text.</center>")
with gr.Tab('Get the transcription of any Youtube video'):
with gr.Row():
input_text_1 = gr.Textbox(placeholder='Enter the Youtube video URL', label='URL')
output_text_1 = gr.Textbox(placeholder='Transcription of the video', label='Transcription')
result_button_1 = gr.Button('Get Transcription')
with gr.Tab('Summary of Youtube video'):
with gr.Row():
input_text = gr.Textbox(placeholder='Enter the Youtube video URL', label='URL')
output_text = gr.Textbox(placeholder='Summary text of the Youtube Video', label='Summary')
result_button = gr.Button('Get Summary')
result_button.click(get_summary, inputs = input_text, outputs = output_text)
result_button_1.click(get_text, inputs = input_text_1, outputs = output_text_1)
demo.launch(debug=True)
In the above code, we created two tabs, one for transcription and the other for the summarized transcription. We used the previously defined functions to get the outputs and display them in the app. The input is a Youtube URL. The output is displayed in a textbox for both tabs. You can deploy this app on huggingface spaces which offer free hosting.
You can find the deployed app here at this link: Youtube Video Transcription With Whisper – a Hugging Face Space by buildingai
Conclusion
In conclusion, developing an app that automatically transcribes YouTube videos using OpenAI’s Whisper and Gradio can make a significant impact in making information more accessible to everyone. Transcriptions can benefit not only those who are hard of hearing or deaf, but also anyone who prefers to read or search for specific information in a video. With this app, we can help bridge the gap between different audiences and improve the overall quality and value of video content.
As technology continues to advance, we have the opportunity to create more inclusive and accessible ways of sharing information. By leveraging the power of AI and machine learning, we can develop tools that make a real difference in people’s lives.




Leave a Reply