Skip to main content

Video transcription using OpenAI: Part 1 - Using Whisper for transcription extraction

This two-part article explores the process of processing video transcriptions. In its first part, it emphasizes the importance of accurate and reader-friendly transcriptions for video content.

Introduction

On our GitNation Portal, we offer a large amount of video content. Recognizing the needs of users who prefer reading over watching videos, we have decided to provide high-quality transcriptions for each video.

It wasn't enough for us to simply convert videos into text. Our objective was to produce transcriptions that are not only accurate but also reader-friendly, logically segmented into paragraphs and sections.

We also aimed to automate this entire process. In this article, we will discuss how we leveraged OpenAI products, specifically Whisper and ChatGPT, to achieve this. We will also delve into the challenges and constraints we encountered and share our experiences in overcoming them.

This is the first part of the article; you can find the second part here.

In this part we will cover:

Downloading video and extracting audio: The initial processing steps

Whisper is a potent tool for converting audio into text. And so the first thing we need to do is extract the audio track from the video.

  1. First and foremost, we need to download the video from its hosting platform. In our scenario, the source of the video content is Vimeo.
import fs from 'fs';
import { Vimeo } from 'vimeo';
import fetch from 'node-fetch';

const CLIENT_ID = 'YOUR_CLIENT_ID';
const CLIENT_SECRET = 'YOUR_CLIENT_SECRET';
const ACCESS_TOKEN = 'YOUR_ACCESS_TOKEN';

const client = new Vimeo(CLIENT_ID, CLIENT_SECRET, ACCESS_TOKEN);

async function downloadVideo(videoID) {
return new Promise((resolve, reject) => {
// Fetch video details
client.request({
method: 'GET',
path: `/video/${videoID}`, // use videoID argument
query: {
video_id: videoID,
fields: 'download',
},
}, async function (error, body, statusCode, headers) {
if (error) {
console.error(error);
reject(error);
} else {
// Get the file URL
const videoURL = body.download[0].link; // You might need to select another array element depending on the video quality
// Download the video
const res = await fetch(url);
const fileStream = fs.createWriteStream("path_to_video.mp4");
await new Promise((resolve, reject) => {
res.body.pipe(fileStream);
res.body.on('error', reject);
fileStream.on('finish', resolve);
});
}
});
});
}

// Example usage:
const videoID = "YOUR_VIDEO_ID";
await downloadVideo(videoID);

  1. Next, we need to extract the audio track from the video.

Below is an example of how you can utilize the ffmpeg-extract-audio library to extract an audio track from a video:

import extractAudio from 'ffmpeg-extract-audio';

async function extractAudioFromVideo(videoPath, outputPath) {
try {
await extractAudio({
input: videoPath,
output: outputPath
});
console.log('Audio extraction successful!');
} catch (error) {
console.error('Error extracting audio:', error);
}
}

// Example usage
extractAudioFromVideo('./path_to_video.mp4', './path_to_output_audio.mp3');

After successfully extracting the audio, we should delete the video file.

fs.unlinkSync('path_to_video.mp4');

Using whisper to extract text transcription from audio

After obtaining the audio from the video, the next step is to transcribe it into text. For this purpose, we'll utilize OpenAI's Whisper system, a state-of-the-art automatic speech recognition system. With Whisper, we can quickly convert spoken words in our audio file into a readable text format, making it an ideal solution for our transcription needs.

async function transcriptFileWithWhisper(pathToAudioFile): Promise<WhisperTranscriptionData> {
const transcript = await openai.createTranscription(
fs.createReadStream(pathToAudioFile) ,
'whisper-1',
TEMPLATE_WHISPER_PROMPT,
'verbose_json',
0.7,
'en',
{
maxContentLength: Infinity,
maxBodyLength: Infinity,
},
);

return transcript.data
}

const whisperResult = await transcriptFileWithWhisper('./path_to_output_audio.mp3');

// After processing, don't forget to delete the audio file.
fs.unlinkSync('./path_to_output_audio.mp3');

The whisperResult provides us with valuable information regarding the transcription. It primarily consists of two fields that are of interest to us: text and segments.

The text field comprises the entire transcription in a continuous format. While this gives us the core content, it's essentially a block of text which might not be user-friendly for reading. To illustrate, it may look something like this:

solid_transcription

Yes, this is already an accurate text transcription of the video, but it's entirely unstructured, making it difficult to read.

Challenges and complexities in processing text transcriptions

At this juncture, it seems an opportune moment to employ ChatGPT to analyze the text and organize it into logical chapters and paragraphs. This is a viable solution for shorter content pieces. However, if we task ChatGPT with processing extensive transcriptions, such as those from workshops that last several hours, we may quickly reach the token limit of the chat model. And that's the first problem.

You can learn more about tokens here and about token limits for various models here.

It's important to note that the model counts all tokens — both input and output, which essentially make up the content of a given chat session. Therefore, even using the largest available model, such as gpt-3.5-turbo-16k, won't solve the problem for longer content pieces.

In such scenarios, a logical approach would be to divide the text into several parts and process them independently. This is, indeed, a promising and currently the only viable solution in such cases.

However, if we start to technically segment the text into chunks, let's say arbitrarily by every 100 sentences, we run the risk of inadvertently breaking a logical flow somewhere in the middle. Consequently, we might not be able to partition the text into logically coherent and distinct chapters. And that poses the second problem.

So, we have two intertwined challenges:

  1. The token limit of the chatGPT model.
  2. When we attempt to work around this limit by dividing the text into fixed chunks, there's a potential risk of interrupting the logical flow or splitting a coherent idea or thought that's being discussed in the content midway.

Furthermore, when we technically partition the text into chunks and process them individually, we lose the ability to engage with chatGPT within a unified context. Each text segment will have its own distinct context, and the chat won't know the preceding content or the most suitable context to process the current chunk.

This becomes particularly crucial when it's essential to consider the context or ideas from previous chunks and undertake tasks beyond merely segmenting text into paragraphs.

As a solution, we've implemented a mechanism named the Logical Loop.

Processing the segments array from Whisper

To begin with, let's make the necessary preparations. Let's revert to the data that Whisper returns to us, specifically focusing on the segments field. Inside, we will find the following transcription structure:

[
...
{
id: 22,
seek: 8482,
start: 91.94,
end: 98.33999999999999,
text: ' So AI, I think it gives us developers a massive boost in productivity, code quality, as well',
tokens: [
50720,
407,
...
],
temperature: 0.7,
avg_logprob: -0.3790066678759078,
compression_ratio: 1.6958041958041958,
no_speech_prob: 0.10085256397724152,
},
{
id: 23,
seek: 8482,
start: 98.34,
end: 101.1,
text: " as the kind of stuff that we're able to build.",
tokens: [51040, 382, ...],
temperature: 0.7,
avg_logprob: -0.3790066678759078,
compression_ratio: 1.6958041958041958,
no_speech_prob: 0.10085256397724152,
},
{
id: 24,
seek: 8482,
start: 101.1,
end: 103.17999999999999,
text: " We're still figuring a lot of this stuff out.",
tokens: [51178, 492, ...],
temperature: 0.7,
avg_logprob: -0.3790066678759078,
compression_ratio: 1.6958041958041958,
no_speech_prob: 0.10085256397724152,
},
...
]

As you can see, Whisper returns individual chunks of text with associated time stamps. These aren't necessarily distinct sentences; a chunk might contain part of a sentence or even several incomplete sentences combined.

However, we need some atomic unit of measurement to split the large text into several chunks of fixed size. We decided that a sentence ending with a period would serve as this unit.

Of course, we could have taken the entire text returned by Whisper and performed

text.split('. ')

However, preserving time stamps was crucial for our subsequent integrations. Thus, let's utilize the data from the segments field and convert them into individual sentences.

export const splitToSentencesWithTimestamps = async ({
transcription,
}: {
transcription: TranscriptionItem[];

}) => {
// If a chunk contains multiple separate sentences, we split them.
const transcriptionChunks = transcription.reduce((acc: TranscriptionItem[], currentTranscriptChunk) => {
if (currentTranscriptChunk.text.includes('. ')) {
const commonTextLength = currentTranscriptChunk.text.length;
let chunkDuration = currentTranscriptChunk.end - currentTranscriptChunk.start;
let currentStart = currentTranscriptChunk.start;

currentTranscriptChunk.text.split('. ').forEach((substring, index, arr) => {
const currentPeriod = (substring.length / commonTextLength) * chunkDuration;

acc.push({
start: currentStart,
end: index === arr.length - 1 ? currentTranscriptChunk.end : (currentStart += currentPeriod),
text: index !== arr.length - 1 ? clearText(` ${substring}.`) : clearText(` ${substring}`),
});
});
} else {
acc.push({
start: currentTranscriptChunk.start,
end: currentTranscriptChunk.end,
text: currentTranscriptChunk.text,
});
}

return acc;
}, []);

let startTime = 0;
let endTime = 0;
let currentSentence = '';

// We assemble complete sentences from individual chunks.
const sentencesWithTimestamps: any[] = transcriptionChunks.reduce(
(sentencesData: any[], currentTimestamp, currentIndex) => {
currentSentence += currentTimestamp.text;
if (
currentSentence.length &&
currentSentence.endsWith('.') &&
transcriptionChunks[currentIndex + 1]?.text.startsWith(' ')
) {
endTime = currentTimestamp.end;
sentencesData.push({ startTime: startTime, endTime: endTime, sentence: currentSentence });
startTime = transcriptionChunks[currentIndex + 1]?.start;
currentSentence = '';
}

return sentencesData;
},
[],
);
return sentencesWithTimestamps;
};


const sentencesWithTimestamps = await splitToSentencesWithTimestamps({ transcription });

As a result, we obtain an array of individual sentences structured as follows:

{
startTime: "...", // sentense start time,
endTime: "...", // sentense end time,
sentence: "..." // sentence text
}

Here's an example of the transformed transcription, where each object represents an individual sentence with associated time stamps:

[
{
startTime: "91.94",
endTime: "101.1",
sentence: "So AI, I think it gives us developers a massive boost in productivity, code quality, as well as the kind of stuff that we're able to build."
},
{
startTime: "101.1",
endTime: "103.18",
sentence: "We're still figuring a lot of this stuff out."
}
]

As mentioned earlier, videos can vary in length, and consequently, transcriptions will differ in size. We've crafted a straightforward utility to help determine the number of sentences in each fixed chunk of text, based on the overall sentence count in the transcript:

export const getFixedChunkSize = (sentencesLength: number) => {
if (sentencesLength > 1500) {
return 90;
}
if (sentencesLength > 1000 && sentencesLength <= 1500) {
return 70;
}
if (sentencesLength > 500 && sentencesLength <= 1000) {
return 50;
}
if (sentencesLength <= 500) {
return 30;
}
return 20;
};

So far, we have the text segmented into individual sentences and a fixed chunk size determined — pretty standard stuff.

So far, nothing groundbreaking; it seems we've merely created utilities to divide a large transcription into parts and process each segment separately. Exactly right! But not entirely.

In the next part of the article, we will discuss directly how to implement the Logic loop and how it helps address the issues we outlined earlier.

WRITTEN BY

Artur Nikitsin

Artur Nikitsin

Senior Engineer at FocusReactive