Introduction
On our portal, we offer a large amount of video content.Recognizing the needs of users who prefer reading over watching videos, we have decided to provide high-quality transcriptions for each video.
It wasn't enough for us to simply convert videos into text.
Our objective was to produce transcriptions that are not only accurate but also
reader-friendly, logically segmented into paragraphs and sections.
We also aimed to automate this entire process. In this article, we will discuss how we leveraged OpenAI products, specifically Whisper and
ChatGPT, to achieve this. We will also delve into the challenges and constraints we encountered and share our experiences in overcoming them.
This is the first part of the article; you can find the second part here here.
In this part we will cover:
Downloading video and extracting audio: The initial processing steps
Whisper is a potent tool for converting audio into text.And so the first thing we need to do is extract the audio track from the video.
- First and foremost, we need to download the video from its hosting platform.
In our scenario, the source of the video content is Vimeo.
import fs from 'fs';
import { Vimeo } from 'vimeo';
import fetch from 'node-fetch';
const CLIENT_ID = 'YOUR_CLIENT_ID';
const CLIENT_SECRET = 'YOUR_CLIENT_SECRET';
const ACCESS_TOKEN = 'YOUR_ACCESS_TOKEN';
const client = new Vimeo(CLIENT_ID, CLIENT_SECRET, ACCESS_TOKEN);
async function downloadVideo(videoID) {
return new Promise((resolve, reject) => {
client.request({
method: 'GET',
path: `/video/${videoID}`,
query: {
video_id: videoID,
fields: 'download',
},
}, async function (error, body, statusCode, headers) {
if (error) {
console.error(error);
reject(error);
} else {
const videoURL = body.download[0].link;
const res = await fetch(url);
const fileStream = fs.createWriteStream("path_to_video.mp4");
await new Promise((resolve, reject) => {
res.body.pipe(fileStream);
res.body.on('error', reject);
fileStream.on('finish', resolve);
});
}
});
});
}
const videoID = "YOUR_VIDEO_ID";
await downloadVideo(videoID);
- Next, we need to extract the audio track from the video.
Below is an example of how you can utilize the ffmpeg-extract-audio
library to extract an audio track from a video:
import extractAudio from 'ffmpeg-extract-audio';
async function extractAudioFromVideo(videoPath, outputPath) {
try {
await extractAudio({
input: videoPath,
output: outputPath
});
console.log('Audio extraction successful!');
} catch (error) {
console.error('Error extracting audio:', error);
}
}
extractAudioFromVideo('./path_to_video.mp4', './path_to_output_audio.mp3');
After successfully extracting the audio, we should delete the video file.
fs.unlinkSync('path_to_video.mp4');
After obtaining the audio from the video, the next step is to transcribe it into text. For this purpose, we'll utilize OpenAI's Whisper system, a state-of-the-art automatic speech recognition (ASR) system. With Whisper, we can quickly convert spoken words in our audio file into a readable text format, making it an ideal solution for our transcription needs.
async function transcriptFileWithWhisper(pathToAudioFile): Promise<WhisperTranscriptionData> {
const transcript = await openai.createTranscription(
fs.createReadStream(pathToAudioFile) ,
'whisper-1',
TEMPLATE_WHISPER_PROMPT,
'verbose_json',
0.7,
'en',
{
maxContentLength: Infinity,
maxBodyLength: Infinity,
},
);
return transcript.data
}
const whisperResult = await transcriptFileWithWhisper('./path_to_output_audio.mp3');
fs.unlinkSync('./path_to_output_audio.mp3');
The whisperResult provides us with valuable information regarding the transcription. It primarily consists of two fields that are of interest to us: text and segments.
The text field comprises the entire transcription in a continuous format.
While this gives us the core content, it's essentially a block of text which
might not be user-friendly for reading. To illustrate, it may look something like this:

Yes, this is already an accurate text transcription of the video, but it's entirely
unstructured, making it difficult to read.
Challenges and complexities in processing text transcriptions
At this juncture, it seems an opportune moment to employ ChatGPT to analyze the text and organize it into logical chapters and paragraphs. This is a viable solution for shorter content pieces. However, if we task ChatGPT with processing extensive transcriptions, such as those from workshops
that last several hours, we may quickly reach the token limit of the chat model. And that's the first problem.
You can learn more about tokens here
and about token limits for various models here.
It's important to note that the model counts all tokens—both input and output—which essentially make up the content of a given chat session.
Therefore, even using the largest available model, such as gpt-3.5-turbo-16k, won't solve the problem for longer content pieces.
In such scenarios, a logical approach would be to divide the text into several parts and process them independently. This is, indeed, a promising and currently the only viable solution in such cases.
However, if we start to technically segment the text into chunks, let's say arbitrarily by every 100 sentences, we run the risk of inadvertently breaking a logical flow somewhere in the middle. Consequently, we might not be able to partition the text into logically coherent and distinct chapters. And that poses the second problem.
So, we have two intertwined challenges:
- The token limit of the chatGPT model.
- When we attempt to work around this limit by dividing the text into fixed chunks, there's a potential risk of interrupting the logical flow or splitting a coherent idea or thought that's being discussed in the content midway.
Furthermore, when we technically partition the text into chunks and process them individually, we lose the ability to engage with chatGPT within a unified context. Each text segment will have its own distinct context, and the chat won't know the preceding content or the most suitable context to process the current chunk.
This becomes particularly crucial when it's essential to consider the context or ideas from previous chunks and undertake tasks beyond merely segmenting text into paragraphs.
As a solution, we've implemented a mechanism named the Logical Loop.
Processing the segments array from Whisper
To begin with, let's make the necessary preparations. Let's revert to the data that Whisper returns to us, specifically focusing on the segments field. Inside, we will find the following transcription structure:
[
...
{
id: 22,
seek: 8482,
start: 91.94,
end: 98.33999999999999,
text: ' So AI, I think it gives us developers a massive boost in productivity, code quality, as well',
tokens: [
50720,
407,
...
],
temperature: 0.7,
avg_logprob: -0.3790066678759078,
compression_ratio: 1.6958041958041958,
no_speech_prob: 0.10085256397724152,
},
{
id: 23,
seek: 8482,
start: 98.34,
end: 101.1,
text: " as the kind of stuff that we're able to build.",
tokens: [51040, 382, ...],
temperature: 0.7,
avg_logprob: -0.3790066678759078,
compression_ratio: 1.6958041958041958,
no_speech_prob: 0.10085256397724152,
},
{
id: 24,
seek: 8482,
start: 101.1,
end: 103.17999999999999,
text: " We're still figuring a lot of this stuff out.",
tokens: [51178, 492, ...],
temperature: 0.7,
avg_logprob: -0.3790066678759078,
compression_ratio: 1.6958041958041958,
no_speech_prob: 0.10085256397724152,
},
...
]
As you can see, Whisper returns individual chunks of text with associated time stamps. These aren't necessarily distinct sentences; a chunk might contain part of a sentence or even several incomplete sentences combined.
However, we need some atomic unit of measurement to split the large text into several chunks of fixed size. We decided that a sentence ending with a period would serve as this unit.
Of course, we could have taken the entire text returned by Whisper and performed
text.split('. ')
However, preserving time stamps was crucial for our subsequent integrations.
Thus, let's utilize the data from the segments field and convert them into individual sentences.
export const splitToSentencesWithTimestamps = async ({
transcription,
}: {
transcription: TranscriptionItem[];
}) => {
const transcriptionChunks = transcription.reduce((acc: TranscriptionItem[], currentTranscriptChunk) => {
if (currentTranscriptChunk.text.includes('. ')) {
const commonTextLength = currentTranscriptChunk.text.length;
let chunkDuration = currentTranscriptChunk.end - currentTranscriptChunk.start;
let currentStart = currentTranscriptChunk.start;
currentTranscriptChunk.text.split('. ').forEach((substring, index, arr) => {
const currentPeriod = (substring.length / commonTextLength) * chunkDuration;
acc.push({
start: currentStart,
end: index === arr.length - 1 ? currentTranscriptChunk.end : (currentStart += currentPeriod),
text: index !== arr.length - 1 ? clearText(` ${substring}.`) : clearText(` ${substring}`),
});
});
} else {
acc.push({
start: currentTranscriptChunk.start,
end: currentTranscriptChunk.end,
text: currentTranscriptChunk.text,
});
}
return acc;
}, []);
let startTime = 0;
let endTime = 0;
let currentSentence = '';
const sentencesWithTimestamps: any[] = transcriptionChunks.reduce(
(sentencesData: any[], currentTimestamp, currentIndex) => {
currentSentence += currentTimestamp.text;
if (
currentSentence.length &&
currentSentence.endsWith('.') &&
transcriptionChunks[currentIndex + 1]?.text.startsWith(' ')
) {
endTime = currentTimestamp.end;
sentencesData.push({ startTime: startTime, endTime: endTime, sentence: currentSentence });
startTime = transcriptionChunks[currentIndex + 1]?.start;
currentSentence = '';
}
return sentencesData;
},
[],
);
return sentencesWithTimestamps;
};
const sentencesWithTimestamps = await splitToSentencesWithTimestamps({ transcription });
As a result, we obtain an array of individual sentences structured as follows:
{
startTime: "...",
endTime: "...",
sentence: "..."
}
Here's an example of the transformed transcription, where each object
represents an individual sentence with associated time stamps:
[
{
startTime: "91.94",
endTime: "101.1",
sentence: "So AI, I think it gives us developers a massive boost in productivity, code quality, as well as the kind of stuff that we're able to build."
},
{
startTime: "101.1",
endTime: "103.18",
sentence: "We're still figuring a lot of this stuff out."
}
]
As mentioned earlier, videos can vary in length, and consequently,
transcriptions will differ in size. We've crafted a straightforward utility to help determine the number of sentences in each fixed chunk of text, based on the overall sentence count in the transcript:
export const getFixedChunkSize = (sentencesLength: number) => {
if (sentencesLength > 1500) {
return 90;
}
if (sentencesLength > 1000 && sentencesLength <= 1500) {
return 70;
}
if (sentencesLength > 500 && sentencesLength <= 1000) {
return 50;
}
if (sentencesLength <= 500) {
return 30;
}
return 20;
};
So far, we have the text segmented into individual sentences and a fixed
chunk size determined — pretty standard stuff.
So far, nothing groundbreaking; it seems we've merely created utilities to
divide a large transcription into parts and process each segment separately.
Exactly right!
But not entirely.
In the next part of the article, we will discuss directly how to implement the Logic loop and how it helps address the issues we outlined earlier.