Video transcription using OpenAI: Part 1 - Using Whisper for transcription extraction

This two-part article explores the process of processing video transcriptions. In its first part, it emphasizes the importance of accurate and reader-friendly transcriptions for video content.

Artur Nikitsin

Senior Engineer at FocusReactive

06 Feb 2025

Video transcription using OpenAI: Part 1 - Using Whisper for transcription extraction

Introduction

On our GitNation Portal, we offer a large amount of video content. Recognizing the needs of users who prefer reading over watching videos, we have decided to provide high-quality transcriptions for each video.

It wasn't enough for us to simply convert videos into text. Our objective was to produce transcriptions that are not only accurate but also reader-friendly, logically segmented into paragraphs and sections.

We also aimed to automate this entire process. In this article, we will discuss how we leveraged OpenAI products, specifically Whisper and ChatGPT, to achieve this. We will also delve into the challenges and constraints we encountered and share our experiences in overcoming them.

This is the first part of the article; you can find the second part here.

In this part we will cover:

Downloading video and extracting audio: The initial processing steps
Using Whisper to extract text transcription from audio
Challenges and complexities in processing text transcriptions
Processing the segments array from Whisper

Downloading video and extracting audio: The initial processing steps

Whisper great tool for converting audio to text. The first step in your workflow is to extract the audio from the video.

We begin by downloading the video from its hosting platform. In our case, we’re using Vimeo.

import fs from 'fs';
import { Vimeo } from 'vimeo';
import fetch from 'node-fetch';

const CLIENT_ID = 'YOUR_CLIENT_ID';
const CLIENT_SECRET = 'YOUR_CLIENT_SECRET';
const ACCESS_TOKEN = 'YOUR_ACCESS_TOKEN';

const client = new Vimeo(CLIENT_ID, CLIENT_SECRET, ACCESS_TOKEN);

async function downloadVideo(videoID) {
   return new Promise((resolve, reject) => {
      // Fetch video details
      client.request({
         method: 'GET',
         path: `/video/${videoID}`, // use videoID argument
         query: {
            video_id: videoID,
            fields: 'download',
         },
      }, async function (error, body, statusCode, headers) {
         if (error) {
            console.error(error);
            reject(error);
         } else {
            // Get the file URL
            const videoURL = body.download[0].link; // You might need to select another array element depending on the video quality
            // Download the video
            const res = await fetch(url);
            const fileStream = fs.createWriteStream("path_to_video.mp4");
            await new Promise((resolve, reject) => {
               res.body.pipe(fileStream);
               res.body.on('error', reject);
               fileStream.on('finish', resolve);
            });
         }
      });
   });
}

// Example usage:
const videoID = "YOUR_VIDEO_ID";
await downloadVideo(videoID);

Extract the Audio

Once the video has been successfully downloaded, the next step is to extract its audio track. We will use theffmpeg-extract-audio library for this.

import extractAudio from 'ffmpeg-extract-audio';

async function extractAudioFromVideo(videoPath, outputPath) {
   try {
      await extractAudio({
         input: videoPath,
         output: outputPath
      });
      console.log('Audio extraction successful!');
   } catch (error) {
      console.error('Error extracting audio:', error);
   }
}

// Example usage
extractAudioFromVideo('./path_to_video.mp4', './path_to_output_audio.mp3');

After successful extraction, do not forget to delete the temporary video file in order not to take up space:

fs.unlinkSync('path_to_video.mp4');

Using whisper to extract text transcription from audio

After obtaining the audio from the video, the next step is to transcribe it into text. For this purpose, we'll utilize OpenAI's Whisper system, a state-of-the-art automatic speech recognition system. With Whisper, we can quickly convert spoken words in our audio file into a readable text format, making it an ideal solution for our transcription needs.

Now that we have an audio file, the next step is to create a transcription using OpenAI's Whisper system, an automatic speech recognition model. With Whisper, we can convert spoken words into a well-formatted text format that is perfect for our needs.

async function transcriptFileWithWhisper(pathToAudioFile): Promise<WhisperTranscriptionData> {
   const transcript = await openai.createTranscription(
           fs.createReadStream(pathToAudioFile) ,
           'whisper-1',
           TEMPLATE_WHISPER_PROMPT,
           'verbose_json',
           0.7,
           'en',
           {
              maxContentLength: Infinity,
              maxBodyLength: Infinity,
           },
   );
   
   return transcript.data
}

const whisperResult = await transcriptFileWithWhisper('./path_to_output_audio.mp3');

// After processing, don't forget to delete the audio file.
fs.unlinkSync('./path_to_output_audio.mp3');

The transcription provides us with valuable information regarding the transcription. It primarily consists of two fields that are of interest to us: text and segments.

The text field comprises the entire transcription in a continuous format. While this gives us the core content, it's essentially a block of text which might not be user-friendly for reading. To illustrate, it may look something like this:

solid_transcription

Yes, this is already an accurate text transcription of the video, but it's entirely unstructured, making it difficult to read.

Challenges and complexities in processing text transcriptions

At this point, we run into an interesting problem. We want to use ChatGPT to reorganize the raw text of the transcript into logical chapters and paragraphs.

This works great for short chunks, but as soon as we start dealing with long content — like multi-hour seminars — we run into the token limits of the ChatGPT models. (For more information on tokens, see this article and the model token limits.)

There are two main issues here:

Token limits: Both input and output tokens count toward the model limits. Even with the largest model (e.g. gpt-3.5-turbo-16k), very long texts quickly exceed the allowed tokens.

You can learn more about tokens here and about token limits for various models here.

Maintaining logical flow: Splitting the text into fixed-size chunks (e.g. every 100 sentences) can break the logical flow — breaking up coherent ideas or cutting a sentence in half.

To address these issues, we implemented what we call a logical loop mechanism, a strategy that divides text into manageable, context-related chunks without losing the overall narrative.

Processing the segments array from Whisper

Let's now turn our attention to the data returned by Whisper. The segments field gives us an array of transcription fragments, each containing timestamps and partial text:

[
   ...
   {
      id: 22,
      seek: 8482,
      start: 91.94,
      end: 98.33999999999999,
      text: ' So AI, I think it gives us developers a massive boost in productivity, code quality, as well',
      tokens: [
         50720,
         407,
         ...
      ],
      temperature: 0.7,
      avg_logprob: -0.3790066678759078,
      compression_ratio: 1.6958041958041958,
      no_speech_prob: 0.10085256397724152,
   },
   {
      id: 23,
      seek: 8482,
      start: 98.34,
      end: 101.1,
      text: " as the kind of stuff that we're able to build.",
      tokens: [51040, 382, ...],
      temperature: 0.7,
      avg_logprob: -0.3790066678759078,
      compression_ratio: 1.6958041958041958,
      no_speech_prob: 0.10085256397724152,
   },
   {
      id: 24,
      seek: 8482,
      start: 101.1,
      end: 103.17999999999999,
      text: " We're still figuring a lot of this stuff out.",
      tokens: [51178, 492, ...],
      temperature: 0.7,
      avg_logprob: -0.3790066678759078,
      compression_ratio: 1.6958041958041958,
      no_speech_prob: 0.10085256397724152,
   },
  ...
]

As you can see, these fragments are not necessarily complete sentences - they can contain parts of sentences or several partial sentences. However, preserving the timestamps is very important for our subsequent integrations.

Splitting Transcription into Sentences

We use the end of sentences (periods) as our natural break point. While you could simply split the entire text using something like text.split('.'), this would lose the associated timestamps. Instead, we process the array of segments and convert it into individual sentences with their original time information.

export const splitToSentencesWithTimestamps = async ({
                                                        transcription,
                                                     }: {
   transcription: TranscriptionItem[];
  
}) => {
   // If a chunk contains multiple separate sentences, we split them.
      const transcriptionChunks = transcription.reduce((acc: TranscriptionItem[], currentTranscriptChunk) => {
         if (currentTranscriptChunk.text.includes('. ')) {
            const commonTextLength = currentTranscriptChunk.text.length;
            let chunkDuration = currentTranscriptChunk.end - currentTranscriptChunk.start;
            let currentStart = currentTranscriptChunk.start;

            currentTranscriptChunk.text.split('. ').forEach((substring, index, arr) => {
               const currentPeriod = (substring.length / commonTextLength) * chunkDuration;

               acc.push({
                  start: currentStart,
                  end: index === arr.length - 1 ? currentTranscriptChunk.end : (currentStart += currentPeriod),
                  text: index !== arr.length - 1 ? clearText(` ${substring}.`) : clearText(` ${substring}`),
               });
            });
         } else {
               acc.push({
                  start: currentTranscriptChunk.start,
                  end: currentTranscriptChunk.end,
                  text: currentTranscriptChunk.text,
               });
         }

         return acc;
      }, []);

      let startTime = 0;
      let endTime = 0;
      let currentSentence = '';

      // We assemble complete sentences from individual chunks.
      const sentencesWithTimestamps: any[] = transcriptionChunks.reduce(
              (sentencesData: any[], currentTimestamp, currentIndex) => {
                 currentSentence += currentTimestamp.text;
                 if (
                         currentSentence.length &&
                         currentSentence.endsWith('.') &&
                         transcriptionChunks[currentIndex + 1]?.text.startsWith(' ')
                 ) {
                    endTime = currentTimestamp.end;
                    sentencesData.push({ startTime: startTime, endTime: endTime, sentence: currentSentence });
                    startTime = transcriptionChunks[currentIndex + 1]?.start;
                    currentSentence = '';
                 }

                 return sentencesData;
              },
              [],
      );
      return sentencesWithTimestamps;
};


const sentencesWithTimestamps = await splitToSentencesWithTimestamps({ transcription }); 

The result is an array where each object represents a separate sentence with its start and end timestamps:

[
  {
    startTime: "91.94",
    endTime: "101.1",
    sentence: "So AI, I think it gives us developers a massive boost in productivity, code quality, as well as the kind of stuff that we're able to build."
  },
  {
    startTime: "101.1",
    endTime: "103.18",
    sentence: "We're still figuring a lot of this stuff out."
  }
]

Determining Fixed Chunk Sizes

Since the length of videos — and therefore transcripts — can vary significantly, we created a utility to determine the optimal number of sentences per block. This helps us balance token limitation issues with maintaining a logical flow of content:

export const getFixedChunkSize = (sentencesLength: number) => {
    if (sentencesLength > 1500) {
        return 90;
    }
    if (sentencesLength > 1000 && sentencesLength <= 1500) {
        return 70;
    }
    if (sentencesLength > 500 && sentencesLength <= 1000) {
        return 50;
    }
    if (sentencesLength <= 500) {
        return 30;
    }
    return 20;
};

So far, we have segmented the transcription into individual sentences and determined the optimal chunk size — a necessary preparation for solving the problems of context loss and token limitations.

In the next part of the article, we will dive into the implementation of the Logic loop and show how it helps preserve context between chunks when processing long transcriptions.

Artur Nikitsin

Senior Engineer at FocusReactive

06 Feb 2025