Skip to main content

Video transcription using OpenAI: Part 2 - Using ChatGPT for transcription processing and formatting

In the second part of this article, we continue our exploration of transcription processing, focusing on the utilization of ChatGPT to enhance and format the transcriptions. We introduce the concept of the "Logical Loop" and its role in addressing challenges encountered in transcription processing.

In the first part of the article, we explored the process of using Whisper to extract text transcriptions from videos and developed the respective utility functions for it. We also highlighted the main challenges that can arise during transcription processing.

In this part, we will delve into how to utilize ChatGPT for processing and formatting these transcriptions. Specifically, we'll discuss the concept of a Logical loop and its role in addressing the challenges we previously identified.

In this part we will cover:

What exactly is the Logical Loop

At its core, the Logical Loop is a typical JavaScript loop. However, it stands out by incorporating a degree of flexibility driven by artificial intelligence. Instead of just iterating over data in a linear manner, it adapts and optimizes based on context and overarching objectives, leveraging the strengths of AI.

We can initiate a logical loop, but only chatGPT will decide when it concludes!

With our sentences segmented and a defined chunk size, it's time to process these chunks using our Logical Loop mechanism. The idea is to keep processing the text chunks until we reach the final one.

Here's a simplified version of how this loop would look:

const chunkSize = getFixedChunkSize(sentencesWithTimestamps.length);
let isLastChunk = false;
const chapters = [];
while (!isLastChunk) {
// AI-driven magic happens here
}

A step by step guide to the Logical loop with ChatGPT

  1. We determine a fixed chunk size for the current transcription, say, 30 sentences.
  2. We submit this fixed chunk to chatGPT for analysis, asking it to divide the text into logical paragraphs without altering the original content. We then request it to evaluate the result and discard the final paragraphs if they begin to address a new topic or idea.
  3. ChatGPT returns the processed chunk, excluding the last paragraphs. This becomes a ready chapter, which we simply append to an array of completed chapters.
  4. We review the array of processed chapters and count the total sentences processed from the transcription. To select a chunk for the next iteration, we take the full array of sentences from the transcription, omit the ones already processed from the beginning, and select the next 30 sentences. In this manner, sentences that chatGPT discarded in the previous iteration will be included in the next one!

Therefore, while we employ a consistent chunk size for each iteration, it's ultimately up to chatGPT to discern where a logically coherent idea concludes and where to truncate the text.

Additionally, on each iteration, we'll ask the chat to generate a shortened version of the chapter. In subsequent iterations, all previously shortened versions will be passed back to the chat to ensure context is understood, enhancing the quality of its output. In this way, we address the issue of losing context for individual chunks when they are processed separately.

Moreover, we intend to ask the chat to generate appropriate titles for each chapter.

Sounds cool! So, let's implement all of this in code:

  1. Configuring OpenAI for chat access

To utilize the OpenAI API, you first need to configure it. This is done using the Configuration object, where you specify your API key:

const configuration = new Configuration({
apiKey: OPENAI_API_KEY,
});

Then, with this configuration, you instantiate the API:

const openai = new OpenAIApi(configuration);

To handle chunks of messages, use the asynchronous function processChunk. This function sends messages to the model and returns the response. Within the parameters, you can adjust the "temperature" and "top_p" to control the randomness and quality of the model's response. You can read more about the chat configuration here.

async function processChunk(messages) {
return await openai.createChatCompletion({
model: 'gpt-3.5-turbo-16k', // model
messages: messages, // array of messages
temperature: 0.5,
top_p: 0.9,
});
}
  1. Now, we'll develop several functions that will allow us to work with text and select chunks on each iteration:
// This function will allow us to remove all tags and extra spaces from the text.
const clearText = (text) => {
// ' ' - important for the case with </p><p>
// remove tags;
// replace all multiple whitespaces with one;
return text.replace(/(<([^>]+)>)/gi, ' ').replace(/ +/g, ' ');
};

// This function will allow us to convert the array of processed chapters back into
// a complete text. This is useful for counting sentences and creating
// a set of condensed versions to provide the chat with the necessary context.
const getSolidTextFromChapters = (transcriptionChapters) => {
const context = transcriptionChapters.reduce((solidText, { partName, shortenedVersion }) => {
return (solidText += `Part name: "${partName}"\n Shortened version: "${shortenedVersion}"\n`);
}, '');

const formattedText = transcriptionChapters
.map(({ textWithParagraphs }) => {
return textWithParagraphs;
})
.join('');

return { context, formattedText };
};

// Finally, this function will analyze the completed chapters and the entire set
// of sentences in the transcription, selecting a fixed chunk on each iteration.
const selectNextChunk = ({
sentencesWithTimestamps,
transcriptionChapters,
chunkSize,
}: {
sentencesWithTimestamps;
transcriptionChapters;
chunkSize;
}) => {
const { formattedText, context } = getSolidTextFromChapters(transcriptionChapters);
const formattedTextWithoutParagraphs = clearText(formattedText);

const formattedSentencesCount = splitTextToSentences(formattedTextWithoutParagraphs).length;
let isFullTextSelected = false;
let currentChunk = sentencesWithTimestamps.slice(formattedSentencesCount, formattedSentencesCount + chunkSize);

if (currentChunk.length < chunkSize) {
isFullTextSelected = true;
}

if (sentencesWithTimestamps.length - formattedSentencesCount - currentChunk.length < 10) {
currentChunk = sentencesWithTimestamps.slice(formattedSentencesCount);
isFullTextSelected = true;
}

return { currentChunk, isFullTextSelected };
};
  1. So, let's utilize all these functions within our loop:
const chunkSize = getFixedChunkSize(sentences.length);
let isLastChunk = false;
let chunkNumber = 1;
const processedChapters = [];

while (!isLastChunk) {
const { currentChunk, isFullTextSelected } = selectNextChunk({
sentencesWithTimestamps,
processedChapters,
chunkSize,
});
isLastChunk = isFullTextSelected;

const { context } = getSolidTextFromChapters(transcriptionChapters);
const currentChunkText = currentChunk.reduce((acc, { sentence }) => {
return (acc += sentence);
}, '');

const dialog: { role: ChatCompletionRequestMessageRoleEnum; content: string }[] = [];

const completionData = await processChunk(dialog);

if (completionData.data.choices[0].message) {
const responseContent = JSON.parse(completionData.data.choices[0].message);
processedChapters.push({
order: chunkNumber,
partName: responseContent.partName,
shortenedVersion: responseContent.shortenedVersion,
textWithParagraphs: responseContent.textWithParagraphs,
});
}
}

As you can see, we are almost ready to launch our loop. However, the dialogue with the chat is currently just an empty array. It's also unclear where fields such as partName, shortenedVersion, and textWithParagraphs originate from within the chat's JSON response. Let's move on to developing this part.

Prompt engineering for iterative transcription processing

What should we consider when creating prompts in our case?

Firstly, it's evident that prompts should vary across different iteration stages. Why is that? At the initial iteration, we lack the context from the previous stages, thus we can't provide it to the chat.

Secondly, when we're processing the final chunk, there's no need to omit the last paragraphs. The entire text should be processed to the end.

Based on these considerations, we can identify three different types of prompts:

  1. For the first chunk.
  2. For all intermediate chunks.
  3. For the concluding chunk.

So, let's integrate prompt handling into our loop.

const chunkSize = getFixedChunkSize(sentences.length);
let isLastChunk = false;
let chunkNumber = 1;
const processedChapters = [];

while (!isLastChunk) {
const { currentChunk, isFullTextSelected } = selectNextChunk({
sentencesWithTimestamps,
processedChapters,
chunkSize,
});
isLastChunk = isFullTextSelected;

const { context } = getSolidTextFromChapters(transcriptionChapters);
const currentChunkText = currentChunk.reduce((acc, { sentence }) => {
return (acc += sentence);
}, '');

const dialog: { role: ChatCompletionRequestMessageRoleEnum; content: string }[] = [];
const COMMON_INSTRUCTION = {
role: ChatCompletionRequestMessageRoleEnum.System,
content: `You are an experienced article writer with a technical background.
Your task is to analyze a large text transcription that has been divided into sections,
examining each part individually.`,
};

if (chunkNumber === 1) {
dialog.push(
COMMON_INSTRUCTION,
{
role: ChatCompletionRequestMessageRoleEnum.User,
content: `
This text is the first part: "${currentChunkText}".
Split this text into logical paragraphs without changing the original content and
follow the next instructions:
1. Wrap each paragraph in <p> tag.
2. Analyze the text. If the final paragraphs introduce a new topic, remove them from the final text.
3. Create a shortened version of the text that contains only the key points.
4. Сreate a unique name for this section based on the information within it.
5. Do not include any explanatory content. Provide an RFC8259 compliant JSON response in the following format without deviation:
{
"textWithParagraphs": "...",
"shortenedVersion": "...",
"partName": "...",
}.`,
}
);
}

if (chunkNumber !== 1 && !isLastChunk) {
dialog.push(
COMMON_INSTRUCTION,
{
role: ChatCompletionRequestMessageRoleEnum.Assistant,
content: `These are the shortened versions of all previous sections with their names:\n ${context}.`,
},
{
role: ChatCompletionRequestMessageRoleEnum.User,
content: `
This text is the next part: "${currentChunkText}".
Split this text into logical paragraphs without changing the original content and
follow the next instructions:
1. Wrap each paragraph in <p> tag.
2. Analyze the text. If the final paragraphs introduce a new topic, remove them from the final text.
3. Considering the context of the previous parts, create a shortened version of the text that contains only the key points.
4. Сreate a unique name for this section based on the information within it.
5. Do not include any explanatory content. Provide an RFC8259 compliant JSON response in the following format without deviation:
{
"textWithParagraphs": "...",
"shortenedVersion": "...",
"partName": "...",
}.`
);
}

if (isLastChunk) {
dialog.push(
COMMON_INSTRUCTION,
{
role: ChatCompletionRequestMessageRoleEnum.Assistant,
content: `These are the shortened versions of all previous sections with their names: ${context}.`,
},
{
role: ChatCompletionRequestMessageRoleEnum.User,
content: `
This text is the last part: "${currentChunkText}".
Split this text into logical paragraphs without changing the original content and
follow the next instructions:
1. Wrap each paragraph in <p> tag.
2. Considering the context of the previous parts, create a shortened version of the text that contains only the key points.
3. Сreate a unique name for this section based on the information within it.
4. Do not include any explanatory content. Provide an RFC8259 compliant JSON response in the following format without deviation:
{
"textWithParagraphs": "...",
"shortenedVersion": "...",
"partName": "...",
}.
`,
}
);
}

const completionData = await processChunk(dialog);

if (completionData.data.choices[0].message) {
const responseContent = JSON.parse(completionData.data.choices[0].message);
processedChapters.push({
order: chunkNumber,
partName: responseContent.partName,
shortenedVersion: responseContent.shortenedVersion,
textWithParagraphs: responseContent.textWithParagraphs,
});
chunkNumber += 1;
}
}

It now looks like we have a complete loop in place. Let's delve into the details of what's happening here. Firstly, to interact with the chat, we need to create a message array. Each message is a standard object with the following properties:

{
role: ChatCompletionRequestMessageRoleEnum.Assistant, // The role of the
// message's author. Possible values: system, user, assistant, or function
content: `Message text`, // The content of the message.
// This field is mandatory for all messages, and may be null for assistant messages with function calls.
}

You can read more about the roles and their significance in chat configuration here.

Next, we see that we have one common system message, COMMON_INSTRUCTION, for each chunk; It helps set the overall context for interacting with the chat.

Lastly, we observe three prompts, each varying depending on which chunk the loop is processing at the moment. Take a closer look at these prompts to better grasp their distinctions.

Great! We now have a fully functional mechanism in place, which we've termed the Logical loop.

After completing this loop, once we have an array of condensed versions of all the sections at our disposal, we can take it a step further. We can make another separate request to ask the chat to produce a single shortened version for the entire transcription.

Additionally, as mentioned earlier, we utilized the timestamps of individual sentences to integrate with Vimeo.

Now our transcription looks cool: processed_transcription

Dive into our portal and, using this transcription as an example, witness how this loop operates and the impressive results it can achieve.

Conclusion

The Logical loop is an excellent scalable solution that allows for processing textual content of any length by dividing the work into parts while maintaining logical integrity and flexibility, made possible thanks to integration with ChatGPT.

The goal is to provide accurate and reader-friendly transcriptions for video content. The overall process can be divided into two main parts. The first part involves the processing of video and audio files, which is necessary for the subsequent use of the Whisper tool. The second part encompasses the implementation of the Logic loop for processing text transcriptions.

Let's summarize and highlight the key steps of this article:

  1. Downloading Video and Extracting Audio:
  • Download the video from a hosting platform (e.g., Vimeo).
  • Extract the audio track from the video using tools like ffmpeg-extract-audio.
  1. Using Whisper to Extract Text Transcription from Audio:
  • Utilize OpenAI's Whisper system to transcribe the audio into text.
  • The resulting transcription contains two important fields: text and segments.
  1. Processing the segments Array from Whisper and selecting fixed chunks for processing:
  • Break down the continuous transcription into individual sentences or logical chunks.
  • Determine a fixed chunk size for each iteration.
  • Implement a logical loop mechanism to process chunks with ChatGPT.
  1. Prompt Engineering for iterative transcription processing:
  • Create different types of prompts for the first, intermediate, and concluding chunks.
  • Prompts guide ChatGPT to divide text into logical paragraphs, analyze content, create shortened versions, and generate unique chapters names.
  • Maintain context across iterations to improve the quality of ChatGPT's output.

Creating it isn't all that complicated. However, it's worth noting that during the development process, you'll likely have to conduct extensive analytical work to best tailor the chat specifically for your tasks.

But what if the chat starts "hallucinating" or doesn't return the desired results? We've dedicated a lot of time to addressing such issues and fine-tuning the optimal model configuration for our specific tasks.

We will delve deeper into this topic in our next article.

WRITTEN BY

Artur Nikitsin

Artur Nikitsin

Senior Engineer at FocusReactive