At $0.006/min and $0.0002/1k tokens, OpenAI's Whisper and ChatGPT APIs are cheap enough to play with. I developed a "video-to-pdf" transcription system for recorded talks to learn more about them. Let's do some back-of-the-envelope calculations about this hypothetical system. It has two pieces - OpenAI's Whisper for the speech-to-text, and then OpenAI's ChatGPT to clean up any transcription errors and break the text into paragraphs.
A fast english speaker reaches around 160 words per minute. OpenAI says each word is about 0.75 tokens for standard english, meaning our hypothetical fast, non-stop speaker is generating 120 tokens per minute, or 7200 per hour. If we had to pass those through ChatGPT, (one token out for each token in), we would get the following costs:
API | Cost | Cost for 1 hour of speech |
---|---|---|
Whisper | $0.006 / min | 36 cents |
ChatGPT | $0.002 / 1k tokens | 2.88 cents |
ChatGPT is basically free - Whisper is 30x as expensive -- but the whole thing still comes out to less than $0.50 to transcribe an hour of speech.
The high level design is
If you don't already have access to a talk, consider something like yt-dlp, which will allow you to download video from most websites, including Youtube. Then, I use ffmpeg to exctact the audio track from the video.
ffmpeg -i input.mp4 -map 0:a output.mp3
This audio track will be provided to OpenAI's Whipser API.
OpenAI's Whisper has a 25 MiB limit (at least for the time being). Therefore, long audio tracks need to be split into chunks of at most 25 MiB.
If we're not smart about where we split audio, we might end up cutting a word in half, which will limit the accuracy off the Whisper API transcription on those words. We'd rather make shorter chunks that are split when there is silence in the video. From a monetary cost perspective, it actually doesn't matter how short the chunks are -- OpenAI is billing us for each second of audio and for each word processed by ChatGPT. Regardless of how short the chunks are, the total audio length and words processed by ChatGPT are the same.
The most pressing limit is ChatGPT's limit context: around 4k tokens. For our purposes, we expect to generate slightly more than one output token for each input token, since ChatGPT will be asked to reproduce the input text with added paragraph breaks. This means our input is limited to around 2000 tokens per API call. At 120 tokens per minute, we'd expect to reach that limit after 15 minutes. In practice, ChatGPT has a hard time reproducing text that is 2000 tokens, so I use a 5-minute window instead of fifteen minutes.
To (attempt to) avoid splitting words, I use pydub to detect silence. I arbitrarily pick a silence threshold, and relax that threshold until no noisy region is longer than our 5-minute chunk. That means there is (hopefully) some safe place to split the text at least every five minutes.
This leaves many very short audio segments. OpenAI says Whisper does better with as much context as possible, so I greedily recombine smaller audio chunks into segments no longer than five minutes. Combine them largest-to-smallest, which allows the smallest ones a best chance to be squeezed in beside their larger neighbors. These recombined chunks may have some silent regions within them - that's fine. The only downside is you pay OpenAI to transcribe nothing out of these silent regions.
Each five-minute audio file is provided to OpenAI's Whisper API. The resulting text is unformatted, with no metadata, but does have punctuation. I then pass it to ChatGPT with the following prompt:
System: You split text into paragraphs and correct transcription errors
User: Split the following text into paragraphs WITHOUT adding or removing anything:\n{text}
ChatGPT is quite good at splitting this unformatted text into paragraphs. I also considered the breaks between the five-minute chunks to be paragraph breaks, which works fine in practice since there was a silent pause there anyway.
The final ingredient needed is a screencapture of the video to go along with each paragraph. I know what timestamp is associated with each five-minute chunk, and I can look up where among the five-minute chunks each paragraph came from. The source chunk and location within the chunk gives a very accurate timestamp for each paragraph of text. I use ffmpeg to extract a frame from the video for each paragraph.
ffmpeg -y -ss 01:23:45 -i input.webm -frames:v 1 -q:v 2 output.jpg
A markdown document is generated by inserting each paragraph in turn. A screenshot is inserted as well, unless it is too similar to the last inserted screenshot. This happens when the speaker lingers on a slide for a while, generating a lot of text without changing the video much. Finally, I use Pandoc to convert that markdown file into a PDF.
How do I decide whether a frame is "too similar" to a previous frame?
I experimented with a few options and settled on the dhash
function in imagehash
.
A description of dhash is provided here.
In short, the difference hash works like this:
For my purposes, we want to call some small variation in frames "the same", since many videos of talks have a small overlay of the presenter speaking. However, we don't want to be too liberal, since it's also common for slides to change only incrementally as a concept is explained. I settled on a difference of 1 bit as providing a reasonable test. If the overlay of the speaker is too large, this doesn't work quite as well, but I'd rather include extra images in the output rather than too few.