SFSpeechRecognizer Tips for iOS 10

Did you know that there’s a new iOS 10 API for transcribing audio in your iOS apps? I didn’t know that this was possible myself until I came across the Speech Framework introduced in iOS 10. In fact, I was actually watching one of Sam Davies’ screencasts on Ray Wenderlich’s new video site called Audio File Speech Transcription (both the screencast and the new videos site are fantastic by the way) where he covered how the API can be used to transcribe an audio file. The screencast itself is a fantastic short overview of how to use the framework to create an app that transcribes 90’s rap music files. I’m not going to try and recreate a tutorial on the basics of the API. Instead, when trying it out myself I discovered a couple nuances not mentioned in Sam’s screencast. Here’s a couple SFSpeechRecognizer tips.

Requires A Device

Use of SFSpeechRecognizer, the main class that makes speech transcription possible, will not actually transcribe anything unless you are running your app on a device (as of Xcode 8 beta 6). This was a surprise to me, especially considering that speech transcription of an audio file, rather than microphone input, has nothing to do with a physical device. I’m wondering if it has something to do with the underlying implementation of Siri and something that only exists on the actual device. Regardless, you are lucky enough to have access to a Bool on SFSpeechRecognizer called isAvailable. This Bool simply indicates whether speech recognition is available at the time of usage. I was actually banging my head trying to figure out how to get Sam’s sample project to actually work within the iOS Simulator. His screencast seemed to transcribe speech no problem in the app I was viewing on screen. Finally I looked closer and noticed that he was screensharing from an iOS device through Quicktime! Mystery solved! Either way, don’t make the same mistake as me and wonder why code like this didn’t work:

guard let recognizer = SFSpeechRecognizer() else {
  return
}

if !recognizer.isAvailable {
  print("Speech recognition not available")
  return
}

Timeouts

The other interesting discovery I made when playing around with SFSpeechRecognizer is that there is an undocumented limitation on how big of a file can be transcribed at once. I’m still playing around with the details as to where the limits are, but I have discovered, that longer running SFSpeechURLRecognitionRequest will timeout. I’m not even talking that long, like I had problems with a 5 minute video. For example, I tried transcribing my video Replace Temp With Query that is 4 minutes and 45 seconds long, and this was all the text that was returned before the timeout happens:

Hey what’s up everybody Danny from clean swiffer.com I’m here to show you another we factor in this week from our valors book we factoring in improving the design of existing time this week we’re gonna take a look at the re-factoring called replaced temp with Cory please temp with berries are factoring that’s a lot like expect nothing but there’s one difference it’s a little more specific with a place template query we’re going to specifically target temporary variables within our code and extract them into reusable piece of code so

Ya, not much (and not very accurate either). Either I should create a Radar for this, or Apple only intends for this API to be used for transcription of short audio clips. Time will tell.

Partial Transcription To The Rescue

Despite the undocumented timeout putting a crimp in my plans for using the Speech Framework in a couple longer running use cases, another Bool caught my eye: shouldReportPartialResults. It turns out my setting this flag to true, the Speech Framework will periodically provide transcribed results to you as they are discovered. Just set the value to true and you’ll see results continuously reported as they are determined:

let request = SFSpeechURLRecognitionRequest(url: url)
request.shouldReportPartialResults = true
recognizer.recognitionTask(with: request) {
  (result, error) in
  guard error == nil else { print("Error: \(error)"); return }
  guard let result = result else { print("No result!"); return }

  print(result.bestTranscription.formattedString)
}

Transcribing Realtime Playback

Despite these two shortcomings of short timeouts and requiring a device (which I hope Apple will fix at some point as the API matures), speech transcription is a really cool technology. Did you notice that voicemails in iOS 10 are automatically transcribed? It’s freakin awesome that you can glance at text for a voicemail rather than needing to listen?

Anyway another really cool real-world example of use of the Speech Framework is in the GitHub open source project Speech Recognition from zats. Apparently with some help from Apple, he came up with a way to transcribe a video on the fly. There’s some gnarly AVFoundation Objective-C code that made this possible. Be sure to take a look at his project and give it a go. And in fact, I’m wondering if I can use the techniques here to work around the timeout limitation I experienced with raw use of SFSpeechRecognizer. (Update: Turns out it did!)

Back Story

If you’ve read this far about SFSpeechRecognizer tips, I might as well bore you with some details on the back story as to why the SFSpeechRecognizer API caught my interest. With the videos I create for this site, it’s important to me to provide a transcription of the videos. I realize that watching a video isn’t always possible, so I like having a text alternative. It probably also helps with SEO as well. The thing is, transcribing the videos is a tedious process. For about a 5 minute video, it takes me about 30 minutes to transcribe it, and I type fast. Additionally, that’s only raw transcription. I’d like to do so much more. For example, I think the Realm videos are really well done. Specifically, two things I like are the links to jump to specific spots in the video from the transcription, and I also like the source code samples in the transcription. For me to do this it would take more time, so in attempt to look for a quick and easy fix to buy back some time, I figured I could use the new iOS 10 Speech Framework to code a speech transcription app to automatically transcribe my videos for me. I’m still working on it and definitely leveraging these SFSpeechRecognizer tips.

They say that necessity is the mother of all invention, right?

Wrap Up

How will you be putting these SFSpeechRecognizer tips to work? Have you had a chance to try out the new Speech Framework or any of the SFSpeechRecognizer APIs? Have you run into any other tricky spots or overcome any hurdles?

Happy cleaning.