Did you know that there’s a new iOS 10 API for transcribing audio in your iOS apps? I didn’t know that this was possible myself until I came across the Speech Framework introduced in iOS 10. In fact, I was actually watching one of Sam Davies’ screencasts on Ray Wenderlich’s new video site called Audio File Speech Transcription (both the screencast and the new videos site are fantastic by the way) where he covered how the API can be used to transcribe an audio file. The screencast itself is a fantastic short overview of how to use the framework to create an app that transcribes 90’s rap music files. I’m not going to try and recreate a tutorial on the basics of the API. Instead, when trying it out myself I discovered a couple nuances not mentioned in Sam’s screencast. Here’s a couple SFSpeechRecognizer
tips.
Requires A Device
Use of SFSpeechRecognizer
, the main class that makes speech transcription possible, will not actually transcribe anything unless you are running your app on a device (as of Xcode 8 beta 6). This was a surprise to me, especially considering that speech transcription of an audio file, rather than microphone input, has nothing to do with a physical device. I’m wondering if it has something to do with the underlying implementation of Siri and something that only exists on the actual device. Regardless, you are lucky enough to have access to a Bool
on SFSpeechRecognizer
called isAvailable
. This Bool
simply indicates whether speech recognition is available at the time of usage. I was actually banging my head trying to figure out how to get Sam’s sample project to actually work within the iOS Simulator. His screencast seemed to transcribe speech no problem in the app I was viewing on screen. Finally I looked closer and noticed that he was screensharing from an iOS device through Quicktime! Mystery solved! Either way, don’t make the same mistake as me and wonder why code like this didn’t work:
guard let recognizer = SFSpeechRecognizer() else {
return
}
if !recognizer.isAvailable {
print("Speech recognition not available")
return
}
Timeouts
The other interesting discovery I made when playing around with SFSpeechRecognizer
is that there is an undocumented limitation on how big of a file can be transcribed at once. I’m still playing around with the details as to where the limits are, but I have discovered, that longer running SFSpeechURLRecognitionRequest
will timeout. I’m not even talking that long, like I had problems with a 5 minute video. For example, I tried transcribing my video Replace Temp With Query that is 4 minutes and 45 seconds long, and this was all the text that was returned before the timeout happens:
Hey what’s up everybody Danny from clean swiffer.com I’m here to show you another we factor in this week from our valors book we factoring in improving the design of existing time this week we’re gonna take a look at the re-factoring called replaced temp with Cory please temp with berries are factoring that’s a lot like expect nothing but there’s one difference it’s a little more specific with a place template query we’re going to specifically target temporary variables within our code and extract them into reusable piece of code so
Ya, not much (and not very accurate either). Either I should create a Radar for this, or Apple only intends for this API to be used for transcription of short audio clips. Time will tell.
Partial Transcription To The Rescue
Despite the undocumented timeout putting a crimp in my plans for using the Speech Framework in a couple longer running use cases, another Bool
caught my eye: shouldReportPartialResults. It turns out my setting this flag to true
, the Speech Framework will periodically provide transcribed results to you as they are discovered. Just set the value to true
and you’ll see results continuously reported as they are determined:
let request = SFSpeechURLRecognitionRequest(url: url)
request.shouldReportPartialResults = true
recognizer.recognitionTask(with: request) {
(result, error) in
guard error == nil else { print("Error: \(error)"); return }
guard let result = result else { print("No result!"); return }
print(result.bestTranscription.formattedString)
}
Transcribing Realtime Playback
Despite these two shortcomings of short timeouts and requiring a device (which I hope Apple will fix at some point as the API matures), speech transcription is a really cool technology. Did you notice that voicemails in iOS 10 are automatically transcribed? It’s freakin awesome that you can glance at text for a voicemail rather than needing to listen?
Anyway another really cool real-world example of use of the Speech Framework is in the GitHub open source project Speech Recognition from zats. Apparently with some help from Apple, he came up with a way to transcribe a video on the fly. There’s some gnarly AVFoundation Objective-C code that made this possible. Be sure to take a look at his project and give it a go. And in fact, I’m wondering if I can use the techniques here to work around the timeout limitation I experienced with raw use of SFSpeechRecognizer
. (Update: Turns out it did!)
Back Story
If you’ve read this far about SFSpeechRecognizer
tips, I might as well bore you with some details on the back story as to why the SFSpeechRecognizer
API caught my interest. With the videos I create for this site, it’s important to me to provide a transcription of the videos. I realize that watching a video isn’t always possible, so I like having a text alternative. It probably also helps with SEO as well. The thing is, transcribing the videos is a tedious process. For about a 5 minute video, it takes me about 30 minutes to transcribe it, and I type fast. Additionally, that’s only raw transcription. I’d like to do so much more. For example, I think the Realm videos are really well done. Specifically, two things I like are the links to jump to specific spots in the video from the transcription, and I also like the source code samples in the transcription. For me to do this it would take more time, so in attempt to look for a quick and easy fix to buy back some time, I figured I could use the new iOS 10 Speech Framework to code a speech transcription app to automatically transcribe my videos for me. I’m still working on it and definitely leveraging these SFSpeechRecognizer
tips.
They say that necessity is the mother of all invention, right?
Wrap Up
How will you be putting these SFSpeechRecognizer
tips to work? Have you had a chance to try out the new Speech Framework or any of the SFSpeechRecognizer
APIs? Have you run into any other tricky spots or overcome any hurdles?
Happy cleaning.
Awesome work, thanks. A few observation: SFSpeechRecognizer calls home as requires network connection. If you abuse it too long, like I did when made it listen to the radio, it will ban the application for a little while and return “Quota limit reached for resource: speech_api, Error Domain=SiriSpeechErrorDomain”. So be gentle to Siri.
No problem, I’m glad you enjoyed it! Thanks for chiming in!
Hi clean Swifter,
nice work and nice tips to overcome the already known problems.
I experienced another problem with this API.
I tried to use the speech recognition together with some system sounds (search the database for matching text, and if found stop the recording and play a sound).
As soon as I start the speech recognition, no sound Output is possible anymore.
If someone has an idea how to overcome this issue, please let me know.
I’m having the same problem right now.
Did you find a solution?
Thanks
Thanks a lot for the tutorial. I’m trying to use in my App and I have an issue that availabilityDidChange delegate is never called. I have SFSpeechRecognizer.delegate set in viewDidLoad. Have you got this problem too?
Thanks for the post. I was wondering if anyone has had any luck accessing the ‘timestamp’ property of a transcription segment. This newly available timing information is what I believe to be the coolest thing about the API and it doesn’t seem to work properly!
In the 2016 WWDC video about the speech API (https://developer.apple.com/videos/play/wwdc2016/509/), the engineer says:
“For iOS 10 we’re starting with a strict audio duration limit of about one minute which is similar to that of keyboard dictation.” (https://developer.apple.com/videos/play/wwdc2016/509/?time=592)
Hello,
Informative article.
For some reason we are getting a message that says “Quota Limit reached for resource”.
Have you guys figured out, under what scenarios this error message is received?
Our utterance duration is definitely less than 1 minute. Is there a limit to the number of times this speech engine can be invoked from the device/app on a daily basis?
Would be useful to know all the limitations.
Best
Puga