iOS Speech框架实战:语音转文字的完整实现指南

一、Speech框架概述

Speech框架是Apple在iOS 10中引入的核心语音处理框架,它通过SFSpeechRecognizer类提供了强大的语音转文字功能。与传统第三方SDK不同,Speech框架深度集成于iOS系统,支持实时识别、离线识别、多语言处理等高级功能,同时严格遵循Apple的隐私保护规范。

框架的核心组件包括:

  • SFSpeechRecognizer:语音识别器主体,负责管理识别任务
  • SFSpeechAudioBufferRecognitionRequest:实时音频流识别请求
  • SFSpeechURLRecognitionRequest:离线音频文件识别请求
  • SFSpeechRecognitionTask:识别任务执行单元
  • SFSpeechRecognitionResult:识别结果封装对象

二、基础实现步骤

1. 权限配置

在Info.plist中添加两个关键权限描述:

  1. <key>NSSpeechRecognitionUsageDescription</key>
  2. <string>需要语音识别权限以实现语音转文字功能</string>
  3. <key>NSMicrophoneUsageDescription</key>
  4. <string>需要麦克风权限以采集语音数据</string>

2. 核心识别流程

  1. import Speech
  2. class SpeechRecognizer {
  3. private var speechRecognizer: SFSpeechRecognizer?
  4. private var recognitionRequest: SFSpeechAudioBufferRecognitionRequest?
  5. private var recognitionTask: SFSpeechRecognitionTask?
  6. private let audioEngine = AVAudioEngine()
  7. func startRecognition() {
  8. // 1. 初始化识别器(限定中文)
  9. speechRecognizer = SFSpeechRecognizer(locale: Locale(identifier: "zh-CN"))
  10. // 2. 创建识别请求
  11. recognitionRequest = SFSpeechAudioBufferRecognitionRequest()
  12. guard let request = recognitionRequest else { return }
  13. // 3. 配置识别任务
  14. recognitionTask = speechRecognizer?.recognitionTask(with: request) { result, error in
  15. if let result = result {
  16. // 处理中间结果(实时显示)
  17. let bestString = result.bestTranscription.formattedString
  18. print("识别结果: \(bestString)")
  19. // 最终结果判断
  20. if result.isFinal {
  21. print("最终结果: \(bestString)")
  22. }
  23. }
  24. if let error = error {
  25. print("识别错误: \(error.localizedDescription)")
  26. self.stopRecognition()
  27. }
  28. }
  29. // 4. 配置音频输入
  30. let audioSession = AVAudioSession.sharedInstance()
  31. try? audioSession.setCategory(.record, mode: .measurement, options: .duckOthers)
  32. try? audioSession.setActive(true, options: .notifyOthersOnDeactivation)
  33. let inputNode = audioEngine.inputNode
  34. let recordingFormat = inputNode.outputFormat(forBus: 0)
  35. inputNode.installTap(onBus: 0, bufferSize: 1024, format: recordingFormat) { (buffer: AVAudioPCMBuffer, when: AVAudioTime) in
  36. self.recognitionRequest?.append(buffer)
  37. }
  38. audioEngine.prepare()
  39. try? audioEngine.start()
  40. }
  41. func stopRecognition() {
  42. audioEngine.stop()
  43. recognitionRequest?.endAudio()
  44. recognitionTask?.cancel()
  45. recognitionTask = nil
  46. }
  47. }

三、高级功能实现

1. 离线识别支持

  1. // 检查离线识别可用性
  2. func checkOfflineAvailability() {
  3. SFSpeechRecognizer.supportedLocales().forEach { locale in
  4. let recognizer = SFSpeechRecognizer(locale: locale)
  5. print("\(locale.identifier) 支持离线: \(recognizer?.supportsOnDeviceRecognition ?? false)")
  6. }
  7. }
  8. // 强制使用离线识别(iOS 13+)
  9. let request = SFSpeechAudioBufferRecognitionRequest()
  10. request.shouldReportPartialResults = true
  11. request.requiresOnDeviceRecognition = true // 强制离线

2. 多语言混合识别

  1. // 动态切换识别语言
  2. func switchLanguage(to localeIdentifier: String) {
  3. speechRecognizer = SFSpeechRecognizer(locale: Locale(identifier: localeIdentifier))
  4. // 需要重新创建recognitionTask
  5. }
  6. // 中英文混合识别示例
  7. let mixedRecognizer = SFSpeechRecognizer(locale: Locale(identifier: "zh-CN"))
  8. mixedRecognizer?.sublanguages = [
  9. Locale(identifier: "zh-CN"),
  10. Locale(identifier: "en-US")
  11. ]

3. 音频文件识别

  1. func recognizeAudioFile(url: URL) {
  2. let request = SFSpeechURLRecognitionRequest(url: url)
  3. let recognizer = SFSpeechRecognizer(locale: Locale(identifier: "zh-CN"))
  4. recognizer?.recognitionTask(with: request) { result, error in
  5. if let transcription = result?.bestTranscription {
  6. print("文件识别结果: \(transcription.formattedString)")
  7. }
  8. }
  9. }

四、性能优化策略

1. 内存管理优化

  • 使用NSCache缓存频繁使用的识别器实例
  • 实现recognitionTask的弱引用持有,避免循环引用
  • 对长音频采用分段处理策略

2. 识别精度提升

  1. // 配置识别参数
  2. let request = SFSpeechAudioBufferRecognitionRequest()
  3. request.shouldReportPartialResults = true // 实时反馈
  4. request.maximumRecognitionDuration = 30.0 // 最大识别时长
  5. request.taskHint = .dictation // 优化长文本识别

3. 错误处理机制

  1. enum RecognitionError: Error {
  2. case authorizationDenied
  3. case audioEngineFailed
  4. case recognitionServiceUnavailable
  5. }
  6. func handleRecognitionError(_ error: Error) -> RecognitionError? {
  7. if (error as NSError).code == SFErrorCode.errorNotAuthorized.rawValue {
  8. return .authorizationDenied
  9. } else if let speechError = error as? SFSpeechErrorCode {
  10. switch speechError {
  11. case .recognitionServiceBusy:
  12. return .recognitionServiceUnavailable
  13. default:
  14. return nil
  15. }
  16. }
  17. return nil
  18. }

五、实际应用场景

1. 即时通讯语音输入

  1. // 在UITextView中集成语音输入
  2. class VoiceInputTextView: UITextView {
  3. private let speechRecognizer = SpeechRecognizer()
  4. @objc func startVoiceInput() {
  5. speechRecognizer.startRecognition { [weak self] text in
  6. DispatchQueue.main.async {
  7. self?.insertText(text)
  8. }
  9. }
  10. }
  11. }

2. 会议记录系统

  1. // 会议场景优化实现
  2. class MeetingRecorder {
  3. private var speakers: [String: String] = [:]
  4. func processRecognitionResult(_ result: SFSpeechRecognitionResult) {
  5. let transcript = result.bestTranscription
  6. let segments = transcript.segments.map { $0.substring }
  7. // 说话人识别逻辑(需结合声纹分析)
  8. // 此处简化为时间间隔判断
  9. let currentTime = Date().timeIntervalSince1970
  10. if currentTime - lastSpeakerChangeTime > 5 {
  11. currentSpeaker = determineSpeaker()
  12. }
  13. speakers[currentSpeaker]?.append(transcript.formattedString)
  14. }
  15. }

六、常见问题解决方案

1. 识别延迟优化

  • 启用硬件加速:request.requiresOnDeviceRecognition = true
  • 调整缓冲区大小:inputNode.installTap(..., bufferSize: 512)
  • 限制识别时长:request.maximumRecognitionDuration = 15.0

2. 中断处理机制

  1. func setupInterruptionHandler() {
  2. NotificationCenter.default.addObserver(forName: AVAudioSession.interruptionNotification, object: nil, queue: nil) { notification in
  3. guard let userInfo = notification.userInfo,
  4. let typeValue = userInfo[AVAudioSessionInterruptionTypeKey] as? UInt,
  5. let type = AVAudioSession.InterruptionType(rawValue: typeValue) else { return }
  6. if type == .began {
  7. self.stopRecognition()
  8. } else if type == .ended {
  9. // 检查是否需要恢复识别
  10. }
  11. }
  12. }

3. 方言识别支持

  1. // 添加方言识别支持
  2. let dialectRecognizers: [SFSpeechRecognizer] = [
  3. SFSpeechRecognizer(locale: Locale(identifier: "zh-CN")), // 普通话
  4. SFSpeechRecognizer(locale: Locale(identifier: "yue-CN")), // 粤语
  5. SFSpeechRecognizer(locale: Locale(identifier: "cmn-Hans-CN")) // 简体中文
  6. ]
  7. func recognizeWithDialects(audioBuffer: AVAudioPCMBuffer) {
  8. let group = DispatchGroup()
  9. var results: [String] = []
  10. dialectRecognizers.forEach { recognizer in
  11. group.enter()
  12. let request = SFSpeechAudioBufferRecognitionRequest()
  13. recognizer?.recognitionTask(with: request) { result, _ in
  14. if let transcription = result?.bestTranscription {
  15. results.append(transcription.formattedString)
  16. }
  17. group.leave()
  18. }
  19. request.append(audioBuffer)
  20. }
  21. group.notify(queue: .main) {
  22. // 选择最佳识别结果(可结合置信度)
  23. }
  24. }

七、最佳实践建议

  1. 权限管理:在App启动时检查语音识别权限,避免在识别过程中弹出权限对话框
  2. 资源释放:实现deinit方法确保释放所有识别资源
  3. 状态管理:维护明确的识别状态(空闲/识别中/暂停)
  4. 测试覆盖:重点测试以下场景:
    • 中英文混合输入
    • 网络中断情况下的离线识别
    • 长时间连续识别(>30分钟)
    • 不同采样率的音频输入

八、未来发展方向

  1. iOS 15引入的SFSpeechRecognizer.supportedLocales()方法可以动态获取支持的语言列表
  2. 结合Core ML实现自定义词汇表优化
  3. 通过SFSpeechRecognitionResultconfidence属性实现置信度过滤
  4. 探索与Vision框架结合实现多模态输入处理

通过系统掌握Speech框架的这些高级特性,开发者可以构建出媲美专业语音识别应用的iOS功能模块。实际开发中,建议从基础实现入手,逐步添加高级功能,并通过性能测试不断优化识别体验。