一、技术背景与功能价值
语音转文字(Speech-to-Text, STT)是移动端人机交互的核心技术之一,广泛应用于语音输入、会议记录、智能客服等场景。Android系统自API 21起提供原生语音识别接口,同时支持集成第三方语音识别SDK(如科大讯飞、腾讯云等)。开发者可根据项目需求选择轻量级的系统API或功能更丰富的商业SDK。
二、系统API实现方案
1. 基础环境配置
在Android Studio中创建项目后,需在AndroidManifest.xml中添加录音权限:
<uses-permission android:name="android.permission.RECORD_AUDIO" /><uses-permission android:name="android.permission.INTERNET" /> <!-- 如需网络识别 -->
动态权限申请代码示例:
private fun checkAudioPermission() {if (ContextCompat.checkSelfPermission(this, Manifest.permission.RECORD_AUDIO)!= PackageManager.PERMISSION_GRANTED) {ActivityCompat.requestPermissions(this,arrayOf(Manifest.permission.RECORD_AUDIO),AUDIO_PERMISSION_CODE)}}
2. 原生语音识别实现
使用RecognizerIntent启动系统语音识别界面:
private fun startVoiceRecognition() {val intent = Intent(RecognizerIntent.ACTION_RECOGNIZE_SPEECH).apply {putExtra(RecognizerIntent.EXTRA_LANGUAGE_MODEL,RecognizerIntent.LANGUAGE_MODEL_FREE_FORM)putExtra(RecognizerIntent.EXTRA_MAX_RESULTS, 5) // 最大识别结果数putExtra(RecognizerIntent.EXTRA_PROMPT, "请开始说话...")}try {startActivityForResult(intent, VOICE_RECOGNITION_REQUEST_CODE)} catch (e: ActivityNotFoundException) {Toast.makeText(this, "设备不支持语音识别", Toast.LENGTH_SHORT).show()}}
处理识别结果:
override fun onActivityResult(requestCode: Int, resultCode: Int, data: Intent?) {super.onActivityResult(requestCode, resultCode, data)if (requestCode == VOICE_RECOGNITION_REQUEST_CODE && resultCode == RESULT_OK) {val results = data?.getStringArrayListExtra(RecognizerIntent.EXTRA_RESULTS)results?.let {textView.text = it[0] // 显示第一个识别结果}}}
系统API优缺点:
- ✅ 无需额外依赖
- ✅ 兼容性好
- ❌ 界面不可定制
- ❌ 功能受限(如不支持实时识别)
三、第三方SDK集成方案
1. 科大讯飞SDK集成
1.1 准备工作
- 注册科大讯飞开发者账号
- 创建应用获取APPID
- 下载Android SDK
1.2 集成步骤
- 将
Msc.jar和armeabi/armeabi-v7a库文件放入项目libs目录 - 初始化SDK(在Application类中):
class MyApp : Application() {override fun onCreate() {super.onCreate()SpeechUtility.createUtility(this, "appid=${YOUR_APPID}")}}
1.3 实现实时语音识别
class VoiceRecognizer(private val callback: RecognitionCallback) {private var recognizer: SpeechRecognizer? = nullfun startListening() {recognizer = SpeechRecognizer.createRecognizer(context)recognizer?.setParameter(SpeechConstant.DOMAIN, "iat") // 语音转文字场景recognizer?.setParameter(SpeechConstant.LANGUAGE, "zh_cn")recognizer?.setParameter(SpeechConstant.ACCENT, "mandarin")val recognizerListener = object : RecognizerListener {override fun onVolumeChanged(volume: Int, data: ByteArray?) {}override fun onBeginOfSpeech() {}override fun onEndOfSpeech() {}override fun onResult(results: RecognizerResult?, isLast: Boolean) {if (isLast) {val result = results?.resultStringcallback.onRecognitionComplete(result)} else {// 实时返回部分结果val partialResult = JsonParser.parseIatResult(results?.resultString)callback.onPartialResult(partialResult)}}override fun onError(error: SpeechError?) {callback.onError(error?.errorCode, error?.errorDescription)}override fun onEvent(eventType: Int, arg1: Int, arg2: Int, obj: Bundle?) {}}recognizer?.startListening(recognizerListener)}fun stopListening() {recognizer?.stopListening()recognizer?.cancel()recognizer?.destroy()}}
2. 腾讯云语音识别集成
2.1 准备工作
- 创建腾讯云账号
- 开通语音识别服务
- 获取SecretId和SecretKey
2.2 实现方案
使用腾讯云REST API实现(需网络请求):
class TencentSTT(private val secretId: String, private val secretKey: String) {private val client = OkHttpClient()fun recognizeSpeech(audioData: ByteArray, callback: (String?) -> Unit) {val timestamp = System.currentTimeMillis() / 1000val sign = generateSign(timestamp)val requestBody = MultipartBody.Builder().setType(MultipartBody.FORM).addFormDataPart("engine_type", "16k_zh").addFormDataPart("channel_num", "1").addFormDataPart("result_type", "0").addFormDataPart("voice_format", "pcm").addFormDataPart("data", "audio.pcm",RequestBody.create(MediaType.parse("audio/pcm"), audioData)).build()val request = Request.Builder().url("https://recognition.tencentcloudapi.com/?Action=CreateRecTask&Timestamp=$timestamp&Signature=$sign").post(requestBody).addHeader("Authorization", "TC3-HMAC-SHA256 Credential=$secretId/...").build()client.newCall(request).enqueue(object : Callback {override fun onResponse(call: Call, response: Response) {val result = response.body?.string()// 解析JSON结果callback(parseResult(result))}override fun onFailure(call: Call, e: IOException) {callback(null)}})}private fun generateSign(timestamp: Long): String {// 实现腾讯云签名算法// 实际开发中需按照腾讯云文档实现return "generated_signature"}}
四、性能优化策略
1. 音频处理优化
- 采样率适配:推荐16kHz采样率(科大讯飞等SDK要求)
- 音频格式转换:使用
AudioRecord直接录制PCM数据private fun startRecording() {val bufferSize = AudioRecord.getMinBufferSize(16000, AudioFormat.CHANNEL_IN_MONO, AudioFormat.ENCODING_PCM_16BIT)val audioRecord = AudioRecord(MediaRecorder.AudioSource.MIC,16000,AudioFormat.CHANNEL_IN_MONO,AudioFormat.ENCODING_PCM_16BIT,bufferSize)audioRecord.startRecording()// 每100ms读取一次数据val buffer = ByteArray(1600) // 100ms@16kHz 16bitwhile (isRecording) {val read = audioRecord.read(buffer, 0, buffer.size)if (read > 0) {// 处理音频数据}}audioRecord.stop()audioRecord.release()}
2. 识别结果处理
- 关键词过滤:使用正则表达式过滤无效字符
fun filterInvalidChars(input: String): String {return input.replace("[^\\u4e00-\\u9fa5a-zA-Z0-9,。、;:?!\"'()]".toRegex(), "")}
- 结果缓存:使用Room数据库存储历史识别记录
五、常见问题解决方案
1. 权限问题
- Android 11+存储权限:需额外申请
MANAGE_EXTERNAL_STORAGE权限(谨慎使用) - 麦克风占用:检测其他应用是否占用麦克风
fun isMicrophoneAvailable(): Boolean {val recorder = AudioRecord(MediaRecorder.AudioSource.MIC,44100,AudioFormat.CHANNEL_IN_MONO,AudioFormat.ENCODING_PCM_16BIT,AudioRecord.getMinBufferSize(44100, AudioFormat.CHANNEL_IN_MONO, AudioFormat.ENCODING_PCM_16BIT))try {recorder.startRecording()recorder.stop()recorder.release()return true} catch (e: IllegalStateException) {return false}}
2. 识别准确率问题
- 语言模型选择:中文识别建议使用
zh_cn+mandarin组合 - 静音检测:使用能量阈值判断有效语音段
fun isSpeechActive(audioData: ShortArray, threshold: Double = 0.1): Boolean {var sum = 0.0for (sample in audioData) {sum += (sample * sample) / 32768.0 // 16bit PCM归一化}val rms = sqrt(sum / audioData.size)return rms > threshold}
六、进阶功能实现
1. 实时语音转写(带时间戳)
data class SpeechSegment(val text: String, val startTime: Long, val endTime: Long)class RealTimeSTT {private val segments = mutableListOf<SpeechSegment>()private var lastTimestamp = 0Lfun processAudio(audioData: ByteArray, timestamp: Long): List<SpeechSegment> {// 调用SDK获取中间结果val partialText = "部分识别结果"if (lastTimestamp > 0) {segments.add(SpeechSegment(partialText,lastTimestamp,timestamp))}lastTimestamp = timestampreturn segments.toList()}}
2. 多语言混合识别
fun setupMultilingualRecognition() {recognizer?.setParameter(SpeechConstant.LANGUAGE, "zh_cn+en_us")recognizer?.setParameter(SpeechConstant.MIXED_LANGUAGE, "1") // 开启混合识别}
七、总结与建议
- 轻量级需求:优先使用系统API
- 商业项目:推荐科大讯飞/腾讯云等成熟SDK
- 性能关键:注意音频采样率、静音检测和结果缓存
- 隐私合规:明确告知用户语音数据处理方式
完整实现示例项目已上传至GitHub,包含:
- 系统API实现
- 科大讯飞SDK集成
- 音频处理工具类
- 识别结果可视化
通过本文提供的方案,开发者可在Android Studio中快速实现稳定可靠的语音转文字功能,根据实际需求选择最适合的技术路线。