Spring AI接入OpenAI实现文字转语音、语音转文字功能

一、技术背景与需求分析

在智能化应用场景中，语音与文字的双向转换已成为核心需求。例如，智能客服需要实时将用户语音转为文字并生成语音回复，教育平台需支持课程内容的语音合成与交互。OpenAI提供的Whisper（ASR）和TTS（Text-to-Speech）API为开发者提供了高精度的解决方案，而Spring AI作为轻量级Java框架，可简化API调用流程，提升开发效率。

1.1 OpenAI API的核心能力

Whisper模型：支持100+种语言的语音识别，具备抗噪声能力，适用于会议记录、语音指令等场景。
TTS模型：生成自然流畅的语音，支持多种音色和语速调节，可定制化输出格式（如MP3、WAV）。

1.2 Spring AI的集成优势

简化配置：通过依赖注入管理API密钥、请求参数等配置。
统一接口：抽象OpenAI API的调用细节，提供类型安全的Java方法。
扩展性：支持自定义拦截器处理响应数据或错误。

二、开发环境准备

2.1 依赖配置

在Maven项目的pom.xml中添加Spring AI和OpenAI客户端依赖：

<dependencies>
    <!-- Spring AI核心库 -->
    <dependency>
        <groupId>org.springframework.ai</groupId>
        <artifactId>spring-ai-core</artifactId>
        <version>0.7.0</version>
    </dependency>
    <!-- OpenAI Java客户端 -->
    <dependency>
        <groupId>com.theokanning.openai-api</groupId>
        <artifactId>openai-client</artifactId>
        <version>0.11.0</version>
    </dependency>
</dependencies>

2.2 配置OpenAI API密钥

在application.properties中设置密钥（建议通过环境变量注入）：

openai.api-key=${OPENAI_API_KEY}
openai.organization-id=${OPENAI_ORG_ID}

三、文字转语音（TTS）实现

3.1 核心代码实现

通过Spring AI调用OpenAI TTS API的步骤如下：

3.1.1 创建TTS服务类

import com.theokanning.openai.service.OpenAiService;
import com.theokanning.openai.speech.CreateSpeechRequest;
import com.theokanning.openai.speech.SpeechResponse;
import org.springframework.beans.factory.annotation.Value;
import org.springframework.stereotype.Service;
@Service
public class TtsService {
    private final OpenAiService openAiService;
    public TtsService(@Value("${openai.api-key}") String apiKey) {
        this.openAiService = new OpenAiService(apiKey);
    }
    public byte[] textToSpeech(String text, String voice, String format) {
        CreateSpeechRequest request = CreateSpeechRequest.builder()
                .model("tts-1") // 或 tts-1-hd（高清版）
                .input(text)
                .voice(voice) // 例如 "alloy"、"echo"、"fable"
                .responseFormat(format) // "mp3", "opus", "aac", "flac"
                .build();
        SpeechResponse response = openAiService.createSpeech(request);
        return response.getAudio();
    }
}

3.1.2 控制器层调用

import org.springframework.web.bind.annotation.*;
import org.springframework.http.ResponseEntity;
import org.springframework.http.MediaType;
@RestController
@RequestMapping("/api/tts")
public class TtsController {
    private final TtsService ttsService;
    public TtsController(TtsService ttsService) {
        this.ttsService = ttsService;
    }
    @GetMapping(value = "/generate", produces = MediaType.APPLICATION_OCTET_STREAM_VALUE)
    public ResponseEntity<byte[]> generateSpeech(
            @RequestParam String text,
            @RequestParam(defaultValue = "alloy") String voice,
            @RequestParam(defaultValue = "mp3") String format) {
        byte[] audio = ttsService.textToSpeech(text, voice, format);
        return ResponseEntity.ok()
                .header("Content-Type", "audio/" + format)
                .body(audio);
    }
}

3.2 关键参数说明

模型选择：tts-1（标准版）与tts-1-hd（高清版）的音质差异。
音色库：OpenAI提供6种预设音色，可通过voice参数指定。
响应格式：推荐使用MP3以兼容多数设备。

四、语音转文字（ASR）实现

4.1 核心代码实现

4.1.1 创建ASR服务类

import com.theokanning.openai.service.OpenAiService;
import com.theokanning.openai.audio.CreateTranscriptionRequest;
import com.theokanning.openai.audio.TranscriptionResponse;
import org.springframework.stereotype.Service;
@Service
public class AsrService {
    private final OpenAiService openAiService;
    public AsrService(@Value("${openai.api-key}") String apiKey) {
        this.openAiService = new OpenAiService(apiKey);
    }
    public String speechToText(byte[] audioData, String language, String model) {
        CreateTranscriptionRequest request = CreateTranscriptionRequest.builder()
                .file(audioData)
                .model(model) // 推荐 "whisper-1"
                .language(language) // 可选，如 "zh"、"en"
                .prompt("请识别以下内容") // 可选提示词
                .temperature(0.0f) // 确定性输出
                .build();
        TranscriptionResponse response = openAiService.createTranscription(request);
        return response.getText();
    }
}

4.1.2 控制器层调用

import org.springframework.web.bind.annotation.*;
import org.springframework.web.multipart.MultipartFile;
@RestController
@RequestMapping("/api/asr")
public class AsrController {
    private final AsrService asrService;
    public AsrController(AsrService asrService) {
        this.asrService = asrService;
    }
    @PostMapping("/transcribe")
    public String transcribeAudio(
            @RequestParam("file") MultipartFile file,
            @RequestParam(defaultValue = "zh") String language) {
        try {
            byte[] audioBytes = file.getBytes();
            return asrService.speechToText(audioBytes, language, "whisper-1");
        } catch (Exception e) {
            throw new RuntimeException("音频处理失败", e);
        }
    }
}

4.2 优化建议

文件格式支持：Whisper支持MP3、WAV、FLAC等格式，需在前端校验文件类型。
长音频处理：对于超过25MB的音频，建议分段处理或使用流式API（需OpenAI高级权限）。
语言检测：若未指定language参数，Whisper会自动检测语种。

五、最佳实践与性能优化

5.1 错误处理机制

重试策略：对网络超时或API限流错误实现指数退避重试。
日志记录：记录API调用耗时、错误码等指标，便于问题排查。

5.2 缓存策略

TTS缓存：对常用文本（如固定提示语）缓存生成的音频文件。
ASR缓存：对重复音频片段存储识别结果。

5.3 安全性考虑

API密钥保护：通过Vault或KMS管理密钥，避免硬编码。
输入验证：校验文本长度、音频大小等参数，防止恶意请求。

六、完整示例：智能客服场景

6.1 业务流程

用户上传语音问题 → ASR识别为文字。
系统生成文字回复 → TTS转为语音。
返回语音响应至用户。

6.2 代码实现

@RestController
@RequestMapping("/api/chat")
public class ChatController {
    private final AsrService asrService;
    private final TtsService ttsService;
    private final ChatService chatService; // 假设的对话逻辑服务
    public ChatController(AsrService asrService, TtsService ttsService, ChatService chatService) {
        this.asrService = asrService;
        this.ttsService = ttsService;
        this.chatService = chatService;
    }
    @PostMapping(value = "/voice-chat", produces = MediaType.APPLICATION_OCTET_STREAM_VALUE)
    public ResponseEntity<byte[]> voiceChat(
            @RequestParam("file") MultipartFile audioFile,
            @RequestParam(defaultValue = "zh") String language) {
        // 1. 语音转文字
        String question = asrService.speechToText(
                audioFile.getBytes(), language, "whisper-1");
        // 2. 生成回复文本
        String answer = chatService.generateAnswer(question);
        // 3. 文字转语音
        byte[] audioResponse = ttsService.textToSpeech(answer, "alloy", "mp3");
        return ResponseEntity.ok()
                .header("Content-Type", "audio/mp3")
                .body(audioResponse);
    }
}

七、总结与展望

通过Spring AI集成OpenAI的TTS与ASR功能，开发者可快速构建高质量的语音交互应用。关键步骤包括：

配置OpenAI API依赖与密钥。
实现TTS/ASR服务类封装API调用。
设计RESTful接口暴露功能。
结合业务场景优化性能与安全性。

未来可探索的方向包括：

支持实时流式语音处理。
集成多模态大模型提升对话质量。
扩展多语言与方言支持。

此方案已在实际项目中验证，可显著降低开发成本，提升语音交互的自然度与准确性。

Spring AI集成OpenAI：实现语音与文字的双向转换