首页
零基础教程
分类浏览
编程
- Sentinel
- Go语言
- C语言
- 汇编语言
- Android
- Java工具库
- Spring Cloud Alibaba
- Spring Cloud
- Spring Data
- Spring Boot
- Spring Batch
- JSP/Servlet
- Hadoop
- Dubbo
- J2Cache
- Hibernate
- OpenJPA
- MyBatis
- ShardingSphere
- Freemarker
- Thymeleaf
- Activiti
- POI
- JMail
- Log4j
- LogBack
- Dom4j
- XML
- RxJava
- JasperReport
- JUnit
- JMock
- Apache Commons
- HttpComponents
- CGLib
- WebSocket
- ESAPI
- 设计模式
前端
- CSS/CSS3
- HTML5
- JavaScript
- JQuery
- DHTMLX
- 浏览器
- HTML
- 前端小知识
- Vue.js
- NodeJS
- ECharts
- Less
- UmiJS
- React
- Ant Design
- Bootstrap
- uni-app
- JS-XLSX
数据库
- SQL
- PL/SQL
- MySQL
- Oracle
- Redis
- SQLite
- MongoDB
- Zookeeper
- H2
服务器
- Podman
- Prometheus
- Tomcat
- JBoss
- RocketMQ
- Docker
- Nginx
- RabbitMQ
其他
- 程序员
- Maven
- SVN
- Git
- UML
- Windows
- 办公软件
- Axure
- Jenkins
- HTTP
- macOS
Java
Spring
Linux
AI
代码片段
Get小技能
面试题

Spring AI 教程

Spring AI 音频模型

提示：如果不能访问 OpenAI，请点击 AiCode API 注册账号，通过代理访问。

Spring AI 提供了对音频模型的支持，可实现语音转文本（语音识别 ASR）、文本转语音（语音合成 TTS）功能。下面以 OpenAI 的 Whisper（语音识别）和 TTS（文本转语音）模型为例。

Whisper 模型介绍

Whisper 是由 OpenAI 开发的一个开源、端到端语音识别系统。

Whisper 于 2022 年开源，其目标是构建一个通用语音识别模型，能够在不同语言、口音、嘈杂环境、麦克风质量等条件下都有良好表现。它具有强大的多语言识别能力，同时支持语音转文本、语音翻译、语言检测等任务。

Whisper 模型的特点：

多语言支持：可支持 100 多种语言的识别和翻译。
高鲁棒性：能处理嘈杂背景、口音变化、非标准发音等情况。
语音翻译：支持将任意语言的语音直接翻译为英文。
端到端 Transformer 架构：基于大型 Transformer 模型，无需传统语音识别中复杂的分步流程。
时间戳支持：可输出带时间戳的字幕格式，如 .srt、.vtt 等。
多模型大小可选：提供 5 种模型尺寸，从 tiny 到 large，以适应不同的资源限制。尺寸如下：

tiny：参数量为 39M，速度非常快，但准确率较低，适用于移动端、快速转录等场景。
base：参数量为 74M，速度快，准确率中等，适用于通用语音识别。
small：参数量为 244M，速度中等，准确率中上，适合多语种转录。
medium：参数量为 769M，速度慢，准确率高，可用于高质量转写。
large：参数量为 1550M，速度慢，准确率最佳，适用于多语言识别翻译、字幕生成等场景。

TTS 模型介绍

OpenAI 的 TTS-1 模型是 OpenAI 开发的文本到语音模型，它是一种基于人工智能技术的文本到语音解决方案，可将给定的文本转换为自然语音音频。

TTS-1 模型的基本信息如下：

版本类型：TTS-1 是 OpenAI TTS 的基本版本，相对较小，适用于一般的文本到语音转换任务。其高清版本为 TTS-1-HD，具有更高的模型容量和更多的参数，适用于对语音质量有更高要求的场景。
费用：TTS-1 和 TTS-1-HD 的收费标准均为每 1000 个输入字符 0.015 美元或 0.03 美元。

TTS-1 模型有 6 个内置语音，分别是 Alloy、Echo、Fable、Onyx、Nova 和 Shimmer。这些语音目前针对英语进行了优化，用户可以尝试不同的语音，以找到与所需音调和目标受众最匹配的声音。

注意：默认响应格式为“mp3”，但也可以使用其他格式，如 opus、aac、flac、pcm 等，以满足不同的应用需求。

简单示例

添加依赖

在 pom.xml 文件中添加依赖，如下：

<dependency>
    <groupId>org.springframework.ai</groupId>
    <artifactId>spring-ai-openai</artifactId>
</dependency>

配置模型

在 application.yml 配置文件中配置 API 密钥、基础地址、模型名等，如下：

spring:
  application:
    name: springai_demo1
  # AI配置
  ai:
    # openai相关配置
    openai:
      # 基础地址
      # 访问 https://api.xty.app/register?aff=pO2q 地址注册账号即可访问 OpenAI 了
      base-url: https://api.xty.app
      # AI KEY
      api-key: sk-vHTHX8D3wNZBfRya****************A23e48AbB600
      # 聊天模型配置
      chat:
        options:
          model: gpt-4-turbo # gpt-3.5-turbo
      # 图片模型配置
      image:
        options:
          # 需要高级接口
          model: dall-e-3

# 日志配置
logging:
  charset:
    console: UTF-8
  level:
    root: info
    org.springframework.ai: debug

文本转语音（TTS 模型）

下面将介绍如何使用 OpenAiAudioSpeechModel 类调用 Open AI 的 TTS 模型，实现文本转语音。OpenAiAudioSpeechModel 类是 Spring AI 框架对 OpenAI TTS（Text-to-Speech，文本转语音）API 的官方封装实现，用于简化 Java 开发者调用 OpenAI TTS 服务（如 tts-1 和 tts-1-hd 模型）的流程。例如：

package com.hxstrive.springai.springai_openai.example.audio_model;

import org.springframework.ai.openai.OpenAiAudioSpeechModel;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.boot.CommandLineRunner;
import org.springframework.boot.SpringApplication;
import org.springframework.boot.autoconfigure.SpringBootApplication;
import java.io.FileOutputStream;
import java.nio.file.Path;

@SpringBootApplication
public class TextToSpeechExample implements CommandLineRunner {

    @Autowired
    private OpenAiAudioSpeechModel speechModel;

    public static void main(String[] args) {
        SpringApplication.run(TextToSpeechExample.class, args);
    }

    @Override
    public void run(String... args) throws Exception {
        // 待转换的文本
        String text = "Hello, this is a text-to-speech example using Spring AI and OpenAI.";

        // 调用 API 生成音频
        byte[] audioData = speechModel.call(text);

        // 保存音频到文件
        Path outputPath = Path.of("D:\\generated_speech.mp3");
        try (FileOutputStream fos = new FileOutputStream(outputPath.toFile())) {
            fos.write(audioData);
            System.out.println("音频已保存至: " + outputPath.toAbsolutePath());
        }
    }
}

运行代码，输出如下：

2025-10-11 23:23:55.902 [restartedMain] DEBUG org.springframework.web.client.DefaultRestClient - Writing [SpeechRequest[model=tts-1, input=Hello, this is a text-to-speech example using Spring AI and OpenAI., voice=alloy, responseFormat=MP3, speed=1.0]] with org.springframework.http.converter.json.MappingJackson2HttpMessageConverter
2025-10-11 23:23:58.986 [restartedMain] DEBUG org.springframework.web.client.DefaultRestClient - Reading to [[B] as "audio/mpeg"
音频已保存至: D:\generated_speech.mp3

我们还可以通过 OpenAiAudioSpeechOptions 类配置模型、语音类型、响应格式和语速等，如下：

package com.hxstrive.springai.springai_openai.example.audio_model2;

import org.springframework.ai.openai.OpenAiAudioSpeechModel;
import org.springframework.ai.openai.OpenAiAudioSpeechOptions;
import org.springframework.ai.openai.api.OpenAiAudioApi;
import org.springframework.ai.openai.audio.speech.Speech;
import org.springframework.ai.openai.audio.speech.SpeechPrompt;
import org.springframework.ai.openai.audio.speech.SpeechResponse;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.boot.CommandLineRunner;
import org.springframework.boot.SpringApplication;
import org.springframework.boot.autoconfigure.SpringBootApplication;

import java.io.FileOutputStream;
import java.nio.file.Path;

@SpringBootApplication
public class TextToSpeechExample implements CommandLineRunner {

    @Autowired
    private OpenAiAudioSpeechModel speechModel;

    public static void main(String[] args) {
        SpringApplication.run(TextToSpeechExample.class, args);
    }

    @Override
    public void run(String... args) throws Exception {
        // 待转换的文本
        String text = "Hello, this is a text-to-speech example using Spring AI and OpenAI.";

        // 调用 API 生成音频
        OpenAiAudioSpeechOptions speechOptions = OpenAiAudioSpeechOptions.builder()
                .model("tts-1") // TTS 模型（tts-1 或 tts-1-hd）
                .voice("alloy") // 语音类型（alloy, echo, fable, onyx, nova, shimmer）
                .responseFormat(OpenAiAudioApi.SpeechRequest.AudioResponseFormat.MP3)
                .speed(1.0f)
                .build();

        SpeechPrompt speechPrompt = new SpeechPrompt(text, speechOptions);
        SpeechResponse speechResponse = speechModel.call(speechPrompt);
        Speech speech = speechResponse.getResult();
        byte[] audioData = speech.getOutput();

        // 保存音频到文件
        Path outputPath = Path.of("D:\\generated_speech.mp3");
        try (FileOutputStream fos = new FileOutputStream(outputPath.toFile())) {
            fos.write(audioData);
            System.out.println("音频已保存至: " + outputPath.toAbsolutePath());
        }
    }
}

运行代码，输出如下：

2025-10-11 23:26:12.563 [restartedMain] DEBUG org.springframework.web.client.DefaultRestClient - Writing [SpeechRequest[model=tts-1, input=Hello, this is a text-to-speech example using Spring AI and OpenAI., voice=alloy, responseFormat=MP3, speed=1.0]] with org.springframework.http.converter.json.MappingJackson2HttpMessageConverter
2025-10-11 23:26:16.035 [restartedMain] DEBUG org.springframework.web.client.DefaultRestClient - Reading to [[B] as "audio/mpeg"
音频已保存至: D:\generated_speech.mp3

语音转文本（Whisper 模型）

下面将介绍如何使用 OpenAiAudioTranscriptionModel 类调用 Open AI 的 Whisper 模型，实现语音转文字。

OpenAiAudioTranscriptionModel 是 Spring AI 1.0.0 中针对 OpenAI 音频转录服务的 “零门槛” 封装，适合需要快速集成高质量语音转文本功能的 Spring 项目（如字幕生成、语音质检、会议记录、无障碍服务等）。其核心优势是简化配置、降低开发成本，同时依托 whisper-1 模型的高语言覆盖度和转录精度，满足大多数商业场景需求。

例如：使用文本转语音示例生成的 generated_speech.mp3 文件提取文本信息

package com.hxstrive.springai.springai_openai.example.audio_model3;

import org.springframework.ai.openai.OpenAiAudioTranscriptionModel;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.boot.CommandLineRunner;
import org.springframework.boot.SpringApplication;
import org.springframework.boot.autoconfigure.SpringBootApplication;
import org.springframework.core.io.FileSystemResource;
import java.io.File;

@SpringBootApplication
public class AudioTranscriptionExample implements CommandLineRunner {

    @Autowired
    private OpenAiAudioTranscriptionModel transcriptionModel;

    public static void main(String[] args) {
        SpringApplication.run(AudioTranscriptionExample.class, args);
    }

    @Override
    public void run(String... args) throws Exception {
        // 音频文件路径（需提前准备一个包含语音的音频文件）
        File audioFile = new File("D:\\generated_speech.mp3");
        FileSystemResource audioResource = new FileSystemResource(audioFile);

        // 默认采用 whisper-1 模型
        String text = transcriptionModel.call(audioResource);

        // 输出转录结果
        System.out.println("转录文本: " + text);
    }
}

运行输出如下：

2025-10-12 06:48:29.233 [restartedMain] DEBUG org.springframework.web.client.DefaultRestClient - Writing [{file=[Byte array resource [resource loaded from byte array]], model=[whisper-1], language=[null], prompt=[null], response_format=[text], temperature=[0.7]}] with org.springframework.http.converter.support.AllEncompassingFormHttpMessageConverter
2025-10-12 06:48:34.453 [restartedMain] DEBUG org.springframework.web.client.DefaultRestClient - Reading to [java.lang.String] as "text/plain;charset=utf-8"
转录文本: Hello, this is a text-to-speech example using Spring AI and OpenAI.

我们还可以使用 AudioTranscriptionOptions 类配置模型、语言、响应格式等，

AudioTranscriptionOptions 类是用于配置音频转录（语音转文本）请求的选项类，主要与音频转录服务（如 OpenAI 的 Whisper 模型）交互时使用。它允许开发者自定义转录过程的各种参数，以满足不同的转录需求。

AudioTranscriptionOptions 类的一些主要属性：

model：必须指定有效的模型名称，例如 OpenAI 的 "whisper-1" 是通用的语音转文本模型，不同模型可能支持不同的语言和功能，需参考 OpenAI 官方文档。
responseFormat：转录结果的输出格式。可选值包括：json（默认）、text（纯文本）、srt（字幕格式）、verbose_json（详细JSON）、vtt（WebVTT字幕）
prompt：提示文本，用于引导模型优化转录结果。可提供上下文信息（如专业术语、人名、特定发音规则）帮助模型更准确识别，例如在技术讲座中转录时，可提示相关领域词汇。
language：音频内容的语言代码（使用 ISO-639-1 标准）。例如："en"（英语）、"zh"（中文）、"ja"（日语），指定语言可提高转录准确性，尤其对于多语言混合音频。
temperature：采样温度，控制转录结果的随机性。取值范围：0.0 ~ 1.0，默认值为 0.0：

低温度（如 0.2）：结果更确定、保守，适合需要精确转录的场景
高温度（如 0.8）：结果更多样化，可能引入创造性解读

granularityType：转录结果的时间戳粒度类型，用于指定是否返回以及返回哪种级别的时间戳（如单词级、段落级），可用于生成带时间标记的字幕文件，精确到每个单词的开始和结束时间。

例如：使用文本转语音示例生成的 generated_speech.mp3 文件提取文本信息

package com.hxstrive.springai.springai_openai.example.audio_model4;

import org.springframework.ai.audio.transcription.AudioTranscription;
import org.springframework.ai.audio.transcription.AudioTranscriptionOptions;
import org.springframework.ai.audio.transcription.AudioTranscriptionPrompt;
import org.springframework.ai.audio.transcription.AudioTranscriptionResponse;
import org.springframework.ai.openai.OpenAiAudioTranscriptionModel;
import org.springframework.ai.openai.OpenAiAudioTranscriptionOptions;
import org.springframework.ai.openai.api.OpenAiAudioApi;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.boot.CommandLineRunner;
import org.springframework.boot.SpringApplication;
import org.springframework.boot.autoconfigure.SpringBootApplication;
import org.springframework.core.io.FileSystemResource;
import java.io.File;
import java.util.List;

@SpringBootApplication
public class AudioTranscriptionExample implements CommandLineRunner {

    @Autowired
    private OpenAiAudioTranscriptionModel transcriptionModel;

    public static void main(String[] args) {
        SpringApplication.run(AudioTranscriptionExample.class, args);
    }

    @Override
    public void run(String... args) throws Exception {
        // 音频文件路径（需提前准备一个包含语音的音频文件）
        File audioFile = new File("D:\\generated_speech.mp3");
        FileSystemResource audioResource = new FileSystemResource(audioFile);

        AudioTranscriptionOptions options = OpenAiAudioTranscriptionOptions.builder()
                .model("whisper-1") // Whisper 模型，支持多语言
                .language("en") // 可选：指定语言（如 "zh" 表示中文）
                .responseFormat(OpenAiAudioApi.TranscriptResponseFormat.TEXT) // 响应格式
                .temperature(0.0f) // 控制输出随机性（0 表示最精确）
                .build();

        AudioTranscriptionPrompt prompt = new AudioTranscriptionPrompt(audioResource, options);

        // 默认采用 whisper-1 模型
        AudioTranscriptionResponse response = transcriptionModel.call(prompt);
        List<AudioTranscription> results = response.getResults();
        for(AudioTranscription result : results) {
            // 输出转录结果
            System.out.println("转录文本: " + result.getOutput());
        }
    }
}

运行输出如下：

2025-10-12 07:17:15.805 [restartedMain] DEBUG org.springframework.web.client.DefaultRestClient - Writing [{file=[Byte array resource [resource loaded from byte array]], model=[whisper-1], language=[en], prompt=[null], response_format=[text], temperature=[0.0]}] with org.springframework.http.converter.support.AllEncompassingFormHttpMessageConverter
2025-10-12 07:17:19.610 [restartedMain] DEBUG org.springframework.web.client.DefaultRestClient - Reading to [java.lang.String] as "text/plain;charset=utf-8"
转录文本: Hello, this is a text-to-speech example using Spring AI and OpenAI.

更多信息参考官方文档：

提示：如果不能访问 OpenAI，请点击 AiCode API 注册账号，通过代理访问。

上一章：Spring AI 图像模型下一章：Spring AI 内容审核模型

说说我的看法

* 必填

全部评论（0）

没有评论

更多教程

关于

本网站专注于 Java、数据库（MySQL、Oracle）、Linux、软件架构及大数据等多领域技术知识分享。涵盖丰富的原创与精选技术文章，助力技术传播与交流。无论是技术新手渴望入门，还是资深开发者寻求进阶，这里都能为您提供深度见解与实用经验，让复杂编码变得轻松易懂，携手共赴技术提升新高度。如有侵权，请来信告知：hxstrive@outlook.com

链接

其他应用

开源镜像网站

公众号