基于Python的录音与语音降噪技术全解析

一、Python语音录音技术实现

1.1 基础录音方案

Python通过sounddevice库可实现跨平台录音功能，其核心优势在于支持多种音频格式（WAV/FLAC/MP3）和采样率设置。典型录音流程包含三个关键步骤：

import sounddevice as sd
import numpy as np
# 参数配置
fs = 44100  # 采样率
duration = 5  # 录音时长(秒)
# 录音实现
print("开始录音...")
recording = sd.rec(int(duration * fs), samplerate=fs, channels=1, dtype='float32')
sd.wait()  # 等待录音完成
print("录音结束")
# 保存为WAV文件
from scipy.io.wavfile import write
write('output.wav', fs, (recording * 32767).astype(np.int16))

该方案通过sounddevice.rec()实现实时录音，scipy.io.wavfile.write()完成格式转换与存储。需注意32位浮点数需转换为16位整数格式以符合WAV标准。

1.2 高级录音控制

对于专业应用场景，可通过pyaudio库实现更精细的控制：

import pyaudio
import wave
CHUNK = 1024
FORMAT = pyaudio.paInt16
CHANNELS = 1
RATE = 44100
RECORD_SECONDS = 5
WAVE_OUTPUT_FILENAME = "advanced.wav"
p = pyaudio.PyAudio()
stream = p.open(format=FORMAT,
                channels=CHANNELS,
                rate=RATE,
                input=True,
                frames_per_buffer=CHUNK)
print("录音中...")
frames = []
for i in range(0, int(RATE / CHUNK * RECORD_SECONDS)):
    data = stream.read(CHUNK)
    frames.append(data)
print("录音结束")
stream.stop_stream()
stream.close()
p.terminate()
wf = wave.open(WAVE_OUTPUT_FILENAME, 'wb')
wf.setnchannels(CHANNELS)
wf.setsampwidth(p.get_sample_size(FORMAT))
wf.setframerate(RATE)
wf.writeframes(b''.join(frames))
wf.close()

此方案通过流式处理实现低延迟录音，特别适合实时语音处理场景。CHUNK参数控制缓冲区大小，直接影响录音的实时性和稳定性。

二、语音降噪技术体系

2.1 经典降噪算法

2.1.1 谱减法实现

谱减法通过估计噪声谱并从含噪语音中减去实现降噪：

import numpy as np
import librosa
import soundfile as sf
def spectral_subtraction(input_file, output_file, n_fft=512, hop_length=256):
    # 加载音频
    y, sr = librosa.load(input_file, sr=None)
    # 短时傅里叶变换
    S = librosa.stft(y, n_fft=n_fft, hop_length=hop_length)
    magnitude = np.abs(S)
    phase = np.angle(S)
    # 噪声估计（假设前0.5秒为噪声）
    noise_frame = int(0.5 * sr / hop_length)
    noise_mag = np.mean(magnitude[:, :noise_frame], axis=1, keepdims=True)
    # 谱减
    alpha = 2.0  # 过减因子
    beta = 0.002  # 谱底参数
    processed_mag = np.maximum(magnitude - alpha * noise_mag, beta * noise_mag)
    # 逆变换
    processed_S = processed_mag * np.exp(1j * phase)
    y_processed = librosa.istft(processed_S, hop_length=hop_length)
    # 保存结果
    sf.write(output_file, y_processed, sr)

该实现包含关键参数：alpha控制降噪强度，beta防止音乐噪声。实际应用中需根据信噪比动态调整这些参数。

2.1.2 维纳滤波实现

维纳滤波通过最小化均方误差实现自适应降噪：

def wiener_filter(input_file, output_file, n_fft=512, hop_length=256):
    y, sr = librosa.load(input_file, sr=None)
    S = librosa.stft(y, n_fft=n_fft, hop_length=hop_length)
    magnitude = np.abs(S)
    phase = np.angle(S)
    # 噪声估计
    noise_frame = int(0.5 * sr / hop_length)
    noise_power = np.mean(np.abs(S[:, :noise_frame])**2, axis=1, keepdims=True)
    # 维纳滤波
    snr = np.maximum(np.abs(S)**2 / noise_power, 1e-6)
    wiener_gain = snr / (snr + 1)
    filtered_mag = magnitude * wiener_gain
    # 重建信号
    filtered_S = filtered_mag * np.exp(1j * phase)
    y_filtered = librosa.istft(filtered_S, hop_length=hop_length)
    sf.write(output_file, y_filtered, sr)

维纳滤波的优势在于保持语音自然度，但需要准确的噪声功率估计。实际应用中常结合语音活动检测(VAD)技术提升估计精度。

2.2 深度学习降噪方案

2.2.1 基于RNNoise的实现

RNNoise是Mozilla开发的轻量级神经网络降噪库：

import subprocess
def rnnoise_process(input_file, output_file):
    cmd = [
        'ffmpeg',
        '-i', input_file,
        '-f', 's16le',
        '-ar', '48000',
        '-ac', '1',
        '-'
    ]
    # 启动RNNoise处理进程
    rnnoise_cmd = ['rnnoise', '-']
    p1 = subprocess.Popen(cmd, stdout=subprocess.PIPE)
    p2 = subprocess.Popen(rnnoise_cmd, stdin=p1.stdout, stdout=subprocess.PIPE)
    # 保存处理结果
    with open(output_file, 'wb') as f:
        while True:
            data = p2.stdout.read(1024)
            if not data:
                break
            f.write(data)
    p1.stdout.close()
    p2.stdout.close()

该方案需要预先安装RNNoise库，其优势在于低计算资源消耗（仅需2MB内存），适合嵌入式设备部署。

2.2.2 基于TensorFlow的CRN模型

卷积循环网络(CRN)是当前最先进的降噪架构之一：

import tensorflow as tf
from tensorflow.keras.layers import Input, Conv2D, LSTM, Dense
def build_crn_model(input_shape=(256, 128, 1)):
    inputs = Input(shape=input_shape)
    # 编码器
    x = Conv2D(64, (3, 3), activation='relu', padding='same')(inputs)
    x = Conv2D(64, (3, 3), activation='relu', padding='same', strides=2)(x)
    # LSTM层
    x = tf.expand_dims(x, axis=3)
    x = tf.squeeze(x, axis=-1)
    x = tf.reshape(x, [-1, x.shape[1], x.shape[2]*x.shape[3]])
    x = LSTM(128, return_sequences=True)(x)
    # 解码器
    x = tf.reshape(x, [-1, x.shape[1], x.shape[2]//64, 64])
    x = tf.keras.layers.UpSampling2D((2, 1))(x)
    x = Conv2D(64, (3, 3), activation='relu', padding='same')(x)
    outputs = Conv2D(1, (3, 3), activation='sigmoid', padding='same')(x)
    model = tf.keras.Model(inputs=inputs, outputs=outputs)
    return model
# 训练示例（需准备数据集）
# model.compile(optimizer='adam', loss='mse')
# model.fit(train_data, train_labels, epochs=50)

CRN模型结合了卷积网络的特征提取能力和循环网络的时序建模能力，在DNS Challenge等基准测试中表现优异。实际应用中需注意数据增强和模型压缩技术。

三、工程实践建议

3.1 降噪效果评估

推荐采用以下客观指标组合评估：

PESQ：语音质量感知评估（-0.5~4.5分）
STOI：语音可懂度指数（0~1）
SEGSYN：频谱失真度量

实现示例：

from pypesq import pesq
import soundfile as sf
def evaluate_pesq(clean_file, processed_file, sr=16000):
    clean, _ = sf.read(clean_file)
    processed, _ = sf.read(processed_file)
    return pesq(sr, clean, processed, 'wb')  # 宽带模式

3.2 实时处理优化

对于实时应用，建议采用以下优化策略：

分块处理：将音频流分割为20-50ms的帧
异步处理：使用生产者-消费者模型
模型量化：将FP32模型转换为INT8
硬件加速：利用GPU或DSP进行并行计算

3.3 典型应用场景

场景	推荐方案	关键指标要求
视频会议	RNNoise + 谱减法级联	延迟<50ms, MOS>3.5
语音助手	CRN模型 + 端点检测	唤醒率>95%, 误报<3%
录音笔	维纳滤波 + 自动增益控制	SNR提升>10dB

四、技术发展趋势

当前研究热点集中在三个方面：

低资源降噪：在100mW功耗内实现实时处理
个性化降噪：基于用户声纹特征的定制化方案
多模态融合：结合视觉信息提升降噪效果

最新研究显示，基于Transformer的时域降噪网络（如Demucs）在音乐源分离任务中已达到SOTA水平，其核心思想是通过自注意力机制捕捉长时依赖关系。

本文提供的完整代码示例和工程建议，可帮助开发者快速构建从基础录音到高级降噪的完整语音处理系统。实际应用中需根据具体场景（如嵌入式设备/服务器集群）选择合适的算法组合，并通过持续优化实现性能与效果的平衡。