Python验证码识别：利用pytesseract识别简单图形验证码

在Web自动化测试、爬虫开发或数据采集场景中，验证码识别是绕不开的技术挑战。针对低复杂度的图形验证码（如纯数字、简单字母组合、无干扰线的字符），使用开源OCR工具pytesseract（Python封装版Tesseract OCR）可实现高效识别。本文将系统讲解从环境搭建到代码实现的完整流程，并提供优化策略提升识别准确率。

一、技术原理与适用场景

1.1 pytesseract工作原理

pytesseract是Google Tesseract OCR引擎的Python封装，通过图像处理技术将图片中的文字转换为可编辑文本。其核心流程分为三步：

图像预处理：二值化、降噪、去干扰
字符分割：识别字符边界并分割
字符识别：基于训练模型匹配字符

1.2 适用验证码类型

该方法最适合以下特征的验证码：

字符清晰无变形（如标准字体）
背景简单（纯色或渐变）
无复杂干扰元素（如扭曲线、噪点）
字符间距适中（避免粘连）

不适用场景：高干扰验证码（如扭曲字符、重叠字符、动态背景）需结合深度学习模型（如CNN）处理。

二、环境配置与依赖安装

2.1 基础依赖安装

# 安装pytesseract
pip install pytesseract pillow
# 安装Tesseract OCR引擎（以Ubuntu为例）
sudo apt install tesseract-ocr  # 基础版本（仅英文）
sudo apt install libtesseract-dev  # 开发头文件

2.2 扩展语言包（可选）

若需识别中文或其他语言，需下载对应训练数据：

# 下载中文训练包（以Ubuntu为例）
sudo apt install tesseract-ocr-chi-sim  # 简体中文

Windows用户需手动下载.traineddata文件并放入Tesseract-OCR\tessdata目录。

2.3 验证安装

import pytesseract
from PIL import Image
# 指定Tesseract路径（Windows需配置）
# pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract.exe'
# 测试识别
image = Image.open('test.png')
text = pytesseract.image_to_string(image, lang='eng')
print(text)

三、图像预处理优化

3.1 基础预处理流程

from PIL import Image, ImageFilter
import numpy as np
def preprocess_image(image_path):
    # 1. 转换为灰度图
    img = Image.open(image_path).convert('L')
    # 2. 二值化处理（阈值可根据实际调整）
    threshold = 140
    img = img.point(lambda x: 0 if x < threshold else 255)
    # 3. 降噪（中值滤波）
    img = img.filter(ImageFilter.MedianFilter(size=3))
    # 4. 边缘增强（可选）
    # img = img.filter(ImageFilter.FIND_EDGES)
    return img

3.2 高级预处理技巧

自适应阈值：针对光照不均的验证码

import cv2
def adaptive_threshold(image_path):
    img = cv2.imread(image_path, 0)
    img = cv2.adaptiveThreshold(img, 255, cv2.ADAPTIVE_THRESH_GAUSSIAN_C, 
                               cv2.THRESH_BINARY, 11, 2)
    return Image.fromarray(img)

字符分割：对粘连字符进行切割

def split_characters(image_path):
    img = preprocess_image(image_path)
    width, height = img.size
    # 简单按列分割（需根据实际调整）
    chars = []
    for x in range(0, width, 20):  # 假设字符宽度约20px
        char = img.crop((x, 0, x+20, height))
        chars.append(char)
    return chars

四、完整代码实现

4.1 基础识别代码

import pytesseract
from PIL import Image
def recognize_captcha(image_path, lang='eng'):
    try:
        # 预处理
        img = preprocess_image(image_path)
        # 识别配置
        custom_config = r'--oem 3 --psm 6'  # oem=3(默认引擎), psm=6(假设为统一文本块)
        # 执行识别
        text = pytesseract.image_to_string(img, config=custom_config, lang=lang)
        # 清理结果（去除空格、换行）
        return ''.join(text.split())
    except Exception as e:
        print(f"识别失败: {e}")
        return None
# 使用示例
result = recognize_captcha('captcha.png')
print(f"识别结果: {result}")

4.2 批量识别与结果校验

import os
def batch_recognize(folder_path):
    results = []
    for filename in os.listdir(folder_path):
        if filename.endswith(('.png', '.jpg', '.jpeg')):
            filepath = os.path.join(folder_path, filename)
            text = recognize_captcha(filepath)
            results.append((filename, text))
    return results
# 示例输出
for file, text in batch_recognize('captcha_samples'):
    print(f"{file}: {text}")

五、识别准确率优化策略

5.1 参数调优

PSM模式选择：

# 常用psm模式
# 6: 假设为统一文本块（默认）
# 7: 单行文本
# 11: 稀疏文本（适合分散字符）
config = r'--oem 3 --psm 11'

语言包匹配：确保lang参数与验证码语言一致（如chi_sim中文）。

5.2 动态阈值调整

def auto_threshold(image_path):
    img = Image.open(image_path).convert('L')
    pixels = np.array(img)
    # 计算平均亮度
    avg = np.mean(pixels)
    # 动态阈值（可根据实际数据调整系数）
    threshold = int(avg * 0.9)
    return img.point(lambda x: 0 if x < threshold else 255)

5.3 多模型融合（高级）

对复杂验证码可结合多种OCR引擎：

def hybrid_recognize(image_path):
    from easyocr import Reader  # 需安装easyocr
    reader = Reader(['en'])
    # pytesseract识别
    pyt_result = recognize_captcha(image_path)
    # easyocr识别
    easy_result = reader.readtext(image_path, detail=0)
    easy_text = ''.join(easy_result)
    # 投票机制（简单示例）
    if len(pyt_result) == len(easy_text):
        return pyt_result if sum(c == d for c, d in zip(pyt_result, easy_text)) > len(pyt_result)/2 else None
    return pyt_result or easy_text

六、实际应用建议

验证码分类处理：
- 对简单验证码直接使用pytesseract
- 对复杂验证码调用深度学习模型（如CRNN）

结果校验：

def validate_result(text, expected_length=4):
    return text is not None and len(text) == expected_length and text.isalnum()

性能优化：
- 批量处理时使用多线程
- 对重复验证码建立缓存
法律合规：
- 仅用于合法授权的测试场景
- 避免对生产系统造成干扰

七、总结与扩展

本文通过pytesseract实现了简单图形验证码的识别，核心步骤包括环境配置、图像预处理、参数调优和结果校验。实际项目中需注意：

预处理是提升准确率的关键
复杂场景需结合深度学习
始终遵守法律法规和网站服务条款

扩展方向：

训练自定义Tesseract模型（针对特定字体）
集成到Scrapy/Selenium爬虫框架
开发可视化调试工具（如显示预处理中间结果）

通过系统优化，pytesseract在简单验证码场景下可达到90%以上的识别准确率，为自动化测试和数据采集提供高效解决方案。

Python验证码识别：pytesseract实战指南