Python实现图片与扫描PDF文字识别全攻略

在数字化转型过程中，企业和开发者经常面临从图片或扫描版PDF中提取文字的需求。传统手动录入方式效率低下且易出错，而基于Python的OCR（光学字符识别）技术可实现自动化处理，显著提升效率。本文将详细介绍如何使用Python库（如Pillow、PyMuPDF、pytesseract）实现图片和扫描PDF的文字识别，并提供可操作的代码示例与优化建议。

一、技术选型与工具准备

1.1 核心库介绍

Pillow：Python图像处理库，用于读取和预处理图片（如调整大小、二值化）。
PyMuPDF（fitz）：轻量级PDF解析库，支持提取扫描PDF中的图像数据。
pytesseract：Tesseract OCR的Python封装，支持多语言识别（需安装Tesseract引擎）。
OpenCV：可选库，用于复杂图像预处理（如去噪、边缘检测）。

1.2 环境配置

安装依赖库：

pip install pillow pymupdf pytesseract opencv-python

安装Tesseract OCR引擎：
- Windows：从UB Mannheim下载安装包。
- macOS：brew install tesseract。
- Linux：sudo apt install tesseract-ocr（或指定语言包如tesseract-ocr-chi-sim）。

二、图片文字识别实现

2.1 基础识别流程

from PIL import Image
import pytesseract
# 读取图片
image = Image.open("example.png")
# 识别文字（默认英文）
text = pytesseract.image_to_string(image)
print(text)
# 指定中文识别（需安装中文语言包）
text_chinese = pytesseract.image_to_string(image, lang="chi_sim")
print(text_chinese)

2.2 图像预处理优化

扫描件或低质量图片需预处理以提高识别率：

import cv2
import numpy as np
def preprocess_image(image_path):
    # 读取图片并转为灰度图
    img = cv2.imread(image_path)
    gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
    # 二值化处理
    _, binary = cv2.threshold(gray, 150, 255, cv2.THRESH_BINARY)
    # 去噪（可选）
    denoised = cv2.fastNlMeansDenoising(binary, None, 10, 7, 21)
    return denoised
processed_img = preprocess_image("low_quality.png")
text = pytesseract.image_to_string(Image.fromarray(processed_img))
print(text)

2.3 关键参数调优

config参数：调整识别模式（如--psm 6假设统一文本块）。
lang参数：指定语言包（如eng+chi_sim混合识别）。

示例：

custom_config = r'--oem 3 --psm 6'
text = pytesseract.image_to_string(image, config=custom_config)

三、扫描PDF文字识别实现

3.1 PDF图像提取

扫描PDF本质是图片集合，需先提取图像再识别：

import fitz  # PyMuPDF
def extract_images_from_pdf(pdf_path):
    doc = fitz.open(pdf_path)
    images = []
    for page_num in range(len(doc)):
        page = doc.load_page(page_num)
        images += page.get_images(full=True)
    return images
# 提取并保存所有图像（需进一步处理）
pdf_path = "scanned.pdf"
images = extract_images_from_pdf(pdf_path)
for i, img_index in enumerate(images):
    xref = img_index[0]
    base_image = doc.extract_image(xref)
    image_bytes = base_image["image"]
    with open(f"page_{i}.png", "wb") as f:
        f.write(image_bytes)

3.2 批量识别PDF

结合PyMuPDF与pytesseract实现批量处理：

def pdf_to_text(pdf_path, output_txt):
    doc = fitz.open(pdf_path)
    full_text = []
    for page_num in range(len(doc)):
        page = doc.load_page(page_num)
        images = page.get_images(full=True)
        for img_index, _ in enumerate(images):
            xref = images[img_index][0]
            base_image = doc.extract_image(xref)
            image = Image.open(io.BytesIO(base_image["image"]))
            text = pytesseract.image_to_string(image, lang="chi_sim+eng")
            full_text.append(text)
    with open(output_txt, "w", encoding="utf-8") as f:
        f.write("\n".join(full_text))
pdf_to_text("scanned.pdf", "output.txt")

3.3 性能优化技巧

分块处理：对大图按区域分割后识别。
多线程：使用concurrent.futures并行处理多页PDF。
缓存中间结果：避免重复处理相同图像。

四、常见问题与解决方案

4.1 识别率低

原因：图像模糊、字体复杂、语言包缺失。
解决：
- 预处理（二值化、去噪）。
- 训练自定义Tesseract模型（需标注数据）。
- 使用商业OCR API（如EasyOCR、PaddleOCR）作为备选。

4.2 处理速度慢

优化：
- 降低图像分辨率（如从300DPI降至150DPI）。
- 限制识别语言（避免加载多余语言包）。
- 使用GPU加速（如PaddleOCR支持CUDA）。

4.3 格式混乱

排版修复：
- 使用正则表达式清理多余空格/换行。
- 结合NLTK进行文本后处理（如分句、纠错）。

五、进阶应用场景

5.1 表格识别

结合camelot或pdfplumber提取表格结构：

import pdfplumber
with pdfplumber.open("table.pdf") as pdf:
    first_page = pdf.pages[0]
    tables = first_page.extract_tables()
    for table in tables:
        print(table)

5.2 实时摄像头识别

使用OpenCV捕获摄像头画面并实时识别：

import cv2
cap = cv2.VideoCapture(0)
while True:
    ret, frame = cap.read()
    if not ret:
        break
    # 预处理
    gray = cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY)
    _, binary = cv2.threshold(gray, 150, 255, cv2.THRESH_BINARY)
    # 识别
    text = pytesseract.image_to_string(binary)
    print(text)
    if cv2.waitKey(1) & 0xFF == ord("q"):
        break
cap.release()

六、总结与建议

Python在OCR领域的应用已非常成熟，通过合理选择库和优化流程，可高效完成图片与扫描PDF的文字识别。开发者需注意：

预处理优先：80%的识别问题可通过图像预处理解决。
语言包管理：按需加载语言包以减少资源占用。
混合方案：复杂场景可结合规则引擎或深度学习模型。

未来，随着Tesseract 5.0+和PaddleOCR等工具的演进，Python在OCR领域的优势将进一步凸显。建议开发者持续关注开源社区更新，并积累标注数据以应对定制化需求。