告别手动整理PDF！AI自动化批量提取数据到Excel全攻略

一、技术方案架构设计

本方案采用”AI+Python”的混合架构，核心流程分为三阶段：

文档解析层：使用专业PDF解析库处理中文文档
智能提取层：通过正则表达式精准定位关键字段
数据持久层：将结构化数据写入Excel并保存至指定路径

1.1 环境准备指南

开发环境需满足以下条件：

Python 3.8+（推荐使用虚拟环境）
主流IDE（如PyCharm/VSCode）
基础依赖库：
```
pip install pdfplumber pandas openpyxl regex
```
提示：建议使用requirements.txt统一管理依赖版本，避免环境冲突

二、智能提取核心实现

2.1 文档目录遍历逻辑

import os
def get_pdf_files(folder_path):
    """递归获取指定目录下所有PDF文件"""
    pdf_files = []
    for root, _, files in os.walk(folder_path):
        for file in files:
            if file.lower().endswith('.pdf'):
                pdf_files.append(os.path.join(root, file))
    return pdf_files

该函数采用os.walk()实现递归遍历，支持处理嵌套子目录中的PDF文件。

2.2 关键字段提取算法

针对中文合同文档的特殊格式，采用三级匹配策略：

精确匹配：直接查找”合同号：”前缀
模糊匹配：处理”编号：”、”No.”等变体
上下文匹配：通过段落位置推断合同号位置

import re
def extract_contract_no(pdf_text):
    """多模式提取合同号"""
    patterns = [
        r'合同号[:：]\s*(\d+)',      # 标准格式
        r'编号[:：]\s*(\d+)',        # 常见变体
        r'[\d\s]{10,20}(?=\n|$)'    # 纯数字序列（备用）
    ]
    for pattern in patterns:
        match = re.search(pattern, pdf_text)
        if match:
            return match.group(1).replace(' ', '')
    return "未找到合同号"

2.3 数据结构化处理

使用Pandas DataFrame构建标准数据模型：

import pandas as pd
def create_dataframe(results):
    """将提取结果转为结构化数据"""
    df = pd.DataFrame(results, columns=['文件名', '合同号'])
    # 数据清洗
    df['合同号'] = df['合同号'].str.strip()
    return df.drop_duplicates()

三、完整实现流程

3.1 主处理流程

def main_process(input_folder, output_path):
    # 1. 获取所有PDF文件
    pdf_files = get_pdf_files(input_folder)
    # 2. 初始化结果容器
    results = []
    # 3. 逐文件处理
    for pdf_path in pdf_files:
        try:
            with open(pdf_path, 'rb') as f:
                # 使用pdfplumber解析PDF
                pdf = pdfplumber.open(pdf_path)
                text = '\n'.join([p.extract_text() for p in pdf.pages])
                pdf.close()
            # 提取合同号
            contract_no = extract_contract_no(text)
            # 记录结果
            results.append({
                '文件名': os.path.basename(pdf_path),
                '合同号': contract_no
            })
        except Exception as e:
            print(f"处理文件 {pdf_path} 时出错: {str(e)}")
    # 4. 生成Excel
    if results:
        df = create_dataframe(results)
        df.to_excel(output_path, index=False, engine='openpyxl')
        print(f"处理完成！结果已保存至: {output_path}")
    else:
        print("未找到任何PDF文件或提取结果为空")

3.2 执行脚本示例

if __name__ == "__main__":
    # 配置参数
    INPUT_FOLDER = r'G:\pdf\合同'  # PDF存放目录
    OUTPUT_FILE = r'G:\pdf\合同\合同号.xlsx'  # 输出路径
    # 执行处理
    main_process(INPUT_FOLDER, OUTPUT_FILE)

四、性能优化方案

4.1 多线程加速处理

对于超大规模文档集（1000+文件），建议使用线程池：

from concurrent.futures import ThreadPoolExecutor
def parallel_process(pdf_files, max_workers=4):
    results = []
    with ThreadPoolExecutor(max_workers=max_workers) as executor:
        futures = []
        for pdf_path in pdf_files:
            futures.append(executor.submit(process_single_file, pdf_path))
        for future in futures:
            results.extend(future.result())
    return results

4.2 异常处理增强

添加重试机制和详细日志记录：

import logging
from tenacity import retry, stop_after_attempt, wait_exponential
logging.basicConfig(
    filename='pdf_processor.log',
    level=logging.INFO,
    format='%(asctime)s - %(levelname)s - %(message)s'
)
@retry(stop=stop_after_attempt(3), wait=wait_exponential(multiplier=1))
def safe_extract_text(pdf_path):
    try:
        with open(pdf_path, 'rb') as f:
            pdf = pdfplumber.open(pdf_path)
            text = '\n'.join([p.extract_text() for p in pdf.pages])
            pdf.close()
        return text
    except Exception as e:
        logging.error(f"提取文件 {pdf_path} 失败: {str(e)}")
        raise

五、常见问题解决方案

5.1 中文乱码问题

确保使用pdfplumber而非PyPDF2（后者对中文支持较差）
检查系统是否安装中文字体（如simsun.ttc）

5.2 合同号识别失败

扩展正则表达式模式，添加更多变体

调试时输出原始文本片段辅助分析：

def debug_extract(pdf_text):
  print("=== 文本片段 ===")
  print(pdf_text[:500])  # 打印前500字符
  # 测试不同模式
  for pattern in patterns:
      print(f"模式 {pattern}: {re.search(pattern, pdf_text)}")

5.3 大文件处理优化

对超过50页的PDF采用分页处理：

def extract_large_pdf(pdf_path):
  pdf = pdfplumber.open(pdf_path)
  full_text = ""
  for i, page in enumerate(pdf.pages):
      if i > 0 and i % 10 == 0:  # 每10页处理一次
          contract_no = extract_contract_no(full_text)
          if contract_no != "未找到合同号":
              return contract_no
          full_text = ""
      full_text += page.extract_text() or ""
  pdf.close()
  return extract_contract_no(full_text)

六、扩展应用场景

本方案可轻松扩展至以下场景：

发票信息提取：修改正则表达式匹配税号、金额等字段
报告数据挖掘：提取特定章节的关键指标
文档归档系统：自动生成文档元数据索引

通过调整extract_contract_no()函数中的正则表达式模式，即可适配不同格式的文档处理需求。建议将核心逻辑封装为类，提高代码复用性：

class PDFDataExtractor:
    def __init__(self, patterns):
        self.patterns = patterns
    def extract(self, text):
        for pattern in self.patterns:
            match = re.search(pattern, text)
            if match:
                return match.group(1)
        return None

本方案通过模块化设计实现了高可扩展性，开发者可根据实际需求灵活调整各组件。对于企业级应用，建议添加数据库持久层和Web界面，构建完整的文档处理工作流系统。