一、技术选型前的关键决策点

在启动HTTPS爬虫开发前，需通过三个核心问题建立技术基准线：

目标类型判断：明确抓取对象是动态渲染的Web页面（需处理JavaScript）还是直接返回数据的API接口。动态页面通常需配合Selenium/Playwright等浏览器自动化工具，而API接口更适合使用纯HTTP客户端库。
协议特性需求：评估是否需要支持HTTP/2、gRPC等现代协议特性。例如金融类API可能强制要求HTTP/2，而物联网设备可能使用mTLS双向认证。
安全机制识别：检测目标是否存在自签名证书、客户端证书验证（mTLS）或动态令牌等反爬机制。某电商平台曾通过TLS指纹识别技术拦截非浏览器请求，这类场景需特殊处理。

二、同步场景下的requests实践

作为Python生态最成熟的HTTP客户端，requests在简单场景下具有不可替代的优势。典型实现需注意以下细节：

证书验证最佳实践

import requests
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry
# 配置重试策略与超时
session = requests.Session()
retries = Retry(total=3, backoff_factor=1, status_forcelist=[502, 503, 504])
session.mount('https://', HTTPAdapter(max_retries=retries))
# 生产环境证书处理
try:
    # 优先使用系统信任链
    response = session.get('https://api.example.com', timeout=10)
    # 自签名证书场景（如测试环境）
    # 方法1：指定CA证书包（推荐）
    response = session.get(
        'https://test.local', 
        verify='/etc/ssl/certs/ca-bundle.crt',
        timeout=10
    )
    # 方法2：临时禁用验证（仅限调试）
    # response = session.get('https://test.local', verify=False, timeout=10)
except requests.exceptions.SSLError as e:
    print(f"SSL验证失败: {str(e)}")

性能优化技巧

连接复用：通过Session对象保持长连接，减少TLS握手开销
DNS缓存：使用requests.adapters.HTTPAdapter的pool_connections参数控制连接池大小
超时管理：建议设置connect_timeout和read_timeout双参数，避免网络抖动导致线程阻塞

三、异步场景下的高并发实现

当需要处理千级并发请求时，异步编程模型可显著降低资源消耗。当前主流方案包括httpx和aiohttp两大阵营。

httpx的HTTP/2实践

import httpx
import asyncio
async def fetch_with_http2(urls):
    async with httpx.AsyncClient(
        http2=True,
        timeout=20.0,
        limits=httpx.Limits(max_connections=100)
    ) as client:
        tasks = [client.get(url) for url in urls]
        responses = await asyncio.gather(*tasks, return_exceptions=True)
        for resp in responses:
            if isinstance(resp, httpx.HTTPStatusError):
                print(f"请求失败: {resp.response.status_code}")
            elif isinstance(resp, Exception):
                print(f"异常发生: {str(resp)}")
            else:
                print(f"成功获取: {resp.status_code}")
# 示例调用
urls = ["https://api.example.com/data/1", "https://api.example.com/data/2"]
asyncio.run(fetch_with_http2(urls))

aiohttp的深度定制

对于需要精细控制TLS参数的场景，aiohttp提供更底层的接口：

import aiohttp
import ssl
from asyncio import Semaphore
async def fetch_with_mtls(url, cert_path, key_path):
    ssl_context = ssl.create_default_context(ssl.Purpose.SERVER_AUTH)
    ssl_context.load_cert_chain(cert_path, key_path)
    semaphore = Semaphore(50)  # 并发限制
    async with aiohttp.ClientSession() as session:
        async with semaphore:
            async with session.get(
                url,
                ssl=ssl_context,
                timeout=aiohttp.ClientTimeout(total=30)
            ) as response:
                return await response.json()

四、HTTPS异常处理全指南

证书类问题

CERTIFICATE_VERIFY_FAILED：
- 检查系统时间是否同步（NTP服务）
- 更新certifi包：pip install --upgrade certifi
- 手动指定CA证书路径

SNI匹配失败：

# 使用openssl诊断SNI问题
openssl s_client -connect example.com:443 -servername example.com

协议行为差异

HTTP/2特殊处理：

某些服务器在HTTP/2下会改变重定向逻辑

推荐实现对比测试机制：

async def test_protocol_compatibility(url):
  async with httpx.AsyncClient(http2=False) as client1:
      resp1 = await client1.get(url)
  async with httpx.AsyncClient(http2=True) as client2:
      resp2 = await client2.get(url)
  if resp1.status_code != resp2.status_code:
      print(f"协议差异检测: HTTP/1.1({resp1.status_code}) vs HTTP/2({resp2.status_code})")

压缩格式处理：
- 现代库自动处理gzip/deflate/brotli解压
- 特殊编码需手动处理（如protobuf）：
```python
import protobuf
from google.protobuf.json_format import ParseDict
async def parse_protobuf_response(response):
```
raw_data = await response.read()
message = protobuf.Message()
message.ParseFromString(raw_data)
return ParseDict(message, protobuf.Message())
```
```

五、调试与监控体系

抓包分析：
- 使用Wireshark或mitmproxy进行底层协议分析
- 配置mitmproxy作为中间人：
```
mitmproxy --set conf_dir=/path/to/certs
```

日志系统集成：

import logging
from httpx import HTTPTransport
class LoggingTransport(HTTPTransport):
    def handle_request(self, request):
        logging.info(f"Request: {request.method} {request.url}")
        return super().handle_request(request)
# 使用自定义Transport
client = httpx.Client(transport=LoggingTransport())

性能监控指标：
- 请求成功率（Success Rate）
- 平均响应时间（P99/P95）
- 证书验证耗时（TLS Handshake Time）

六、生产环境部署建议

证书轮换机制：
- 实现自动化证书更新检测
- 使用Kubernetes Secret或对象存储管理证书

优雅降级策略：

def get_with_fallback(url):
    try:
        return httpx.get(url, http2=True)
    except httpx.HTTPStatusError:
        try:
            return httpx.get(url, http2=False)
        except Exception as e:
            return requests.get(url)

资源隔离方案：
- 使用cgroups限制单个爬虫实例的资源占用
- 容器化部署实现环境隔离

通过系统化的技术选型、严谨的异常处理和完善的监控体系，开发者可以构建出稳定高效的HTTPS爬虫系统。实际开发中需根据具体业务场景，在开发效率、运行性能和系统稳定性之间取得平衡，建议通过AB测试验证不同技术方案的实际效果。

HTTPS爬虫开发全攻略：从工具选型到异常处理的技术实践