一、技术背景与合规性说明
电商平台商家信息采集是市场调研、竞品分析的重要数据来源。通过Python实现自动化采集可显著提升效率,但需严格遵守《网络安全法》及平台服务条款,明确采集范围仅限公开数据且不得用于商业营销。建议在使用前进行合规性审查,优先通过平台官方API获取数据。
二、技术实现路径
1. 环境准备与依赖安装
pip install requests beautifulsoup4 selenium lxml fake-useragent
核心库功能说明:
- requests:基础HTTP请求库
- BeautifulSoup:HTML解析工具
- Selenium:动态页面渲染引擎
- fake-useragent:请求头伪装工具
2. 静态页面解析方案
2.1 请求头配置
from fake_useragent import UserAgentheaders = {'User-Agent': UserAgent().random,'Referer': 'https://www.example-marketplace.com/','Accept-Language': 'zh-CN,zh;q=0.9'}
关键参数说明:
- User-Agent需模拟真实浏览器
- Referer字段匹配目标平台域名
- Accept-Language设置中文优先
2.2 页面结构分析
典型商家信息页HTML结构示例:
<div class="contact-module"><span class="contact-name">张三</span><div class="contact-phone" data-phone="138****1234">显示号码</div><a href="mailto:zhangsan@example.com" class="contact-email">联系邮箱</a></div>
解析逻辑实现:
from bs4 import BeautifulSoupdef parse_contact_info(html_content):soup = BeautifulSoup(html_content, 'lxml')contact_module = soup.find('div', class_='contact-module')if not contact_module:return Nonereturn {'name': contact_module.find('span', class_='contact-name').text,'phone': contact_module.find('div', class_='contact-phone')['data-phone'],'email': contact_module.find('a', class_='contact-email')['href'].replace('mailto:', '')}
3. 动态页面处理方案
3.1 Selenium自动化配置
from selenium import webdriverfrom selenium.webdriver.chrome.options import Optionschrome_options = Options()chrome_options.add_argument('--headless')chrome_options.add_argument('--disable-gpu')chrome_options.add_argument(f'user-agent={UserAgent().random}')driver = webdriver.Chrome(options=chrome_options)driver.get('https://www.example-marketplace.com/merchant/12345')
关键配置项:
- headless模式提升执行效率
- 禁用GPU加速减少资源消耗
- 动态设置User-Agent
3.2 动态元素加载处理
from selenium.webdriver.common.by import Byfrom selenium.webdriver.support.ui import WebDriverWaitfrom selenium.webdriver.support import expected_conditions as ECdef get_dynamic_contact(driver):try:contact_div = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.CLASS_NAME, 'contact-module')))# 模拟点击显示号码按钮(如有)show_btn = contact_div.find_element(By.CLASS_NAME, 'show-phone-btn')show_btn.click()phone = contact_div.find_element(By.CLASS_NAME, 'contact-phone').textreturn phoneexcept Exception as e:print(f"动态元素加载失败: {str(e)}")return None
4. 反爬机制应对策略
4.1 IP代理池配置
import randomPROXY_POOL = [{'http': 'http://10.10.1.10:3128'},{'http': 'http://20.20.2.20:8080'}]def get_random_proxy():return random.choice(PROXY_POOL)
4.2 请求频率控制
import timeimport randomdef request_with_delay(url, headers, proxy=None):time.sleep(random.uniform(1, 3)) # 随机延迟1-3秒try:if proxy:response = requests.get(url, headers=headers, proxies=proxy, timeout=10)else:response = requests.get(url, headers=headers, timeout=10)return responseexcept Exception as e:print(f"请求失败: {str(e)}")return None
三、完整采集流程设计
1. 商家列表页采集
def crawl_merchant_list(base_url, max_pages=5):merchants = []for page in range(1, max_pages+1):url = f"{base_url}?page={page}"response = request_with_delay(url, headers)if response and response.status_code == 200:soup = BeautifulSoup(response.text, 'lxml')items = soup.select('.merchant-item')for item in items:merchant_id = item['data-id']merchant_url = f"https://www.example-marketplace.com/merchant/{merchant_id}"merchants.append(merchant_url)return merchants
2. 详情页信息采集
def crawl_merchant_details(merchant_urls):results = []for url in merchant_urls:# 静态页面优先尝试response = request_with_delay(url, headers)if response and response.status_code == 200:contact_info = parse_contact_info(response.text)if contact_info:results.append(contact_info)continue# 动态页面处理driver.get(url)phone = get_dynamic_contact(driver)if phone:# 补充其他静态可获取字段results.append({'phone': phone})return results
四、性能优化与异常处理
1. 并发采集设计
from concurrent.futures import ThreadPoolExecutordef concurrent_crawl(urls, max_workers=5):with ThreadPoolExecutor(max_workers=max_workers) as executor:results = list(executor.map(crawl_single_page, urls))return [r for r in results if r]
2. 数据持久化方案
import csvimport jsondef save_to_csv(data, filename='contacts.csv'):with open(filename, 'w', newline='', encoding='utf-8') as f:writer = csv.DictWriter(f, fieldnames=['name', 'phone', 'email'])writer.writeheader()writer.writerows(data)def save_to_json(data, filename='contacts.json'):with open(filename, 'w', encoding='utf-8') as f:json.dump(data, f, ensure_ascii=False, indent=2)
五、技术实践建议
- 合规性优先:始终通过robots.txt检查采集权限,控制采集频率不超过1次/秒
- 动态适应:定期检查页面结构变更,使用元素定位策略而非固定XPath
- 资源管理:采用连接池技术复用HTTP会话,减少内存泄漏风险
- 异常监控:建立采集失败重试机制,记录失败URL供后续分析
- 数据清洗:对采集结果进行正则验证,过滤无效联系方式(如非11位手机号)
本方案通过组合静态解析与动态渲染技术,结合智能反爬策略,可实现电商平台商家信息的稳定采集。实际部署时建议先在小规模测试环境中验证,再逐步扩展至生产环境。