一、天气查询命令行工具开发（基础篇）

1.1 项目目标与架构设计

本项目的核心目标是构建一个可通过命令行交互的天气查询系统，用户输入城市名称即可获取实时温度、湿度等气象数据。系统采用分层架构设计：

数据层：集成第三方气象API
业务层：实现城市坐标转换与天气数据解析
交互层：提供友好的命令行参数解析与错误处理

1.2 核心功能实现

1.2.1 气象数据获取

推荐使用无需API Key的Open-Meteo服务，其RESTful接口设计简洁：

import requests
def fetch_weather_data(lat, lon):
    """获取指定坐标的实时天气数据"""
    base_url = "https://api.open-meteo.com/v1/forecast"
    params = {
        "latitude": lat,
        "longitude": lon,
        "current_weather": True,
        "timezone": "auto"
    }
    try:
        response = requests.get(base_url, params=params, timeout=5)
        response.raise_for_status()
        return response.json()
    except requests.exceptions.RequestException as e:
        print(f"网络请求失败: {str(e)}")
        return None

1.2.2 城市坐标转换

采用三级缓存策略优化性能：

内存缓存：内置高频查询城市坐标
文件缓存：持久化存储扩展坐标库
API查询：集成地理编码服务（如Nominatim）

# 基础坐标库
CITY_COORDINATES = {
    "Beijing": (39.9042, 116.4074),
    "Shanghai": (31.2304, 121.4737),
    "Guangzhou": (23.1291, 113.2644)
}
def get_coordinates(city_name):
    """获取城市坐标（支持缓存扩展）"""
    # 1. 检查内存缓存
    if city_name in CITY_COORDINATES:
        return CITY_COORDINATES[city_name]
    # 2. 实际项目中可添加文件缓存检查
    # with open('city_coords.json') as f:
    #     coords_db = json.load(f)
    # 3. 调用地理编码API（示例伪代码）
    # geocode_url = "https://nominatim.openstreetmap.org/search"
    # ...
    raise ValueError(f"未找到城市坐标: {city_name}")

1.2.3 命令行交互设计

使用argparse模块构建专业级CLI：

import argparse
def main():
    parser = argparse.ArgumentParser(
        description="实时天气查询工具",
        epilog="示例: python weather.py Beijing"
    )
    parser.add_argument("city", help="要查询的城市名称")
    parser.add_argument("-v", "--verbose", action="store_true", help="显示详细信息")
    args = parser.parse_args()
    try:
        lat, lon = get_coordinates(args.city)
        weather_data = fetch_weather_data(lat, lon)
        if weather_data and args.verbose:
            print(f"""
城市: {args.city}
温度: {weather_data['current_weather']['temperature']}°C
湿度: {weather_data['current_weather']['humidity']}%
风速: {weather_data['current_weather']['windspeed']} km/h
""")
        elif weather_data:
            print(f"{args.city}: {weather_data['current_weather']['temperature']}°C")
    except Exception as e:
        print(f"查询失败: {str(e)}")
if __name__ == "__main__":
    main()

1.3 性能优化建议

异步请求：使用aiohttp实现并发查询
数据缓存：添加Redis缓存层（TTL设为10分钟）
错误重试：实现指数退避重试机制
日志系统：集成logging模块记录运行状态

二、豆瓣电影数据爬取与分析（进阶篇）

2.1 项目架构设计

本系统采用典型的爬虫-存储-分析架构：

数据采集层 → 数据清洗层 → 存储层 → 分析层
    │               │               │               │
requests        BeautifulSoup     pandas          matplotlib

2.2 核心模块实现

2.2.1 智能反爬策略

import requests
from fake_useragent import UserAgent
import time
import random
class AntiScrapingClient:
    def __init__(self):
        self.ua = UserAgent()
        self.session = requests.Session()
    def get(self, url):
        headers = {
            "User-Agent": self.ua.random,
            "Accept-Language": "zh-CN,zh;q=0.9",
            "Referer": "https://www.baidu.com"
        }
        time.sleep(random.uniform(0.5, 1.5))  # 随机延迟
        try:
            response = self.session.get(url, headers=headers, timeout=10)
            if response.status_code == 429:
                raise Exception("请求过于频繁")
            return response
        except requests.exceptions.RequestException as e:
            print(f"请求异常: {str(e)}")
            return None

2.2.2 数据解析模块

使用CSS选择器精准定位元素：

from bs4 import BeautifulSoup
def parse_movie_page(html):
    soup = BeautifulSoup(html, 'lxml')
    movies = []
    for item in soup.select('.item'):
        try:
            title = item.select_one('.title').text.strip()
            rating = float(item.select_one('.rating_num').text)
            year = item.select_one('.bd p').text.split('\n')[1].strip().split('/')[0].strip()
            # 处理导演信息（示例）
            director_info = item.select_one('.bd p').text.split('\n')[0].strip()
            directors = [d.strip() for d in director_info.split('导演:')[-1].split('主演:')[0].split('/')] if '导演:' in director_info else []
            movies.append({
                'title': title,
                'rating': rating,
                'year': year,
                'directors': directors
            })
        except Exception as e:
            print(f"解析错误: {str(e)}")
            continue
    return movies

2.2.3 数据存储方案

推荐使用pandas的DataFrame结构：

import pandas as pd
def save_to_csv(movies, filename='douban_top250.csv'):
    df = pd.DataFrame(movies)
    # 数据清洗
    df['year'] = pd.to_numeric(df['year'], errors='coerce')
    df = df.dropna(subset=['year'])
    # 保存文件
    df.to_csv(filename, index=False, encoding='utf_8_sig')
    print(f"数据已保存至 {filename}")

2.3 数据分析与可视化

2.3.1 基础统计分析

def analyze_movies(filename):
    df = pd.read_csv(filename)
    # 基础统计
    stats = {
        '总数量': len(df),
        '平均分': round(df['rating'].mean(), 2),
        '最高分': df['rating'].max(),
        '最低分': df['rating'].min(),
        '年代跨度': f"{df['year'].min()}-{df['year'].max()}"
    }
    # 导演作品统计
    director_counts = df['directors'].explode().value_counts().head(10)
    return {
        'basic_stats': stats,
        'top_directors': director_counts.to_dict()
    }

2.3.2 可视化实现

使用matplotlib生成专业图表：

import matplotlib.pyplot as plt
def visualize_data(df):
    plt.figure(figsize=(12, 6))
    # 年代分布
    plt.subplot(1, 2, 1)
    year_counts = df['year'].value_counts().sort_index()
    year_counts.plot(kind='bar', color='skyblue')
    plt.title('电影年代分布')
    plt.xlabel('年份')
    plt.ylabel('数量')
    # 评分分布
    plt.subplot(1, 2, 2)
    df['rating'].plot(kind='hist', bins=10, edgecolor='black', color='salmon')
    plt.title('评分分布')
    plt.xlabel('评分')
    plt.ylabel('数量')
    plt.tight_layout()
    plt.savefig('movie_analysis.png')
    plt.close()

2.4 高级扩展建议

分布式爬取：使用Scrapy-Redis实现分布式任务队列
增量更新：记录已爬取URL实现增量更新
异常监控：集成Sentry进行错误监控
数据持久化：将清洗后的数据存入关系型数据库

三、开发最佳实践总结

3.1 代码组织规范

project/
├── config/            # 配置文件
│   └── settings.py
├── core/              # 核心逻辑
│   ├── api_client.py
│   ├── parser.py
│   └── analyzer.py
├── utils/             # 工具函数
│   ├── cache.py
│   └── logger.py
├── tests/             # 单元测试
│   └── test_parser.py
└── main.py           # 入口文件

3.2 性能优化策略

连接池管理：使用requests.Session复用TCP连接
批量操作：数据库操作尽量使用批量插入
内存优化：大数据处理使用生成器而非列表
并行计算：对独立任务使用多进程/多线程

3.3 安全考虑要点

敏感信息：API密钥等使用环境变量存储
输入验证：对所有用户输入进行严格校验
速率限制：遵守目标网站的robots.txt规定
异常处理：区分预期异常和意外异常

通过这两个项目的完整实现，开发者可以系统掌握Python在数据采集、处理和分析领域的核心技能。建议在实际开发中结合具体业务需求进行模块化改造，逐步构建可复用的技术组件库。

Python进阶实践：从命令行工具到数据爬取的完整开发指南