从 Series & DataFrame 到 Time Series Analysis
掌握数据科学的核心技能
Understanding the fundamental building blocks
Core Data Structures
import pandas as pd import numpy as np # 创建Series s = pd.Series([1, 3, 5, np.nan, 6, 8]) print(s)
带索引的一维数组,类似于Excel中的单列数据
# 创建DataFrame dates = pd.date_range('20230101', periods=6) df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))
二维表格结构,类似于Excel工作表或SQL表
Data Input/Output Operations
df = pd.read_csv('data.csv') df.to_csv('output.csv', index=False)
df = pd.read_excel('data.xlsx', sheet_name='Sheet1') df.to_excel('output.xlsx', index=False)
df = pd.read_sql('SELECT * FROM table', connection) df.to_sql('table_name', connection, if_exists='replace')
Data Cleaning & Preprocessing
# 检查缺失值 df.isnull().sum() # 填充缺失值 df.fillna(df.mean(), inplace=True) # 删除缺失值 df.dropna(inplace=True)
# 检查重复值 df.duplicated().sum() # 删除重复值 df.drop_duplicates(inplace=True)
# IQR方法检测异常值 Q1 = df.quantile(0.25) Q3 = df.quantile(0.75) IQR = Q3 - Q1 df_clean = df[~((df < (Q1 - 1.5 * IQR)) | (df > (Q3 + 1.5 * IQR))).any(axis=1)]
Transform your data for better analysis
Split-Apply-Combine Strategy
# 按单列分组 grouped = df.groupby('column_name') # 聚合函数 result = df.groupby('category').agg({ 'sales': ['sum', 'mean', 'count'], 'profit': 'sum' }) # 多列分组 multi_group = df.groupby(['region', 'category']).sum()
# Transform - 保持原数据形状 df['sales_pct'] = df.groupby('category')['sales'].transform( lambda x: x / x.sum() * 100 ) # Aggregate - 减少数据维度 summary = df.groupby('category').agg({ 'sales': ['sum', 'mean'], 'quantity': 'count' })
Merge, Join & Concatenate
# Inner join merged = pd.merge(df1, df2, on='key', how='inner') # Left join left_merged = pd.merge(df1, df2, on='key', how='left')
# 垂直拼接 result = pd.concat([df1, df2], axis=0) # 水平拼接 result = pd.concat([df1, df2], axis=1)
Pivot, Melt & Reshape
pivot_df = df.pivot(index='date', columns='category', values='sales')
melted_df = pd.melt(df, id_vars=['id'], value_vars=['A', 'B', 'C'])
Master time-based data analysis
Time-based Data Manipulation
# 创建日期范围 dates = pd.date_range('2023-01-01', periods=100, freq='D') # 转换为日期类型 df['date'] = pd.to_datetime(df['date']) # 设置日期索引 df.set_index('date', inplace=True) # 时间切片 recent_data = df['2023-01-01':'2023-12-31']
# 按月重采样 monthly = df.resample('M').sum() # 按周聚合 weekly = df.resample('W').agg({ 'sales': 'sum', 'customers': 'mean' }) # 滚动窗口 rolling_avg = df['sales'].rolling(window=7).mean()
Titanic Dataset Analysis
# 加载泰坦尼克数据集 titanic = pd.read_csv('titanic.csv') # 基本信息 print(titanic.head()) print(titanic.info()) print(titanic.describe())
# 处理缺失值 titanic['Age'].fillna(titanic['Age'].median(), inplace=True) titanic['Embarked'].fillna('S', inplace=True) # 创建新特征 titanic['Family_Size'] = titanic['SibSp'] + titanic['Parch'] + 1
# 生存率分析 survival_by_class = titanic.groupby('Pclass')['Survived'].mean() survival_by_gender = titanic.groupby('Sex')['Survived'].mean()
Professional tips for efficient data processing
Performance Optimization
# 优化数据类型 df['category'] = df['category'].astype('category') df['int_col'] = pd.to_numeric(df['int_col'], downcast='integer')
# 优雅的链式调用 result = (df .dropna() .groupby('category') .agg({'sales': 'sum'}) .sort_values('sales', ascending=False) )
# 使用query方法 filtered = df.query('age > 25 and salary > 50000') # 布尔索引优化 mask = (df['age'] > 25) & (df['salary'] > 50000) result = df.loc[mask]
Clean Code Practices
# 清晰的变量命名 sales_by_region = df.groupby('region')['sales'].sum() # 适当的注释 # 计算每个地区的销售额占比 region_percentage = sales_by_region / sales_by_region.sum() * 100
try: df = pd.read_csv('data.csv') except FileNotFoundError: print("文件未找到,请检查文件路径") except pd.errors.EmptyDataError: print("文件为空")
Structured Learning Journey
Curated resources for continuous learning