今天是生信星球陪你的第990天
公众号里的文章大多数需要编程基础,如果因为代码看不懂,而跟不上正文的节奏,可以来找我学习,相当于给自己一个新手保护期。我的课程都是循环开课。下一期的时间,点进去咨询微信↓
生信分析直播课程(10月初下一期)
生信新手保护学习小组(预计9.13下一期)
单细胞陪伴学习小组(预计9.16下一期)
1.插播:像Rstudio那样逐行运行代码 2.示例数据 3.练习:读取多个csv文件 4.练习:探索数据 5.练习:可视化 6.练习:Groupby 和 aggregates 7.拼图 8.虚拟变量
1.插播:像Rstudio那样逐行运行代码
2.示例数据
3.练习:读取多个csv文件
glob
函数返回与指定模式匹配的文件名列表。然后,可以用列表推导式将多个文件读入到列表中,按需提取感兴趣的 DataFrame。import glob
import pandas as pd
# Get a list of all the csv files
csv_files = glob.____('*.csv')
# List comprehension that loads of all the files
dfs = [pd.read_csv(____) for ____ in ____]
# List comprehension that looks at the shape of all DataFrames
print(____)
import glob
import pandas as pd
# Get a list of all the csv files
csv_files = glob.glob('*.csv')
csv_files
## ['airlines.csv', 'flights.csv', 'planes.csv']
# List comprehension that loads of all the files
dfs = [pd.read_csv(x) for x in csv_files]
# List comprehension that looks at the shape of all DataFrames
print([x.shape for x in dfs])
## [(16, 2), (336776, 20), (3322, 9)]
4.练习:探索数据
planes
。engines
的频数# Get the planes DataFrame
planes = dfs[____]
# Count the frequency of engines in our data
print(____)
# Look at all planes with >= 3 engines
print(____[____])
# Look at all planes with >= 3 engines and <= 100 seats
print(____[____])
# Get the planes DataFrame
planes = dfs[2]
# Count the frequency of engines in our data
print(planes['engines'].value_counts())
# Look at all planes with >= 3 engines
print(planes.loc[planes['engines'] >= 3])
# Look at all planes with >= 3 engines and <= 100 seats
print(planes.loc[(planes.engines >=3) & (planes.seats <=100)])
5.练习:可视化
import matplotlib.pyplot as plt
# Scatter plot of engines and seats
planes.____(x=____, y=____, kind=____)
plt.show()
# Histogram of seats
____(kind=____)
plt.show()
# Boxplot of seats by engine
____(column=____, by=____)
plt.xticks(rotation=45)
plt.show()
import matplotlib.pyplot as plt
# Scatter plot of engines and seats
planes.plot(x='engines', y='seats', kind='scatter')
plt.show()
# Histogram of seats
planes.seats.plot(kind='hist')
plt.show()
# Boxplot of seats by engine
planes.boxplot(column='seats', by='engine')
plt.xticks(rotation=45)
plt.show()
6.练习:Groupby 和 aggregates
from datetime import datetime
def get_season(date_str):
# 定义每个季节的开始和结束日期
spring_start = datetime.strptime("2013-03-20", "%Y-%m-%d")
summer_start = datetime.strptime("2013-06-21", "%Y-%m-%d")
fall_start = datetime.strptime("2013-09-22", "%Y-%m-%d")
winter_start = datetime.strptime("2013-12-21", "%Y-%m-%d")
# 解析输入日期
date = datetime.strptime(date_str, "%Y-%m-%d %H:%M:%S")
# 考虑年份变化
year = date.year
if date < spring_start.replace(year=year):
return "Winter"
elif date < summer_start.replace(year=year):
return "Spring"
elif date < fall_start.replace(year=year):
return "Summer"
elif date < winter_start.replace(year=year):
return "Fall"
else:
return "Winter"
# 测试函数
test_date = "2013-01-01 05:00:00"
print(get_season(test_date)) # 输出:Winter
test_date = "2013-04-01 05:00:00"
print(get_season(test_date)) # 输出:Spring
test_date = "2013-07-01 05:00:00"
print(get_season(test_date)) # 输出:Summer
test_date = "2013-10-01 05:00:00"
print(get_season(test_date)) # 输出:Fall
test_date = "2013-12-25 05:00:00"
print(get_season(test_date)) # 输出:Winter
# 根据time_hour列来推断季节,并添加到flights里
flights = dfs[1]
flights['season'] = [get_season(x) for x in flights['time_hour']]
flights['season'].value_counts()
## season
## Summer 87341
## Spring 87089
## Fall 83190
## Winter 79156
## Name: count, dtype: int64
dep_delay
列和arr_delay
列相加# Calculate total_delay
flights['total_delay'] = ____ + ____
# Mean total_delay by carrier
tdel_car = ____.____(____)[____].____().reset_index()
print(tdel_car)
# Mean dep_delay and arr_delay for each season
dadel_season = ____.____(____)[____, ____].____().reset_index()
print(dadel_season)
# Mean and std delays by origin
del_ori = flights.groupby('origin')['total_delay', 'dep_delay', 'arr_delay'].____([____, 'std'])
print(del_ori)
ValueError: Cannot subset columns with a tuple with more than one element. Use a list instead.
https://wenku.csdn.net/answer/ae36c7adc06c404aba632d1e9fa6561c
# Calculate total_delay
flights['total_delay'] = flights['dep_delay'] + flights['arr_delay']
# Mean total_delay by carrier
tdel_car = flights.groupby('carrier')['total_delay'].mean().reset_index()
print(tdel_car)
# Mean dep_delay and arr_delay for each season
dadel_season = flights.groupby('season')[['dep_delay', 'arr_delay']].mean().reset_index()
print(dadel_season)
# Mean and std delays by origin
del_ori = flights.groupby('origin')[['total_delay', 'dep_delay', 'arr_delay']].agg(['mean', 'std'])
print(del_ori)
7.拼图
orgigin
和dep_delay
列作为横纵坐标画箱线图# Create a figure
fig, (ax1, ax2) = plt.subplots(____)
# Boxplot and barplot in the axes
sns.____(x=____, y=____, data=flights, ax=____)
sns.____(x=____, y=____, data=tdel_car, ax=____)
# Label axes
ax1.set_title(____)
# Use tight_layout() so the plots don't overlap
fig.tight_layout()
plt.show()
# Create a figure
fig, (ax1, ax2) = plt.subplots(2,1)
# Boxplot and barplot in the axes
sns.boxplot(x='origin', y='dep_delay', data=flights, ax=ax1)
sns.barplot(x='carrier', y='total_delay', data=tdel_car, ax=ax2)
# Label axes
ax1.set_title('Originating airport and the departure delay')
# Use tight_layout() so the plots don't overlap
fig.tight_layout()
plt.show()
8.虚拟变量
处理分类变量: 在机器学习模型中,许多算法要求输入特征是数值型的。通过将分类变量转换为虚拟变量,可以使这些变量适合用于算法。
防止虚拟变量陷阱: 使用虚拟变量时,如果有n个类别,通常会创建n-1个虚拟变量,以避免多重共线性(即自变量之间的高度相关性),这被称为虚拟变量陷阱。
提高模型的可解释性: 虚拟变量使得模型能够更清晰地理解不同类别对结果变量的影响,可帮助分析每个类别的贡献。
增强模型的表现: 将分类特征转换为虚拟变量后,可以提高某些机器学习模型的预测准确性,因为模型可以捕捉到类别间的差异。
flights_sub = flights[['year', 'month', 'day', 'dep_time', 'dep_delay', 'origin']]
# Look at the head of flights_sub
print(____)
# Create dummy variables
flights_dummies = ____
# Look at the head of flights_dummies
print(____)
# Look at the head of flights_sub
print(flights_sub.head())
## year month day dep_time dep_delay origin
## 0 2013 1 1 517.0 2.0 EWR
## 1 2013 1 1 533.0 4.0 LGA
## 2 2013 1 1 542.0 2.0 JFK
## 3 2013 1 1 544.0 -1.0 JFK
## 4 2013 1 1 554.0 -6.0 LGA
# Create dummy variables
flights_dummies = pd.get_dummies(flights_sub)
# Look at the head of flights_dummies
print(flights_dummies.head())
## year month day dep_time dep_delay origin_EWR origin_JFK origin_LGA
## 0 2013 1 1 517.0 2.0 True False False
## 1 2013 1 1 533.0 4.0 False False True
## 2 2013 1 1 542.0 2.0 False True False
## 3 2013 1 1 544.0 -1.0 False True False
## 4 2013 1 1 554.0 -6.0 False False True
flights_dummies = pd.get_dummies(flights_sub, dtype=int)
# Look at the head of flights_dummies
print(flights_dummies.head())
## year month day dep_time dep_delay origin_EWR origin_JFK origin_LGA
## 0 2013 1 1 517.0 2.0 1 0 0
## 1 2013 1 1 533.0 4.0 0 0 1
## 2 2013 1 1 542.0 2.0 0 1 0
## 3 2013 1 1 544.0 -1.0 0 1 0
## 4 2013 1 1 554.0 -6.0 0 0 1