公众号里的文章大多数需要编程基础,如果因为代码看不懂,而跟不上正文的节奏,可以来找我学习,相当于给自己一个新手保护期。我的课程都是循环开课。下一期的时间,点进去咨询微信↓ 生信分析直播课程(10月初下一期) 生信新手保护学习小组(预计9.13下一期) 单细胞陪伴学习小组(预计9.16下一期)
1.缺失值2.处理缺失值练习:处理缺失值3.Apply4.tidy数据重置索引练习5.groupby练习:groupby
1.缺失值
我的补充:在python中,NaN、NULL、NA、None都是缺失值的意思,但在R语言:
NaN
表示非数值(Not a Number),计算0/0或者计算负数的平方根时会得出。
NULL
表示没有、不存在。
NA
表示缺失值,特指存在但未知的值。
含缺失值的数据集非常常见。写代码时提到缺失值要写None
或者是np.NaN
,np.NAN
,np.nan
。
import pandas as pd
df = pd.DataFrame({'name':['John Smith','Jane Doe','Mary Johnson'],
'treatment_a':[None,16.0,3.0],
'treatment_b':[2,11,1]})
df
## name treatment_a treatment_b
## 0 John Smith NaN 2
## 1 Jane Doe 16.0 11
## 2 Mary Johnson 3.0 1
判断是否含有缺失值
pd.isna(df.treatment_a) #pd.isnull也一样
## 0 True
## 1 False
## 2 False
## Name: treatment_a, dtype: bool
pd.notnull(df.treatment_a)
## 0 False
## 1 True
## 2 True
## Name: treatment_a, dtype: bool
我的补充:
看这一列是否有缺失值(不要一串逻辑值,只要一个)
统计有多少个缺失值
any(pd.isna(df.treatment_a))
## True
sum(pd.isna(df.treatment_a))
## 1
df.treatment_a.isna().value_counts()
## treatment_a
## False 2
## True 1
## Name: count, dtype: int64
2.处理缺失值
.mean()
方法计算平均值,是默认忽略缺失值的。
a_mean = df['treatment_a'].mean()
a_mean
## np.float64(9.5)
.fillna()
将列中的所有缺失值替换为提供的值。
将treatment_a列里面的NA填充上该列的平均值,传递给a_fill列:
df.a_fill = df.treatment_a.fillna(a_mean)
df
## name treatment_a treatment_b a_fill
## 0 John Smith NaN 2 9.5
## 1 Jane Doe 16.0 11 16.0
## 2 Mary Johnson 3.0 1 3.0
练习:处理缺失值
课程使用的示例数据是tips,来自seaborn包,内容如下:
import seaborn as sns
tips = sns.load_dataset('tips')
tips.head()
## total_bill tip sex smoker day time size
## 0 16.99 1.01 Female No Sun Dinner 2
## 1 10.34 1.66 Male No Sun Dinner 3
## 2 21.01 3.50 Male No Sun Dinner 3
## 3 23.68 3.31 Male No Sun Dinner 2
## 4 24.59 3.61 Female No Sun Dinner 4
tips.dtypes
## total_bill float64
## tip object
## sex object
## smoker object
## day category
## time category
## size object
## dtype: object
(但是我尝试了一下发现示例里的数据和原始的tips有所不同,因为用来举例子的total_bill列没发现缺失值。算咯,就比划一下代码)
1.输出tips 数据框中total_bill
为缺失值的行
2.计算total_bill列的平均值
3.用这个值填充'total_bill'列的平均值
# Print the rows where total_bill is missing
print(tips.loc[____(____)])
# Mean of the total_bill column
tbill_mean = tips['total_bill']____
# Fill in missing total_bill
print(tips['total_bill']____(____))
答案:
# Print the rows where total_bill is missing
print(tips.loc[pd.isnull(tips['total_bill'])])
# Mean of the total_bill column
tbill_mean = tips['total_bill'].mean()
# Fill in missing total_bill
print(tips['total_bill'].fillna(tbill_mean))
3.Apply
计算每行/每列的函数运算结果,例如平均值
R的apply是1表示行,2表示列
python里的apply是0表示行,1表示列
4.tidy数据
非常熟悉的配方,这是哈德雷大佬提出的概念:
R语言里的宽变长函数有好几个,最新的是pivot_longer和pivot_wider。(也有melt,被哈德雷大佬自己嫌弃然后新写了函数)
melt,宽变长
pviot_table ,长变宽
import pandas as pd
import numpy as np
df = pd.DataFrame({'name':['John Smith','Jane Doe','Mary Johnson'],
'treatment_a':[None,16.0,3.0],
'treatment_b':[2,11,1]})
df
## name treatment_a treatment_b
## 0 John Smith NaN 2
## 1 Jane Doe 16.0 11
## 2 Mary Johnson 3.0 1
df_melt = pd.melt(df,id_vars='name')
df_melt
## name variable value
## 0 John Smith treatment_a NaN
## 1 Jane Doe treatment_a 16.0
## 2 Mary Johnson treatment_a 3.0
## 3 John Smith treatment_b 2.0
## 4 Jane Doe treatment_b 11.0
## 5 Mary Johnson treatment_b 1.0
df_melt_pivot = pd.pivot_table(df_melt,
index='name',
columns='variable',
values = 'value')
df_melt_pivot
## variable treatment_a treatment_b
## name
## Jane Doe 16.0 11.0
## John Smith NaN 2.0
## Mary Johnson 3.0 1.0
pivot_table
的几个参数:
index是新数据框的行名是旧数据框的哪一列
columns是新数据框列名是旧数据框的哪一列
values是新数据框每列的内容是旧数据框的哪一列
重置索引
得到常规的dataframe,行名变成索引,原来的行名成为现在的第一列
df_melt_pivot.reset_index()
## variable name treatment_a treatment_b
## 0 Jane Doe 16.0 11.0
## 1 John Smith NaN 2.0
## 2 Mary Johnson 3.0 1.0
练习
airquality数据框内容如下
Ozone Solar.R Wind Temp Month Day
0 41.0 190.0 7.4 67 5 1
1 36.0 118.0 8.0 72 5 2
2 12.0 149.0 12.6 74 5 3
3 18.0 313.0 11.5 62 5 4
4 NaN NaN 14.3 56 5 5
.. ... ... ... ... ... ...
148 30.0 193.0 6.9 70 9 26
149 NaN 145.0 13.2 77 9 27
150 14.0 191.0 14.3 75 9 28
151 18.0 131.0 8.0 76 9 29
152 20.0 223.0 11.5 68 9 30
[153 rows x 6 columns]
1.将这个数据框melt
2.讲melt后的数据框转换回宽数据
3.重置索引
# Melt the airquality DataFrame
airquality_melted = ____(____, id_vars=['Day', 'Month'])
print(airquality_melted)
# Pivot the molten DataFrame
airquality_pivoted = ____(index=['Month', 'Day'], columns='variable', values='value')
print(airquality_pivoted)
# Reset the index
print(airquality_pivoted____)
答案
# Melt the airquality DataFrame
airquality_melted = pd.melt(airquality, id_vars=['Day', 'Month'])
print(airquality_melted)
# Pivot the molten DataFrame
airquality_pivoted = airquality_melted.pivot_table(index=['Month', 'Day'], columns='variable', values='value')
print(airquality_pivoted)
# Reset the index
print(airquality_pivoted.reset_index())
5.groupby
用于分组计算,对应R语言的dplyr::group_by函数
df_melt
## name variable value
## 0 John Smith treatment_a NaN
## 1 Jane Doe treatment_a 16.0
## 2 Mary Johnson treatment_a 3.0
## 3 John Smith treatment_b 2.0
## 4 Jane Doe treatment_b 11.0
## 5 Mary Johnson treatment_b 1.0
df_melt.groupby('name')['value'].mean()
## name
## Jane Doe 13.5
## John Smith 2.0
## Mary Johnson 2.0
## Name: value, dtype: float64
练习:groupby
1.计算每个性别('sex'
)的平均'tip'
2.计算每个性别('sex'
)和('time'
列)组合的平均'tip'
# Mean tip by sex
print(tips____(____)[____].____)
# Mean tip by sex and time
print(tips____([____, ____])[____].____)
答案
# Mean tip by sex
print(tips.groupby('sex')['tip'].mean())
# Mean tip by sex and time
print(tips.groupby(['sex', 'time'])['tip'].mean())