公众号里的文章大多数需要编程基础,如果因为代码看不懂,而跟不上正文的节奏,可以来找我学习,相当于给自己一个新手保护期。我的课程都是循环开课。下一期的时间,点进去咨询微信↓ 生信分析直播课程(2024.9.2下一期) 生信新手保护学习小组(预计9.13下一期) 单细胞陪伴学习小组(预计8.29下一期)
1.info() 和.dtypes查看每一列的数据类型2..astype()数据类型转换练习:数据类型转换3..str.trip去除字符串前后的空格4..upper()和.lower()大小写转换练习:大小写转换5.category 分类数据练习:category
type()可以返回对象的数据类型。
如果仅仅用type看类型只能得到“数据框”,看不到具体每列的数据类型。
1.info() 和.dtypes查看每一列的数据类型
如果要找出 DataFrame 中每一列的数据类型,可以使用 .info()
方法或 .dtypes
属性。相当于R的str函数
包含字符串的列在 Pandas 中表示为object类型。
df = pd.DataFrame({
'A' : [1,2,3],
'B' : [4,5,6]})
df
## A B
## 0 1 4
## 1 2 5
## 2 3 6
type(df)
## pandas.core.frame.DataFrame
df.dtypes
## A int64
## B int64
## dtype: object
2..astype()数据类型转换
转换为字符型
看到object意思就是字符串
如果要更改列的数据类型,可以在列上调用
.astype()
方法以及列的新类型。例如,要将“column_a”
的类型转换为整数:df['A'] = df['A'].astype('str')
df.info()
## <class 'pandas.core.frame.DataFrame'>
## RangeIndex: 3 entries, 0 to 2
## Data columns (total 2 columns):
## # Column Non-Null Count Dtype
## --- ------ -------------- -----
## 0 A 3 non-null object
## 1 B 3 non-null int64
## dtypes: int64(1), object(1)
## memory usage: 180.0+ bytes
练习:数据类型转换
课程使用的示例数据是tips,来自seaborn包,内容如下:
import seaborn as sns
tips = sns.load_dataset('tips')
tips.head()
## total_bill tip sex smoker day time size
## 0 16.99 1.01 Female No Sun Dinner 2
## 1 10.34 1.66 Male No Sun Dinner 3
## 2 21.01 3.50 Male No Sun Dinner 3
## 3 23.68 3.31 Male No Sun Dinner 2
## 4 24.59 3.61 Female No Sun Dinner 4
tips.dtypes
## total_bill float64
## tip object
## sex object
## smoker object
## day category
## time category
## size object
## dtype: object
Convert the
size
column toint
type.Convert the
tip
columns tofloat
type.Look at
.dtypes
again.
# Convert the size column
tips['size'] = tips['size']____
# Convert the tip column
____ = ____
# Look at the types
print(____)
# Convert the size column
tips['size'] = tips['size'].astype(int)
# Convert the tip column
tips['tip'] = tips['tip'].astype(float)
# Look at the types
print(tips.dtypes)
3..str.trip去除字符串前后的空格
类似R的trim()函数
df = pd.DataFrame({'name':['Daniel ',' Eric',' Julia ']})
df
##3 name
## 0 Daniel
## 1 Eric
## 2 Julia
df['names_strip'] = df['name'].str.strip()
df
## name names_strip
## 0 Daniel Daniel
## 1 Eric Eric
## 2 Julia Julia
df.names_strip
## 0 Daniel
## 1 Eric
## 2 Julia
## Name: names_strip, dtype: object
4..upper()和.lower()大小写转换
Python 允许您将其内置的字符串操作方法与 str
访问器一起使用。有几种字符串方法,其中一些是 .upper()
和 .lower()。
它们分别将字符串转换为大写和小写。
# Converts 'col_a' to lower case
df['col_a'].str.lower()
# Converts 'col_b' to upper case
df['col_b'].str.upper()
练习:大小写转换
'sex'
和 'smoker'
内容如下:
sex smoker
0 Female No
1 Male No
2 Male No
3 Male No
4 Female No
.. ... ...
239 Male No
240 Female Yes
241 Male Yes
242 Male No
243 Female No
[244 rows x 2 columns]
把 'sex'
列转换为小写
把 'smoker'
列转换为大写
检查 'sex'
and 'smoker'
,确保转换成功
# Convert sex to lower case
tips____ = tips____
# Convert smoker to upper case
tips____ = tips____
# Print the sex and smoker columns
print(tips[['sex', 'smoker']])
答案
# Convert sex to lower case
tips['sex'] = tips['sex'].str.lower()
# Convert smoker to upper case
tips['smoker'] = tips['smoker'].str.upper()
# Print the sex and smoker columns
print(tips[['sex', 'smoker']])
5.category 分类数据
类似R语言里的因子,表示分类数据。
df = pd.DataFrame({'name':['Daniel','Eric','Julia'],
'gender':['Male','Male','Female']})
df.dtypes
## name object
## gender object
## dtype: object
df['gender_cat'] = df['gender'].astype('category')
df.dtypes
## name object
## gender object
## gender_cat category
## dtype: object
查看具体有哪些类别,以及每个类别对应的整数,这里看到的就是Male被存为1,Female被存为2
df['gender_cat'].cat.categories
## Index(['Female', 'Male'], dtype='object')
df.gender_cat.cat.codes
## 0 1
## 1 1
## 2 0
## dtype: int8
把“category”
传递给 .astype()
可以将列转换为category类型。有了category列后,就可以通过使用 .cat``.categories
属性来查看各种类别(categories ,在 R 中称为levels,水平)。
category的另一个应用是,在数据中保留顺序。例如,从字面意思上讲,“low”出现在“high”之前是有道理的。可以使用 reorder_categories()
为列提供顺序。
# Reorder categorical levels
df['column_name'].cat.reorder_categories(['low', 'high'], ordered=True)
练习:category
# Convert the type of time column
tips['time'] = ____
# Use the cat accessor to print the categories in the time column
print(____)
# Convert the type of time column
tips['time'] = tips['time'].astype('category')
# Use the cat accessor to print the categories in the time column
print(tips['time'].cat.categories)
# Order the time category so lunch is before dinner
tips['time2'] = tips____([____, ____], ordered=True)
# Use the cat accessor to print the categories in the time2 column
print(____)
答案
# Convert the type of time column
tips['time'] = tips['time'].astype('category')
# Use the cat accessor to print the categories in the time column
print(tips.time.cat.categories)
# Convert the type of time column
tips['time'] = tips['time'].astype('category')
# Use the cat accessor to print the categories in the time column
print(tips['time'].cat.categories)
# Order the time category so lunch is before dinner
tips['time2'] = tips['time'].cat.reorder_categories(['Lunch', 'Dinner'], ordered=True)
# Use the cat accessor to print the categories in the time2 column
print(tips['time2'].cat.categories)