💡专注R语言在🩺生物医学中的使用
设为“星标”,精彩不错过
为了方便大家学习,我已经录制了配套的视频,放在了哔哩哔哩(我的B站账号:阿越就是我),免费观看,复制以下网址粘贴到浏览器打开即可:https://space.bilibili.com/42460432/channel/collectiondetail?sid=3740949
今天给大家介绍一些常用的数值处理和字符串处理函数,演示用数据可在粉丝QQ群文件免费获取。
本期目录:
数值处理
计算函数
概率函数(选学)
字符串处理
数值处理
计算函数
常见的计算函数:
x <- c(1,2,3,4,5)
sum(x) # 求和
## [1] 15
mean(x) # 平均数
## [1] 3
median(x) # 中位数
## [1] 3
sd(x) # 标准差
## [1] 1.581139
var(x) # 方差
## [1] 2.5
mad(x) # 绝对中位差 median absolute deviation
## [1] 1.4826
quantile(x,probs = c(0.05,0.95)) # 分位数
## 5% 95%
## 1.2 4.8
range(x) # 范围
## [1] 1 5
min(x) # 最小值
## [1] 1
max(x) # 最大值
## [1] 5
scale(x) # 中心化,标准化
## [,1]
## [1,] -1.2649111
## [2,] -0.6324555
## [3,] 0.0000000
## [4,] 0.6324555
## [5,] 1.2649111
## attr(,"scaled:center")
## [1] 3
## attr(,"scaled:scale")
## [1] 1.581139
# ?scale
概率函数(选学)
由两部分组成:
d:密度函数(density) p:分布函数(distribution) q:分位数函数(quantile) r:随机函数(random)
随机正态分布:
rnorm(20, mean = 0, sd = 1)
## [1] 1.0596805 -0.6468369 -0.1423821 2.2077962 -0.7620366 -0.4443315
## [7] -1.0475048 0.1295400 0.3159857 0.3337168 0.9895305 -2.0335491
## [13] -0.2772392 -0.1436860 -2.0125462 0.1289014 0.1686231 0.3285903
## [19] 0.5307388 -1.0927376
密度正态分布:
dnorm(20, mean = 0, sd = 1)
## [1] 5.520948e-88
随机均匀分布:
runif(20, min = 10, max = 80)
## [1] 62.73452 17.97619 42.46156 73.27239 44.05626 41.89920 61.33796 10.41209
## [9] 53.93925 21.08192 37.55603 67.29889 70.35501 30.61270 49.70592 65.78216
## [17] 40.11610 47.11659 47.45472 39.35435
随机过程无法复现,但是可以通过设置随机种子数复现(所以计算机里面的随机是伪随机):
# 设置随机种子数,你的结果就能和我一样了
set.seed(123)
rnorm(20, mean = 0, sd = 1)
## [1] -0.56047565 -0.23017749 1.55870831 0.07050839 0.12928774 1.71506499
## [7] 0.46091621 -1.26506123 -0.68685285 -0.44566197 1.22408180 0.35981383
## [13] 0.40077145 0.11068272 -0.55584113 1.78691314 0.49785048 -1.96661716
## [19] 0.70135590 -0.47279141
字符串处理
常用的字符处理函数:
以第5章导入的TCGA乳腺癌数据为例。先读取数据:
df <- read.csv("datasets/brca_clin.csv", header = T)
# 检查下数据的基本结构
dim(df)
## [1] 20 9
str(df)
## 'data.frame': 20 obs. of 9 variables:
## $ barcode : chr "TCGA-BH-A1FC-11A-32R-A13Q-07" "TCGA-AC-A2FM-11B-32R-A19W-07" "TCGA-BH-A0DO-11A-22R-A12D-07" "TCGA-E2-A1BC-11A-32R-A12P-07" ...
## $ patient : chr "TCGA-BH-A1FC" "TCGA-AC-A2FM" "TCGA-BH-A0DO" "TCGA-E2-A1BC" ...
## $ sample : chr "TCGA-BH-A1FC-11A" "TCGA-AC-A2FM-11B" "TCGA-BH-A0DO-11A" "TCGA-E2-A1BC-11A" ...
## $ sample_type : chr "Solid Tissue Normal" "Solid Tissue Normal" "Solid Tissue Normal" "Solid Tissue Normal" ...
## $ initial_weight : int 260 220 130 260 200 60 320 310 100 250 ...
## $ ajcc_pathologic_stage : chr "Stage IIA" "Stage IIB" "Stage I" "Stage IA" ...
## $ days_to_last_follow_up: int NA NA 1644 501 660 3247 NA NA 1876 707 ...
## $ gender : chr "female" "female" "female" "female" ...
## $ age_at_index : int 78 87 78 63 41 59 60 39 54 51 ...
head(df)
## barcode patient sample
## 1 TCGA-BH-A1FC-11A-32R-A13Q-07 TCGA-BH-A1FC TCGA-BH-A1FC-11A
## 2 TCGA-AC-A2FM-11B-32R-A19W-07 TCGA-AC-A2FM TCGA-AC-A2FM-11B
## 3 TCGA-BH-A0DO-11A-22R-A12D-07 TCGA-BH-A0DO TCGA-BH-A0DO-11A
## 4 TCGA-E2-A1BC-11A-32R-A12P-07 TCGA-E2-A1BC TCGA-E2-A1BC-11A
## 5 TCGA-BH-A0BJ-11A-23R-A089-07 TCGA-BH-A0BJ TCGA-BH-A0BJ-11A
## 6 TCGA-E2-A1LH-11A-22R-A14D-07 TCGA-E2-A1LH TCGA-E2-A1LH-11A
## sample_type initial_weight ajcc_pathologic_stage
## 1 Solid Tissue Normal 260 Stage IIA
## 2 Solid Tissue Normal 220 Stage IIB
## 3 Solid Tissue Normal 130 Stage I
## 4 Solid Tissue Normal 260 Stage IA
## 5 Solid Tissue Normal 200 Stage IIB
## 6 Solid Tissue Normal 60 Stage I
## days_to_last_follow_up gender age_at_index
## 1 NA female 78
## 2 NA female 87
## 3 1644 female 78
## 4 501 female 63
## 5 660 female 41
## 6 3247 female 59
计算字符数量:
x <- df$barcode[1:3]
x
## [1] "TCGA-BH-A1FC-11A-32R-A13Q-07" "TCGA-AC-A2FM-11B-32R-A19W-07"
## [3] "TCGA-BH-A0DO-11A-22R-A12D-07"
nchar(x)
## [1] 28 28 28
截取字符串、替换字符串:
x <- df$barcode[1]
x
## [1] "TCGA-BH-A1FC-11A-32R-A13Q-07"
substr(x, start = 1, stop = 15)
## [1] "TCGA-BH-A1FC-11"
substr(x, start = 1, stop = 3) <- "ggg"
x
## [1] "gggA-BH-A1FC-11A-32R-A13Q-07"
查找字符串:
x <- c(df$barcode[1:3], "hahahaha")
x
## [1] "TCGA-BH-A1FC-11A-32R-A13Q-07" "TCGA-AC-A2FM-11B-32R-A19W-07"
## [3] "TCGA-BH-A0DO-11A-22R-A12D-07" "hahahaha"
grep("TCGA", x)
## [1] 1 2 3
grepl("TCGA", x)
## [1] TRUE TRUE TRUE FALSE
搜索替换,横岗变成下划线:
x <- df$barcode[1:5]
x
## [1] "TCGA-BH-A1FC-11A-32R-A13Q-07" "TCGA-AC-A2FM-11B-32R-A19W-07"
## [3] "TCGA-BH-A0DO-11A-22R-A12D-07" "TCGA-E2-A1BC-11A-32R-A12P-07"
## [5] "TCGA-BH-A0BJ-11A-23R-A089-07"
sub("-","_",x)
## [1] "TCGA_BH-A1FC-11A-32R-A13Q-07" "TCGA_AC-A2FM-11B-32R-A19W-07"
## [3] "TCGA_BH-A0DO-11A-22R-A12D-07" "TCGA_E2-A1BC-11A-32R-A12P-07"
## [5] "TCGA_BH-A0BJ-11A-23R-A089-07"
gsub("-","_",x)
## [1] "TCGA_BH_A1FC_11A_32R_A13Q_07" "TCGA_AC_A2FM_11B_32R_A19W_07"
## [3] "TCGA_BH_A0DO_11A_22R_A12D_07" "TCGA_E2_A1BC_11A_32R_A12P_07"
## [5] "TCGA_BH_A0BJ_11A_23R_A089_07"
分割字符串:
x <- df$barcode[1]
x
## [1] "TCGA-BH-A1FC-11A-32R-A13Q-07"
strsplit(x, split = "-")
## [[1]]
## [1] "TCGA" "BH" "A1FC" "11A" "32R" "A13Q" "07"
连接字符串:
paste("haha",1:3,sep = "")
## [1] "haha1" "haha2" "haha3"
paste("haha",1:3,sep = " ")
## [1] "haha 1" "haha 2" "haha 3"
paste("haha",1:3,sep = "OOO")
## [1] "hahaOOO1" "hahaOOO2" "hahaOOO3"
paste("今天是",date())
## [1] "今天是 Sat Sep 21 16:12:47 2024"
paste0("haha",1:3)
## [1] "haha1" "haha2" "haha3"
大小写转换:
x <- c("asdf","asdf","ghb")
toupper(x)
## [1] "ASDF" "ASDF" "GHB"
x <- c("SADFf","FAFFaa")
tolower(x)
## [1] "sadff" "faffaa"
Note
更高级的字符处理技术请学习R包
stringr
和正则表达式,非常强大!
联系我们,关注我们
免费QQ交流群1:613637742 免费QQ交流群2:608720452 公众号消息界面关于作者获取联系方式 知乎、CSDN、简书同名账号 哔哩哔哩:阿越就是我