[读书笔记]Simple Linear Regression解读

文摘 2024-10-01 20:58 澳大利亚

Simple Linear Regression (SLR) is a statistical method used to model the linear relationship between a dependent variable (response) and a single independent variable (predictor). The goal is to predict the value of the dependent variable based on the independent variable using a straight line.

=====================

一、How to calculate the regression coefficients manually?

二、How to interpret the output of lm()？

三、How to simulate lm() in R?

四、术语

五、参考资料

=====================

一、How to calcualate the regression coefficients manually?

Note: the Fomulas are only for simple linear regression.

=====================

二、How to interpret the output of lm()？

1. R example for simple linear regression

2. Interpretation

2.1 Residuals represent the differences between the observed and predicted values in a regression model, and analyzing their distribution is crucial for assessing the model's fit. Ideally, residuals should be normally distributed around zero, with their mean and median close to zero, signifying that the errors are evenly spread out. Symmetry in the residuals' spread, as indicated by the minimum, maximum, 1st quartile (1Q), and 3rd quartile (3Q) values, further supports normality. For normally distributed residuals, the 1Q and 3Q values should typically be around 1.5 times the standard error. This normality ensures that large prediction errors are less likely as the distance from the actual values increases, and that overestimations and underestimations occur with equal probability, leading to unbiased and stable predictions.

2.2 The standard error of the estimated coefficients represents the variability or uncertainty in the coefficient estimates. A smaller standard error suggests that the coefficient is estimated with greater precision, indicating a better model fit. However, it’s important to recognize that these values should be considered in relation to the magnitude of the coefficients themselves. t-values are used to calculate p-values, which help assess the significance of each variable. The p-value can be interpreted as the probability that a given variable is irrelevant to the model. Lower p-values indicate a higher likelihood that the variable is important in explaining the outcome. In regression output, significance is often represented by stars, where more stars indicate a stronger significance level for the variable in question.

2.3 RSE,R-squared and F-statistic

2.3.1 The Residual Standard Error (RSE) represents the standard deviation of the residuals, or the unexplained variation in the model. It indicates how much the actual values deviate from the predicted values on average. The smaller the RSE, the better the model fits the data.

2.3.1 In simple linear regression, the degrees of freedom are calculated as n−2, where n is the number of observations, and 2 accounts for the intercept and slope.

2.3.3 The R-squared (R²) value measures the proportion of the variance in the dependent variable that is explained by the model. It ranges from 0 to 1, with higher values indicating a better fit.

2.3.4 Adjusted R-squared is more useful in multiple linear regression as it adjusts for the number of predictors, preventing inflation when unnecessary variables are added.

2.3.5 The F-statistic assesses the overall significance of the model. It compares the explained variance to the unexplained variance, providing a measure of the model's performance.

2.3.6 The p-value corresponding to the F-statistic in the last line tests the null hypothesis that all coefficients (except the intercept) are zero. In simple linear regression, the p-value for the F-statistic is equivalent to the p-value for the slope. In multiple regression, the F-statistic tests the combined significance of all predictors.

2.3.7 If the p-value is below a given threshold (e.g., 0.05), we reject the null hypothesis, indicating that at least one predictor is significantly related to the outcome.

2.3.8 The calculation for the F-statistic's p-value in R can be done as follows:

p_value <- 1 - pf(F_statistic, df1 = k, df2 = n - k - 1)

Where k is the number of predictors, and n is the number of observations.

=====================

三、How to simulate lm() in R?

1. my_lm() codes:

2. Demonstration of the Entire Calculation Process

""""""

# Sample data: 10 weights and 10 blood pressure readings

weights <- c(60, 65, 70, 72, 75, 80, 85, 88, 90, 95)

blood_pressure <- c(120, 122, 125, 130, 136, 140, 145, 150, 155, 160)

data <- data.frame(weights, blood_pressure)

# Create the linear model

model <- my_lm(blood_pressure ~ weights, data)

# Summarize the model

my_summary(model)

""""""""

=====================

四、术语

1.The Central Limit Theorem (CLT): states that, as the sample size increases, the sampling distribution of the sample mean approaches a normal distribution, regardless of the population's original distribution. This implies that with more observations, the variance of parameter estimates shrinks.

2.Confidence Intervals:provide a range of values that are likely to contain the true parameters of the model, with a given level of confidence (e.g., 95%).

3.Covariance: measures the direction of the linear relationship between two variables. A positive covariance means that the variables increase together, while a negative covariance means that as one increases, the other decreases.

4.Correlation: measures both the strength and direction of a linear relationship between two variables, normalized between -1 and 1, making it easier to interpret than covariance.

5.Distributions:describe the possible values that a random variable can take and the likelihood of those values occurring. Common distributions include normal, binomial, and Poisson distributions.

6.Expectation:the expectation of a random variable is its long-term average or mean value, representing the expected outcome of a random process.

7.ESS: Explained Sum of Squares:measures the amount of variation in the dependent variable that is explained by the independent variables in the model.

8. Hypothesis Tests: evaluate whether model components (e.g., coefficients) are statistically significant for prediction. They help compare models and decide whether to reject a null hypothesis.

9.Inference:is the process of estimating parameters of a model based on sample data, drawing conclusions about the population.

10. Least Squares Estimation (LSE) :is a method used to estimate the coefficients of a regression model by minimizing the sum of the squared residuals (the differences between observed and predicted values).

11.Maximum Likelihood Estimation (MLE) :is a method to estimate the parameters of a statistical model by maximizing the likelihood function, which measures how likely the observed data are under the model.

12.Mean Squared Error (MSE) :is the average of the squared differences between the observed and predicted values. It quantifies the error of a model's predictions.

13.Ordinary Least Squares(OLS): is a type of linear regression that estimates the coefficients by minimizing the sum of squared residuals.

14.Probability: quantifies the likelihood of an event occurring and is essential in modeling uncertainty and dealing with noisy data.

15.p-value: is the probability of obtaining test results at least as extreme as the observed results, assuming the null hypothesis is true. It helps assess statistical significance.

16.Regression:is a statistical method used to predict a continuous dependent variable based on one or more independent variables (either continuous or categorical).

17.Residual Standard Error(RSE):provides a measure of the typical size of the residuals in the regression model, i.e., how far observed data points deviate from the predicted values, on average.

18.Residual Sum of Squares(RSS):is the sum of the squared differences between observed and predicted values, representing the total unexplained variance in the model.

19.Sample Statistics: are estimates calculated from sample data and are used to infer parameters of the population.

20.Sum of Squared Errors(SSE): also known as RSS, is the sum of squared differences between observed and predicted values. It quantifies the error in the model’s predictions.

21.Total Sum of Squares(TSS) :is the total variation in the dependent variable. It is the sum of the squares of the differences between the observed values and their mean. It represents both the explained and unexplained variance in the model.

=====================

五、参考资料

1. simple linear regression example

https://stattrek.com/regression/regression-example.aspx

2. Stat Trek's: AP Statistics Tutorial

https://stattrek.com/tutorials/ap-statistics-tutorial

3. Applied Linear Regression, 3rd Ed.Using R

http://users.stat.umn.edu/~sandy/alr3ed/website/R.html

4. Statistics 36-350: Data Mining

https://www.stat.cmu.edu/~cshalizi/350/

http://mp.weixin.qq.com/s?__biz=MzA3NDQzNDg5OA==&mid=2490834955&idx=1&sn=76e77d4e3a20351be4546b4e487d571a

ingenieur

不动笔墨不读书

最新文章

[2015年6月20日]箱根町

[2015年6月]河口湖登富士山

[心得体会]参考文献设置悬挂缩进

[读书笔记]Simple Linear Regression解读

[读书笔记]R语言基础

[读书笔记]DevOps introduction

[读书笔记]Mathpix for LaTeX

[读书笔记]How to acknowledge or reference the use of AI?

[读书笔记]APA and ACM Reference Formats

[读书笔记]python基础

[读书笔记]程序员的工具

[读书笔记]函数的极值计算

[数学工具]WolframAlpha and Desmos

[读书笔记]操作系统基本概念

[读书笔记]数据库基础

[读书笔记]人工智能与智能计算的发展

[读书笔记]概率和数理统计基础

[读书笔记]图论简单应用

[读书笔记]二进制数据传输原理

[心得体会]澳大利亚入境24小时参考

[心得体会]如何在二手车APP卖车？

[读书笔记]IPsec协议

[读书笔记]炒股的智慧

[读书笔记]炒股入门与技巧

[考试]雅思7.0的教训总结

2020~2022年中国科大少年班上海录取名单(图片版)

[读书笔记]安全工具

[考试]CISSP考证记录

[读书笔记]华为数据之道

分类

时事

民生

政务

教育

文化

科技

财富

体娱

健康

情感

旅行

百科

职场

楼市

企业

乐活

学术

汽车

时尚

创业

美食

幽默

美体

文摘

原创标签

时事社会财经军事教育体育科技汽车科学房产搞笑综艺明星音乐动漫游戏时尚健康旅游美食生活摄影宠物职场育儿情感小说曲艺文化历史三农文学娱乐电影视频图片新闻宗教电视剧纪录片广告创意壁纸头像心灵鸡汤星座命理教育培训艺术文化金融财经健康医疗美妆时尚餐饮美食母婴育儿社会新闻工业农业时事政治星座占卜幽默笑话独立短篇连载作品文化历史科技互联网

发布位置

广东北京山东江苏河南浙江山西福建河北上海四川陕西湖南安徽湖北内蒙古江西云南广西甘肃辽宁黑龙江贵州新疆重庆吉林天津海南青海宁夏西藏香港澳门台湾美国加拿大澳大利亚日本新加坡英国西班牙新西兰韩国泰国法国德国意大利缅甸菲律宾马来西亚越南荷兰柬埔寨俄罗斯巴西智利卢森堡芬兰瑞典比利时瑞士土耳其斐济挪威朝鲜尼日利亚阿根廷匈牙利爱尔兰印度老挝葡萄牙乌克兰印度尼西亚哈萨克斯坦塔吉克斯坦希腊南非蒙古奥地利肯尼亚加纳丹麦津巴布韦埃及坦桑尼亚捷克阿联酋安哥拉