Simple Linear Regression (SLR) is a statistical method used to model the linear relationship between a dependent variable (response) and a single independent variable (predictor). The goal is to predict the value of the dependent variable based on the independent variable using a straight line.
=====================
目录:
一、How to calculate the regression coefficients manually?
二、How to interpret the output of lm()?
三、How to simulate lm() in R?
四、术语
五、参考资料
=====================
一、How to calcualate the regression coefficients manually?
Note: the Fomulas are only for simple linear regression.
=====================
二、How to interpret the output of lm()?
1. R example for simple linear regression
2. Interpretation
2.1 Residuals represent the differences between the observed and predicted values in a regression model, and analyzing their distribution is crucial for assessing the model's fit. Ideally, residuals should be normally distributed around zero, with their mean and median close to zero, signifying that the errors are evenly spread out. Symmetry in the residuals' spread, as indicated by the minimum, maximum, 1st quartile (1Q), and 3rd quartile (3Q) values, further supports normality. For normally distributed residuals, the 1Q and 3Q values should typically be around 1.5 times the standard error. This normality ensures that large prediction errors are less likely as the distance from the actual values increases, and that overestimations and underestimations occur with equal probability, leading to unbiased and stable predictions.
2.2 The standard error of the estimated coefficients represents the variability or uncertainty in the coefficient estimates. A smaller standard error suggests that the coefficient is estimated with greater precision, indicating a better model fit. However, it’s important to recognize that these values should be considered in relation to the magnitude of the coefficients themselves. t-values are used to calculate p-values, which help assess the significance of each variable. The p-value can be interpreted as the probability that a given variable is irrelevant to the model. Lower p-values indicate a higher likelihood that the variable is important in explaining the outcome. In regression output, significance is often represented by stars, where more stars indicate a stronger significance level for the variable in question.
2.3 RSE,R-squared and F-statistic
2.3.1 The Residual Standard Error (RSE) represents the standard deviation of the residuals, or the unexplained variation in the model. It indicates how much the actual values deviate from the predicted values on average. The smaller the RSE, the better the model fits the data.
2.3.1 In simple linear regression, the degrees of freedom are calculated as n−2, where n is the number of observations, and 2 accounts for the intercept and slope.
2.3.3 The R-squared (R²) value measures the proportion of the variance in the dependent variable that is explained by the model. It ranges from 0 to 1, with higher values indicating a better fit.
2.3.4 Adjusted R-squared is more useful in multiple linear regression as it adjusts for the number of predictors, preventing inflation when unnecessary variables are added.
2.3.5 The F-statistic assesses the overall significance of the model. It compares the explained variance to the unexplained variance, providing a measure of the model's performance.
2.3.6 The p-value corresponding to the F-statistic in the last line tests the null hypothesis that all coefficients (except the intercept) are zero. In simple linear regression, the p-value for the F-statistic is equivalent to the p-value for the slope. In multiple regression, the F-statistic tests the combined significance of all predictors.
2.3.7 If the p-value is below a given threshold (e.g., 0.05), we reject the null hypothesis, indicating that at least one predictor is significantly related to the outcome.
2.3.8 The calculation for the F-statistic's p-value in R can be done as follows:
p_value <- 1 - pf(F_statistic, df1 = k, df2 = n - k - 1)
Where k is the number of predictors, and n is the number of observations.
=====================
三、How to simulate lm() in R?
1. my_lm() codes:
2. Demonstration of the Entire Calculation Process
""""""
# Sample data: 10 weights and 10 blood pressure readings
weights <- c(60, 65, 70, 72, 75, 80, 85, 88, 90, 95)
blood_pressure <- c(120, 122, 125, 130, 136, 140, 145, 150, 155, 160)
data <- data.frame(weights, blood_pressure)
# Create the linear model
model <- my_lm(blood_pressure ~ weights, data)
# Summarize the model
my_summary(model)
""""""""
=====================
四、术语
1.The Central Limit Theorem (CLT): states that, as the sample size increases, the sampling distribution of the sample mean approaches a normal distribution, regardless of the population's original distribution. This implies that with more observations, the variance of parameter estimates shrinks.
2.Confidence Intervals:provide a range of values that are likely to contain the true parameters of the model, with a given level of confidence (e.g., 95%).
3.Covariance: measures the direction of the linear relationship between two variables. A positive covariance means that the variables increase together, while a negative covariance means that as one increases, the other decreases.
4.Correlation: measures both the strength and direction of a linear relationship between two variables, normalized between -1 and 1, making it easier to interpret than covariance.
5.Distributions:describe the possible values that a random variable can take and the likelihood of those values occurring. Common distributions include normal, binomial, and Poisson distributions.
6.Expectation:the expectation of a random variable is its long-term average or mean value, representing the expected outcome of a random process.
7.ESS: Explained Sum of Squares:measures the amount of variation in the dependent variable that is explained by the independent variables in the model.
8. Hypothesis Tests: evaluate whether model components (e.g., coefficients) are statistically significant for prediction. They help compare models and decide whether to reject a null hypothesis.
9.Inference:is the process of estimating parameters of a model based on sample data, drawing conclusions about the population.
10. Least Squares Estimation (LSE) :is a method used to estimate the coefficients of a regression model by minimizing the sum of the squared residuals (the differences between observed and predicted values).
11.Maximum Likelihood Estimation (MLE) :is a method to estimate the parameters of a statistical model by maximizing the likelihood function, which measures how likely the observed data are under the model.
12.Mean Squared Error (MSE) :is the average of the squared differences between the observed and predicted values. It quantifies the error of a model's predictions.
13.Ordinary Least Squares(OLS): is a type of linear regression that estimates the coefficients by minimizing the sum of squared residuals.
14.Probability: quantifies the likelihood of an event occurring and is essential in modeling uncertainty and dealing with noisy data.
15.p-value: is the probability of obtaining test results at least as extreme as the observed results, assuming the null hypothesis is true. It helps assess statistical significance.
16.Regression:is a statistical method used to predict a continuous dependent variable based on one or more independent variables (either continuous or categorical).
17.Residual Standard Error(RSE):provides a measure of the typical size of the residuals in the regression model, i.e., how far observed data points deviate from the predicted values, on average.
18.Residual Sum of Squares(RSS):is the sum of the squared differences between observed and predicted values, representing the total unexplained variance in the model.
19.Sample Statistics: are estimates calculated from sample data and are used to infer parameters of the population.
20.Sum of Squared Errors(SSE): also known as RSS, is the sum of squared differences between observed and predicted values. It quantifies the error in the model’s predictions.
21.Total Sum of Squares(TSS) :is the total variation in the dependent variable. It is the sum of the squares of the differences between the observed values and their mean. It represents both the explained and unexplained variance in the model.
=====================
五、参考资料
1. simple linear regression example
https://stattrek.com/regression/regression-example.aspx
2. Stat Trek's: AP Statistics Tutorial
https://stattrek.com/tutorials/ap-statistics-tutorial
3. Applied Linear Regression, 3rd Ed.Using R
http://users.stat.umn.edu/~sandy/alr3ed/website/R.html
4. Statistics 36-350: Data Mining
https://www.stat.cmu.edu/~cshalizi/350/