####################3
## A Simple linear regression model and plotting a graphic
# Regression is used to analyze relationships between continuous variables. In statistics, simple linear regression is the least squares estimator of a linear regression model with a single predictor variable. It describes the linear relationship between a predictor variable, plotted on the x-axis, and a response variable, plotted on the y-axis. Simple linear regression fits a straight line through the set of n points. The sum of squared residuals of the model is the smallest possible.
# The fitted line has the slope equal to the correlation between y and x corrected by the ratio of standard deviations of these variables. The intercept of the fitted line is such that it passes through the center of mass (x, y) of the data points.
# Suppose there are n data points {yi, xi}, where i = 1, 2, …, n.
# The goal is to find the equation of the straight line y = α + β x
# which would provide a “best” fit for the data points: such a line that minimizes the sum of squared residuals of the linear regression model.
## Data
# As a first example of linear model we use the Orange data set. This data includes measurements of the growth of a sample of five orange trees in a location in California.
# The response is the circumference of the tree at a particular height from the ground (often converted to “diameter at breast height”).
data(Orange)
str(Orange)
# R include many datasets in memory. Different library carry different datasets as well. To know more about your data availability type
data()
## Plotting data
# To see how age and circumference data are related use the plot(x,y) function
plot(Orange$circumference,Orange$age) # plotting the default layout
# Now add some graphical parameters
plot(Orange$circumference, Orange$age , xlab="Age", ylab="Circumference", main="Growth of five orange trees in California" )
## See the differences by the tree sampled
points(Orange$circumference[Orange$Tree==1], Orange$age[Orange$Tree==1], pch=1, col="black")
points(Orange$circumference[Orange$Tree==2], Orange$age[Orange$Tree==2], pch=2, col="blue")
points(Orange$circumference[Orange$Tree==3], Orange$age[Orange$Tree==3], pch=3, col="red")
points(Orange$circumference[Orange$Tree==4], Orange$age[Orange$Tree==4], pch=4, col="magenta")
points(Orange$circumference[Orange$Tree==5], Orange$age[Orange$Tree==5], pch=5, col="green")
lines(Orange$circumference[Orange$Tree==1], Orange$age[Orange$Tree==1], pch=1, col="black")
lines(Orange$circumference[Orange$Tree==2], Orange$age[Orange$Tree==2], pch=2, col="blue")
lines(Orange$circumference[Orange$Tree==3], Orange$age[Orange$Tree==3], pch=3, col="red")
lines(Orange$circumference[Orange$Tree==4], Orange$age[Orange$Tree==4], pch=4, col="magenta")
lines(Orange$circumference[Orange$Tree==5], Orange$age[Orange$Tree==5], pch=5, col="green")
### Simple Linear Regression model
# Now we will fit a simple linear model to circumference as a function of age. To perform linear regression we create a linear model using the lm (linear model) function: Note that the variables are in the opposite order from the plot command, which is plot(x,y), whereas the linear model lm(Y~x) is read as model Age as a function of circumference
orange.lm=lm(Orange$age~Orange$circumference)
# The lm command produces no output at all, but it creates orange.lm as an object
# i.e. a data structure consisting of multiple parts, holding the results of a regression analysis with Age being modeled as a function of Circumference. R needs additional commands to extract desired information about the model or display results graphically.
# To get a summary of the results, enter the command
summary(orange.lm)
# You will vizualize the following informations:
# Call:
# lm(formula = Orange$age ~ Orange$circumference)
#
# Residuals:
# Min 1Q Median 3Q Max
# -317.88 -140.90 -17.20 96.54 471.16
#
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 16.6036 78.1406 0.212 0.833
# Orange$circumference 7.8160 0.6059 12.900 1.93e-14 ***
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
# Residual standard error: 203.1 on 33 degrees of freedom
# Multiple R-squared: 0.8345, Adjusted R-squared: 0.8295
# F-statistic: 166.4 on 1 and 33 DF, p-value: 1.931e-14
#
# R sets up model objects so that the function summary “knows” that orange.lm was created by lm, and produces an appropriate summary of results for an lm object.
# The table of coefficients gives the estimated regression line as age = 16.6036 + 7.8160 × circumference, and associated with each coefficient is the standard error of the estimate the t-statistic value for testing whether the coefficient is nonzero, and the p-value corresponding to the t-statistic. Below the table, the adjusted R-squared gives the estimated fraction of the variance explained by the regression line, and the p-value in the last line is an overall test for significance of the model against the null hypothesis that the response variable is independent of the predictors.
#
# To add the regression line in the previously plotted graphic:
abline(orange.lm, cex=3, col="brown")
# You can get the coefficients by using the coef function:
coef(orange.lm)
# (Intercept) Orange$circumference
# 16.603609 7.815998
# You can also “interrogate” orange.lm directly. Type names(orange.lm) to get a list of the components of orange.lm, and then use the $ symbol to extract components according to their names.
names(orange.lm)
# [1] "coefficients" "residuals" "effects" "rank"
# [5] "fitted.values" "assign" "qr" "df.residual"
# [9] "xlevels" "call" "terms" "model"
# For more information (perhaps more than you want) about orange.lm, use str(orange.lm) (for structure). You can get the regression coefficients this way:
orange.lm$coefficients
# (Intercept) Orange$circumference
# 16.603609 7.815998