Linear Regression Techniques

Univariate Regression

In [1]:
hubble <- read.table("../data/hubble.dat", header=TRUE)
str(hubble)
'data.frame':	24 obs. of  2 variables:
 $ velocity: int  133 664 1794 1594 1473 278 714 882 80 772 ...
 $ distance: num  2 9.16 16.14 17.95 21.88 ...
In [2]:
with(hubble, plot(velocity ~ distance))
In [3]:
lm_hubble <- lm(velocity ~ distance, data=hubble)

lm_hubble  # print(lm_hubble)
summary(lm_hubble)

# str(lm_hubble)
Out[3]:
Call:
lm(formula = velocity ~ distance, data = hubble)

Coefficients:
(Intercept)     distance  
      6.696       76.127  
Out[3]:
Call:
lm(formula = velocity ~ distance, data = hubble)

Residuals:
    Min      1Q  Median      3Q     Max 
-735.14 -132.92  -22.57  171.74  558.61 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)    6.696    126.557   0.053    0.958    
distance      76.127      9.493   8.019 5.68e-08 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 264.7 on 22 degrees of freedom
Multiple R-squared:  0.7451,	Adjusted R-squared:  0.7335 
F-statistic:  64.3 on 1 and 22 DF,  p-value: 5.677e-08
In [4]:
par(mfrow=c(2,2))
plot(lm_hubble)  # Tuckey's diagnostic plots

Multivariate Regression

In [5]:
ozdata = read.table("../data/ozone.csv", header=TRUE, sep=",")
head(ozdata)
Out[5]:
OzoneSolar.RWindTempMonthDay
1411907.46751
23611887252
31214912.67453
41831311.56254
5NANA14.35655
628NA14.96656
In [6]:
lm_oz = lm(Ozone ~ Wind + Temp, data=ozdata)
summary(lm_oz)
Out[6]:
Call:
lm(formula = Ozone ~ Wind + Temp, data = ozdata)

Residuals:
    Min      1Q  Median      3Q     Max 
-41.251 -13.695  -2.856  11.390 100.367 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -71.0332    23.5780  -3.013   0.0032 ** 
Wind         -3.0555     0.6633  -4.607 1.08e-05 ***
Temp          1.8402     0.2500   7.362 3.15e-11 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 21.85 on 113 degrees of freedom
  (37 observations deleted due to missingness)
Multiple R-squared:  0.5687,	Adjusted R-squared:  0.5611 
F-statistic:  74.5 on 2 and 113 DF,  p-value: < 2.2e-16
In [7]:
lm_oz = lm(Ozone ~ Wind + Temp + Solar.R, data=ozdata)
summary(lm_oz)
Out[7]:
Call:
lm(formula = Ozone ~ Wind + Temp + Solar.R, data = ozdata)

Residuals:
    Min      1Q  Median      3Q     Max 
-40.485 -14.219  -3.551  10.097  95.619 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -64.34208   23.05472  -2.791  0.00623 ** 
Wind         -3.33359    0.65441  -5.094 1.52e-06 ***
Temp          1.65209    0.25353   6.516 2.42e-09 ***
Solar.R       0.05982    0.02319   2.580  0.01124 *  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 21.18 on 107 degrees of freedom
  (42 observations deleted due to missingness)
Multiple R-squared:  0.6059,	Adjusted R-squared:  0.5948 
F-statistic: 54.83 on 3 and 107 DF,  p-value: < 2.2e-16
In [8]:
plot(lm_oz$residuals, type="h")
points(lm_oz$residuals)
grid()

Prediction

All statistical models in R can be utilized for prediction, using the predict() function.

In [9]:
predict(lm_oz, newdata = data.frame(Wind=c(10,11,12), Temp=c(60,61,62), Solar.R=rep(100, 3)))
Out[9]:
1
7.42964167269357
2
5.74814327817356
3
4.06664488365352

Formula Interface

Model formulae will b accepted by many functions in R and R packages.

y ~ .
y ~ x
y ~ x - 1
y ~ x1 + x2 + ...
y ~ x1 * x2
y ~ x + I(x^2)
log(y) ~ x         # y = exp(a + b x)
...

Logistic Regression

In [10]:
plasma <- read.table("../data/plasma.dat", header=TRUE)
str(plasma)
'data.frame':	32 obs. of  3 variables:
 $ fibrinogen: num  2.52 2.56 2.19 2.18 3.41 2.46 3.22 2.21 3.15 2.6 ...
 $ globulin  : int  38 31 33 31 37 36 38 37 39 41 ...
 $ ESR       : Factor w/ 2 levels "ESR < 20","ESR > 20": 1 1 1 1 1 1 1 1 1 1 ...
In [11]:
plot(plasma$ESR)
In [12]:
plasma_glm <- glm(ESR ~ fibrinogen + globulin, data=plasma, family="binomial")
summary(plasma_glm)
Out[12]:
Call:
glm(formula = ESR ~ fibrinogen + globulin, family = "binomial", 
    data = plasma)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-0.9683  -0.6122  -0.3458  -0.2116   2.2636  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)  
(Intercept) -12.7921     5.7963  -2.207   0.0273 *
fibrinogen    1.9104     0.9710   1.967   0.0491 *
globulin      0.1558     0.1195   1.303   0.1925  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 30.885  on 31  degrees of freedom
Residual deviance: 22.971  on 29  degrees of freedom
AIC: 28.971

Number of Fisher Scoring iterations: 5
In [13]:
prob <- predict(plasma_glm, plasma[, 1:2], type="response")
plot(prob ~ plasma$ESR)
In [14]:
plot(plasma$fibri, plasma$globu, pch='x', col=plasma$ESR,
    xlim=c(2,6), ylim=c(25,55))
symbols(plasma$fibri, plasma$globu, circles=prob, add=TRUE)

Generalized Logistic Regression (GLM)

Generalized Additive Regression (GAM)

In [15]:
...
Error in eval(expr, envir, enclos): '...' used in an incorrect context