DOC

# W7L1RegressionandCorrelation

By Micheal Price,2014-05-17 16:14
9 views 0
W7L1RegressionandCorrelation

Week 7 Lesson 1 (lesson 13) – Ch 14 pg 629

Linear Regression and Correlation – how are two variables related e.g. sunny days and sales volume of ice cream

Linear equation with one variable – straight line

Y = bo + b1X where X is the independent variable and bois the y-intercept (a

constant) and b1is the slope (gradient)

Slope – an increase in 1 unit of x results in b1increase in y

When b1> 0, slope upwards; b1 < 0, slope downwards; b1=0, slope horizontal

Scatter Plot – a graph of two variables – x and y e.g. age and price of carsErrors occur when plot a linear line on a scatter plot. These errors are each

measured by the vertical distance from the data point to the linear line.We can plot many lines on a scatter plot and therefore we use the least-squares

method to choose the ‘best’ line with least (squared) error.This line is called the Regression Line and the equation, Regression equation

Sxy = (- xix) (- )yiy = xiyi - xiyin

Sxx = (- xix)? = xi? - xi?n and Syy = (- yiy)? = yi? - yi?n

b1 = Sxy/Sxx

bo = y - b1x =) = yib1xin-

Using the Regression equation, we can calculate any values of x and y.x is also called the predictor and y is called the response variable

extrapolation can only be made within acceptable limits, beyond that it becomes

inaccurate

Outliers can distort the Regression line. An outlier that affect the regression is

called an influential observation

An influential observation may not be an outlier though

Coefficient of Determination

1

How accurate is x predicting y? two approaches:

Measure the total variation in the observed values of the response variable;

Total Sum of Squares, SST = (- )yiy? and the Sample Variance is SST

divided by n-1

Amount of variation of response variable explained by the regression (distance

between the mean and the predicted values of the response variable)

Regression Sum of Squares, SSR = (- )yiy? where yi are the predicted

values for each response variable, yi

If we divide SSR by SST, we get an idea of the percentage of variation of the

response variable explained by the regression

Coefficient of Determination, r? = SSR/SST = (- )yiy?/ (- )yiy?

(always between 0 and 1 where towards zero means a poor predictor)

Amount of variation of response variable NOT explained by the regression

(distance between the mean and the observed values of the response variable)

Error Sum of Squares, SSE = (-)yiyi? where yi are the predicted values for

each response variable, yi

Therefore SST = SSR + SSE also known as the Regression Identity

Since r? = SSR/SST, then r? = (SST – SSE)/ SST or 1- SSE/SST

SST = Syy and SSR = S?xy /Sxx and SSE = Syy – (S?xy /Sxx)

SST = yi? - yi?n and SSR = [(- xix) (- )yiy]?/ (- xix)?

Linear Correlation Coefficient - denoted by r

r = SxySxxSyy. where r always lie between -1 and 1

2

r is positive when the slope is positive - both xix- and (- )yiy must be positive or

negative to have a positive, r

r is negative when the slope is negative –one of (- ) xix or (- )yiy must be negative

and the other positive to have a negative, r

r close to -1 or 1 shows a strong linear relationship and x a good predictor of ya negative r means negatively linear correlated between x and y and vice versa Correlation does not mean Causation

3

Report this document

For any questions or suggestions please email
cust-service@docsford.com