PROPORTIONAL REDUCTION OF ERROR (P-R-E) MEASURES OF ASSOCIATION:
GENERAL CONSIDERATIONS
Measures of association: are two variables related to one another? We need a summary measure; we can't just reproduce the table in our articles and reports.
General principle of PRE measures: does knowing the value of a case on one variable help you to predict its value on the other, that is, help you as compared to not knowing its value?
General PRE Formula: (error before - error after) / (error before)
So: each specific PRE formula has three elements:
Notice that this measure always varies between 0 and 1. 0 occurs when error before = error after, in other words, when knowing the independent variable doesn't help us predict. In other words, 0 means no association. 1 occurs when error after = 0, i.e., when knowing the independent variable enables us to make a perfect prediction of the dependent variable. In other words, 1 means perfect association.
Can there ever be a negative measure? No, because you can't predict worse than by not knowing anything.
Can there ever be a measure greater than 100%? No, because that would mean errors after would have to be negative, and there's no such thing as a negative error.
LAMBDA: A PRE MEASURE FOR NOMINAL VARIABLES
For the specific example of nominal variables, the elements of this formula come out as follows:
This measure is called lambda. There are other (and better) measures of association for nominal variables, but this is the simplest.
Let's apply this to the table I showed last time:
| Parents lean: | ||||
| Democrat | Republican | Total | ||
| Children lean |
Democrat | 11 (79%) | 7 (26%) | 18 (44%) |
| Republican | 3 (21% | 20 (74%) | 23 (56%) | |
| Total | 14 (100%) | 27 (100%) | 41 (100%) | |
We want to measure to what extent knowing the parent's ideology (the independent variable) helps us predict the child's ideology (the dependent variable).
Error before: What's our best guess as to the children's ideology? Guess the mode (Republican). How many errors will we make? (18)
Error after: Let me take a case. Which party do your parents lean towards? So what should I guess for this person, not looking at him or her specifically, but just guessing generally? (Guess "Democrat" for children of Democrats, and "Republican" for children of Republicans.) How many errors will I make among children of Democrats? (3) How many errors among children of Republicans? (7) Total errors after knowing the independent variable = 10.
Errors before = 18
Errors after = 10
So knowing parents' party leanings has helped me guess the children's leanings.
How much has it helped? Reduction in error = errors before - errors after = 8
Compare this to the original number of errors: 8/18 = .44 = 44%
It has reduced my error by a proportion 44% from what it was originally.
What does the figure of 0% mean? (Error before = error after, which implies no help in guessing, which means no association.)
What does the figure of 100% mean? (Error after = 0, which implies that one becomes a perfect predictor, which means perfect association.)
So PRE measures range from 0% to 100%: 0 means no association; 100 means perfect association.
Can there ever be a negative measure? No, because you can't predict worse than by not knowing anything.
Can there ever be a measure greater than 100%? No, because that would mean errors after would have to be negative, and there's no such thing as a negative error.
To review: A PRE measure depends on three elements:
We then predict both before and after knowing the independent variable's value, and apply the formula.
Pearson's r² uses linear prediction and defines error as total squared error (variance).
PEARSON'S R-SQUARED EXAMPLE: REGIME CHANGES AND CONFLICT
Sample of 4 countries. Dependent variable: # of wars in last 25 years. Data: 5, 7, 2, 2. Independent variable: average number of government changes per decade. Data: 2, 3, 1, 0.
| Case | X | Y |
| A | 2 | 5 |
| B | 3 | 7 |
| C | 1 | 2 |
| D | 0 | 2 |
| TOTAL | 6 | 16 |
| AVG | 1.5 | 4.0 |
(Source: Hypothetical data)
How to measure error? For Pearson's r² we use the squared deviation of the actual value of the case from the value we predict from that case--the same as we used in computing the standard deviation.
For example, for case 1, if we guess Y as 3 (i.e., predicted Y = 3), what is the error? (Y - predicted Y)² = (2 - 3)² = (-1)² = 1.
Now for the "error before": we don't know X at all, so our guess has to be predicted Y = (best guess) .
If we don't know what case we were talking about, what would be our best guess as to the number of wars, where the error is the square of the distance? Answer: the mean is the best guess. Let's show that by computing error for three reasonable guesses: the mode, the median, and the mean.
4 (the mean): 1×1+3×3+2×2+2×2 = 1+9+4+4 = 18 [least error, because the mean is the best guess]. This is the "error before", also called "the variance" and/or "the variance to be explained".
2 (the mode): 3×3+5×5+0×0+0×0 = 34
3.5 (the median): 1.5×1.5+3.5×3.5+1.5×1.5+1.5×1.5 = 2.25+12.25+2.25+2.25 = 19 [less error, but still not the best guess]
So as I said above, this shows that the mean is the best guess. (For this case, at least, but it can be shown mathematically that this applies in all cases.)
Let's put this computation into a table, where is the predicted value of Y. Because the mean is the best predictor, we use the formula predicted Y = mean.
| CASE | X | Y | predicted Y | Y - predicted Y | (Y - predicted Y)² |
| A | 2 | 5 | 4 | 1 | 1 |
| B | 3 | 7 | 4 | 3 | 9 |
| C | 1 | 2 | 4 | -2 | 4 |
| D | 0 | 2 | 4 | -2 | 4 |
| TOTAL | 6 | 16 | 16 | 0 | 18 |
| AVG | 1½ | 4 | 4 |
(Source: Hypothetical data)
Note that the sum of Y-' will always be 0; this is a good check. Remember: the mean is the balance point at which the positive and negative deviations just balance out, meaning that their total has to be 0.
Now we try to predict Y by a straight line of the form, Y = a + b×X. This is the so-called "regression line". We want to choose a and b to make our predictions as accurate as possible, and there are straightforward formulas for doing this:
b = SUM [(y-yavg)(x-xavg)]/[(x-xavg)²]
a = yavg-b×xavg
Plugging our data into these formulas, we calculate that the best straight line equation for predicting Y is
predicted Y = 1.3 + 1.8×X.
Here the "1.3" is the predicted average # of wars even if there are no government turnovers (i.e., X=0). "1.8" is the number of additional wars for each additional government turnover.
Using this regression line as our predictor, we get the following table:
| CASE | X | Y | predicted Y | Y - predicted Y | (Y - predicted Y )² |
| A | 2 | 5 | 4.9 | .1 | .01 |
| B | 3 | 7 | 6.7 | .3 | .09 |
| C | 1 | 2 | 3.1 | -1.1 | 1.21 |
| D | 0 | 2 | 1.3 | .7 | .49 |
| TOTAL | 6 | 16 | 16.0 | 0.0 | 1.80 |
| AVG | 1½ | 4 | 4.0 |
(Source: Hypothetical data)
We've done a good job of predicting: we've reduced our error from 18 to 1.8, so our PRE measure of association, known as "Pearson's r-squared" or as "the linear correlation (squared)", is equal to (18-1.8)/18 = .9 (This is an extraordinarily good correlation.)
Sometimes correlations are given as r instead of r²; this is generally deceptive. For example, where r²=.01, r=.10. The latter looks ten times as good as the former, but the former really tells us how much use the relationship is.
The University of Minnesota is an equal opportunity educator and employer.
Copyright © 2006 Regents of the University of Minnesota. All rights reserved.