Seeing History
CorrelationCorrelation has a common meaning: how well things match up with each other. In statistics, the concept acquires greater precision, and also greater interpretive power. It works like this: Of two measurable quantities associated with n different objects, correlation tells us how closely changes in one quantity are predictable from changes in the other quantity. If we measure the height and the annual income of 18 people, we can calculate how far income is a function of height in that sample. Take another sample:
You are the Mayor of Islington. The Mothers of Islington come to you in a body, demanding that more open space be created, since out of 18 districts of London, Islington ranks at the bottom for percentage of open space (variable x, Sp), but near the top for percentage of accidents that are accidents to children (variable y, Ac). They thrust in your face these figures, for Islington and the other 17 districts.
1 2 3 4 5 6 7 8District x (Sp) y (Ac)Bermondsey 0.050 0.463Camberwell 0.052 0.336Deptford 0.022 0.434Finsbury 0.020 0.388Fulham 0.042 0.422Hammersmith 0.122 0.283Hampstead 0.148 0.171Islington 0.013 0.429Marylebone 0.236 0.178Paddington 0.072 0.336Poplar 0.045 0.370Shoreditch 0.014 0.400Southwark 0.031 0.333Stepney 0.025 0.374Stoke Newington 0.063 0.308Wandsworth 0.146 0.238Westminster 0.275 0.108Woolwich 0.070 0.382
How to make sense of them? You mentally calculate the average value, or mean, of the x column, and then subtract that average (mx, the mean of x) from each individual value of x to make the high and low values more obvious. The result is a column of (x-mx) figures. You then do the same to y. You have:
1 2 3 4 5 6 7 8District x (Sp) y (Acc) (x-mx) (y-my)Bermondsey 0.050 0.463 -0.030 +0.132Camberwell 0.052 0.336 -0.028 +0.005Deptford 0.022 0.434 -0.058 +0.103Finsbury 0.020 0.388 -0.060 +0.057Fulham 0.042 0.422 -0.038 +0.091Hammersmith 0.122 0.283 +0.042 -0.048Hampstead 0.148 0.171 +0.068 -0.160Islington 0.013 0.429 -0.067 +0.098Marylebone 0.236 0.178 +0.156 -0.153Paddington 0.072 0.336 -0.008 +0.005Poplar 0.045 0.370 -0.035 +0.039Shoreditch 0.014 0.400 -0.066 +0.069Southwark 0.031 0.333 -0.049 +0.002Stepney 0.025 0.374 -0.055 +0.043Stoke Newington 0.063 0.308 -0.017 -0.023Wandsworth 0.146 0.238 +0.066 -0.093Westminster 0.275 0.108 +0.195 -0.223Woolwich 0.070 0.382 -0.010 +0.051SUM (S) 1.446 5.953mean (S/18) 0.080 0.331
Hmm, you say to yourself, look at that zigzag pattern. The minus or below-average values of x do seem to match the plus or above-average values of y. There is clearly something going on here.
We now want to define the relationship more exactly. To do this, we could calculate what is called the correlation coefficient, which is not intuitively meaningful, or we could go for the coefficient of determination (D), which is intuitively meaningful. Let's go for D. To get it, we first multiply together each pair of (x-mx) and (y-my) terms, and put the result in column 6. Then we square each of those terms, and put the results in columns 7 and 8. The result, accomplished by an unobtrusive secretary while you distract the Mothers of Islington with small talk, would look like this:
1 2 3 4 5 6 7 8District x (Sp) y (Acc) (x-mx) (y-my) (x-mx)(y-my) (x-mx)² (y-my)²Bermondsey 0.050 0.463 -0.030 +0.132 -0.003960 0.000900 0.017424Camberwell 0.052 0.336 -0.028 +0.005 -0.000140 0.000784 0.000025Deptford 0.022 0.434 -0.058 +0.103 -0.005974 0.003364 0.010609Finsbury 0.020 0.388 -0.060 +0.057 -0.003420 0.003600 0.003249Fulham 0.042 0.422 -0.038 +0.091 -0.003458 0.001444 0.008281Hammersmith 0.122 0.283 +0.042 -0.048 -0.002016 0.001764 0.002304Hampstead 0.148 0.171 +0.068 -0.160 -0.010880 0.004624 0.025600Islington 0.013 0.429 -0.067 +0.098 -0.006566 0.004489 0.009604Marylebone 0.236 0.178 +0.156 -0.153 -0.023868 0.024336 0.023409Paddington 0.072 0.336 -0.008 +0.005 -0.000040 0.000064 0.000025Poplar 0.045 0.370 -0.035 +0.039 -0.001365 0.001225 0.001521Shoreditch 0.014 0.400 -0.066 +0.069 -0.004554 0.004356 0.004761Southwark 0.031 0.333 -0.049 +0.002 -0.000098 0.002401 0.000004Stepney 0.025 0.374 -0.055 +0.043 -0.002365 0.003025 0.001849Stoke Newington 0.063 0.308 -0.017 -0.023 +0.000391 0.000289 0.000529Wandsworth 0.146 0.238 +0.066 -0.093 -0.006138 0.004356 0.008649Westminster 0.275 0.108 +0.195 -0.223 -0.043485 0.038025 0.049729Woolwich 0.070 0.382 -0.010 +0.051 -0.000510 0.000100 0.002601SUM (S) 1.446 5.953 a = -0.118446 b = 0.099146 c = 0.170173mean (S/18) 0.080 0.331
The boxes are now all filled in (note that mx means "the mean of the x values" and so on), and we can proceed to calculate our answer. It's very easy. All we need are the three sums here called a, b, and c. We substitute them into the following formula:
D = a² / bc
getting
D = 0.014029454 / (0.099146)(0.170173) = 0.014029454 / 0.016871972 = 0.8315
And we can then prefix the sign of "a" (which here is minus) to the final result, to remind us that the correlation is inverse: as one quantity goes up, the other tends to go down. Thus, finally,
D = -0.8315
The D value tells us, as a fraction, how much one of our categories can be predicted from other category. In this case, it is 83%. Then the proportion of accidents which are accidents to children is 83% associated with the amount of open space. It is a pretty high level, but still, it leaves some room for politics. Here come the politics:
"Ladies," you then say, "I congratulate you on gathering these figures; they do indeed show a relationship between open space and proportion of accidents to children. About five-sixths of accidents to children can be accounted for by the proportion of open space. But one-sixth cannot be so explained, and surely (eyeing briefly every sixth mother in the room) we would not wish to abandon one-sixth of our children to their fate. Perhaps we should look at it even more closely. If you will ask the Police Commander for a map of the borough showing where accidents to children have occurred, and I will have my secretary request his cooperation, we can pinpoint the exact locations of these lamentable accidents, and take precise and effective steps to prevent them."
And as they head for the doorway, you add silently, Or you could move to Westminster.
Just for the record, if you want to get instead the classical correlation coefficient C in the example above, it is the square root of D, which in this case would be:
C = 0.9119
The standard textbooks will tell you, by means of a series of tables in the back, how to interpret this.
Readings
- M J Moroney. Facts From Figures. The Islington problem (with its out of date data) is at p288. It there gets a correlation coefficient solution; the coefficient of determination is not mentioned. Moroney reaches a correlation coefficient of -0.92. This is a little high, due to rounding error.
- Islington Police Station. Note the program for safety in the schools.
17 Dec 2006 / Contact The Project / Exit to Outline Index Page