Categories:

After my talk at Oxford Geek Night I was happy to have a couple of suggestions to see if the algorithm could produce better results. One was to remove retweets from the search, which makes sense as we all know from many Twitter bios “a RT does not imply endorsement”—and that was easy to implement as the basic Twitter search api returns retweets ‘old-style’ with “RT” at the head.

The other was more complex, so I’m going to quote Owen who emailed me directly:

“This morning I thought up an analogy. Suppose you have weather readings for the last 100 days. For each day you have temperature (T), humidity (H) and mm of precipitation (P). What you’re doing is multiplying these all together, presumably because you want to get one number out. Unfortunately this number is meaningless. If you wanted to combine these quantities in some way you should really be thinking about what meaning you’re attaching to the number you get out. I’m ignoring here the fact that you multiplied them all together, when in all likelihood adding them would make more sense. I suggest it would be more meaningful to keep track of them separately, and plot three graphs instead of one. Indeed, this is what is done with weather data.

You spoke about wanting to get a measure of how much spread a set of data has. What you want is the variance, or something like it. The average (more properly called the mean) of a set of numbers is obtained by adding them all up and dividing by the total number. This tells you something very useful, but it loses all information about how spread out the information was. The variance captures that. It’s a bit tricky to calculate. I’ll try to explain it here, but you can always google for more details. Suppose you have numbers a1 up to a100. The average is M = (a1 + a2 + … + a100) / 100. The calculate the variance we have to calculate some intermediate numbers. First, you have to calculate the average. Then you have to calculate the average of each number squared: Z = (a1^2 + … + a100^2) / 100. Now the variance is V = Z – M. I know that doesn’t seem to make much sense. There is a way of calculating the variance which makes it clearer why it’s any use, but it’s a bit harder to actually implement.

You might want to square root the variance to get the standard deviation. This is measured on the same scale as the original numbers you had, so it makes a bit more sense to use that instead.”

So, @IsOxfordHappy and the location sensitive page now do both of those. I’ve removed the ‘word scale’ for the time being till I can see roughly what the numbers are. Thanks everyone for your suggestions.