Coronavirus data: all skewed up.
Let’s expand on my last post a wee bit.
We talked about the importance of putting all of your data on the same playing field with standardization. Instead of looking at total cases – which doesn’t account for each country’s total population – we looked at the percentage of cases per million people.
Let’s start with the same approach and update the numbers. This is the distribution of cases per mil as of May 7th:
Cool. What headline could we write using the data this way?
UK BEHIND THE US, SPAIN, AND ITALY IN NUMBER OF CASES PER MILLION PEOPLE. HUZZAH!
But, here’s where things get a little skewy.
What does it mean when the data is skewed? It means there’s a factor in the data that hasn’t been considered. That sneaky little factor (Russia) distorts the data (election results).
Okay, okay, how about another (less controversial) example: There are waaaaay less shark attacks in Kansas than in California. Crazy right?!
Nope. Last time I checked, Kansas had no coastline. The amount of coastline in each state is an unaccounted for factor in my statement about shark attacks.
Let’s translate that to this situation. Just like you can’t have shark attacks without coastline, you can’t have positive test results without…TESTS.
So how many tests per mil did each country administer?
Looks like Spain and Italy tie for the most testing - around 10% more than the UK or USA. More testing means more chances for positive tests.
Now we need a new question to ask of our data. Let’s look at the whole picture first:
What if we looked at cases per mil as a percentage of total tests per mil in each country? In other words, for each country, what percentage of the tests given yielded positive results?
For the US, that’s 3,940/25,499, which equals 0.15 or 15%. Do the same thing for each country, and we get this:
Looks like we need to change our headline!
UK HAS SECOND LARGEST NUMBER OF POSITIVE TESTS PER MILLION PEOPLE. BOLLOCKS!
But wait…what if the UK only tests people they think are infected? Let’s pretend we have our own country with a grand total of 4 citizens – 2 are showing signs of being sick. We only test the 2 people we think are sick and their results are positive. That’s a 100% positive test rate. If we test all 4 people and get the same 2 positive results, we now have a 50% positive test rate. Biased testing skews the data. But, that’s another detour we won’t take right now…
The point is - the questions you ask of the data make a difference.
Side note: here comes the media. They love to play with data. It doesn’t take much to tell the story you want to tell as long as you think your audience won’t think for themselves. So, thank you for making their jobs harder by being here, reading this, and arming yourself against skewed up media practices.
Stay safe and healthy, y’all.
kdoh