Good Math/Bad Math

Wednesday, April 05, 2006

Good vs Bad in Math of Autism Studies

Ok, so I've been asked to explain the difference between the good math in Shattuck's recent paper about diagnostic substitution, and the Geier's paper on change in autism rates as the use of thimerasol based vaccines has declined.
It's a bit hard to do, because while I have the full text of the Geier paper, I do not have the full text of Shattuck - so wrt Shattuck, I'm working on a combination of the parts of the paper that I've been able to find online, and various interviews with the author and/or discussions of the paper. (If it's legal for someone with access to send me a copy, I'd be delighted to see it; but I'm not willing to shell out $12 to obtain a copy through the journal directly just so that I can write a blog post.)

So - here's what I consider to be a few key differences between the studies:

  1. Shattuck is very up-front about the limitations of his data. He's arguing that children that are currently classified as autistic by public school special education programs were formerly classified differently; a key part of that argument is based on the fact that until 1994, there was no uniform reporting of autism as a special education classification. In presenting his argument, Shattuck states: : "However, no reliable data exist that would indicate how the enrollment of children with autism was actually distributed among other enrollment categories before the 1990's." In other words, he admits that the data he's working with is incomplete - that some of the information that would be useful in supporting his analysis was never collected.

    In contrast, the Geiers use extremely poor data sources, but never own up to the weaknesses of that data. They present it as if it were unblemished, carefully gathered data.

  2. Shattuck uses multiple analyses; the Geiers only used one simple linear regression. This is a big deal. When you use one analysis - particularly one simple analysis, you may find the appearance of a trend or correlation that is really an artifact of the analysis on that particular data. When you're working with real world data, you always try to perform multiple independent analyses to try to eliminate the possibility of false correlation. The Geiers do not do this.

  3. Shattuck performs careful correlation analyses to show that his results are not the result of combinations of other diagnostic substitutions; that is, unlike the Geiers who find the data that supports their conclusion, and stop looking any further, Shattuck does the math to show that what he found is not the result of some other trend in the data.

  4. Shattuck presents corroborative data from different sources to support his his hypothesis about the reporting rates of autism. His data has a known fundamental weakness: autism wasn't uniformly reported before 1994, so he shows statistical data to support the idea that autism was under-reported before it became a distinct required reporting category in 1994. The Geiers do not admit to any weaknesses in their data sources, much less show corroborative data to support their hypothesis where their data is weak.

  5. Shattuck explains the trend behaviour of categorization data when new categories are introduced, showing how the autism data follows the standard trend, and demontsrating that the same trend occurs in other diagnoses after they became required categories.

  6. Shattuck analyzes his data in a statistically valid way. He does not massage it, divide it, alter it, re-organize it: he treats it as a consistent data set over time, and analyzes it as such. The Geiers look at their data, and pick one arbitrary point to break their data into two independent data sets. That is absolutely not a mathematically valid thing to do. Shattuck is using math right; the Geiers are doing something that is unambiguously wrong.

  7. Finally, Shattuck's work was reviewed and analyzed by another researcher, who points out some potential problems in his methodology; Shattuck admits to those weaknesses, but makes a reasonable argument for why the data he used is the best available.


  • The points you have highlighted (especially their use of a linear regression) illustrate a single underlying point, which I think needs to be stated: you cannot do reliable statistics on the basis of high-school maths.

    By Blogger Thomas Winwood, at 2:00 PM  

  • Thomas:

    I think there's more to it than that. I don't believe that the Geier's paper uses such dreadful math because they don't know enough to do it better. To me at least, when it comes to math, it looks a lot more like the Geiers are dishonest rather than incompetent.

    By Blogger MarkCC, at 2:23 PM  

  • Judging from your description, the data should never have been used to test the Thimerosal/autism link, regardless of the analysis. However, I don't agree that using linear regression on these data sets is a sin. More complex analysis types might be better, but throwing a complex analysis at a bad data set doesn't lead to more reliable results. I'm not trying to defend the paper, but you probably wouldn't have too much problem finding regression lines fit through time series data in the mainstream epidemiological literature.

    It seems that the biggest problem with the analysis in the Geier and Geier paper, which you mention, is that only one explanation is considered. Epidemiological data is always more messy than clinical studies, but when you're talking about the effects of a toxin on kids there isn't another way. But, to do correlational studies well it's crucial to consider as many plausible alternative hypotheses as possible. Even without the problems with the data set, and the problems with changes in diagnoses discussed in the Shattuck paper, the point you made in your previous post that we wouldn't expect a big decline in autism rates until several years after vaccinations stopped is really important, because it means that there is probably a better explanation for the dramatic dropoff in reports after 2003.

    I don't see anything wrong generally, however, with picking a date at which a change is expected and then testing whether the data support the hypothesis that a change has occurred. Basing the date on an inspection of the data is a no-no, but if the date is based on something like a change in treatment then it's perfectly fine. If nothing else the change in the Geier and Geier scatter plots is so pronounced that I can't imagine that any of the common analysis types would have missed it. Again, the data should probably not have been used at all, and alternative explanations certainly should have been tested, but I don't see how the choice of linear regression makes the situation any better or worse.

    By Anonymous Anonymous, at 4:59 PM  

  • anonymous:

    While you are correct that the data that the Geiers used should never have been used for testing a thimerosal/autism link, there's more to it than that.

    Using simple linear regression is not a valid technique for discovering how a correlation changes over time is not a valid mathematical technique. Using two regressions, before and after a single pivot point is not a valid technique for showing that there was a change at a particular time.

    The Shattuck paper is a good contrast to show what's wrong with the Geiers'. One of the things that Shattuck did was do a secondary analysis to correct for the possibility that diagnostic substitution was taking place simultaneously among several diagnostic categories. He didn't just do the simplest analysis that looked like it supported his case - he did multiple analyses to ensure that what he was finding was statistically meaningful, and to eliminate other causes.

    By Blogger MarkCC, at 6:14 PM  

  • Depressing article on the front page of the Atlanta Journal-Constitution today (also at but might be behind the registration wal). The anti-vaccine and autism-activism crowds have raised such a stink, with the help of RFK Jr., that the credibility of the Centers for Disease Control is being called into question by Congress. The bitter irony is that the less work is done by outfits like the CDC, the more we'll wind up relying on industry-funded studies for actual scientific information - not that the activists care about the science.

    By Anonymous jackd, at 3:20 PM  

  • Math aside, the Geiers definition of "New Cases" is totally flawed. This alone will hopefully get the paper retracted in its totality.

    By Blogger Joseph, at 1:25 PM  

Post a Comment

<< Home