Good Math/Bad Math: Slandering with Statistics

Tuesday, May 02, 2006

Slandering with Statistics

One thing that I keep getting mail about is various folks misusing statistics. There are a lot of examples of this, from young earth creationists, to the intelligent design bozos that I've spent too much time on, to various racist assholes, to HIV/AIDS denialists.

A link to a comment from another blog posted by a racist asshole trying to argue that the accuser in the Duke Rape case is probably lying is an example of a kind of misuse of statistics that I haven't talked about yet, so I thought it was worth taking a moment to look at it.

So here's what the asshole had to say:

Unfortunately, statistically a black women is significantly more likely to make a false accusation of rape than to have been raped by a white man. According to the National Crime Victimization Survey ( http://www.ojp.usdoj.gov/bjs/pub/pdf/cvus/current/cv0342.pdf ), less than .0004% of black rape victims were raped by whites. (The NCVS reports the percentage as 0% because there were less than 10 reported cases. I assumed 9 cases, to come up with an actual percentage) Even with the most conservative figure of 2% of rape allegations being false, this means in the case of the Duke Rape Case, the victim is 5000 times more likely to have made a false accucation than to have actually been raped.

Before I hit the real point, I'll point out that there's a really dumb math error in there: it's not "0.0004%"; it's "0.04%". Yeah, sure, we're still talking about small numbers, but two orders of magnitude is nothing to sneeze at, and it shows either how little concern the writer has for the facts; or it's a deliberate attempt to dishonestly skew the numbers by adding a percent sign.

But what's really important are the errors in how he uses statistics. He's making two major errors. One is a sampling error in producing a meaningful statistic; and one is an error in how you can apply a statistical result.

Statistics are a way of analyzing large quantities of data to understand patterns and trends, and they're very useful for that. But you have to understand what a statistic means and how it was gathered in order to know when it can be applied. In general, statistics are useful when they are generated from a large enough pool of correctly sampled data; statistics can be meaningfully applied to data in the aggregate (applying broad statistical trends to individual cases is rarely enlightening).

Now, in the above quoted gunk, the writer admits that the size of the sample from which he is extrapolating is vanishingly small in the data. You cannot generate anything meaningful from such a small sample: that "0.04%" number is basically just made up - it has no validity, no statistical meaning.

That's not the only sampling error that he makes. Even if there were enough data in the survey that he cited to draw some conclusion, you have to recognize the properties of the data that you're working with - the nature of the sample, and what, if any, data skew you would expect to see.

Whites raping blacks is something with a long history in America. We know that historically, it was extremely common. We also know that there is a significant social factor to this kind of crime: the usual dynamic of it is a poor black woman raped by a comparatively wealthy white man. In this dynamic, the victim is, at best, going to be socially stigmatized by coming forward. (For example, do you think that the black woman who had Jesse Helms' illegitimate daughter could have come forward and said he raped her? Do you think that the poor black "servant" in Helms childhood home actually wanted to have sex with him?) What this means is that because of the social environment, crimes like white-on-black rape are very likely to be under-reported.

This is a property of the sampling of the data. It's based on reported crimes; and we know that rape is a chronically under-reported crime, and that there are good reasons to suspect that the under-reporting is especially pronounced in the cases of black victims with white perps. So when we use that data, we need to recognize the sampling bias in it - and consider that when we use the data to draw conclusions.

You cannot use crime report statistics without recognizing the biases that are a natural part of the reporting process that generates those statistics. We know that the data are very likely to be flawed when it comes to this subject; to then use a microscopic subset of the flawed data and try to apply it to an individual case to purportedly say something meaningful - that's just nonsense.

But that's exactly what our little racist buddy does: he creates a meaningless number from a meaninglessly small sample of data with a known bias, and tries to combine it with other statistics to come up with a number that he can use in an argument.

Then he tries to apply the statistic that he generated. One of the very important properties of statistics is that they talk about populations, not individuals. The average american family has often been cited as having something around 2.7 children. You can't use that statistic to say that a particular family probably has two and two-thirds children.

But again - that's exactly what he does: he creates a nonsense statistic, and then applies it to a specific individual to argue that we should believe that she is lying.

37 Comments:

Nicely written and analyzed. Would you mind if I used this in my intro statistics class next fall? I think it would be a good example for them to see.

By Qalmlea, at 4:14 PM
You might also want to look on the Volokh Conspiracy for some cogent discussins about rape statistics in general (not just the Duke case). For example, he does a good job of discussing an apparent (not actual) contradiction between the numbers of falsely reported rapes and claims that such reports are exceedingly rare.

In any case, a point I think you might emphasize even more strongly is that even if the statistics were valid they would say nothing at all about the specific individual. An average person has an IQ of 100--but most people you meet do not. I won't even get into the 2.7 kids ;)

What I think you should avoid, however, is any resort to unproven statistics of your own... if you're going to attack the fool (and rightly so) for using bad statistics which don't show anything, you should not then start talking about the "good reasons to suspect that the under-reporting is especially pronounced in the cases of black victims with white perps", or relating anecdotal evidence about Helms. Your examples are just as relevant as his, or as the comparisons to Tawana Bradley--which is to say, they're not relevant at all.

By sailorman, at 5:13 PM
The racist asshole you're referring to may have been manipulating his numbers, but are you saying that prior information like that should play no role at all in weighing the evidence in individual cases? For example, if it were known that all people from Crete are liars, wouldn't this affect how you would analyze the testimony of a person from Crete?

By Anonymous, at 7:34 PM
Historically speaking, the idea that the rape of blacks by whites was rare is quite simply preposterous. Genetic studies such as those done by Mark Shriver of Penn State suggest that among self-identified African-Americans have 17-18% white ancestry, while the typical measure for self-identified white adults averaged around 0.7%, with 70% of those having no trace of African ancestry in their genes.

It's rather hard maintain a straight face while claiming that rapes of blacks by whites was rare in light of this disparity.

By Anonymous, at 8:17 PM
The racist asshole you're referring to may have been manipulating his numbers, but are you saying that prior information like that should play no role at all in weighing the evidence in individual cases?

If I could be so bold, I believe what is being offered is that it is not "prior information," but rather "nonsense," and hence, yes, "nonsense like that" should play no role at all in weighing the evidence in individual cases.

By Anonymous, at 8:24 PM
Another way of saying it is that he took two independent variables about rape cases in general and tried to apply them to a specific case where one of the variables was already fixed. In other words, (let's just imagine that his percetages are valid for a moment) all he could say is that for black victim rape cases, any particular case is 5000 times (2 vs. .0004) more likey to be fake than to have been a white perp. But given a case where the accused has already been identified as white, then the odds that it's fake (assuming independent variables) is ... 2%!!!

Let me make another example just for fun. Suppose that 0.0004% of cancer cases are bladder cancer. And let's also suppose that 2% of all cancer diagnoses are misdiagnosed. Does this mean that a patient who has been diagnosed with bladder cancer is 5000 times more likely to have been misdiagnosed than to have cancer?

What a maroon!

By The Science Pundit, at 8:40 PM
qalmlea:

Yes, go ahead. As I've said before, folks are welcome to reuse any of my posts for non-commercial purposes as long as it's attributed.

By MarkCC, at 9:04 PM
reverent bayes:

Yes, that is exactly what I'm saying. "Prior information" in the form of statistical generalities are absolutely irrelevant to the truth or falsehood of a particular specific case.

Your example is great: suppose we know that all people from Crete are liars - should we then never allow anyone from Crete to report a crime? Can we safely assume that no one from Crete will ever be robbed?

My answer, of course, is that in any individual case, you consider the evidence in that case, for the individuals involved in that case. Statistics are for analyzing trends, not individual cases.

By MarkCC, at 9:11 PM
Another statistic to research: Of all the alleged rapes that occurred in the last ten years and received wide national coverage, how many resulted in convictions?

By Anonymous, at 9:34 PM
Mark, did you mean Jesse Helms or Strom Thurmond? Thurmond's daughter is Essie Mae Washington-Williams.

By Zeno, at 9:37 PM
Uhh...what in that post makes the poster a "racist asshole", or where you referring to commenters?

By Steve, at 4:27 PM
steve:

The person that I'm calling a racist asshole is the guy who wrote the comment that I quoted in my post.

By MarkCC, at 4:31 PM
zeno:

Yes, you're right; I confused Strom Thurmond and Jesse Helms. I'm not sure why, but I frequently confuse those two.

By MarkCC, at 4:36 PM
sailorman:

I disagree very strongly with you about the issue of whether or not things like social situations should be considered in statistical discussions.

Statistics are only as valid as the data from which they are generated. When using data, you must understand how and why that information was gathered, and what factors might influence it.

I would not argue for a specific modification in a statistical value on the basis of a social factor - but I *would* argue for the applicability or inapplicability of a statistic computed from a particular data set based on the known properties of how that data set was produced.

By MarkCC, at 4:46 PM
Marc, you seem to advocate a somewhat irrational estimation procedure (e.g. estimating the probability that X is guilty of Y), in the sense that it doesn't take into account relevant information. Many statistical models contain parameters that cannot be estimated from the data collected in the individual case you're interested in, but that can be estimated from previous studies. That's standard scientific practice. If the speed of light had to be measured over and over again in every experiment...

Therefore I stick to this point: if it is known from previous studies that 99% of people from Crete are habitual liers, then you should arrive at a lower estimate of the probability that a new guy from Crete is telling the truth compared to when earlier studies had shown that only 1% are habitual liers. How could you not?

By Anonymous, at 6:29 PM
reverent bayes,

I assume that your Cretian liars example was meant to challenge something Mark said, because I cannot for the life of me figure out its significance to the original "asshole" post. Let's look at what he said:

Unfortunately, statistically a black women is significantly more likely to make a false accusation of rape than to have been raped by a white man.

Even if that were true, so what? I could say "Unfortunately, statistically a New Yorker is significantly more likely to lie on his resume than to have graduated from the University of Hawaii."
Unless you could show that there was a particular incentive for New Yorkers to say they graduated from UH, these are two independent statistics. Big deal! All this says is that among New Yorkers, you'll find more fudged resumes than ones that say UH graduate. But it says nothing about the odds of a particular resume, given that it says "UH graduate," being fudged.

Then after some nonsense, he concludes:
Even with the most conservative figure of 2% of rape allegations being false, this means in the case of the Duke Rape Case, the victim is 5000 times more likely to have made a false accucation than to have actually been raped.

This is a total non sequitor! This would be like holding up a resume that says "graduated from University of Hawaii" and claiming that it's much more likely to be fudged than genuine. In fact, it is no more likely to be fudged than any other resume. The same applies to the Duke rape case example. Unless this guy can show--and he appears to have made no effort to do so--that false rape charges by black victims are more likely to be towards white men, he's got nothing! (Didn't some statistician have a theorem about this kind of stuff?)

By The Science Pundit, at 9:50 PM
Science Pundit,

I assume that your Cretian liars example was meant to challenge something Mark said, because I cannot for the life of me figure out its significance to the original "asshole" post.

I think that what the Reverent Bayes is objecting to is idea of not conditioning on all the data which is sort of what Mark was implying.

For example, if people from Crete lie with a probability of 0.99 then shouldn't that be something we condition on when we try to evaluate the veracity of claims made by those from Crete? Seems to me that the 0.99 should go into the prior probability as to whether the guy is lying or not.

Mark's objection seems to be that we can't trust the 0.99 for other factors that are related to how the data is collected (e.g. it is only reported crimes) and sociological factors (e.g. the racial issues for the past century).

By saying we can't use past experience, Mark seems to be arguing against the use of Bayes Theorem in a wide range of situations were it should be usable (i.e. who is at fault in an auto accident). For example, suppose our guy from Crete is in an auto accident. If he is telling the truth he will accept fault with probability 0.5 and blame the other driver with probability 0.5. If lying the guy from Crete will blame the other driver with probability 0.85 and accept blame with probability 0.15. Now with the given priors if we observe that the guy from Crete accepting blame our prior for lying goes down, and our prior for telling the truth goes up.

But Mark says we shouldn't use statistical information that says 99% of the time our guy from Crete is lying. Seems kind of weird to me.

By Steve, at 2:04 PM
By the way, just want to add that I think Mark is right that the data in official crime reports has to be taken with a shovel of salt. There are indeed problems with under-reporting and social factors.

But also the Reverent Bayes is also correct: we do want to condition on all the relevant data.

By Steve, at 2:07 PM
Dang it...

When I wrote,

Now with the given priors if we observe that the guy from Crete accepting blame our prior for lying goes down, and our prior for telling the truth goes up.

Replace prior with posterior.

Also, Mark's point about using statistics shoudl be modified, IMO. We should be careful when applying statistics to individuals. Just because we know 99% of men from Crete are liars (hypothetically speaking), we shouldn't automatically conclude that whenever a man from Crete says something he is lying.

By Steve, at 2:12 PM
The way to determine whether a Cretan has been in an auto accident is to examine him, his car, any witnesses, and the scene of the purported accident, not to do arcane analysis on whether or not he has a passing acquaintance with the truth. If reality depended on what the man said, we'd have him over to James Randi to collect his million bucks.

The boy who cried "Wolf!" might have been a proven 100% liar based on previous experience, but there surely was a wolf there when there was a wolf there.

By Anonymous, at 4:02 PM
The way to determine whether a Cretan has been in an auto accident is to examine him, his car, any witnesses, and the scene of the purported accident, not to do arcane analysis on whether or not he has a passing acquaintance with the truth.

1. First, the issue wasn't whether or not there was an accident, but who was at fault. A somewhat different question. Second, obviously you haven't been in an auto accident. The last three that happened to me (rear ended by others), there were witnesses, but none stopped. While being rear ended almost always means the other guy is at fault, it isn't 100%. Inspection of the cars simply tells us that an accident happened, not necessarily who is at fault.

2. As for "arcane analysis" what do you think Mark does for a living? Don't be a dimbulb.

The boy who cried "Wolf!" might have been a proven 100% liar based on previous experience, but there surely was a wolf there when there was a wolf there.

Mark, please do a post on conditional probability. Some or your readers clearly need some help on this.

By Steve, at 4:13 PM
" As for "arcane analysis" what do you think Mark does for a living? Don't be a dimbulb."

I know what Mark does for a living. I would also tell a successful stockbroker that market analysis wouldn't tell him whether or not the Cretan had been in an auto accident, either. Don't be a paladin.

"Mark, please do a post on conditional probability. Some or your readers clearly need some help on this."

If conditional probability can tell an investigator the truth status of a claim without resort to the actual evidence, I'll sit in the front row of class, thank you.

By Anonymous, at 4:36 PM
If conditional probability can tell an investigator the truth status of a claim without resort to the actual evidence, I'll sit in the front row of class, thank you.

And what part about "conditioning on all the data" don't you understand?

Idiot.

By Steve, at 4:46 PM
None of it, without being a mathematician myself, which is why I wanted a seat up front in class.

Is math too pristine and pure for the unwashed hands of amateurs? Do you cane the plebes too? Asshole.

By Anonymous, at 4:54 PM
I don't give a shit about the case where a black woman accused some white guys of raping her. It's very American that this is such a big deal, just like the girl that has gone missing on Aruba (I'm Dutch, so Aruba is officially part of my country, yeah I know that sucks, but that story has hardly been in the news at all in Holland).

My point was that information about the likelihood that black women falsely report rape by a white guy is relevant information.

I agree with Marc that in practice the estimates of this probability are probably biased. But this is simply additional information that should also be taken into account (technically, it would increase the variance of the prior distribution).

I guess that Marc's tendency to be PC overwhelmed his otherwise admirable analytical abilities.

By Anonymous, at 7:15 PM
bayes:

Dealing with group statistics, it is not valid to apply them to specific individuals. This is just a general statement of fact. It *doesn't work*. It isn't valid use of statistics.

As has been repeated multiple times in the comments here: the average person has an IQ of 100; the average family has 2.7 children. That doesn't mean that you can use those numbers to give valid probabilities of *my* IQ or the number of children that *I* have.

If you took a truly random pool of 1000 people, you can infer that it's likely that the average of their IQs is 100. If you took a pool of 1000 families, you could reasonable infer that there's a high probability that the average number of children will be 2.7. But you can't point at an individual, and say "His IQ is probably 100".

That's not PC - that's valid use of statistics. Statistics are about groups.

assumptions about how many children a specific married couple will

By MarkCC, at 7:28 PM
bayes:

One other point: there is no valid statistic about the likelyhood of a black woman falsely reporting a rape by a white man. The "statistic" that the jackass in question used was a phony manufactured one, which could not be validly applied in *any* situation.

By MarkCC, at 7:30 PM
None of it, without being a mathematician myself, which is why I wanted a seat up front in class.

So in other words your comments are based on ignorance. Thanks for that important bit of information.

Mark,

I'm still no seeing the issue here with trying to determine if somebody is lying and using previous statistical information. You keep saying we can't use the 2.7 children estimate of the mean, but you keep blowing off things like the distribution. For example, there is a distribution of number of children where the mean is 2.7. While nobody can indeed have 2.7 people, it seems that we would want to use that information in certain circumstances.

Seriously, what is your experience with Bayesian analysis? Using it at the individual level strikes me as eminently reasonable. Basically what we have is a situation with evaluating the honesty of an individual (in terms of probabilities). If we have prior information on the honesty of people in similar circumstances shouldn't that be factored into the analysis?

My additional problem is you simply state things as fact. You can't do this. You provide little additional support. For example, the 2.7 children while true, it strikes me as beside the point in terms of a distribution.

Second, you are making a strawman argument when you say, "We now people in situtation X often do Y, and this person is is in situation X therefore Y." You are taking a probabilistic statment and turning it into a statement of certainty. Nobody here is arguing that the woman must be lying, but it is a possibility.

By Steve, at 1:03 PM
steve:

I don't know how else to say it. Statistics are measurements in the aggregate. Properties of individuals taken out of the aggregate cannot meaningfully be predicted, except by how they compare to the aggregate - and to do that comparison, you need to first identify the properties of the individual.

You can meaningfully ask, "If I pick a random person, what's the probability that their intelligence is between IQ 90 and 110?". Because that's really a question about the aggregate.

The reason I keep bringing in the "2.7 or whatever number children" is because it's a very tactile example of what happens when aggregate properties are applied to individuals. The number only has a meaning when applied in the context of an aggregate.

What's the probability that a random person in America is Jewish? Around 2%. What's the probability that *I* am Jewish? 100%. Once you take *me* out of the aggregate, the statistics of the aggregate no longer have meaning.

By MarkCC, at 4:19 PM
Seriously, what is your experience with Bayesian analysis? Using it at the individual level strikes me as eminently reasonable. Basically what we have is a situation with evaluating the honesty of an individual (in terms of probabilities). If we have prior information on the honesty of people in similar circumstances shouldn't that be factored into the analysis?

Actually this is done all the time--it's called profiling. How valid or moral (or legal, for that matter) profiling is, should probably be left to its own discussion.

By The Science Pundit, at 6:26 PM
"What's the probability that a random person in America is Jewish? Around 2%. What's the probability that *I* am Jewish? 100%. Once you take *me* out of the aggregate, the statistics of the aggregate no longer have meaning."

You really have been got at by those evil frequentists. :-)

From the Bayesian perspective, we can apply probabilities to individuals: not knowing you from Adam (OK, perhaps you're better dressed than he is), I can use that as the probability that you're Jewish: I can set up odds on that basis such that I would accept either side of the bet.

Of course, more information will change that. For example, your statement that you're Jewish raises the probability somewhat (not to 1, of course: you could be Cretan).

You're right that once we know you're Jewish, then the statistics have no meaning, but that's because we're conditioning on new information (i.e. we now have P(X|X)).

In the rape case under discussion, the probability that the lady was lying would be relevant (if it could be calculated correctly!), but then so would a lot of other information.

For an introduction to how the legal communtiy views these matters, take a look at
Taroni et al. (2006) Bayesian Networks and Probabilistic Inference in Forensic Science.

Bob
P.S. (and if you understand this, you're really beyond hope) Is it ironic that "the reverand bayes" is Dutch? I'm not sure whether I can expect his probability statements to be true or not.

By Anonymous, at 5:24 AM
Bob,
Did you know that ireland was once ruled by a dutch (protestant) king? The protestants in northern ireland still have yearly "orange marches", typically causing lots of violent clashes between catholics and protestants.

By Anonymous, at 5:40 PM
"Did you know that ireland was once ruled by a dutch (protestant) king? "

It's a bit more complicated than that (as they always are with Ireland), and you might loose a few friends if you say that in the wrong place!

As I'm here, I might as well suggest some further reading, for anyone wanting to know more about the philosophy behind probability (and why we can/cannot talk about the probability associated with a single event), then this is a good book:

Hacking, I. (2001) An Introduction to Probability and Inductive Logic. CUP.

It's aimed at undergrad philosophers, so the maths is at a low level, but it's very clear and understandable.

Bob

By Anonymous, at 1:23 AM
Mark,

You can meaningfully ask, "If I pick a random person, what's the probability that their intelligence is between IQ 90 and 110?". Because that's really a question about the aggregate.

Okay, I'm with ya. What I want to know thought is why can't we do with with lying. Why can't we ask, "What is the probability this person is lying, given all the evidence we have?" Where all the evidence also includes, if we have decent numbers on it, the number of people in that situation who lie.

What's the probability that a random person in America is Jewish? Around 2%. What's the probability that *I* am Jewish? 100%. Once you take *me* out of the aggregate, the statistics of the aggregate no longer have meaning.

Again, I agree. However, I don't see the problem. When we condition on all the data then what is the big deal? Presumably, that 2% thingy would get swamped by someother factor (e.g. Mark goes to Synagogue, or Mark is Jewish, etc.).

Science Pundit Guy,

Actually this is done all the time--it's called profiling. How valid or moral (or legal, for that matter) profiling is, should probably be left to its own discussion.

And sometimes it works. This is how things like Bayesian e-mail filters work. You get lots of "ham" and lots of "spam" and then the filter can do a decent job of "profiling" the e-mail.

This strikes me as a case of taking aggregate information and using it in the individual.

I think Mark is both right and wrong here. Right in that the data that the commenter is using is very problematic, especially how the commenter uses it. However, the idea of using aggregate data to help us make probability statements about individuals strikes me as quite in line with Bayesian reasoning.

By Steve, at 2:35 PM
So if the DoJ says (pdf, pg.2): "For those cases in which the victimoffender
relationship is known, husbands
or boyfriends killed 26% of
female murder victims, whereas wives
or girlfriends killed 3% of the male
victims", does that mean there's Scott Peterson's wife was 3 times more likely to have been killed by someone else? Does that tell us anything about the probability of Peterson's guilt or innocence? (OJ would presumably be even less likely to have killed Nicole. I'd assume that the number of murders by a black male ex-spouse of a white female would be a very small percentage of total murders in the US.)

And that's not even going into the legal issues of fairness or presumption of innocence. Especially if you take the racist asshole's figure of 2% false allegations, (as the science pundit pointed out) that would mean there's a 98% "probability" that the the accuser is telling the truth. Is that admissible as evidence for the prosecution? They admit DNA evidence and that's only a little less than 2% more likely.

Me, I think that statistics != probability, but I don't quite have the math chops to back up that statement.

By dorkafork, at 2:09 AM
Well, it seems that part of the problem here is a confusion of past frequencies (which is what the rape statistics give us) with subjective probabilities (which is what the Bayesian requires).

One way of framing the question is whether (or better: under what circumstances) we ought to align our subjective probabilities with frequencies. It's clear that we don't *always* want to do so. Suppose I flip a particular coin once, and it comes up heads. After the first flip, what should my subjective probability be for the proposition that the second flip will turn up heads? If we always aligned subjective probabilities with frequencies, we would give the proposition probability 1--but that doesn't seem right.

I think the point Mark is making is that in this case--and perhaps many cases in which we apply satistics--we have good reasons for not aligning our subjective probabilities with past frequencies.

By Anonymous, at 12:35 AM
Ok... I am the anonymous poster referred to in this post. First let me apologize for the obvious. There was no malicious attempt on my part to skew the percentage. I simply flubbed it. The rest of my argument is also quite poorly made as well. The statistics I used were for single offender rapes, not multiple offender rapes. Im a little hurt that I am viewed as a "racist asshole". My argument was faulty, but not racist. I can understand the heated emotions surrounding this case, but am always surprised when people revert to name calling instead of just attacking the argument. An Ad hominem attack automatically reduces the creditability of person making the accusation.

p.s. I really was quite surprised that my argument caused such a stir.

By Anonymous, at 10:20 AM

Good Math/Bad Math

Tuesday, May 02, 2006

Slandering with Statistics

37 Comments:

About

About Me

Previous