Friday, May 12, 2006

On Applying Sampled Statistics

After my post last week about probabilities and the Duke rape case, there's been a fairly heated discussion in the comments about whether/how it's appropriate to apply statistics to compute probabilities relating to a particular selected instance. I decided it was worth taking the time to post a little bit about what statistics mean, and what that implies about how they can be applied.

Statistical analysis is primarily about measurements of groups. Broadly speaking, there are two kinds of statistics: full-population statistics, and sampled statistics.

Full population statistics are less common. In full population statistics, you have data about every member of the group being analyzed, and you're trying to analyze the properties of that group. An example of this is many types of manufacturing: for example, if you're intel, and you're manufacturing chips, after each manufacturing run, you test every chip you produced. And you gather a lot of information about how many have no detectable defects; and of the defective units, how many had each kind of defect.

Sampled statistics are the more common ones. They're what you use when you do not have data about every member of the group. What you have is a selected subgroup, called your sample, which is (ideally) representative of the group as a whole. By performing measurements on the sample, if the sample is truly representative, then probabilistic descriptions of the measurements of the sample should be very close to what you would find by doing a total measurement of all members of the group. To continue with the manufacturing example: if you were building processors for use in satellites, you would want to know how much radiation exposure the processors could survive before they failed. So you might take a sample of the units, and expose them to increasing levels of radiation, seeing the spectrum of failure points, in order to allow you to make a reasonable guess at how much shielding you would need to achieve a specified level of reliability.

The purpose of sampled statistics is that if the sample is representative, then measuring the sample will allow you to make probabilistic statements about the population the sample was drawn from. That's a critical, key point: statistics are about groups, and sampled statistics are carefully done in a way that allows you to reason from samples to populations. (Put more formally, wikipedia defines it as "Sampling is that part of statistical practice concerned with the selection of individual observations intended to yield some knowledge about a population of concern, especially for the purposes of statistical inference. In particular, results from probability theory and statistical theory are employed to guide practice.")

Because of the fact that a representative sample should show roughly the same characteristics as the population as a whole, then if you take a second, independent representative sample of the population, then the statistics drawn from the first sample should, with high probability, apply to the second sample. So - again with the manufacturing example: Supposed I've produced 20,000 processors for satellites. From those, I've taken a testing sample of 1000 units, and exposed them to various level of radiation until they failed, and generated the mean failure level, and the standard deviation of the amount of radiation that caused them to fail. Now, I want to deliver 1000 units to a customer. If my testing sample was representative, and the set of 1000 units that I deliver to my customer are equally representative, then the data I generated from my radiation exposure tests should describe the failure properties of the set of units I delivered.

But the key there is the idea of the representative sample. A representative sample should have the roughly the same distribution of properties as the population. If the sample is not representative, then we cannot reason from or about it.

Now, the issue that came up in comments is: can we use statistics to reason about a specific member of a population? The answer is always no. Because a specific, individual member of a population can never be a representative sample. And you can only reason to and from representative samples of the population.

• Interesting point. How does the prohibition apply in this case: The Body Mass Index, a rough ratio of weight to height correlates to various health problems. If my doctor tells me "your BMI is too high, lose some weight," isn't he applying sampled statistics to an individual case?

This seems pretty common in the medical world.

By  ArtK, at 12:04 PM

• Can we use statistics to reason about a specific member of a population? The answer is always no.

It's true that you can't make any assertions about a given member that you can be certain of, but you can make probablistic statements about them.

For instance, lets say that you have determined, that your chip manufacture process produces faulty chips at a rate of one per thousand chips manufactured. Also assume that a random sample of the chips produces a representative sample.

Now pick a single random chip. You can say that this chip has a chance of being faulty, and in fact, that it has a 0.1% chance of being faulty.

So it's not really correct to say that you can't use statistics to say anything about individual members of a population. It would be truer to say that you can only make probabilistic statements about them.

It gets a bit more complicated when you introduce the idea that the failure rate of the manufacture process is only determined to a particular level of certainty, and so any statements you make based on that error rate need to reflect that, but even so, it's possible to take those things into account and make real and meaningful statements about individual items in a population.

By  TWAndrews, at 12:07 PM

• How does the prohibition apply in this case: The Body Mass Index, a rough ratio of weight to height correlates to various health problems. If my doctor tells me "your BMI is too high, lose some weight," isn't he applying sampled statistics to an individual case?

This seems pretty common in the medical world.

It's quite common, and it's often wrong, or at least misleading.

Take the case of BMI. Assume that BMI is correlated with, for instance, heart-disease, and we have good biological reasons (though not necessarily statistical ones) for believing that the relationship is causitive.

Even making that assmption, it is not possible to say "Your BMI is too high, you're going to have a heart-attack." What is true to say is that people with a high BMI have a higher risk of heart disease, and that by lowering your BMI, you might lower your risk. But there is no way to know what the actualy effect for any individual really is.

By  TWAndrews, at 12:56 PM

• artk:

There is some ability to make probabilistic statements - but it's strongly based on the assumption that the individual is a typical member of the population - that is, that they are a representative sample of size one. It's a terrible assumption.

There are probabilistic statements that you can make - but you need to attach uncertainty measurements to them. For example, you can say that there is a 70% probability that a randomly selected individual has properties within 1 standard deviation of the mean.

To give a specific personal example: I'm built tall and skinny. My current BMI is 24 - which is supposed to be healthy. But I'm at least 20 pounds overweight for my height, even though for a "normal" person, that weight would give me a BMI in the "unhealthily underweight" range.

Now - you *can* use the normal BMI calculations, to say that someone with me weight and height is healthy with roughly 70% certainty.

By  MarkCC, at 1:41 PM

• Mark,
It's funny, I was talking about how you have to be careful with sampling statistics as well here.
Excellent point, as usual.

By  franky, at 1:59 PM

• Now, the issue that came up in comments is: can we use statistics to reason about a specific member of a population? The answer is always no.

Bayesians would disagree with you. Convienent of you to leave that out of your post. We can make probabilistic statments about individuals and use the statistics from the population to help inform such statements and revise them.

There is some ability to make probabilistic statements - but it's strongly based on the assumption that the individual is a typical member of the population - that is, that they are a representative sample of size one. It's a terrible assumption.

This is simply false. For example in the Duke rape case, it isn't just assuming that she is a typical woman who makes a rape accusation, we'd also look at all the evidence and condition on that. This would not necessarily lead to the same conclusion that she is "representative".

There are probabilistic statements that you can make - but you need to attach uncertainty measurements to them. For example, you can say that there is a 70% probability that a randomly selected individual has properties within 1 standard deviation of the mean.

This contradicts the final conclusion of your initial post. I think you need to revise it. Heck replace it with the paragraph above.

By  Steve, at 2:55 PM

• This post totally misses the point, deliberately I think, of the “fairly heated discussion in the comments about whether/how it's appropriate to apply statistics to compute probabilities relating to a particular selected instance.”

The original discussion had a racial dimension, which tends to overwhelm common sense, so let’s instead consider a more abstract example. Say the discussion concerned a trait x in a population with mean trait value mu. As it happens, the population can be subdivided into a number of discrete subpopulations, each subpopulation i having a different mean value mu_i of the trait. Of course, averaged over all subpopulations i the mean trait value is still mu. Now suppose we are confronted with an individual z and we want to estimate z’s value of x. If we don’t know what subpopulation z is from, the best estimate would be mu. But suppose we have the additional piece of information that z happens to be a member of subpopulation i. What would now be the best estimate of the z’s x-value? Clearly it would be mu_i. Mark would have you disregard this piece of information if I understand his argument correctly.

A less abstract example: what is the probability that a random individual has a dick? If you don’t know that individual’s gender, and if there are equal numbers of females and males, the answer is 50%. But if you have the additional information that the individual is male, you would adjust your estimate to 100% (I am here assuming that all males have dicks).

By  The Reverent Bayes, at 6:42 PM

• bayes:

You *cannot* assign a specific probability to a property of an individual based on the properties of a sub-population within a sample.

You *cannot* assign specific properties to a member of a subgroup *at all*. If you want to try to reason downward from properties of the population to properties of an individual, the best you can do is to assign a probability with an associated uncertainty value. That is an incredibly important key point. If you don't want to mess with uncertainty values (which are, frankly, misery to compute), then you can't assign a probability at all. But *any* bare probability number assigned to an individual is invalid, period.

In the Duke case - you *cannot* assign a meaningful uncertainty value. Normally, you start with the uncertainty value associated with the full population; and then you need to add in an uncertainty value associated with the sub-sampling of data about the sub-population. When you compute the uncertainty factor for the sub-population, one of the things you need to do is determine whether you have a large enough sub-population for the data associated with it to be meaningful.

In the phony numbers generated by the asshole in the post about the rape case, the "sub-sample" that he used - black women who reported rapes by white men - was so small compared to the full population (9 out of a population of 131,000) that it cannot be used to generate meaningful measurements about members of the subgroup.

By  MarkCC, at 7:30 PM

• "Statistical analysis is primarily about measurements of groups."

Nooooo! It's about measurements, and extracting information from them. We need some assumption about the errors having a common property: i.e that they are random (more formally that they are iid, or exchangeable). But beyond that, the measurements can be from anything: a population, an individual, several individuals etc. As long as we can estimate the magnitude of the error term, we can hten make predictions at whatever level is in the model.

For some practical examples of when this is done, look at the econometric lterature, where time series are analysed. Or the animal breeding literature, where one of the main aims is to estimate breeding values: i.e. the genetic quality of individuals from data on the quality of a large number of animals in the breeding design.

If you don't believe me, then there is this paper (pdf) which describes some of the modern tools that we use, and also includes this sentence:

"In our example, we can predict the radon levels for new houses in an existing county, or for a new county."

This is frustrating: I know you're horribly wrong, but I don't want to pull rank (I am a statistician), and I enjoy the blog too much to start a fight.

Bob

By  Bob O'H, at 3:38 AM

• Bob O'H,

Are you the Bob O'H that used to work with metapop-Hanski?

By  The Reverent Bayes, at 8:33 AM

• " The Reverent Bayes said...
Bob O'H,
Are you the Bob O'H that used to work with metapop-Hanski? "

The same. I even witnessed Ilkka dancing on a table with his secretary to ABBA.

Hmm, Dutch, Bayesian, knows the MRG. You didn't have several cartoons in your thesis, including one with an Abbey Road theme, did you?

Bob

By  Bob O'H, at 3:08 PM

• It troubles me when people argue against prejudice based upon stereotypes by insisting that the stereotype is based on incorrect statistics (e.g. "Cretans are not liars") because that carries the unspoken assumption that prejudice based upon accurate statistics is acceptable. I have a somewhat similar concern regarding your mathematically more sophisticated objection.

While it is true that you cannot draw conclusions about an individual member of a population based upon sampled statistics, this doesn't resolve the original problem, because you can circumvent that objection by using sampled data to develop a policy that achieves increased reliability on the average by taking statistical information into account.

Let's stick with the "99% of Cretans are liars" example, and assume for the sake of argument that it is statistically accurate. It is true that this statistic does not allow one to determine whether a particular Cretan is telling the truth. Nevertheless, a policy that says "In the absence of other evidence, always take the word of a non-Cretan over the word of a Cretan" will yield the correct result more often than a policy that says "in the absence of other evidence give conflicting testimony equal weight regardless of national origin."

I'd argue nevertheless that automatically rejecting the word of Cretans is wrong, not because it is mathematically incorrect, but because it is unjust. In other words, if there is a conflict between individual justice and statistical justice, individual justice should prevail. But this is not a mathematical conclusion but an ethical or a practical one-ethical if you think that it is inherently wrong to penalize an individual for the actions of others, practical if you think that basing judgments on sampled statistics discourages individuals from distinguishing themselves, and/or discourages desirable change averaged over the group (i.e. there is less incentive for Cretans to become more honest if nobody will believe them, anyway).

By  trrll, at 6:46 PM

• Mark,

Are you doing this on purpose here? You write,

You *cannot* assign specific properties to a member of a subgroup *at all*. If you want to try to reason downward from properties of the population to properties of an individual, the best you can do is to assign a probability with an associated uncertainty value.

Those of us here advocating the Bayesian view point are making precisely this argument, but you pretend we aren't. You may know math, but your grasp of logic is stunningly bad. You keep trotting out this strawman argument as if it will impress us.

Further, such probability arguments are a form of reasoning from the population (sample or otherwise) to the individual. Now as a strict Frequentist you may have some justification for your position, but Frequentist statistics is not the entire universe of statistics.

But *any* bare probability number assigned to an individual is invalid, period.

This is also wrong, IMO. Unless I'm mistaken I could select a prior for lying to be 50/50. Then as I learn new information, update that prior using Bayes theorem.

By  Steve, at 12:02 PM

• Sorry I've been lagging on following the comments here; I've been fighting off a miserable sinus infection.

steve: I think we're talking at cross-purposes here. I'm trying to say "You can't take a population statistic and apply it to an individual to get an accurate prediction of the properties of the individuak"; you're saying "You can use a population statistic to give you an initial estimate of the properties of an individual, which you can then revise as you get more data about the individual".

My point is: the initial estimate that you get from the population statistic *is wrong*. If it wasn't wrong, it wouldn't need to be revised in light of additional information. In sampled statistics, if we discover new information that shows that our sample didn't represent the population correctly, we conclude that the sample was invalid.

I also stand by the statement that we can not create a *bare* probability for an individual from a population statistic: by a bare probability, I mean a probability without an uncertainty measure. In your example, you say we can make an initial estimate of the probability of someone lying as being 50%. Then update that figure as you get more data. What I'm saying is that the bare number probability .5 is incomplete - because it *must* have an uncertainty measure attached. The uncertainty measure is what allows you to revise that probability: the probability of them lying *doesn't change*: your data about them changes, allowing you to update the probability and reduce the uncertainty. Just stating the bare probability is wrong.

By  MarkCC, at 9:00 AM

• My point is: the initial estimate that you get from the population statistic *is wrong*. If it wasn't wrong, it wouldn't need to be revised in light of additional information. In sampled statistics, if we discover new information that shows that our sample didn't represent the population correctly, we conclude that the sample was invalid.

Yes and no. The initial probability assessment is an estimate, and like many estimates it can be wrong. Even an estimate where you did your sampling perfectly in terms of procedure can be wrong. You could have just gotten a strange sample. In fact, the Bayesian method shows some of the silliness with the Frequentist approach. Still, this chance of "being wrong" doesn't invalidate statistics or reasoning from the population to the individual.

I also stand by the statement that we can not create a *bare* probability for an individual from a population statistic: by a bare probability, I mean a probability without an uncertainty measure.

I am not sure what you mean by an uncertainty measure here. We are talking about a probability which itself tells us something about how uncertain we are.

In your example, you say we can make an initial estimate of the probability of someone lying as being 50%. Then update that figure as you get more data. What I'm saying is that the bare number probability .5 is incomplete - because it *must* have an uncertainty measure attached.

I'm sorry you are simply wrong here. Both of the other Bayesians here would tell you as much. With the Bayesian approach you are stating your uncertainty about whatever you are interested by making such probability assessments. The 50/50 probability is considered an uninformative prior--i.e. it is used when you have no other information.

The uncertainty measure is what allows you to revise that probability: the probability of them lying *doesn't change*: your data about them changes, allowing you to update the probability and reduce the uncertainty. Just stating the bare probability is wrong.

Uhhhmmm no. I'm becoming more and more convinced you are not at all convinced with Bayes Theorem and how Bayesians use it. The probability does change. Seriously, try getting a book on Bayesian statistics and reading about it. It will, literally, give you a whole new perspective on statistics. It sure did for me.

By  Steve, at 9:58 AM

• Mark, you might like "Probability Theory. The Logic of Science", by the late ET Jaynes. Wonderful book, and most of it is for free on the web (just Google it). After reading that, you're going to be a Bayesian, with probability 100%.

By  The Reverent Bayes, at 12:21 PM

• I left this comment on the original post last night, but it seems relevant here:

Well, it seems that part of the problem here is a conflation of past frequencies (which is what the rape statistics give us) with subjective probabilities (which is what the Bayesian requires).

One way of framing the question is whether (or better: under what circumstances) we ought to align our subjective probabilities with frequencies. It's clear that we don't *always* want to do so. Suppose I flip a particular coin once, and it comes up heads. After the first flip, what should my subjective probability be for the proposition that the second flip will turn up heads? If we always aligned subjective probabilities with frequencies, we would give the proposition probability 1--but that doesn't seem right.

I think the point Mark is making is that in this case--and perhaps many cases in which we apply satistics--we have good reasons for not aligning our subjective probabilities with past frequencies.

By  jdkbrown, at 3:20 PM