TCS Daily

Information Awareness

By Arnold Kling - February 25, 2003 12:00 AM

Thomas Bayes, an 18th-century pioneer in statistics, has been in the news quite a bit lately. This story, for example, describes Microsoft's plan to incorporate Bayesian analysis.

"The technology will be embedded in future Microsoft software and is intended to let computers and cell phones automatically filter messages, schedule meetings without their owners' help and derive strategies for getting in touch with other people."

As someone who teaches statistics, I get very excited by Bayes' Theorem. For me, it is one of the high points of the course.

How Bayes' Theorem Works

Suppose that your doctor says that you have microscopic hematuria (blood in your urine that is visible under a microscope), and the doctor recommends that you undergo a very uncomfortable procedure to test for kidney cancer. He says that 80 percent of people with kidney cancer also have blood in their urine.

What you are interested in is not the 80 percent probability that the doctor gave you for someone with kidney cancer having hematuria. You want the inverse probability, which is the probability that someone with hematuria will have kidney cancer. That is where Bayes can help. Bayes' Theorem is also known as the Inverse Probability Law.

Bayes would reason as follows. Let A be the event "you have kidney cancer." The probability of this in the entire population is about .00005, or one in 20,000. Let event B be the event "you have microscopic hematuria," which in the entire population is about .10, or one in ten.

The joint probability of A and B is the probability of A (.00005) times the probability of B given A (.80), or .00004. The probability of A given B is equal to the probability of A and B (.00004) divided by the probability of B (.10), or .0004. So, you have a 4 in 10,000 probability of kidney cancer, which might not be a significant enough chance to warrant going through an uncomfortable procedure.

How Bayes Fights Spam

At a recent conference on spam, several presenters reported progress using Bayesian spam filters. I have tested one of the products, called POPfile, and the filtering is excellent. However, it is an Open Source product, so non-techies may have trouble getting it running.

The most common non-Bayesian spam filters use rules. A rule might say that an email with the phrase "Enlarge your penis!" is spam.

Instead, a Bayesian filter takes into account not just the phrases that are commonly found in spam but also the phrases that are commonly found in legitimate email. The Bayesian filter calculates two scores for each email - one score for the phrases that appear in spam and one score for the phrases that appear in legitimate email. It then compares the two scores to decide whether the current piece of email more closely resembles spam or legitimate email. The result is a very powerful filter.

How Bayes Could Fight Terrorism

One of the reasons that I am in favor of using a national data mining project - such as the much-reviled Total Information Awareness program - to fight terrorism is that I want the power of Bayes' theorem on our side. I believe that just as Bayesian filters are able to defeat spam, Bayesian data mining will be able to identify terrorists.

One complaint about a national database is that it will include information on decent, law-abiding citizens. From the standpoint of Bayes' theorem, this is a feature and not a bug. Just as Bayesian spam filters work by comparing how well an email correlates with spam with how well it correlates with legitimate email, a terrorism-prevention application will work best if it can compare how an individual's behavior correlates with the behavior of a terrorist with how the individual's behavior correlates with that of a decent, law-abiding citizen. With Bayesian data modeling, maybe the fact that someone is a scholar at the Brookings Institution means that we do not have to sweep him up and detain him at an immigration facility.

In fact, the relationship between national origin and terrorism provides another excellent classroom illustration of inverse probability. Although the September 11 hijackers were of Arab national origin, one can estimate that the probability that someone of Arab national origin is a terrorist is quite low. Such a calculation reinforces the opposition to racial profiling of Arab-Americans that George Bush expressed as a candidate for President.

I believe that the government should be collecting data on behavior and should stop collecting data on ethnicity. Unfortunately, Congress and misguided civil libertarians want it the other way around.

Coping with Information Overload

Progress in computers is increasing both the demand and the supply for Bayesian statistical analysis. On the supply side, computers are able to store and process large amounts of information, so that Bayesian computation has become much more practical. On the demand side, computer-enhanced communication threatens us with information overload. As companies like Microsoft are finding, Bayesian filters provide a helpful tool in filtering and prioritizing messages.

As consumers, we are going to encounter products and services that use statistical filters. (Already, they are widely used in the decision to grant consumer credit.) I believe that to be an informed citizen in the modern world, it is important to have some grasp of Bayes' Theorem.

TCS Daily Archives