“An appropriate answer to the right problem is worth a good deal more than an exact answer to an approximate problem.” John Wilder Tukey. Use of data and use of statistics can obscure rather than illuminate a problem, if we aren’t careful.

And as Douglas Adams observed in The Hitchhiker’s Guide to the Galaxy, “Forty-Two” isn’t much of an answer to the Great Question of Life, the Universe, and Everything, if one doesn’t know what the question actually means.

How do you know whether the numerical evidence presented in a news report, an article or a study really supports the arguments the author is proposing? That’s the beauty and the pitfall of numerical data–it’s exact, but exactly what does it tell you? As consumers of data in all kinds of information, policy and decision making settings, it’s sometimes nearly impossible to know how to interpret the numerical and statistical evidence we see. Statistics themselves don’t lie, but can be used to mislead. It is their very precision that makes statistics so seductive and convincing.

The following are just a few common problems that arise in dealing with data and statistical measures. One is the ill-defined question. Another is the quality of the raw data—skewed, poorly selected, or unsuitably defined for the question at hand. Yet another is the problem that arises when the scale used to present the data is a poor match for the type of data used. A few words on these below.

First let’s consider scale—the units used to report the data. They affect our impression of how significant the data is, in real but sometimes subtle ways.

In competitive swimming, for example, a race won or lost hinges on hundredths of a second, as Olympic swimming results will attest. To evaluate the skill and power of each swimmer based on swim times, we need results reported in hundredths of a second, the relevant interval for this sport. We can create a very different impression of an Olympic swim by switching the reporting results from hundredths of a second to tenths of a second, to whole seconds, or to five-second intervals. As the unit of time increases, the drama of the competition—legitimate in this case—begins to disappear. Over five-second intervals, it will appear that all Olympic swimmers are pretty much interchangeable with one another, and give us pause to question the fuss over such a ho-hum sport. But Olympic competitors are very closely matched, so differences between them must be sifted through a very fine ‘sieve.’

By comparison, a much coarser measurement is sufficient for logging time worked in the average American workweek. Hours, days, or some fraction of these usually tell us all we need to know. A to-the-minute (or to-the-second) report may be less informative or more costly to obtain than the benefits we can derive from it.

It’s logical. Swimmers work in literally ‘split’ seconds; generally, employees work in ‘split’ hours. If we move away from the most appropriate scale in a presentation, we begin either to exaggerate or to downplay the significance of changes in whatever data we have.

Aggregation, or the way we define categories of data, is another factor to consider. The kinds of questions we can answer with numerical research and analysis will be affected by how we assemble our facts. This data set is designed to measure the number of uninsured non-elderly males in North Carolina. All 19-64 year-old uninsured men in North Carolina have been grouped together into one statistic. That’s not necessarily a problem, unless you want to propose an insurance access or heath care policy to bring the total number of uninsured males down. The issue? The category ‘nineteen to sixty-four year olds’ is a highly heterogeneous group. This massive category is surely too coarse a sieve to differentiate the insurance wants and needs of men with age differences as much as forty-five years. As a result, little useful insight can be garnered, especially for policy, either on the adequacy of health and insurance delivery systems, or on other questions we might like to address.

And what about basic statistical measures, like averages? When we don’t have good individual data, or want a ‘typical’ picture of a group, we sometimes use the arithmetic mean, or ‘average.’ This can work well to give an overall picture, except when there are ‘outliers,’ atypical elements that affect the results of our calculated average. In that case, we don’t get a very good general picture at all. For example, if three people in a room each have $999.00, $5.00, and $20.00 respectively in cash in their pockets, the arithmetic average (mean) of cash on hand is $341.33. This is true, if not typical for any one person, since no one is holding that amount (or anything close to it), and one of the three is carrying an amount far out of line with the other two. As a result, the mean (average cash) value tells us little about what is most typical for this group.

Other kinds of averages, the median and the mode, convey different information from identical data. Information on the ages of all N.C. public school children in April of their first-grade year would yield a mode—whatever age appears most frequently in the data set. The median is that age for which half of the children are older—say six years and nine months—and the other half of the first graders are younger.

Can I lie with statistics? In a way. By choosing carefully, I can pick which results I allow to emerge from my investigations. Here are a few (completely unscientific but computationally
correct) examples I created. According to data I collected for age-at-death, reported in a Raleigh newspaper during August of 2007, the mean age at death for one and the same day in August could truthfully be reported as 70.73 years, 69.93 years, or 76.68 years. How did these different results come about? It’s what ‘s left out, and what we don’t know behind the calculation, that makes the difference. If I wanted to support a particular result, “cherry picking” my data set would allow me to report the results that most closely match my own views, a well-known problem that occurs in many scientific and social disciplines.

I did a deliberately poor job in my example, of course. I arbitrarily truncated the data set, and drew numbers from one sample only–a single day—that cannot be assumed to be representative of either the city or the state. I totally ignored relevant information that may well have affected the data and the outcome, such as occupation, circumstantial and environmental factors, personal history, dispersion of age in the sample, etc., etc.

So what can I conclude from these numbers? Not very much. Do 50+ year-olds really have a better chance of reaching age 70 (and beyond) than do under-50’s in North Carolina? It could look that way from one of my (undisclosed) data sets, but there is way too little information here to claim that, much less to—say—construct an insurance/health care access policy based on very shaky, albeit very exact, numbers.

I’m no statistician, but I think I can appreciate a good practitioner (and a good quote) when I see one. Done wrong, statistics can lead us astray. But done right, statistics are tools that can illuminate problems, and lend additional understanding to questions that vex and intrigue us. And done right, that avenue is entirely worthwhile.