The Princeton Review, long known for preparing students to take college and graduate study admissions tests like the SAT, the LSAT, and the MCAT, has ranked North Carolina at the top in their first annual evaluation of state testing and accountability.

In a talk delivered to educators, education leaders, policy makers, testing experts, and others at the September 2002 Education Leaders Conference in Denver Colorado, John Katzman of the Review explained the purpose and process of grading each state’s tests, as well as what those grades mean.

Katzman argued that the measure of a test’s worthiness changes depending upon the role the test will play in state education policy. A test which functions as a snapshot of what schools and students have been doing, without any prescriptive or policy import, is judged differently than a test which sets goals to affect behavior — the so-called high stakes test. In the era of No Child Left Behind, all states will be engaged in some high stakes testing.

As Katzman describes it, the purpose of a high stakes test is to encourage people to improve and to do good things. Accuracy and precision of the test itself is less important than the incentive effects the results produce. According to Katzman, good accountability should map to good outcomes. The question is, does it?

One way to get a look at how well state tests correlate to accountability measures is to compare state results to an outside, visible standard. Princeton chose the National Assessment of Educational Progress. Rankings of states with good accountability in testing do track the NAEP results, they found. In weighting the ranking index, therefore, they included correlation to some improvement in NAEP scores over an eight-year period.

Not a test of rigor

The Princeton Review study is self-consciously different from other accountability studies. Unlike other measures, it does not gauge the rigor of academic standards. Nor does it assess the rigor of the tests that measure academic standards. This means that a state with high academic standards, using a test that is capable of measuring whether or not those standards are met, could be ranked identically to a state with low academic standards, as long as it also uses a test which is capable of measuring whether those standards are met. The fit between a test, and its ability to measure what it claims it will measure, is the only relevant issue. North Carolina garnered a number one ranking for the 2000-01 year in the Princeton study. According to Princeton, it had the best fit between its test and the accountability criteria researchers chose.

The four criteria used to rank a state’s accountability are academic alignment, test quality, sunshine, and policy. These areas were chosen because they reveal different aspects of the state’s testing and reporting procedures. Researchers also felt that these areas would be ones that would provide tools that educators could eventually use to help align classroom practice with state standards. The Princeton Review takes the position that they (Princeton) are test experts, not test suppliers, with clients at the district and school level. This enables them to fairly assess, in their view, state testing practices which may fall short of good accountability standards.

Since test-based accountability is to some degree the future of education , states should be encouraged to do the best possible job of providing it, the study suggests.

It argues that an accountability program should produce the fewest unintended consequences and provide a means for improving classroom instruction. Openness, or “sunshine,” is valued because disclosure leads to stronger tests as well as a more stable political environment for education. The Princeton Review firmly believes that “those who design and implement accountability should themselves be open to scrutiny.”

The first criterion, alignment, evaluates how well state tests are aligned to academic content, knowledge, and skills, as specified by the state’s curriculum standards. There are three sub-items in this category. One looks at the number of test questions required to measure mastery, another at the degree of overlap between published curriculum standards and those actually tested, and the third at a schools’ ability to choose among tests with equated standards.

North Carolina received the highest score on the first two items, and the lowest score for test choice. Each separate item carries its own weight in the scoring, and alignment as a whole represents 20 percent of the state’s ranking in the Princeton Review. The team gave North Carolina a “B+” for test alignment.

Test quality, also 20 percent of the state ranking, examines whether the tests administered are capable of determining that stated standards have been met. The reviewers determined that North Carolina had met all of the criteria in this category, which includes complete scoring, multiple types of items, an independent review, validation of items, pre-established controls, and a consistent curve.

In looking at test quality, reviewers were asking questions about how well written items appeared to be, as well as whether they had been scored accurately and completely. States received higher scores if their tests included a variety of types of questions, including open ended, performance, multiple choice, and computation. Reviewers wanted to know whether anyone other than those writing the test were reviewing the test before placing it in front of students. If scoring or achievement cutoffs for the test were not established before it was given, the state scored lower on that component.

Finally, the research team tried to determine whether scoring curves and cutoff points were consistent on a year-to-year basis, as well as across subjects. On all of these points North Carolina satisfied the research criteria for the highest score. The state received an “A” in this category.

Sunshine was somewhat more problematic for North Carolina’s accountability standards. This category represents 30 percent of the state’s weighted score. It examines how open policies and procedures surrounding the test have been, and whether they are conducive to ongoing improvement.

This is a large category in the study, covering questions about how many students are tested (level of inclusion), whether all scores are included in a school’s profile, the security of testing and scoring procedures, test score and test item release, and disaggregated information about the performance of different groups.

Because contract terms with the agencies responsible for constructing the N.C. test were not open to examination, few of the test specs were easily available, and the release of test scores to the public took too long, North Carolina received a “B” for sunshine.

A final criteria is policy, or how accountability systems affect education in a way that is consistent with state goals. This is the largest category of the four, and like sunshine, represents 30 percent of the state’s weighted score in the rankings.

In the policy category, investigators wanted to know whether indicators besides test scores played a role in the state’s school quality measure, and how many alternate measures were used. In some states, high stakes tests have different consequences for students than for schools, and the study reflected this. State scores were also higher if detailed data following each test was available, and in a format that would help align tests with curriculum standards.

Several items in policy address questions of flexibility, and public access to data. A final item looks at testing costs. For policy accountability, North Carolina earned an “A.”

Even at the top of the rankings, North Carolina scored only 178.5 out of 200 possible points. Retaining flexibility and innovation is the acknowledged challenge.

Palasek is assistant editor of Carolina Journal.