By Alexander Lane,2014-05-07 21:01
12 views 0

    INSERT A for Step 1



    by Royal Van Horn

    Phi Delta Kappan

    Since I teach assessment classes at the university and write this Technology column, it makes sense that I should write a column on the intersection of these two topics. I wonder why I didn’t think about doing such a column before?

    Before I get to the intersection of assessment and technology, though, I wanted to discuss a few assessment fundamentals and clarify a few terms. Assessment, as a discipline, has always been concerned with both measurement and evaluation. Measurement is the easy part; evaluation is a bit tricky. Before you evaluate a student, school, or district, you have to consider the “compared to what” issue, and therein lie the tricks.

    Classically, assessment texts have described “norm-referenced evaluation,” you compare a student to a representative sample of similar students across the U.S., which is known as the norm group. Such tests as the CTBS (Comprehensive Tests of Basic Skills) and SAT (Stanford Achievement Tests) are examples of tests designed for this purpose. Criterion-referenced evaluation compares a student to a set of objectives, competencies, or standards- usually state standards measured by state tests. Unfortunately, most assessment texts forget to mention the third approach to evaluation, “improvement-referenced” evaluation. To use improvement-referenced evaluation, you have to accurately track a student’s progress and take measurements at least three times a year. Most test designed to provide information for norm- and criterion-referenced evaluation do not work well for improvement-referenced uses, since, for a variety of reasons, they cannot be given reliably- at least in a paper-and-pencil format – three times a year.One of the biggest issues today is “grade-level testing.” On the surface, it seems logical to give fourth-graders a fourth-grade test. The problem is that some fourth-graders are performing at the second- or third-grade level, and giving them a fourth grade test yields little useful information. Besides, these below grade level students get frustrated by tests that are much too hard for them. Grade-level testing is also frustrating for students performing above grade level. This issue prompts people to advocate or discuss OOLT (Out Of grade-Level Testing). Why not give students performing above or below grade level the tests that are appropriate for students at their level? The obvious solution to this quandary is to get the computer to custom design test to fit individuals. This is called Computer Adaptive Testing (CAT), and such tests fit nicely with an improvement-referenced approach to evaluation. A CAT test is simply a test that makes continuous adjustments in the difficulty of items so that they match a student’s performance level. If a student misses an item, a slightly easier one is given. If a student gets an item correct, a slightly more difficult one is given. Since time is not wasted on items or questions that are above or below a student’s ability, relatively few items need to be answered. Testing often takes as little as 10 minutes. Obviously, a computer is necessary to quickly check an item and offer the next one, and a large item bank of questions matched to various levels is required to support such an approach.

    Computer adaptive tests have numerous advantages. First, every student receives a unique test, adjusted to his or her performance level. This makes cheating virtually impossible. Second, test results can be immediately obtained, and a wide variety of reports can be generated. Third, a CAT test can be administered one-on–one in a classroom setting or to many students at once in a computer lab. Fourth, with a large item pool, a CAT test can be given regularly--for example, in August, January, and May. When a CAT test is given regularly, individual student progress can be charted and evaluated. That is, you do not have to wait until you give the state achievement test in late spring to find out if a student is making progress.

    Michigan Department of Education Office of School Improvement

    MI-Map 9:4 Web-enhanced Technology to Assess Student AchievementPage 1

    INSERT A for Step 1 (cont.)

    The Northwest Evaluation Association ( in Portland, Oregon, is a nonprofit education

    organization that is supported by member districts. Its revenue is generated by its development efforts, and it has a computerized item bank of more than 15,000 calibrated test items that it continually refines. Items must be tested with 300 students and must pass all the qualifications and statistical tests before being added to the item bank. This continuously refined item bank allows NWEA to design a variety of tests, including computer adaptive tests.

    One of NWEA’s main products is the computer adaptive Measures of Academic Progress (MAP) test. MAP is a set of tests in math, reading, and language usage designed around the most common set of goals from around the country. “MAP test items are referenced to the Rasch Unit Scale. This scale is the most important difference between MAP and other tests. This scale is an equal interval scale that measures a student’s academic growth similar to the way a yardstick measures physical growth.” The Rasch scale offers many advantages over the percentile scale, which rank-orders students only.The MAP test has been administered to more than 800,000 students and is widely used in northwestern states, such as Idaho and Oregon. NWEA will design customized tests that are aligned to state standards, which it has listed by state on its website. NWEA is also developing a Science MAP test in the areas of “Concepts and Processes” and “General Science Topics” and is expanding its coverage to include items on high school subjects, such as biology and American history.

    For matters related to testing, one of the best references is the Mental Measurements Yearbook from the

    Buros Institute at the University of Nebraska. The Yearbook publishes reviews of hundreds of tests written

    by measurement professionals. Our campus library subscribes to the on-line version, so I searched it for reviews of CAT test. The search yielded information only on STAR Reading and STAR Math from Advantage Learning Systems, the company that makes the Accelerated Reader software used in many elementary schools. There were two reviews of each test, one positive and one guarded. The reading test items consist of sentences with a missing word that the student must supply. According to the reviewers, this type of item heavily emphasizes vocabulary development and does not measure other important aspects of reading. Also, according to the reviewers, the items on STAR Math heavily emphasize computation. Schools now using Accelerated Reader software may be more interested in the STAR tests than schools that do not use the software. Advantage Learning also makes the Advantage STAR Early Literacy test. This test is not in the Mental Measurements Yearbook, and I did not have time to find much

    information on it. What I did find indicates that it has 2,000 items covering a wide range of readiness and early-literacy skills.

    In my study of CAT tests, I came upon the Lexia Comprehensive Reading Test (CRT), which is appropriate for kindergarten-readiness screening and use with primary-grade students

    ( I am not sure that this is a computer adaptive test. I think it is more

    appropriately classified as a “computer-based test,” but that is probably not a detriment at this level. One of the members of our reading faculty happened to have a copy of the software, so I was able to load and run the program. The CRT is really four test in one: kindergarten readiness, phonics and decoding, Dolch sight words, and a Burns and Roe informal reading inventory. To administer the test to pre-kindergarteners, a teacher or other trained person uses the computer keyboard to record students’ responses to each test item. For example, the computer will place large colored squares on the screen and ask students to point to the green square. If a student does a task successfully, the teacher pushes one key; if not, the teachers pushes a different key. The voice used to present items and tasks is that of a female with excellent enunciation. The computer screens are large, colorful, simple, and well laid out.The readiness portion of the CRT includes the following skills: giving first and last names, giving the names of the letters in first name, giving age, giving the names of the colors of eight crayons, writing first name with pencil, demonstrating phonemic awareness with pictures, and demonstrating phonemic awareness without pictures. The next three subtests measure additional early skills. Detailed individual and class-level reports are immediately available for print out. Lexia makes other CAT test, and I encourage readers to visit their website. Frankly, I was impressed with how well the computer was used Michigan Department of Education Office of School Improvement

    MI-Map 9:4 Web-enhanced Technology to Assess Student AchievementPage 2

    INSERT A for Step 1 (cont.)

    to deliver the CRT, but I am certainly not an early childhood expert. I recommend that early childhood

    educators review the Lexia, Advantage Learning, and similar tests.

    Readiness and achievement are obviously not the only things that can be assessed using computer

    adaptive tests. I uncovered a paper presented at the 1991 meeting of the American Educational Research

    Association that discussed a test that was developed in Singapore to measure eight-grade attitudes

    toward science two. CAT versions are also available for the Graduate Record Exam, for various college

    entrance exams, and for teacher certification tests. I suspect that most of the major test-makers have, or

    soon will have, computer adaptive versions of all their major tests.

Indeed, I predict that computer adaptive tests will quickly become routine in schools, especially since they

    provide the ability to carefully monitor student progress from month to month and item to item.

    1.Kay Woodfield, “Getting On Board with Online Testing,” T.H.E. Journal, January 2003, p. 36.

    2.Yoke-Yeen and Tit-Loong Lam, “The Use of the Graded Response Model in Computerized

    Adaptive Testing of the Attitudes to Science Scale,” paper presented at the annual meeting of the

    American Educational Research Association, Chicago, 1991. Michigan Department of Education Office of School Improvement

    MI-Map 9:4 Web-enhanced Technology to Assess Student AchievementPage 3

     INSERT B for Step 1



    Education and technology forces have converged this year to vault computer-based

    testing into the headlines, raising important questions about whether this new mode of assessment is more useful than traditional paper-and-pencil exams.

    To begin with, the increased testing requirements imposed by the "No Child Left Behind" Act of 2001—a far-reaching overhaul of federal education policy signed into law by President Bush in January 2002—have set schools scrambling to find more efficient ways to assess academic skills and get children ready for high-stakes state exams. Unlike traditional standardized tests on paper, which can take weeks or even months to score and return to schools, computer-based assessments can provide almost immediate feedback. That is arguably one of the biggest draws for educators.

    Already,12 states and the District of Columbia have a computerized exam or a pilot project under way to evaluate the effectiveness of computer-based testing, according to a new Education Week survey of state

    departments of education ("State Initiatives: A Survey of State Departments of Education"). All of these

    tests, except for one in North Carolina and the District of Columbia exam, are administered via the Internet. In five states, officials report that computerized testing was designed to partially meet requirements under the new federal law.

    Eventually, experts predict, technology could change the face of testing itself, enabling states to mesh the use of tests for instructional and accountability purposes.

    "You've got the potential that technology could be a solution," says Wesley D. Bruce, the director of school assessment for the Indiana Department of Education, "but there is, right now, just a huge set of issues."

    Chief among them is a simple question: Do schools have enough computers to test children in this new manner? The answer in many places is no. And with most states struggling with budget deficits, it's unlikely they are going to use their limited resources to fill that void.

    Yet researchers point out that computer-based testing has the potential to be far cheaper than its printed counterpart.

    Richard Swartz, a senior research director at the Educational Testing Service, in Princeton, N.J., estimates that the actual costs of putting a test on-line and building a customized scoring model are comparable to those of developing a good paper-and-pencil exam. "Once the tests are implemented," he adds, "the difference in scoring costs is enormously in favor of the computer."

    Still, other problems with computerized assessment have emerged. One prickly issue involves the use of what is called adaptive testing, in which the computer adjusts the level of difficulty of questions based on how well a student is answering them. Proponents of this form of testing argue that it provides a more individualized and accurate assessment of a student's ability.

    But the No Child Left Behind law, a revision of the Elementary and Secondary Education Act that puts a higher premium than ever on schools' accountability for student achievement, continues to mandate that states measure student performance against the expectations for a student's grade level. Michigan Department of Education Office of School Improvement

    MI-Map 9:4 Web-enhanced Technology to Assess Student AchievementPage 4

    INSERT B for Step 1 (cont.)With adaptive testing, a 7th grader, for instance, might be bumped up to questions at the 8th grade level

    or dropped down to the 6th grade level. As a consequence, debate is growing about whether adaptive

    testing can meet the purposes of the federal law, and if it doesn't, how the technology should be modified

    to meet the requirements.

    To give educators a head start on understanding computer-based testing, Technology Counts 2003—the

    sixth edition of Education Week's annual report on educational technology in the 50 states and the District

    of Columbia—examines these new developments from a host of angles, beginning with an analysis of the

    impact of the No Child Left Behind law ("Legal Twists, Digital Turns"). Surprisingly, perhaps, the story points out that the law is having the effect of both encouraging and discouraging the use of computerized


    As another part of this year's focus on computer-based testing, Technology Counts 2003 takes a close look at adaptive testing, with analysis from proponents and critics, and a description of how it works("A

    Question of Direction"). The upshot of the adaptive-testing debate is that developers of such assessments

    are worried that they may be left out of what could be the greatest pre-collegiate testing boom in history.

    Computerized assessment may turn out to have its biggest impact in the area of on-line test preparation,

    observers of the field say. Last year, for instance, more than 200,000 students in 60 countries signed up

    for the Princeton Review's on-line demonstrations of such tests as the SAT and state exit exams.

    Technology Counts 2003 tracks the online test prep trend ("Prepping for the Big Test"). As educators face the new federal requirement to test all 3rd through 8th graders annually in reading and

    mathematics, states are experimenting with new ways of using technology to evaluate the abilities of

    special education students. Testing experts say that what educators learn from how to tailor assessments

    to the needs of special education students could also shape how they test other students who may have

    more subtle individual needs. This year's report examines those lessons ("Spec. Ed. Tech Sparks Ideas").

    Technology Counts 2003 also includes a story about teachers who are using computer-based testing to

    give classroom quizzes and tests ("The Teacher's New Test"), an examination of the benefits and drawbacks of essay-grading software ("Essay Grading Goes Digital"), an analysis of the growing business of computer-based testing ("Marketing to the Test"), and a look at national trends in educational


    Snapshots of the steps each state has taken to use computer-based testing—or simply to use educational

    technology more effectively—are also included in the report ("State Profiles"), as are data tables with state-by-state statistics on technology use in schools ("Tracking Tech Trends"). We hope you'll find information here that will help you understand computer-based testing and its evolving

    role in education.

    The Editors

    Michigan Department of Education Office of School Improvement

    MI-Map 9:4 Web-enhanced Technology to Assess Student AchievementPage 5

    INSERT C for Step 1



    by Andrew Trotter

    Computer adaptive testing is used to test recruits to the U.S. military, for licensing nurses and computer technicians, for entrance tests to graduate school, and for a popular

    placement test used by community colleges—but not for academic testing in all but a handful of K-12 schools.

    Most notably, computer adaptive testing has been left out of nearly all the large-scale testing programs that states are ramping up to meet the requirements of the federal "No Child Left Behind Act" of 2001. A prime reason: The U.S. Department of Education interprets the law's test-driven accountability rules as excluding so-called "out-of-level" testing. Federal officials have said the adaptive tests are not "grade-level tests," a requirement of the law.

    "Psychometricians regard that decision as humorous," Robert Dolan, a testing expert at the nonprofit Center for Applied Special Technology in Wakefield, Mass., says of the department's stance. Adaptive tests deliver harder or easier items, depending on how well the individual test-taker is doing. They are considered out-of-level because the difficulty range could include skills and content offered in higher and lower grades.

    Adaptive’ testing puts

    federal officials and Dolan and other test experts concede states may have reason to say no to

    experts at adaptive testing, because of cost, uneven technology levels in

    schools, and even educators' unfamiliarity with the method—but not

    because of grade-level testing.

    "The span of [test item] difficulty from easiest to hardest is entirely under the control of the test developer," says Tim Davey, the senior research director of the Educational Testing Service, based in Princeton, N.J.

    Some experts say adaptive tests give schools a better return on the time and money devoted to testing—including more accurate measurement of the proficiency of students who are above and below average, and speedier access to the test results.

    But Education Department officials say their hands are tied. "The regulations are very clear in saying all students have to be held to the same standard as the foundation for school accountability," says Sue Rigney, an education specialist in the department. "The focus here is very explicitly on the grade level the state has defined."

    Federal officials worry that out-of-level testing might lead to lower expectations for below-average students.

    Michigan Department of Education Office of School Improvement

    MI-Map 9:4 Web-enhanced Technology to Assess Student AchievementPage 6

    INSERT C for Step 1 (cont.)

    They also note that states are free to use computer-adaptive tests outside the accountability purposes of the No Child Left Behind law, which requires yearly assessments in reading and mathematics of students in grades 3-8.

    But the upshot, for now, is that computer adaptive tests are left out of the federal law, along with the public attention and federal money for test development that come with it. And the developers of adaptive tests feel they are missing out on what may be the greatest pre-collegiate testing boom in history. 'Made Us a Pariah'

    "[The Education Department's] decision made us a pariah," says Allan L. Olson, the president of the Northwest Evaluation Association, a nonprofit testing organization in Portland, Ore. The group was developing a computer adaptive test for Idaho's assessment when the department ruled its method out just over a year ago.

    Federal officials gave the same message to South Dakota and Oregon. South Dakota subsequently made voluntary its once-required computer adaptive test, and has adopted a conventional paper-and-pencil test for its statewide program. Oregon has postponed for a year the addition of a computer adaptive feature to its on-line test.

    "I think the [department's] interpretation in the case of South Dakota was based on a sort of misunderstanding of what adaptive testing does," says Davey of the ETS. He says computer adaptive tests typically span more than a single grade level—a diagnostic benefit—but they don't have to, and in any case, grade-level information is recorded for each test item.

    Researchers express puzzlement because the federal government has been deeply involved in the development of computer adaptive testing, starting with seminal research at the U.S. Office of Naval Research in the 1970s and 1980s. A decade later, Education Department grants paid for new computer adaptive reading tests in foreign languages, and department officials lauded the method's potential for school-based testing.

    David J. Weiss, one of the original leaders of the Navy research, says there is "no reason" why computer adaptive testing is not appropriate for K-12.

    Now the director of the psychometric-methods program at the University of Minnesota, Twin Cities, Weiss notes that a study of children who took such tests in Oregon for several years produced "beautiful data" on improvements in math and reading.

    Federal officials say they would consider the use of a computer adaptive test if it tested within the grade- level.

    But other test experts say the federal government is right to be wary of computer adaptive testing. "The technology is not ready for prime time," contends Robert A. Schaeffer, the public education director for the National Center for Fair & Opening Testing, or FairTest, a Cambridge, Mass.-based advocacy group that opposed the No Child Left Behind Act because of its testing mandates.

    Michigan Department of Education Office of School Improvement

    MI-Map 9:4 Web-enhanced Technology to Assess Student AchievementPage 7

    The technology is not INSERT C for Step 1 (cont.)ready for prime time.’

    He says the computer adaptive version of the Graduate Record Robert A. Schaeffer, Public Education Director, Examination launched at ETS testing centers in 1994 was initially flawed National Center for because it had a pool of test items that was too small, and there were Fair & Open Testing

    insufficient facilities for the number of test-takers.

    ETS spokesman Tom Ewing acknowledges those problems occurred but

    says they were quickly resolved through enlarging the pool of questions and improving test scheduling. But Schaeffer warns that schools could face a rougher transition, considering their budget limitations and the high stakes involved in testing.

    W. James Popham, a professor emeritus and educational testing authority at the University of California, Los Angeles, says the theoretical accuracy of computer adaptive testing does not necessarily translate into reality: "Even though [such testing] makes measurement types quite merry, they can play games with numbers and it doesn't help kids."

    Popham, a former president of the American Educational Research Association, contends that the testing technology is "opaque" to the public and policymakers.

    He says federal officials may believe the testing method could introduce loopholes into the education law. "They fear educational con artists who have historically frustrated congressional attempts to safeguard disadvantaged youngsters," Popham says, referring to educators who wish to avoid accountability. "The fear is, they'll pull a fast one and downgrade expectations."

    Zeroing In on Skills

    But proponents of adaptive, computer-based testing fear that schools may wait decades for access to a major improvement over conventional, "linear" standardized tests, which present each student with the same set of test items.

    The logic of the new tests is that of a coach pitching to a young batter: If the youngster is missing, the coach eases up a little; if not, he increases the challenge. Sooner or later, the coach zeroes in on the batter's skill level.

    Some testing experts argue that the adjustment improves test accuracy.

    "In paper-and-pencil tests, items tend to be grouped around average kids. Those in the tails of distribution—we don't get as much information about those kids," says Michael L. Nering, the senior psychometrician at Measured Progress, a testing company in Dover, N.H. The great thing about

    adaptive testing is that "The great thing about adaptive testing is that it has equal precision," it has equal precision.’ meaning the results are accurate at all proficiency levels, says Nering, who

    Michael L. Nering, helped design two state assessments and developed computer adaptive Senior Psychometrician, tests for ACT Inc. "No matter what your ability is, whether you're really Measured Progresssmart or not, the test will stop administering items when equal precision is


    Michigan Department of Education Office of School Improvement

    MI-Map 9:4 Web-enhanced Technology to Assess Student AchievementPage 8

    INSERT C for Step 1 (cont.)

    By contrast, most of the items on conventional tests—on paper or computer—are aimed at the "average" student in the target population.

    "If I'm a very low-performing student, there may be only two or three items on the [conventional] test that are appropriate to my level of performance," Davey of the ETS says, adding that the same is true for high-performing students.

    Inside the IRT

    Computer adaptive tests often use the same types of questions as conventional tests, though with adjustments for display on a screen. Other features are distinctive, such as the order of items being irreversible. Students are not allowed to recheck or change answers.

    This one-way street is necessary because of the process that takes place after each answer: A central computer recalculates the test-taker's ability level, then selects the next item, based on the individual's success to that point.

    As the student completes more items, the computer tracks the statistical accuracy of the score until a set accuracy level is reached. Then the test moves to another skill or content area. Reaching that level may require fewer items if the student answers with consistent proficiency—or many more items, if the student answers inconsistently.

    "Adaptive testing doesn't waste the examinees' time by asking questions Adaptive testing that we're already pretty sure we know how the student is going to answer," doesn't waste the says Davey. examinees' time.’

    Tim Davey,To make the crucial decisions about which items to present, the test is Senior Research Director, Educational Testing Serviceoutfitted with an "item response theory" model—essentially its brains and

    the part of the system that some critics consider opaque.

    The IRT model governs the interaction between the test-taker and the test

    items. It weighs the student's record of right and wrong answers against several known characteristics of the test items—such as difficulty, the ability to discriminate between higher- and lower-ability students, the degree to which guessing may succeed, and coverage of academic content.

    By solving the complex algorithm written into the IRT model, the computer determines which test item should be presented to the student next.

    Test developers concede that IRT models are unfathomable to lay people and even challenge the intellects of experts unfamiliar with a given test.

    Schaeffer of FairTest calls the IRT model the "pig in a poke" that makes computer adaptive testing hard for policymakers to accept.

    "Who knows what the algorithm is for test delivery?" he asks. "You have to accept the test manufacturer's claims about whether the test is equivalent for each student."

    Michigan Department of Education Office of School Improvement

    MI-Map 9:4 Web-enhanced Technology to Assess Student AchievementPage 9

    INSERT C for Step 1 (cont.)

    Scott Elliot, the chief executive officer of Vantage Learning, a major maker of computer-based tests located in Yardley, Pa., says, "There are many technical nuances under the IRT; some differences [between IRTs] are sort of like religion."

    Davey of the ETS agrees that the IRT resists attempts to explain it, but adds that the apparent simplicity of conventional testing is "based largely on oversimplification of how paper testing typically is done." In fact, he says, virtually identical IRT models are used with some conventional state tests to ensure that the same score in different years represents approximately the same proficiency level on the test—a vital issue for accountability.

    Breaking With the Past

    Because of technology hurdles and spotty acceptance of computer adaptive testing, experts generally predict that the field will struggle for the next five or 10 years, but that schools will eventually turn to the approach.

    Davey believes educators will be persuaded by the greater amount of diagnostic information the tests produce from fewer school days spent testing.

    That's not to overlook other formidable problems that computer-based testing poses for schools—notably, the difficulty of providing technology that is reliable and consistent for all students, so the playing field is kept level. The tests must be delivered over a robust infrastructure to avoid Experts generally processing and communications delays that would leave students waiting predict that the field of for their next test items. computer adaptive

    testing will struggle for Computer adaptive tests also require larger banks of test items than the next five or 10 conventional tests do. Yet the adaptive method gives items a longer useful because it's harder for test-takers to predict which items they will


    Finally, adaptive tests are subject to some of the same well-documented problems as other standardized tests, such as cultural biases, says FairTest's Schaeffer. "Automating test items that are used inappropriately, in many ways makes matters worse—you add technical problems and dissemination-of-information problems," he says.

    Referring to the ETS adaptive Graduate Record Examination, he adds, "The GRE, in spite of all the hoopla, is the same lame questions put out using a hidden algorithm, instead of linearly on a sheet of paper."

    Ewing of the ETS counters that its test items are "what the graduate deans have said are the math and verbal skills that they want students to be able to handle."

    Meanwhile, researchers are working on new kinds of adaptations that could be applied in computer adaptive tests—including presenting items using multimedia or computer simulations and catering to an individual's preferred learning style. Already, some tests present items in different languages. Those changes highlight another potential pitfall. Today, policymakers insist on having new tests demonstrate "comparability" with old tests, a task that Davey says becomes vastly more difficult as testing methods change.

    Michigan Department of Education Office of School Improvement

    MI-Map 9:4 Web-enhanced Technology to Assess Student AchievementPage 10

Report this document

For any questions or suggestions please email