Glossary of Assessment TermsA-B | C-D | E-F | G-H | I-J | K-L | M-N | O-P | Q-R | S-T | U-ZAcademic Aptitude Test An aptitude test predicts achievement in academic pursuits. Ideally, in constructing this type of test, the developer tries to minimize the effect of exposure to specific materials or courses of study on the examinee's score. Achievement Test An assessment that measures a student's acquired knowledge and skills in one or more common content areas (for example, reading, mathematics, or language). Adult Accountability Test An assessment intended primarily for individuals 18 years old or older who are no longer attending elementary or secondary school. Alternative Assessment An assessment that differs from traditional achievement tests. For example, an alternative assessment may require a student to generate or produce responses or products rather than answer only selected-response items. This type of assessment may include constructed-response activities, essays, portfolios, interviews, teacher observations, work samples, and/or group projects. Analytic Scoring A scoring procedure in which a student's work is evaluated for selected traits or dimensions, with each dimension receiving a separate score. Aptitude Test A test consisting of items selected and standardized so that the test predicts a person's future performance on tasks not obviously similar to those in the test. Aptitude tests may or may not differ in content from achievement tests, but they do differ in purpose. Aptitude tests consist of items that predict future learning or performance; achievement tests consist of items that sample the adequacy of past learning. Authentic Assessment An assessment that measures a student's performance on tasks and situations that occur in real life. This type of assessment is closely aligned with, and models, what students do in the classroom. Battery A test battery is a set of several tests designed to be administered as a unit. Individual subject-area tests measure different areas of content and may be scored separately; scores from the subtests may also be combined into a single score. Bias A situation that occurs in testing when items systematically measure differently for different ethnic, gender, or age groups. Test developers reduce bias by analyzing item data separately for each group, then identifying and discarding items that appear to be biased. - back to top -Ceiling The upper limit of performance that can be measured effectively by a test. Individuals are said to have reached the ceiling of a test when they perform at the top of the range in which the test can make reliable discriminations. If an individual or group scores at the ceiling of a test, the next higher level of the test should be administered, if available. Checklist An assessment that is based on the examiner observing an individual or group and indicating whether or not the assessed behavior is demonstrated. Composite Score A single score used to express the combination, by averaging or summation, of the scores on several different tests. Comprehensive Equal-Interval Scale See "Equal-Interval Scale". Constructed-Response Item An assessment unit with directions, a question, or a problem that elicits a written, pictorial, or graphic response from a student. Sometimes called an "open-ended" item. Content Validity Content validity indicates the extent to which the content of the test samples the subject matter or situation about which conclusions are to be drawn. Methods used in determining content validity are textbook analysis, description of the universe of items, adequacy of the sample,
representativeness of the test content, inter-correlations of subtest scores, and opinions of a jury of experts. Conversion Tables Tables used to convert a student's test scores from scale score units to grade equivalents, percentile ranks, and stanines. Criterion A standard or judgment used as a basis for quantitative and qualitative comparison; that variable to which a test is compared to constitute a measure of the test's validity. For example, grade-point average and attainment of curricular objectives are often used as criteria for judging the validity of an academic aptitude test. Criterion-Referenced Test A test in which every item is directly identified with an explicitly stated educational behavioral objective. The test is designed to determine which of these objectives have been mastered by the examinee. Culture-Fair Test A test devised to exclude specific cultural stimuli so that persons from a particular culture will not be penalized or rewarded on the basis of differential familiarity with the stimuli. Derived Score A test score pertaining to a norm group (such as a percentile, stanine, or grade equivalent) that is an outgrowth of the scale scores. Derived scores are useful descriptors; however, they are not calibrated on an equal-interval scale, so they cannot be added, subtracted, or averaged across test levels the way scale scores can. Diagnostic Test A test intended to locate learning difficulties or patterns of error. Such tests yield measures of specific knowledge, skills, or abilities underlying achievement within a broad subject. Thus, they provide a basis for remedial instruction. Discrimination Parameter The property that indicates how accurately an item distinguishes between examinees of high ability and those of low ability on the trait being measured. An item that can be answered equally well by examinees of low and high ability does not discriminate well and does not give any information about relative levels of performance. - back to top - Early Childhood Test An assessment intended for students in kindergarten and grades 1 through 3. Educational (Instructional) Objective A statement that defines an intended outcome of instruction. It describes what a successful learner is able to do at the end of the lesson or course, defines the conditions under which the behavior is to occur, and often specifies the criterion or standard of acceptable performance. Equal-Interval Scale A scale marked off in units of equal size that is applied to all groups taking a given test, regardless of group characteristics or time of year. Each test yields its own scale. On TABE, for example, scale scores are expressed in numbers ranging from 0 to 999. The continuity of the scale among levels comes from administering special test forms containing items from adjacent test levels to random groups of students. This allows the TABE scales to be calibrated so that a given adult learner is expected to obtain the same scale score regardless of the form or level of the test he or she takes. However, the standard error of measurement associated with that student's score will vary systematically from level to level. Face Validity An evaluation of a test based on inspection only. Floor The opposite of ceiling, it is the lowest limit of performance that can be measured effectively by a test. Individuals are said to have reached the floor of a test when they perform at the bottom of the range in which the test can make reliable discriminations. If an individual or group scores at the floor of a test, the next lower level of the test, if available, should be administered.
Frequency Distribution An ordered tabulation of individual scores (or groups of scores) showing the number of persons who obtained each score or placed within each range of scores. - back to top - Grade Equivalent A score on a scale developed to indicate the school grade (usually measured in months) that corresponds to an average chronological age, mental age, test score, or other characteristic of students. A grade equivalent of 6.4 is interpreted as a score that is average for a group in the fourth month of Grade 6. Grade equivalents do not compose a scale of equal intervals and cannot be added, subtracted, or averaged across test levels the way scale scores can. Here is an excerpt from the "Scale Score to Grade Equivalent Table for TABE 7 & 8." Scale ScoreGrade Equivalent Reading Applied Mathematics Language 800 12.9+ 12.9+ 12.9+ 700 12.0+ 12.9+ 12.9+ 600 11.2 11.6 12.1 500 5.3 5.8 4.4 400 2.3 2.7 2.1 300 1.2 1.4 1.1 200 0.0 0.0 0.0 Grade Norm The average test score obtained by students classified at a given grade placement. Guessing Parameter The probability that a student with very low ability on the trait being measured will answer the item correctly. There is always some chance of guessing the answer to a multiple-choice item, and this probability can vary among items. The guessing parameter enables a model to account for these factors. Holistic Scoring A scoring procedure yielding a single score based on overall student performance rather than on an accumulation of points. Holistic scoring uses rubrics to evaluate student performance. - back to top - Intelligence Test A test that measures the higher intellectual capacities of a person, such as the ability to perceive and understand relationships and the ability to recall associated meaning--in other words, measures the ability to learn. Interpretation The act of explaining test scores to students so they understand exactly what each type of score means. For example, a percentile rank refers to the percentage of students in the norm group who fall below a particular point, not the percentage of items answered correctly. Item A question or problem on a test. Item Bias An item is biased when it systematically measures differently for different ethnic, cultural, regional, or gender groups. Item Response Theory The basis of various statistical models for analyzing item and test data. In TABE, the three-parameter model was used in the selection and scaling of items. This model takes into account discrimination, difficulty, and chance level of success (guessing) to describe each item's statistical characteristics. - back to top - K-12 Assessment An assessment intended primarily for students in elementary and secondary schools. CTB assessments may assess students in the entire K-12 range or just in selected grades, e.g., Grades 2-12 . Local Norms Norms that have been obtained from data collected in a limited locale, such as a school system, county, or state. They may be used instead of, or along with, national norms to evaluate student performance. Location Parameter A statistic from item response theory that pinpoints the ability level at which an item discriminates, or measures, best. - back to top - Mean The quotient obtained by dividing the sum of a set of scores by the number of scores; also called "average." Mathematicians call it "arithmetic mean." Median The middle score in a set of ranked scores. Equal numbers of ranked scores lie above and below the median. It corresponds to the 50th percentile and the 5th decile. Mode The score or value that occurs most frequently in a distribution. Multiple Measures Assessments that measure student
performance in a variety of ways. Multiple measures may include standardized tests, teacher observations, classroom performance assessments, and portfolios. Multiple-Choice Item A question, problem, or statement (called a "stem") which appears on a test, followed by two or more answer choices, called alternatives or response choices. The incorrect choices, called distractors, usually reflect common errors. The examinee's task is to choose from, among the alternatives provided, the best answer to the question posed in the stem. These are also called "selected-response items." Normal Distribution Curve A bell-shaped curve representing a theoretical distribution of measurements that is often approximated by a wide variety of actual data. It is often used as a basis for scaling and statistical hypothesis testing and estimation in psychology and education because it approximates the frequency distributions of sets of measurements of human characteristics. Norm-Referenced Test A standardized assessment, in which all students perform under the same conditions. This type of test compares a student or group of students with a specified reference group, usually others of the same grade and age for K-12 students, or for adults, those with similar characteristics, such as those in an adult basic education class. Norms The average or typical scores on a test for members of a specified group. They are usually presented in tabular form for a series of different homogeneous groups. - back to top - Objective A desired educational outcome such as "constructing meaning" or "adding whole numbers." Usually several different objectives are measured in one subtest. Objective Test A test for which a list of correct answers, one for each test item, can be provided so that subjective opinion or judgment is eliminated from the scoring procedure. Multiple-choice, true/false, and matching-item tests are purely objective, while short answer and completion-item tests are less so. Percentile One of the 99 point scores that divide a ranked distribution into groups, each of which contains 1/100 of the scores. The 73rd percentile denotes the score or point below which 73 percent of the scores fall in a particular distribution of scores. (See also the table under "stanine.") Performance Assessment An assessment activity that requires students to construct a response, create a product, or perform a demonstration. Usually there are multiple ways that an examinee can approach a performance assessment and more than one correct answer. Performance Standard A level of performance on a test, established by education experts, as a goal of student attainment. Power Test A test that samples the range of an examinee's capacity in particular skills or abilities and that places minimal emphasis on time limits. A "pure" power test is sometimes defined as one in which every examinee has sufficient time to complete the test. Predictive Validity The ability of a score on one test to forecast a student's probable performance on another test of similar skills. Predictive validity is determined by mathematically relating scores on the two different tests. - back to top - Raw Score The first score obtained in scoring a test, which is often the number of correct answers. Sometimes it is the number right minus a fraction of the number wrong, the time required to complete the test, the number of errors, or some other number obtained directly from the test's administration. Readiness Test A test of ability to engage in a new type of specific learning. Level of maturity, previous experience, and
emotional and mental set are important determinants of readiness. Reliability The consistency of test scores obtained by the same individuals on different occasions or with different sets of equivalent items; accuracy of scores. Rubric A scoring tool, or set of criteria, used to evaluate a student's test performance. - back to top - Scale An organized set of measurements, all of which measure one property or characteristic. Different types of test-score scales use different units, for example, number correct, percentiles, or IRT scale scores. Scale Scores Scores on a single scale with intervals of equal size. The scale can be applied to all groups taking a given test, regardless of group characteristics or time of year, making it possible to compare scores from different groups of examinees. Scale scores are appropriate for various statistical purposes; for example, they can be added, subtracted, and averaged across test levels. Such computations permit educators to make direct comparisons among examinees, compare individual scores to groups, or compare an individual's pre-test and post-test scores in a way that is statistically valid. This cannot be done with percentiles or grade level equivalents. Selected-Response Item A question or incomplete statement that is followed by answer choices, one of which is the correct or best answer. Also referred to as a "multiple-choice" item. Special Admissions Test A test of a student's ability to participate in special programs or advanced learning situations. For example, an honors-level class or a magnet school may require the attainment of high scores on an assessment for admission. Speed Test A test in which one aspect of performance is measured by the number of tasks performed in a given time. A "pure" speed test is one in which examinees make no errors and that cannot be completed by any examinee in the allotted time. Standard Deviation A statistic used to express the extent of the divergence of a set of scores from the average of all the scores in the group. In a normal distribution, approximately two-thirds (68.3%) of the scores lie within the limits of one standard deviation above and one standard deviation below the mean. One-sixth of the scores lie more than one standard deviation above the mean, and one-sixth lie more than one standard deviation below the mean. Standard Error of Measurement A measure of the amount of error to be expected in a score from a particular test. The smaller the standard error of measurement, the greater the accuracy of the test score. The standard error of measurement is the standard deviation of a theoretical distribution of a set of variations, each of which is the difference between the obtained score and true score. Thus, if a standard error of measurement is 5, the chances are two to one that an obtained score lies within five units of the true score. Standard Score A derived score scaled to produce an arbitrarily assigned mean and standard deviation. For example, deviation IQs are standard scores with a mean of 100 and, usually, a standard deviation of 16. Standardization The process of administering a test to a nationally representative sample of examinees using carefully defined directions, time limits, materials, and scoring procedures. The results produce norms to which the performance of other examinees can be compared, provided they took the test under the same conditions. Standardization Sample That part of the population that is used in the norming of a test, i.e., the reference population. The sample should represent the population in essential characteristics, some of which may be geographical location,
age, or grade for K-12 students, or, for adults, participation in a specific type of program (for example, adult basic education). Standardized Test A test constructed of items that are appropriate in level of difficulty and discriminating power for the intended examinees, and that fit the pre-planned table of content specifications. The test is administered in accordance with explicit directions for uniform administration and is interpreted using a manual that contains reliable norms for the defined reference groups. Stanine A unit of a standard score scale that divides the norm population into nine groups with the mean at stanine 5. The word stanine draws its name from the fact that it is a STAndard score on a scale of NINE units. Comparison Table of Stanines and Percentiles StaninesApproximate Percentiles Percentage of Examinees 9 Highest Level 96-99 4% 8 High Level 90-95 7% 7 Well above average 78-89 12% 6 Slightly above average 60-77 17% 5 Average 41-59 20% 4 Slightly below average 23-40 17% 3 Well below average 11-22 12% 2 Low Level 5-10 7% 1 Lowest Level 1-4 4% Test Battery See "Battery". Test Item See "Item". Test Objective See "Objective". - back to top - Validity The capability of a test to measure what its authors or users intend it to measure. - back to top -