Foreign Language Assessment
Compiled by Po-Sen Liao
Why do we test at all? Is it the only way to get students to learn? To test out of habit? To test as a punitive measure? A discouraging barrier for students to face or a hurdle they have to jump at prescribed points? Can tests be a positive challenge?
(questionable and promising testing procedures, see p.4 )
The assessment tasks should be nonthreatening and developmental in nature, allowing the learners ample opportunities to demonstrate what they know and do not know, and providing useful feedback both for the learners and for their teachers.
Advances in language testing over the last decade have included:
1. The development of a theoretical view that sees language ability as
multi-componential and recognizes the influence of test-method and
test-taker characteristics on test performance.
2. The use of more sophisticated measurement and statistical tools.
3. The development of communicative language tests commensurate with
the increased teaching of communicative skills.
A variety of ways of collecting information on a learner’s language ability or
achievement. An umbrella term encompasses tests, observation, or project works.
Proficiency assessment: The assessment of general language abilities acquired by the learner independent of a course of study (e.g. TOFEL, TOEIC).
Achievement assessment: To establish what a student has learned in relation to a particular course (e.g. tests carried out by the teacher and based on the specific content of the course). It is to determine acquisition of course objectives at the end of instruction.
Diagnostic assessment: It is designed to diagnose a particular aspect of a
language. A diagnostic test in pronunciation might have the purpose of determining which phonological features of English are difficult for learners and should therefore become a part of a curriculum. Such assessment may offer a checklist of features for the teacher to use in pinpointing difficulties.
Placement assessment: Its purpose is to place a student into an appropriate level or section of a language curriculum. Certain proficiency tests and diagnostic tests can act in the role of placement assessments.
Aptitude assessment: It is designed to measure a person’s capacity or general
ability to learn a foreign language and to be successful. It is considered to be independent of a particular language. This test usually requires learners to perform such tasks as memorizing numbers and vocabulary, listening to foreign words, and detecting spelling clues and grammatical patterns.
Formative assessment: It is often closely related to the instructional program and may take forms of quizzes and chapter tests. Its results are often used in a diagnostic manner by teachers to modify instruction.
Summative assessment: The type of assessment that occurs at the end of a period of study. It goes beyond the material of specific lessons and focuses on evaluating general course outcomes.
Norm-referenced assessment: to evaluate ability against a standard or normative
performance of a group. It provides a broad indication of relative standing. (e.g. a score in an exam reports a learner’s standing compared to other students).
Criterion-referenced assessment: to assess achievement or performance against
a cut-off score that is determined as a reflection of mastery or attainment of specified objectives. This approach is used to see whether a respondent has met certain instructional objectives or criteria. Focus is on ability to perform tasks rather than group ranking. (e.g. a learner can give basic personal information).
Refers to the overall language program and not just with what individual students
Assessment of an individual students’ progress or achievement is an important
component of evaluation. Evaluation goes beyond student achievement to
consider all aspects of teaching and learning, and to look at how educational
decisions can be informed by the results of alternative forms of assessment.
; How to evaluate the assessment instrument?
1. Validity: A test is valid when it measures effectively what it is intended to
measure. A test must be reliable in order to be valid.
Types of validity:
A. Content validity: Checking all test items to make certain that they
correspond to the instructional objectives of the course.
B. Criterion-related validity: Determining how closely learner’s performance
on a given new test parallels their performance on another instrument, or
criterion. If the instrument to be validated is correlated with another criterion
instrument at the same time, then this is refereed to as concurrent validity.
If the correlation takes place at some future time, then it is referred to as
C. Construct validity:It refers to the degree to which scores on an assessment
instrument permit inferences about underlying trait. It examines whether the
instrument is a true reflection of the theory of the trait being measured.
D. System validity: the effects of instructional changes brought about by the
introduction of the test into an educational system. (p.41)
Washback effect: how assessment instruments affect educational practices
E. Face validity/ perceived validity: whether the test looks as if is measuring
what it is supposed to measure.
2. Reliability:The degree to which it can be trusted to produce the same result upon repeated administrations. A language test must produce consistent results and give consistent information.
Types of reliability:
A. Test-retest reliability: the degree of consistency of scores for the same test
given to the same students on different occasions.
B. Alternate-forms reliability: the consistency of scores for the same students
on different occasions on different but comparable forms of the test.
C. Split-half reliability: a special case of alternate-forms reliability. The same
individuals are tested on one occasion with a single test. A score is calculated
for each half of the test for each individual and the consistency of the two
halves is compared.
D. Scorer reliability: the degree of consistency of scores from different scorers
for the same individuals on the same test (interrater reliability) or from the
same scorer for the same individuals on the same test but on different
occasions (interrater reliability). Score reliability is an issue when scores are
based on subjective judgments.
(e.g., rater unreliability: two teachers observe two ESL students during a
conversation together. Both teachers listened to and assessed the same conversation and reported different interpretations. Instrument-related, person-related reliability)
The reliability index is a number ranging from .00 to 1.00 that indicates what proportion of measurement is reliable. An index of .80 means that your measurement
is 80% reliable and 20% error. For the purpose of classroom testing, a reliability coefficient of at least .70 is good. Higher reliability coefficients would be expected of standardized tests used for large-scale administration (.80 or better).
Rater reliability: to use more than one well-trained and experienced observer, interviewer, or composition reader.
Person-related reliability: to assess on several occasions.
Instrument-related reliability: to use a variety of methods of information collection.
(A test which is reliable is not necessarily valid. A test may have maximum consistence, but may not be measuring what it is specifically intended to measure But an instrument can be only as valid as it is reliable. Inconsistency in a measurement reduces validity.)
3. Practicality: practical considerations like cost, administration time, administrator
qualifications, and acceptability.
Historically, language-testing trends and practices have followed the changing winds of the teaching methodology.
In the 1950s and 1960s, under the influence of behaviorism and structural
linguistics, language tests were designed to assess learners’ mastery of different areas
of the linguistic system such as phoneme discrimination, grammatical knowledge and vocabulary. Tests often used objective testing formats such as multiple choice.
However, such discrete item tests provided no information on learners’ ability to use
language for communicative purposes.
In the 1970s and early 1980s this led to an upsurge of integrative tests such as
cloze and dictation, which required learners to use linguistic and contextual knowledge to reconstitute the meaning of written or spoken texts.
Since the early 1980s, with the widespread of Communicative Language
Teaching (CLT), assessment has become increasingly direct. Many language tests often contain tasks which resemble the kinds of language-use situations that test takers would encounter in using the language for communicative purposes in everyday life. The tasks typically include activities such as oral interviews, listening
to and reading extracts from the media, and various kinds of authentic writing tasks
which reflect real-life demands. Today, test designers are still challenged in their quest for more authentic, content-valid instruments that stimulate real-world interaction
while still meeting reliability and practicality criteria.
The best way to evaluate students’ performance in a second language is still a matter of debate. Given the wide variety of assessment methods available and the lack of consensus on the most appropriate means to use, the best way to assess language performance in the classroom may be through a multifaceted or eclectic approach,
whereby a variety of methods are used.
Discrete-point/ integrative assessment: (p.161)
Since the 1960s, the notion of discrete-point assessment, that is, assessing one and only one point at a time, has met with some disfavor among theorists. They feel that such a method provides little information on the student’s ability to function in
actual language-use situations. They also contend that it is difficult to determine which points are being assessed. In the past testing points have determined in part by a contrastive analysis of differences between the target and the native languages. But this contrastive analysis was criticized for being too limiting. About 20 years ago an integrative approach emerged, with the emphasis on testing more than one point at a time.
There is actually a continuum from the most discrete-point on the one hand to the most integrative items or procedures on the other. Most items fall somewhere in between.
Direct/ indirect assessment:
A direct measure samples explicitly from the behavior being evaluated, while an indirect measure is contrived to the extent that the task differs from a normal language-using task. There is an increasing concern being voiced that assessment need to be developed that directly reflect the traits are supposed to measure.
; Traditional test items:
1. Multiple-choice (show your items to your colleagues)
The multiple-choice item, like other paper-and-pencil tests (e.g. true-false items, matching items, short questions), measures whether the student knows or understands what to do when confronted with a problem situation. Multiple-choice items are favored because their scoring can be reliable, rapid, and economical. However, they cannot determine how the student actually will perform in that situation. Furthermore, it is not well adapted to measuring some problem-solving skills, or the ability to organize and present ideas.
Suggestions for constructing multiple-choice items:
1. The correct answer must not be dubious.
(e.g.) Which is the odd one out?
2. Items should be presented in context.
(e.g.) Fill in the blank with the most suitable option:
Visitor: Thank you very much for such a wonderful visit.
Hostess: We were so glad you could come. Come back______.
3. All distracters should be plausible.
(e.g.) What is the major purpose of the United Nations?
a. To develop a new system of international law.
b. To provide military control of nations that have recently attained their
independence. (vs. To provide military control).
c. To maintain peace among the peoples of the world.
d. To establish and maintain democratic forms of government in newly formed
nations (vs. To form new governments).
4. For young students, 3-choice items may be preferable in order to reduce the
amount of reading. For other learners, 4 or 5 choices are favored to reduce the
chances of guessing the correct answer.
2. Essay questions:
Learning outcomes concerned with the abilities to select, organize, integrate, relate, and evaluate ideas require the freedom of response provided by essay questions. It emphasizes on the integration and application of thinking and problem-solving skills. However, the most serious limitation is the unreliability of the scoring. Another
closely related limitation is the amount of time required for scoring the answers. A series of studies has shown that answers to essay questions are scored differently by different teachers and that even the same teachers score the answers differently at
different times. One teacher stresses factual content, one organization of ideas, and another writing skills. With each teacher evaluating the degree to which different learning outcomes are achieved, it is not surprising that their scoring diverge so widely. The scoring reliability should be increased by clearly defining the outcomes to be measured, properly framing the questions, carefully following scoring rules, and obtaining practice in scoring.
Suggestions for constructing essay questions:
A prompt for the essay is presented in the form of a mini-teat that the respondents need to understand and operationalize. We need to give careful consideration to the instructions the respondents attend to in accomplishing the testing tasks.
1. Instructions should be brief, but explicit.
2. Specific about the form the answers are to take—if possible, presenting a
sample question and answer.
3. Informative as to the value of each item and section of the assessment
instrument, the time allowed for the test, and whether speed is a factor.
4. Formulate questions that will call forth the behavior specified in the
5. Phrase each question so that the students’ task is clearly defined.
(e.g.): (the incorrect answers may due to misinterpretation or lack of
Write a one-page statement defending the importance of conserving our
natural resources. Your answer will be evaluated in terms of its organization,
comprehensiveness, and the relevance of the arguments presented. (30%, 30
Describe the similarities and differences between --- (comparing)
What are major causes of --- (cause and effect)
Briefly summarize the contents of --- (summarizing)
Describe the strengths and weaknesses of the following --- (evaluating)
; Appraising tests:
Instead of discarding the test after a classroom test has been administered and the students have discussed the results, a better approach is to appraise the effectiveness of the test items and to build a file of high-quality items for future use.
Scoring tests is not the final step in the evaluation process. Scores are arbitrary. The main concern is the interpretation of the scores: (p.98)
Raw score: the score obtained directly as a result of tallying up all the items answered correctly, usually is not easy to interpret.
Percentage score: the number of items that students answered correctly divided by the total items on the test.
Percentile: a number tells what percent of individuals within he specified norm group scored lower than the raw score of a given student.
Mean score: the average score or a given group of students. We divide the students scores added together by the number of scores involved.
Item difficulty: the ratio of correct responses to total responses for a given test item. A norm-referenced assessment: (aims to differentiate among high and low achievers) should have items that approximately 60% to 80% of the respondents answer
A criterion-reference assessment: (aims to determine whether students have achieved the objectives of a course) to obtain item difficulty of 90% or better.
Formula: P = R/N x 100
(P = item difficulty, R = the number of students who got the item right, N = the total number of students who tried the item)
Item discrimination/ item discriminating power: how well an item performs in
separating the better students form the weaker ones.
(While the item-difficulty index focuses on how the items fare, item discrimination
looks at how the respondents fare from item to item)
An item-discrimination level of .30 or above is generally agreed to be desirable.
Formula: D = Ru–Rl / 1/2T
( D = item discriminating power, Ru = the number of students in the upper who get
the item right, Rl = the number of students in the lower group who get the item right, T = the total number of students included in the item analysis)
An item with maximum positive discriminating power is one in which all students in the upper group get the item right and all the students in the lower group get the item wrong. (D=10-0 / 10 = 1) An item with no discriminating power is one in which an equal number of students in both the upper and lower groups gets the item right. (D=10-10 / 10 = 0)It is also possible to calculate an index of negative discriminating power; that is, one in which more students in the lower group than the upper group get the item right. Such items should be revised so that they discriminate positively, or they should be discarded.
Piloting of assessment instruments
Ideally, an assessment instrument that is intended to perform an important function in an institution would undergo piloting on a small sample of respondents
similar to those for whom it is designed. The pilot administration provides the assessor feedback on the items and procedures. The assessor can obtain some valuable
insights about what part of the instrument needs to be revised before it is administered to a larger group.
; Assessing the speaking skills
If students are asked to participate in communicative, open-ended activities in the classroom, then it is hypocritical to assess their progress with discrete point grammar tests. The test should be designed to give students a real-life, culturally authentic task.
1. Interviews: Greeting, warm-up chat, close-up.
2. Pair discourse
3. Group oral
4. Tape recording
A. Oral descriptions of visuals: comprises appropriate conversation stimulus at the
novice level, not only because they provide a psychological prop, but also because
they facilitate listing and identifying tasks for the students. There may be many
possibilities for appropriate answers.
Sources of visuals: the teacher’s or students’ personal slides and photos,
yearbook pictures, magazine pictures.
B. Role-play: To function in a ―survival situation‖ in which students might encounter
in a real life situation.
(e.g.) Situation cards: On which are listed role-playing instructions for the
1. When you see two of your friends at the mall, you decide to invite them to your
birthday party. Tell them when and where it is, what you will do at the party, how
many people will be there, and any other details you think they would be
interested in knowing.
2. Leave a message on the answering machine with the following information:
- Leave your name and the time you called
- Tell the person where you are going tonight.
- Tell the person you’ll see him/her tomorrow at a particular place and time.)
; Assessing reading comprehension (p.211)
When readers approach text on the basis of the prior content, language, and textual schemata that they may have with regard to that particular text, this is referred to as top-down reading. When readers focus exclusively on what is present in the text
itself, and especially on the words and sentences of the text, this is referred to as bottom-up reading. Successful learners usually display a combination of top-down and bottom-up reading.
Test constructors and users of assessment instruments should be aware of the skills tested by reading comprehension questions. There are numerous taxonomies of such skills:
1. The recognition of words and phrases of similar or opposing meaning.
5. The identifying or locating of information.
6. The discriminating of elements or features within context; the analysis of
elements within a structure and of the relationship among them --- e.g.
causal, sequential, chronological, hierarchical.
7. The interpreting of complex ideas, actions, events, and relationships.
8. Inferring --- the deriving of conclusions, and predicting the continuation.
1. Fixed-response format: multiple-choice
2. Structure-response format:
Cloze: The cloze test is extensively used a completion measure, ideally
aimed at tapping reading skills interactively, with respondents using cues
from the text in a bottom-up fashion as well as bring their background
knowledge to bear on the task.
I am one of people who simply cannot up. To me,
the of an orderly living is as remote and
as trying to climb Fuji.
The cloze has been used as a measure of readability, global reading skills,
grammar, and writing. It can be scored according to an exact-word,
acceptable-word, or multiple-choice approach. (p.139)
; Assessing listening skills
The following are examples of listening comprehension items and
procedure, ranging from the most discrete-point items to more integrative
assessment tasks. The teacher must decide when it is appropriate to use any of
these approaches for assessing listening in the classroom.
1. Discrimination of sounds: sound-discrimination items are of particular