February 2005 DGIV/EDU/LANG (2005) 1
Seminar to calibrate examples of spoken performances
in line with the scales of the
Common European Framework of Reference for Languages
CIEP, Sèvres, 2 - 4 December 2004
Brian NORTH (Eurocentres)
Sylvie LEPAGE (CIEP)
Language Policy Division
The seminar was organised by the Centre International d’Etudes Pédagogiques (CIEP) and Eurocentres as part of a project to provide samples of oral performance for French illustrating the levels of the Common European Framework of Reference (CEFR) (Council of Europe 2001). The aim of the seminar was to calibrate performances that could be presented and documented on a DVD illustrating performances at levels A1 to C2 of the CEFR. The programme is given as Appendix 1.
Examination institutes, language schools and university departments concerned with the teaching and testing of the French language are currently in the process of relating their curricula to the CEFR. A problem that arises in this regard is the question of assuring a consistent interpretation of the levels in different contexts.
In July 2002 the Finnish authorities organised a seminar for the Council of Europe (DG IV/EDU/LANG - 2002) to discuss this issue, as a result of which an authoring group was established to produce a preliminary version of a Manual to help institutions to relate their examinations to the CEFR (DG IV/EDU/LANG – 2003/5). That Manual, published for piloting
in September 2003 envisages the process of linking an exam to the CEFR in three stages: ? Specification: define the coverage of the examination in terms of the CEFR; ? Standardisation: ensure a consistent interpretation of the CEFR levels related to the
interpretation elsewhere, exploiting illustrative samples of performances already calibrated to
the CEFR in this process;
? Empirical Validation: check that the results given by the examination relate to the levels of
the CEFR in the manner foreseen.
The primary aim of the French DVD project was to calibrate the oral performances that could be used on a standardisation DVD to facilitate a consistent interpretation of the CEFR levels for French. The project was an initiative that arose from the meeting in Strasbourg in April 2004 concerning piloting and case studies in relation to the draft Manual. At this meeting Sylvie Lepage presented the CIEP case study with regard to DELF DALF. Two videos illustrating the levels of the CEFR for English had also been distributed before the meeting. These consisted of samples from the CEFR Swiss project put together by Brian North and Gareth Hughes, plus a cassette illustrating performances at Cambridge ESOL examinations at the different CEFR levels. The Eurocentres Foundation and the CIEP decided at the meeting to produce the DVD in order to provide illustrative samples for the second official language of the Council of Europe: French. A further meeting of a group of experts concerned with the provision of illustrative samples held in Strasbourg in October 2004 was exploited as an opportunity to discuss in depth both the organisation of the programme of the planned seminar and the analysis of the resultant data.
Aims of the Seminar
The notion of a standardisation video exemplifying performances at different levels on a scale of language proficiency is in fact an innovation both in France and in the world of French as a foreign language. As a result there is an interest in such a tool not only from examination boards but also from language schools (e.g. Eurocentres), accreditation associations for language schools (e.g. EAQUALS) and from both teachers and teacher trainers who work with the CEFR and with the European Language Portfolio (ELP).
Because of this wide interest in such a DVD, and because this was the first international
benchmarking seminar in the Council of Europe’s projects related to the CEFR and the Manual,
the seminar had significant process aims in addition to the main objective of producing the actual
product – the DVD of calibrated samples. The full aims could be summarised as follows:
? Establish a consensus in the interpretation of the CEFR levels in relation to learner
performances in French as a foreign language (Français Langue Étrangère = FLE in
France and abroad).
? Give participants practical experience of such a seminar so that they would be better able
to organise one themselves. This applied both to those institutes involved in the world of
FLE in France and also to those institutes invited who planned to produce DVDs
illustrating CEFR levels during 2005 (Goethe-Institut for German; Instituto Cervantes for
Spanish; University of Perugia for Italian; University of Lisbon for Portuguese). ? Pilot a methodology for running such a seminar, giving practical input for a Guide to
organising such an event.
? Calibrate examples of spoken performance for young adults from a wide range of
? Identify those examples most suitable for inclusion on a DVD illustrating CEFR levels for
? Document the selected examples based on discussion at the seminar, the criteria of CEFR
Table 3 (Oral Assessment Grid) and statistical characteristics.
Essentially, in an international benchmarking seminar of this type there are two different levels of
aims that are to some extent incompatible:
1. Calibrate and document the performances
In a language testing context, such a task is normally carried out by a relatively small
group of expert raters trained and experienced in the test context concerned. Essentially
the examination institute gives the experts the authority to dictate the way the criteria are
to be interpreted and the experts justify their decision by documenting the relationship of
the performances to the criteria and by demonstrating the degree of consistency (intra-
rater reliability) and agreement (inter-rater reliability). In the case of a seminar aiming to
calibrate to a common (not institute-specific) framework, it was not so obvious whether
calibrations based on the opinions of a closed group of experts would be representative of
different perspectives in the wider world of FLE.
2. Establish a consensus in interpreting the CEFR levels in the pedagogic culture(s)
The alternative is to assemble a wider group of experts and establish a consensus. The
establishment of such a consensus cannot be separated from the process of training. A
larger number of raters from different backgrounds will need training in order to follow
the same defined procedure and to make judgments in relation to the same CEFR criterion
descriptors. After training, “consensus” may be interpreted to mean:
a. an averaging of differing opinions from individual judgments;
b. convergence to agreement through discussion.
Ideally both approaches should produce the same result.
In the inclusive tradition of the Council of Europe’s modern languages projects, the decision was
taken to invite a wide range of participants representing different perspectives and seek consensus. It was decided to collect data both on individual judgments representing such different
perspectives and on the consensus formed after discussion.
Efforts were therefore made to ensure attendance by a large group representative of different perspectives on the interpretation of CEFR levels for French. The 38 participants listed in
Appendix 2 represented four different groups:
? experts from the French examinations boards for the French language: CIEP and Alliance
Française (10 persons);
? teachers from French language schools in French-speaking countries (10 persons); ? experts in the French language from other European educational systems (11 persons); ? experts in the CEFR levels who were not experts in the French language (7 persons).
The decision to invite such a large group representing radically different perspectives was taken for several reasons. Because of the fact that little standardisation discussions had taken place in relation to French as a foreign language it was felt politically important to include as many people from the world of French as a foreign language as possible. Secondly, the examination
institutes and French language schools in France were simultaneously starting to standardise on the CEFR, and it was felt better to seek convergence rather than to impose the view of one
institution on others. Thirdly, since use of the CEFR was advanced in other countries in Europe, it was felt wise to include approximately 50% of participants from contexts abroad.
In addition to conventional analysis, more sophisticated statistical analyses in which the four groups are separately identified were undertaken. This would show whether there are
significantly different interpretations between the different groups of raters represented. The organisers are grateful to Cambridge ESOL and ALTE (Association of Language Testers in
Europe), and in particular to Neil Jones, for the support offered in carrying out these analyses.
The video recordings were made between May and September 2004 at Eurocentres Paris, the
CIEP studio at Sèvres, the Centre International d’Etudes de Langue at Brest and the Collège
International de Cannes. Other recordings were also made in DELF DALF test centres at Prague
and Madrid, but these had to be abandoned because of technical problems with sound and vision quality. The recordings shown at the seminar had performances from young adults from 15
countries: Belgium, Brazil, China, Columbia, Germany, Great Britain, Italy, Mexico, Peru,
Serbia-Montenegro, Sweden, Switzerland, the Ukraine, the United Arab Emirates and the United
States of America.
Each learner had signed a form authorising the transmission of their video image on such a
standardisation DVD and in the internet. The learners filmed were selected in a systematic
fashion and their language level in relation to the CEFR was documented with teachers’
evaluations, with questionnaires and with test results – especially for the new CEFR-based DELF DALF.
th September by an expert group The filmed performances were then viewed at a workshop on 14
consisting of Béatrice Dupoux (CIEP), Gareth Hughes (Migros Club Schools – and member of
the Portfolio Validation Committee), Sylvie Lepage (CIEP), Marie-Claude Moyer (Eurocentres
France) and Brian North (Eurocentres – and coordinator of the CEFR Manual authoring group). At this workshop the performances were rated onto the CEFR levels and discussed before a final
selection of performances for the seminar was made. The performances are listed in Appendix 3
from level C2 to A1.
The tasks filmed were a further development of those developed for standardisation videos in the
Swiss research project that had produced the CEFR levels and descriptors. These tasks had been
used for the English video circulated in April 2004 and were recommended in the Council of
Europe’s “Brief for Recording.”
Each recording shows two learners, with no native-speaker examiner/interlocutor. There are three
? a production phase by the first learner with a sustained monologue, which may be followed
by questions from the other learner;
? a similar production phase by the other learner;
? a phase of spontaneous interaction between the two learners.
This same test format was used at all levels. The production phase is “semi-prepared” in that the
learners can choose a topic and have about 10 minutes in which to reflect on what they want to
say. The interaction phase is completely spontaneous, elicited by cards (or strips of paper)
containing discussion questions. As with the production phase there is an element of learner
choice in the topics for the interaction phase in that the two learners could discard topics that did
not interest them.
Recordings with a total duration of up to 12 minutes were shown in their entirety. Longer
recordings (a couple lasted 20 minutes) were shortened to extracts of 3-4 minutes for each of the
Rating Instruments and Procedures
The main rating instrument was the criteria grid presented as CEFR Table 3 (Appendix 4), which
defines for each of the 6 CEFR levels the requirements in terms of Range (Étendue), Accuracy
(Correction), Fluency (Aisance), Interaction (Interaction) and Coherence (Cohérence). This grid
had been distributed to participants before the seminar with the request to study it.
Participants were also provided with a supplementary grid giving descriptors for the “plus” levels
in the middle of the CEFR scale A2+, B1+ and B2+ (Appendix 5). The addition of these 3 “plus”
levels provided 9 levels. These 9 levels reflect the linear scale produced in the Swiss research
project. In the CEFR, descriptors on the scales that are at these “plus” levels are presented above a horizontal line, with the descriptors for the criterion level underneath this horizontal line.
Before giving the final, Global judgement, it was stressed several times that participants should
consult other relevant CEFR scales with which they had also been provided (Appendix 6). These
consisted of a selection of scales for Spoken Interaction and for Spoken Production from Chapter
4 as well as the scale for Phonological Control from Chapter 5, since this qualitative category
was not included in CEFR Table 3.
Rating was first done on paper using Form B2 (analytic rating form) from the Manual given as
Appendix 7. Votes were recorded electronically with CEFR levels corresponding to buttons on
the keypad as shown in Appendix 8.
The same basic pattern of rating was followed once the initial familiarisation and training phases
later described had been completed.
1. Watch the sequence (Production: Learner A; Production: Learner B; Interaction between
Learners A & B).
2. Consult criteria grids and record on Form B2 the ratings for both learners for the 5
categories: Range (Étendue), Accuracy (Correction), Fluency (Aisance), Interaction
(Interaction) and Coherence (Cohérence).
3. Consult other scales provided, reflect and rate the Global Level for both learners on
4. Electronic Voting: individual votes: Range (Étendue), Accuracy (Correction), Fluency
(Aisance), Interaction (Interaction) and Coherence (Cohérence), Global (Note Globale)
5. View Histogram of the individual global judgements.
6. Group discussion in groups of 6-7 mixed by background and with a CIEP person as
chair/rapporteur. Normally 10 minutes discussion.
7. Reports from Group leaders.
8. Plenary discussion.
9. Electronic Vote: Global (Note Globale).
10. View Histogram of global judgements after discussion.
The rationale was to:
a) capture the full data of participants’ individual judgements before any conferring;
b) attempt to establish consensus through (i) seeing one’s own anonymous vote in the
context of the total pattern of votes, and (ii) discussion.
As mentioned before, the desired result for sequences that would be included on the DVD was
that the product of the objective analysis of the individual data (a) would coincide with the
consensus arrived at in the discussion (b).
At the beginning of the seminar, however, before moving on to rate performances, participants
completed two familiarisation exercises along the lines suggested in the Manual, modified by
suggestions made at the October seminar in Strasbourg, modified again to meet the requirements
of the time available.
The first exercise involved identifying the CEFR level of descriptors from the two CEFR scales “Overall Spoken Interaction” and “Overall Spoken Production” when these were presented in a
mixed list in random order. A version of the worksheet with the answers added is given as
Appendix 9. The follow up was an explanation of why each descriptor was the level at which it was assigned, by drawing attention to the salient features characteristic of the level concerned. This exercise was effective and drew attention to the fact that some participants were misleading themselves by taking one word out of context (e.g. “can interact…….” in a descriptor for A1, despite the fact that this descriptor is qualified with provisos and conditions).
The second familiarisation exercise involved the grid to be used as rating criteria in the seminar: CEFR Table 3. Participants were asked to work with their neighbour in pairs. The exercise was an ambitious one in which participants were given a blank grid and 24 pieces of paper to place in the correct cells, with the warning that 6 of the 30 cells were missing. Some participants
completed the task astonishingly quickly and appeared to thoroughly enjoy it. Many, however, had difficulty seeing which descriptors were defining which category. This was particularly the case with Range (Etendue) and Accuracy (Correction). This may have been at least partly caused by the fact that, judging by later discussion in the seminar, it appears to be customary in French as a foreign language to interpret the former as “Vocabulary” – rather than as the range of
language resources available - structures, turns of phrase, words, knowledge of collocation and colligation etc.
The first evaluation sequence was also a form of familiarisation task. Participants were asked to watch a first short sequence showing two learners considered to be B1, consult the criteria and rate just the overall, global level of each learner. This task aimed to get participants accustomed to the rating instruments and to the process of first recording judgments on paper and then voting electronically, but it also served to show the degree of agreement in the group.
Training in Procedures and Criteria
There then followed a phase of training in which the production phases of three sequences were rated separately by each of three criteria.
Two such sequences (Margarida and Mariana; Valérie and Sophie) were rated on the Thursday.
The third sequence (Deborah and Iryna) was evaluated on the morning of the main rating day
(Friday) after the explications on the use of the criteria.
The same procedure was followed for each of the three sequences:
? First the production phase of Learner A was viewed and rated individually with RANGE
(with no discussion), then it was repeated and rated for ACCURACY (again no discussion),
then it was repeated a third time and rated for FLUENCY – without discussion.
? Exactly the same was then applied to Learner B
? Finally, the Interaction phase – showing both learners – was played once, and participants
were asked to finalise their decisions on the three criteria used and to make a global
judgement on the overall level.
Discussion in groups of 5-6 followed for approximately 15 minutes, followed by a second
viewing of the Interaction phase, and a final judgement (the vote after discussion).
Participants found discussing learner proficiency and justifying their opinions in relation to
specific criteria during the seminar a very positive experience, as shown in their feedback
comments in Appendix 11. Several also commented that they much appreciated the explanations
of the Table 3 criteria and other CEFR descriptors during the seminar. However, the process of
focusing on detailed formulations for one aspect of performance rather than starting with the
question: “What level is this person in the general scheme of things?” did appear to cause some
participants difficulty. This point is returned to at the end of the report.
The statistical analysis includes the provision of conventional descriptive summaries of ratings
(mean, modal). However the main analysis employed a multi-faceted Rasch model scaling
analysis, using the program FACETS (Linacre 1989). The illustrative video for English showing
Swiss adult learners contained examples calibrated with FACETS in the original CEFR research
project (100 raters: North 2000) and confirmed at a final conference of the Swiss project (circa 25 raters again with a FACETS analysis).
Although not exactly user-friendly, FACETS has the following advantages:
? Rater Consistency
Raters who are inconsistent in their judgements (sometimes lenient, sometimes strict) can be
identified and removed from the analysis.
In initial analysis runs it appeared that 3 or 4 raters in Sèvres were close to or outside the
conventionally acceptable criteria for rater consistency. FACETS, like all Rasch programs,
produces “fit statistics” that quantify the extent to which something or someone “fits” the
expectations and assumptions of the statistical model. A rater who “misfits” is being
inconsistent: rating good people low, and weak people high – or in some other way changing
their standard for judgments. The quality and accuracy of the measurement is improved if
data from such raters is removed from the analysis. The discovery that 5-10% of raters show
inconsistency is not unusual. There can be many reasons. Sometimes the “misfit” is
concentrated in just one or two judgements, and just that unreliable data can be removed.
? Rater Severity
Differing severity (leniency/strictness) of the raters can be taken into account in calculating
the “ability estimate” for the learners. Extreme cases can be eliminated if desired, though
FACETS actually compensates for severity, thus objectifying the judgements. Provided
lenient or severe raters are being consistent, therefore, they can be safely retained in the
In the Sèvres data, once the raters had tuned in, the range of severity they showed was not
remarkable. There were some differences between individuals, but the program FACETS is
designed precisely to take this into account in arriving at an objective result.
? Difficult Performances
Learners who prove difficult to assess because of some extraneous factor can also be
identified. Lack of “fit” with certain learners can be due to poor audio quality, lack of
familiarity with learners speaking the mother tongue concerned, bad pronunciation, or some
other factor in the performance confusing to raters.
Among the samples shown Luis (a Peruvian paired with Aleksandar) misfitted seriously,
reflecting people’s extreme difficulty concentrating on what was an idiosyncratic, nervy
performance with poor elucidation. The only other performances with any noticeable level of
misfit was in the sequence with Ambrogio and Silvia. This interaction was somewhat more
intimate than the others and slight misfit on both learners may have been caused by fast
diction from Ambrogio and some embarrassment from Silvia.
? Identifying Significantly Differing Interpretation Among Groups
Since the 4 different groups to which participants belong are defined by context and were
identified as a “facet” in the analysis, it is possible to measure whether any group:
o tends to rate the performance of any learner significantly differently to the other
o as a group is significantly more lenient or more strict that the other groups.
In the event there were no significant differences in these two respects in the interpretations
of the four different groups of experts. Indeed the only noticeable distinction was that the
group “French language experts in European projects” tended to be more severe on the
criterion Correction (accuracy) and more lenient on Interaction, whereas the group “French
language schools in France” rated more evenly across all the criteria.
? Mathematical scaling to CEF Levels
The cut-off points on the mathematical scale dividing the CEF levels (CEF Manual Table 6.2
given as Appendix 10) can be used to “anchor” the analysis to the (logit) scale created in the
original CEF research project so that a mathematical “ability estimate” is made for each
learner. This can be useful when discussing whether a learner is, for example, a strong B1, a
typical B1, a borderline B1, etc.
There are issues with regard to the best interpretation of the results from a FACETS analysis
of rating scale data and those who are interested are referred to the report available in English
by Neil Jones, the ALTE data analyst. The FACETS scaling of the independent judgements
before discussion produced an illuminating result that is interesting to compare with the result
based purely on aggregating raters’ opinions.
In making a final decision for each sample, the FACETS result from individual judgements
was compared to the consensus opinion established after discussion.
Inter-rater reliability: A first point to be made is that the ratings show an impressive level of
inter-rater reliability. The mean inter-rater correlation for the independent judgements before
discussion was 0.886, and that for the more consensual view after discussion was 0.967. In
interpreting these high coefficients, it must be borne in mind that the full range of language
proficiency was being assessed, from below Level A1 to Level C2. It is easier to achieve such
high correlations in these circumstances. Nevertheless, such inter-rater reliability coefficients
compare well with those reported in the literature for ratings by expert, trained judges across the
range of proficiency.
Halo Effect: A second point is that there was an extremely high inter-correlation of 0.992 to
0.998 between the five criteria: Range (Étendue), Accuracy (Correction), Fluency (Aisance),
Interaction (Interaction) and Coherence (Cohérence). The criteria were clearly not being used in
an independent fashion. To quote the analysis report: “On this evidence either the rating criteria
are not distinct, or the raters were not able to identify that they were distinct (=halo effect) or the
subjects do not display distinct profiles of skill.” This result is not unusual: such criteria give
raters a focus to help them to arrive at a considered judgement and may not actually represent
truly independent variables.
Training Effect: The analysis suggests that during the course of the seminar the participants as
individuals became more consistent in their interpretation of the criteria (CEF descriptors). This
can be seen in the way that the “fit” statistics for the facet “Day” improved over the three days as
the participants became more familiar with the criteria and the procedures. Such an effect is to be
expected in a seminar which had the aim of developing a consensus view.
CEF Levels of Samples: The placement of the samples at CEF levels is best considered by
comparing the FACETS analysis of the independent judgements before discussion with the
consensus reached after discussion. The latter cannot properly be subjected to a FACETS
analysis because the judgements in the data set are not independent and because such an analysis
cannot handle 100% scores - as happened for Josue and Rachel, for example, who were
unanimously considered to be C2.
The following table compares the conclusions reached from the two sets of votes, and makes a
final recommendation in the right hand column.
Judgements Judgements after
Definitive CEF Level (FACETS) Discussion
C2 33 Josue C2 C2
C2 32 Rachel C2 C2
C1 31 Aleksandar DALF C1 C1
C1 15 Ambriogio C1 C1
C1 25 Aleksandar C1 C1
B2+ 8 Xi B2+ B2+
B2+ 26 Luis B2+ (B2+)
B2+ 16 Silvia B2+ B2+
B2 7 Nataliya B2 B2
B1+ 37 Gu Jung B1+ B1+
B1+ 4 Sophie B1+ B1+
B1 3 Valérie B1+ B1
B1 13 Evelyne B1 B1
B1 1 Margarida B1 B1
B1 2 Mariana B1 B1
(A2+) / B1 14 Andrea B1 A2+