Seminar to calibrate examples of spoken performance

By Zachary Lee,2014-01-20 03:31
17 views 0
Seminar to calibrate examples of spoken performance

February 2005 DGIV/EDU/LANG (2005) 1

Seminar to calibrate examples of spoken performances

    in line with the scales of the

    Common European Framework of Reference for Languages

    CIEP, Sèvres, 2 - 4 December 2004



    Brian NORTH (Eurocentres)

    Sylvie LEPAGE (CIEP)

    Language Policy Division


    The seminar was organised by the Centre International d’Etudes Pédagogiques (CIEP) and Eurocentres as part of a project to provide samples of oral performance for French illustrating the levels of the Common European Framework of Reference (CEFR) (Council of Europe 2001). The aim of the seminar was to calibrate performances that could be presented and documented on a DVD illustrating performances at levels A1 to C2 of the CEFR. The programme is given as Appendix 1.


    Examination institutes, language schools and university departments concerned with the teaching and testing of the French language are currently in the process of relating their curricula to the CEFR. A problem that arises in this regard is the question of assuring a consistent interpretation of the levels in different contexts.

    In July 2002 the Finnish authorities organised a seminar for the Council of Europe (DG IV/EDU/LANG - 2002) to discuss this issue, as a result of which an authoring group was established to produce a preliminary version of a Manual to help institutions to relate their examinations to the CEFR (DG IV/EDU/LANG 2003/5). That Manual, published for piloting

    in September 2003 envisages the process of linking an exam to the CEFR in three stages: ? Specification: define the coverage of the examination in terms of the CEFR; ? Standardisation: ensure a consistent interpretation of the CEFR levels related to the

    interpretation elsewhere, exploiting illustrative samples of performances already calibrated to

    the CEFR in this process;

    ? Empirical Validation: check that the results given by the examination relate to the levels of

    the CEFR in the manner foreseen.

    The primary aim of the French DVD project was to calibrate the oral performances that could be used on a standardisation DVD to facilitate a consistent interpretation of the CEFR levels for French. The project was an initiative that arose from the meeting in Strasbourg in April 2004 concerning piloting and case studies in relation to the draft Manual. At this meeting Sylvie Lepage presented the CIEP case study with regard to DELF DALF. Two videos illustrating the levels of the CEFR for English had also been distributed before the meeting. These consisted of samples from the CEFR Swiss project put together by Brian North and Gareth Hughes, plus a cassette illustrating performances at Cambridge ESOL examinations at the different CEFR levels. The Eurocentres Foundation and the CIEP decided at the meeting to produce the DVD in order to provide illustrative samples for the second official language of the Council of Europe: French. A further meeting of a group of experts concerned with the provision of illustrative samples held in Strasbourg in October 2004 was exploited as an opportunity to discuss in depth both the organisation of the programme of the planned seminar and the analysis of the resultant data.

Aims of the Seminar

    The notion of a standardisation video exemplifying performances at different levels on a scale of language proficiency is in fact an innovation both in France and in the world of French as a foreign language. As a result there is an interest in such a tool not only from examination boards but also from language schools (e.g. Eurocentres), accreditation associations for language schools (e.g. EAQUALS) and from both teachers and teacher trainers who work with the CEFR and with the European Language Portfolio (ELP).


Because of this wide interest in such a DVD, and because this was the first international

    benchmarking seminar in the Council of Europe’s projects related to the CEFR and the Manual,

    the seminar had significant process aims in addition to the main objective of producing the actual

    product the DVD of calibrated samples. The full aims could be summarised as follows:

Process aims

    ? Establish a consensus in the interpretation of the CEFR levels in relation to learner

    performances in French as a foreign language (Français Langue Étrangère = FLE in

    France and abroad).

    ? Give participants practical experience of such a seminar so that they would be better able

    to organise one themselves. This applied both to those institutes involved in the world of

    FLE in France and also to those institutes invited who planned to produce DVDs

    illustrating CEFR levels during 2005 (Goethe-Institut for German; Instituto Cervantes for

    Spanish; University of Perugia for Italian; University of Lisbon for Portuguese). ? Pilot a methodology for running such a seminar, giving practical input for a Guide to

    organising such an event.

Product aims

    ? Calibrate examples of spoken performance for young adults from a wide range of


    ? Identify those examples most suitable for inclusion on a DVD illustrating CEFR levels for


    ? Document the selected examples based on discussion at the seminar, the criteria of CEFR

    Table 3 (Oral Assessment Grid) and statistical characteristics.

Essentially, in an international benchmarking seminar of this type there are two different levels of

    aims that are to some extent incompatible:

1. Calibrate and document the performances

    In a language testing context, such a task is normally carried out by a relatively small

    group of expert raters trained and experienced in the test context concerned. Essentially

    the examination institute gives the experts the authority to dictate the way the criteria are

    to be interpreted and the experts justify their decision by documenting the relationship of

    the performances to the criteria and by demonstrating the degree of consistency (intra-

    rater reliability) and agreement (inter-rater reliability). In the case of a seminar aiming to

    calibrate to a common (not institute-specific) framework, it was not so obvious whether

    calibrations based on the opinions of a closed group of experts would be representative of

    different perspectives in the wider world of FLE.

    2. Establish a consensus in interpreting the CEFR levels in the pedagogic culture(s)


    The alternative is to assemble a wider group of experts and establish a consensus. The

    establishment of such a consensus cannot be separated from the process of training. A

    larger number of raters from different backgrounds will need training in order to follow

    the same defined procedure and to make judgments in relation to the same CEFR criterion

    descriptors. After training, “consensus” may be interpreted to mean:

    a. an averaging of differing opinions from individual judgments;


    b. convergence to agreement through discussion.

    Ideally both approaches should produce the same result.

In the inclusive tradition of the Council of Europe’s modern languages projects, the decision was

    taken to invite a wide range of participants representing different perspectives and seek consensus. It was decided to collect data both on individual judgments representing such different

    perspectives and on the consensus formed after discussion.


    Efforts were therefore made to ensure attendance by a large group representative of different perspectives on the interpretation of CEFR levels for French. The 38 participants listed in

    Appendix 2 represented four different groups:

    ? experts from the French examinations boards for the French language: CIEP and Alliance

    Française (10 persons);

    ? teachers from French language schools in French-speaking countries (10 persons); ? experts in the French language from other European educational systems (11 persons); ? experts in the CEFR levels who were not experts in the French language (7 persons).

    The decision to invite such a large group representing radically different perspectives was taken for several reasons. Because of the fact that little standardisation discussions had taken place in relation to French as a foreign language it was felt politically important to include as many people from the world of French as a foreign language as possible. Secondly, the examination

    institutes and French language schools in France were simultaneously starting to standardise on the CEFR, and it was felt better to seek convergence rather than to impose the view of one

    institution on others. Thirdly, since use of the CEFR was advanced in other countries in Europe, it was felt wise to include approximately 50% of participants from contexts abroad.

    In addition to conventional analysis, more sophisticated statistical analyses in which the four groups are separately identified were undertaken. This would show whether there are

    significantly different interpretations between the different groups of raters represented. The organisers are grateful to Cambridge ESOL and ALTE (Association of Language Testers in

    Europe), and in particular to Neil Jones, for the support offered in carrying out these analyses.


    The video recordings were made between May and September 2004 at Eurocentres Paris, the

    CIEP studio at Sèvres, the Centre International d’Etudes de Langue at Brest and the Collège

    International de Cannes. Other recordings were also made in DELF DALF test centres at Prague

    and Madrid, but these had to be abandoned because of technical problems with sound and vision quality. The recordings shown at the seminar had performances from young adults from 15

    countries: Belgium, Brazil, China, Columbia, Germany, Great Britain, Italy, Mexico, Peru,

    Serbia-Montenegro, Sweden, Switzerland, the Ukraine, the United Arab Emirates and the United

    States of America.

Each learner had signed a form authorising the transmission of their video image on such a

    standardisation DVD and in the internet. The learners filmed were selected in a systematic


fashion and their language level in relation to the CEFR was documented with teachers’

    evaluations, with questionnaires and with test results especially for the new CEFR-based DELF DALF.

     th September by an expert group The filmed performances were then viewed at a workshop on 14

    consisting of Béatrice Dupoux (CIEP), Gareth Hughes (Migros Club Schools and member of

    the Portfolio Validation Committee), Sylvie Lepage (CIEP), Marie-Claude Moyer (Eurocentres

    France) and Brian North (Eurocentres and coordinator of the CEFR Manual authoring group). At this workshop the performances were rated onto the CEFR levels and discussed before a final

    selection of performances for the seminar was made. The performances are listed in Appendix 3

    from level C2 to A1.


    The tasks filmed were a further development of those developed for standardisation videos in the

    Swiss research project that had produced the CEFR levels and descriptors. These tasks had been

    used for the English video circulated in April 2004 and were recommended in the Council of

    Europe’s “Brief for Recording.”

Each recording shows two learners, with no native-speaker examiner/interlocutor. There are three


    ? a production phase by the first learner with a sustained monologue, which may be followed

    by questions from the other learner;

    ? a similar production phase by the other learner;

    ? a phase of spontaneous interaction between the two learners.

This same test format was used at all levels. The production phase is “semi-prepared” in that the

    learners can choose a topic and have about 10 minutes in which to reflect on what they want to

    say. The interaction phase is completely spontaneous, elicited by cards (or strips of paper)

    containing discussion questions. As with the production phase there is an element of learner

    choice in the topics for the interaction phase in that the two learners could discard topics that did

    not interest them.

Recordings with a total duration of up to 12 minutes were shown in their entirety. Longer

    recordings (a couple lasted 20 minutes) were shortened to extracts of 3-4 minutes for each of the

    three phases.

Rating Instruments and Procedures

    The main rating instrument was the criteria grid presented as CEFR Table 3 (Appendix 4), which

    defines for each of the 6 CEFR levels the requirements in terms of Range (Étendue), Accuracy

    (Correction), Fluency (Aisance), Interaction (Interaction) and Coherence (Cohérence). This grid

    had been distributed to participants before the seminar with the request to study it.

Participants were also provided with a supplementary grid giving descriptors for the “plus” levels

    in the middle of the CEFR scale A2+, B1+ and B2+ (Appendix 5). The addition of these 3 “plus”

    levels provided 9 levels. These 9 levels reflect the linear scale produced in the Swiss research

    project. In the CEFR, descriptors on the scales that are at these “plus” levels are presented above a horizontal line, with the descriptors for the criterion level underneath this horizontal line.


Before giving the final, Global judgement, it was stressed several times that participants should

    consult other relevant CEFR scales with which they had also been provided (Appendix 6). These

    consisted of a selection of scales for Spoken Interaction and for Spoken Production from Chapter

    4 as well as the scale for Phonological Control from Chapter 5, since this qualitative category

    was not included in CEFR Table 3.

Rating was first done on paper using Form B2 (analytic rating form) from the Manual given as

    Appendix 7. Votes were recorded electronically with CEFR levels corresponding to buttons on

    the keypad as shown in Appendix 8.

The same basic pattern of rating was followed once the initial familiarisation and training phases

    later described had been completed.

    1. Watch the sequence (Production: Learner A; Production: Learner B; Interaction between

    Learners A & B).

    2. Consult criteria grids and record on Form B2 the ratings for both learners for the 5

    categories: Range (Étendue), Accuracy (Correction), Fluency (Aisance), Interaction

    (Interaction) and Coherence (Cohérence).

    3. Consult other scales provided, reflect and rate the Global Level for both learners on

    Form B2.

    4. Electronic Voting: individual votes: Range (Étendue), Accuracy (Correction), Fluency

    (Aisance), Interaction (Interaction) and Coherence (Cohérence), Global (Note Globale)

    5. View Histogram of the individual global judgements.

    6. Group discussion in groups of 6-7 mixed by background and with a CIEP person as

    chair/rapporteur. Normally 10 minutes discussion.

    7. Reports from Group leaders.

    8. Plenary discussion.

    9. Electronic Vote: Global (Note Globale).

    10. View Histogram of global judgements after discussion.

The rationale was to:

    a) capture the full data of participants’ individual judgements before any conferring;

    b) attempt to establish consensus through (i) seeing one’s own anonymous vote in the

    context of the total pattern of votes, and (ii) discussion.

As mentioned before, the desired result for sequences that would be included on the DVD was

    that the product of the objective analysis of the individual data (a) would coincide with the

    consensus arrived at in the discussion (b).

Familiarisation Exercises

    At the beginning of the seminar, however, before moving on to rate performances, participants

    completed two familiarisation exercises along the lines suggested in the Manual, modified by

    suggestions made at the October seminar in Strasbourg, modified again to meet the requirements

    of the time available.


    The first exercise involved identifying the CEFR level of descriptors from the two CEFR scales “Overall Spoken Interaction” and “Overall Spoken Production” when these were presented in a

    mixed list in random order. A version of the worksheet with the answers added is given as

    Appendix 9. The follow up was an explanation of why each descriptor was the level at which it was assigned, by drawing attention to the salient features characteristic of the level concerned. This exercise was effective and drew attention to the fact that some participants were misleading themselves by taking one word out of context (e.g. “can interact…….” in a descriptor for A1, despite the fact that this descriptor is qualified with provisos and conditions).

    The second familiarisation exercise involved the grid to be used as rating criteria in the seminar: CEFR Table 3. Participants were asked to work with their neighbour in pairs. The exercise was an ambitious one in which participants were given a blank grid and 24 pieces of paper to place in the correct cells, with the warning that 6 of the 30 cells were missing. Some participants

    completed the task astonishingly quickly and appeared to thoroughly enjoy it. Many, however, had difficulty seeing which descriptors were defining which category. This was particularly the case with Range (Etendue) and Accuracy (Correction). This may have been at least partly caused by the fact that, judging by later discussion in the seminar, it appears to be customary in French as a foreign language to interpret the former as “Vocabulary” – rather than as the range of

    language resources available - structures, turns of phrase, words, knowledge of collocation and colligation etc.

    The first evaluation sequence was also a form of familiarisation task. Participants were asked to watch a first short sequence showing two learners considered to be B1, consult the criteria and rate just the overall, global level of each learner. This task aimed to get participants accustomed to the rating instruments and to the process of first recording judgments on paper and then voting electronically, but it also served to show the degree of agreement in the group.

Training in Procedures and Criteria

    There then followed a phase of training in which the production phases of three sequences were rated separately by each of three criteria.

Two such sequences (Margarida and Mariana; Valérie and Sophie) were rated on the Thursday.

    The third sequence (Deborah and Iryna) was evaluated on the morning of the main rating day

    (Friday) after the explications on the use of the criteria.

The same procedure was followed for each of the three sequences:

    ? First the production phase of Learner A was viewed and rated individually with RANGE

    (with no discussion), then it was repeated and rated for ACCURACY (again no discussion),

    then it was repeated a third time and rated for FLUENCY without discussion.

    ? Exactly the same was then applied to Learner B

    ? Finally, the Interaction phase showing both learners was played once, and participants

    were asked to finalise their decisions on the three criteria used and to make a global

    judgement on the overall level.

Discussion in groups of 5-6 followed for approximately 15 minutes, followed by a second

    viewing of the Interaction phase, and a final judgement (the vote after discussion).


Participants found discussing learner proficiency and justifying their opinions in relation to

    specific criteria during the seminar a very positive experience, as shown in their feedback

    comments in Appendix 11. Several also commented that they much appreciated the explanations

    of the Table 3 criteria and other CEFR descriptors during the seminar. However, the process of

    focusing on detailed formulations for one aspect of performance rather than starting with the

    question: “What level is this person in the general scheme of things?” did appear to cause some

    participants difficulty. This point is returned to at the end of the report.

Analysis Method

    The statistical analysis includes the provision of conventional descriptive summaries of ratings

    (mean, modal). However the main analysis employed a multi-faceted Rasch model scaling

    analysis, using the program FACETS (Linacre 1989). The illustrative video for English showing

    Swiss adult learners contained examples calibrated with FACETS in the original CEFR research

    project (100 raters: North 2000) and confirmed at a final conference of the Swiss project (circa 25 raters again with a FACETS analysis).

Although not exactly user-friendly, FACETS has the following advantages:

? Rater Consistency

    Raters who are inconsistent in their judgements (sometimes lenient, sometimes strict) can be

    identified and removed from the analysis.

    In initial analysis runs it appeared that 3 or 4 raters in Sèvres were close to or outside the

    conventionally acceptable criteria for rater consistency. FACETS, like all Rasch programs,

    produces “fit statistics” that quantify the extent to which something or someone “fits” the

    expectations and assumptions of the statistical model. A rater who “misfits” is being

    inconsistent: rating good people low, and weak people high or in some other way changing

    their standard for judgments. The quality and accuracy of the measurement is improved if

    data from such raters is removed from the analysis. The discovery that 5-10% of raters show

    inconsistency is not unusual. There can be many reasons. Sometimes the “misfit” is

    concentrated in just one or two judgements, and just that unreliable data can be removed.

? Rater Severity

    Differing severity (leniency/strictness) of the raters can be taken into account in calculating

    the “ability estimate” for the learners. Extreme cases can be eliminated if desired, though

    FACETS actually compensates for severity, thus objectifying the judgements. Provided

    lenient or severe raters are being consistent, therefore, they can be safely retained in the


    In the Sèvres data, once the raters had tuned in, the range of severity they showed was not

    remarkable. There were some differences between individuals, but the program FACETS is

    designed precisely to take this into account in arriving at an objective result.

? Difficult Performances

    Learners who prove difficult to assess because of some extraneous factor can also be

    identified. Lack of “fit” with certain learners can be due to poor audio quality, lack of


    familiarity with learners speaking the mother tongue concerned, bad pronunciation, or some

    other factor in the performance confusing to raters.

    Among the samples shown Luis (a Peruvian paired with Aleksandar) misfitted seriously,

    reflecting people’s extreme difficulty concentrating on what was an idiosyncratic, nervy

    performance with poor elucidation. The only other performances with any noticeable level of

    misfit was in the sequence with Ambrogio and Silvia. This interaction was somewhat more

    intimate than the others and slight misfit on both learners may have been caused by fast

    diction from Ambrogio and some embarrassment from Silvia.

? Identifying Significantly Differing Interpretation Among Groups

    Since the 4 different groups to which participants belong are defined by context and were

    identified as a “facet” in the analysis, it is possible to measure whether any group:

    o tends to rate the performance of any learner significantly differently to the other

    groups, or

    o as a group is significantly more lenient or more strict that the other groups.

    In the event there were no significant differences in these two respects in the interpretations

    of the four different groups of experts. Indeed the only noticeable distinction was that the

    group “French language experts in European projects” tended to be more severe on the

    criterion Correction (accuracy) and more lenient on Interaction, whereas the group “French

    language schools in France” rated more evenly across all the criteria.

? Mathematical scaling to CEF Levels

    The cut-off points on the mathematical scale dividing the CEF levels (CEF Manual Table 6.2

    given as Appendix 10) can be used to “anchor” the analysis to the (logit) scale created in the

    original CEF research project so that a mathematical “ability estimate” is made for each

    learner. This can be useful when discussing whether a learner is, for example, a strong B1, a

    typical B1, a borderline B1, etc.

    There are issues with regard to the best interpretation of the results from a FACETS analysis

    of rating scale data and those who are interested are referred to the report available in English

    by Neil Jones, the ALTE data analyst. The FACETS scaling of the independent judgements

    before discussion produced an illuminating result that is interesting to compare with the result

    based purely on aggregating raters’ opinions.

    In making a final decision for each sample, the FACETS result from individual judgements

    was compared to the consensus opinion established after discussion.

Analysis Result

Inter-rater reliability: A first point to be made is that the ratings show an impressive level of

    inter-rater reliability. The mean inter-rater correlation for the independent judgements before

    discussion was 0.886, and that for the more consensual view after discussion was 0.967. In

    interpreting these high coefficients, it must be borne in mind that the full range of language

    proficiency was being assessed, from below Level A1 to Level C2. It is easier to achieve such

    high correlations in these circumstances. Nevertheless, such inter-rater reliability coefficients


compare well with those reported in the literature for ratings by expert, trained judges across the

    range of proficiency.

Halo Effect: A second point is that there was an extremely high inter-correlation of 0.992 to

    0.998 between the five criteria: Range (Étendue), Accuracy (Correction), Fluency (Aisance),

    Interaction (Interaction) and Coherence (Cohérence). The criteria were clearly not being used in

    an independent fashion. To quote the analysis report: “On this evidence either the rating criteria

    are not distinct, or the raters were not able to identify that they were distinct (=halo effect) or the

    subjects do not display distinct profiles of skill.” This result is not unusual: such criteria give

    raters a focus to help them to arrive at a considered judgement and may not actually represent

    truly independent variables.

Training Effect: The analysis suggests that during the course of the seminar the participants as

    individuals became more consistent in their interpretation of the criteria (CEF descriptors). This

    can be seen in the way that the “fit” statistics for the facet “Day” improved over the three days as

    the participants became more familiar with the criteria and the procedures. Such an effect is to be

    expected in a seminar which had the aim of developing a consensus view.

CEF Levels of Samples: The placement of the samples at CEF levels is best considered by

    comparing the FACETS analysis of the independent judgements before discussion with the

    consensus reached after discussion. The latter cannot properly be subjected to a FACETS

    analysis because the judgements in the data set are not independent and because such an analysis

    cannot handle 100% scores - as happened for Josue and Rachel, for example, who were

    unanimously considered to be C2.

The following table compares the conclusions reached from the two sets of votes, and makes a

    final recommendation in the right hand column.


    Judgements Judgements after

    Definitive CEF Level (FACETS) Discussion

    C2 33 Josue C2 C2

    C2 32 Rachel C2 C2

    C1 31 Aleksandar DALF C1 C1

    C1 15 Ambriogio C1 C1

    C1 25 Aleksandar C1 C1

    B2+ 8 Xi B2+ B2+

    B2+ 26 Luis B2+ (B2+)

    B2+ 16 Silvia B2+ B2+

    B2 7 Nataliya B2 B2

    B1+ 37 Gu Jung B1+ B1+

    B1+ 4 Sophie B1+ B1+

    B1 3 Valérie B1+ B1

    B1 13 Evelyne B1 B1

    B1 1 Margarida B1 B1

    B1 2 Mariana B1 B1

    (A2+) / B1 14 Andrea B1 A2+


Report this document

For any questions or suggestions please email