;Diatopic, diamesic and diaphasic variations in spoken Italian
+Renata Savy?, Francesco Cutugno
?Department of Linguistics and Literary Studies – University of Salerno +Department of Physics – NLP Group - University of Naples, Federico II
1.1 The framework of Italian spoken language corpora
In recent years Italian linguistics has dedicated an increasing amount of resources to the study of spoken communication, reducing the historical lack of available data for research.
Nevertheless linguistic research is still sensibly poor of basic methodological instruments and specific data helping the study of human languages, and in particular as far as the spoken dimension is concerned, (Mc Enery&Wilson, 1996). Among these instruments, speech corpora, recorded in many different conditions, are of fundamental importance from two main points of view:
a) for the description and the knowledge of how spoken language operates in all the
conditions of use;
b) to realise tools to be used as a reference base for the development of systems for
robust speech recognition and good quality speech synthesis (Albano Leoni, 2006).
To reach these two aims, strictly related to each other, it is, therefore, necessary, an integrated strategy, that is able to satisfy both the needs of basic knowledge and those of the applications production, is necessary. One of the basic resources required to carry out this integrated strategy is the production of calibrated and stratified speech corpora, in which different varieties of spoken language along the diamesic, diaphasic, diastratic and diatopic dimensions are present, each one in the right proportion compared to the others. As a matter of fact, natural languages are characterised by a high degree of variability in all their use conditions (Sobrero, 1993; Berruto, 1995), and, furthermore, it is largely known that it appears with the strongest evidence in spoken language (Brown, 1990).
Many initiatives to collect spoken Italian corpora of various sizes started since early ‟80:
Sornicola (1981), Berretta (1985), Voghera (1993), Bazzanella (1994) and many others started collecting their own datasets of small-medium size, constructing their proper analytic tools, strictly oriented to a specific, and sometimes limited in its ambitions, linguistic research. The studies they produced resulted as being partial, different, occasional and related to limited geographic areas and/or single/rare/non-representative linguistic phenomena (Sobrero, 1985).
But until the mid ‟90s the objective of having in use, a large corpus, allowing global analyses of the complex reality of spoken Italian and, above all, representative of the variational aspects, remained unattended. Such a corpus must cover a wide/significant range of communicative situations, with regard to phonology, prosody, morphology, syntax and basic lexicon in order to constitute the starting point for the description of the concrete modalities in which communication takes place (Albano Leoni, 2006).
; The CLIPS project has been funded by Italian Ministry of Education, University and Research (L.488/92) and coordinated by Prof. Federico Albano Leoni. The corpus is freely available for research and is carefully described by a set of public documents (website: http://www.clips.unina.it ).
After 2000, new and larger corpora of spoken Italian were produced, some aiming at 1specific purposes, as CiT (Corpus di Italiano Trasmesso, see Spina, 2005), Lir (Lessico di 2italiano Radiofonico), while others aiming at representing Italian in a wider perspective 34 5(Lablita, see Cresti, 2000, C-ORAL-ROM, see Cresti et. al., 2002). Some of them take into
account only a few, mainly diaphasic and in some cases diamesic, aspects of the linguistic variability. In these corpora, with the only exception of the LIP (Lessico di frequenza
dell’italiano parlato, De Mauro et al., 1993), no regard is posed to the dimension of diatopic variation that appears to be fundamental in the study of any natural language and in particular for Italian.
1.2. The variability problem
A corpus that aims at being really representative of spoken Italian necessarily has to face the peculiar sociolinguistic situation observed in Italy. As a matter of fact, among the various sources of variability naturally encountered in human languages along different dimensions of expression, Italian presents, because of historical reasons, a particular relevance of diatopic variance which cannot be neglected and that is difficult to be represented.
Standard „Italian‟ is then an abstraction built on mixing and combining all regional
varieties (Cortelazzo&Mioni, 1990; Telmon, 1993; 2008; Bruni, 1992; Cortelazzo, 2001), each one derived by one or more local romance dialects which all together gave rise, on the base of a succession of historical combinations, to the national language (De Mauro, 1972; Lepschy&Lepschy, 1977; Bruni, 1992; Marazzini, 1994; Harris&Vincent, 2001).
On the prescriptive plane, we can consider the Florence variant of Italian as representative of the linguistic unification as far as the written form (especially literary) and the most formal varieties of spoken language are concerned (De Mauro, 1972; Lepschy&Lepschy. 1977; Bruni, 1992; Marazzini, 1994; Harris&Vincent, 2001). However the language of everyday communication used in Italy is far from being conveniently standardised. Moreover, the „Italiano comune’ (lit. „Common Italian‟, Serianni, 1988) is more stable as far as some levels of the linguistic structure are concerned (morphology and, in part, lexicon) than in other levels (phonetics, prosody, syntax).
Diatopic variety is interleaved, as obvious, to the diastratic variation and furthermore to the diaphasic one too, while this last is partly related to the communication medium.
Many studies concern with the descriptions of regional varieties of Italian, mainly at the phonetics level, but a few of them are based on the systematic analysis of data coming from spontaneous speech corpora. The necessity to have in use a corpus of spoken language with a high level of stratification is then evident. Having this resource available, we could finally count on a reference dataset to be used, as already stated above, both for studies about the global description of Italian and its varieties, and for studies on speech technologies.
1.3 CLIPS assumptions and goals
The corpus of spoken Italian that is presented in this paper derives from a project which started on 2000 and concluded at the end of 2006. The project, as its acronym indicates, (Corpora e Lessici di Italiano Parlato e Scritto – lit. Corpora and Lexicons of Spoken and
Written Italian), was aiming at the production of linguistic resources for the study and the automatic processing of Italian in both its written and spoken form. The production of
vocabularies extracted from written texts followed specific procedures and criteria significantly 6different from the ones used to realise the corpus of spoken language.
The collection of speech recordings has been driven, since the early stages of the project, by the necessity to make the corpus as much stratified as possible on the diamesic, diaphasic
and, moreover, diatopic planes. At the same time, diastratic variation is not considered in CLIPS, as it addresses issues not taken into account during the development of the project. 7Previous similar experiences preceded the collection of CLIPS (see Crocco et al., 2003)
constituting a test-bed mainly for data collection and coding (see ??3 and 4). However these pilot attempts were conducted on smaller size scales and their representativity of all the dimensions of variation was almost limited. In this view, then, CLIPS represents the first and the most complete stratified corpus of spoken Italian, as it will be showed in the next sections.
It is important to stress that, among the main aims of this corpus, particular relevance is given to the study and the description of phonetic and phonological levels of the varieties of Italian (and of the relative applicative implications). Only in a later time an attempt of extensions to other analysis levels has been made (see ?5). Some of the balancing that will be described further on (?3.1) must be reinterpreted in this view, as this can represent a limit for using the corpus for some specific research aim (i.e. lexical statistics, studies on morpho-syntax etc.).
2. CLIPS stratification
An overall portrait of the layered structure of the CLIPS corpus is depicted in Table 1.
Dialogic Ortho-diaphasic/diamesic Read speech Radio and TV Telephonic (elicited) phonic
15 regional 15 regional 15 regional 15 regional standard Diatopic varieties varieties varieties varieties
broadcast map-task read sentences Auto talk show read Textual sentences commercials spot the word list WoZ difference culture
Table 1. Corpus stratification.
In the following sections a more detailed explanation of these structures will be given, but a complete description of all the project aspects can be found in the website documentation.
2.1 Diamesic/diaphasic stratification
We discuss together these two dimensions as they are, as we already said above, strictly related and partly inter-dependent. CLIPS, for what the diamesic dimension concerns, is articulated into four varieties:
a) free field recordings;
b) radio recordings;
c) television recordings;
d) telephonic conversations.
Diaphasic variation determines a sort of internal articulation in every diamesically determined sub-corpus.
The „free-field‟ corpus consists of the collection of elicited and (semi-)spontaneous
dialogues (presenting a low level of formality) and of read speech (with a further subdivision in readings of isolated word and sentences lists).
Both the radio and television speech sub-corpora present a wide differentiation in their textual typologies (see next sections) which can lead, in some cases, to a further internal diaphasic articulation.
Radio and television spoken language presents traces of textual organisation recalling the written one (as can be frequently seen in the news reading); however the presence of informal conversations is not rare, especially in the live programs, even in comparison to other media. Consequently a wide range of different styles are available in Radio and TV speech ranging from read speech or read/acted speech, interview-based dialogues, to multi-speaker talk shows and debates without control of the turn-taking. The parallel comparison by textual typologies shows that radio and television corpora present the same diaphasic varieties.
Different recording types available in the telephonic sub-corpus (see ?2.4 for their description) cannot be properly situated along the diaphasic continuum. In the former case, speakers produce a sort of guided, not-read monologue: this kind of speech is characterised by a low degree of spontaneity and by an almost high level of formality. In the latter a quasi-natural
dialogue is realised where the speaker interacts with a synthetic voice giving answers slowly and not always coherently. We can probably consider this condition as more spontaneous than the former, and partially less formal, but a correct distinction is problematic.
It is, finally, very difficult to position the ortho-phonic corpus along the variational continuum: in principle it should be considered a diaphasic (read) variety of „free field‟ speech, obtained in highly controlled laboratory conditions (anechoic chamber, high quality recording devices) with highly skilled speakers (actors or professional operators). However, these factors strongly determine the nature and the type of speech produced resulting in the emergence of a peculiar diamesic variety.
2.2 Diatopic stratification
Collection sites have been chosen according to the results of detailed socio-economic, 8geo- and socio- linguistic analyses which brought to the choice of 15 locations representative
of 15 diatopic varieties of Italian.
Many socioeconomic criteria could have been used to perform this choice; we selected the following ones as the most pertinent for our aims:
a) development indexes (average income, unemployment rates,
b) availability of infrastructural endowment (public transports, communications, energy,
c) demographic dynamics;
d) cities social organisation, in relation to the amount of inhabitants per site.
We operated a preliminary selection of the most representative sites in the Italian territory. This procedure led us to a preliminary selection of about 30 main Italian towns, where 15 of them, mainly positioned in the north of the country, presented the higher level of socioeconomic welfare even if in many cases these towns presented lower rates for demographic dynamics and number of inhabitants.
At the same time some important geo-linguistic constraints were taken under consideration to respect the complex Italian situation. We guaranteed the representativity of the seven variants of Italian normally encountered in our country, assigning a given number of sites per linguistic area proportionally to the above listed economic constraints.
This leads us to the following cities final selection listed in function of the geo-linguistic area of pertinence:
1) gallo-italica (Gallo-italic, Torino, Genova, Milano, Bergamo, Parma);
2) veneta (Veneto, Venezia);
3) toscana (Tuscan, Firenze);
4) mediana (median, Roma, Perugia);
5) meridionale (southern, Napoli, Bari);
6) meridionale estrema (extreme southern, Catanzaro, Lecce, Palermo);
7) sarda (Sardinian, Cagliari).
Figure 1: Map of the Italian geo-linguistic areas with the indication of the chosen collection sites.
All the corpus sub-sections, with the exception of the ortho-phonic one, have been collected in the above listed localities. Dialogues and read speech recordings were produced directly on-site, usually asking for collaboration to universities and research centres. For telephonic speech, a service company hired speakers in the 15 cities asking them to phone, using the classic analogical line, to a unique calling centre where all the calls were stored. The radio and TV corpus section is structured on the diatopic plane by means of the selection of local and regional broadcast services. In this case we chose to add, as a reference, and giving them a proper proportional size, a quote of recordings coming from national (both public and private) networks.
2.3 Speaker selection
As it is well known, within a given city, many sociolinguistic factors can influence the structure of the spoken variant, such as: the city size in itself and its number of inhabitants; the intensity of fluxes of migration and the movements of outliers, from and to other linguistic areas; the number of disadvantaged suburbs; the amount of foreign people living in the sites. In some cases indirect measure of sociolinguistic variability can be derived from the analysis of specific indicators such as the quality and the coverage of public transportation, the number of
schools and universities sites available, the number of private cars and the data on the urban car traffic, the data on micro-economic development given by tertiary activities. The analyses of data publicly available concerning the aspects herein listed for the 15 chosen sites, showed a very complex situation with a high degree of differentiation both intra- and inter- locality. It was really difficult to define the criteria for the selection of speaker characteristics. Consequently, in order to minimise the risk of interference introduced by these (and other not listed above) not-controlled variables, and, at the same time, to assure that the collection of all the recordings would proceed without, we decided to select a sample of speakers which could result as homogeneous in relation to some fundamental variables such as: average age, socio-economic status, instruction level, residence in town etcetera. On this basis, for dialogues, read speech and telephonic sections of the corpus, we chose undergraduate students, aged between 18 and 30, who always lived in the city area of the selected sites as well as their parents. Males and females are, on average, equally distributed in the corpus global population.
2.4 Textual typologies
Each sub-corpus of CLIPS presents a variety of textual typologies, chosen with the aim to differentiate as much as possible the communicative contexts and, consequently, the type of productions in relation to the stylistic profile and to the linguistic register. This differentiation increases the corpus stratification on the diaphasic plane, adding, at the same time, some particular features to the linguistic structures of the produced texts. This is particularly evident in dialogic, radio-television and telephonic speech. 9The dialogic corpus contains two types of texts, elicited using two different techniques.
The first one, the map task (mt) developed at HCRC of Edinburgh (Anderson et al., 1992; Carletta et. al., 1996) has been widely experimented in many international projects (e.g., 1011121314151617AEMT, DCIEM, DMTC, JMTD, ANDOLS, IViE, SMTC, AVIP-API). Utterances
produced with this method present a certain degree of spontaneity and a low control for what phonetics and prosody concerns; on the contrary the pragmatic plane presents rigid schemas caused by the nature of the task to be accomplished and by the pre-ordered speaker role (Giver/Follower).
The second type of dialogic text is obtained using the „Spot-the-difference‟ task (sd)
which allows a greater freedom of interaction and, consequently, a greater variability in the conversational schemas (Pean et al., 1993): speakers alternate each other in turn taking more freely than in the former type of dialogue, producing almost spontaneous conversations (even if with some limitation) and comparable to the ones produced in the everyday life.
Both these texts are however affected by some limits as far as syntactic structure concerns: syntax variety is reduced because of the predominance of the structure question/answer in the texts, because of lexical choices imposed by the task features (paths on the map route, map structure, drawings) and by the pre-ordered referents (objects in the maps and in the drawings). 18The radio and television corpus is articulated into four typologies:
a) dissemination and culture (dc), including mainly documentaries or educative and
scientific programs, mostly structured into monologue very often, but not always, with
texts read by the speaker;
b) information/news and service (is), formed by bulletins including sport news,
uttered by speaker and professional anchor-men but that can contain spontaneous
interview to common people;
c) commercials (pb) mainly formed by acted utterances;
d) entertainment (it) including a wide variety of live shows with guests and public
ranging from talk-shows to quizzes, debates, all presenting an almost free turn-taking.
19Two different modalities were chosen, finally, for the telephonic corpus: in the first one,
speakers were instructed to call the desk and to act as a customer complaining about one specific service or expressing a particular request in a unique utterance without any interaction with the desk, while in the second, interaction between the two parts was guaranteed by means of the Wizard of Oz method (WoZ), in which a synthetic voice producing sentences manually chosen by a human operator, induce the speaker/customer to interact with the concierge.
3. CLIPS collection
Up to now, we have clearly stated that our corpus has been realised in order to have opportunely represented a precise set of variational dimensions into the dataset. Many details regarding the corpus collection must be carefully evaluated in order to prevent wrong workloads and to avoid goals that are not sustainable. Furthermore, the collected corpus must possess an internal coherence for what dimensional balancing concerns and must be correctly subdivided in sub parts whose relative sizes are thought as representative of the different variation sources. In this section we will describe how balancing and sizing have been reached. Even if it will not explicitly appear, it is important to stress that, this control has been performed twice: during planning and at the end of the phase of the dataset creation, or in practice, in the moment immediately preceding the conversion of the dataset into a proper relational database.
3.1. Corpus representativeness and balance
The complex articulation of CLIPS is reflected even in the symmetry and in the accurate balancing of the data in relation to a set of variables. Different solutions were found to optimise the balancing of the corpus depending on the main characteristics of the various sections composing the dataset.
As already stated above, dialogues were elicited by means of two different techniques, map task (mt) and spot-the-difference (sd) task. For both tasks we prepared two different sets of maps (A-B) and drawings (A-B). In each collecting sites we have 4 dialogues for each set, leading to a total of 16 dialogues per site (total = 240 dialogues).
Speaker sex is on average equally distributed in each site bringing at the final distribution of 51.2% of female and 48.8% of male speakers.
Map A Map B Drawing A Drawing B Females Males
60 60 60 60 124 116
Table 2: dialogic sub-corpus, number of speakers for: mt map type, sd drawing type and sex subdivision.
Both maps and spot-the-difference drawings contain objects frequently named during the dialogues: all these word expressing the named entities in the drawings have been chosen among the top ranking words in the most important Italian frequency lexicons: LIP (Lessico di
frequenza dell'italiano parlato - De Mauro et al., 1993); LIF (Lessico di frequenza della lingua
italiana contemporanea - Bortolini et al., 1972); VELI (Vocabolario Elettronico della Lingua
Italiana - various authors, 1989); Lessico elementare (Marconi et al., 1994).
The duration of each map-task dialogue was not fixed in advance; we selected only recordings in the range of 10-18 minutes. In spot-the-difference task, speakers were asked to conclude the interaction exactly after 10 minutes.
3.1.3. Read speech
Same speakers involved in the dialogues read word (Wl) and sentences list (Sl). Word lists reproduced the named entities in the two types of drawings.
Similarly to what done to select object words in the previous case, sentences were 20. The list of high obtained selecting names and verbs having high frequency scores in Italian
frequency words were obtained consulting again the above cited Italian frequency lexicons. Words for lists and sentences were chosen among 240 most used Italian lemma; dispersion measurements and use index were considered.
Syntactical structures present a certain degree of variability (mono- and multi- clausal sentences, verbal vs. nominal clauses, dislocations etcetera) as typically happens in conversational speech and this leads to a correspondent variety of intonational patterns.
Speaker sex balancing has been provided only for dialogues, read and ortho-phonic speech, i.e. only for the corpus sections in which speakers were selected a priori.
3.1.4. Radio and TV broadcasting
In this case balance involves the relative amount of recordings with respect to the ratio of Radio vs. TV, national vs. local broadcasting, evaluation of the most representative typologies of programs and relative internal articulation of the corpus in regard to this dimension. We decided to assign an equal amount of recording time to Radio and television. The 21multidimensional balancing process for RTV material was based on the analysis of audit data
and socioeconomic indexes similar to those used for the selection of collection sites.
The distribution between national and local amount of recordings was fixed in 20%for TV and 80% for radio. Obviously, local recordings are equally distributed over the 15 collection sites, and the ratio between national and local broadcasting is around 3:1 in favour of the national programs in each site. As it is shown in table 3, within the internal articulation, a correct proportion among the various typologies is conserved.
Local radio recs. Local TV recs. Typologies National radio recs. National TV recs. for 15 sites for 15 sites
15’ 15’ 50’ 50’ Entertainment
5‟ 5‟ 15‟ 15‟ Information/Service
2‟ 2‟ 15‟ 15‟ Dissemination/Culture
3‟ 3‟ 10‟ 10‟ Commercials
6h 15‟ 6h 15‟ 1h 30‟ 1h 30‟ TOTAL
Table 3. Subdivisions of the radio and TV corpus with reference to the different textual typologies
3.1.5. Telephonic speech
As already stated (in ?2.4) the semantic domain chosen for telephone conversation recordings was the simulation of two types of interaction between a hotel concierge and a customer in his/her room (single utterance and WoZ).
We initially decided to distribute the recordings across the two types of interaction according to a subdivision of 15% for short monologues vs. 85% for WoZ, while the final balance on the effectively collected data results in 13% vs. 87%.
Speakers called a multiline telephonic calling centre at Naples University from all the 15 sites. Each site furnished 20 speakers and each speaker made 10 calls, for a total of 3000 calls.
3.1.6. Ortho-phonic speech:
22This corpus section has two main aims: the first one is to provide a reference for Italian pronunciation and the second is to realise a dataset that could, in the future, be used for diphone or corpus based automatic speech synthesis.
Five male and five female speakers, professionally qualified, recorded a list of 120 sentences (Bl) plus three repetitions of the same 20 sentences used in the read speech section (Sl). In this case, balance consisted in controlling for the presence of all the two and three phone clusters available in Italian at least once in the list.
3.1.7. Corpus dimension
CLIPS consists of about 100 hours of audio recordings, partitioned as shown in table 4:
Dialogue section is the more relevant one and forms approximately 50% of the entire corpus; remaining sections cover about 16 hours of recording each, with the exception of the ortho-phonic corpus that consists of less than 4 hours.
Read Ortho-Corpus Dialogic Radio and TV Telephonic TOTAL speech phonic
mt + sd Sl + Wl RD + TV Auto+Woz Bl+Sl various 120+120 90+180 333+240 1077+7628 1200+600 units
? n. dialogs ? n. lists ? n. recordings ? n. turns ? n. lists
time 48h 14’ 16h 21’ 16h 38’ 16h 42’ 3h 42’ 101h 37’
Table 4. Size and duration of the 6 main sub-parts of CLIPS.
An intensive two years work performed by a group of about 15 phoneticians has led to the transcription of a portion of the corpus ranging from 30% of the recording in the case of the dialogues section to 100% of the telephonic speech recordings. In the same period the staff performed a manual and accurate phonetic labelling of a part of the transcribed material (for details about labelling ?4.3). Table 5 resumes these results.
Ortho- Dialogic Read speech Radio and TV Telephonic phonic
% 30% 30% 30% 100% 100%
time 15h 30‟ 5h 20‟ 4h 30‟ 16h 40‟ 3h 40‟
% 10% 10% 10% 3,5% 16%
time 5h 30‟ 1h 40‟ 1h 10‟ 35‟ 35‟
Table 5. Percentages of transcribed and labelled material for CLIPS sub-parts
A posteriori counting of the number of words, speakers and data-files plus a final evaluation of the mass storage size has been performed (see table 6).
n. of words n. of speakers n. of files Storage
~1Mwrd ~1000 ~ 130.000 ~ 22Gb
Table 6. Other CLIPS dimensions
With reference to table 6, further considerations are required.
As we already stated a significant part of the recordings available on the CLIPS website has never been transcribed. It means that we are not able to estimate the effective number of words that are contained in the whole recording set and the cited value only estimates what has been processed in some way during the project.
The speaker count provides the exact evaluation of 550 units for dialogues, read, ortho-phonic and telephonic speech summed to an approximate estimate of 450 participants in the radio and television sub-part.
4. CLIPS coding
The phase of coding, as it always happens in the process of corpora construction, is the most problematic as it requires a careful evaluation of at least three factors (see Sinclair 2005): 1) the accordance with accepted international standards that can assure interexchange and
reciprocal usability with other similar corpora;
2) the analysis of the main linguistic and applicative aims that the corpus should consent to
3) the capacity to respond to constraints and particular features that the project imposes.
These considerations influenced the definition of directives and coding specifications that are clearly described in all of their phases in the project documentation available on the website. The adopted norms attempt to satisfy two fundamental requirements: homogeneity (or compatibility) with other speech corpora (mainly in the European environment) and adequacy with the objectives proposed during the planning of the project for the CLIPS corpus.
As we shall see, variability, particularly the diatopic one, was constantly taken into account during the coding procedure definition.
4.1. Standards and protocols
During the realisation of CLIPS, each planned phase has been developed taking into account the available directives and the suggestion of international standardisation initiatives. In particular, after having conducted a detailed survey on main international standards, we chose to fully adhere to Eagles SLWG recommendations (Gibbon et al., 1997) that provides a wide set of directives for corpus-based design in natural language processing applications. Eagles
provides proposals and procedures for linguistic corpora thought to be suited for use in language engineering and basic research, and, furthermore, provides a widely accepted set of procedures and workflows for corpus-based work in basic linguistics too. It also provides encoding conventions for linguistic annotation, as well as general suggestions on the structure of the dataset to be used for representing corpora annotated for linguistic purposes.
4.1.1. Levels of coding
As far as the constitution of datasets oriented to the phonetic/linguistic study of spoken language concerns, in various European research projects many proposals have been formulated regarding the possible list of levels of transcription and labelling. A first proposal, made in the ambit of the ESPRIT-SAM (Speech Assessment Methods, Fourcin et al., 1989) project, indicates five possible levels of phonetic transcription/labelling (see Barry&Fourcin, 1992):
1) physical level, where acoustic features in the speech signal are labelled;
2) acoustic/phonetic level, where speech segments are associated to phonetic categories
like occlusion, friction, voicing, nasalisation;