A negative selection approach for automatic speaker identification

By Tommy Alexander,2014-11-25 18:26
11 views 0
A negative selection approach for automatic speaker identification

    A negative selection approach for automatic speaker identification

     1*2 K. M. Faraoun , A. Boukelif 1 Evolutionary Engineering and Distributed Information

    Département d’informatique, Djillali Liabès University.

    Systems Laboratory, EEDIS -SBA Algeria 2. Communication Networks ,Architectures and Multimedia laboratory

    Département d’électronique, Djillali Liabès University.

    University of S.B.A. Algeria

    Abstract. This paper shows the potential accomplishments of artificial immune systems (in particular, the negative selection algorithm) application to the problem of speaker recognition. Both the use of binary representation of original signal and that of its Fast Fourier Transform in a real-number representation are analysed. A number of experiments are performed on different datasets to examine the performance evolution with respect to the different system parameters. It is found that substantial enhancements of the system capabilities are possible by means of the exploitation of the Fast Fourier Transform. We can see from obtained results that the proposed approach can give acceptable results when using a simple detection algorithm compared to existing techniques.

Keywords: Speaker recognition, Artificial immune system, Negative selection

    . Introduction 1

    Speaker recognition is a biometric-based technology (technology that verifies or identifies individuals by analyzing a facet of their physiology and/or behavior) that refers to automatic voice detection technologies, including speaker identification and speaker verification. ___________________________________________________________________________ * Corresponding Author

    Faraoun Kamel Mohamed

    Tel: (+213) 75 32 36 50

    Fax: (+213) 48 57 77 50

    Address: 15, rue Oualhaci Mokhtar - Sidi bel abbès -22000 -Algérie

    Speaker identification is the process of finding the identity of an unknown speaker by comparing the voice of that unknown speaker with voices in a database of speakers. It entails

    -to-many comparison. Speaker verification is the process of determining whether a a one

    person is who she/he claims to be. It entails a one-to-one comparison between a newly input voiceprint (by the claimant) and the voiceprint for the claimed identity that is stored in the system.

    Speaker recognition systems have a large set of applications in everyday life: Time and Attendance Systems, Access Control Systems, Telephone-Banking, Biometric Login to telephone aided shopping systems, Information and Reservation Services, Security control for confidential information and Forensic purposes.

    All speaker recognition systems contain two main modules: feature extraction and feature matching. Feature extraction is the process that extracts a small amount of data from the voice signal that can later be used to represent each speaker [1]. Feature matching involves the actual procedure to identify the unknown speaker by comparing extracted features from his/her voice input with the ones from a set of known speakers. Speaker recognition is a difficult task and it is still an active research area. In practice, this task has been challenged by the highly variant of input speech signals: a speech signal includes the presence and type of speech pathologies, the physical and emotional state of the speaker and a can be also impregnated with the acoustical noise and environment where the recording is done. Often, humans are able to extract the identity information when the speech comes from a speaker they are acquainted with. However, in an automatic verification, a routine must be developed to accommodate this kind of parasitic alterations.

    As a result, different kinds of speaker recognition systems tools and methods were built based on different methods like:

     Neural network learning [3];

The Bayesian Maximum A Posteriori (MAP) Adaptation Method [4];

     Statistical analysis and vector quantization [5];

     Gaussian mixture models (GMM) [6];

     Hidden Markov models (HMM)[7].

    In this work, an attempt is made to show the use of the negative selection algorithm (an artificial immunology based algorithm) to build an efficient speaker recognition system. Details on the method developed in the present work and descriptions of the experiments conducted are presented in the following paper is finally concluded with a summary of the most important points.

    2. The speaker recognition

    A speech signal is a very complex function of the speaker and his environment that can be captured easily with a standard microphone. Each voice signal is represented in a waveform. After its acquisition by a microphone, a sound is converted to electrical current. Continuous oscillations of air pressure become continuous oscillations of voltage in an electrical circuit. This fast-changing voltage is then converted into a series of numbers by a digitizer. A digitizer acts like a very fast digital voltmeter. It makes thousands of measurements per second. Each measurement results in a number that can be stored digitally (that is, only a finite number of significant digits of this number are recorded). This number is called a sample and the whole conversion of sound to a series of numbers is called sampling. The result of the sampling operation is a numbers vector, witch represent the voice signal waveform. The numbers range depends on the sampling bit-rate (16-bit or 8-bit in our work). Any speaker recognition system, use on the obtained vector to extract different features and characteristics of the voice signal, and to do any kind of analysis [2].

    Speaker recognition systems are classified as text-dependent (fixed-text) and text-independent (free-text). The text-dependent systems require a user to re-pronounce some specified

    utterances, usually containing the same text as the training data, like passwords, card numbers, PIN codes, etc. There is no such constraint in text-independent systems. Speaker models capture characteristics of somebody’s speech that show up irrespective of what one is saying.

    In the text-dependent system, the knowledge of knowing words or word sequence can be exploited to improve the performance.

    The used representation is also a very important element for the recognition system. Some techniques use the initial form of the voice signal (obtained directly by the sampling phase). Others employ a transformed form of the signal, mainly via Fast Fourier transform (FFT). The FFT transformation permits to work in the frequency domain and hence to use the frequency spectrum of the voice signal instead of the wave form. This form gives more information about the signal structure and can be more characterized to each speaker.

    Some other applications use the MFCC (Mel-Frequency Cepstrum Coefficients) [8]. MFCC’s

    are based on the known variation of the human ear’s critical bandwidths with frequency. The main MFCC role is to construct a voiceprint for each speaker, based on its voice signal characteristics. This voiceprint is then used by the identification process. Only voiceprints are stored in the database instead of storing the whole voice signal. When a new voice is recorded, its voiceprint is extracted and then compared to all voiceprints in the database until identification or reject. MFCC voiceprint combined with neural network techniques has been used to build a robust and flexible speaker recognition system [9]. In the present work, a new form of voiceprint is created for each speaker, using the negative selection algorithm. This algorithm is one of the main components of artificial immune systems.

    3. The artificial immune systems

    Artificial immune systems (AIS) are adaptive systems inspired by theoretical immunology and observed immune functions, principles and models. They form the basis of solutions for various real world problems, in particular, intrusion detection.

    The natural immune system is a network of cells, tissues, and organs that work together to defend the body against attacks by “foreign” invaders. So any artificial immune system must give a model for each element and each inspired mechanism. Mechanisms are implemented as algorithms, when elements (cells, tissues...) are represented by binary strings or real vectors (depending on the problem definition). There are various mechanisms in the artificial immune

    system such as clonal selection, affinity maturation, somatic hyper-mutation, receptor editing and negative selection.

    4. The negative selection (NS) algorithm

    Through the use of the negative selection process, there have been a number of works attempting at building artificial immune systems for virus detection [10], computer security [11], hardware fault tolerant systems [12] and Time series analysis [13]. The original work by Forrest, Perelson et al. in 1994 [14], in which the negative selection algorithm was proposed, has been a starting point for almost all the research in the AIS related to the computer security. The negative selection algorithm is inspired by the maturation of T-cells in the thymus gland [15]. The algorithm consists of two stages: censoring and monitoring. The censoring phase caters for the generation of change-detectors. Subsequently, the system being protected is monitored for changes using the detectors generated in the censoring stage. The basic principle of a negative-selection algorithm is as follows:

     Define self as a multi-set N of strings of length l over a finite alphabet, a collection that S

    we wish to protect or monitor. For example, N may be a segmented file, or a normal S

    pattern of activity of some system or process.

     Generate a set N of detectors, each of which fails to match any string in N. A partial RS

    matching rule is used to compare the strings.

     Monitor N for changes by continually matching the detectors against N. If any detector SS

    ever matches, a change (or deviation) must have occurred.

     Matching between detectors and self-strings is done via a matching rule witch indicate for each two strings of the same length l, and with the same alphabet, if they match or not. Obvious approximate matching rules include Hamming distance and Euclidian distance, but the more adopted actually and the more plausible rule in the immunology concept, is the so called r-contiguous bits [15]: Two strings match if they have r contiguous bits in common. The parameter r is the threshold of the matching rule that determines the specificity of the detector. It is an indication of the size of the subset of strings that a single detector can match. If r = l, then the matching is completely specific, that is, the detector will match only a single string (itself); but if r = 0, the matching is completely general, that is, the detector will match every single string of length l.

    5. Proposed approaches: NS algorithm for speaker recognition

    The main goal to be achieved in almost all applications of the negative selection algorithm is to detect abnormal deviation from normal behaviour. The algorithm generates detectors from a segmented version of the original data (the self set Ns) whose representation differs from one application to another. In general, a binary representation is used to codify the self-elements. The elements have the same fixed length (witch can be a parameter of the algorithm).

    Accordingly, the negative selection algorithm is used in the present work to construct a set of detectors for a given speaker voice signal. The generated detectors are then used as a voiceprint to monitor the acquired new voice signals (identification phase). If the signal is produced by the same speaker, the form and the data distribution must be very similar to the original one (used to generate detectors), so a very low anomalies rate will be obtained (null in the ideal case).

    According to the obtained anomalies rate, the automatic speaker recognition system decides of the new voice speaker identity. To achieve that, a database of voiceprints of different

    number of detected anomaliesAnomaly rate ; (1) number of self elements

    speakers is used. A voiceprint is given by the set of detectors obtained when applying the negative selection algorithm to the corresponding voice signal (the learning phase). If the lowest obtained anomalies rate is higher than a fixed threshold ? (witch a parameter of the system that determine the highest accepted value of the detected anomalies rate) , the system decides that this voice signal does not belong to any speakers of the database, and so the speaker is not identified. Else, the speaker is identified as the one with the lowest anomalies rate (lower than ?). Fig.1 resumes how the recognition system operates. The anomalies rate computed for each new speaker voice with respect to a given detector set (voiceprint) is computed using the following expression:

    Learning phase

    Voice 1 signal

    Voice 2 signal Detectors Database


    Voice 3 signal Identification phase

    New speaker voice

    Applying Binary/Float Codification detectors


    voiceprints Anomalies are used rate < ? No No

    Yes Yes

    Speaker Not identified Speaker identified

    Figure 1. Speaker recognition system based on the negative selection algorithm

     Two approaches were used to built the automatic speaker recognition system. The first one uses the voice single waveform as input when the second uses its Fast Fourier Transform

    (FFT). In the first approach, the result of the sampling operation of the voice signal is a vector that contains either integer or byte values in the case of 16-bits or of 8-bits sampling rate respectively. In both cases, a binary representation can be used. In our implementation, the sampled vector is decomposed into bit-strings of length l, where l can be varied from 8 to 64 bits (a parameter of the system). The negative selection algorithm is then applied on the resulting strings set to generate detector with the same length l. The number of generated detectors is also a variable parameter of the system; it can be varied to study system performance. The following figure show the different steps performed to generate the detectors set for the two used methods.

    To Compute the performances of the system with respect to a given parameters combination, some performances measures were defined. The performances are compute using to different samples set for each speaker: a positive samples set: signal belonging to the same speaker but acquired in different situations, and negative samples: samples belonging to another speakers. Let N and N be the number of negative and positive samples respectively, and AN(i) the gp

    thanomalies rate obtained for the i sample computed using equation (1). Three performance

    measures can be computed for each speaker:


    AN(i)i;1 NGR ; (2) Negative detection rate: NgNp

    AN(i) Positive detection rate: ;i1 PSR ; (3)Np

    NGR GR ; (4)PSR

Global detection rate :

    According to these equations, a good configuration of the system is achieved when NRG is maximized, and PSR is minimized, and so the global detection rate is in its maximum. In order to have better appreciation of system, the average of each performance measure is taken over all the used speakers (50 in our experiments):

    101AV ; NGR (5) ngi50i;1

    101AV ; PSR (6) npi50i;1

    101 AV ; GL (7) gli50i1;

    In the second approach, when employing the Fast Fourier Transform (FFT) version of the signal, a real number codification is used. The resulting vector is normalized to range in the interval [0,1]. The resulting vectors components are directly considered as self-elements. For each voice signal sample, the FFT is first computed and then transmitted to the negative selection algorithm (see fig.2).

     Negative_Selection (In :S,m,r; out :D); s

    S: set of self samples

    m: number of detectors r : detection radius s

    1: D

    2: Repeat n3: x random number from [0,1]

    4: repeat for every s in S={s,i=1,2,….} ii

    5: d Euclidian distance between s and x i 6: if d< r then go to 2 s

    7: D D {x}

    8: Until | D | = m 9: Return D

     Figure 2: Real valued negative selection algorithm The detectors are then generated as real numbers vectors, with the Euclidian distance as matching rule, and are generated using an adapted negative selection algorithm [17]. The generated set of is used to monitor each positive and negative voice sample from the dataset.

    The same performance measures are used to test the system performances (NGR, PSR and GR).

    7. Dataset and Experiment

    As explained above, our system is text-dependent. Accordingly, the used dataset must contain different copies voice signals of the same speaker, using the same text. This can be achieved by collecting voice signals at different periods, or in different situations (in noisy environment, the speaker has a cold, etc…).

    For our experiments, we have use the YOHO CD-ROM Voice Verification Corpus available at the Linguistic Data Consortium (LDC) [18]. The data is transformed and represented in the format detailed above. The used dataset contain a total of 138 speakers. The data is transformed and represented in the format detailed above.

    To enable comparison of performances between the proposed approach and existing ones, we have use the DARPA TIMIT speech database was designed to provide acoustic phonetic speech data for the development and evaluation of automatic speech recognition systems [19]. Different combinations of parameters were tested to examine the system performance with respect to each one. All tests were performed on an Intel-Pentium 4 CPU 2.66Ghz with 256 Mo Ram size.

    8. Results and discussion

    8.1 Binary representation

    The length of each element of the self set after segmentation (parameter l) was first set to 16 bits. The detectors number was set to 100, the matching threshold r to 8 and the detection threshold ? to 0.4. The r-contiguous matching rule is used to match detectors with signal elements (self set).

    Once a voiceprint of a speaker is constructed, the identification system can be tested in two ways: (i) using the voice signal of other speakers (user discrimination), (ii) using the voice of

Report this document

For any questions or suggestions please email