A negative selection approach for automatic speaker identification
1*2 K. M. Faraoun , A. Boukelif 1 Evolutionary Engineering and Distributed Information
Département d’informatique, Djillali Liabès University.
Systems Laboratory, EEDIS -SBA– Algeria
firstname.lastname@example.org 2. Communication Networks ,Architectures and Multimedia laboratory
Département d’électronique, Djillali Liabès University.
University of S.B.A. Algeria
Abstract. This paper shows the potential accomplishments of artificial immune systems (in particular, the negative selection algorithm) application to the problem of speaker recognition. Both the use of binary representation of original signal and that of its Fast Fourier Transform in a real-number representation are analysed. A number of experiments are performed on different datasets to examine the performance evolution with respect to the different system parameters. It is found that substantial enhancements of the system capabilities are possible by means of the exploitation of the Fast Fourier Transform. We can see from obtained results that the proposed approach can give acceptable results when using a simple detection algorithm compared to existing techniques.
Keywords: Speaker recognition, Artificial immune system, Negative selection
. Introduction 1
Speaker recognition is a biometric-based technology (technology that verifies or identifies individuals by analyzing a facet of their physiology and/or behavior) that refers to automatic voice detection technologies, including speaker identification and speaker verification. ___________________________________________________________________________ * Corresponding Author
Faraoun Kamel Mohamed
Tel: (+213) 75 32 36 50
Fax: (+213) 48 57 77 50
Address: 15, rue Oualhaci Mokhtar - Sidi bel abbès -22000 -Algérie
Speaker identification is the process of finding the identity of an unknown speaker by comparing the voice of that unknown speaker with voices in a database of speakers. It entails
-to-many comparison. Speaker verification is the process of determining whether a a one
person is who she/he claims to be. It entails a one-to-one comparison between a newly input voiceprint (by the claimant) and the voiceprint for the claimed identity that is stored in the system.
Speaker recognition systems have a large set of applications in everyday life: Time and Attendance Systems, Access Control Systems, Telephone-Banking, Biometric Login to telephone aided shopping systems, Information and Reservation Services, Security control for confidential information and Forensic purposes.
All speaker recognition systems contain two main modules: feature extraction and feature matching. Feature extraction is the process that extracts a small amount of data from the voice signal that can later be used to represent each speaker . Feature matching involves the actual procedure to identify the unknown speaker by comparing extracted features from his/her voice input with the ones from a set of known speakers. Speaker recognition is a difficult task and it is still an active research area. In practice, this task has been challenged by the highly variant of input speech signals: a speech signal includes the presence and type of speech pathologies, the physical and emotional state of the speaker and a can be also impregnated with the acoustical noise and environment where the recording is done. Often, humans are able to extract the identity information when the speech comes from a speaker they are acquainted with. However, in an automatic verification, a routine must be developed to accommodate this kind of parasitic alterations.
As a result, different kinds of speaker recognition systems tools and methods were built based on different methods like:
？ Neural network learning ;
？ The Bayesian Maximum A Posteriori (MAP) Adaptation Method ;
？ Statistical analysis and vector quantization ;
？ Gaussian mixture models (GMM) ;
？ Hidden Markov models (HMM).
In this work, an attempt is made to show the use of the negative selection algorithm (an artificial immunology based algorithm) to build an efficient speaker recognition system. Details on the method developed in the present work and descriptions of the experiments conducted are presented in the following paper is finally concluded with a summary of the most important points.
2. The speaker recognition
A speech signal is a very complex function of the speaker and his environment that can be captured easily with a standard microphone. Each voice signal is represented in a waveform. After its acquisition by a microphone, a sound is converted to electrical current. Continuous oscillations of air pressure become continuous oscillations of voltage in an electrical circuit. This fast-changing voltage is then converted into a series of numbers by a digitizer. A digitizer acts like a very fast digital voltmeter. It makes thousands of measurements per second. Each measurement results in a number that can be stored digitally (that is, only a finite number of significant digits of this number are recorded). This number is called a sample and the whole conversion of sound to a series of numbers is called sampling. The result of the sampling operation is a numbers vector, witch represent the voice signal waveform. The numbers range depends on the sampling bit-rate (16-bit or 8-bit in our work). Any speaker recognition system, use on the obtained vector to extract different features and characteristics of the voice signal, and to do any kind of analysis .
Speaker recognition systems are classified as text-dependent (fixed-text) and text-independent (free-text). The text-dependent systems require a user to re-pronounce some specified
utterances, usually containing the same text as the training data, like passwords, card numbers, PIN codes, etc. There is no such constraint in text-independent systems. Speaker models capture characteristics of somebody’s speech that show up irrespective of what one is saying.
In the text-dependent system, the knowledge of knowing words or word sequence can be exploited to improve the performance.
The used representation is also a very important element for the recognition system. Some techniques use the initial form of the voice signal (obtained directly by the sampling phase). Others employ a transformed form of the signal, mainly via Fast Fourier transform (FFT). The FFT transformation permits to work in the frequency domain and hence to use the frequency spectrum of the voice signal instead of the wave form. This form gives more information about the signal structure and can be more characterized to each speaker.
Some other applications use the MFCC (Mel-Frequency Cepstrum Coefficients) . MFCC’s
are based on the known variation of the human ear’s critical bandwidths with frequency. The main MFCC role is to construct a voiceprint for each speaker, based on its voice signal characteristics. This voiceprint is then used by the identification process. Only voiceprints are stored in the database instead of storing the whole voice signal. When a new voice is recorded, its voiceprint is extracted and then compared to all voiceprints in the database until identification or reject. MFCC voiceprint combined with neural network techniques has been used to build a robust and flexible speaker recognition system . In the present work, a new form of voiceprint is created for each speaker, using the negative selection algorithm. This algorithm is one of the main components of artificial immune systems.
3. The artificial immune systems
Artificial immune systems (AIS) are adaptive systems inspired by theoretical immunology and observed immune functions, principles and models. They form the basis of solutions for various real world problems, in particular, intrusion detection.
The natural immune system is a network of cells, tissues, and organs that work together to defend the body against attacks by “foreign” invaders. So any artificial immune system must give a model for each element and each inspired mechanism. Mechanisms are implemented as algorithms, when elements (cells, tissues...) are represented by binary strings or real vectors (depending on the problem definition). There are various mechanisms in the artificial immune
system such as clonal selection, affinity maturation, somatic hyper-mutation, receptor editing and negative selection.
4. The negative selection (NS) algorithm
Through the use of the negative selection process, there have been a number of works attempting at building artificial immune systems for virus detection , computer security , hardware fault tolerant systems  and Time series analysis . The original work by Forrest, Perelson et al. in 1994 , in which the negative selection algorithm was proposed, has been a starting point for almost all the research in the AIS related to the computer security. The negative selection algorithm is inspired by the maturation of T-cells in the thymus gland . The algorithm consists of two stages: censoring and monitoring. The censoring phase caters for the generation of change-detectors. Subsequently, the system being protected is monitored for changes using the detectors generated in the censoring stage. The basic principle of a negative-selection algorithm is as follows:
？ Define self as a multi-set N of strings of length l over a finite alphabet, a collection that S
we wish to protect or monitor. For example, N may be a segmented file, or a normal S
pattern of activity of some system or process.
？ Generate a set N of detectors, each of which fails to match any string in N. A partial RS
matching rule is used to compare the strings.
？ Monitor N for changes by continually matching the detectors against N. If any detector SS
ever matches, a change (or deviation) must have occurred.
Matching between detectors and self-strings is done via a matching rule witch indicate for each two strings of the same length l, and with the same alphabet, if they match or not. Obvious approximate matching rules include Hamming distance and Euclidian distance, but the more adopted actually and the more plausible rule in the immunology concept, is the so called r-contiguous bits : Two strings match if they have r contiguous bits in common. The parameter r is the threshold of the matching rule that determines the specificity of the detector. It is an indication of the size of the subset of strings that a single detector can match. If r = l, then the matching is completely specific, that is, the detector will match only a single string (itself); but if r = 0, the matching is completely general, that is, the detector will match every single string of length l.
5. Proposed approaches: NS algorithm for speaker recognition
The main goal to be achieved in almost all applications of the negative selection algorithm is to detect abnormal deviation from normal behaviour. The algorithm generates detectors from a segmented version of the original data (the self set Ns) whose representation differs from one application to another. In general, a binary representation is used to codify the self-elements. The elements have the same fixed length (witch can be a parameter of the algorithm).
Accordingly, the negative selection algorithm is used in the present work to construct a set of detectors for a given speaker voice signal. The generated detectors are then used as a voiceprint to monitor the acquired new voice signals (identification phase). If the signal is produced by the same speaker, the form and the data distribution must be very similar to the original one (used to generate detectors), so a very low anomalies rate will be obtained (null in the ideal case).
According to the obtained anomalies rate, the automatic speaker recognition system decides of the new voice speaker identity. To achieve that, a database of voiceprints of different
number of detected anomalies！Anomaly rate ; (1) number of self elements！
speakers is used. A voiceprint is given by the set of detectors obtained when applying the negative selection algorithm to the corresponding voice signal (the learning phase). If the lowest obtained anomalies rate is higher than a fixed threshold ? (witch a parameter of the system that determine the highest accepted value of the detected anomalies rate) , the system decides that this voice signal does not belong to any speakers of the database, and so the speaker is not identified. Else, the speaker is identified as the one with the lowest anomalies rate (lower than ?). Fig.1 resumes how the recognition system operates. The anomalies rate computed for each new speaker voice with respect to a given detector set (voiceprint) is computed using the following expression:
Voice 1 signal
Voice 2 signal Detectors Database
Voice 3 signal Identification phase
New speaker voice
Applying Binary/Float Codification detectors
voiceprints Anomalies are used rate < ? No No
Speaker Not identified Speaker identified
Figure 1. Speaker recognition system based on the negative selection