Monday, 8 October 2012

Voiceprint Recognition Systems for Remote Authentication-A Survey

Voiceprint Recognition System also known as a Speaker Recognition System (SRS) is
the best-known commercialized forms of voice Biometrics. Automated speaker recognition is
the computing task of validating a user's claimed identity using characteristics extracted from
their voices. In contrast to other biometric technologies which are mostly image based and
require expensive proprietary hardware such as vendor’s fingerprint sensor or iris-scanning
equipment, the speaker recognition systems are designed for use with virtually any standard
telephone or on public telephone networks. The ability to work with standard telephone
equipment makes it possible to support broad-based deployments of voice biometrics applica-
tions in a variety of settings. In automated speaker recognition the speech signal is processed
to extract speaker-specific information. These speaker specific informations are used to gen-
erate voiceprint which cannot be replicated by any source except the original speaker. This
makes speaker recognition a secure method for authenticating an individual since unlike
passwords or tokens; it cannot be stolen, duplicated or forgotten. This literature survey paper
gives brief introduction on SRS, and then discusses general architecture of SRS, biometric
standards relevant to voice/speech, typical applications of SRS, and current research in
Speaker Recognition Systems. We have also surveyed various approaches for SRS..

Keywords: Voiceprint, SRS, Speaker Recognition Systems, Voice Biometrics, Speech.
  1. Introduction
1.1. Brief Overview of Speaker Recognition
Voice biometrics specifically was first developed in 1970, and although it has become a
sophisticated security tool only in the past few years, it has been seen as a technology with
great potential for much longer. The most significant difference between voice biometrics and
other biometrics is that voice biometrics is the only commercial biometrics that process
acoustic information. Most other biometrics is image-based. Another important difference is
that most commercial voice biometrics systems are designed for use with virtually any stan-
dard telephone or on public telephone networks. The ability to work with standard telephone
equipment makes it possible to support broad-based deployments of voice biometrics applica-
tions in a variety of settings. In contrast, most other biometrics requires proprietary hardware,
such as the vendor’s fingerprint sensor or iris-scanning equipment. By definition, voice bio-
metrics is always linked to a particular speaker. The best-known commercialized forms of
voice biometrics are Speaker Recognition. Speaker recognition is the computing task of vali-
dating a user's claimed identity using characteristics extracted from their voices.

International Journal of Hybrid Information Technology
Vol. 4, No. 2, April, 2011

Table 1. Typical applications of speaker recognition systems
Areas
Authentication
Specific applications
Information
Security

Remote Identification & Verification, Mobile Bank ing,ATM
Transaction, Access Control

Personal Device Logon, Desktop Logon, Application Security,
Database Security, Medical Records, Security Control for Confi
dential Information

Law Enforce-
Forensic Investigation, Surveillance Applications
ment
Interactive
Banking over a telephone network, Information and Reserva-
Voice
tion
Response
Services, Telephone Shopping, Voice Dialing, Voice Mail

A speaker’s voice is extremely difficult to forge for biometrics comparison purposes,
since a myriad of qualities are measured ranging from dialect and speaking style to pitch,
spectral magnitudes, and format frequencies. The vibration of a user's vocal chords and the
patterns created by the physical components resulting in human speech are as distinctive as
fingerprints. Voice Recognition captures the unique characteristics, such as speed and tone
and pitch , dialect etc associated with an individual’s voice and creates a non-replicable
voiceprint which is also known as a speaker model or template. This voiceprint which is de-
rived through mathematical modeling of multiple voice features is nearly impossible to repli-
cate. A voiceprint is a secure method for authenticating an individual’s identity that unlike
passwords or tokens cannot be stolen, duplicated or forgotten.

1.2. Voice Production Mechanism
The origin of differences in voice of different speakers lays in the construction of their ar-
ticulatory organs, such as the length of the vocal tract, characteristics of the vocal chord and
the differences in their speaking habits. An adult vocal tract is approximately 17 cm long and
is considered as part of the speech production organs above the vocal folds (earlier called as
the vocal chords). As shown in Figure 1.2 (a), the speech production organs includes the la-
ryngeal pharynx (below the epiglottis), oral pharynx (behind the tongue, between the epiglot-
tis and vellum), oral cavity (forward of the velum and bounded by the lips, tongue, and pa-
late), nasal pharynx (above the velum, rear end of nasal cavity) and the nasal cavity (above
the palate and extending from the pharynx to the nostrils). The larynx comprises of the vocal
folds, the top of the cricoids cartilage, the arytenoids cartilages and the thyroid cartilage. The
area between the vocal folds is called the glottis. The resonance of the vocal tract alters the
spectrum of the acoustic as it passes through the vocal tract. Vocal tract resonances are called
formants. Therefore the vocal tract shape can be estimated from the spectral shape (e.g., for-
mant location and spectral tilt) of the voice signal. Speaker recognition systems use features
generally derived only from the vocal tract. The excitation source of the human vocal also
contains speaker specific information. The excitation is generated by the airflow from the
lungs, which thereafter passes through the trachea and then through the vocal folds. The exci-

80
International Journal of Hybrid Information Technology
Vol. 4, No. 2, April, 2011

tation is classified as phonation, whispering, frication, compression, vibration or a combina-
tion of these. Phonation excitation is caused when airflow is modulated by the vocal folds.
When the vocal folds are closed, pressure builds up underneath them until they blow apart.
The folds are drawn back together again by their tension, elasticity and the Bernoulli Effect.
The oscillation of vocal folds causes pulsed stream excitation of the vocal tract. The frequen-
cy of oscillation is called the fundamental frequency and it depends upon the length, mass and
the tension of the vocal folds. The fundamental frequency therefore is another distinguishing
characteristic for a given speaker.

Figure 1. The speech production mechanism [41]
1.3 How the Technology Works
The underlying premise for speaker recognition is that each person’s voice differs in pitch,
tone, and volume enough to make it uniquely distinguishable. Several factors contribute to
this uniqueness: size and shape of the mouth, throat, nose, and teeth, which are called the arti-
culators and the size, shape, and tension of the vocal cords. The chance that all of these are
exactly the same in any two people is low. The manner of vocalizing further distinguishes a
person’s speech: how the muscles are used in the lips, tongue and jaw. Speech is produced by
air passing from the lungs through the throat and vocal cords, then through the articulators.
Different positions of the articulators create different sounds. This produces a vocal pattern
that is used in the analysis.

International Journal of Hybrid Information Technology
Vol. 4, No. 2, April, 2011

A visual representation of the voice can be made to help the analysis. This is called a spec-
trogram also known as voiceprint, voice gram, spectral waterfall, and sonogram. A spectro-
gram displays the time, frequency of vibration of the vocal cords (pitch), and amplitude (vo-
lume). Pitch is higher for females than for males.

Figure. 2. These voiceprints are a visual representation of two different
speakers saying “RENRAKU” [1]

1.4. Methodology
Each speaker recognition system has two phases: Enrollment and verification. During
enrollment, the speaker's voice is recorded and typically a number of features are extracted to
form a voice print, template, or model. In the verification phase, a speech sample or "utter-
ance" is compared against a previously created voice print
Speaker recognition systems fall into two categories: Text-Dependent and Text-
Independent.In a text-dependent system, text is same during enrollment and verification phase
.In Text-independent systems the text during enrollment and test is different. In fact, the
enrollment may happen without the user's knowledge, as in the case for many forensic appli-
cations.

2. General Speaker Recognition System Architecture
There are two major commercialized applications of speaker recognition technologies and
methodologies: Speaker Identification and Speaker Verification.

2.1. SIS (Speaker Identification System)
Speaker Identification can be thought of as the task of finding who is talking from a set of
known voices of speakers. It is the process of determining who has provided a given utterance
based on the information contained in speech waves. Speaker identification is a 1: N match
where the voice is compared against N templates.

2.2. SVS (Speaker Verification System)
82
International Journal of Hybrid Information Technology
Vol. 4, No. 2, April, 2011

Speaker Verification on the other hand is the process of accepting or rejecting the speaker
claiming to be the actual one. Speaker verification is a 1:1 match where one speaker's voice is
matched to one template.

Figure. 3. General SIS and SVS Architecture [2]
Figure.4. Speaker Identification System[42]
International Journal of Hybrid Information Technology
Vol. 4, No. 2, April, 2011

Figure.5. Speaker Verification System[42]
3. Voice Biometric Standards
Standards play an important role in the development and sustainability of technology, and
work in the international and national standards arena will facilitate the improvement of bio-
metrics. The major standards work in the area of speaker recognition involves:

Speaker Verification Application Program Interface(SVAPI)
Biometric Application Program Interface (BioAPI)
Media Resource Control Protocol (MRCP)
Voice Extensible Markup Language (VoiceXML)
Voice Browser (W3C)






Of these, BioAPI has been cited as the one truly organic standard stemming from the BioAPI
Consortium, founded by over 120 companies and organizations with a common interest in
promoting the growth of the biometrics market.

84
International Journal of Hybrid Information Technology
Vol. 4, No. 2, April, 2011

4. Commercial Applications of SRS
The applications of speaker recognition technology are quite varied and continually growing.
Voice biometric systems are mostly used for telephony-based applications. Voice verification
is used for government, healthcare, call centers, electronic commerce, financial services, and
customer authentication for service calls, and for house arrest and probation-related
authentication.

Table 2. Broad areas where speaker recognition technology has been or is
currently used.

S.No.
1.
Authentication
Union Pacific Railroad : Union Pacific moves railcars back and forth across the
United States every day. The railcars travel loaded in one direction and empty on the
way back. When the loaded railcar arrives, the customer is notified to come and pick
up the contents. Once emptied, the customer needs to alert Union Pacific to put the
railcar back to work. Union Pacific now has an automated system that utilizes voice
authentication to allow a customer to release empty railcars. Customers enroll in the
voice authentication system over the phone. When they call back to release an empty
railcar, the system authenticates them and allows them to release their railcars. In this
case, voice authentication has allowed customers to get off the phone faster, and
Union Pacific to guarantee that a customer is not releasing a railcar that doesn’t
belong to him.

New York Town Manor: New York Town Manor is a residential community in
Pennsylvania designed for senior citizens with technologically advanced features. The
residents no longer have to remember passwords. They do carry ID cards that are used
in conjunction with voice authentication to allow access to the complex.To enter their
apartments, they speak for a few seconds while the system authenticates them. With
this approach, voice authentication provides an extra measure of security.

Bell Canada : Technicians for Bell Canada used to have to carry laptops on the job with them.
A technician would dial up using a modem to report the current job as finished and to get the
next job. Bell Canada has rolled out a new system that uses voice authentication to verify the
identity of the technician through a phone call and give him access to the data. This eliminates
the need for a laptop

Password Journal: Anyone who has ever had a diary has probably worried that
someone would read it without permission. One company has solved this problem by
adding voice authentication as a privacy measure to their Password Journal product.
The journal has its own speaker, raises an alarm if an unauthorized person attempts to
access it, and keeps track of how many failed attempts there have been.

Password Reset: Some companies are allowing users to reset passwords themselves. Users
dial an automated system. The system asks questions. When the user answers, the system
authenticates his voice and allows him to reset his own password. This saves companies time
and money in support costs, and users need not spend time on hold waiting for the next

Areas/Company using speaker recognition technology
International Journal of Hybrid Information Technology
Vol. 4, No. 2, April, 2011

available support person
US Social Security Administration: The United States Social Security
Administration is using voice authentication to allow employers to report W-2 wages
online. Used in combination with a pin number, the voice authentication provides
system security and user convenience

Banking: Reducing crime at Automated Teller Machines is an ongoing struggle.
Banks have started using biometrics to authenticate users before allowing ATM trans-
actions. Users generally must provide a pin number and a voice sample to be allowed
access. Royal Canadian Bank is using voice authentication to allow access to tele-
phone banking.

2.
Law Enforcement: In Louisiana, criminals are kept on a short leash with voice
biometrics. This inexpensive approach allows law enforcement to check in with
offenders at © SANS Institute 2004, Author retains full rights. Key fingerprint =
AF19 FA27 2F94 998D FDB5 DE3D F8B5 06E4 A169 4E46 © SANS Institute
2004, As part of the Information Security Reading Room Author retains full rights.
Lisa Myers Page 12 7/24/2004 random times of the day. The offender must answer
the phone and speak a phrase that is used for authentication. This system guarantees
that they are where they are supposed to be! Voice authentication has also been used
in criminal cases, such as rape and murder cases, to verify the identity of an
individual in a recorded conversation. There is a terrorism application also. Voice
authentication is frequently used to validate the identity of terrorists such as Osama
Bin Laden on recorded conversations. Hopefully these clues will one day assist in his
capture

3.
AHM (Australia Health Management): Since 2007, Australia private health insurer
AHM has successfully managed one of the largest public-facing deployments of
speaker verification. With more than 400,000 yearly calls into its main contact center,
ahm has implemented an automated voice verification system to provide quick,
accurate authentication of callers enhancing member security and improving the
customer experience

4.
VoiceCash:Based in Germany, VoiceCash an enabler of mobile payment solutions is
targeting consumers interested in cross-border money transfers offering pre-paid payment
cards that can be managed online or via SMS communications. The transfers can be
authenticated utilizing voice verification technology supplied by VoiceTrust.

5.
SIMAH: The Saudi Arabia Credit Bureau is deploying a voice biometric solution
provided by Agnitio and IST, a contact center system integrator. The technology is
part of IST's iSecure product and will be deployed through SIMAH's new Cisco
contact center.

6.
86
International Journal of Hybrid Information Technology
Vol. 4, No. 2, April, 2011

Vodafone Turkey: Vodafone Turkey has integrated PerSay VocalPassword with
Avaya Voice Portal Platform to enable secure self-service applications such as GSM
Personal Unlocking Key reset and access to Vodafone Call Centers

7.
5. Leading Vendors of Speaker Recognition Systems
Table 3. List of vendors
S.No Vendor
1.
Persay (NY, USA)
2.
Agnito (Spain)

3.
4.
5.
6.
7.
8.
9.

TAB Systems Inc. (Slovenia, Europe)
DAON (Washington DC)
Smartmatic (USA )
Speech Technology Center (Russia)
Loquendo (Italy)
SeMarket (Barcelona)
RecognitionTechnologiesLtd.(NY)

Websitewww.persay.com
www.agnitio.es
www.tab-systems.com
www.daon.com
www.smartmatic.com
www.speechpro.com
www.loquendo.com/en/
www.semarket.com
www.speakeridentification.com Print Page

No comments:

Post a Comment