Human Science Integration Seminar

2006-12-08: Human SCINT Seminar (21)

Poster Hiroshi Fukagawa Registed 2006-11-12 17:23 (2156 hits)

Date: 2006.12.08 (Fri) 16:30-18:30
Place: Kashiwa Campus, Transdisciplinary Sciences Bldg., Room 2D8
- connect to: Hongo Campus, Faculty of Engineering Bldg.2, Room 101B1
- connect to: Komaba Campus, Information Education Bldg., Distance Learning Room
Speaker: Nobuaki Minematsu
Title: Speech as music -- human cognition of sounds based on the structural invariance --
Keywords:structural organization, invariant structure, speech and music, speech perception, language disorder, language acquisition

Affiliation: Department of Frontier Informatics, Graduate School of Frontier Sciences
Position: Associate Professor
Disciplines: speech science and engineering
Societies and Conferences: IEICE(The Institute of Electronics, Information and Communication Engineers), ASJ(The Acoustical Society of Japan), PSJ(Phonetic Society of Japan), JSAI(The Japanese Society for Artificial Intelligence), IPSJ(Information Processing Society of Japan), ISCA(International Speech Communication Association), IPA(International Phonetic Association)

Bibliography: Nobuaki Minematsu, Speech as music -- human cognition of sounds based on the structural invariance --, No. 21, pp. 1, 2006.

Abstract:
Infants acquire a spoken language mainly through the interactions with their mothers and fathers. Acoustically speaking, they learn how to deal with speech variability mainly by hearing a strongly biased speech corpus. Speech science and engineering has built a method to handle the variability by collecting speech samples from millions of speakers. What is the intrinsic difference between humans and speech science and engineering? When people describe a musical tune as sequence of syllable names such as do, re, and me, the sequence is not changed even after the tune is transposed. People with relative pitch perceive mentally the same sound quality for physically different sounds. They first capture the musical scale in the tune and, in the scale structure, they identify the individual sounds. Putting it another way, they first capture the holistic pattern, then, they identify the constituent elements in the pattern. With this strategy, the variability in music cannot change the sequence produced by the people with relative pitch. Developmental psychology claims that infants first capture the holistic pattern of the word, then, they learn the individual segmental sounds. If this claim is correct, is the holistic pattern of the word is acoustically invariant with respect to speakers, microphones, lines, etc? In this talk, the speaker-invariant representation of speech is introduced mathematically through the structural organization of speech. After that, some experimental results of speech recognition by humans and machines are shown. Finally, some discussions are done about how speech science and engineering should be.

References:
●Theoretical backgrounds
[1]N. Minematsu,"Yet another acoustic representation of speech", Proc. Int. Conf. Acoustics, Speech & Signal Processing (ICASSP), pp.585-588 (2004-5)
[2]N. Minematsu,"Mathematical evidence of the acoustic universal structure",Proc. Int. Conf. Acoustics, Speech & Signal Processing (ICASSP), pp.889-892 (2005-3)
[3]N. Minematsu, T. Nishimura, K. Nishinari, and K. Sakuraba,"Theorem of the invariant structure and its derivation of speech Gestalt", Proc. Int. Workshop on Speech Recognition and Intrinsic Variations (SRIV), pp.47-52 (2006-5)
●Application to speech recognition
[4]T. Murakami, K. Maruyama, N. Minematsu, and K. Hirose,"Japanese vowel recognition based on structural representation of speech", Proc. European Conf. Speech Communication and Technology (EUROSPEECH), pp.1261-1264 (2005-9)
[5]T. Murakami, K. Maruyama, N. Minematsu, and K. Hirose,"Japanese vowel recognition using external structure of speech", Proc. Int. Workshop on Automatic Speech Recognition and Understanding (ASRU), pp.203-208 (2005-11)
[6]N. Minematsu, T. Nishimura, T. Murakami, and K. Hirose,"Speech recognition only with supra-segmental features -- hearing speech as music --", Proc. Int. Conf. Speech Prosody, pp.589-594 (2006-5)
●Application to pronunciation training
[7]N. Minematsu,"Pronunciation assessment based upon the phonological distortions
observed in language learners' utterances",Proc. Int. Conf. Spoken Language Processing (ICSLP), pp.1669-1672 (2004-10)
[8]S. Asakawa, N. Minematsu, T. I. Jaakkola, and K. Hiorse,"Structural representation of the non-native pronunciations",Proc. European Conf. Speech Communication and Technology (EUROSPEECH), pp.165-168 (2005-9)
[9]N. Minematsu, S. Asakawa, and K. Hirose,"Structural representation of the pronunciation and its use for CALL", Int. Workshop on Spoken Language Technology (SLT) (2006-12, to appear)