Automatic Segmentation of Spontaneous Speech / Segmentação automática da fala espontânea

Brigitte Bigi, Christine Meunier

Abstract


Abstract: Most of the time, analyzing the phonetic entities of speech requires the alignment of the speech recording with its phonetic transcription. However, studies on automatic segmentation have predominantly been carried out on read speech or on prepared speech while spontaneous speech refers to a more informal activity, without any preparation. As a consequence, in spontaneous speech numerous phenomena occur such as hesitations, repetitions, feedback, backchannels, non-standard elisions, reduction phenomena, truncated words, and more generally, non-standard pronunciations. Events like laughter, noises and filled pauses are also very frequent in spontaneous speech. This paper aims to compare read speech and spontaneous speech in order to evaluate the impact of speech style on a speech segmentation task. This paper describes the solution implemented into the SPPAS software tool to automatically perform speech segmentation of read and spontaneous speech. This solution consists mainly in two sorts of things: supporting an Enriched Orthographic Transcription for an optimization of the grapheme-to-phoneme conversion and allowing the forced-alignment of the following events: filled pauses, laughter and noises. Actually, these events represent less than 1 % of the tokens in read speech and about 6 % in spontaneous speech. They occur in a maximum of 3 % of the Inter-Pausal Units of a read speech corpus and from 20 % up to 36 % of the Inter-Pausal Units in the spontaneous speech corpora. The UBPA measure – Unit Boundary Positioning Accuracy, of the proposed forced-alignment system is 96.09 % accurate as regards read speech and 96.48 % for spontaneous speech with a delta range of 40 ms.

Keywords: spontaneous speech; forced-alignment; paralinguistic events.

Resumo: Na maior parte dos casos, a análise de entidades fonéticas da fala exige o alinhamento da gravação da fala com sua transcrição fonética. Entretanto, os estudos sobre segmentação automática têm sido predominantemente desenvolvidos com amostras de fala lida ou fala preparada, uma vez que a fala espontânea refere-se a uma atividade mais informal, sem qualquer preparação. Como consequência, na fala espontânea numerosos fenômenos ocorrem, tais como: hesitações, repetições, feedback, backchannels, elisões não-padrão, fenômenos de redução, palavras truncadas, e mais comumente, pronúncias não-padrão. Eventos como o riso, ruídos e pausas preenchidas também são muito comuns na fala espontânea. Este artigo objetiva comparar a fala lida e a fala espontânea a fim de avaliar o impacto do estilo de fala numa tarefa de segmentação da fala. O artigo descreve a solução implementada no programa SPPAS para a segmentação automática da fala lida e da fala espontânea. Essa solução consiste de principalmente dois aspectos: suporte para uma Transcrição Ortográfica Enriquecida para a otimização da conversão grafema-para-fonema e permissão para o alinhamento forçado (forced-alignment) dos seguintes eventos: pausas preenchidas, riso e ruídos. Tais eventos representam menos de 1% das ocorrências na fala lida e cerca de 6% na fala espontânea. Eles ocorrem com um máximo de 3% nas Unidades Entre-Pausas de um corpus de fala lida e de 20% a 36% nas Pausas Entre-Unidades de corpora de fala espontânea. As medidas APFU – Acurácia no Posicionamento de Fronteiras de Unidade, do sistema de alinhamento forçado (forced-alignment system) proposto são de 96% de acerto no que diz respeito à fala lida e 96,48% para a fala espontânea, com uma variação delta de 4 ms.

Palavras-chave: fala espontânea; sistema de alinhamento forçado (forced alignment system); eventos paralinguísticos


Keywords


spontaneous speech; forced-alignment; paralinguistic events.

Full Text:

PDF

References


ADDA-DECKER, M.; GENDROT, C.; NGUYEN, N. Contributions du traitement automatique de la parole à l’étude des voyelles orales du français. Traitement Automatique des Langues – ATALA, [s.l.], v. 49,

n. 3, p. 13-46, 2008.

BATES, R. A.; OSTENDORF, M.; WRIGHT, R. A. Symbolic phonetic features for modeling of pronunciation variation. Speech Communication, Elsevier, v. 49, n. 2, p. 83-97, 2007. Doi: https://doi.org/10.1016/j.specom.2006.10.007

BELL, A.; JURAFSKY, D.; FOSLER-LUSSIER, E.; GIRAND, C.; GREGORY, M.; GILDEA, D. Effects of disfluencies, predictability, and utterance position on word form variation in English conversation. The Journal of the Acoustical Society of America, [s.l.], v. 113, n. 2, p. 1001-1024, 2003. Doi: https://doi.org/10.1121/1.1534836

BERTRAND, R.; BLACHE, P.; ESPESSER, R.; FERRÉ, G.; MEUNIER, C.; PRIEGO-VALVERDE, B.; RAUZY, S. Le CID – Corpus of Interactional Data – Annotation et Exploitation Multimodale de Parole Conversationnelle. Traitement Automatique des Langues, – ATALA, [s.l.], v. 49, n. 3, 2008.

BIGI, B. The SPPAS participation to Evalita 2014. In: ITALIAN CONFERENCE ON COMPUTATIONAL LINGUISTICS CLiC-it, 1; INTERNATIONAL WORKSHOP EVALITA, 4., 2014, Pisa, Italy. Proceedings… Pisa: Pisa University Press, 2014. v. 2, p. 127-130.

BIGI, B. A Multilingual Text Normalization Approach. In: VETULANI, Z.; MARIANI, J. (Ed.). Human Language Technology Challenges for Computer Science and Linguistics, LTC 2011. Lecture Notes in Computer Science. Berlin: Springer Berlin Heidelberg, 2014. v. 8387, p. 515-526. Doi: https://doi.org/10.1007/978-3-319-14120-6_42

BIGI, B. SPPAS – Multi-lingual Approaches to the Automatic Annotation of Speech. The Phonetician, International Society of Phonetic Sciences, v. 111-112, p. 54-69, 2015.

BIGI, B. A phonetization approach for the forced-alignment task in SPPAS. In: VETULANI, Z.; USZKOREIT, H.; KUBIS, M. (Ed.). Human Language Technology Challenges for Computer Science and Linguistics, LTC 2013. Lecture Notes in Computer Science. Berlin: Springer Berlin Heidelberg, 2016. v. 9561, p. 397-410. Doi: https://doi.org/10.1007/978-3-319-43808-5_30

BIGI, B.; BERTRAND, R.; PÉRI, P. Orthographic Transcription: which enrichment is required for phonetization? In : INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 8., 2012, Istanbul, Turkey. Proceedings… Istanbul: European Language Resources Association, 2012. p. 1756-1763.

BROGNAUX, S.; ROEKHAUT, S.; DRUGMAN, T. et al. Train&Align: A New Online Tool for Automatic Phonetic Alignment. In: IEEE Spoken Language Technology Workshop, 4., 2012, Miami, EUA. Proceedings… Miami: [s.n.], 2012. p. 416-421. Doi: https://doi.org/10.1109/SLT.2012.6424260

CHAN, D.; FOURCIN, A.; GIBBON, D.; GRANSTROM, B.; HUCKVALE, M.; KOKKINAKIS, G.; KVALE, K.; LAMEL, L.; LINDBERG, B.; MORENO, A.; MOUROPOULOS, J.; SENIA, F.; TRANCOSO, I.; VELD, C.; ZEILIGER, J. “EUROM- A Spoken Language Resource for the EU”. In: EUROPEAN CONFERENCE ON SPEECH COMMUNICATION AND SPEECH TECHNOLOGY, 4., 1995, Madrid, Spain. Proceedings… Madrid: [s.n.], 1995. v. 1, p. 867-870.

CLARK, H. H.; TREE, J. E. F. Using uh and um in spontaneous speaking. Cognition, Elsevier, v. 84, n. 1, p. 73-111, 2002. Doi: https://doi.org/10.1016/S0010-0277(02)00017-3

CERISARA, C.; MELLA, O.; FOHR, D. JTrans, an open-source software for semi-automatic text-to-speech alignment. In: ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION, 10., 2009, Brighton, United Kingdom. Proceedings… Brighton:  International Speech Communication Association, 2009. p. 1823-1826.

GODFREY, J. J.; HOLLIMAN, E. C.; McDANIEL, J. SWITCHBOARD: Telephone speech corpus for research and development. In: IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 1992, San Francisco, USA. Proceedings… San Francisco: IEEE, 1992. p. 517-520.

GORISCH, J.; ASTÉSANO, C.; GURMAN BARD, E.; BIGI, B.; PRÉVOT, L. Aix Map Task corpus: The French multimodal corpus of task-oriented dialogue. In: INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 9., 2014, Reykjavik, Iceland. Proceedings… Reykjavik: [s.n.], 2014. p. 2648-2652.

GORMAN, K.; HOWELL, J.; WAGNER, M. Prosodylab-Aligner: A Tool for Forced Alignment of Laboratory Speech. Canadian Acoustics, Canada, v. 39, n. 3, p. 192-193, 2011.

HERMENT, S.; TORTEL, A.; BIGI, B.; HIRST, D. J.; LOUKINA, A. AixOx, a multi-layered learners’ corpus: automatic annotation. Specialisation and Variation in Language Corpora. Linguistic Insights: Studies in Language and Communication, Oxford, v. 179, p. 41-76, 2014.

HOSOM, J. P. Speaker-independent phoneme alignment using transition-dependent states. Speech Communication, Elsevier, v. 51, n. 4, p. 352-368, 2008. Doi: https://doi.org/10.1016/j.specom.2008.11.003

JOHNSON, K. Massive Reduction in Conversational American English. In: YONEYAMA, K.; MAEKAWA, K. (Ed.). Spontaneous Speech: Data and Analysis. Proceedings of the 1st Session of the 10th International Symposium. Tokyo, Japan: The International Institute for Japanese Language, 2004. p. 29-54.

KISLER, T.; REICHEL, U. D.; SCHIEL, F. Multilingual processing of speech via web services. Computer Speech & Language, Elsevier, v. 45, p. 326-347, 2017. Doi: https://doi.org/10.1016/j.csl.2017.01.005

KVALE, K. On the Connection Between Manual Segmentation Conventions and “errors” Made by Automatic Segmentation. In: INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, 3., 1994, Yokohama, Japan. Proceedings… Yokohama: Acoustical Society of Japan, 1994. p. 1667-1670.

LAMERE, P.; KWOK, P.; GOUVEA, E.; RAJ, B.; SINGH, R.; WALKER, W.; WARMUTH, M.; WOLF, P. The CMU SPHINX-4 speech recognition system. In: IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING, 2003, Hong Kong. Hong Kong: IEEE, 2003. v. 1. Doi: 10.1109/ICASSP.2003.1202277

LEE, A.; KAWAHARA, T.; SHIKANO, K. Julius – an open source real-time large vocabulary recognition engine. In: EUROPEAN CONFERENCE ON SPEECH COMMUNICATION AND TECHNOLOGY, 7., 2001, Aalborg, Denmark. Proceedings… Aalborg: [s.n.], 2001. p. 1691-1694.

LEUNG, H. C; ZUE, V. W. A. Procedure for Automatic Alignment of Phonetic Transcriptions with Continuous Speech. In: IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 1984, San Diego, USA. Proceedings… San Diego: IEEE, 1984. v. 9, p. 73-76. Doi: https://doi.org/10.1109/ICASSP.1984.1172426

LIVESCU, K.; JYOTHI, P.; FOSLER-LUSSIER, E. Articulatory feature-based pronunciation modeling. Computer Speech & Language, Elsevier, v. 36, p. 212-232, 2016. Doi: https://doi.org/10.1016/j.csl.2015.07.003

LUBBERS, M.; TORREIRA, F. PraatAlign: an interactive Praat plug-in for performing phonetic forced alignment. 2016. Available at: https://github.com/dopefishh/praatalign. Retrieved on : 05/28/2018.

MEUNIER, C. Contexte et nature des réalisations phonétiques en parole conversationnelle. In : JOURNEES D’ETUDE SUR LA PAROLE, 2012, Grenoble, France. Actes... Grenoble : AFCP ; ATALA, 2012. p.1–8.

MEUNIER, C. Phoneme deletion and fusion in conversational speech. In: EXPERIMENTAL APPROACHES TO PERCEPTION AND PRODUCTION OF LANGUAGE VARIATION, 2013, Copenhagen, Denmark. Proceedings… Copenhagen: University of Copenhagen, 2013.

MEUNIER, C.; FOUGERON, C.; FREDOUILLE, C.; BIGI, B.; CREVIER-BUCHMAN, L. et al. The TYPALOC Corpus: A Collection of Various Dysarthric Speech Recordings in Read and Spontaneous Styles. In: INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION CONFERENCE, 10., 2016, Portorož, Slovenia. Proceedings… Portorož: ELRA, 2016. p. 4658-4665.

MORENO, P. J.; JOERG, C.; THONG, J-M. V. et al. A recursive algorithm for the forced alignment of very long audio segments. In: INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, 5., 1998, Sydney, Australia. Proceedings… Sydney: ISCA Archives, 1998. http://www.isca-speech.org/archive/icslp_1998

OGDEN, R. Turn transition, creak and glottal stop in Finnish talk-in-interaction. Journal of the International Phonetic Association, Cambridge, v. 31, n. 1, p. 139-152, 2001. Doi: https://doi.org/10.1017/S0025100301001116

PORTES, C. Prosody and Discourse: phonetic specificity, discursive ecology and pragmatic meaning of the “implication contour”. 2004. Thesis (PhD) – Université de Provence - Aix-Marseille I, 2004.

POVEY, D., GHOSHAL, A., BOULIANNE, Gilles, et al. The Kaldi speech recognition toolkit. In: IEEE WORKSHOP ON AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING, 2011, Waikoloa, Hawaii. Proceedings… Waikoloa: IEEE Signal Processing Society, 2011.

PRIEGO-VALVERDE, B.; BIGI, B. Smiling behavior in humorous and non humorous conversations: a preliminary cross-cultural comparison between American English and French. In: INTERNATIONAL SOCIETY FOR HUMOR STUDIES CONFERENCE, 2016, Dublin, Ireland. Oral Presentation. Available at: https://hal.archives-ouvertes.fr/hal-01455222. Retrieved on: 05/28/2018.

RABINER, L. R.; JUANG, B. H. Fundamentals of Speech Recognition. Englewood Cliffs: PTR Prentice Hall, 1993. v. 14.

RILEY, M.; BYRNE, W.; FINKE, M. et al. Stochastic pronunciation modelling from hand-labelled phonetic corpora. Speech Communication, Elsevier, v. 29, n. 2, p. 209-224, 1999. Doi: https://doi.org/10.1016/S0167-6393(99)00037-0

ROUAS, J-L.; BEPPU, M.; ADDA-DECKER, M. Comparison of spectral properties of read, prepared and casual speech in French. In: LANGUAGE RESOURCE AND EVALUATION CONFERENCE, 7., Malta, 2010. Proceedings… Malta: University of Malta, 2010. p. 606-611.

RYBACH, D.; GOLLAN, C.; HEIGOLD, G.; HOFFMEISTER, B.; LööF, J.; SCHLüTER, R.; NEY, H. The RWTH Aachen University Open Source Speech Recognition System. In: ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION, 10., Brighton, U.K., 2009. Proceedings of the Interspeech 2009…, Brighton: ISCA Archive, 2009. p. 2111-2114.

SCHUPPLER, B.; ERNESTUS, M.; SCHARENBORG, O.; BOVES, L. Preparing a corpus of Dutch spontaneous dialogues for automatic phonetic analysis. ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION, 9., Brisbane, Austrália, 2008. Proceedings… Brisbane: ISCA Archive, 2008. p. 1638-1641.

SHRIBERG, E. Disfluencies in switchboard. In: INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, 4., 1996, Philadelphia, EUA. Proceedings… Philadelphia: ISCA Archives, 1996. p. 11-14.

SHRIBERG, E. Phonetic consequences of speech disfluency. In: INTERNATIONAL CONGRESS ON PHONETIC SCIENCES, 14., San Francisco, EUA, 1999. Proceedings… San Francisco: University of California, 1999. p. 619-622.

SHRIBERG, E. Spontaneous speech: How people really talk and why engineers should care. In: EUROPEAN CONFERENCE ON SPEECH COMMUNICATION AND TECHNOLOGY, 9., Lisbon, Portugal, 2005. Proceedings… Lisbon: ISCA, 2005.

STAN, A.; MAMIYA, Y.; YAMAGISHI, J.; BELL, P.; WATTS, O.; CLARK, R. A.; KING, S. ALISA: An automatic lightly supervised speech segmentation and alignment tool. Computer Speech & Language, Elsevier, v. 35, p. 116-133, 2016. Doi: https://doi.org/10.1016/j.csl.2015.06.006

TREE, J. E. F.; CLARK, H. H. Pronouncing “the” as “thee” to signal problems in speaking. Cognition, Elsevier, v. 62, n. 2, p. 151-167, 1997. Doi: https://doi.org/10.1016/S0010-0277(96)00781-0

YOUNG, S.J.; YOUNG, Sj. The HTK hidden Markov model toolkit: Design and philosophy. Cambridge: University of Cambridge, Department of Engineering, 1993.

YUAN, J.; LIBERMAN, M. Speaker identification on the SCOTUS corpus. Journal of the Acoustical Society of America, [s.l.], v. 123, n. 5, 2008. Doi: https://doi.org/10.1121/1.2935783




DOI: http://dx.doi.org/10.17851/2237-2083.26.4.1489-1530

Refbacks

  • There are currently no refbacks.
';



Copyright (c) 2018 Brigitte Bigi, Christine Meunier

Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 International License.

e - ISSN 2237-2083 

License

Licensed through  Creative Commons Atribuição 4.0 Internacional    

Image result for fapemig

Grant #APL-00427-17 (2018-2019)