Exploring content selection strategies for Multilingual Multi-Document Summarization based on the Universal Network Language (UNL)

Matheus Rigobelo Chaud, Ariani Di Felippo

Abstract


Abstract: Multilingual Multi-Document Summarization aims at ranking the sentences of a cluster with (at least) 2 news texts (1 in the user’s language and 1 in a foreign language), and select the top-ranked sentences for a summary in the user’s language. We explored three concept-based statistics and one superficial strategy for sentence ranking. We used a bilingual corpus (Brazilian Portuguese-English) encoded in UNL (Universal Network Language) with source and summary sentences aligned based on content overlap. Our experiment shows that “concept frequency normalized by the number of concepts in the sentence” is the measure that best ranks the sentences selected by humans. However, it does not outperform the superficial strategy based on the position of the sentences in the texts. This indicates that the most frequent concepts are not always contained in first sentences, usually selected by humans to build the summaries because they convey the main information of the collection.
Keywords: content selection; concept; statistical measure; multilingual corpus; multi-document summarization.

Keywords: content selection; concept; statistical measure; multilingual
corpus; multi-document summarization.

Resumo: O objetivo da Sumarização Automática Multilíngue Multidocumento é ranquear as sentenças de uma coleção com ao menos duas notícias (1 na língua do usuário e 1 em língua estrangeira) e selecionar as mais bem pontuadas para compor um sumário na língua do usuário. Exploramos três estatísticas conceituais e uma estratégia superficial para criar um ranque das sentenças quanto à relevância. Para tanto, utilizamos um corpus bilíngue (português-inglês) anotado via UNL (Universal Network Language) e com textos-fonte e sumários alinhados em nível sentencial. A avaliação indica que a estatística
denominada frequência de conceitos normalizada pelo número de conceitos da sentença é a que melhor reproduz o ranqueamento humano. Essa medida, entretanto, não supera a estratégia superficial baseada na posição das sentenças. Isso indica que os conceitos mais frequentes do cluster nem sempre estão contidos nas primeiras sentenças dos textosfonte, usualmente selecionadas pelos humanos para compor os sumários porque veiculam a informação principal da coleção.

Palavras-chave: seleção de conteúdo; conceito; medida estatística; corpus multilíngue; sumarização multidocumento.

 


Keywords


content selection; concept; statistical measure; multilingual corpus; multi-document summarization

Full Text:

PDF

References


ALANSARY, S.; NAGI, M.; ADLY, N. UNL Editor: An annotation tool for semantic analysis. In: INTERNATIONAL CONFERENCE ON LANGUAGE ENGINEERING, 11., 2011, Cairo, Egypt. Proceedings... Cairo, Egypt, 2011.

CAMARGO, Renata. T. Investigação de estratégias de sumarização humana multidocumento. 2013. 133 f. Dissertação (Mestrado em Linguística) - Universidade Federal de São Carlos, São Carlos, SP, 2013.

CAMARGO, R.T.; DI FELIPPO, A.; PARDO, T.A.S. On strategies of human Multi-Document Summarization. In: BRAZILIAN SYMPOSIUM IN INFORMATION AND HUMAN LANGUAGE TECHNOLOGY (STIL), 10, 2015, Natal, Brazil. Proceedings... Natal, 2015, p. 141-150.

CARDEÑOSA, J. et al. A new knowledge representation model to support multilingual ontologies. A case study. In: INTERNATIONAL CONFERENCE ON SEMANTIC WEB AND WEB SERVICES (SWWS), 2008, Monterrey, Mexico. Proceedings... Berlin, Heidelberg: Springer Berlin Heidelberg, 2008. p. 313-319. ISBN 1-60132-089-2.

CHAUD, M. R. Investigação de estratégias de Sumarização Automática Multidocumento Multilíngue baseadas em interlíngua. 2014. 100f. Qualificação (Mestrado em Linguística) – Departamento de Letras, Universidade Federal de São Carlos, São Carlos, 2014.

DANG, H. T. Overview of DUC 2005. In: DOCUMENT UNDERSTANDING CONFERENCE, 2005.

DI-FELIPPO, A. CM2News: Towards a Corpus for Multilingual Multi-document Summarization. In: CORPORA AND TOOLS FOR PROCESSING CORPORA WORKSHOP (CTPC) (PROPOR), 12., 2016, Tomar, Portugal. Proceedings... Tomar, July, 2016.

DI-FELIPPO, A.; TOSTA, F. E. S.; PARDO, T.A.S. Applying Lexical-Conceptual Knowledge for Multilingual Multi-Document Summarization. In: INTERNATIONAL CONFERENCE ON THE COMPUTATIONAL PROCESSING OF PORTUGUESE (PROPOR), 12., 2016, Tomar, Portugal. Proceedings... Tomar, July, 2016. Doi: https://doi.org/10.1007/978-3-319-41552-9_4.

EVANS, D. K., KLAVANS, J. L. Columbia Newsblaster: multilingual news summarization in the web. In: HUMAN LANGUAGE TECHNOLOGIES (HLT) – NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (NAAL), 2004. Boston, MA: NAACL, 2004. Doi:10.3115/1614025.1614026.

EVANS, D. K., MCKEOWN K.; KLAVANS, J. L. Similarity-Based Multilingual Multi-document Summarization. Technical Report CUCS-014-05, Columbia University, New York, 2005. ISBN-13: 978-0262061971 ISBN-10: 026206197X.

FELLBAUM, Christiane D. (Ed.). Wordnet: an electronic lexical database. Massachusetts: MIT Press, 1998.

HENNIG, L., UMBRATH, W., WETZKER, R. An ontology-based approach to text summarization. In: WORKSHOP ON NATURAL LANGUAGE PROCESSING AND ONTOLOGY ENGINEERING (NLPOE 2008), 3., Toronto, 2008. Proceedings... Toronto, Canada, 2008. p. 291-294. Doi: https://doi.org/10.1109/WIIAT.2008.175.

HIRAO, T.; SUZUKI, J.; ISOZAKI, H.; MAEDA, E. Dependency-based Sentence Alignment for Multiple Document Summarization. In: INTERNATIONAL CONFERENCE ON COMPUTATIONAL LINGUISTICS (COLING), 2004. Proceedings... Switzerland, 2004. p. 446-452. Doi: https://doi.org/10.3115/1220355.1220419.

KIM, S. N.; BALDWIN, T.; KAN, M.-Y. Extracting domain specific words – a statistical approach. In: AUSTRALASIAN LANGUAGE TECHNOLOGY ASSOCIATION WORKSHOP, 2009. Proceedings... Sidney, Australia, 2009. p. 94-98.

KIT, C.; LIU, X. Measuring mono-word termhood by rank difference via corpus comparison. Terminology. International Journal of Theoretical and Applied Issues in Specialized Communication, John Benjamin Publishing Company, v.14, n. 2, p. 204-229, 2008. Doi: 10.1075/term.14.2.05kit.

KUMAR, Y. J.; SALIM, N.; RAZA, B. Cross-document structural relationship identification using supervised machine learning. Applied Soft Computing, v. 12, n. 10, p. 3124-3131, 2012. Doi: https://doi.org/10.1016/j.asoc.2012.06.017.

LIN, C-Y. ROUGE: a Package for Automatic Evaluation of Summaries. In: WORKSHOP ON TEXT SUMMARIZATION BRANCHES OUT (WAS), 2004, Barcelona. Proceedings... Barcelona, 2004.

LOPES, L.;FERNANDES, P.; VIEIRA, R. Estimating term domain relevance through term frequency, disjoint corpora frequency - tf-dcf. Knowledge-Based Systems, Elsevier, v. 97, p. 237-249, 2016. Doi: https://doi.org/10.1016/j.knosys.2015.12.015.

MANGAIRKARASI, Selvi; GUNASUNDARI, Salem. Semantic based text summarization using universal networking language. International Journal of Applied Information System, New York, v.3, n.8, p. 18-23, 2012. ISSN: 2249-0868.

MARCU, Daniel. The automatic construction of large-scale corpora for summarization research. In: CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, 22., 1999. Proceedings... 1999, p. 137-144. Doi: 10.1145/312624.312668.

MARTINS, Camila. B. UNLSumm: Um sumarizador automático de textos UNL. 2002. 100 f. Dissertação (Mestrado em Ciência da Computação) – Universidade Federal de São Carlos, São Carlos, SP, 2002.

MARTINS, Camila B.; RINO, Lucia H. M. Heurísticas de poda de sentenças para a Sumarização Automática de textos UNL: Estudo de casos. Série de Relatórios do Núcleo Interinstitucional de Linguística Computacional NILC – ICMC (Relatório NILC-TR-02-11). São Carlos, SP: USP, 2002. 51p.

MARTINS, R. T. et al. The UNL distinctive features: evidences through a NL-UNL encoding task. In: INTERNATIONAL LANGUAGE RESOURCES AND EVALUATION CONFERENCE (The First International Workshop on UNL, Other Interlinguas and Their Applications), 2002. Proceedings... Las Palmas, 2002. p. 8-13.

MCKEOWN, K. R. et al. Tracking and summarizing news on a daily basis with columbia’s newsblaster. In: INTERNATIONAL CONFERENCE ON HUMAN LANGUAGE TECHNOLOGY RESEARCH (HLT´02), 2, 2002, San Diego, USA. Proceedings...San Diego, 2002, p. 280-285. Doi: https://doi.org/10.3115/1289189.1289212.

MORATO, J. et al. Wordnet applications. In: INTERNATIONAL GLOBAL WORDNET CONFERENCE, 2., 2004. Proceedings... Masaryk University, Brno, 2004. p. 270-278. ISBN 80-210-3302-9.

ORĂSAN, C.; CHIOREAN, O. A. Evaluation of a Cross-lingual Romanian-English Multi-document Summariser. In: LANGUAGE RESOURCES AND EVALUATION CONFERENCE (LREC), 6., 2008, Proceedings... Marrakesh, 2008. p. 2114-19.

PANDIAN. L. S.; KALPANA. S. UNL based Document Summarization based on Level of Users. International Journal of Computer Applications, New Yor, v. 66, n. 24, p. 28-36, March 2013.

ROARK, B.; FISHER, S. OGI / OHSU baseline multilingual multi-document summarization system. In: MULTILINGUAL SUMMARIZATION EVALUATION (MSE) (Association for Computational Linguistics Workshop), 2005, Michigan, United States of America. Proceedings... Michigan, USA, 2005.

SARKAR, K.; BANDYOPADHYAY. S. A multilingual text summarization system for Indian languages. In: SYMPOSIUM ON INDIAN MORPHOLOGY, PHONOLOGY & LANGUAGE ENGINEERING (SIMPLE’05), 2., 2005, Kharagpur, India. Proceedings... Kharagpur: Indian Institute of Technology, 2005. February 5-7.

SARKAR, K. Multilingual summarization approaches. In: Computational Linguistics: Concepts, Methodologies, Tools, and Applications: Concepts, Methodologies, Tools, and Applications. Information Resources Management Association, 2014. p. 158-177. Doi: https://doi.org/10.4018/978-1-4666-6042-7.ch009.

SORNLERTLAMVANICH, V.; POTIPITI, T.; CHAROENPORN, T. UNL document summarization. In: INTERNATIONAL WORKSHOP ON MULTIMEDIA ANNOTATION (MMA’2001), 1., 2001, Japan. Proceedings... Tokyo, Japan, 2001.

TOSTA, F. E. S.; DI-FELIPPO, A.; PARDO, T. A. S. Estudo de métodos clássicos de sumarização no cenário multidocumento multilíngue. In: STUDENT WORKSHOP ON INFORMATION AND HUMAN LANGUAGE TECHNOLOGY (TILiC), 3., 2013. Fortaleza. Proceedings... Fortaleza: Sociedade Brasileira de Computação, 2013.p.1-3.

TOSTA, F. E. S. Aplicação de conhecimento léxico-conceitual na sumarização automática multidocumento multilíngue. 2014. 116 f. Dissertação (Mestrado em Linguística), Universidade Federal de São Carlos, São Carlos, SP, 2014.

UCHIDA, H.; ZHU, M.; DELLA SENTA, T. The UNL, a Gift for a Millennium. Tokyo, Japan: The United Nations University - Institute of Advanced Studies, 1999.

WAN, X.; LI, H.; XIAO, J. Cross-language document summarization based on machine translation quality prediction. In: ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, 48., 2010, Uppsala, Sweden. Proceedings... Stroudsburg, PA: Association for Computational Linguistics, 2010. p. 917-926.

WU, Chia-Wei; LIU, Chao-Lin. Ontology-based text summarization for business news articles. In: INTERNATIONAL CONFERENCE ON COMPUTERS AND THEIR APPLICATIONS (ISCA), 2003, Hawaii, USA. Proceedings... Hawaii, 2003. p. 389-392.




DOI: http://dx.doi.org/10.17851/2237-2083.26.1.45-71

Refbacks

  • There are currently no refbacks.
';



Copyright (c) 2017 Matheus Rigobelo Chaud, Ariani Di Felippo

Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 International License.

e - ISSN 2237-2083 

License

Licensed through  Creative Commons Atribuição 4.0 Internacional