Topic Modeling for Keyword Extraction: using Natural Language Processing methods for keyword extraction in Portal Min@s

Arnaldo Candido Junior; Célia Magalhães; Helena Caseli; Régis Zangirolami

doi:10.17851/2237-2083.23.3.695-726

Topic Modeling for Keyword Extraction: using Natural Language Processing methods for keyword extraction in Portal Min@s

Arnaldo Candido Junior, Célia Magalhães, Helena Caseli, Régis Zangirolami

Abstract

Abstract: This article aims to evaluate the application of two efficient automatic methods for keyword extraction used by Corpus Linguistics and Natural Language Processing communities for generating keywords from literary texts: WordSmith Tools and Latent Dirichlet Allocation (LDA). These tools have their own specificities and are based on different extraction techniques; thus an analysis focused on their performance was required. This article aims to understand how each method works and to evaluate them when applied to extract keywords from literary works. To this end, we used human analysis, with knowledge of the field of the texts used. The LDA method was used for extracting keywords through its integration with Portal Min@s: Corpora de Fala e Escrita, a general corpora-processing system, designed for different research in corpus linguistics. The experiment outcomes confirm the effectiveness of WordSmith Tools and LDA in extracting keywords from literary corpus. They also show that human analysis of the lists is required at a stage prior to experiments to complement the automatically generated list, crossing WordSmith Tools and LDA results, and that the linguistic intuition of a human analyst about the lists generated separately by the two methods in this study was more favorable to the use of the WordSmith Tools keyword list.

Keywords: Keyword Extraction, Natural Language Processing, Corpus Analysis, WordSmith Tools, Latent Dirichlet Allocation, Portal Min@s.

Resumo: Este artigo tem o objetivo da avaliar a aplicação de dois métodos automáticos eficientes na extração de palavras-chave, usados pelas comunidades da Linguística de Corpus e do Processamento da Língua Natural para gerar palavras-chave de textos literários: o WordSmith Tools e o Latent Dirichlet Allocation (LDA). As duas ferramentas escolhidas para este trabalho têm suas especificidades e técnicas diferentes de extração, o que nos levou a uma análise orientada para a sua performance. Objetivamos entender, então, como cada método funciona e avaliar sua aplicação em textos literários. Para esse fim, usamos análise humana, com conhecimento do campo dos textos usados. O método LDA foi usado para extrair palavras-chave por meio de sua integração com o Portal Min@s: Corpora de Fala e Escrita, um sistema geral de processamento de corpora, concebido para diferentes pesquisas de Linguística de Corpus. Os resultados do experimento confirmam a eficácia do WordSmith Tools e do LDA na extração de palavras-chave de um corpus literário, além de apontar que é necessária a análise humana das listas em um estágio anterior aos experimentos para complementar a lista gerada automaticamente, cruzando os resultados do WordSmith Tools e do LDA. Também indicam que a intuição linguística do analista humano sobre as listas geradas separadamente pelos dois métodos usados neste estudo foi mais favorável ao uso da lista de palavras-chave do WordSmith Tools.

Palavras-chave: extração de palavras-chave; processamento natural da linguagem; análise de corpus; WordSmith Tools; Latent Dirichlet Allocation; Portal Min@s.

Keywords

Keyword Extraction, Natural Language Processing, Corpus Analysis, WordSmith Tools, Latent Dirichlet Allocation, Portal Min@s.

Full Text:

PDF

References

BERBER-SARDINHA, T. Comparing corpora with WordSmith Tools: How large must the reference corpus be? São Paulo, 2000.

BLEI, D. M. Probabilistic Topic Models: Surveying a suite of algorithms that offer a solution to managing large documents archives. Communications of the ACM, s. l., n. 55, p. 77-84, 2012.

BLEI, D. M.; NG, A. Y.; JORDAN, M. I. Latent Dirichlet Allocation. Journal of Machine Learning Research, s.l., n. 3, p. 993-1022, 2003.

CONRAD, J. O coração da treva. Translated by Hamilton Trevisan. São Paulo: Global Editora e Distribuidora, 1984.

CONRAD, J. O coração das trevas. Translated by Regina Régis Junqueira. Belo Horizonte: Editora Itatiaia, 1984.

CONRAD, J. O coração das trevas. Translated by Marcos Santarrita. Rio de Janeiro: Ediouro, 1996.

DAVIES, M. The advantage of using relational databases for large corpora: speed, advanced queries, and unlimited annotation. International Journal of Corpus Linguistics, v. 10, p. 301-28, 2005. DOI: http://dx.doi.org/10.1075/ijcl.10.3.02dav.

DAVIES, M. Relational databases as a robust architecture for the analysis of word frequency. In What's in a Wordlist?: Investigating Word Frequency and Keyword Extraction, ed. Dawn Archer. London: Ashgate, 2009. p. 53-68.

DREDZE, M.; WALLACH, H. M.; PULLER, D; PEREIRA, F. Generating summary keywords for emails using topics. In: Proceedings of the 13th International Conference on Intelligent User Interfaces (IUI '08). ACM, New York, NY, USA, p. 199-206, 2008. DOI: http://dx.doi.org/10.1145/1378773.1378800.

LANDAUER, T. K.; FOLTZ, P. W.; LAHAM, D. An introduction to latent semantic analysis. Discourse Processes, 25:259–284, 1998. DOI: http://dx.doi.org/10.1080/01638539809545028.

LIU, Z.; HUANG, W.; ZHENG, Y.; SUN, M. Automatic Keyphrase Extraction via Topic Decomposition. In: Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, 2010. p. 366–376.

MAGALHÃES, C; ASSIS, R. C. Representação de atores sociais em corpus paralelo: Heart of Darkness e suas traduções para o português. In COHEN, Maria Antonieta; LARA, Gláucia Muniz Proença. (Org.). Linguística, tradução, discurso. Belo Horizonte: Editora UFMG, 2009. p. 201-220.

MANNING, C. D.; H. SCHÜTZE. Foundations of statistical natural language processing. MIT Press, 2000.

MCINTYRE, D.; WALKER, B. How can corpora be used to explore the language of poetry and drama. In: O'KEEFE, A.; McCARTHY, M. (Ed.). The Routledge handbook of corpus linguistics. London; New York: Routledge, 2010. p. 516-530. DOI: http://dx.doi.org/10.4324/9780203856949.ch37.

MUNIZ, M.; PAULOVICH, F.; Minghim, R.; INFANTE, K.; Muniz, F.; VIEIRA, R.; ALUÍSIO, S. M. Taming the tiger topic: an XCES compliant corpus Portal to generate subcorpus based on automatic text topic identification. In: Corpus Linguistics 2007 Conference, 2007, Birmingham. Proceedings of the Corpus Linguistics 2007 Conference, 2007.

RAYSON, P. E. Matrix: A statistical method and software tool for linguistic analysis through corpus comparison. Lancaster University, 2002.

SCOTT, M. WordSmith Tools Manual. Oxford: Oxford University Press, 1996.

SCOTT, M. PC analysis of keywords - and key keywords. System, v. 25, n. 2, p. 233-245, 1997. DOI: http://dx.doi.org/10.1016/S0346-251X(97)00011-0.

STUBBS, M. Conrad in the computer: examples of quantitative stylistic methods. Language and Literature, v. 14, n. 1, p. 5-24, 2005. DOI: http://dx.doi.org/10.1177/0963947005048873.

TDK TECHNOLOGIES. Topic Modeling Explained: LDA to Bayesian Inference. From: https://www.tdktech.com/tech-talks/topic-modeling-explained-lda-to-bayesian-inference. Retrieved on: July 12, 2015.

DOI: http://dx.doi.org/10.17851/2237-2083.23.3.695-726