Utilização de informações lexicais extraídas automaticamente de corpora na análise sintática computacional do português

Leonel Figueiredo de Alencar

doi:10.17851/2237-2083.19.1.7-85

Utilização de informações lexicais extraídas automaticamente de corpora na análise sintática computacional do português

Leonel Figueiredo de Alencar

Abstract

No desenvolvimento de analisadores sintáticos profundos paratextos irrestritos, a principal dificuldade a ser vencida é a modelaçãodo léxico. Tradicionalmente, duas estratégias têm sido usadas paralidar com a informação lexical na análise sintática automática: acompilação de milhares de entradas lexicais ou a formulação decentenas de regras morfológicas. Devido aos processos produtivosde formação de palavras, aos nomes próprios ou a grafias não padrão,a primeira estratégia, que subjaz aos analisadores do português doBrasil (PB) livremente descarregáveis da Internet, não é robusta.A última estratégia, por sua vez, constitui tarefa não trivial deengenharia do conhecimento, consumindo muito tempo. Nomomento, o PB não dispõe de um analisador sintático de amplacobertura licenciado como software livre. Visando aopreenchimento o mais rápido possível dessa lacuna, argumentamosneste artigo que uma solução bem menos custosa e muito maiseficiente para o gargalo lexical consiste em simplesmentereaproveitar, como componente lexical do processamento sintáticoprofundo, etiquetadores morfossintáticos livremente disponíveis. Além disso, graças à ampla e gratuita disponibilidade de corporamorfossintaticamente anotados do PB e eficientes pacotes deaprendizado de máquina, a construção de etiquetadores de altaacurácia adicionais tornou-se uma tarefa que quase não demandaesforço. A fim de integrar facilmente o output de etiquetadores dediferentes arquiteturas em parsers tabulares de gramáticas livresde contexto compilados por meio do Natural Language Toolkit(NLTK), desenvolvemos um módulo em Python denominadoALEXP. Pelo que sabemos, o ALEXP é o primeiro software livreespecialmente otimizado para o processamento do português arealizar essa tarefa. A funcionalidade da ferramenta é descrita pormeio de protótipos de gramática do PB aplicados na análise desentenças do mundo real, com resultados bastante promissores.

Keywords

Linguística computacional; Processamento automático da linguagem natural; Etiquetagem morfossintática; Etiquetador morfossintático; Análise sintática automática; Gramática livre de contexto; Processamento computacional do português; Aprendizado de máquina

Full Text:

PDF (Português (Brasil))

References

ALENCAR, L. F. de. Resenha de “Teoria X-barra: descrição do português e aplicação computacional”, de Gabriel de Ávila Othero. Revista Virtual de Estudos da Linguagem – ReVEL, v. 6, n. 10, mar. 2008. Disponível em: http://www.revel. inf.br/site2007 /_pdf/11/ resenhas/revel_10_resenha_othero.pdf. Acesso em: 30 maio 2011.

ALENCAR, L. F. de. Aelius: uma ferramenta para anotação automática de corpora usando o NLTK. Trabalho apresentado ao IX Encontro de Linguística de Corpus, Porto Alegre, PUCRS, 8 e 9 de outubro de 2010. [S.l.]: [s.n.], 2010. Disponível em: http://corpuslg.org/gelc/elc2010.php. Acesso em: 22 set. 2011.

ALENCAR, L. F. de. Aelius: uma ferramenta para anotação automática de corpora usando o NLTK. Trabalho submetido para publicação nos Anais do IX Encontro de Linguística de Corpus, PUCRS, Porto Alegre, 8 e 9 de outubro de 2010. 2011.

ALLAUZEN, A.; BONNEAU-MAYNARD, H. Training and Evaluation of POS Taggers on the French MULTITAG Corpus. LANGUAGE RESOURCES AND EVALUATION CONFERENCE, n. 6, 2008, Marrakech, Morocco. Proceedings... [s.l.]: ELRA, 2008. Disponível em: http://www.lrec-conf.org/proceedings/lrec2008/pdf/856_paper.pdf. Acesso em: 8 mar. 2011.

ALMEIDA, S. et al. Selva: A New Syntactic Parser for Portuguese. In: MAMEDE, N. et al. (Ed.). INTERNATIONAL WORKSHOP ON COMPUTATIONAL PROCESSING OF THE PORTUGUESE LANGUAGE, n. 6, 2003, Faro. Proceedings... Berlin; Heidelberg: Springer, 2003. p. 102-109.

BEESLEY, K. R.; KARTTUNEN, L. Finite state morphology. Stanford, CSLI Publications, 2003. 510 p.

BICK, E. The parsing system “Palavras”: automatic grammatical analysis of Portuguese in a Constraint Grammar framework. 2000. Tese (Dr. phil.) – Department of Linguistics, University of Århus, Århus, Dinamarca, 2000. 505 p. Disponível em: beta.visl.sdu.dk/pdf/PLP20-amilo. ps.pdf. Acesso em: 27 out. 2009.

BIRD, S.; KLEIN, E.; LOPER, E. Natural language processing with Python: analyzing text with the Natural Language Toolkit. Sebastopol: O’Reilly, 2009. 502 p.

BIRD, S.; KLEIN, E.; LOPER, E. Natural Language Toolkit. [s.l]: [s.n.], 2011. Disponível em: http://www.nltk.org. Acesso em: 24 jan. 2011.

BORBA, F. da S. et al. (Org.). 2. ed. Dicionário gramatical de verbos do português contemporâneo do Brasil. São Paulo: Fundação Editora da UNESP, 1991. 1373 p.

BRANCO, A.; SILVA, J. 2004. Evaluating Solutions for the Rapid Development of State-of-the-Art POS Taggers for Portuguese. In: LINO, M. T. et al. (Ed.). INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, n. 4, 2004, Lisboa. Proceedings... Paris: ELRA, 2004. p. 507-510.

BRANCO, A.; COSTA, F. LXGram: A Deep Linguistic Processing Grammar for Portuguese. In: PARDO, T. A. S. et al. (Ed.). INTERNATIONAL CONFERENCE ON COMPUTATIONAL PROCESSING OF THE PORTUGUESE LANGUAGE, n. 9, 2010, Porto Alegre. Proceedings... Berlin; Heidelberg: Springer, 2010. p. 86-89.

BRANCO, A. et al. Developing a Deep Linguistic Databank Supporting a Collection of Treebanks: the CINTIL DeepGramBank. LANGUAGE RESOURCES AND EVALUATION CONFERENCE, n. 7, 2010, La Valletta, Malta. Proceedings... [s.l.]: ELRA, 2010. p. 1810-1815. Disponível em: http://www.lrec-conf.org/proceedings/lrec2010/pdf/154_Paper.pdf. Acesso em: 26. abr. 2011.

BRESNAN, J. Lexical-Functional Syntax. Malden, Mass.; Oxford: Blackwell, 2001. 446 p.

CARNIE, A. Syntax: A generative introduction. Oxford: Blackwell, 2002. 390 p.

CARRERAS, X. et al. FreeLing: An Open-Source Suite of Language Analyzers. In: LINO, M. T. et al. (Ed.). INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, n. 4, 2004, Lisboa. Proceedings... Paris: ELRA, 2004. Disponível em:http://www.lsi.upc.edu/~nlp/papers/carreras04.pdf. Acesso em: 27 maio 2011.

CHUN, W. J. Core Python programming. 2. ed. Upper Saddle River, NJ: Prentice Hall, 2006. 1078 p.

CONTIER, A.; PADOVANI, D.; JOSÉ NETO, J. Tecnologia Adaptativa Aplicada ao Processamento da Linguagem Natural. WORKSHOP DE TECNOLOGIA ADAPTATIVA, n. 4, 2010, São Paulo. Memórias... São Paulo: EPUSP, 2010. p. 35-42.

CORPUS Histórico do Português Tycho Brahe. Campinas: Instituto de Estudos da Linguagem/Universidade Estadual de Campinas, 2010. Disponível em: http://www. tycho. iel.unicamp.br/~tycho/corpus/. Acesso em 30. set. 2010.

CURUPIRA: A Functional Parser for Brazilian Portuguese. São Carlos: Núcleo Interinstitucional de Linguística Computacional, [2004]. Disponível em: http://www.nilc. icmc. usp.br/nilc/tools/curupira.html. Acesso em: 1. jun. 2011.

DI FELIPPO, A.; DIAS-DA-SILVA, B. C. O processamento automático de línguas naturais enquanto engenharia do conhecimento linguístico. Calidoscópio, São Leopoldo, v. 7, n. 3, p. 183-191, set./dez. 2009.

FALK, Y. N. Lexical-functional grammar: an introduction to parallel constraint-based syntax. Stanford, CSLI Publications, 2001. 237 p.

FELDMAN, A.; HANA, J. A resource-light approach to morpho-syntactic tagging. Amsterdam; New York: Rodopi, 2010. 185 p.

FREE SOFTWARE FOUNDATION. The Free Software Definition. [s.l.]: [s.n.], 2010. Disponível em: http://www.gnu.org/philosophy/free-sw.html. Acesso em: 16 de jun. 2011.

FREITAS, C.; ROCHA, P.; BICK, E. Um mundo novo na Floresta Sintá(c)tica: o treebank do Português. Calidoscópio, São Leopoldo, v. 6, n. 3, p. 142-148, set/dez 2008.

FUKUI, N. Phrase structure. In: BALTIN, M.; COLLINS, C. (Ed.). The Handbook of Contemporary Syntactic Theory. Malden, MA: Blackwell, 2003. p. 374-406.

GARCIA, M.; GAMALLO, P. Análise morfossintática para português europeu e galego: problemas, soluções e avaliação. LinguaMÁTICA, Braga, v. 2, n. 2, p. 59-67, jun. 2010.

Grewendorf, G.; Hamm, F.; Sternefeld, W. Sprachliches Wissen: eine Einführung in moderne Theorien der grammatischen Beschreibung. 3. ed. Frankfurt am Main: Suhrkamp, 1989. 467 p.

GREWENDORF, G. Minimalistische Syntax. Tübingen; Basel: A. Francke, 2002. 344 p.

GÜNGÖR, T. Part-of-Speech Tagging. In: INDURKHYA, N.; DAMERAU, F. J. Handbook of Natural Language Processing. 2. ed. Boca Raton, FL: Chapman & Hall/CRC, 2010. p. 205-235.

HAJICOVÁ, E. et al. Treebank annotation. In: INDURKHYA, N.; DAMERAU, F. J. Handbook of Natural Language Processing. 2. ed. Boca Raton, FL: Chapman & Hall/CRC, 2010. p. 167-188.

HALÁCSY, P.; KORNAI, A. ; ORAVECZ, C. HunPos: an open source trigram tagger. ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, n. 45, 2007, Praga. Proceedings... Stroudsburg: Association for Computational Linguistics, 2007. p. 209-212.

HAUSSER, R. Grundlagen der Computerlinguistik. Berlin: Springer, 2000. 572 p.

HIPPISLEY, A. Lexical analysis. In: INDURKHYA, N.; DAMERAU, F. J. Handbook of Natural Language Processing. 2. ed. Boca Raton, FL: Chapman & Hall/CRC, 2010. p. 31-58.

HOUAISS; VILLAR, 2001. IDICIONÁRIO Aulete. Rio de Janeiro: Lexikon, 2011. Disponível em: http://aulete.uol.com.br. Acesso em: 23 jun. 2011.

JURAFSKY, D.; MARTIN, J.H. Speech and language processing: an introduction to natural language processing, computational linguistics, and speech recognition. 2. ed. London: Pearson International, 2009. 1024 p.

KLENK, U. Generative Syntax. Tübingen: Narr, 2003. 261 p.

Lemnitzer, L.; Zinsmeister, H. Korpuslinguistik: eine Einführung. Tübingen: Narr, 2006. 220 p.

LXGRAM. Lisboa: Universidade de Lisboa, Departamento de Informática, [s.d.]. Disponível em: http://nlxgroup.di.fc.ul.pt/lxgram/. Acesso em: 3 jun. 2011.

LEMNITZER, L.; WAGNER, A. Akquisition lexikalischen Wissens. In: H. LOBIN; L. LEMNITZER (Ed.). Texttechnologie: Perspektiven und Anwendungen. Tübingen, Stauffenburg, 2004. p. 245-266.

LJUNGLÖF, P. ; WIRÉN, M. Syntactic parsing. In: INDURKHYA, N.; DAMERAU, F. J. Handbook of Natural Language Processing. 2. ed. Boca Raton, FL: Chapman & Hall/CRC, 2010. p. 59-91.

MAIER, W. NeGra und TüBa-D/Z: ein Vergleich. In: REHM, G.; WITT, A.; LEMNITZER, L. (Ed.). Data structures for linguistic resources and applications. Biennial GLDV Conference 2007. Proceedings. Tübingen: Gunter Narr, 2007. p. 29-38.

MARTINS, R.; HASEGAWA, R.; NUNES, G. Curupira: um parser funcional para a língua portuguesa. São Carlos: Núcleo Interinstitucional de Linguística Computacional, 2002. Disponível em: http://www.nilc.icmc.usp.br/nilc/download/nilc-tr-02-26.zip. Acesso em: 1 jun. 2011.

MARTINS, R.; NUNES, G.; HASEGAWA, R. Curupira: A Functional Parser for Brazilian Portuguese. In: MAMEDE, N. et al. (Ed.). INTERNATIONAL WORKSHOP ON COMPUTATIONAL PROCESSING OF THE PORTUGUESE LANGUAGE, n. 6, 2003, Faro. Proceedings... Berlin; Heidelberg: Springer, 2003. p. 179-183.

MENUZZI, S. M.;OTHERO, G. A.; Sintaxe X-barra: uma aplicação computacional. Working papers em Linguística, Florianópolis, v. 9, p. 15-29, 2008.

MERTZ, D. Text Processing in Python. Upper Saddle River, NJ: Addison-Wesley, 2003. 520 p.

MÜLLER, S. Head-Driven Phrase Structure Grammar: eine Einführung. Tübingen: Stauffenburg, 2007. 440 p.

NAMIUTI, C. O Corpus Anotado do Português Histórico: um Avanço para as Pesquisas em Linguística Histórica do Português. Revista Virtual de Estudos da Linguagem, v.2, n. 3, 2004. Disponível em: http://www.revel.inf.br/. Acesso em: 1. abr. 2011.

NAUMANN, S. XML-basierte Tools zur Entwicklung und Pflege syntaktisch annotierter Korpora. In: MEHLER, A.; LOBIN, H. (Eds.). Automatische Textanalyse: Systeme und Methoden zur Annotation und Analyse natürlichsprachlicer Texte. Wiesbaden: VS Verlag für Sozialwissenschaften, 2004. p. 153-166.

NOVICHKOVA, S; EGOROV, S.; DARASELIA, N. MedScan: a natural language processing engine for MEDLINE abstracts. Bioinformatics, Oxford, v. 19, n. 13, p. 1699-1706, 2003.

NIVRE, J. Statistical parsing. In: INDURKHYA, N.; DAMERAU, F. J. Handbook of Natural Language Processing. 2. ed. Boca Raton, FL: Chapman & Hall/CRC, 2010. p. 237-266.

OPEN Source Initiative. [s.l.]: [s.n.], 2011. Disponível em:http://www.opensource.org/. Acesso em: 16 de jun. 2011.

OTHERO, G. A. Teoria X-barra: descrição do português e aplicação computacional. São Paulo: Contexto, 2006. 160 p.

OTHERO, G. A. A gramática da frase em português: algumas reflexões para a formalização da estrutura frasal em português. Porto Alegre: Edipucrs, 2009. 160 p. Disponível em: http://www.pucrs.br/edipucrs/gramaticadafrase.pdf. Acesso em: 2 ago. 2010.

PADRÓ, L. et al. FreeLing 2.1: Five Years of Open-Source Language Processing Tools. LANGUAGE RESOURCES AND EVALUATION CONFERENCE, n. 7, 2010, La Valletta, Malta. Proceedings... [s.l.]: ELRA, 2010. p. 931-936. Disponível em: http://www.lrec-conf.org/proceedings/lrec2010/pdf/14_Paper.pdf Acesso em:

fev. 2011.

PERKINS, J. Python Text Processing with NLTK 2.0 Cookbook. Birmingham, UK: Packt, 2010. 256 p.

RATNAPARKHI, A. A Maximum Entropy Model for Part-Of-Speech Tagging. EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING, 1996, Philadelphia, Pennsylvannia. Proceedings... Pennsylvannia: University of Pennsylvannia, 1996. p. 133-142. Disponível em: http://acl.ldc.upenn.edu/W/W96/W96-0213.pdf. Acesso em:<2. Jun. 2011.

ROARK, B.; SPROAT, R. Computational approaches to morphology and syntax. Oxford: Oxford University Press, 2007. 316 p.

RON RONCAGLIA, D. Marina questiona se partidos farão “pacto de silêncio” sobre mensalões. Folha online. Disponível em: http://noticias.bol.uol.com.br/brasil/2010/02/26/marina-questiona-se-partidos-farao-pacto-de-silencio-sobre-mensaloes.jhtm. Acesso em: 28 mar. 2011.

SAG, I. A.; WASOW, T. ; BENDER, E. Syntactic theory: a formal introduction. 2. ed. Stanford: CSLI Publications, 2003. 608 p.

SILVA, J.; BRANCO, A.; GONÇALVES, P. Top-Performing Robust Constituency Parsing of Portuguese: freely available in as many ways as you can get it. LANGUAGE RESOURCES AND EVALUATION CONFERENCE, n. 7, 2010, La Valletta, Malta. Proceedings... [s.l.]: ELRA, 2010. p. 1960-1963. Disponível em: http://www.lrec-conf.org /proceedings /lrec2010/pdf/136_Paper.pdf. Acesso em: 26. abr. 2011.

SILVA, J. et al. Out-of-the-Box Robust Parsing of Portuguese. In: PARDO, T. A. S. et al. (Ed.). INTERNATIONAL CONFERENCE ON COMPUTATIONAL PROCESSING OF THE PORTUGUESE LANGUAGE, n. 9, 2010, Porto Alegre. Proceedings... Berlin; Heidelberg: Springer, 2010. p. 75-85.

TAGNIN, S. E. O.; VALE, O. A. (Org.). Avanços da Linguística de Corpus no Brasil. São Paulo: Humanitas, 2008. 437 p.

ULE, T.; HINRICHS, E. Linguistische Annotation. In: LOBIN, H.; LEMNITZER, L. (Ed.). Texttechnologie: Perspektiven und Anwendungen. Tübingen: Stauffenburg, 2004. p. 217-243.

VOUTILAINEN, A. Part-of-speech tagging. In: MITKOV, R. (Ed.). The Oxford handbook of computational linguistics. Oxford: Oxford University Press, 2004. p. 219-232.

XIAO, R. Corpus creation. In: INDURKHYA, N.; DAMERAU, F. J. Handbook of Natural Language Processing. 2. ed. Boca Raton, FL: Chapman & Hall/CRC, 2010. p. 147-165.

DOI: http://dx.doi.org/10.17851/2237-2083.19.1.7-85