ISCL Hauptseminar (Wintersemester 2014/15, Meurers)
Corpus Annotation: Linguistic Foundations and
Computational Linguistic Analysis
Abstract:
Language data collected in electronic corpora can in principle provide important empirical insights for theoretical and computational linguistics. For theoretical linguistics, corpus examples can be used to validate or falsify linguistic generalizations. In computational linguistics, language models and classifiers can be trained on corpus data to learn how to predict or classify previously unseen data on that basis.
Effective querying of corpora for specific phenomena and the development of computational tools for the automatic analysis of language often requires reference to annotations. Annotations essentially function as an index to classes of data which cannot easily be identified based on the surface form alone. For example, finding all sentences containing modal verbs using only the surface forms is possible, but would require a long list of all forms of the modal verbs. Even so, sentences where, for example, “can” is not actually a modal verb (as in “Pass me a can of beer” or “I can tuna for a living”) would be wrongly identified. Other search patterns, such as a query for all sentences containing past participle verbs, cannot even be specified in finite form using the surface string alone. The annotation of corpora thus serves an important function in providing abstractions which make it possible to access or generalize over large sets of examples.
This seminar will provide an overview of the creation and use of linguistically annotated corpora in theoretical and computational linguistics. It will include basic questions such as how to tokenize or sentence segment a corpus as well as conceptual considerations relevant to the creation of annotation schemes, and will then explore different types of corpora (from newspaper to learner corpora) and different types of annotations (morphological, constituency, dependency, semantic and formal pragmatic).
Instructor: Prof. Dr. Detmar Meurers
Course meets: 4 SWS in Seminarraum 1.13, Blochbau (Wilhelmstr. 19)
Credit Points:
Online syllabus: http://purl.org/dm/14/ws/hs
Moodle page: https://moodle02.zdv.uni-tuebingen.de/course/view.php?id=980
If you have not already used this Moodle installation for another course, please log onto it asap and create an account for yourself using your ordinary ZDV university login, then enroll into our course.
Nature of course and my expectations: This is a Hauptseminar intended to provide an overview of the key issues and annotation schemes in this active research area. Each participant is expected to
Note: According to the rules of the Fakultät missing more than two meetings unexcused, automatically results in failing the class.
Academic conduct and misconduct: Research is driven by discussion and free exchange of ideas, motivations, and perspectives. So you are encouraged to work in groups, discuss, and exchange ideas. At the same time, the foundation of the free exchange of ideas is that everyone is open about where they obtained which information. Concretely, this means you are expected to always make explicit when you’ve worked on something as a team – and keep in mind that being part of a team always means sharing the work.
For text you write, you always have to provide explicit references for any ideas or passages you reuse from somewhere else. Note that this includes text “found” on the web, where you should cite the url of the web site in case no more official publication is available.
Class etiquette: Please do not read or work on materials for other classes in our seminar. Come to class on time and do not pack up early. When our seminar meets in the computer lab, only use the computers when you are asked to do a specific activity – do not read email or browse the web. All portable electronic devices such as cell phones should be switched off for the entire length of the flight – oops – class. If for some reason, you must leave early or you have to miss class for an important reason, please let me know before class.
Session plan:
Topics we can chose from
We focus on the conceptual issues, in particular questions relating to linguistic modeling. Which properties and insights can be and have been identified and annotated in corpora?
Abeillé, A. (ed.) (2003). Treebanks: Building and using syntactically annotated corpora. Dordrecht: Kluwer.
Abeillé, A., T. Brants & H. Uszkoreit (eds.) (2000). Proceedings of the Second Workshop on Linguistically Interpreted Corpora (LINC-00). Luxembourg. Workshop information at http://www.coli.uni-sb.de/linc2000/.
Abeillé, A., L. Clément & F. Toussenel (2003). Building a Treebank for French. In Abeillé (2003).
Artstein, R. & M. Poesio (2009). Survey Article: Inter-Coder Agreement for Computational Linguistics. Computational Linguistics 34(4), 1–42. URL http://www.mitpressjournals.org/doi/abs/10.1162/coli.07-034-R2.
Atalay, N., K. Oflazer & B. Say (2003). The annotation process in the Turkish treebank. In Proceedings of the 4th International Workshop on Linguistically Interpreteted Corpora (LINC).
Atwell, E., G. Demetriou, J. Hughes, A. Schiffrin, C. Souter & S. Wilcock (2000a). A comparative evaluation of modern English corpus grammatical annotation schemes. International Computer Archive of Modern and Medieval English (ICAME). Issue on Computers in English Linguistics 24, 7–23. URL http://www.hit.uib.no/icame/ij24/atwell.pdf.
Atwell, E., G. Demetriou, J. Hughes, A. Schiffrin, C. Souter & S. Wilcock (2000b). Comparing linguistic interpretation schemes for English corpora. In Abeillé et al. (2000). URL http://www.comp.leeds.ac.uk/eric/coling2000linc.ps. Workshop information at http://www.coli.uni-sb.de/linc2000/.
Atwell, E., J. Hughes & C. Souter (1994). AMALGAM: Automatic Mapping Among Lexico-Grammatical Annotation Models. In J. Klavans & P. Resnik (eds.), Proceedings of The Balancing Act - Combining Symbolic and Statistical Approaches to Language, Workshop in conjunction with the 32nd Annual Meeting of the Association for Computational Linguistics. New Mexico State University, Las Cruces, New Mexico, USA. URL http://www.scs.leeds.ac.uk/nlp/papers/atwell+hughes+souter94acl.ps.Z.
Bies, A., M. Ferguson, K. Katz & R. MacIntyre (1995). Bracketing Guidelines for Treebank II Style Penn Treebank Project. University of Pennsylvania. URL ftp://ftp.cis.upenn.edu/pub/treebank/doc/manual/root.ps.gz.
Boyd, A., M. Dickinson & D. Meurers (2008). On Detecting Errors in Dependency Treebanks. Research on Language and Computation 6(2), 113–137. URL http://purl.org/dm/papers/boyd-et-al-08.html.
Brants, T. (1995). Tagset reduction without information loss. In Proceedings of the 33nd Annual Meeting of the Association for Computational Linguistics (ACL 95). Cambridge, MA: MIT. URL http://www.coli.uni-sb.de/~thorsten/publications/Brants-ACL95.ps.gz.
Brants, T. & W. Skut (1998). Automation of Treebank Annotation. In Proceedings of New Methods in Language Processing (NeMLaP-98). Syndey. URL http://www.coli.uni-sb.de/~thorsten/publications/Brants-Skut-NeMLaP98.ps.gz.
Brill, E. (2000). Part-of-Speech Tagging. In R. Dale, H. Moisl & H. Somers (eds.), Handbook of Natural Language Processing, New York: Marcel Dekker. URL http://www.netLibrary.com/ebook_info.asp?product_id=47610.
Cheung, J. C. K. & G. Penn (2009). Topological Field Parsing of German. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP. Suntec, Singapore: Association for Computational Linguistics, pp. 64–72. URL http://www.aclweb.org/anthology/P/P09/P09-1008.
Cloeren, J. (1999). Tagsets. In van Halteren (1999), chap. 4, pp. 37–54.
Déjean, H. (2000). How to Evaluate and Compare Tagsets? A Proposal. In Gavrilidou et al. (2000). URL http://lcg-www.uia.ac.be/lcg/ps/dejean.lrec2000.ps.gz.
Dickinson, M. & W. D. Meurers (2003a). Detecting Errors in Part-of-Speech Annotation. In Proceedings of the 10th Conference of the European Chapter of the Association for Computational Linguistics (EACL-03). Budapest, Hungary, pp. 107–114. URL http://purl.org/dm/papers/dickinson-meurers-03.html.
Dickinson, M. & W. D. Meurers (2003b). Detecting Inconsistencies in Treebanks. In Proceedings of the Second Workshop on Treebanks and Linguistic Theories (TLT-03). Växjö, Sweden, pp. 45–56. URL http://purl.org/dm/papers/dickinson-meurers-tlt03.html.
Dickinson, M. & W. D. Meurers (2005a). Detecting Annotation Errors in Spoken Language Corpora. In The Special Session on treebanks for spoken language and discourse at NODALIDA-05. Joensuu, Finland. URL http://purl.org/~dm/papers/dickinson-meurers-nodalida05.html.
Dickinson, M. & W. D. Meurers (2005b). Detecting Errors in Discontinuous Structural Annotation. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL-05). pp. 322–329. URL http://aclweb.org/anthology/P05-1040.
Dienes, P. & C. Oravecz (2000). Bottom-up tagset design from maximally reduced tagset. In Abeillé et al. (2000), pp. 42–47. URL http://www.coli.uni-sb.de/~dienes/dior2000.ps.gz. Workshop information at http://www.coli.uni-sb.de/linc2000/.
Džeroski, S., T. Erjavec & J. Zavrel (2000). Morphosyntactic Tagging of Slovene: Evaluating Taggers and Tagsets. In Gavrilidou et al. (2000), pp. 1099–1104. URL http://nl.ijs.si/et/Bib/LREC00/lrec-tag.ps.
Díaz Negrillo, A., D. Meurers, S. Valera & H. Wunsch (2010). Towards interlanguage POS annotation for effective learner corpora in SLA and FLT. Language Forum 36(1–2), 139–154. URL http://purl.org/dm/papers/diaz-negrillo-et-al-09.html.
Elworthy, D. (1995). Tagset Design and Inflected Languages. In Proceedings of the ACL-SIGDAT Workshop. Dublin. URL http://arXiv.org/abs/cmp-lg/9504002.
Forst, M., N. Bertomeu, B. Crysmann, F. Fouvry, S. Hansen-Schirra & V. Kordoni (2004). Towards a Dependency-Based Gold Standard for German Parsers. The TIGER Dependency Bank. In S. Hansen-Schirra, S. Oepen & H. Uszkoreit (eds.), 5th International Workshop on Linguistically Interpreted Corpora (LINC-04) at COLING. Geneva, Switzerland: COLING, pp. 31–38. URL http://aclweb.org/anthology/W04-1905.
Gaizauskas, R. (1995). Investigations into the grammar underlying the Penn Treebank II. Tech. Rep. Research Memorandum CS-95-25, University of Sheffield. URL citeseer.ist.psu.edu/111349.html.
Garside, R., G. Leech & T. McEnery (eds.) (1997). Corpus annotation: linguistic information from computer text corpora. Harlow, England: Addison Wesley Longman Limited.
Gavrilidou, M., G. Carayannis, S. Markantonatou, S. Piperidis & G. Steinhauer (eds.) (2000). Proceedings of the Second International Conference on Language Resources and Evaluation (LREC-00). Athens.
Grefenstette, G. (1999). Tokenization. In van Halteren (1999), chap. 9, pp. 117–133.
Grefenstette, G. & P. Tapanainen (1994). What is a word, what is a sentence? In Proceedings of the 3rd International Conference on Computational Lexicography (COMPLEX-94). pp. 79–87. URL http://purl.org/dm/lib/Grefenstette.Tapanainen-94.pdf.
Hajič, J., A. Böhmová, E. Hajičová & B. Vidová-Hladká (2003). The Prague Dependency Treebank: A Three-Level Annotation Scenario. In Abeillé (2003), chap. 7, pp. 103–127. URL http://ufal.mff.cuni.cz/pdt2.0/publications/HajicHajicovaAl2000.pdf.
Hajič, J., B. Vidová-Hladká & P. Pajas (2001). The Prague Dependency Treebank: Annotation Structure and Support. In Proceedings of the IRCS Workshop on Linguistic Databases. University of Pennsylvania, Philadelphia, pp. 105–114. URL http://ufal.mff.cuni.cz/pdt2.0/publications/HajicHladkaPajas2001.pdf.
Hajič, J. & B. Hladká (1998). Tagging Inflective Languages: Prediction of Morphological Categories for a Rich, Structured Tagset. In Proceedings of COLING-ACL Conference. Montreal, Canada, pp. 483–490.
Hajič, J., J. Panevová, E. Buráňová, Z. Urešová & A. Bémová (1999). Annotations at Analytical Layer. Instructions for Annotators. Tech. rep., ÚFAL MFF UK, Prague, Czech Republic. URL http://ufal.mff.cuni.cz/pdt2.0/doc/manuals/en/a-layer/pdf/a-man-en.pdf. English translation by Zdeněk Kirschner.
Hajičová, E., J. Panevová & P. Sgall (2000). A Manual for Tectogrammatic Tagging of the Prague Dependency Treebank. Tech. Rep. TR-2000-09, ÚFAL MFF UK, Prague, Czech Republic. In Czech.
Hana, J. & D. Zeman (2005). A Manual for Morphological Annotation, 2nd edition. Tech. Rep. 27, ÚFAL MFF UK, Prague, Czech Republic. URL http://ufal.mff.cuni.cz/pdt2.0/doc/manuals/en/m-layer/pdf/m-man-en.pdf.
King, T. H., R. Crouch, S. Riezler, M. Dalrymple & R. M. Kaplan (2003). The PARC 700 Dependency Bank. In Proceedings of the 4th International Workshop on Linguistically Interpreted Corpora, held at the 10th Conference of the European Chapter of the Association for Computational Linguistics (EACL-03). Budapest. URL http://www2.parc.com/isl/groups/nltt/fsbank/.
Kingsbury, P., M. Palmer & M. Marcus (2002). Adding Semantic Annotation to the Penn TreeBank. In Proceedings of the Human Language Technology Conference. San Diego, California. URL http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.13.5336&rep=rep1&type=pdf.
Kübler, S. & A. Wagner (2000). Evaluating POS Tagging under Sub-optimal Conditions. Or: Des Meticulousness Pay? In Proceedings of International Conference on Artificial and Computational Intelligence for Decision, Control and Automation in Engineering and Industrial Applications (ACIDCA’2000). Monastir, Tunisia. URL http://www.sfs.uni-tuebingen.de/~kuebler/papers/acidca.ps.
Leech, G. (1997). Grammatical Tagging. In Garside et al. (1997), chap. 2, pp. 19–33.
Lu, X. (2006). Hybrid Models for Chinese Unknown Word Resolution. Ph.D. thesis, The Ohio State University.
Lu, X. (2014). Computational Methods for Corpus Annotation and Analysis. Springer.
Marcus, M., G. Kim, M. Marcinkiewicz, R. MacIntyre, A. Bies, M. Ferguson, K. Katz & B. Schasberger (1994). The Penn treebank: Annotating predicate argument structure. URL ftp://ftp.cis.upenn.edu/pub/treebank/doc/arpa94.ps.gz.
Marcus, M., B. Santorini & M. A. Marcinkiewicz (1993). Building a large annotated corpus of English: The Penn Treebank. Computational Linguistics 19(2), 313–330. URL ftp://ftp.cis.upenn.edu/pub/treebank/doc/cl93.ps.gz.
McEnery, T. & A. Wilson (1996). Corpus Linguistics. Edinburgh Textbooks in Empirical Linguistics. Edinburgh, UK: Edinburgh University Press.
Meurers, W. D. (2005). On the use of electronic corpora for theoretical linguistics. Case studies from the syntax of German. Lingua 115(11), 1619–1639. URL http://purl.org/dm/papers/meurers-03.html.
Meurers, W. D. & S. Müller (2009). Corpora and Syntax (Article 42). In A. Lüdeling & M. Kytö (eds.), Corpus linguistics, Berlin: Mouton de Gruyter, vol. 2 of Handbooks of Linguistics and Communication Science, pp. 920–933. URL http://purl.org/dm/papers/meurers-mueller-09.html.
Nivre, J., J. Nilsson & J. Hall (2006). Talbanken05: A Swedish Treebank with Phrase Structure and Dependency Annotation. In Proceedings of the fifth international conference on Language Resources and Evaluation (LREC-06). Genoa, Italy. URL http://stp.lingfil.uu.se/~nivre/docs/talbanken05.pdf.
Oflazer, K., D. Z. Hakkani-Tür & G. Tür (1999). Design for a Turkish Treebank. In Uszkoreit et al. (1999), pp. 28–34.
Oflazer, K., B. Say, D. Z. Hakkani-Tür & G. Tür (2003). Building a Turkish Treebank. In Abeillé (2003).
Palmer, D. D. (2000). Tokenisation and Sentence Segmentation. In R. Dale, H. Moisl & H. Somers (eds.), Handbook of Natural Language Processing, New York: Marcel Dekker, pp. 11–35. URL http://www.netLibrary.com/ebook_info.asp?product_id=47610.
Palmer, M., D. Gildea & P. Kingsbury (2005). The Proposition Bank: An Annotated Corpus of Semantic Roles. Computational Linguistics 31(1), 71–105. URL http://aclweb.org/anthology/J05-1004.
Sampson, G. & A. Babarczy (2003). Limits to annotation precision. In Proceedings of the 4th International Workshop on Linguistically Interpreted Corpora (LINC-03). pp. 61–68. URL http://www.grsampson.net/Alta.html.
Santorini, B. (1990). Part-Of-Speech Tagging Guidelines for the Penn Treebank Project (3rd Revision, 2nd printing). Ms., UPenn.
Schiller, A., S. Teufel & C. Thielen (1995). Guidlines für das Taggen deutscher Textcorpora mit STTS. Tech. rep., IMS-CL, Univ. Stuttgart and SfS, Univ. Tübingen. URL http://www.cogsci.ed.ac.uk/~simone/stts_guide.ps.gz.
Stegmann, R., H. Telljohann & E. W. Hinrichs (2000). Stylebook for the German Treebank in VERBMOBIL. Verbmobil-Report 239, Universität Tübingen, Tübingen, Germany. URL http://verbmobil.dfki.de/cgi-bin/verbmobil/htbin/decode.cgi/share/VM-depot/FTP-SERVER/vm-reports/report-239-00.ps.
Taylor, A., M. Marcus & B. Santorini (2003). The Penn Treebank: An Overview. In Abeillé (2003), chap. 1, pp. 5–22.
Telljohann, H., E. W. Hinrichs, S. Kübler & H. Zinsmeister (2005). Stylebook for the Tübingen Treebank of Written German (TüBa-D/Z). Tech. rep., Seminar für Sprachwissenschaft, Universität Tübingen, Germany.
Teufel, S. (1995). A Support Tool for Tagset Mapping. In Proceedings of the SIGDAT Workshop at EACL 95. Dublin. URL http://www.cogsci.ed.ac.uk/~simone/eacl95.ps.gz.
Teufel, S., H. Schmid, H. Heid & A. Schiller (1996). EAGLES Study of the relation between Tagsets and Taggers. Document eag clwg tags/v, EAGLES. URL ftp://ftp.ilc.pi.cnr.it/pub/eagles/lexicons/tags.ps.gz.
Thielen, C. & A. Schiller (1996). Ein kleines und erweitertes Tagset fürs Deutsche. In H. Feldweg & E. W. Hinrichs (eds.), Lexikon und Text: wiederverwendbare Methoden und Ressourcen zur linguistischen Erschließung des Deutschen, Tübingen: Max Niemeyer Verlag, vol. 73 of Lexicographica: Series maior, pp. 215–226.
Tufiş, D., P. Dienes, C. Oravecz & T. Váradi (2000). Principled Hidden Tagset Design for Tiered Tagging of Hungarian. In Gavrilidou et al. (2000). URL http://www.coli.uni-sb.de/~thorsten/tnt/papers/lrec2000-tufis-ea.pdf.
Uszkoreit, H., T. Brants & B. Krenn (eds.) (1999). Proceedings of the Workshop on Linguistically Interpreted Corpora (LINC-99). Bergen, Norway: Association for Computational Linguistics.
van Halteren, H. (ed.) (1999). Syntactic Wordclass Tagging. Dordrecht: Kluwer Academic Publishers.
Váradi, T. & C. Oravecz (1999). Morpho-syntactic ambiguity and tagset design for Hungarian. In Uszkoreit et al. (1999), pp. 8–12. URL http://www.inf.u-szeged.hu/~alexin/ILP/EACL99-Bergen.ps.gz.
Voutilainen, A. & T. Järvinen (1995). Specifying a shallow grammatical representation for parsing purposes. In Proceedings of the 7th Conference of the EACL. Dublin, Ireland. URL http://www.aclweb.org/anthology-new/E95-1029.
Zinsmeister, H. & U. H. und Kathrin Beck (2013). Das Stuttgart-Tübingen Tagset – Stand und Perspektiven. Journal for Language Technology and Computational Linguistics (JLCL) 28(1). URL http://www.jlcl.org/2013_Heft1/Heft1-2013.pdf.