Linguistic Modeling and its Interfaces
Oberseminar, Detmar Meurers, Winter Semester 2012

The OS features presentations and discussions of current issues in linguistic modeling and its interfaces. This includes linguistic modeling in computational linguistics, language acquisition research, Intelligent Computer-Assisted Language Learning, as well as theoretical linguistic research with a focus on the interfaces of syntax and information structure. It is open to advanced students and anyone interested in this interdisciplinary enterprise.

When: Fr 10ct-12
Where: Room 1.13 (Blochbau, Wilhelmstr. 19)
Mailing list for related announcements: http://mailman.sfs.uni-tuebingen.de/cgi-bin/mailman/listinfo/icall
A list of talks in previous semesters can be found here: Summer 12, Winter 11/12, Summer 11, Summer 10, Winter 09/10, Summer 09
Materials with access restricted to the university of Tübingen are available at: https://moodle02.zdv.uni-tuebingen.de/course/view.php?id=380

Sessions

October 18: Detmar Meurers:
Update and Planning
October 26: Nicolas Ballier (Paris 7)
Retrieving complement taking nouns in English corpora
Abstract: This talk will investigate the retrieval of nouns governing /that/-complement clauses in English. First, a historical perspective traces back the quest for these head nouns, noting some of the inconsistencies in reference descriptions and negligence of earlier researches. A tribute is paid to Bridgeman (1965), the first attempt at a comprehensive list. Building on earlier corpus-based (Bowen 2005) or semi-automatic attempts (Ballier 2007, Ballier 2009, Kanté 2011), we move on to investigate favourable contexts from distributional and semantic perspectives. The thesaurus method (using synonyms of head nouns) is then evoked, taking into account some insights from Price et al. (2006). We then adumbrate methods training taggers on gold standard data to retrieve noun complement clauses from reference corpora. This NLP approach might prove to be fruitful to make up for the current tagsets used in corpora which do not acknowledge this syntactic distinction (the BNC CLAWS6 tagset does not allow for the distinction between /that/ as a conjunction or as relative pronoun). Corpora tagged with the CLAWS C8 tagset prove to be unsatisfactory with the alleged distinction. The overall analysis contradicts the typical reference descriptions (e.g., Huddleston & Pullum 2002) about this kind of nouns: morphologically simple nouns appear to be more numerous than nouns deriving from verbs or adjectives.
November 2: Detmar Meurers, reporting on collaborative work with Julia Hancke, Sowmya Vajjala, Serhiy Bykh
On Analyzing the Complexity of Learner Language – First insights for the Kobalt-DaF data
November 9: Armin Buch
Kernelmethoden in der Linguistik
Abstract: Grammatische Regeln haben Ausnahmen, aber innerhalb von Ausnahmen gibt es wieder stabile Inseln. Nimmt man graduelle Urteile oder Reaktionszeiten hinzu, ergibt sich ein vielfältiges Kontinuum zwischen Regel und Ausnahme. Ein Beispiel: In der deutschen Pluralbildung ist der einzige streng regelhafte Fall (der s-Plural, 3% aller Substantive) in der Minderheit. Mehrere andere Morpheme teilen sich in nur teilweise vorhersagbarer Weise den Wortschatz auf. Was ist hier Regel, was ist Ausnahme?
Datenorientierte linguistische Modelle können diese Kontinua abbilden, jedoch um den Preis, dass dabei nicht ein kleiner Satz nicht-redundater grammatischer Regeln herauskommt. Entscheidend für diese Modelle ist der Begriff der Ähnlichkeit: Unbekannte Fälle werden so behandelt wie bekannte ähnliche. Eine sehr flexible, mathematisch wohl verstandene und auf Daten aller Art, insbesondere auch linguistische, übertragbare Formalisierung von Ähnlichkeit sind Kernfunktionen (Kernels). Mithilfe maschineller Lern- und Analyseverfahren wie der Hauptkomponentenanalyse kann man nun Modelle von Sprache erstellen. In meinem Vortrag berichte ich aus meiner Dissertation zu linguistischen Anwendungen von Kernfunktionen. Am Beispiel der Pluralbildung zeige ich, dass mit Kernelmethoden nicht nur klassifiziert werden kann, sondern auch konkrete Formen vorhergesagt werden können, und dass aber ein solches Modell nicht als die “Regeln der deutschen Pluralbildung” interpretiert werden kann.
November 16: Michael Hahn
Towards Combining Formal and Distributional Semantics - an approach for evaluating answers to reading comprehension questions

Abstract: In computational semantics, there is growing interest in integrating formal and distributional semantics to combine their complementary strengths. While formal semantics can be used to precisely represent meaning, distributional methods have been applied successfully for non-compositional types of semantic similarity. This talk will present work on CoSeC, a system for evaluating answers to reading comprehension questions. Unlike most other content assessment systems, it is based on comparing underspecified formal semantic representations. I will present an extension of the CoSeC approach that is able to incorporate information about semantic similarity of words or phrases obtained using distributional methods. I will then present experiments using PMI-IR (Turney, 2001) and vector-space based models of semantic similarity (Landauer et al., 1998). I will also present an approach to automatically induce variable bindings for synonymous multi-word expressions. v

Stefanie Wolf, Sarah Schulz, et al.
Detecting (non-compositional) multi-word units

Abstract: The talk addresses the question how to find (non-compositional) multi-word units and measure their semantic similarity by using Distributional Semantics. Sarah Schulz will present the topic of her Master Thesis in which she will extract non-compositional multi-word units in English and their synonyms. Within the CoMiC-DE system we want to compare student answers to target answers with respect to meaning. Examples from the Corpus of Reading Comphrehension Exercises in German (CREG) will illustrate the problem of multi-word units. The current state of the system and ideas for its improvement will be presented.
November 23: Sophie Herment (Université de Provence, DEMA/LPL)
The Prosody-Syntax interface in discourse: A few syntactic structures in English
Abstract: The talk addresses the question of the interface between prosody, syntax and discourse through the analysis of a few non canonical syntactic structures taken from corpora of spoken English. Clefts, extrapositions, right noun-phrase dislocations and the insertion of auxiliary do in an assertive context are closely looked at. The prosody of these structures is compared to marked prosodic forms of emphatic utterances with a neutral syntax. The prosodic analysis is mainly based on the number of tone units, the place of the nuclear syllable and the pitch movement. The context and the information structure is also taken into account for the utterances analysed here. We show that a syntactically non canonical utterance can be pronounced with a neutral prosody or on the contrary that the prosody can be marked. The analysis of sentences in context allows us to argue that prosody and syntax are complementary but play a role at different levels and that prosody has pragmatic functions in discourse: marking the information structure, thematizing, focalising, expressing contrast and emphasis.
November 30: COLING 2013 dry runs

Julia Hancke, Sowmya Vajjala, Detmar Meurers
Readability Classification for German using lexical, syntactic, and morphological features

Abstract: We investigate the problem of reading level assessment for German texts on a newly compiled corpus of freely available easy and difficult articles, targeted at adult and p child readers respectively. We adapt a wide range of syntactic, lexical and language model features from previous research on English and combined them with new features that make use of the rich morphology of German. We show that readability classification for German based on these features is highly successful, reaching 89.7% accuracy, with the new morphological features making an important contribution.

Serhiy Bykh and Detmar Meurers
Native Language Identification Using Recurring N-grams – Investigating Abstraction and Domain Dependence

Abstract: Native Language Identification tackles the problem of determining the native language of an author based on a text the author has written in a second language. In this paper, we discuss the systematic use of recurring n-grams of any length as features for training a native language classifier. Starting with surface n-grams, we investigate two degrees of abstraction incorporating parts-of-speech. The approach outperforms previous work employing a comparable data setup, reaching 89.71% accuracy for a task with seven native languages using data from the International Corpus of Learner English (ICLE). We then investigate the claim by Brooke and Hirst (2011) that a content bias in ICLE seems to result in an easy classification by topic instead of by native language characteristics. We show that training our model on ICLE and testing it on three other, independently compiled learner corpora dealing with other topics still results in high accuracy classification.
December 7: René Venegas (Pontificia Universidad Católica de Valparaíso, Escuela Lingüística de Valparaíso)
Specialized discourse: multidimensional analysis and text classification
Abstract: In this talk, I present two approaches to analysing discourses in Spanish. There are Multidimensional analysis (MDA) and supervised classification (SC) of specialized texts. Concerning MDA, I present two studies based on the written academic PUCV-2006 Corpus of Spanish. Both studies employ the five dimensions (i.e. Contextual and Interactive Focus, Narrative Focus, Commitment Focus, Modalizing Focus, and Informational Focus) identified by Parodi (2005). The main assumption is that the dimensions determined by a previous multidimensional analysis can be used to characterize a new corpus of university genres. In the first study, I calculate linguistic density across the five dimensions to describe the nine academic genres of the corpus. In the second one, I compare the PUCV-2006 Corpus with four corpora from different registers. The findings confirm the specialized nature of the genres in the PUCV-2006 Corpus, where both a strong lexico-grammatical compactness of meanings and modalization of certainty are expressed in the texts. Concerning SC, I will present three classification experiments based on specialized texts. In the first one, I compare naïve Bayes and SVM methods - based on shared lexical-semantic content words- to classify the disciplines of 160 academic texts. In the second one, the informational density scores obtained in a previous multidimensional analysis is used to classify the four disciplines corresponding to 353 thesis of the TFGPUCV- 2010 corpus, using discriminant analysis and naïve Bayes. In the last classification experiment, naïve Bayes is used to classify disciplines and genres, based on part-of-speech trigrams calculated from a sample of theses and other academic genres. According to my findings, it is possible to argue that the lexico-grammatical level allows to classify the texts according to the disciplines and genres with a high accuracy percentage.
Related LEAD talk December 11, 18ct, Festsaal der Alten Aula: Volker Frederking (University of Erlangen-Nürnberg) “Literary Aesthetic Text Understanding and Competence: Theoretical Modeling and Empirical Collection”
Abstract: Frederking et al. developed a competence model for literary-aesthetic judging within the framework of the priority program 1293 of the DFG “competence models for the measurement of individual learning results and for the accounting of educational processes”. Literary-aesthetic judging was determined as a theoretically and empirically funded three-dimensional construct, which can be differentiated from general reading ability. In his presentation, Prof. Dr. Frederking will speak about the underlying competence model as well as the development of the test instrument for the measurement of the literary-aesthetic judging competency, with which the competence model was validated.
December 14: Guido Faccani (Zürich) und Sebastian Ristow (archaeoplanristow.de, Köln)
Das Corpus Architecturae Religiosae Europeae (CARE). Perspektiven zur automatischen Textgenerierung für eine öffentlichkeitstaugliche App
Abstract: Die CARE-Datenbanken befinden sich derzeit im Aufbau. Gebündelt werden hier Informationen zu Kirchenbauten mit Bauphasen vor dem Jahr 1000 in verschiedenen Ländern Europas. In konkreter Planung ist ein Projektstart für Deutschland, Österreich und die Schweiz (DACH, http://care-dach.net). Als erster Schritt werden pro Kirche dort die Bauphasen aus einem Katalog der 1960er und 80er Jahre eingepflegt. Die Informationen sind formal strukturiert aber ungleich lang und umfangreich. Um die Inhalte für die Öffentlichkeit nutzbar zu machen, wurde der Prototyp einer App entwickelt (App-Store: Frühchristliches Köln), die am Beispiel Köln und der dort vorhandenen 10 Bauten zeigt, wie das Endergebnis aussehen könnte. Gedacht ist an eine zukünftige automatische Generierung der Inhalte dieser App aus den Informationsmengen der Datenbank.
January 11: Sebastian Padó (Universität Heidelberg)
Textual Entailment: State of the Art and Challenges
Abstract: Textual Entailment captures a common sense notion of entailment between two natural language texts, P (premise) and H (hypothesis). The relevance of Textual Entailment lies in its promise to
- subsume a substantial chunk of the semantic processing in a range of NLP tasks including IE, QA, and Summarization;
- provide a notion of entailment that is not tied to a particular representation but provides a “common ground” for comparing and contrasting semantic processing mechanisms in an end-to-end setting.
The first half of this talk will introduce the notion of Textual Entailment and provide an overview of recent work on the topic, including a typology of the major algorithmic approaches, relevant linguistic phenomena, and applications. Unfortunately, it has turned out that the agnosticism of Textual Entailment with regard to processing has led to a fragmentation of research. The second half will cover ongoing work on the development of a generalized model of Textual Entailment that subsumes the various proposed algorithms and the implementation of this model in the form of a multilingual, reusable, open-sourced platform for semantic processing.
[Two recent manuscripts are available from https://moodle02.zdv.uni-tuebingen.de/course/view.php?id=380 (access restricted to logins from the University of Tübingen)]
January 25: Detmar Meurers and LingMod Lab
Status and Plans
February 1: Cornelius Fath
Topological Field Parsing of Learner Language
Abstract: In the context of the Kobalt-DaF network, whose members investigate different aspects of learner texts, we took a look at Topological Fields in essays of Chinese and Belorussian learners of German. The texts were parsed according to the TüBa-D/Z annotation scheme and finally manually corrected with the tool Synpathy. The talk will provide some insight into the results of the automatic parsing process and the problems which arise there. Furthermore, I will give a run through the application of the tools in use.
February 8: Janina Kopp
On Characterizing German Interlanguage Part-of-Speech Classes: Multilevel Categories as Bridge between Empirical Observations and Linguistic Generalizations
Abstract: In my MA thesis, I explore POS analysis for learner language. Tagsets for native language are often insufficient in describing the linguistic phenomena occurrin g in learner language. In the sentence ”He was choiced for the job”, the word ”c hoiced” cannot be accurately tagged: if it is only analyzed as a finite Verb the information on how the word was formed (out of a noun/adjective stem) is lost, which would be of interest for both SLA and SLT research. Forming new (error) ca tegories is often also not desirable when learner language needs to be compared with native language. Díaz-Negrillo et al (2009) suggest in their publication ”T owards interlanguage POS annotation for effective learner corpora in SLA and FLT ” to split POS analysis into three dimensions to avoid this conflict. The words are analyzed with a native language tagset from a distributional, morphological and lexical perspective. Mismatches on these levels are expected to expose error s or misuse of the language. In my talk, I will discuss these issues and present a implementation of the trip artite POS tagging for German. I will show what other theoretical and practical issues were revealed during the implementation and testing process.
preview of next semester:
- April 19: Vera Demberg (Saarland University)
  The effect of Discourse Connectors on Processing Difficulty
  Abstract: We have known for a long time that discourse connectors (like “therefore”, “however”, “but”) facilitate human sentence processing when used appropriately. We however know much less about the time course of processing such connectors. In particular, we are interested in whether discourse connectors are processed quickly enough to affect expectations about upcoming discourse content. In this talk, I will present recent experiments on the processing of causals vs. concessives, which indicate that connectors are integrated incrementally into the discourse representation, and that concessives, similar to negation, give rise to a search for alternatives. We however also found evidence that concessives take longer to process than causals.
  I will then go on to talk about expectations which people may have about upcoming discourse relations /before/ encountering a connective, and how these expectations affect the explicit vs. implicit realization of discourse cues. Both studies can shed some light on the causes of processing difficulty at the discourse level.
  In a final part of my talk, I will give an overview of our recent efforts in evaluating models of linguistic processing difficulty in real-world scenarios, where we use a dual-task setting with a simultaneous a language comprehension task and a well-controlled and continuous simulated driving task. Cognitive load in this setting is measured in terms of a novel form of pupillometry, in addition to task related measures such as steering accuracy.
- April 26: Karin Aguado (Universität Kassel)
  TBD (related to Chunks in SLA)
- June 28: Ulrike Gut (Universität Münster)
  TBD (related to ICE Nigeria)
- Lene Antonsen (University of Tromsø)

_________________________________________________________________________________

Last updated: February 5, 2013