Linguistic Modeling and its Interfaces
Oberseminar, Detmar Meurers, Winter Semester 2012
The OS features presentations and discussions of current issues in linguistic modeling and its interfaces. This includes linguistic modeling in computational linguistics, language acquisition research, Intelligent Computer-Assisted Language Learning, as well as theoretical linguistic research with a focus on the interfaces of syntax and information structure. It is open to advanced students and anyone interested in this interdisciplinary enterprise.
Abstract: This talk will investigate the retrieval of nouns governing /that/-complement clauses in
English. First, a historical perspective traces back the quest for these head nouns, noting some of
the inconsistencies in reference descriptions and negligence of earlier researches. A tribute
is paid to Bridgeman (1965), the first attempt at a comprehensive list. Building on
earlier corpus-based (Bowen 2005) or semi-automatic attempts (Ballier 2007, Ballier
2009, Kanté 2011), we move on to investigate favourable contexts from distributional
and semantic perspectives. The thesaurus method (using synonyms of head nouns)
is then evoked, taking into account some insights from Price et al. (2006). We then
adumbrate methods training taggers on gold standard data to retrieve noun complement
clauses from reference corpora. This NLP approach might prove to be fruitful to make
up for the current tagsets used in corpora which do not acknowledge this syntactic
distinction (the BNC CLAWS6 tagset does not allow for the distinction between /that/ as a
conjunction or as relative pronoun). Corpora tagged with the CLAWS C8 tagset prove to be
unsatisfactory with the alleged distinction. The overall analysis contradicts the typical
reference descriptions (e.g., Huddleston & Pullum 2002) about this kind of nouns:
morphologically simple nouns appear to be more numerous than nouns deriving from verbs or
adjectives.
Abstract: Grammatische Regeln haben Ausnahmen, aber innerhalb von Ausnahmen gibt es wieder stabile Inseln. Nimmt man graduelle Urteile oder Reaktionszeiten hinzu, ergibt sich ein vielfältiges Kontinuum zwischen Regel und Ausnahme. Ein Beispiel: In der deutschen Pluralbildung ist der einzige streng regelhafte Fall (der s-Plural, 3% aller Substantive) in der Minderheit. Mehrere andere Morpheme teilen sich in nur teilweise vorhersagbarer Weise den Wortschatz auf. Was ist hier Regel, was ist Ausnahme?
Datenorientierte linguistische Modelle können diese Kontinua abbilden, jedoch um den Preis, dass
dabei nicht ein kleiner Satz nicht-redundater grammatischer Regeln herauskommt. Entscheidend
für diese Modelle ist der Begriff der Ähnlichkeit: Unbekannte Fälle werden so behandelt wie
bekannte ähnliche. Eine sehr flexible, mathematisch wohl verstandene und auf Daten aller
Art, insbesondere auch linguistische, übertragbare Formalisierung von Ähnlichkeit sind
Kernfunktionen (Kernels). Mithilfe maschineller Lern- und Analyseverfahren wie der
Hauptkomponentenanalyse kann man nun Modelle von Sprache erstellen. In meinem Vortrag
berichte ich aus meiner Dissertation zu linguistischen Anwendungen von Kernfunktionen. Am
Beispiel der Pluralbildung zeige ich, dass mit Kernelmethoden nicht nur klassifiziert werden
kann, sondern auch konkrete Formen vorhergesagt werden können, und dass aber ein
solches Modell nicht als die “Regeln der deutschen Pluralbildung” interpretiert werden
kann.
Abstract: In computational semantics, there is growing interest in integrating formal and
distributional semantics to combine their complementary strengths. While formal semantics
can be used to precisely represent meaning, distributional methods have been applied
successfully for non-compositional types of semantic similarity. This talk will present work on
CoSeC, a system for evaluating answers to reading comprehension questions. Unlike
most other content assessment systems, it is based on comparing underspecified formal
semantic representations. I will present an extension of the CoSeC approach that is able to
incorporate information about semantic similarity of words or phrases obtained using
distributional methods. I will then present experiments using PMI-IR (Turney, 2001) and
vector-space based models of semantic similarity (Landauer et al., 1998). I will also present
an approach to automatically induce variable bindings for synonymous multi-word
expressions.
v
Stefanie Wolf, Sarah Schulz, et al.
Detecting (non-compositional) multi-word units
Abstract: The talk addresses the question how to find (non-compositional) multi-word units and
measure their semantic similarity by using Distributional Semantics. Sarah Schulz will present the
topic of her Master Thesis in which she will extract non-compositional multi-word units
in English and their synonyms. Within the CoMiC-DE system we want to compare
student answers to target answers with respect to meaning. Examples from the Corpus of
Reading Comphrehension Exercises in German (CREG) will illustrate the problem of
multi-word units. The current state of the system and ideas for its improvement will be
presented.
Abstract: The talk addresses the question of the interface between prosody, syntax and discourse
through the analysis of a few non canonical syntactic structures taken from corpora of spoken
English. Clefts, extrapositions, right noun-phrase dislocations and the insertion of auxiliary do in
an assertive context are closely looked at. The prosody of these structures is compared to marked
prosodic forms of emphatic utterances with a neutral syntax. The prosodic analysis is mainly based
on the number of tone units, the place of the nuclear syllable and the pitch movement. The context
and the information structure is also taken into account for the utterances analysed
here. We show that a syntactically non canonical utterance can be pronounced with a
neutral prosody or on the contrary that the prosody can be marked. The analysis of
sentences in context allows us to argue that prosody and syntax are complementary but
play a role at different levels and that prosody has pragmatic functions in discourse:
marking the information structure, thematizing, focalising, expressing contrast and
emphasis.
Julia Hancke, Sowmya Vajjala, Detmar Meurers
Readability Classification for German using lexical, syntactic, and
morphological features
Abstract: We investigate the problem of reading level assessment for German texts on a newly
compiled corpus of freely available easy and difficult articles, targeted at adult and p child readers
respectively. We adapt a wide range of syntactic, lexical and language model features from previous
research on English and combined them with new features that make use of the rich morphology of
German. We show that readability classification for German based on these features is highly
successful, reaching 89.7% accuracy, with the new morphological features making an important
contribution.
Serhiy Bykh and Detmar Meurers
Native Language Identification Using Recurring N-grams – Investigating
Abstraction and Domain Dependence
Abstract: Native Language Identification tackles the problem of determining the native
language of an author based on a text the author has written in a second language. In this
paper, we discuss the systematic use of recurring n-grams of any length as features for
training a native language classifier. Starting with surface n-grams, we investigate two
degrees of abstraction incorporating parts-of-speech. The approach outperforms previous
work employing a comparable data setup, reaching 89.71% accuracy for a task with
seven native languages using data from the International Corpus of Learner English
(ICLE). We then investigate the claim by Brooke and Hirst (2011) that a content bias in
ICLE seems to result in an easy classification by topic instead of by native language
characteristics. We show that training our model on ICLE and testing it on three other,
independently compiled learner corpora dealing with other topics still results in high accuracy
classification.
Abstract: In this talk, I present two approaches to analysing discourses in Spanish. There are
Multidimensional analysis (MDA) and supervised classification (SC) of specialized texts.
Concerning MDA, I present two studies based on the written academic PUCV-2006 Corpus of
Spanish. Both studies employ the five dimensions (i.e. Contextual and Interactive Focus, Narrative
Focus, Commitment Focus, Modalizing Focus, and Informational Focus) identified by Parodi
(2005). The main assumption is that the dimensions determined by a previous multidimensional
analysis can be used to characterize a new corpus of university genres. In the first study, I calculate
linguistic density across the five dimensions to describe the nine academic genres of the corpus. In
the second one, I compare the PUCV-2006 Corpus with four corpora from different registers.
The findings confirm the specialized nature of the genres in the PUCV-2006 Corpus,
where both a strong lexico-grammatical compactness of meanings and modalization of
certainty are expressed in the texts. Concerning SC, I will present three classification
experiments based on specialized texts. In the first one, I compare naïve Bayes and SVM
methods - based on shared lexical-semantic content words- to classify the disciplines of 160
academic texts. In the second one, the informational density scores obtained in a previous
multidimensional analysis is used to classify the four disciplines corresponding to 353 thesis of
the TFGPUCV- 2010 corpus, using discriminant analysis and naïve Bayes. In the
last classification experiment, naïve Bayes is used to classify disciplines and genres,
based on part-of-speech trigrams calculated from a sample of theses and other academic
genres. According to my findings, it is possible to argue that the lexico-grammatical level
allows to classify the texts according to the disciplines and genres with a high accuracy
percentage.
Abstract: Frederking et al. developed a competence model for literary-aesthetic judging within the
framework of the priority program 1293 of the DFG “competence models for the measurement of
individual learning results and for the accounting of educational processes”. Literary-aesthetic
judging was determined as a theoretically and empirically funded three-dimensional construct,
which can be differentiated from general reading ability. In his presentation, Prof. Dr. Frederking
will speak about the underlying competence model as well as the development of the test
instrument for the measurement of the literary-aesthetic judging competency, with which the
competence model was validated.
Abstract: Die CARE-Datenbanken befinden sich derzeit im Aufbau. Gebündelt werden hier
Informationen zu Kirchenbauten mit Bauphasen vor dem Jahr 1000 in verschiedenen
Ländern Europas. In konkreter Planung ist ein Projektstart für Deutschland, Österreich
und die Schweiz (DACH, http://care-dach.net). Als erster Schritt werden pro Kirche
dort die Bauphasen aus einem Katalog der 1960er und 80er Jahre eingepflegt. Die
Informationen sind formal strukturiert aber ungleich lang und umfangreich. Um die Inhalte
für die Öffentlichkeit nutzbar zu machen, wurde der Prototyp einer App entwickelt
(App-Store: Frühchristliches Köln), die am Beispiel Köln und der dort vorhandenen 10
Bauten zeigt, wie das Endergebnis aussehen könnte. Gedacht ist an eine zukünftige
automatische Generierung der Inhalte dieser App aus den Informationsmengen der
Datenbank.
Abstract: Textual Entailment captures a common sense notion of entailment between two natural language texts, P (premise) and H (hypothesis). The relevance of Textual Entailment lies in its promise to
The first half of this talk will introduce the notion of Textual Entailment and provide an overview
of recent work on the topic, including a typology of the major algorithmic approaches, relevant
linguistic phenomena, and applications. Unfortunately, it has turned out that the agnosticism of
Textual Entailment with regard to processing has led to a fragmentation of research. The second
half will cover ongoing work on the development of a generalized model of Textual
Entailment that subsumes the various proposed algorithms and the implementation of
this model in the form of a multilingual, reusable, open-sourced platform for semantic
processing.
[Two recent manuscripts are available from https://moodle02.zdv.uni-tuebingen.de/course/view.php?id=380
(access restricted to logins from the University of Tübingen)]
Abstract: In the context of the Kobalt-DaF network, whose members investigate different
aspects of learner texts, we took a look at Topological Fields in essays of Chinese and
Belorussian learners of German. The texts were parsed according to the TüBa-D/Z
annotation scheme and finally manually corrected with the tool Synpathy. The talk will
provide some insight into the results of the automatic parsing process and the problems
which arise there. Furthermore, I will give a run through the application of the tools in
use.
Abstract: In my MA thesis, I explore POS analysis for learner language. Tagsets for native
language are often insufficient in describing the linguistic phenomena occurrin g in learner
language. In the sentence ”He was choiced for the job”, the word ”c hoiced” cannot be accurately
tagged: if it is only analyzed as a finite Verb the information on how the word was
formed (out of a noun/adjective stem) is lost, which would be of interest for both SLA
and SLT research. Forming new (error) ca tegories is often also not desirable when
learner language needs to be compared with native language. Díaz-Negrillo et al (2009)
suggest in their publication ”T owards interlanguage POS annotation for effective learner
corpora in SLA and FLT ” to split POS analysis into three dimensions to avoid this
conflict. The words are analyzed with a native language tagset from a distributional,
morphological and lexical perspective. Mismatches on these levels are expected to expose
error s or misuse of the language. In my talk, I will discuss these issues and present a
implementation of the trip artite POS tagging for German. I will show what other
theoretical and practical issues were revealed during the implementation and testing
process.
Abstract: We have known for a long time that discourse connectors (like “therefore”, “however”, “but”) facilitate human sentence processing when used appropriately. We however know much less about the time course of processing such connectors. In particular, we are interested in whether discourse connectors are processed quickly enough to affect expectations about upcoming discourse content. In this talk, I will present recent experiments on the processing of causals vs. concessives, which indicate that connectors are integrated incrementally into the discourse representation, and that concessives, similar to negation, give rise to a search for alternatives. We however also found evidence that concessives take longer to process than causals.
I will then go on to talk about expectations which people may have about upcoming discourse relations /before/ encountering a connective, and how these expectations affect the explicit vs. implicit realization of discourse cues. Both studies can shed some light on the causes of processing difficulty at the discourse level.
In a final part of my talk, I will give an overview of our recent efforts in evaluating models of
linguistic processing difficulty in real-world scenarios, where we use a dual-task setting with a
simultaneous a language comprehension task and a well-controlled and continuous
simulated driving task. Cognitive load in this setting is measured in terms of a
novel form of pupillometry, in addition to task related measures such as steering
accuracy.
_________________________________________________________________________________
Last updated: February 5, 2013