Linguistic Modeling and its Interfaces
Oberseminar, Detmar Meurers, Winter Semester 2013/2014
This series features presentations and discussions of current issues in linguistic modeling and its interfaces. This includes linguistic modeling in computational linguistics, language acquisition research, Intelligent Computer-Assisted Language Learning, and education, as well as theoretical linguistic research with a focus on the interfaces of syntax and information structure. It is open to anyone interested in this interdisciplinary enterprise.
A list of talks in previous semesters can be found here: Summer 13, Winter 12/13, Summer 12, Winter 11/12, Summer 11, Summer 10, Winter 09/10, Summer 09
Abstract: In this talk I will present a new approach to focus, and more generally information structure, that I developed in my Master’s thesis. Focus has traditionally be seen as a grammatical notion (although most people would be hard pressed when asked what they actually mean by the term ”grammatical”). Recent research indicates that language processing is influenced by unpredictability as measured through surprisal (a.k.a. Shannon-Information). In short, it appears that if a word is unpredictable it will be pronounced with longer duration and it will also be harder to react to. Focus on the other hand has the nice property that it lengthens focused words, hence allowing for more processing time. I therefore attempt to (partially) tie focus to unpredictability. This will happen under the assumption that language is a form of human behaviour. This in turn implies that it has social impact. In a large-scale online study I tested how people would rate social variables if focus placement is manipulated in question-answer dialogues. The results indicate that focus does indeed affect the dimensions of friendliness and sincerity. Interestingly, if we are willing to take information theory seriously and interpret information as a means of uncertainty reduction we can integrate the social component of focus with its predictability component. In my talk I will outline how this can be achieved.
Andrea Horbach
Reducing supervision in scoring short answer exercises
Abstract: In this talk I will present our work on Computer-Assisted Language Learning with the aim of reducing teacher effort in grading short answer exercises. We present two strands of work: one using the CREG corpus and the other using data that has been collected during placement tests for learners of German as a Foreign Language at Saarland University.
Short answer exercises such as reading or listening comprehension questions are a common assessment strategy in foreign language learning, and it seems intuitively likely that students would make use of the reading text in constructing answers. First, we discuss an annotation study which explores the question of whether this intuition is reflected in data from the CREG corpus. We find that instructor-supplied target answers as well as correct student answers often link to the same portion of the text, while incorrect student answers often refer to passages of the text that have nothing to do with the correct answer.
Next, we evaluate whether these findings can be leveraged for automatic short answer scoring. We build a very simple classifier that relies solely on whether the student answer relates to the same passage of the reading text as the target answer. This classifier achieves performance below the state-of-the-art, but at the same time it suggest possibilities for developing automatic answer scoring systems that need less supervision from instructors.
Third, I will discuss ongoing joint work with Magdalena Wolska in which we explore the effectiveness of clustering student answers for reducing teachers’ effort in a manual grading scenario. Using data from Saarland University placement tests, we simulate a grading scenario which assumes that a teacher only labels one answer per cluster. We find that labeling on average 40% of the student answer types is enough to reach an accuracy of 90%.
In future work we will consider how to best integrate these theoretical findings into teacher interfaces and real grading scenarios.
Alexis Palmer (Universität des Saarlandes)
Active learning in the real world
Abstract: Following on the previous talk, I will discuss our next approach to reducing the amount of supervision needed for short answer scoring: namely, active learning. Active learning (AL) is a specialized machine learning scenario in which the (machine) learner guides the selection of examples to be annotated. The aim is to maximize the usefulness of human annotation effort while achieving sufficient classification accuracy. Though there is a significant literature on the use of AL for natural language processing, relatively few studies have considered what happens when AL is applied in real-world annotation settings.
In this talk, I will discuss results from a study of the effectiveness of AL in the real-world context of documenting the Mayan language Uspanteko, in which we find that the most appropriate way of combining machine and human resources depends to some extent on the expertise of the human annotator. I will then show our first steps toward implementing AL in the short answer scoring context.
Michael Hahn (Universität Tübingen):
Distributional Semantics of Phrasal Units
Abstract: The availability of large corpora of written and spoken language has significantly enriched
the empirical foundation of linguistic research. At the same time, it arguably is refocusing
language-related research towards questions which can readily be addressed by observing surface
evidence, such as which words (co)occur, with which frequencies, in which contexts. To step from
an investigation of past language use towards predictions generalizing across language tasks and
domains, the annotation of corpora with abstract linguistic properties serves an important role.
The talk explores the role and relevance of systematic corpus annotation using case studies
from the analysis of learner corpora, records of language produced by second language
learners.
(dry run of invited talk at Herrenhausen Conference: “(Digital) Humanities Revisited – Challenges
and Opportunities in the Digital Age”)
Abstract: Word formation is one of the major mechanisms for the expansion of the vocabulary in a language. Knowledge of lexical morphology (and derivation/affixation in particular) includes information about a word’s morphological complexity, the meaning and syntactic function of affixes, and the restrictions that govern the attachment of affixes to bases. In L2 acquisition, this knowledge is important for both the decoding of unknown words and the production of new words that have not yet been acquired. In addition, it can also be beneficial for the ad-hoc formation of words, e.g. when mastering problems in lexical search. Thus, knowledge of L2 derivational morphology is likely to have a positive effect on the size of receptive and productive vocabulary.
Surprisingly, though, there is comparatively little empirical research on L2 learners’ productive use of derivational morphology (Lessard and Levison 2001, Schmitt and Zimmerman 2002, González Álvarez 2004; see Plag 2009 for review). Other research has focused on learners’ grammatical knowledge of individual affixes (Schmitt and Meara 1997, Mochizuki and Aizawa 2000), and on the usefulness of productive word formation as a strategy to facilitate vocabulary acquisition (e.g. Morin 2003) and lexical search (Zimmermann 2002).
This talk discusses the potential of learner corpora for the investigation of advanced learners’ knowledge and use of productive derivational morphology in their written L2 English. I will focus on cross-linguistic influence (CLI) and examine the following questions:
References
Ulla König-Cardanobile (Universität Tübingen)
Informationstheoretischer Ansatz zur Quantifizierung von Komplexität am
Beispiel der Substantivflexion des Deutschen
Ramon Ziai (Universität Tübingen)
Update on Advancing Content Assessment in Context: The Role of
Information Structure and Answer Typing
Abstract: Despite today’s digital tools and writing aids, the production of well-formed, linguistically correct, stylistically adequate, and target- and audience-tailored documents is a challenge for writers; in a study of error types in native-language student writing, Lunsford and Lunsford (2008) found similar errors as Connors and Lunsford (1988) had identified in a comparable study 20 years before. The number of spelling errors had decreased dramatically, however, the texts contained similar numbers of “subject-verb agreement errors,” “missing words,” “unnecessary shifts in verb tense,” or “fused sentences,” a clear indication that such errors cannot be detected and corrected by automatic checkers.
When produced by skilled writers, these errors can be considered performance errors, typically introduced while revising and editing text, rather than competence errors. These errors should therefore be prevented by offering appropriate editing functions for writers. However, for developing such functions, we first need a clear understanding of the causes of such errors.
The concept of action slips proposed by Norman (1981) offers a very strong theoretical framework that considers both the process and the product: some failure in a procedure results in an error in the product. However, error analysis (in writing research, second language acquisition, and natural-language processing) has traditionally focused on the product, i.e., the errors visible in the finished text, but has not addressed the writing process, i.e., the editing operations that preceded the error.
I present an approach to systematically analyze complex writing errors to distinguish typos from
revision errors and identify the areas where writers could benefit most from better
tools.
Abstract: In this talk, I will present the results of my investigation of the effect of different
instructional parameters within an interactive application for computer-assisted language
learning (CALL). More specifically, I examined different forms of CALL interaction and
their effect on language learning. My research is motivated by existing work on two
widely discussed issues within the discipline of second language acquisition. One is
the debate that pits form against meaning and leads to a discussion of the extent to
which language instruction should focus on linguistic forms and formal correctness as
opposed to emphasizing communicative skills and the ability to use the language to
make meaning in the real world. Related to that is the second controversial issue which
concerns the dichotomy between implicit and explicit knowledge and learning: How
explicit or implicit should instruction be, how does the degree of explicitness affect
the development of explicit and implicit knowledge, and how do these two types of
knowledge contribute to language skills. I will report on experiments with learners of
German who practiced with one of three versions of a text-based dialog system, each of
which realised a different degree of explicitness and a different weight on form versus
meaning.
Abstract: In this talk, I will present an overview of two strands of my research: the first half (joint
work with Amber Smith) will cover some methods for detecting errors in manually- and
automatically-annotated syntactic corpora (i.e., parser errors). These rather simple
anomaly-finding methods tend to work well for low-resource situations; are independent of parser,
language, or annotation scheme; and seem to be leading towards a connection with parse revision.
The second half (joint work with Marwa Ragheb) focuses on a project of syntacticaly annotating
English as a Second Language (ESL) data, the decisions we have had to make, and our first
steps in automatically parsing this data. Preliminary results suggest that anomaly
detection methods can help clean up our training data. Additionally, the improvement
with hand-written post-processing of parse results is an encouragement to develop
a grammar-based parse revision system, of a kind the first half of the talk connects
with.
Abstract: One motivation for the task of Native Language Identification (NLI) – attempting to detect the native language (L1) of a writer writing in a second language – is to identify which particular characteristics of the writing show effects of the L1: whether a particular pattern of article use might indicate a Chinese L1 speaker, and so on. In this talk (on joint work with Jojo Wong) I’ll be discussing two types of feature we’ve looked at in the NLI classification task, one a representation of syntactic structure, the other a collocation ’topic model’ learnt by Bayesian inference; and I’ll look at what kinds of information these sorts of features might give us about cross-linguistic effects.
_________________________________________________________________________________
Last updated: June 2, 2014