Linguistic Modeling and its Interfaces
Oberseminar, Detmar Meurers, Summer Semester 2012
The OS features presentations and discussions of current issues in linguistic modeling and its interfaces. This includes linguistic modeling in computational linguistics, language acquisition research, Intelligent Computer-Assisted Language Learning, as well as theoretical linguistic research with a focus on the interfaces of syntax and information structure. It is open to advanced students and anyone interested in this interdisciplinary enterprise.
Abstract: Learner corpora as collections of language produced by language learners have been systematically collected since the 90s, and with readily available collections such as the ICLE (Granger et al. 2002) for English and FALKO (Lüdeling et al. 2005) for German there is a growing empirical basis on which theories about second language acquisition and the linguistic system can be informed and applications can be tested.
While most research on learner corpora has analyzed the (co)occurrence of (sequences of) words or
manual error annotation, tools for automatically analyzing large corpora in terms of linguistic
abstractions such as parts-of-speech, syntactic constituency, or dependency are increasingly
available. Similar to the discussion about the role of exemplars vs. prototypes in language, this
situation raises the question when to consider surface forms as such and when linguistic categories
abstracting and generalizing over surface forms are useful in a corpus-based analysis. In this talk, I
want to illustrate the issue with some experiments from our current research, mostly from the
domain of L1 identification, the automatic identification of the native language of a non-native
writer.
Abstract: For us humans, it is easy to distinguish the main text on a web page from surrou nding
elements like headlines and navigation bars based on their visual appearan ce. A computer only
sees the underlying HTML code and the textual content of the page. Yet, as many web pages have
a very similar structure, it is possible to m ake these distinctions automatically. I take
a look at what information needs to be extracted from the web page in order to do
so.
Abstract: This talk presents a new corpus-based approach to L1-classification of written texts, where the linguistically motivated syntactic alternations of Levin (1993) play the role of L1 characteristic features. The study is an example of how linguistic knowledge may contribute to the statistical analysis of corpus data. On the other hand, we show how computational, data-driven methods may complete and expand a linguistic theory.
The classification experiments I carried out on English texts written by native speakers and English learners with four different mother tongues support the hypothesis about the distinctive nature of some syntactic alternations as L1-classification features: the choice of a syntactic frame within an alternation often depends on the L1 of the speaker.
Building on these experiments, an alternative approach supporting the creation of automatic
alternations and their usage as L1-classification features was developed. The alternations
automatically extracted from the corpus proved their capability to distinguish native from learner
English texts by the high accuracy results achieved in the experiments on L1-classification with
automatic alternations features.
Abstract: In this talk, I will describe an experiment that contributes to the ongoing debate on
learning without awareness (see Williams, 2005; Hama & Leow, 2010; Faretta-Stutenberg &
Morgan-Short, 2011; Leung & Williams, 2011) by comparing three measures of awareness:
retrospective verbal reports, think-aloud protocols, and subjective measures (confidence ratings and
source attributions). The experiment was based on a widely-cited study on the implicit learning of
form-meaning connections (Williams, 2005). Our results showed a clear learning effect in
experimental subjects but not in controls, i.e. the study provided further evidence for the rapid,
incidental learning of form-meaning connections. The three measures of awareness further indicated
that experimental subjects acquired both implicit and explicit knowledge of form-meaning
connections, a finding that confirms the possibility of learning without awareness (Williams,
2005).
Sowmya Vajjala and Detmar Meurers
On Improving the Accuracy of Readability Classification using Insights
from Second Language Acquisition
Adriane Boyd, Marion Zepf and Detmar Meurers
Informing Determiner and Preposition Error Correction with
Hierarchical Word Clustering
Abstract: I will present a a learner corpus of Czech, currently in the final stages of development.
The corpus captures Czech as used by non-native speakers with various L1 background
and at all proficiency levels. I will discuss its annotation scheme, consisting of three
interlinked levels to cope with a wide range of error types present in the input. Each level
corrects different types of errors; links between the levels allow capturing errors in
word order and complex discontinuous expressions. Errors are not only corrected, but
also classified. The annotation scheme was tested on a doubly-annotated sample of
approx. 10,000 words with fair inter-annotator agreement results. Currently the corpus
contains about 2M words with 300K words doubly annotated. I will also discuss the
practical aspects of the project and the possibility of (semi)automatic annotation in the
future.
Abstract: Empirically grounded readability classification with machine-learning techniques requires
texts for which the reading level is known. For English, the Weekly Reader is often used as a gold
standard. For German, there so far has been little research on readability classification and there is
no established gold standard. I created a German corpus for readability classification from the
websites of the monthly magazines GEO and GEOlino. GEOlino is similar to GEO but targeted at
children (age 8-14). On this data, I experiment with a variety of lexical, syntactic and
language model features that have previously been used for readability assessment on
English.
Abstract: Natural language processing (NLP) methods continue to be a driving force behind the
success of applications serving the education community. The most widely-known application is
automated essay scoring which is used in assessment and instructional settings. In this talk, design
considerations that are critical to the development of educational applications will be discussed.
Specifically, NLP researchers need to consider the core aspects of the educational infrastructure:
curriculum and assessment development, curriculum delivery, and outcomes reporting. Researchers
also need to be aware of current pedagogical influences as they develop of educational applications.
Three specific applications developed at ETS will be described and discussed in the
context of its use in the educational infrastructure: e-rater®;, c-rater®;, and Language
MuseSM.
Abstract: This talk deals with issues of annotation in historical corpora and the use of overuse/underuse statistics as a means to detect patterns of change and variation. The focus of this talk is methodological.
In overuse/underuse statistics the frequency of a given category is compared in two or more corpora. One of the corpora is defined as a ‘standard’ and the other corpora are evaluated with respect to the standard corpus. A significantly lower frequency of a given category in one corpus than in the standard corpus is called underuse, a significantly higher frequency is called overuse. Overuse or underuse can be computed for any given category or combination of categories that is coded in a corpus. It can be used in diachronic corpora to detect candidates for patterns of change: a steady change towards the ‘standard’ (in our case Modern German) might signal a continuous development. We will exemplarily illustrate this method with two case studies in a very small but deeply annotated diachronic treebank of German (Deutsche Diachrone Baumbank, DDB).
The first case study is concerned with the emergence and development of auxiliaries in German and the second case study deals with relative clauses. By tracing the development of auxiliaries we show how a semantic change of the same element takes place while the relative clause study is more concerned with the interplay of syntactic and semantic levels. Relative clauses are interesting because formally they seem to be very stable (Schmidt 2004: 235 claims that they have not changed since Old High German times) but there might be a functional change.
References
_________________________________________________________________________________
Last updated: July 14, 2012