Serhiy Bykh, Sowmya Vajjala, Julia Krivanek, and Detmar Meurers
Proceedings of the 8th Workshop on Innovative Use of NLP for Building Educational Applications, Atlanta, GA, USA.
We explore a range of features and ensembles for the task of Native Language Identification as part of the NLI Shared Task (Tetreault et al., 2013). Starting with recurring word-based n-grams (Bykh and Meurers, 2012), we tested different linguistic abstractions such as part-of-speech, dependencies, and syntactic trees as features for NLI. We also experimented with features encoding morphological properties, the nature of the realizations of particular lemmas, and several measures of complexity developed for proficiency and readability classification (Vajjala and Meurers, 2012). Employing an ensemble classifier incorporating all of our features we achieved an accuracy of 82.2% (rank 5) in the closed task and 83.5% (rank 1) in the open-2 task. In the open-1 task, the word-based recurring n-grams outperformed the ensemble, yielding 38.5% (rank 2). Overall, across all three tasks, our best accuracy of 83.5% for the standard TOEFL11 test set came in second place.
Electronically available file formats:
Bibtex entry:
@InProceedings{Bykh.Vajjala.ea-13,
author = {Serhiy Bykh and Sowmya Vajjala and Julia Krivanek and
Detmar Meurers},
title = {Combining Shallow and Linguistically Motivated Features
in Native Language Identification},
booktitle = {Proceedings of the 8th Workshop on Innovative Use of NLP
for Building Educational Applications (BEA)},
year = {2013},
address = {Atlanta, GA, USA},
pages = {197--206},
pdf = {http://www.aclweb.org/anthology/W13-1726},
url = {http://purl.org/dm/papers/Bykh.Vajjala.ea-13.html},
}