Clippers: A computational linguistics discussion group
(Meurers, 795Y, Winter 2005)
Clippers is our forum for informal discussion of all issues related
to computational linguistics: from work in progress of visitors and
people in the department, over presentation of new papers, to
practical concerns such as hints on the use of CL related software
tools.
Everyone with an interest in computational linguistics is most
welcome!
To see what happened in previous quarters of Clippers, you can check
out the pages of some previous quarters:
Autumn 04,
Spring 04,
Winter 04,
Autumn 03,
Spring 03,
Winter 03,
Autumn 02,
Spring 02,
Autumn 01
When and where:Tuesdays at 1730-1848 in 340
Central Classrooms.
Important: Please be sure to subscribe to our local
computational linguistics mailing list on which all Clippers
sessions and talks are announced.
The plan, as usual, is to start each session with 5-10
minutes on whatever someone wants to bring up and then to continue
with the following topics:
- Tue, 4. Jan.: Organization
- Tue, 11. Jan.: no meeting, in support of ACL paper writing
;-)
- Tue, 18. Jan.: no meeting
- Tue, 25. Jan.: Laura Stoia on Populating Semantic Classes
using Large Scale Corpora
- Tue, 1. Feb.: Markus Dickinson and Detmar Meurers on
Detecting errors in discontinuous structural corpus annotation
- Tue, 8. Feb.:
Tianfang Xu on Finding Landmarks
- Tue, 15. Feb.: Yang Shao and Soundararajan Srinivasan on
On building large-vocabulary speech recognition systems
Abstract:
CSE 888R04 was started in Spring 2004 to provide experience in
building large-vocabulary continuous speech recognition systems. In
this talk we will describe the motivation for and the challenges in
building such systems and then detail our preliminary work on the
Wall Street Journal corpus. We will conclude with some thoughts on
future systems/approaches that we intend to pursue.
- Tue, 22. Feb.:
Mona Diab (Computational Linguistics Center for Computational
Learning Systems (CCLS), Columbia University)
on Bootstrapping an Arabic WordNet: Issues of scale and
representation
NOTE the unusual time and place: 3:30 in 122 Oxley Hall
Abstract:
I propose the automatic bootstrapping of a Modern Standard Arabic
WordNet on the lexeme level using Arabic English parallel corpora
and an English WordNet. I address the feasibility of such an
endeavor and present a qualitative evaluation of the meaning
correspondences cross linguistically between Arabic and English. I
further present an automatic means of performing this task using an
unsupervised Word Sense Disambiguation System. I test the
feasibility of the bootstrapping by qualitatively evaluating the
meaning definition projection of English words onto their Arabic
translations. I manually evaluate 447 word instances of the Arabic
words that correspond to correctly sense tagged English words using
English WordNet 1.7. from the SENSEVAL 3 data. The words evaluated
correspond to Nouns, verbs, adjectives in English. I find that for
Arabic verbs, adjectives and nouns, on average 52.3% of all the
words examined, the corresponding English WordNet set of definitions
are sufficient as definitions for the Arabic translation word;
39.96% of the Arabic words correspond to specific subsets of the
WordNet definitions; and finally, 7.8% of the Arabic words comprise
supersets of their corresponding English WordNet translation
definitions. These results are very encouraging as they are similar
to those obtained by researchers building EuroWordNet. Moreover we
present an evaluation of the automatic creation of an Arabic WordNet
using our unsupervised system SALAAM utilizing Senseval 2 data.
Finally, I will discuss the appropriateness of the lexeme level as
the granularity of representation for an Arabic WordNet
- Tue, 1. Mar.: Wilbert Heeringa (University of Groningen) on
Measuring Norwegian Dialect Distances using Word Pronunciation
Transcriptions and Acoustic Word Samples
Abstract: The term "dialectometry" was coined by Jean Séguy, who was director
of the Atlas linguistique de la Gascogne. On the basis of the
material of this atlas he measured linguistic distances between
dialects, defining distance as the number of items on which two
dialects disagree. A more refined approach was introduced by Kessler
in 1995. He used Levenshtein distance as a tool for measuring
distances between Irish Gaelic. Levenshtein distance is equal to the
minimum cost of changing one word pronunciation into another. The
minimum cost is based on the weights of insertions, deletions and
substitutions, the three operations allowed to be used for changing
the word pronunciation.
In this paper we apply Levenshtein distance to 15 Norwegian
dialects, using a data set compiled by Jørn Almberg. First, we apply
Levenshtein distance to word pronunciation transcriptions, just as
Kessler did. However, a novel element here is the weighting of
insertions, deletions and substitutions by using acoustic segment
distances. We examine different acoustic segment distances: the
Barkfilter, the cochleagram and the formant-track representation.
Second, we apply Levenshtein distance directly to a acoustic
representations of the word samples, nearly without using
information provided by the transcription. The minimum cost is based
on the weights of insertions, deletions and substitutions of spectra
or formant bundles, rather than phonetic transcription segments.
Again, the Barkfilter, the cochleagram and the formant-track
representation are examined.
Finally, we validate our results by comparing them with the results
of a perception experiment, carried out by Charlotte Gooskens in the
spring of 2000. She measured perceptual distances between the 15
Norwegian dialects by presenting recordings of the same dialects to
a class of school children in each of the 15 dialect locations. The
pupils rated the distances of the dialects to their own dialect on a
scale of 1 (=similar) to 10 (=completely different). We found
significant correlations between the Levenshtein distance
measurements and the perceptual measurements.
- Tue, 8. Mar.:
- Stacey Bailey on On (not quite) recognizing textual
entailment
- Ilana Bromberg on An update on Systematicity in the Arabic
Lexicon
Last modified: Mon Jan 3 22:44:57 EST 2005
- For questions or comments regarding this page, please contact: Detmar Meurers