Should we rely on numbers? Arguments for and against the use of quantitative methods in linguistics

  • Speaker: Professor Diana Santos (University of Oslo)
  • Abstract: In this presentation I share my insecurities with the audience, by describing various arguments that have been suggested for the use of quantitative methods in the study of language, and by looking at the proper functioning of the language to clarify the issue. I will try to provide an overview of various positions. I will contrast two  scientists: Yule and Zipf. I will also discuss some issues associated with the design of corpus-based grammar.

The Manually Annotated Sub-Corpus: An Experiment in Collaborative Language Resource Development

  • Speaker: Professor Nancy Ide (Vassar College)
  • Abstract: The Manually Annotated Sub-Corpus (MASC) is the first broad-genre corpus that includes a wide range of linguistic annotations represented in a common format, and which is completely free to use by anyone for any purpose. MASC, and the larger Open American National Corpus (OANC) from which it is drawn, are built upon ISO standards for representation of linguistic resources and are supported by tools and services that enable obtaining any choice of sub-corpora and annotations in an array formats usable in other tools and frameworks. We see MASC, OANC, their supporting services, and the open data philosophy as a prototype for future language resource development and delivery. Both the OANC and MASC are intended to serve as a base for continued community development, wherein additional data and annotations are contributed by members of the computational linguistics community. To this end, we must address several issues: how to best support the collaborative annotation of the corpus, e.g., via web-based technologies, including web services; how to ensure that the community is engaged in  both the creation and delivery of this linguistically enriched data; and how to ensure that access to data is unencumbered by licensing concerns. These issues, potential solutions,  and the viability of collaborative resource development for linguistically-enriched language data will be overviewed in the presentation.