CLS-INFRA Training School on Data and Annotation
- Authors
and 3 more
- Topics

This training school is the first held by the CLS-Infra project. It was held at Univerzita Karlova (Charles University) at the Faculty of Mathematics and Physics in Prague, Czech Republic from 7-9 June 2022.
Learning Outcomes
By following this event resource, you should be able to
-
Edit, annotate and query your own data using CQL and Universal Dependencies
-
Understand and apply discipline-wide standards such as TEITOK, KonText and UDPipe.
Attendees
The event was attended by 40 interdisciplinary and international researchers in hybrid format.
Organisation Team
- Lisanne van Rossum
- Silvie Cinkova
- Bartlomiej Kunda
- Ciara Lynn Murphy
- Justin Tonra
- Salvador Ros
Website
CLS INFRA Training School – Prague Summer School June 2022 https://clsinfra.io/events/training-school/
Acknowledgements
The event was organized as part of Work Package 4 of the CLS INFRA, funded by the European Commission (Horizon2020, Grant ID 101004984).
What do I bring to prepare for the training?
Corpus building is an integral part of this workshop. Therefore, you should have to hand a few text files of your own interest, for your individual hands-on experience with building a corpus. These files do not necessarily have to be in English.
Who is this course for?
This is an elementary course for actual beginners. If you have any NLP/programming skills, if you have ever built your own corpus or queried one beyond looking for a single word, or if you are well-versed in XML-TEI, this course would be too slow for you.
If you do have prior experience with the skills offered in this training school, you might still have fun scrolling through the teaching materials or apply for a fellowship to pursue your individual research or teaching project with the on-site support by our CLS INFRA project partners.
1.Intro - Information extraction from the Shakespeare Drama Corpus
We use the Shakespeare Drama Corpus to show you how to extract information with corpus-linguistic methods - which questions you can ask and what answers you can expect. This is a general overview; you will learn details of the individual tools and methods in the following sessions.
Session 1 - Introduction
2.XML, TEI, and TEITOK I
This session introduces the basic principles of the XML markup and presents the Text Encoding Initiative (TEI-XML) guidelines as an encoding standard for digital editions and textual corpora. The students produce a valid TEI-XML document and upload it to TEITOK, our web-based platform for viewing, creating, and editing corpora with both rich textual mark-up and linguistic annotation.
Session 2 - XML, TEI and TEITOK
3.Universal Dependencies – Morphology
This session presents UDPipe, an NLP tool to analyse texts in more than seventy languages. UDPipe works with Universal Dependencies. Universal Dependencies is a framework for consistent grammar annotation across human languages. We particularly focus on its morphological annotation scheme, explaining the individual part-of-speech labels, as well as the more fine-grained morphological features, on English. A separate session deals with the syntactic markup.
Session 3 - Universal Dependencies – Morphology
4.Base CQL
CQL (Corpus Query Language, developed in the 1990s) is the de facto standard in the field, used by the most current corpus query tools. We start with explaining the students the regular expressions and gradually and diverse restrictions within a single token query. We demonstrate the searches in Kontext, the corpus manager developed and maintained by the Institute of the Czech National Corpus.
Session 4 - Base CQL
5.Metadata
This session explains the TEI-XML header structure, its relation to the document body, good practice and its implementation in TEITOK. It prepares the students for setting up their own corpus of philological texts with complex headers and genre-specific text metadata. The students receive guidance to create their own document headers.
Session 5 - Metadata
6.Advanced CQL
This session is a continuation of the Base CQL session. It draws on queries about individual tokens and proceeds to queries on a sequence of tokens, introducing the concept of group referencing and metadata scope (e. g. within a single sentence). It also introduces aggregation and filtering functionalities of Kontext, the corpus manager used in the demonstration.
Session 6 - Advanced CQL
7.Statistics
This session explains statistical considerations on frequency in corpus linguistics, mainly the statistical significance and effect size of a difference between two frequency counts. Besides, it introduces several quantitative stylistics metrics, such as different flavors of lexical richness, descriptivity vs. narrativity, thematic concentration, and thematic weights of individual words. Students learn about the on-line tools Calc and QuitaUp to calculate these metrics automatically.
Session 7 - Statistics
8.Universal Dependencies – Syntax
This is a continuation of the Universal Dependencies – Morphology session. It explains the principles of dependency grammar and its UD flavor, touching upon the interplay between the linguistic form and function, as well as ambiguity and vagueness in the linguistic annotation. We explain the principles of dependency grammar and elaborate on the most common syntactic labels and their typical usage.
Session 8 - Universal Dependencies – Syntax
9.Named-Entity Recognition and bulk editing
In this session, students gain the basic overview of the state of the art in the Named-Entity Recognition and Entity Linking (referring from a linguistic entity to an external knowledge base). They learn about the main Entity Linking authorities, such as WikiData and VIAAF. They get a hands-on experience with TEITOK’s manual entity annotation module. Eventually, they get acquainted with TEITOK’s bulk editing module.
Session 9 - Named-Entity Recognition and bulk editing
10.Tree queries - Grew
This session teaches the students the foundations of Grew, a declarative tree query language, working with its implementation in TEITOK. We demonstrate the power of tree querying on searching the syntactically parsed Shakespeare corpus for salient semantic participants of a verb, which grammatical constructions to look for and how to implement the search in Grew.
Session 10 - Tree queries - Grew