Inventory of linguistic entities

Content

The inventory of linguistic entities contains the unique results of analyzing every white-space delimited token in a text.

Tokens are first normalized by removing punctuation and second accents. Every token is then assigned to one of the following five categories:

For labels, quoted tokens, and named entities, the inventory simply records the unique occurrences of normalized surface forms, but tokens classified as lexical entities or numbers are then further parsed.

Tokens classified as lexical entities are analyzed morphologically, and resolved to one or more entities identified with unique identifiers. E.g., Μῆνιν, the first token in the Iliad, is identified as a form of the lexical entity with ID n67485 (lemma form μῆνις), while the token ἀποδεχθέντα is identified as form either of the entity with ID n12409 (lemma form ἀποδείκνυμι), or the entity with ID n12449 (lemma form ἀποδέχομαι). The inventory of linguistic entities records each unique identifier together with a lemma form, and an English label, when a descriptive phrase can be automatically associated with the entity from the Perseus project's "short translations", derived from Liddell-Scott.

Numbers are parsed numerically and their values as either integers or ratios of integers is computed. E.g., γ' is parsed as the integer 3. The inventory records unique numeric values.

Structure

The analytical inventory is stored in a simple tabular format with structured metadata expressing the coverage of the inventory as a series of CTS URNs. This information can be readily imported by any kind of software. A full description of the tabular format will be available from this page.

Current state

We have begun analyzing Greek texts in verse down to about 300 C.E. When the first inventory of extant texts has been released on this site, we will simultaneously begin releasing current versions of the inventory of linguistic entities, and of the associated index of tokens.