The “First thousand years of Greek:” documentation
Creating the CHS Canon of Greek Literature
The necessary first step is to
identify extant texts, how they are cited,
and how their canonical citation scheme
maps on to a TEI P5-compliant schema.
The project is currently managing this information
in a relational database system.
Creating valid TEI P5 texts
Once the entries in
the unique CHS Canon of Greek Literature
have been constructed
for a corpus of texts,
a variety of utility programs
work with
the that data
and
a source of public-domain readings of ancient texts, such as the TLG,
to create TEI-conformant texts.
-
dio-getText.pl:
a perl script that works with a
modified version of Diogenes 3, to catalog each citation node of a
text in the TLG E disk.
- Scripts in groovy to
consult the CHS Canon database, and work with the output of
dio-getText.pl
to create documents validating against one of several TEI P5-compliant Relax NG schemas.
Creating a database of
strings and lexical
identifiers
Lemmatized indexing
involves mapping a surface string (an inflected form)
to a lexical entity identified by a unique identifier.
Before we can create a lemmatized index,
we need to establish the set of identifiers for lexical entities.
- Script to
format data from Peter Heslin's "expert-data" package
as a tabular mapping of surface strings (inflected forms)
to the lemma
strings used by the Perseus project's Morpheus
system to label a lexical entity
- Script to extract from the Perseus project's electronic LSJ the
unique identifiers for each article, together with the
lemma used by LSJ to label that lexical entity
- Output from the previous two steps has to be
reviewed and unified so that morphological identifications can
be expressed with unique identifiers, rather than
labelling lemmas
Creating a lemmatized index of P5-compliant texts
- Scripts in groovy to
consult the
both the CHS Canon of Greek Literature
and the database of lexical entities
to create lemmatized word indices of the TEI P5 documents,
referred to by
CTS URN.