Goal: sort words from a Latin text into those lemmatized, ambiguously lemmatized, and unrecognized. Store results in local database. Note the original form, lemma, stem (for large-scale queries), text provenance, lemmatization category (LEMMA, AMBIGUOUS, NOT RECOGNISED), reason for not recognizing form.
TEI XML encoded
Tasks / strategies:
grep '^[[:lower:]]' filename > filename-lower
etc)
iconv -f utf8 -t ascii//TRANSLIT filename-upper | tr '[:upper:]' '[:lower:]' | perl ambig-local.pl
We have built a local database containing words already identified – those don't have to be sent to a lemmatizing service (saves time)
Check: write results of local checks in the CSV! Have reports on total number of words identified locally
Use Bamboo Morpheus service. Expected return format: JSON
First three points achieved by jsonu3.sh bash script.
Expected categories: non-Latin words (Greek, modern languages), numbers, abbreviations, orthographic variations, errors, names, common Latin words not in the service database, uncommon Latin words; and any combination of the above
When the local database is updated, query it for words from our text beginning with uppercase letter. Queries should be normalised (lowercased).