Task: find out computationally which words in a Latin text are rare or strange.
Preprocess by excluding frequent words; there are some lists. Claude Pavur's is here (18653 wordforms). James H. Dee's database is here. Anne Mahoney's 200 essential Latin words are here (currently). A list of words where -que ending is not a conjunction is here (among other useful things).
Lemmatizing services: Archimedes (XML-RPC), LemLat, PrePro2010 (XML-API). All results require postprocessing, cleaning etc.
Compare list2 (lemmatized words) with list1 (primary wordlist): comm
Our bash script which serves as a wrapper and pre-processor for Archimedes Project XML-RPC call looks like this:
#!/bin/bash # Jovanovic, 2011-10, lematiziranje rijeci # usage: ./comm-lemm.sh antconc-result-filename # requires: vlist.sed, latstop.txt, rpc3.py, 11lemclean.sed, 11lemclean2.sed # step 0: clean up an AntConc wordlist # step 1: remove the frequent Latin words. Ensure the unix format, remove whitespaces. tr -d '\011' < "$1" | sed -f vlist.sed - | sort - | comm -23 - latstop.txt > c"$1" # vlist.sed holds cleaning commands # latstop.txt holds frequent latin words # let the result of step 1 become a 'file' variable FILE=c"$1" # step 2: send rarer words to lemmatizer # clean up the results # sort and save the lemmata python rpc3.py "${FILE}" | sed -f 11lemclean.sed - | sort - | sed -f 11lemclean2.sed - > lem2"${FILE}" # 11lemclean.sed holds cleaning commands for lemmatizer # 11lemclean2.sed holds another set of cleaning commands # keep only the forms which were lemmatized # then comm, but watch for spaces and tabs! comm -23 "${FILE}" lem2"${FILE}" > r-"${FILE}"