See also Transforming an index into a list of CroALa searches.
If you deal with language and literature, lists are very interesting instruments. Not to read, of course, but to do something with them. If we take a list of words and compare it to another, there is a lot to learn and discover — in case the lists are sufficiently long. However, comparing long lists is a job best left to others; in our case, to computers.
But what is interesting in comparing lists? Let's say you have a list of mythological names and a list of words from a (long) Latin epic. A subset at the intersection of these lists tells us which mythological names (from our list) occur in the epic — but also which names don't occur there.
Take ten more lists of ten more long epics, and apply the same procedure. See what happens then, think about what does it all mean, see what this comparison makes you think about.
However, before we come to this part of the action, we have to learn how to prepare lists, and where to find them and how to adapt them.
One list of mythological names — and other names from antiquity — is freely available on Wikisource.1) It's the list of lemmata from the Dictionary of Greek and Roman Biography and Mythology (1867), an encyclopedia by William Smith. Smith's dictionary on Wikisource is incomplete — a lot of articles are missing — but, for list-forging, this is irrelevant.
So, we take (that is, copy and paste) Smith's list of lemmata. In its Wikisource format it looks like this:
*[[../Abaeus|Abaeus]] *[[../Abammon Magister|Abammon Magister]] *[[../Abantiades|Abantiades]] *[[../Abantias|Abantias]] *[[../Abantidas|Abantidas]] *[[../Abarbarea|Abarbarea]] *[[../Abaris|Abaris]] *[[../Abas (mythology) 1.|Abas 1.]] *[[../Abas (mythology) 2.|Abas 2.]]
and with some searching and replacing we make it much simpler2):
Abaeus Abantiades Abantias Abantidas Abarbarea Abaris Abascantus Abderus Abdias Abellio Abgarus Abia
Now we can order the computer to compare this list to another — in our case, to the list of all words occurring in CroALa (called words.R
in PhiloLogic system), where the interesting part goes like this (actually it is preceded by a lot of less interesting numbers, also occurring in the texts):
aaron aarone aaronem aaroni aaronis aathor ab ab2 ab3 aba abac abachuc abachuch abacos abacosque abacta abactam abacti abactis abacto abactor abactorem abactos abactum abactus abactę abacuc abacuch abacum abacus abaddir abadon abaffy abagare abagari
It is important to realize that the list of words in CroALa has 271,281 potentially interesting entries, and the (simplified) list of all Smith's names in A has 1145 words. That means that a computer has to compare each of 1145 Smith's names with each of CroALa's 271,281 words; it has to make 310,616,745 comparisons. This could take a long time — should each comparing operation last a millisecond (it lasts less, I hope), the comparison would require 86 hours. On my old laptop it takes about 86 seconds, but it is enough for me to get impatient. That's how spoiled we have become.
How do you make a computer compare lists? I know of three possible ways.
First, there are Linux tools (commands) grep
and sort
, which can be combined in a simple bash script to find intersection of the two lists (called in the script ${file1}
and ${file2}
):
# comparing two lists of words (separated by newlines) # usage: ./cpr-lists.sh filename1 filename2 # calls tools: grep, sort # take arguments, compare with grep, sort alphabetically file1=$1 file2=$2 grep -if ${file1} -w ${file2} \ | sort -d - \ > "${file1}"-"${file2}"-zajedno
With this script, a search for 100 Smith's A-words among CroALa's 271,281 took 53.337 seconds. A search for 1000 words would require, I guess, ten times more – 8.8 minutes. This is still less than a coffee break.
You understand already that our grep search for Abaeus
will find only Abaeus
. We had to use the grep -i
option to make it find abaeus
as well.
However, in CroALa, which is a collection of natural language, we can expect not only Abaei, Abaeum, Abaeo
, but also Abaeusque, Abaeive, Abaeumne
, and even Abęi, Abeus
.
What to do?
The solution is called regular expressions.
In grep's dialect of regex notation we would use:3)
grep -i \baba\?e[iuoe].*
(\b
means “only beginning of word”; \?
means “zero or one letter”; [iuoe]
means “either i, u, o, or e”, and .*
— a combination which is not necessary in my version of grep, it is there by default — means “everything and nothing”).
The orthography in neo-Latin is so uncontrolled, though not unpredictable, that we have to do a lot of regex transformations to cover all variants. With each transformation the chances grow that you'll find what you didn't look for.
On top of it come all the endings of Latin flexion.
However, when you use computer scripts, you have to think hard once, and later reuse what you've written. Here's the hack I made today to search for names from an index in CroALa (remember regex and PhiloLogic?)
# use on an already partially processed file # we use sed, linux search-and-replace commandline tool # 1. replace all x-s at the end of Latin words sed 's/x$/[xc]*/g' inputfilename \ # 2. diphthong oE can be written as E; 3. remove the ending -Us | sed 's/[oO]E/o?E/g' | sed 's/U[sm]$/*/g' \ # 4. remove the ending -a; 5. let anything come after endings -o, -r etc. | sed 's/a$/*/g' | sed 's/[ornlm]$/&*/g' \ # 6. remove the ending -as, leaving the a; 7. let anything come after -os | sed 's/as$/a*/g' | sed 's/os$/os*/g' \ # 8. -ns and -rs have -tis genitive tc. 9. remove the ending -Es | sed 's/\([nr]\)s$/\1[st]*/g' | sed 's/Es$/*/g' \ # 9. -is in nominative produces -em, -es etc. 10. nom. -e may produce -ae, -is etc | sed 's/Is$/[EI]*/g' | sed 's/E$/[aEI]*/g' \ # 11. -i is usually nom pl; 12. -ti- can be written as -ci- | sed 's/I$/[IoU]*/g' | sed 's/tI\([aEIoU]\)/[tc]I\1/g' \ # 13. aspiration everywhere | sed 's/t\([aEIOUr]\)/th?\1/g' > outputfilename # should be improved further for beginnings such as Ae-, Ho-