====== Exploring CroALa through Janus Pannonius ======
A lab diary.
===== What we already have =====
* Access Janus Pannonius' works in CroALa (as listed in the [[z:croala-index-auctorum|Index auctorum]])
* Study [[http://ramminger.userweb.mwn.de/search/searchresults.htm?searchField=IAN+PAN&Submit.x=16&Submit.y=15&srcriteria=phrase&aUngleichA=1|Pannonius' words]] excerpted for Johann Ramminger's Neulateinische Wortliste (as listed among the [[z:croala-nlw|Auctores Croatici (et vicini) in NLW]])
* A procedure for [[z:croala-large-scale|large-scale querying in CroALa]], looking for names Pannonius used
====== Querying CroALa with a list of Pannonius' words ======
===== The algorithm =====
- Produce a list of words from a Pannonius' text. Normalize orthographically if necessary (ę = ae etc)
- Send the list to a Latin lemmatizing service, e. g. Morpheus lemmatizer offered by Perseus Digital Library
- Collect the words **not** recognized by the lemmatizer. They are interesting because there are either names or words unknown to the service (current version of Morpheus does not recognize [[http://services-qa.projectbamboo.org/bsp/morphologyservice/analysis/word?word=accipiter&lang=lat&engine=morpheus|accipiter]]).
- The unrecognized common words can be included in the service, or explored as possible neo-Latin contribution to vocabulary
- The unrecognized names are considered as prominently thematic, and worthy of further comparison: what do authors in CroALa have to say about Autolycus or Demophoont? Do they say the same things, or something different?
- The list of unrecognized words is then sent to CroALa for automatic querying. A script notes the number of occurrences found
- Obviously, there will always //be// some occurrences, until we decide to exclude Pannonius from the queried set through a [[http://www.ffzg.unizg.hr/klafil/croala/cgi-bin/search3t?dbname=croala&author=NOT+panonije|NOT panonije]] search
- Query results are formatted into [[http://www.ffzg.unizg.hr/klafil/croala/xpr/2012-10-jp-unk.html|a third list]], with links to searches in CroALa and data on frequency in the collection. This list could also include links to other searches (in CAMENA / Termini, in Poeti d'Italia in lingua latina, in Google Books, in archive.org).
- [[http://www.ffzg.unizg.hr/klafil/croala/xpr/2012-10-jp-unk100.html|A subset of this list, including only queries with less than 100 results in CroALa]], seems better suited for close reading experiments
- A [[https://www.google.com/fusiontables/DataSource?docid=1YJCACjkiwYb5JD7KfmrnxPSv-OG2PC8EfOoae5Q|Google Fusion Table of this list]] enables not only sharing, but also (public) manipulation of data, such as sorting and filtering
===== A problem: regular expressions homonyms =====
Because we are searching in a corpus of a flective language, and because orthography varies across texts in CroALa, we want to use regular expressions in searches. Fortunately, PhiloLogic offers not only, regex, but also ways to control it.
There is a problem. A search for //d*// will find many "homonyms". And yet, when we try to stem words -- to take only the roots as query basis -- //deus// will end up as //d*//. In extreme cases, homonymy can be removed "by hand" --- but our hypothesis is that computers can serve for "triage", for finding interesting passages in a corpus. How to make them do this effectively?
We had to experiment. Using regex's capabilities to match //any single character// (through the operator .), we built several searches for the same stem (followed by two, three, four, or by any number of characters). Here are the query strings, together with number of occurrences found in CroALa:
"ETH?RUSC..","19"
"ETH?RUSC...",""
"ETH?RUSC....","2"
"ETH?RUSC.*","39"
The idea is that a comparison of these slightly different query strings will help us categorize the cases -- find out what is useful where.
A starting point for such comparisons can be seen here, as a table of names from Latin poems by Marko Marulić (Marcus Marulus): [[http://www.ffzg.unizg.hr/klafil/croala/xpr/marul-radices.html|X]].
See [[z:croala-regex-homonyms|the algorithm and the scripts]] for achieving all that's described above.
====== Querying Pannonius's texts in CroALa with a set of words found elsewhere ======
* Headings of Ravisius Textor's ([[http://www.uni-mannheim.de/mateo/camenaref/tixier.html|Jean Tixier's]]) Epitheta, presented as starting points for parallel searches in Pannonius and in CroALa (letter A): [[http://www.ffzg.unizg.hr/klafil/croala/xpr/tix-a-q.html|X]]
Note that Textor's headings with problematic words --- very short ones, or n-gram phrases --- are also included there.
===== The algorithm =====
- Produce a list of words found or compiled elsewhere (Marulić's names, Ravisius Textor's headings and epitheta)
- Prepare the list for orthographic variation:
- stem it (keeping the original word as a CSV field)
- treat separately stems of 1--3 characters, of 4--5, of 6--7, everything else, and PHRASES (recognizable by a space in the field)
- Query all of CroALa with items from the list (we use a [[z:croala-query-all-bash|Bash script]])
- Produce report:
- Keep query and number of occurrences, strip everything else
- Join (Linux ''paste'') the two CSV fields ("QUERY","NUMBER OF OCCURRENCES") to the original CSV list
- Query Pannonius' texts in CroALa with the list
- Produce report, as above
- Paste it all together (in the order: original, Pannonius, CroALA total)
- Transform CSV into HTML rows (with a [[z:croala-csv-rows|Perl script]])
- Produce a HTML report with a report template (with a [[z:croala-report-template|Perl script]])
===== Problems and fine-tuning =====
We also produce a simple report on ratio of Pannonius' words to CroALa total (n = CroALa / Pannonius). The lesser the ratio, more potentially interesting the result.
The report:
abyssus (ABIS?S.*), linea 11: 124
accipiter (AC?CIPITE.*), linea 16: 66
acerra (ACER?R.*), linea 18: 74
achaia (ACH?A.*), linea 22: 266
achates (ACH?AT.*), linea 23: 18
achelous (ACH?ELO.*), linea 24: 29
acheron (ACH?ERON.*), linea 25: 59
achiui (ACH?IU.*), linea 27: 12
ACIES (ACIE.), linea 28: 183
ACIES (ACIE*), linea 29: 163
acron (ACRON.*), linea 37: 3
actaeon (ACTA?O?EON.*), linea 38: 10
acus (ACU), linea 42: 14
adamas (ADAM.*), linea 44: 216
addua (ADDU.*), linea 46: 685
adonis (ADON.*), linea 49: 123
aduena (ADUEN.*), linea 53: 221
adulter (ADULTE.*), linea 56: 137
adytum (ADIT.*), linea 58: 238
aeacus (A?O?EAC.*), linea 59: 12
aedes (A?O?ED.*), linea 61: 119
aeetes (A?O?EET.*), linea 63: 7
aegeria (A?O?EGER.*), linea 65: 60
aegis (A?O?EG.*), linea 67: 58
aello (A?O?EL?L.*), linea 73: 235
aeneas (A?O?ENE.*), linea 74: 48
aeneis (A?O?ENE.*), linea 75: 48
aeolus (A?O?EOL.*), linea 77: 57
aequor (A?O?EQUO.*), linea 78: 49
aerumna (A?O?ERUMN.*), linea 82: 179
aeschylus (A?O?ESCH?IL.*), linea 83: 3
aestas (A?O?EST.*), linea 86: 170
AESTUS PRO CALORE (A?ESTU*), linea 87: 102
aether (A?O?ETH?E.*), linea 90: 82
aethiope (A?O?ETH?IOP.*), linea 91: 62
aethra (A?O?ETH?R.*), linea 94: 52
aetna (A?O?ETN.*), linea 95: 12
aeua (A?O?EU.*), linea 96: 17
aeuum (A?O?EU.*), linea 97: 17
afri (AFR*), linea 99: 198
africa (AFRIC.*), linea 100: 153
africus (AFRIC.*), linea 101: 153
agamemnon (AGAMEMNON.*), linea 102: 24
agenor (AGENO.*), linea 107: 4
ager (AGER), linea 108: 55
ager (AGR*), linea 109: 127
agger (AGGE.*), linea 110: 67
agna (AGN.), linea 112: 83
agna (AGN..), linea 113: 155
agnus (AGN..), linea 114: 155
agrestes (AGREST.*), linea 115: 130
agricola (AGRICOL.*), linea 116: 74
ahenum (AH?EN.*), linea 117: 45
ala (AL.), linea 120: 170
ala (AL..), linea 121: 86
alba (ALB..), linea 125: 214
alba (ALB.), linea 126: 27
alcides (ALCID.*), linea 132: 11
alcinous (ALCINO.*), linea 133: 6
alcmaeon (ALCMA?O?EON.*), linea 134: 6
alcmena (ALCMEN.*), linea 135: 5
alea (ALE), linea 137: 24
ales (AL..), linea 140: 86
ales (ALIT...), linea 141: 19
ales (ALIT.), linea 142: 17
alexander (ALEXANDE.*), linea 144: 245
alexandria (ALEXANDR.*), linea 145: 254
alga (ALG..), linea 147: 14
alloquium (AL?LOQUI.*), linea 150: 60
alnus (ALN*), linea 153: 20
aloe (ALOE*), linea 154: 12
aloeus (ALO?A?E.*), linea 155: 127
aloidae (ALOID.*), linea 156: 2
alpes (ALP*), linea 157: 71
alpheus (ALPH?E.*), linea 158: 44
altare (ALTAR.*), linea 159: 169
alueus (ALUE.*), linea 160: 58
aluus (ALU.), linea 162: 83
aluus (ALU..), linea 163: 71
amantes (AMANT.*), linea 165: 39
amaracus (AMARAC.*), linea 166: 5
amaranthus (AMARANTH?.*), linea 167: 12
amaror (AMARO.*), linea 168: 23
amator (AMATO.*), linea 171: 113
amazones (AMAZON.*), linea 172: 39
ambitio (AMBITI.*), linea 174: 102
ambrosia (AMBROS.*), linea 175: 26
amica (AMIC.*), linea 178: 76
amicitia (AMICIT.*), linea 179: 44
amicus (AMIC.*), linea 180: 76
amnis (AMN.), linea 182: 33
amnis (AMN..), linea 183: 55
amphion (AMPH?ION.*), linea 189: 11
amphitheatrum (AMPH?ITH?EATR.*), linea 191: 10
amplexus (AMPLEX.*), linea 195: 259
amyclae (AMICL.*), linea 198: 11
amygdalus (AMIGDAL.*), linea 199: 5
anas (ANAS), linea 201: 7
anchora (ANCH?OR.*), linea 207: 45
ancus (ANC..), linea 210: 14
andromeda (ANDROMED.*), linea 211: 19
anethum (ANETH?.*), linea 212: 4
angelus (ANGEL.*), linea 214: 735
anglia (ANGL.*), linea 215: 946
angli (ANGL.*), linea 216: 946
anguilla (ANGUIL?L.*), linea 218: 7
anguis (ANGU.*), linea 219: 66
angulus (ANGUL.*), linea 220: 244
anima (ANIM.*), linea 223: 193
animus (ANIM.*), linea 226: 193
annales (AN?NAL.*), linea 227: 149
ANNA soror Didus (ANNA?[AE]|ANN[AE][QNU]*), linea 228: 369
annus (AN?N.*), linea 233: 112
anser (ANSE.*), linea 236: 19
antenor (ANTENO.*), linea 237: 20
anteus (ANTE.*), linea 238: 83
antrum (ANTR.*), linea 244: 35
anus (AN), linea 246: 56
apelles (APEL?L.*), linea 247: 118
aper (APER), linea 248: 16
aper (APR..), linea 249: 44
aper (APR.), linea 250: 12
apes (APE.), linea 252: 30
apes (APU.), linea 253: 281
apex (APEX.*), linea 254: 72
apicius (APIC.*), linea 255: 92
apis (API.), linea 258: 60
apollo (APOL?L.*), linea 262: 20
apostoli (APOSTOL.*), linea 264: 310
aqua (AQU..), linea 268: 70
aqua (AQU.), linea 269: 142
aquila (AQUIL.*), linea 273: 374
aquilo (AQUIL.*), linea 274: 374
ara (AR..), linea 275: 48
ara (AR.), linea 276: 111
arabes (ARAB.*), linea 277: 41
ARAR SIUE ARARIS (ARAR.), linea 282: 21
arator (ARATO.*), linea 283: 30
aratrum (ARATR.*), linea 284: 86
aratus (ARAT.*), linea 285: 40
araxes (ARAX.*), linea 286: 12
arbor (ARBI.*), linea 287: 194
arbustum (ARBUST.*), linea 288: 36
arca (ARC..), linea 290: 71
arca (ARC.), linea 291: 71
arcades (ARCAD.*), linea 292: 116
arces (ARCE.), linea 293: 33
arces (ARCI...), linea 295: 26
arctos (ARCT.*), linea 299: 49
ARCUS COELESTIS (ARCU.+CO?A?EL*), linea 301: 1
ardor (ARDO.*), linea 302: 76
area (ARE.), linea 303: 143
area (ARE..), linea 304: 33
arena (AREN.*), linea 305: 53
argentum (ARGENT.*), linea 307: 268
argi (ARG.), linea 309: 16
argi (ARG..), linea 310: 16
argilla (ARGIL?L.*), linea 312: 15
arion (ARION.*), linea 324: 29
ARION EQUUS (ARION*), linea 325: 29
arista (ARIST.*), linea 327: 201
aristoteles (ARISTOTEL.*), linea 331: 343
arma (ARM.), linea 332: 50
arma (ARMI.), linea 333: 56
arma (ARMOR*), linea 334: 182
armenia (ARMEN.*), linea 335: 205
armentum (ARMENT.*), linea 337: 130
arnus (ARN.), linea 340: 12
arrius (AR?R.*), linea 345: 65
artemisia (ARTEMIS.*), linea 346: 12
arundo (ARUND.*), linea 352: 106
ARUNDO PRO IACULO (ARUND*), linea 353: 106
aruum (ARU..), linea 356: 44
aruum (ARU.), linea 357: 31
aruum (ARUI.), linea 358: 41
ascra (ASCR.*), linea 361: 119
asia (ASIA*), linea 363: 288
asinus (ASIN.*), linea 366: 136
aspectus (ASPECT.*), linea 369: 267
aspis (ASPID*), linea 372: 55
astra (ASTR.*), linea 377: 36
astraea (ASTRA?O?E.*), linea 378: 63
astus (AST.), linea 380: 34
astutia (ASTUT.*), linea 381: 28
athenae (ATH?EN.*), linea 388: 258
athesis (ATH?ES.*), linea 390: 10
athos (ATH?.*), linea 392: 142
atlas (ATH?LANT*), linea 395: 13
atlas (ATH?LAS*), linea 396: 36
atomi (ATOM.*), linea 397: 22
atria (ATRI.), linea 399: 208
atria (ATRI..), linea 400: 52
atridae (ATRID.*), linea 402: 21
atropos (ATROP.*), linea 403: 5
auceps (AUCEP.*), linea 409: 8
aucupium (AUCUPI.*), linea 410: 12
audacia (AUDAC.*), linea 411: 154
auena (AUEN.*), linea 413: 81
auernus (AUERN.*), linea 414: 40
augur (AUGU.*), linea 416: 221
augustus (AUGUST.*), linea 419: 200
aula (AUL..), linea 424: 233
aula (AUL.), linea 425: 78
aulaea (AULA?O?E.*), linea 427: 188
aulis (AULID*), linea 428: 3
aura (AUR.), linea 431: 76
aura (AUR..), linea 432: 76
aures (AURE.), linea 433: 67
aures (AURIB..), linea 435: 98
aurora (AUROR.*), linea 438: 254
ausonia (AUSON.*), linea 440: 9
ausonii (AUSONI.*), linea 441: 9
auspicium (AUSPICI.*), linea 443: 58
auster (AUSTE.*), linea 444: 68
ausus (AUS.), linea 445: 36
ausus (AUS....), linea 446: 55
ausus (AUS..), linea 447: 61
autolycus (AUTOLIC.*), linea 449: 1
autumnus (AUTUMN.*), linea 450: 21
auus (AU..), linea 451: 52
auus (AU.), linea 452: 257
axis (AXE), linea 455: 53
axis (AXI.), linea 456: 50
This is done via the following Perl code, applied to a CSV file:
#!/usr/bin/perl
# csv-2col.pl - report on columns and ratio col5/col3
use strict;
use warnings;
use Text::CSV;
use File::Slurp 'read_file';
my $file = shift or die 'filename!';
my $csv = Text::CSV->new();
open(CSV, '<', $file) or die "Could not open '$file' $!\n";
while () {
if ($csv->parse($_)) {
my @columns = $csv->fields();
# test for not equal to regex:
if ($columns[3] !~ /0|^$/ ) {
print $columns[0], " (", $columns[1], "), linea ", $., ": ", int($columns[5] / $columns[3]), "\n";
}
} else {
my $err = $csv->error_input;
print "Failed to parse line: $err";
}
}