This is done with a bash script and a perl script (perl uses Lingua::LA::Stemmer).
Here is the zacroala.sh
script:
#!/bin/bash # Jovanovic, 2012-10, format a list of words for CroALa orthographic search # usage: ./zacroala.sh filename # take argument, find file file=$1 ./lastem.pl ${file} \ | awk '{ print length(), $0}' \ | sort -n \ | awk '{$1=""; print $0}' \ | sed 's/ //g' \ | tr '[:lower:]' '[:upper:]' \ | tr "JY" "I" \ | tr "V" "U" \ | sed 's/\([AO]\)E/[AO]?E/g' \ | sed 's/\([BCDFGHLMNPRST]\)\1/\1?\1/g' \ | sed 's/H/H?/g' \ | sed 's/T\([^TH?]\)/TH?\1/g' \ | sed 's/\(.*\)/\1*/g' - >> ${file}-zacroala
Here is the lastem.pl
, called from zacroala.sh
:
#!/usr/bin/perl #!/usr/bin/perl -w # lastem.pl - read in a file of latin words, turn it into an array, print out the stems # synopsis: lastem.pl somefile # Jovanovic, 5/10/2012 use strict; use warnings; use Lingua::LA::Stemmer; use File::Slurp 'read_file'; # give us a file: my $fname = shift or die 'filename!'; # turn it into an array with slurp: my @words = read_file $fname; # stem the array (hard reference...): my $stems = Lingua::LA::Stemmer::stem(\@words); # print the result: print "$_ \n" for (@$stems);
Here is the zacr-rez.sh script:
#!/bin/bash # Jovanovic, 2012-10, transforms a list of results into live links for CroALa # usage: ./zacr-rez.sh filename # take argument, find file file=$1 sort ${file} \ | sed 's#^\(.*\)\( =.*\)#<li><a href="http://www.ffzg.unizg.hr/klafil/croala/cgi-bin/search3t?dbname=croala\&word=\1\&OUTPUT=TF">\1</a>\2#g' > ${file}.html # end of script