How to internationalize strings that are built up from multiple parts? - language-agnostic

Say I want to ask the user to confirm an action. The action consists of three parts. Each of the three parts can be formatted in one of two ways. In a human-language-specific way, I might do something like this (pseudocode):
res = ""
res = "Do you want to %1 at %2, %3 at time %4%5?" % (
"foo" if fooing else "bar", foobar_data,
"expiring" if expiring else "starting", the_time,
", without foobing" if no_foob else (", foobing at %1" % when_foob))
Even if I wrap all translatable strings with the translation function (e.g. tr("Do you want to %1 at %2 ...")), this would probably only work for english since other languages are unlikely to have the same syntactic structure.
But if I write out the whole sentences then I get a combinatorial explosion:
if fooing and expiring and no_foob:
res = "Do you want to foo at %1, expiring at time %2, without foobing?"
elif fooing and expiring and not no_foob:
res = "Do you want to foo at %1, expiring at time %2, foobing at %3?"
elif fooing and not expiring and no_foob:
res = "Do you want to foo at %1, starting at time %2, without foobing?"
# etc ...
res = res % (foobar_data, the_time, when_foob) # account for sometimes not having when_foob somehow
What's the standard way to deal with this situation?

I think it's best to avoid such complex sentences altogether. Instead, provide one simple sentence and add further details in a table-like way (tables are easier to translate).
This has the added bonus that you can hide the details per default (progressive disclosure) and keep the main message short. This increases the chance that your users read the confirmation dialog at all.
Mock-up:
Do you really want to <Foo/Bar>?
[ Cancel ] [ <Foo/Bar> ] [ Details... ]
---------------------------------------------
Time: <Expring at %1 / Starting at %1>
Foobing: <Yes/No>

Even if I wrap all translatable strings with the translation function, this would probably only work for english since other languages are unlikely to have the same syntactic structure.
Yes and no. You can obviously switch the sentence and parameters around to fit the target language:
Wollen Sie am %2 %1, %3 %4%5?
The tricky part obviously is to get the declinations and such right; while the words themselves might not change at all in English when swapping them, they may have to be altered heavily in other languages.
For this it's important to translate all the necessary variations, and be able to annotate them with context in your translation system:
tr('foo', context='infinitive') if fooing else tr('bar', context='infinitive')
tr('expiring', context='verb before date') if expiring else tr('starting', context='verb before date')
The PO file format for instance has this notion built in:
msgctxt "verb before date"
msgid "expiring"
msgstr "wird auslaufen am"
It does take some linguistic knowledge to break up sentences correctly, be able to classify and annotate each term correctly so it can be correctly translated into all languages as necessary, and it takes a lot of Q&A to ensure it's being done correctly. You want to find the right balance between small enough snippets that translations can be reused, and breaking it down enough so it can be translated correctly.
You can also use message ids, if that becomes too complex:
res = "question.confirmActivty" % (..)
msgid "question.confirmActivty"
msgstr "Do you want to %1 at %2, %3 at time %4%5?"

Related

COBOL .csv File IO into Table Not Working

I am trying to learn Cobol as I have heard of it and thought it would be fun to take a look at. I came across MicroFocus Cobol, not really sure if that is pertinent to this post though, and since I like to write in visual studio it was enough incentive to try and learn it.
I've been reading alot about it and trying to follow documentation and examples. So far I've gotten user input and output to the console working so then I decided to try file IO out. That went ok when I was just reading in a 'record' at a time, I realize that 'record' may be incorrect jargon. Although I've been programming for a while I am an extreme noob with cobol.
I have a c++ program that I have written before that simply takes a .csv file and parses it then sorts the data by whatever column the user wants. I figured it wouldn't be to hard to do the same in cobol. Well apparently I have misjudged in this regard.
I have a file, edited in windows using notepad++, called test.csv which contains:
4001942600,140,4
4001942700,141,3
4001944000,142,2
This data is from the us census, which has column headers titled: GEOID, SUMLEV, STATE. I removed the header row since I couldn't figure out how to read it in at the time and then read in the other data. Anywho...
In Visual Studio 2015, on Windows 7 Pro 64 Bit, using Micro Focus, and step debugging I can see in-record containing the first row of data. The unstring works fine for that run but the next time the program 'loops' I can step debug, and view in-record and see it contains the new data however the watch display when I expand the watch elements looks like the following:
REC-COUNTER 002 PIC 9(3)
+ IN-RECORD {Length = 42} : "40019427004001942700 000 " GROUP
- GEOID {Length = 3} PIC 9(10)
GEOID(1) 4001942700 PIC 9(10)
GEOID(2) 4001942700 PIC 9(10)
GEOID(3) <Illegal data in numeric field> PIC 9(10)
- SUMLEV {Length = 3} PIC 9(3)
SUMLEV(1) <Illegal data in numeric field> PIC 9(3)
SUMLEV(2) 000 PIC 9(3)
SUMLEV(3) <Illegal data in numeric field> PIC 9(3)
- STATE {Length = 3} PIC X
STATE(1) PIC X
STATE(2) PIC X
STATE(3) PIC X
So I'm not sure why that just before the Unstring operation the second time around I can see the proper data, but after the unstring happens incorrect data is then stored in the 'table'. What is also interesting is that if I continue on the third time around the correct data is stored in the 'table'.
identification division.
program-id.endat.
environment division.
input-output section.
file-control.
select in-file assign to "C:/Users/Shittin Kitten/Google Drive/Embry-Riddle/Spring 2017/CS332/group_project/cobol1/cobol1/test.csv"
organization is line sequential.
data division.
file section.
fd in-file.
01 in-record.
05 record-table.
10 geoid occurs 3 times pic 9(10).
10 sumlev occurs 3 times pic 9(3).
10 state occurs 3 times pic X(1).
working-storage section.
01 switches.
05 eof-switch pic X value "N".
* declaring a local variable for counting
01 rec-counter pic 9(3).
* Defining constants for new line and carraige return. \n \r DNE in cobol!
78 NL value X"0A".
78 CR value X"0D".
78 TAB value X"09".
******** Start of Program ******
000-main.
open input in-file.
perform
perform 200-process-records
until eof-switch = "Y".
close in-file;
stop run.
*********** End of Program ************
******** Start of Paragraph 2 *********
200-process-records.
read in-file into in-record
at end move "Y" to eof-switch
not at end compute rec-counter = rec-counter + 1;
end-read.
Unstring in-record delimited by "," into
geoid in record-table(rec-counter),
sumlev in record-table(rec-counter),
state in record-table(rec-counter).
display "GEOID " & TAB &">> " & TAB & geoid of record-table(rec-counter).
display "SUMLEV >> " & TAB & sumlev of record-table(rec-counter).
display "STATE " & TAB &">> " & TAB & state of record-table(rec-counter) & NL.
************* End of Paragraph 2 **************
I'm very confused about why I can actually see the data after the read operation, but it isn't stored in the table. I have tried changing the declarations of the table to pic 9(some length) as well and the result changes but I can't seem to pinpoint what I'm not getting about this.
I think there are a few things you've not grasped yet, and which you need to.
In the DATA DIVISION, there are a number of SECTIONs, each of which has a specific purpose.
The FILE SECTION is where you define data structures which represent data on files (input, output or input-output). Each file has an FD, and subordinate to an FD will be one or more 01-level structures, which can be extremely simple, or complex.
Some of the exact behaviour is down to particular implementation for a compiler, but you should treat things this way, for your own "minimal surprise" and for the same of anyone who has to later amend your programs: for an input file, don't change the data after a READ, unless you are going to update the record (of if you are using a keyed READ, perhaps). You can regard the "input area" as a "window" on your data-file. The next READ, and the window is pointed to a different position. Alternatively, you can regard it as "the next record arrives, obliterating what was there previously". You have put the "result" of your UNSTRING into the record-area. The result will for sure disappear on the next read. You have the possibility (if the window is true for your compiler, and depending on the mechanism it uses for IO) of squishing the "following" data as well.
Your result should be in the WORKING-STORAGE, where it will remain undisturbed by new records being read.
READ filname INTO data-description is an implicit MOVE of the data from the record-area to data-description. If, as you have specified, data-description is the record-area, the result is "undefined". If you only want the data in the record-area, just a plain READ filename is all that is needed.
You have a similar issue with your original UNSTRING. You have the source and target fields referencing the same storage. "Undefined" and not the result you want. This is why the unnecessary UNSTRING "worked".
You have a redundant inline PERFORM. You process "something" after end-of-file. You make things more convoluted by using unnecessary "punctuation" in the PROCEDURE DIVISION (which you've apparently omitted to paste). Try using ADD instead of COMPUTE there. Look at the use of FILE STATUS, and of 88-level condition-names.
You don't need a "new line" for DISPLAY, because you get one for free unless you use NO ADVANCING.
You don't need to "concatenate" in the DISPLAY, because you get that for free as well.
DISPLAY and its cousin, ACCEPT, are the verbs (only intrinsic functions are functions in COBOL (except where your compiler supports user-defined functions)) which vary the most from compiler to compiler. If your complier supports SCREEN SECTION in the DATA DIVISION you can format and process user-input in "screens". If you were to use IBM's Enterprise COBOL you'd have very basic DISPLAY/ACCEPT.
You "declare a local variable". Do you? In what sense? Local to the program.
You can pick up quite a lot of tips by looking at COBOL questions here from the last few years.
Well I figured it out. While step debugging again, and hovering the mouse over record-table I noticed 26 white spaces present after the last data field. Now earlier tonight I attempted to change this data on the 'fly' as it were, because normally visual studio allows this. I attempted to make the change but did not verify that it took, normally I don't have to, but apparently it did not take. Now I should have known better since the icon displayed to the left of record-table displays a little closed pad-lock.
I normally program C, C++, and C# so when I see the little pad lock it usually has something to do with scoping and visibility. Not knowing COBOL well enough I overlooked this little detail.
Now I decided to unstring in-record delimited by spaces into temp-string. just prior to the
Unstring temp-string delimited by "," into
geoid in record-table(rec-counter),
sumlev in record-table(rec-counter),
state in record-table(rec-counter).
The result of this was the properly formatted data, at least as I understand it, stored into the table and printed to the console screen.
Now I have read that the unstring 'function' can utilize multiple 'operators' such as so I may try to combine these two unstring operations into one.
Cheers!
**** Update ****
I have read the Mr. Woodger's reply below. If I could ask for a bit more assistance with this. I have also read this post which is similar but above my level at this time. COBOL read/store in table
That is pretty much what I'm trying to do but I don't understand some of things Mr. Woodger is trying to explain. Below is the code a bit more refined with some questions I have as comments. I would very much like some assistance with this or maybe if I could have an offline conversation that would be fine too.
`identification division.
* I do not know what 'endat' is
program-id.endat.
environment division.
input-output section.
file-control.
* assign a file path to in-file
select in-file assign to "C:/Users/Shittin Kitten/Google Drive/Embry-Riddle/Spring 2017/CS332/group_project/cobol1/cobol1/test.csv"
* Is line sequential what I need here? I think it is
organization is line sequential.
* Is the data devision similar to typedef in C?
data division.
* Does the file sectino belong to data division?
file section.
* Am I doing this correctly? Should this be below?
fd in-file.
* I believe I am defining a structure at this point
01 in-record.
05 record-table.
10 geoid occurs 3 times pic A(10).
10 sumlev occurs 3 times pic A(3).
10 state occurs 3 times pic A(1).
* To me the working-storage section is similar to ADA declarative section
* is this a correct analogy?
working-storage section.
* Is this where in-record should go? Is in-record a representative name?
01 eof-switch pic X value "N".
01 rec-counter pic 9(1).
* I don't know if I need these
78 NL value X"0A".
78 TAB value X"09".
01 sort-col pic 9(1).
********************************* Start of Program ****************************
*Now the procedure division, this is alot like ada to me
procedure division.
* Open the file
perform 100-initialize.
* Read data
perform 200-process-records
* loop until eof
until eof-switch = "Y".
* ask user to sort by a column
display "Would which column would you like to bubble sort? " & TAB.
* get user input
accept sort-col.
* close file
perform 300-terminate.
* End program
stop run.
********************************* End of Program ****************************
******************************** Start of Paragraph 1 ************************
100-initialize.
open input in-file.
* Performing a read, what is the difference in this read and the next one
* paragraph 200? Why do I do this here instead of just opening the file?
read in-file
at end
move "Y" to eof-switch
not at end
* Should I do this addition here? Also why a semicolon?
add 1 to rec-counter;
end-read.
* Should I not be unstringing here?
Unstring in-record delimited by "," into geoid of record-table,
sumlev of record-table, state of record-table.
******************************** End of Paragraph 1 ************************
********************************* Start of Paragraph 2 **********************
200-process-records.
read in-file into in-record
at end move "Y" to eof-switch
not at end add 1 to rec-counter;
end-read.
* Should in-record be something else? I think so but don't know how to
* declare and use it
Unstring in-record delimited by "," into
geoid in record-table(rec-counter),
sumlev in record-table(rec-counter),
state in record-table(rec-counter).
* These lines seem to give the printed format that I want
display "GEOID " & TAB &">> " & TAB & geoid of record-table(rec-counter).
display "SUMLEV >> " & TAB & sumlev of record-table(rec-counter).
display "STATE " & TAB &">> " & TAB & state of record-table(rec-counter) & NL.
********************************* End of Paragraph 2 ************************
********************************* Start of Paragraph 3 ************************
300-terminate.
display "number of records >>>> " rec-counter;
close in-file;
**************************** End of Paragraph 3 *****************************
`

How can I make nltk.NaiveBayesClassifier.train() work with my dictionary

I'm currently making a simples spam/ham email filter using Naive Bayles.
For you to understand my algorithm logic: I have a folder with lots os files, which are examples of spam/ham emails. I also have two other files in this folder containing the titles of all my ham examples and another with the titles of all my spam examples. I organized like this so I can open and read this emails properly.
I'm putting all the words I judge to be important in a dictionary structure, with a label "spam" or "ham" depending from which kind of file I extracted them from.
Then I'm using nltk.NaiveBayesClassifier.train() so I can train my classifier, but I'm getting the error:
for featureset, label in labeled_featuresets:
ValueError: too many values to unpack
I don't know why this is happening. When I looked for a solution, I found that strings are not hashable, and I was using a list to do it, then I turned it into a dictionary, which are hashable as far as I know, but it keeps getting this error.
Someone knows how to solve it? Thanks!
All my code is listed below:
import nltk
import re
import random
stopwords = nltk.corpus.stopwords.words('english') #Words I should avoid since they have weak value for classification
my_file = open("spam_files.txt", "r") #my_file now has the name of each file that contains a spam email example
word = {} #a dictionary where I will storage all the words and which value they have (spam or ham)
for lines in my_file: #for each name of file (which will be represenetd by LINES) of my_file
with open(lines.rsplit('\n')[0]) as email: #I will open the file pointed by LINES, and then, read the email example that is inside this file
for phrase in email: #After that, I will take every phrase of this email example I just opened
try: #and I'll try to tokenize it
tokens = nltk.word_tokenize(phrase)
except:
continue #I will ignore non-ascii elements
for c in tokens: #for each token
regex = re.compile('[^a-zA-Z]') #I will also exclude numbers
c = regex.sub('', c)
if (c): #If there is any element left
if (c not in stopwords): #And if this element is a not a stopword
c.lower()
word.update({c: 'spam'})#I put this element in my dictionary. Since I'm analysing spam examples, variable C is labeled "spam".
my_file.close()
email.close()
#The same logic is used for the Ham emails. Since my ham emails contain only ascii elements, I dont test it with TRY
my_file = open("ham_files.txt", "r")
for lines in my_file:
with open(lines.rsplit('\n')[0]) as email:
for phrase in email:
tokens = nltk.word_tokenize(phrase)
for c in tokens:
regex = re.compile('[^a-zA-Z]')
c = regex.sub('', c)
if (c):
if (c not in stopwords):
c.lower()
word.update({c: 'ham'})
my_file.close()
email.close()
#And here I train my classifier
classifier = nltk.NaiveBayesClassifier.train(word)
classifier.show_most_informative_features(5)
nltk.NaiveBayesClassifier.train() expects “a list of tuples (featureset, label)” (see the documentation of the train() method)
What is not mentioned there is that featureset should be a dict of feature names mapped to feature values.
So, in a typical spam/ham classification with a bag-of-words model, the labels are 'spam'/'ham' or 1/0 or True/False;
the feature names are the occurring words and the values are the number of times each word occurs.
For example, the argument to the train() method might look like this:
[({'greetings': 1, 'loan': 2, 'offer': 1}, 'spam'),
({'money': 3}, 'spam'),
...
({'dear': 1, 'meeting': 2}, 'ham'),
...
]
If your dataset is rather small, you might want to replace the actual word counts with 1, to reduce data sparsity.

if all else fails tcl script fails

I am trying to make a script to transfer file to another device. Since I cannot account for every error that may occur, I am trying to make an if-all-else fails situation:
spawn scp filename login#ip:filename
expect "word:"
send "password"
expect {
"100" {
puts "success"
} "\*" {
puts "Failed"
}
}
This always returns a Failed message and does not even transfer the file, where as this piece of code:
spawn scp filename login#ip:filename
expect "word:"
send "password"
expect "100"
puts "success"
shows the transfer of the file and prints a success message.
I cant understand what is wrong with my if-expect statement n the first piece of code.
The problem is because of \*. The backslash will be translated by Tcl, thereby making the \* into * alone which is then passed to expect as
expect *
As you know, * matches anything. This is like saying, "I don't care what's in the input buffer. Throw it away." This pattern always matches, even if nothing is there. Remember that * matches anything, and the empty string is anything! As a corollary of this behavior, this command always returns immediately. It never waits for new data to arrive. It does not have to since it matches everything.
I don't know why you have used *. Suppose, if your intention is to match literal asterisk sign, then use \\*.
The string \\* is translated by Tcl to \*. The pattern matcher then interprets the \* as a request to match a literal *.
expect "*" ;# matches * and? and X and abc
expect "\*" ;# matches * and? and X and abc
expect "\\*" ;# matches * but not? or X or abc
Just remember two rules:
Tcl translates backslash sequences.
The pattern matcher treats backs lashed characters as literals.
Note : Apart from question, one observation. You are referring your expect block as a if-else block. It is not same as If-Else block.
The reason is, in traditional if-else block, we know for sure that at least one of that block will be executed. But, in expect, it is not the case. It is more of like multiple if blocks alone.

Any difference of double quote a variable?

For the following code:
set str "a bb ccc"
if {[string first bb "$str"] >= 0} {
puts "yes"
}
My college said I should not double-quote $str because there is performance difference, something like TCL makes a new object internally using $str.
I cannot find a convincing document on this. Do you know if the claim is accurate?
Your colleague is actually wrong, as Tcl's parser is smart enough to know that "$str" is identical to $str. Let's look at the bytecode generated (this is with Tcl 8.6.0, but the part that we're going to look at in detail is actually the same in older versions all the way back to 8.0a1):
% tcl::unsupported::disassemble script {
set str "a bb ccc"
if {[string first bb "$str"] >= 0} {
puts "yes"
}
}
ByteCode 0x0x78710, refCt 1, epoch 15, interp 0x0x2dc10 (epoch 15)
Source "\nset str \"a bb ccc\"\nif {[string first bb \"$str\"] >= 0} "
Cmds 4, src 74, inst 37, litObjs 7, aux 0, stkDepth 2, code/src 0.00
Commands 4:
1: pc 0-5, src 1-18 2: pc 6-35, src 20-72
3: pc 15-20, src 25-46 4: pc 26-31, src 61-70
Command 1: "set str \"a bb ccc\""
(0) push1 0 # "str"
(2) push1 1 # "a bb ccc"
(4) storeScalarStk
(5) pop
Command 2: "if {[string first bb \"$str\"] >= 0} {\n puts \"yes\"\n}"
(6) startCommand +30 2 # next cmd at pc 36, 2 cmds start here
Command 3: "string first bb \"$str\""
(15) push1 2 # "bb"
(17) push1 0 # "str"
(19) loadScalarStk
(20) strfind
(21) push1 3 # "0"
(23) ge
(24) jumpFalse1 +10 # pc 34
Command 4: "puts \"yes\""
(26) push1 4 # "puts"
(28) push1 5 # "yes"
(30) invokeStk1 2
(32) jump1 +4 # pc 36
(34) push1 6 # ""
(36) done
As you can see (look at (17)–(19)), the "$str" is compiled to a push of the name of the variable and a dereference (loadScalarStk). That's the most optimal sequence given that there's no local variable table (i.e., we're not in a procedure). The compiler doesn't do non-local optimizations.
I think your colleague is correct: if Tcl sees plain $str where a word is expected, it parses out that "str" as the name of a variable, looks it up in the approptiate scope, then extracts an internal object representing its value from that variable and then asks that object to produce the string representation of that value. At this point that string representation will be either already available and cached (in the object) — and it will, in your case, — or it will be transparently generated by the object, and cached.
If you put dereferencing of a variable ($str) in a double quoted string, then Tcl goes like this: when it sees the first " in a place where a word is expected, it enters a mode where it would parse the following characters, performing variable- and command substitutions as it goes until it sees the next unescaped ", at which point the substituted text accumulated since the opening " is considered to be one word and it ends up being in a (newly created) internal object representing that word's value.
As you can see, in the second (your) case the original object holding the value of a variable named "str" will be asked for its value, and it then will be used to construct another value while in the first case the first value would be used right away.
Now there's a more subtle matter. For the scripts it evaluates, Tcl only guarantees that its interpreter obeys certain evaluation rules, and nothing more; everything else is implementation details. These details might change from version to version; for instance, in Tcl 8.6, the engine has been reimplemented using non-recursive evaluation (NRE), and while those were rather radical changes to the Tcl internals, your existing scripts did not notice.
What I'm leading you to, is that discussing of implicit performance "hacks" such as the one we're at now only have sense when applied to a particular version of the runtime. I very much doubt Tcl currently optimizes away "$str" to just re-use the object from $str but it could eventually start, in theory.
The real "problem" with your approach is not performance degradation but rather an apparent self-delusion you seem to apply to yourself which leads to Tcl code of dubious style. Let me explain. Contrary to "more conventional" languages (usually influenced by C and the like), Tcl does not have special syntax for strings. This is because it does not have string literals: every value starting its life in a script from a literal is initially a string. The actual type of any value is defined at runtime by commands operating on those values. To demonstrate, set x 10; incr x will put a string "10" to a variable named "x", and then the incr command will force the value in that variable "x" to convert the string "10" it holds to an integer (of value 10); then this integer will be incremented by 1 (producing 11) invalidating the string representation as a side effect. If you later will do puts $x, the string representation will be regenerated from the integer (producing "11"), cached in the value and then printed.
Hence the code style you adopted actually tries to make Tcl code look more like Python (or Perl or whatever was your previous language) for no real value, and also look alien to seasoned Tcl developers. Both double quotes and curly braces are used in Tcl for grouping, not for producing string values and code blocks, respectively — these are just particular use cases for different ways of grouping. Consider reading this thread for more background.
Update: various types of grouping are very well explained in the tutorial which is worth reading as a whole.

Auto building an output list of possible words from OCR'ing errors based on a given population of words

I wonder if in Perl/MySQL if is possible to build a list of variant words, based on a given word, to which that word may have the common OCR errors occurring (i.e. 8 instead of b)? In other words, if I have a list of words, and in that list is the word "Alphabet", then is there a way to extend or build a new list to include my original word plus the OCR error variants of "Alphabet"? So in my output, I could have the following variants to Alphabet perhaps:
Alphabet
A1phabet
Alpha8et
A1pha8et
Of course it would be useful to code for most if not all of the common errros that appear in OCR'ed text. Things like 8 instead of b, or 1 instead of l. I'm not looking to fix the errors, because in my data itself I could have OCR errors, but want to create a variant list of words as my output based on a list of words I give it as an input. So in my data, I may have Alpha8et, but if I do a simple search for Alphabet, it won't find this obvious error.
My quick and dirty MySQL approach
Select * from
(SELECT Word
FROM words
union all
-- Rule 1 (8 instead of b)
SELECT
case
when Word regexp 'b|B' = 1
then replace(replace(Word, 'B','8'),'b','8')
end as Word
FROM words
union all
-- Rule 2 (1 instead of l)
SELECT
case
when Word regexp 'l|L' = 1
then replace(replace(Word, 'L','1'),'l','1')
end as Word
FROM words) qry
where qry.Word is not null
order by qry.Word;
I'm thinking there must be a more automated and cleaner method
If you have examples of scanned texts with both the as-scanned (raw) version, and the corrected version, it should be relatively simple to generate a list of the character corrections. Gather this data from enough texts, then sort it by frequency. Decide how frequent a correction has to be for it to be "common," then leave only the common corrections in the list.
Turn the list into a map keyed by the correct letter; the value being an array of the common mis-scans for that letter. Use a recursive function to take a word and generate all of its variations.
This example, in Ruby, shows the recursive function. Gathering up the possible mis-scans is up to you:
VARIATIONS = {
'l' => ['1'],
'b' => ['8'],
}
def variations(word)
return [''] if word.empty?
first_character = word[0..0]
remainder = word[1..-1]
possible_first_characters =
[first_character] | VARIATIONS.fetch(first_character, [])
possible_remainders = variations(remainder)
possible_first_characters.product(possible_remainders).map(&:join)
end
p variations('Alphabet')
# => ["Alphabet", "Alpha8et", "A1phabet", "A1pha8et"]
The original word is included in the list of variations. If you want only possible mis-scans, then remove the original word:
def misscans(word)
variations(word) - [word]
end
p misscans('Alphabet')
# => ["Alpha8et", "A1phabet", "A1pha8et"]
A quick-and-dirty (and untested) version of a command-line program would couple the above functions with this "main" function:
input_path, output_path = ARGV
File.open(input_path, 'r') do |infile|
File.open(output_path, 'w') do |outfile|
while word = infile.gets
outfile.puts misscans(word)
end
end
end
An efficient way for achieving this is by using the bitap algorithm. Perl has re::engine::TRE, a binding to libtre which implements the fuzzy string matching in regexp:
use strict;
use warnings qw(all);
use re::engine::TRE max_cost => 1;
# match "Perl"
if ("A pearl is a hard object produced..." =~ /\(Perl\)/i) {
    say $1; # find "pearl"
}
Plus, there is agrep tool which allows you to use libtre from the command line:
$ agrep -i -E 1 peArl *
fork.pl:#!/usr/bin/env perl
geo.pl:#!/usr/bin/env perl
leak.pl:#!/usr/local/bin/perl
When you need to match several words against the OCRized text, there are two distinct approaches.
You could simply build one regexp with your entire dictionary, if it is small enough:
/(Arakanese|Nelumbium|additionary|archarios|corbeil|golee|layer|reinstill\)/
Large dictionary queries can be optimized by building trigram index.
Perl has a String::Trigram for doing this in-memory.
Several RDBMS also have trigram index extensions. PostgreSQL-flavored pg_trgm allows you to write queries like this, which are fast enough even for really big dictionaries:
SELECT DISTINCT street, similarity(street, word)
FROM address_street
JOIN (
SELECT UNNEST(ARRAY['higienopolis','lapa','morumbi']) AS word
) AS t0 ON street % word;
(this one took ~70ms on a table with ~150K rows)