Warning : no protos/config when trying to train Tesseract - ocr

I'm trying to train tesseract, because it mainly confuses "g" with "9" when reading my .tiff files.
After deducing the font used in the .tiff files seemed to be "Pragmatica Book", I decided to follow this tutorial to train my tesseract on usual characters in Pragmatica font.
The snag is, when it comes to the command :
shapeclustering -F font_properties -U unicharset eng2.LobsterTwo.exp0.tr
it gives :
Bad properties for index n, char A: 0,255 0,255 0,0 0,0 0,0
with n going from 3 to 64,
and as many lines of
Computing shape distances...
Stopped with 0 merged, min dist 999.000000
This results in the next step by giving :
Bad properties for index n, char l: 0,255 0,255 0,0 0,0 0,0
for n from 3 to 64, followed by
Warning: no protos/configs for sh0058 in CreateIntTemplates()
I found this former post dealing with that issue, but all the related answers raise the fact that font name is different in .tr file and in font_properties, which is not my case here, as both names are "Pragmatica".
Thanks in advance for your help, I don't see what I did wrong !

Related

Understanding DINO (object classifier) model architecture

I am trying to understand the model architecture of DINO https://arxiv.org/pdf/2203.03605.pdf
These are the last few layers I see when I execute model.children()
Question 1)
In class_embed, (0) is of dimension 256 by 91, and if it's feeding into (1) of class_embed, shouldn't the first dimension be 91?
So, I realize (0) of class_embed is not actually feeding into (1) of class_embed. Could someone explain this to me?
Question 2)
Also, the last layer(2) of MLP (see the first picture which says (5): MLP) has dimension 256 by 4. So, shouldn't the first dimension of class_embed (0) be having a size of 4 ?
Now, when I use a different function to print the layers, I see that the layers shown above are appearing as clubbed. For example, there is only one layer of
Linear(in_features=256, out_features=91, bias=True)]
Why does this function give me a different architecture?
Question 3)
Now, I went on to create a hook for the 3rd last layer.
When I print the size, I am getting 1 by 900 by 256. Shouldn't I be getting something like 1 by 256 by 256 ?
Code to find dimension:
Output:
especially since layer 4 is :

Pytorch LSTM text-generator repeats same words

UPDATE: It was a mistake in the logic generating new characters. See answer below.
ORIGINAL QUESTION: I built an LSTM for character-level text generation with Pytorch. The model trains well (loss decreases reasonably etc.) but the trained model ends up outputting the last handful of words of the input repeated over and over again (e.g. Input: "She told her to come back later, but she never did"; Output: ", but she never did, but she never did, but she never did" and so on).
I have played around with the hyperparameters a bit, and the problem persists. I'm currently using:
Loss function: BCE
Optimizer: Adam
Learning rate: 0.001
Sequence length: 64
Batch size: 32
Embedding dim: 128
Hidden dim: 512
LSTM layers: 2
I also tried not always choosing the top choice, but this only introduces incorrect words and doesn't break the loop. I've been looking at countless tutorials, and I can't quite figure out what I'm doing differently/wrong.
The following is the code for training the model. training_data is one long string and I'm looping over it predicting the next character for each substring of length SEQ_LEN. I'm not sure if my mistake is here or elsewhere but any comment or direction is highly appreciated!
loss_dict = dict()
for e in range(EPOCHS):
print("------ EPOCH {} OF {} ------".format(e+1, EPOCHS))
lstm.reset_cell()
for i in range(0, DATA_LEN, BATCH_SIZE):
if i % 50000 == 0:
print(i/float(DATA_LEN))
optimizer.zero_grad()
input_vector = torch.tensor([[
vocab.get(char, len(vocab))
for char in training_data[i+b:i+b+SEQ_LEN]
] for b in range(BATCH_SIZE)])
if USE_CUDA and torch.cuda.is_available():
input_vector = input_vector.cuda()
output_vector = lstm(input_vector)
target_vector = torch.zeros(output_vector.shape)
if USE_CUDA and torch.cuda.is_available():
target_vector = target_vector.cuda()
for b in range(BATCH_SIZE):
target_vector[b][vocab.get(training_data[i+b+SEQ_LEN])] = 1
error = loss(output_vector, target_vector)
error.backward()
optimizer.step()
loss_dict[(e, int(i/BATCH_SIZE))] = error.detach().item()
ANSWER: I had made a stupid mistake when producing the characters with the trained model: I got confused with the batch size and assumed that at each step the network would predict an entire batch of new characters when in fact it only predicts a single oneā€¦ That's why it simply repeated the end of the input. Yikes!
Anyways, if you run into this problem DOUBLE CHECK that you have the right logic for producing new output with the trained model (especially if you're using batches). If it's not that and the problem persists, you can try fine-tuning the following:
sequence length
greediness (e.g. probabilistic choice vs. top choice for next character)
batch size
epochs

How to process my images to help Tesseract?

I have some images containing only digits, and a semicolon.
Example:
You can see more here: https://imgur.com/a/54dsl6h
They seem pretty clean and straightforward to me, but Tesseract considers them as empty "pages" (Empty page!!).
I tried both with oem 1 and oem 0 with a character list:
tesseract processed/35.0.png stdout -c tessedit_char_whitelist=0123456789: --oem 0
tesseract processed/35.0.png stdout
What can I do to get Tesseract to recognize the characters better?
Tesseract still gives me pretty bad results overall, but making the text bolder with a simple dilatation algorithm helped a bit.
In the end, since the font is really square, I used a trick, where I defined a bunch of segments for each digits, and depending on which segments intersect, or dont intersect with the digit, I can determine with 99% accuracy which digit it is.

Tesseract OCR How do I improve result?

I am having a hard time working with Tesseract, is there a way to improve the accuracy? How do I train it for myself, if needed?
the only thing I am doing is reading the following characters, XYZ:-0123456789
that's it! The pictures always look that way.
thanks!
The output of Tesseract 4.00alpha with your image is
$ tesseract ICKcj.png - -l eng
*: 4606 Y; 4809 Z; 698
Warning. Invalid resolution 0 dpi. Using 70 instead.
Resample the picture to 50% and setting the dpi to 300:
The output with this image is slightly better and the warning is vanishing:
$ tesseract ICKcj-50.png - -l eng
X: 4606 Y: 4809 Z: 698
The only thing missing are the minus signs, which are printed quite irregular (a better resolution in the picture could help). It is also possible to restrict the output pattern in tesseract. Alternatively, you can try to guess the minus afterwards depending on the spaces between the X, Y, Z and the numbers.

error in applying an ascii to netlogo with apply-raster

I am trying to apply an ascii (241 rows, 463 columns) to netlogo using the following code:
set my-dataset "data/my-folder/my-file.asc"
resize-world 0 gis:width-of (gis:load-dataset my-dataset) - 1 0 gis:height-of (gis:load-dataset my-dataset) - 1
gis:set-world-envelope-ds (gis:envelope-of (gis:load-dataset my-dataset))
gis:apply-raster (gis:load-dataset my-dataset) my-variable
In the resize-world command, I added the -1 since netlogo starts at 0, while gis:width-of starts at 1.The result is a netlogo world with min-pycor 0 max-pycor 240 and min-pxcor 0 max-pxcor 462 (a 241x463 world), which matches the size of my ascii perfectly. The gis-world envelop command makes sure that the extent is similar for the ascii en the Netlogo world. I checked this, and again it matches perfectly.
The problem I am facing is that although the netlogo rows and columns match the ascii rows and columns, the applied ascii is displaced by 1 in the y direction. The top row of the netlogo world is filled with zero's, while the top row of my ascii is filled with high values.
FIGURE: top row is red, showing 0 values, where they should not be 0.
Does anybody know what the problem is? Or how to correctly apply the ascii to the netlogo world so that one ascii value fills the corresponding netlogo patch?
Maybe additional to this: can I stop netlogo from automatically resampling, so that I know for sure that the values in netlogo are the same as those in my ascii.
Thank you for your help
More info:
ascii header
NCOLS 463
NROWS 241
XLLCORNER 2.54710299910375
YLLCORNER 49.4941766658013
CELLSIZE 0.00833333333339921
NODATA_value -9999
netlogo envelope:
show gis:world-envelope
observer: [2.5471029991037497 6.405436332467584 49.49417666580129 51.502509999150504]
my-file envelope:
show gis:envelope-of gis:load-dataset my-dataset
observer: [2.54710299910375 6.405436332467584 49.4941766658013 51.502509999150504]
Note that there is a slight rounding difference, which I just cannot erase no matter how I code the world-envelope. In any case, considering that it is such a tiny difference I don't think this is the problem.
edit: I checked what actually happened by exporting the netlogo values to a raster, and comparing them in ArcGIS. It is not a simple problem of resampling. Actually, the top row just has missing values, without shifting values. Furthermore, the middle column and row duplicate, causing everything to shift outwards from the middle towards the bottom and right. I added a simple illustration, hoping that this would clarify the problem.
I investigated it further, and I think the error originates in the code behind apply-raster similar to the problem here.
I analyzed the java code of the apply raster on github, and it seems to refer to the worlds min-pxcor and min-pycor while doing something with the gis extent. As the real edge coordinates are not similar to the patch center coordinates, this might be causing the problem? I am not a java expert though, it might be something to investigate further (and I might be wrong..).
Anyway, to get my ascii to apply nicely to my world (which was set to the size of the ascii), I now run the following code:
file-open "data/my-folder/my-file.asc"
let temp []
while [file-at-end? = false][repeat 6 [let header file-read-line] ; skip header
set temp lput file-read temp
]
file-close
(foreach sort patches temp
[ ask ?1 [ set my-variable ?2 ] ] )