Tesseract Training - new font with only digits - ocr

Hello i try to train tesseract for a new font based on the following digits:
all digits are provided in a png file with transparent background. If i create a box file from it, train it and so on - all works fine!
Now the problem, same situation but i want to train tesseract based on the following image:
as you can see the digits are exactly the same as well as the positions and so on. The only difference from image 1 is that i used a yellow background and from now on nothing is working anymore. I create a box file i set the same positions as for the first image:
0 5 4 20 22 0
1 27 4 38 21 0
2 48 4 60 22 0
3 71 3 83 22 0
4 94 5 109 22 0
5 119 5 131 22 0
6 143 5 157 22 0
7 172 5 184 22 0
8 197 5 211 23 0
9 224 5 238 22 0
well and then i trained the box, but the resulting .tr file is completely empty i didn't stop here and completed all other steps. The resulting font is not possible to use!
So my question is how to train tesseract to recognize this digits no matter which background is used for them?
Edit 2016-04-16:
I used ImageMagick to preprocess the images and i found a command which works very well for all kind of backgrounds. So i wanted to train tesseract for this created images, but it doesn't work as i thought it would... .
First of all i created box files, where most of them were empty. Well i used a website to organize the character positions and i spent a lot of time to make the cropping perfectly! Afterwards i created the resulting .tr files and did also the other stuff to train tesseract.
Finally i got the "traineddata", i moved the file to the "tessdata" directory of tesseract and used it like it should be used:
tesseract example.jpg output -l mg
(i called the new font "mg")
Okay whatever it doesn't recognize all or most of them! I opened this thread to find help, till now nobody really has a clue how to do this, sadly... . Please help me out.
The whole tesseract training files, which i used and created, u can find here:
Tesseract training directory (as no zip/not compressed -> view of all files of the directory)

You can change any color image to binary image and then use tesseract on it, that way no matter what color you are using you will always have same result.

Related

Looking for decoding algorithm for datetime in MYSQL. See examples, reward for solution

Have tried some of the online references as wells as unix time form at etc. but none of these seem to work. See the examples below.
running Mysql 5.5.5 in ubuntu. innodb engine.
nothing is custom. This is using a built in datetime function.
Here are some examples with the 6 byte hex string and the decoded message below. We are looking for the decoding algorithm. i.e.how to turn the 6 byte hex string into the correct date/time. The algorithm must work correctly on the examples below. The right most byte seems to indicate difference in seconds correctly for small small differences in time between records. i.e. we show an example with 14 sec difference.
full records,nicely highlighted and formated word doc here.
https://www.dropbox.com/s/zsqy9o2rw1h0e09/mysql%20datetime%20examples%20.docx?dl=0
link to formatted word document with the examples.
contact frank%simrex.com re. reward.
replace % with #
hex strings and decoded date/time pairs are below.
pulled from healthy file running mysql
12 51 72 78 B9 46 ... 2014-10-22 16:53:18
12 51 72 78 B9 54 ... 2014-10-22 16:53:32
12 51 72 78 BA 13 ... 2014-10-22 16:55:23
12 51 72 78 CC 27 ... 2014-10-22 17:01:51
here you go.
select str_to_date(conv(replace('12 51 72 78 CC 27',' ', ''), 16, 10), '%Y%m%d%H%i%s')

LZW Decompress: Why is first dictionary code encountered in TIFF strip 261 instead of 257, or am I misreading it?

I have a trivial RGB file saved as TIFF in Photoshop, 1000 or so pixels wide. The first row consists of 3 pixels all of which are hex 4B red, B0 green, 78 blue, and the rest of the row white.
The strip is LZW-encoded and the initial bytes of the strip are:
80 12 D6 07 80 04 16 0C B4 27 A1 E0 D0 B8 64 36 ... (actually only the first 7 or so bytes are significant to my question.)
In 9-bit segments this is:
100000000 001001011 010110000 001111000 000000000 100000101 100000110 ...
(0x100) (0x4B) (0xB0) (0x78) (0x00) (0x105) (0x106)
From what I understand 256 (0x100) is a reset code, but why is the first extended code after that 261 (0x105) instead of 257? I would expect whatever dictionary entry this points to to be the 4B/B0 pair for the second pixel (which it may well be), but how would the decompression algorithm know to place 4B/B0 at 261 instead of 257? Can someone explain what I'm missing here? Might there be something elsewhere in the .tif file that would indicate this? Thanks very much.
~
Let's see
256 (100h) is Clear
257 (101h) is EOF
in your case, then
4Bh B0h is 258 (102h)
B0h 78h is 259 (103h)
78h 00h is 260 (104h)
00h 00h is 261 (105h)
Looks good to me. LZW can actually encode one character ahead of what's been added to the table.

Weka Decision Tree

I am trying to use weka to analyze some data. I've got a dataset with 3 variables and 1000+ instances.
The dataset references movie remakes and
how similar they are (0.0-1.0)
the difference in years between the movie and the remake
and lastly if they were made by the same studio (yes or no)
I am trying to make a decision tree to analyze the data. Using the J48 (because that's all I have ever used) I only get one leaf. Im assuming I'm doing something wrong. Any help is appreciated.
Here is a snippet from the data set:
Similarity YearDifference STUDIO TYPE
0.5 36 No
0.5 9 No
0.85 18 No
0.4 10 No
0.5 15 No
0.7 6 No
0.8 11 No
0.8 0 Yes
...
If interested the data can be downloaded as a csv here http://s000.tinyupload.com/?file_id=77863432352576044943
Your data set is not balanced cause there are almost 5 times more "No" then "Yes" for class attribute. That's why J48 is tree which is actually just one leaf that classifies everything as "NO". You can do one of these things:
sample your data set so you have equal number of No and Yes
Try using better classification algorithm e.g. Random Forest (it's located few spaces below J48 in Weka explorer GUI)

binary conversion using 3 figures system 0,1,2

Suppose system is evolved by extraterrestrial creatures having only 3 figures and they use the figures 0,1,2 with (2>1>0) ,How to represent the binary equivalent of 222 using this?
I calculated it to be 22020 but the book answers it 11010 .how this.Shouldn't i use the same method to binary conversion as from decimal to binary except using '3' here ???
I think you meant base 3 (not binary) equivalent of decimal 222
22020 in base 3 is 222 in decimal.
220202(your answer) in base 3 is 668 in decimal.
11010 (according to book) in base 3 is 111 in decimal.
222 in binary is 11011110
May be i will be able to tell where you went wrong if you tell the method you used to calculate base 3 equivalent of 222
Edit:
Sorry I could not understand the problem until you provide the link. It says what is binary equivalent of 222 (remember 222 is in base 3)
222 in base 3 = 26 in decimal (base 10)
26 in decimal = 11010 in binary
Mark it as accepted if it solved your problem.
Assuming the start is decimal 222.
Well, without knowing the system used in the book I would decompose it by hand in the following way:
3^4 = 81,
3^3 = 27,
3^2 = 9,
3^1 = 3,
So 81 fits twize into 222 , so the 4th "bit" has the value 2.
Remaining are 60. 27 fits twice into 60 so the next bit is 2 again.
Remaining are 6. 9 fits not into 6, so the next bit is 0.
Remaining are 6. 3 fits twice into 6, so the next bit is 2.
remaining are 0. so the last bit 0
This gives as result 22020.
One quick sanity check on how many "bits" are needed for representation of decimal 222 in a number system with 3 Numbers: 1+log(222)/log(3)=5,9 => nearly 6 "bits" are needed, which goes well with the result 22020.
First see how many figures you have, here we have 3 so
we have to convert 222 to binary when we have only 3 figures so
2×3^2+2×3^1+2×3^0 (if the number were being 121 then →
1×3^2+2×3^1+1×3^0)
which gives 26 then divide this with 2 until we don't get 1/2
when reminder is 1 then write 1 if 0 then 0 you will get
so we get 01011 just reverse it we have the answer
11010
enter image description here

Need to access the API to share folder with variable lists of people

I am working in VB 2010 to create some data acquisition programs and one of the requests is that we use Google Drive to store and share the Excel files once created. I have been able to save and sync folders into My Drive, I now need to be able to share this folder with a variable list of email addresses.
The emails would all be "name"#mix.wvu.edu and they would be inputing those at the beginning of the experiment so that when they save it will automatically be sent to My Drive and then emailed to the students.
Basically, I need some help writing code to select a folder of files from My Drive, click "share", add people that have the ability to edit the files and send them upon the student clicking the button below. Any help would be much appreciated! Thank you!
sample code of what I have so far as a .xlsx file
Sample data set
Tension Test
Date: Apr-06-2013 Load Extension Stress Strain
Time: 15:07 6 4.5
Specimen Material: Plexiglass 7 6
Specimen #: 14 8 7.5
Width: 34 9 9
Thickness: 3 10 10.5
Area: 102 11 12
Gage Length: 3 12 13.5
Data Rate: 10 13 15
14 16.5
15 18
16 19.5
17 21
18 22.5
19 24
20 25.5
21 27
22 28.5
23 30
24 31.5
25 33
For sharing files, please see: https://developers.google.com/drive/manage-sharing and the permissions resource: https://developers.google.com/drive/v2/reference/permissions
If you have any specific questions, please ask them.