Tesseract OCR with numeric tables - ocr

I need to OCR old statistical tables that contain numerical values for each town in a given area. I use Tesseract 4.0.0-beta.3, and in most cases I get acceptable results, but in some others the software fails to recognise the structure of the table and skips rows or entire columns.
I was trying to apply a more suitable configuration by checking --help-psm, but honestly I couldn't figure out which one could improve my results. I also tried to slice up the tables to individual columns, but the results were even worse. I suppose the issue is that some cells contain 1 or 2 digit numbers, and the rows are deemed to short, which is usually good, but here it is rather problematic. What settings would you use to optimise results?

In a similar situation I was using
tesseract image test --psm 6 --oem 0 digits
I even deleted the left text - to be processed
separately.
Number recognition was ok, but my problem was, that I have ~10 columns and some are blank in some rows, but tesseract sometimes ignores the vertical lines, sometimes displays them as "1", unpredictedly.
I tried several settings, even deleted the vertical lines, but couldn't get tesseract to keep the table structure for subsequent computer-read.
Hope it helps.

Related

store text of character length ~300,000 in mysql database

I have a column of data I would like to add to a mysql database table. The column is raw text and the longest piece of text contains approximately 300,000 characters. Is it possible to store this in the table? How?
I have been reading that even LONGTEXT columns are limited somewhat.
Presumably you have ruled out the alternative of storing these items of text in files, and storing their pathnames in your table. If you have not considered that choice, please do. It's often the most practical way to handle this sort of application. That's especially true if you're using a web server to deliver your information to your users: by putting those objects in your file system you avoid a very serious production bottleneck (fetching the objects from the DBMS and then sending them to the user).
MySQL's LOBs (large objects) will take 300k characters without problems. MEDIUMTEXT handles 16 megabytes. But the programming work necessary to load those objects into the DBMS and get them out again can be a bit challenging. You haven't mentioned your application stack, so it's hard to give you specific advice about that. Where to start? read about the MySQL server parameter max_allowed_packet.
If this were my project, and for some reason using the file system was out of the question, I would store the large textual articles as segments in shorter rows. For example, instead of
textid textval
(int) (MEDIUMTEXT)
number lots and lots and lots of text.
I'd make a table like this:
textid segmentid textval
(int) (int) (VARCHAR(250))
number 1 Lots and
number 2 lots and
number 3 lots of
number 4 text.
The segment lengths should probably be around 250 characters each in length. I think you'd be smart to break the segments on word boundaries if you can; it will make stuff like FULLTEXT search easier. This will end up with many shorter rows for your big text items, but that will make your programming, your backups, and everything else about your system. easier to handle all around.
There is an upfront cost, but it's probably worth it.

Database Design: How should I store 'word difficulty' in MySQL?

I made a vocabulary app for Android that has a list of ~5000 words stored in a local database (SQLite), and I want to find out which words are more difficult than others.
To find out, I'm thinking of adding a very simple feature that puts two random words on the screen, asking the user to choose the more difficult one. Then another pair of random words will show, and this process can be repeated for as long as the user wants. The more users who participate in this 'more difficult word', the app would in theory be able to distinguish difficult words from easy words.
Since the difficulty would be based on input from all users, I know I need to keep track of it online so that every app could then fetch them from the database on my website (which is MySQL). I'm not sure what would be the most efficient way to keep track of the difficulty, but I came up with two possible solutions:
1) Add a difficulty column that holds integer values to the words table. Then for every pair of words that a user looks at and ranks, the word that he/she chooses more difficult would have have its difficulty increased by one, and the word not chosen would have its difficulty decreased by one. I could simply order by that integer value to get the most difficult ones.
2) Create a difficulty table with two columns, more and less, that hold words (or ID's of the words to save space) based on the results of each selection a user makes. I'm still unsure how I would get the most difficult words - some combination of group by and order by?
The benefit of my second solution is that I can know how many times each word has been seen (# of rows from the more column that contain the word + # rows from the less column that contain the word). That helps with statistics, like if I wanted to find out which word has the highest ratio of more / less. But it would also take up much more space than my first suggested solution would, and don't know how it could scale.
Which do you think is the better solution, or what other ones should I consider?
Did you try sphinx for this? Guess a full text search engine like sphinx would solve with great performance.

Apache Camel problems aggregating large (1mil record) CSV files

My question is (1) is there a better strategy for solving my problem (2) is it possible to tweak/improve my solution so it works and doesn't split the aggregation in a reliable manner (3 the less important one) how can i debug it more intelligently? figuring out wtf the aggregator is doing is difficult because it only fails on giant batches that are hard to debug because of their size. answers to any of these would be very useful, most importantly the first two.
I think the problem is I'm not expressing to camel correctly that I need it to treat the CSV file coming in as a single lump and i dont want the aggregator to stop till all the records have been aggregated.
I'm writing a route to digest a million line CSV file, split then aggregate the data on some key primary fields, then write the aggregated records to a table
unforuntaely the primary constraints of the table are getting violated (which also correspond to the aggregation keys), implying that the aggregator is not waiting for the whole input to finish.
it works fine for small files of a few thousand records, but on the large sizes it will actually face in production, (1,000,000 records) it fails.
Firstly it fails with a JavaHeap memory error on the split after the CSV unmarshall. I fix this with .streaming(). This impacts the aggregator, where the aggregator 'completes' too early.
to illustrate:
A 1
A 2
B 2
--- aggregator split ---
B 1
A 2
--> A(3),B(2) ... A(2),B(1) = constraint violation because 2 lots of A's etc.
when what I want is A(5),B(3)
with examples of 100, 1000 etc, records it works fine and correctly. but when it processes 1,000,000 records, which is the real-size it needs to handle, firstly the split() gets an OutOfJavaHeapSpace exception.
I felt that simply changing the heap-size would be a short-term solution and just pushing the problem back until the next upper-limit of records comes through, so I got around it by using the .streaming() on the split.
Unfortunately now, the aggregator is being drip-fed the records, not getting them in a big cludge and it seems to be completing early and doing another aggregation, which is violating my primary constraint
from( file://inbox )
.unmarshall().bindy().
.split().body().streaming()
.setHeader( "X" Expression building string of primary-key fields)
.aggregate( header("X") ... ).completionTimeout( 15000 )
etc.
I think part of the problem is that I'm depending on the streaming split not timeing out longer than a fixed amount of time, which just isn't foolproof - e.g. a system task might reasonably cause this, etc. Also everytime I increase this timeout it makes it longer and longer to debug and test this stuff.
PRobably a better solution would be to read the size of the CSV file that comes in and not allow the aggregator to complete until every record has been processed. I have no idea how I'd express this in camel however.
Very possibly I just have a fundamental stategy misunderstanding of how I should be approaching / describing this problem. There may be a much better (simpler) approach that I dont know.
there's also such a large amount of records going in, I can't realistically debug them by hand to get an idea of what's happening (I'm also breaking the timeout on the aggregator when I do, I suspect)
You can split the file first one a line by line case, and then convert each line to CSV. Then you can run the splitter in streaming mode, and therefore have low memory consumption, and be able to read a file with a million records.
There is some blog links from this page http://camel.apache.org/articles about splitting big files in Camel. They cover XML though, but would be related to splitting big CSV files as well.

MySQL: how to search for a number that might not be unique

I'll try to make it easy by explaining an example.
So, I consolidate data from two sources, namely 1 and 2. In each of the sources, it has a column "number" that has unique values within a source. But when A and B are consolidated (they have to be), it cannot be checked that they are unique. However, when consolidating 1 and 2, I created a column name "source" and tagged it with its source name (1 or 2). Therefore, if I want to look for a certain specific "number" I submit a query that looks for the desired number AND source.
Is there a better way to do this? It is working just fine because my database is small, but will this work well (i.e. fast, efficiently, etc.) as the DB grows? I mean, it won't have one million entries in the next few years, but I'd still like to perform it in a optimal manner.
The only other way I can think about is to keep separate "number" columns for different sources and query the appropriate columns.. but this will require additional columns to be added as I get additional sources. Hm.. what to do?
Your method should work just fine without causing any perceivable slow downs, if any at all.

mysql: using one field to contain many "fields" to save on fields

I have a project which needs an Excel GUI (client's request) with a backend mysql db/table requiring almost 90 fields.
(almost 60 fields are duplications of 6 fields.)
After giving it some thought, I ended up creating a table with 11 fields: 10 searcheable fields, and one big field which can contain up to 60 fields "together", separated by ":"
So a record on that big field would be look something like this:
charge1:100:200:200::usd:charge2:1000:2000:2000::usd:charge3:150:200:200:250:USD, and so on
As you can see, these are blocks of 6 fields and can be up 10 of these "blocks", but never more than 255 characters altogether.
None of these "fields" need to to be indexed nor searched for (that's done on the other 10 fields)
What I am doing is "SELECT *" query (with an Excel GUI) of the 11 fields and then (with VBA) I separate these values to columns (this takes less than 1 second).
With VBA I display the data on certain fields within the Excel "form".
This is working fine and I am very happy with the results, as I was looking for a light, simple and super fast solution, and it is.
Is there a "technical" reason for not doing this ?
Perhaps fields with too many characters might give problems ????
I understand there are many ways of handling this, however this is a small project and I am looking for a simple solution that works, not a complex one (with too many tables and/or fields)
Since the GUI is an excel interface I don't want to make it too complex if there isn't need for that.
Thanks in advance for your input.
I think you already have a pretty good idea of problems that may arise.
Indexing doesn't work real good on those fields, updating and reading individual values requires extra work in your application.
Also, you're storing what looks mostly like numbers in a string-type column, so that means some extra storage space (though you'd have to weigh that against a bit of overhead for separate columns).
It might turn into a nightmare when the structure of those columns changes.
All of that might be manageable effort for you, but it's entirely possible that the dev after you will hate you. :p