Training Tesseract: how to handle multiple whitespace characters in training images - ocr

I am finetuning tesseract. I am using tesstrain. When creating the ground truth files, how should I handle multiple whitespace characters?
a.tiff
There are two whitespaces between "(f)" and "any". I have no idea how many are between "30" and "(f)". Should I:
Attempt to provide the correct number of whitespaces? (in a.gt.txt)
Just use 1 whitespace between words always?
It doesn't make a difference.
I've seen on other tangentially related questions that when doing the inference, there is an option to preserve interword space. Maybe that makes a difference:

Related

Unclear note in the October 2018 release of the Custom Translator User Guide

Can anyone clarify what the following note exactly means?
NOTE: There must not be any new line characters; “\n” or “\r” at the end of sentences. If there are then the alignment of sentences will be corrupted and the training will not be effective.
The note appears on page 5, section 2.1.2.1 Parallel documents.
Does this apply to any document formats? It does not make much sense (at least to me), for instance for .align documents...
Thank you for bringing this to our attention. We will update the documentation as this statement is inaccurate. It should read
"NOTE: There must not be any new line characters; “\n” or “\r” within a sentence. If there are then the alignment of sentences will be corrupted and the training will not be effective."
The issue we want to address here is that parallel documents should not break a single sentence across multiple lines as it makes sentence alignment much less effective.
In regards to your question regarding .align files. We do not sentence align on these files, so you could break the sentences across multiple lines as long as you did it consistently. That is to say that if you have a sentence broken into three lines on the source side, it should be broken into three lines on the target side. Since the sentence aligner is not used, even one in unmatched split would cause misalignments to all the following sentences. There is no advantage to splitting sentences, so I strongly urge you not to do that.

Does 1 English letter = 1 Chinese character?

I am a UX designer and we are working on a product where there needs to be a text input field for the user to insert their note. There needs to be a word limit indication, whether they're typing in Traditional Chinese or English.
So my question is:
If the character limit is 15, am I correct to say:
I am in Sweden (11/15 characters)
我在瑞典 (4/15 characters)
I was told that 1 Chinese character counts as 2-byte code and 1 English letter counts as 1-byte. How does this affect the character limit? I want to make sure my design is clear as possible for the developers.
So it’s about display size, right? Counting words won’t be useful in that case because a word can be as long as you want.
Counting characters is marginally more useful, but also doesn’t guarantee that the message will fit in the end because different characters have different widths. Just as an example, these four strings all consist of five characters each:
“​​​​​”
“     ”
“WWWWW”
“﷽﷽﷽﷽﷽”
There really is no elegant way to solve this. You’d need to know the precise metrics of the font you’re using and then calculate the visual width of each input.
If you’re fine with a “close enough” solution, you can just use the <input> element’s maxlength attribute. HTML and JavaScript count UTF-16 code units, however, which means that characters in the so-called Basic Multilingual Plane count as 1 and everything else counts as 2.
The Basic Multilingual Plane contains 99% of all characters in common, present-day use, so the vast majority of users probably won’t notice anything wrong. You could do something fancier with JavaScript, but I reckon it’s not really necessary for this kind of task.
Just keep in mind that this approach still won’t guarantee that the user’s input will fit visually on the print-out unless you leave a lot of empty room just in case. Definitely play around with some narrow and wide characters to see how much space they really take up when printed.

Hyphenating arbitrary text automatically

What kinds of challenges are there facing automatic hyphenation? It seems that you could just draw word by word, breaking when the length of the line exceeds the length of the viewport (or whatever we're wrapping our text in), placing hyphens after as many characters as can fit (provided at least two characters fit and the word is at least four characters), skipping words that already contain a hyphen (there's no requirement that words have to be hyphenated).
But I note how Firefox and IE need a dictionary to be able to hyphenate with CSS's hyphens. This seems to imply that there are further issues regarding where we can place hyphens.
What kinds of issues are these? Do any exist in the English language or do they only exist in other languages?
You have these issues in all languages. You can only place a hyphen where meaningful tokens result from the split, as has already been pointed out. You don't want to, for example, split a word like "wr-ong".
This may or may not be a syllable, while in most languages (including English) it is. But the main point is that you cannot pin it down as easily just with some simple rules. You would need to consider a lot of phonology to get a highly accurate result, and these rules vary from language to language.
With this background, I can see why one would take a dictionary instead, and frankly, being a computational linguist myself, this is also what I would probably opt for.
If you DO want to go for an automatic solution, I would recommend doing some research in English phonology of syllables, or the so-called syllabification. You might want to start with this article on Wikipedia:
Wikipedia - Syllabification

what are the disadvantages of having tons of entities?

I've been writing a source-to-display converter for a small project. Basically, it takes an input and transforms the input into an output that is displayable by the browser (think Wikipedia-like).
The idea is there, but it isn't like the MediaWiki style, nor is like the MarkDown style. It has a few innovations by itself. For example, when the user types in a chain of spaces, I would presume he wants the spaces preserved. Since html ignores spaces by default, I was thinking of converting these chain of spaces into respective s (for example 3 spaces in a row converted to 1 )
So what happens is that I can foresee a possibility of a ton of tags per post (and a single page may have multiple posts).
I've been hearing alot of anti-&nbsps in the web, but most of it boils down to readability headaches (in this case, the input is supplied by the user. if he decides to make his post unreadable he can do so with any of the other formatting actions supplied) or maintenance headaches (which in this case is not, since it's a converted output).
I'm wondering what are the disadvantages of having tons of tags on a webpage?
You are rendering every space as ?
Besides wasting so much bandwidth, this will not allow dynamic line breaking as "nbsp" means "*n*on *b*reaking *sp*ace". This will most probably cause much trouble.
If it's just being dumped to a client, it's just a matter of size, and if it's gzipped, it barely matters in terms of network traffic.
It'll slow down rendering, I'm sure, and take up DOM space, but whether or not that matters depends on stuff I don't know about your use case(s). You might be able to achieve the same result in other ways, too; not sure.
s aren't tags, but are character entities like ©, <, >, etc.
I'd say that the disadvantages would be readability. When I see a word, I expect the spacing to be constant (unless it is in a block of justified text).
Can you show me a case where you'd need s?
Have you considered trying to figure out what the user, by inserting those spaces, is really trying to achieve? Rather than the how (they want to insert the spaces), the what (if the spaces are at the beginning of a line, they want to indent the text in question).
An example of this is many programming sites convert 4 spaces at the start of a line to a pre+code block.
For your purposes, maybe it should be a <block> block.
The end goal being that of converting the spaces not to what the user (with their limited resources) intended to show up there but, rather, what they meant to convey with it.

PHP/MySQL: store formatting of text properly?

I'm writing note software in PHP (to store notes) and most often I include code within, when I fetch the note from the database it collapses all whitespace I assume, so any code blocks look ugly. (I nl2br() it, I mean horizontal space)
What would be the most efficient way to deal with this? I think the database entry keeps the spaces, so would replacing all spaces with be the only solution PHP-display-side? (ugly for long long entries), what are your thoughts on how I can accomplish this taking in mind the code may be 1-16M characters long?
It shouldn't be collapsing all whitespace. Try outputting it inside <pre> tags to see that white space.
What code are you storing the Database? HTML? PHP?! This will determine the best solution to your problem.
Different column types will or won't preserve characters like new lines, carriage returns or tabs. I use Text, using a UTF-8 collation.
At a very basic level look at nl2br() - http://php.net/manual/en/function.nl2br.php