Tesseract/gImageReader OCR: older texts are missing spaces between words

Tesseract/gImageReader OCR: older texts are missing spaces between words - ocr

I'm working with some older texts, were the letters are sometimes a little smudgy.
The letters and words get recognized near-perfectly, and when I look at the hOCR or html file, the text looks perfect.
But when I export to PDF with an invisible text layer the spaces between words frequently go missing, for paragraphs at a time. This is annoying when trying to highlight parts of the text and then copy-paste those excerpts.
Any advice?
Other than these old texts, gImageReader is absolutely amazing and does exactly what I want. I did try the "Middle English" language, but that had the same result.

Related

How do word-breaking hyphens work with copy-paste (in LaTeX / PDFs)

I just copied a piece of text from a PDF, the text contained a hyphenated word broken across two lines. The pasted text didn't contain the hyphen character and I'm wondering if someone can explain why that's the case.
I checked by opening the PDF in both Chrome, Firefox, and Acrobat Reader DC and had the same behaviour so I'm assuming it's either a feature of the Window's copy-pasting mechanism or something to do with the underlying encoding of the text. I would have guessed that the PDF file-format would hard-code the breaks into the text rather than dynamically (but consistently) calculating them?
If it's useful to know, I believe the document was generated in some form of TeX program, probably in LaTeX.
How does this work? Is it related to "Soft Hyphens"?

How to translate text/HTML that has stylistic line breaks?

The general question here is how do you mark text up for translation on an HTML page when the position of the line breaks have to look eye pleasing (as opposed to the line break aways happening after a specific word)?
I have a web page I want to translate into 5 different languages. In some places, I have text like "Enjoyed by 10,000 happy users" under a small icon that needs to be displayed in an eye pleasing way. This looks good as the noun phrase is on its own line and each line has about the same number of letters:
<icon>
Enjoyed by
10,000 happy users
Do I send this text to be translated as this?
Enjoyed by <br> 10,000 happy users
Problems:
By adding markup to the text it makes it unlikely I can reuse the string elsewhere but I can't see any other options.
How do I cope with how I place the in the translated text given the translated text will have a different number of letters (e.g. "Genossen von 10.000 glückliche Benutzer" in German)? Just review how each one renders on the page manually and adjust the myself after the translations come back?
I can't see any clean way to do this. I could remove the markup and try to write some server code that will add the break in a nice place but I can't see how it's possible to automate (e.g. putting noun phrases on their own line if possible when the previous line has enough letters). CSS has even less options to do this.

Your question is somewhat subjective, but I think your choices are to either trust your translators to format the HTML, or trust them to come up with copy that fits your design. Trying to engineer your way to a "clean" solution with server code sounds like it will achieve the exact opposite.
Make sure your design is good enough to cope with a reasonable range of word lengths. If your layout lives and dies by the text being exactly X characters long, then it isn't well designed. You can always ask your translators to try and write a translation in less than a maximum number of characters. This is why we still have human translators - they are also copywriters :)

Is there a New Line equivalent of the No-Break Space Unicode character?

I am looking for a way to force a new line / break line in a HTML text box. It's a "description" box and I need to convey a bunch of information (main description, multiple download links, social media links, ...), and would like to give it a slightly nicer, more readable spacing.
The box' line breaking works the same as it does here on Stackoverflow:
breaking the line once in the editor, results in the lines of text staying together in the rendering.
breaking the line twice, results in a new paragraph with a small whitespace in front.
breaking the line three or more times, does the same as just doing it twice.
And I can only input flat text, but on my Mac, I can press Alt+Space to type the Unicode character for a No-Break Space (" "), with the same result as if I added & nbsp; into the HTML file. Together with "•" (Shift+Alt+.), this works perfectly to force the appearance of an unordered list (even nested ones, if I want to).
So now I'm looking for a way to do the same with a New Line, so I can better space out my blocks of information. Any ideas?
PS: I know this will make a lot of web developers cringe, but because I'm just using Unicode characters, it won't break anything. If the site expects you to post download links and contact information, why didn't they just provide segments for those, to do it correctly, instead of just a single "Description" box? There are separate boxes for "Technology used" and "Help / Controls", so why not for "Download", "Contact Info", ...? :/
PPS: I guess forcing a horizontal rule would be fine too (and maybe look even better)

Arabic with diacritics renders weird - html iOS with webfonts

I have been debugging this for days, and no luck. The html page I am developing, has Arabic text. When using a webfont, Arabic text with diacritics does not render well. Words are too close to one another, often overlapping, and paragraphs have leading space, and text is clipped at the end. Things render well when diacritics are not present. Things render well without webfonts. The problem is just with iOS. I tried multiple webfonts. I tried many different things. Please help.

In HTML and CSS, how do I make japanese text break lines correctly?

I'm writting a simple paragraph in both English and Japanese, using only HTML and CSS. The English text breaks lines normally (when a word doesn't fit on a line anymore, it's pushed to the next one).
With Japanese though, not a whole word is pushed to the next line, but part of it only. I've tried setting word-wrap to break-word and normal, but nothing changes (with the Japanese text).
How to I make whole words in Japanese jump to the next line like it happens in English?

English separates words with spaces, Japanese doesn't.
Whether characters in Japanese form a word or not depends on context. In many cases, looking for certain grammatical (Kana) particles could be used to separate words - but this wouldn't even be close to being reliable.
Essentially, you'd need a Japanese dictionary / understanding of the language to identify where the words start and end - a browser won't know how to do this.
Alternatively, if you know the start and end of the words, you could perhaps wrap each one in a span - then use CSS to ensure each span wraps to a new line as a whole when it doesn't fit.

Japanese has specific rules that are followed when breaking text. They are called 禁則処理 (kinsoku shori). Here is a link explaining the rules. The rules are mostly concerned with special characters. Have a look at any popular Japanese webpage and you will see that multi-character (kana and kanji) words are often split. I often see です split between lines.
Update:
I stumbled across this tool recently. I haven't tried it out yet, but the theory is solid. If someone is looking to improve the line breaks with Japanese text this could be a good solution.

I'm not an expert with Japanese specifically so it's hard for me to tell if things are wrapping correctly, but I just had to solve this problem myself and both word-break: keep-all and white-space: nowrap seemed to solve the issue for me, so those might be worth trying out.

Until the browsers are smart enough to do on-the-fly semantic analysis of the language, there are only a couple of options :
1/ Understand enough of the language to be able to group semantic elements in their own, unbreakable DOM elements. Something like (without the line breaks) :
<span class="el">私は</span>
<span class="el">キッチンで</span>
<span class="el">パンを</span>
<span class="el">食べました。</span>
Then in CSS, use something like .el { display: inline-block; }. You probably want to do this only on headings and important text pieces only, since it could impact accessibility (ie. how screen readers interpret the text). The other inconvenients are that 1/ you need to understand the text to know where to add the blocks, and 2/ this obviously only works for static text (and even in that case, it's still a manual, painstaking process).
2/ Use a tool that does the grouping for you. It could be something on the client side, like TinySegmenter (whitch does segment a bit too much for my taste IMHO), or on the server-side, with things like Budou that use Google Cloud Natural Language API and ML to analyze your sentences. The downsides (at least for Budou) is that 1/ you need Python (I think that I saw a Node.js port somewhere), and 2/ It's not free.
Hope this helps!

try setting the css property
line-break:strict;
Check it out here.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008