pytesseract | Difference between image_to_string and image_to_boxes - ocr

I'm using pytesseract to perform OCR. My application only perform OCR on PNGs with a specific font, so I'm in the process of training tesseract to that specific font.
Consider the following test image (test_1.png):
This code:
img = Image.open('test_1.png')
pytesseract.image_to_string(image=img)
will produce this result:
Lorem ipsum dolor sit amet, consectetm
elit. Fusce tcmpus dignissim diam. Null
dapibus cu, dignissim nec, vulputate egt
Curabitur aliquam, augue eget posuere z
lacus varius augue, sit amet lacinia uma
I want to produce a .box file so I can train tesseract.
I'm using the following code to do that (exactly the same image):
boxes = pytesseract.image_to_boxes(image=img)
This produces a completely different result:
Question: Why is there such a big difference between the results from image_to_text and image_to_boxes?

Related

Is it possible to create a variable text string in HTML?

I have a document where there's a component name that's repeated hundreds of times. I'm trying to make this document into a template where the component name will change from report to report. Instead of having hundreds of iterations of "Example123" in plain text, I'd like to define a variable text string, for example "&ComponentName;" and use that throughout the template so that any change to that variable changes each instance of the component name. Thereby, creating a situation where anyone can create a new document for different components with one change instead of hundreds. Is something like that possible when just using HTML?
I've tried looking up every element in HTML in w3 schools to see if there's something like this, but to no avail. I've also tried searching stack overflow for this, but I think I might be using the wrong terminology and I'm not sure how else to describe what I'm looking for. When I think of "variable" I'm thinking of "X" which can be defined by the user, but when I look up "variable text in html" I tend to get results about <var> which doesn't help in this use case.
I tried to use
<script>
const string = "The revolution will not be televised.";
console.log(string);
</script>
to see if the text string will appear in output, but nothing appeared and I'm unfamiliar with Javascript.
Use some plachoder text inside your html, and then replace all the instances of that text with your component name, in javascript:
const placeholder = '#CMPNT#'
const componentName = 'MY COMPONENT NAME'
document.querySelector('.container').innerHTML = document.querySelector('.container').innerHTML.replaceAll(placeholder, componentName)
<div class="container">
<p>Lorem #CMPNT# dolor sit amet, consectetur adipiscing
elit. Vivamus ac scelerisque augue. Donec eu mattis libero.
Quisque gravida sit amet tellus id #CMPNT#. Orci varius
natoque penatibus et magnis dis parturient montes,
nascetur ridiculus mus. Suspendisse #CMPNT# et magna vel
congue. In non maximus diam. Suspendisse tristique est
vitae nibh sollicitudin varius. Integer at dolor vitae felis
placerat fringilla eu sit amet ipsum. #CMPNT# euismod ipsum
eget neque rhoncus sodales. Cras velit dui, tempus at pulvinar
eget, varius sed #CMPNT#. Donec egestas, erat nec luctus
suscipit, libero quam maximus mauris, id sagittis ligula
quam condimentum lacus.</p>
</div>

How to format HTML in Sublime Text 3

I’m using Sublime 3 to prepare HTML files that will eventually be turned into an epub in Sigil. This is working very well except that the formatting isn’t helping the readability.
I have HTMLbeautify and HTML/CSS/JSPrettify. They do a great job with the indentation but I would also like a method of putting the opening and closing paragraph tags on new lines, something like
<p>
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse rutrum dolor in lacus efficitur consequat. Cras turpis dolor, pretium sit amet tincidunt sed, porta iaculis lectus. Morbi consectetur vitae justo eu pretium.
</p>
Can anybody help?
I've read all the other Sublime/HTML formatting queries and i can't find anything that quite covers this.
Just select all lines (Ctrl A) and then from the menu select Edit → Line → Reindent. This will work

Is it possible to capture a block of nested HTML with PCRE RegEx?

Before you slate me, yes I know that you shouldn't parse HTML with regex, you should use a dedicated parser. I don't have that option in the language I'm using (Xojo) and for various reasons, I need to use RegEx.
I'm trying to capture an entire block of HTML that may or may not contain nested HTML elements. Examples:
<blockquote> This is a blockquote with two paragraphs. Lorem ipsum dolor sit amet,
consectetuer adipiscing elit. Aliquam hendrerit mi posuere lectus.
Vestibulum enim wisi, viverra nec, fringilla in, laoreet vitae, risus.
Donec sit amet nisl. Aliquam semper ipsum sit amet velit. Suspendisse
id sem consectetuer libero luctus adipiscing.</blockquote>
-----------------
<blockquote> This is the first level of quoting.
<blockquote> This is nested blockquote.</blockquote>
Back to the first level.</blockquote>
-----------------
<div>
Not nested
</div>
-----------------
<div>
Top level
<div>Nested</div>
</div>
I had come up with this pattern: <(\w*)>([\S\s]*?)<\/\1> but whilst it works for blocks of HTML it fails if the block contains a block of HTML with the same tags as the parent block. Online example here.
I'm using the PRCE variant of RegEx and coding in Xojo.
Does anyone have any useful advice on how to solve this problem? Thank you.

MultiColumn Text

Is there a way to specify that a text must be in multiple column and column width is defined in percent?
something like :
<div style="width:20%; max-height:100px;" >Lorem ipsum dolor sit amet, consectetur adipiscing elit. Aliquam sodales urna non odio egestas tempor. Nunc vel vehicula ante. Etiam bibendum iaculis libero, eget molestie nisl pharetra in. In semper consequat est, eu porta velit mollis nec Lorem ipsum dolor sit amet, consectetur adipiscing elit. Aliquam sodales urna non odio egestas tempor. Nunc vel vehicula ante.
</div>
if the text overflow the div bound a new column is displayed.
I'd probably advise having a look at CSS3's Multi Column functionality if you don't have to support older browsers:
http://www.w3.org/TR/css3-multicol/
This is not currently supported using native HTML. Currently JavaScript must be used to obtain this feature.
See: http://www.htmlgoodies.com/html5/tutorials/how-to-create-multi-columns-in-css3-and-javascript.html#fbid=uRKNCpHfmWY
This is a JavaScript Solution:
I did an iPad WebApp in my last term at the uni which was supposed be a newspaper app. To get the newspaper like rows we used this jQuery Plugin:
http://archive.plugins.jquery.com/project/Columnizer
or: http://welcome.totheinter.net/columnizer-jquery-plugin/
you can specify the width of your columns, the amount of columns and has quite a few features which might be useful for your purpose. (but we didn't need them actually...)

Replace single words/lines with text from multiple inputs

What would be the best method for replacing variables/words/lines of text in a larger "paragraph" of code?
Example:
Lorem ipsum dolor $SIT amet, consectetur adipiscing elit. Aliquam condimentum dolor ut est faucibus dapibus. Donec molestie dictum nisi, eu euismod $SAPIEN gravida in. Aliquam dictum, tellus eu facilisis laoreet, sapien nunc placerat turpis, eu pretium augue eros vel lectus. Quisque condimentum lorem $EROS, vel pharetra tortor.
I want to be able to enter text in a textbox/prompt to replace the "Variables" $SIT, $SAPIEN, $EROS with actual values automatically.
I trust I've made myself obscure? :P
I'm n00b at any sort of coding. I only know some basic HTML, PHP, and Java. But please give me a clear solution with an example or link or more help.
Thanks so much!
You must utilize JavaScript if you want to do it client-side, and any of the server-side ones [PHP, Python, Ruby] if you want to do it that way. In all of these languages there are equivalents of "string replace" functions, that'll take list of strings to search, list of strings to replace and subject that they will be working on. Solution for JS and PHP:
http://php.net/manual/en/function.str-replace.php
http://www.w3schools.com/jsref/jsref_replace.asp
The way that you'll do it is up to you.