Markup font-style (italic) in tesseract OCR

Markup font-style (italic) in tesseract OCR - html

Have tesseract-ocr v3.02.02 installed on Windows 7, and have used it via the command line:
1) Output png text to a text file: tesseract image.png txtfile
2) Output png text to a html file: tesseract image.png htmlfile hocr
I need it to be able to markup any italic text in the output text or html file. How do I do this (preferably on the command line - never used it in API mode)?

The hocr output by Tesseract includes only the word coordinates and confidence values, not font-related information. As such, you will need to modify the source code to output what you want for the command-line mode, or use its API.

Related

Cannot convert DOCX to HTML with Python

I've tried it by using mammoth:
import mammoth
result = mammoth.convert_to_html("MyDocument.docx")
print (result.value)
I don't get an HTML, but this strange code:
kbW7yqZoo4h9pYM6yBxX1QFx2pCoPYflXfieIPbtqpT913Vk7OzcZdEk3eO7TbWjvZNTGilsfmRrPwDvB[...]
I've also tried to use docx2html, but I can't install it. When I run pip install docx2html I get this error:
SyntaxError: Missing parentheses in call to 'print'

Mammoth .docx to HTML converter
Mammoth is designed to convert .docx documents, such as those created by Microsoft Word, and convert them to HTML. Mammoth aims to produce simple and clean HTML by using semantic information in the document, and ignoring other details. For instance, Mammoth converts any paragraph with the style Heading 1 to h1 elements, rather than attempting to exactly copy the styling (font, text size, colour, etc.) of the heading.
There's a large mismatch between the structure used by .docx and the structure of HTML, meaning that the conversion is unlikely to be perfect for more complicated documents. Mammoth works best if you only use styles to semantically mark up your document.
The following features are currently supported:
Headings.
Lists.
Customisable mapping from your own docx styles to HTML. For instance, you could convert WarningHeading to h1.warning by providing an appropriate style mapping.
Tables. The formatting of the table itself, such as borders, is currently ignored, but the formatting of the text is treated the same as in the rest of the document.
Footnotes and endnotes.
Images.
Bold, italics, underlines, strikethrough, superscript and subscript.
Links.
Line breaks.
Text boxes. The contents of the text box are treated as a separate paragraph that appears after the paragraph containing the text box.
Comments.
Installation
pip install mammoth
Basic conversion
To convert an existing .docx file to HTML, pass a file-like object to mammoth.convert_to_html. The file should be opened in binary mode. For instance:
import mammoth
with open("document.docx", "rb") as docx_file:
result = mammoth.convert_to_html(docx_file)
html = result.value # The generated HTML
messages = result.messages # Any messages, such as warnings during conversion
You can also extract the raw text of the document by using mammoth.extract_raw_text. This will ignore all formatting in the document. Each paragraph is followed by two newlines.
with open("document.docx", "rb") as docx_file:
result = mammoth.extract_raw_text(docx_file)
text = result.value # The raw text
messages = result.messages # Any messages

You can use pypandoc module for that purpose. See below code
import pypandoc
output = pypandoc.convert_file('file.docx', 'docx', outputfile="file_converted.html")

The issue you're having is probably that mammoth doesn't create legit HTML files, just HTML snippets. Meaning it's missing the and tags.
Some browsers can still render the content from the file since they're advanced enough to do so, but I ran into a similar problem when trying to use the raw output.
A nifty workaround for this is to add this to your code to convert it to proper HTML files:
import mammoth
with open("test.docx", "rb") as docx_file:
result = mammoth.convert_to_html(docx_file)
html = result.value # The generated HTML
messages = result.messages # Any messages,
full_html = (
'<!DOCTYPE html><html><head><meta charset="utf-8"/></head><body>'
+ html
+ "</body></html>"
)
with open("test.html", "w", encoding="utf-8") as f:
f.write(full_html)
Where test.html is whatever the title you gave to your document.
I'm not taking credit for this, I found it here as well, but can't find the source post.

As stated in the documentation:
To convert an existing .docx file to HTML, pass a file-like object to
mammoth.convert_to_html. The file should be opened in binary mode. For
instance:
import mammoth
with open("document.docx", "rb") as docx_file:
result = mammoth.convert_to_html(docx_file)
html = result.value # The generated HTML
messages = result.messages # Any messages, such as warnings during conversion

Detect subtitle from an image

I need a customized idea for detect only subtitle in an image. Maybe some steps of image processing such that to be able to extract (with tesseract for example) characters from the processed image correctly.

Why wouldn't you cut the bottom of the image and then apply tesseract on this ?
In bash on linux I would put the following in a bash script and apply it to all image (with xargs for example):
# filenames
input="$1"
extension=$(echo $(echo "$input"|sed 's/.*\.//g'))
nomfich=$(basename $input .$extension)
interm="$nomfich.tiff"
# convert to tiff and crop
convert -gravity South -crop 100%x15%+0+0 -density 300 $input $interm
# ocr
tesseract $interm $nomfich

Color output in Sublime Text 2 console?

Is there any way to output coloful text to the sublime text console? I tried this:
"\033[0;32mTest\033[0m"
and the console displays something similar to this:
ESC[0;32mTestESC[0m"

Unfortunately, the Sublime Text console is essentially monochrome. Its foreground and background colors can be changed via the Packages/Theme - Default/Widgets.stTheme file (or your theme's equivalent), but you can't colorize output using terminal escape codes, like you are using.

Html to ansi colored terminal text

I am under Linux and I want to fetch an html page from the web and then output it on terminal. I found out that html2text essentially does the job, but it converts my html to a plain text whereas I would better convert it into ansi colored text in the spirit of ls --color=auto. Any ideas?

The elinks browser can do that. Other text browsers such as lynx or w3m might be able to do that as well.
elinks -dump -dump-color-mode 1 http://example.com/
the above example provides a text version of http://example.com/ using 16 colors. The output format can be customized further depending on need.
The -dump option enables the dump mode, which just prints the whole page as text, with the link destinations printed out in a kind of "email-style".
-dump-color-mode 1 enables the coloring of the output using the 16 basic terminal colors. Depending on the value and the capabilities of the terminal emulator this can be up to ~16 million (True Color). The values are documented in elinks.conf(5).
The colors used for output can be configured as well, which is documented in elinks.conf(5) as well.

The w3m browser supports coloring the output text.

You can use the lynx browser to output the text using this command.
lynx -dump http://example.com

How to render an HTML file offline?

I have a collection of html files that I gathered from a website using wget. Each file name is of the form details.php?id=100419&cid=13%0D, where the id and cid varies. Portions of the html files contain articles in Asian language (Unicode text). My intention is to extract the Asian-language text only. Dumping the rendered html using a command-line browser is the first step that I have thought of. It will eliminate some of the frills.
The problem is, I cannot dump the rendered html to a file (using, say, w3m -dump ). The dumping works if only I direct the browser (at the command-line) to the properly formed URL : http://<blah-blah>/<filename>. But this is way I will have to spend the time to download the files once again from the web. How do I get around this, what other tools could I use?
w3m -dump <filename> complains saying:
w3m: Can't load details.php?id=100419&cid=13%0D.
file <filname> shows:
details.php?id=100419&cid=13%0D: Non-ISO extended-ASCII HTML document text, with very long lines, with CRLF, CR, LF, NEL line terminators

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008