How do I segment a document using Tesseract then output the resulting bounding boxes and labels - ocr

I'm trying to get Tesseract to output a file with labelled bounding boxes that result from page segmentation (pre OCR). I know it must be capable of doing this 'out of the box' because of the results shown at the ICDAR competitions where contestants had to segment and various documents (academic paper here). Here's an example from that paper illustrating what I want to create:
I have built the latest version of tesseract using brew, brew install tesseract --HEAD, and have been trying to edit config files located in /usr/local/Cellar/tesseract/HEAD/share/tessdata/configs/ to output labelled boxes. The output received using hocr as the config, i.e.
tesseract infile.tiff outfile_stem -l eng -psm 1 hocr
gives a bounding box for everything and has some labelling in class tags e.g.
<p class='ocr_par' dir='ltr' id='par_5_82' title="bbox 2194 4490 3842 4589">
<span class='ocr_line' id='line_5_142' ...
but I can't visualise this. Is there a standard tool to visualize hOCR files, or is the facility to create an output file with bounding boxes built into Tesseract?
The current head version details:
tesseract 3.04.00
leptonica-1.71
libjpeg 8d : libpng 1.6.16 : libtiff 4.0.3 : zlib 1.2.5
Edit
I'm really looking to achieve this using the command line tool (as in examples above). #nguyenq has pointed me to the API reference, unfortunately I have no c++ experience. If the only solution is to use the API, please can you provide a quick python example?

Success. Many thanks to the people at the Pattern Recognition and Image Analysis Research Lab (PRImA) for producing tools to handle this. You can obtain them freely on their website or github.
Below I give the full solution for a Mac running 10.10 and using the homebrew package manager. I use wine to run windows executables.
Overview
Download tools: Tesseract OCR to Page (TPT) and Page Viewer (PVT)
Use the TPT to run tesseract on your document and convert the HOCR xml to a PAGE xml
Use the PVT to view the original image with the PAGE xml information overlaid
Code
brew install wine # takes a little while >10m
brew install gs # only for generating a tif example. Not required, you can use Preview
brew install wget # only for downloading example paper. Not required, you can do so manually!
cd ~/Downloads
wget -O paper.pdf "http://www.prima.cse.salford.ac.uk/www/assets/papers/ICDAR2013_Antonacopoulos_HNLA2013.pdf"
# This command can be ommitted and you can do the conversion to tiff with Preview
gs \
-o paper-%d.tif \
-sDEVICE=tiff24nc \
-r300x300 \
paper.pdf
cd ~/Downloads
# ttptool is the location you downloaded the Tesseract to PAGE tool to
ttptool="/Users/Me/Project/tools/TesseractToPAGE 1.3"
# sudo chmod 777 "$ttptool/bin/PRImA_Tesseract-1-3-78.exe"
touch "$ttptool/log.txt"
wine "$ttptool/bin/PRImA_Tesseract-1-3-78.exe" \
-inp-img "$dl/Downloads/paper-3.tif" \
-out-xml "$dl/Downloads/paper-3-tool.xml" \
-rec-mode layout>>log.txt
# pvtool is the location you downloaded the PAGE Viewer tool to
pvtool="/Users/Me/Project/tools/PAGEViewerMacOS_1.1/JPageViewer 1.1 (Mac OS, 64 bit)"
cd "$pvtool"
dl=~
java -XstartOnFirstThread -jar JPageViewer.jar "$dl/Downloads/paper-3-tool.xml" "$dl/Downloads/paper-3.tif"
Results
Document with overlays (rollover to see text and type)
Overlays alone (use GUI buttons to toggle)
Appendix
You can run tesseract yourself and use another tool to convert its output to PAGE format. I was unable to get this to work but I'm sure you'll be fine!
# Note that the pvtool does take as input HOCR xml but it ignores the region type
brew install tesseract --devel # installs v 3.03 at time of writing
tesseract ~/Downloads/paper-3.tif ~/Downloads/paper-3 hocr
mv paper-3.hocr paper-3.xml # The page viewer will only open XML files
java -XstartOnFirstThread -jar JPageViewer.jar "$dl/Downloads/paper-3.xml"
At this point you need to use the PAGE Converter Java Tool to convert the HOCR xml into a PAGE xml. It should go a little something like this:
pctool="/Users/Me/Project/tools/JPageConverter 1.0"
java -jar "$pctool/PageConverter.jar" -source-xml paper-3.xml -target-xml paper-3-hocrconvert.xml -convert-to LATEST
Unfortunately, I kept getting null pointers.
Could not convert to target XML schema format.
java.lang.NullPointerException
at org.primaresearch.dla.page.converter.PageConverter.run(PageConverter.java:126)
at org.primaresearch.dla.page.converter.PageConverter.main(PageConverter.java:65)
Could not save target PAGE XML file: paper-3-hocrconvert.xml
java.lang.NullPointerException
at org.primaresearch.dla.page.io.xml.XmlInputOutput.writePage(XmlInputOutput.java:144)
at org.primaresearch.dla.page.converter.PageConverter.run(PageConverter.java:135)
at org.primaresearch.dla.page.converter.PageConverter.main(PageConverter.java:65)

You can use its API to obtain the bounding boxes at various levels (character/word/line/para) -- see API Example. You have to draw the labels yourself.

If you are python familiar, you can directly use tesserocr library which is a nice python wrapper around the C++ API. Here is a code snippet to draw polygons at block level using PIL:
from PIL import Image, ImageDraw
from tesserocr import PyTessBaseAPI, RIL, iterate_level, PSM
img = Image.open(filename)
results = []
with PyTessBaseAPI() as api:
api.SetImage(img)
api.SetPageSegMode(PSM.AUTO_ONLY)
iterator = api.AnalyseLayout()
for w in iterate_level(iterator, RIL.BLOCK):
if w is not None:
results.append((w.BlockType(), w.BlockPolygon()))
print('Found {} block elements.'.format(len(results)))
draw = ImageDraw.Draw(img)
for block_type, poly in results:
# you can define a color per block type (see tesserocr.PT for block types list)
draw.line(poly + [poly[0]], fill=(0, 255, 0), width=2)

With Tesseract 4.0.0, a command like tesseract source/dir/myimage.tiff target/directory/basefilename hocr will create a basefilename.hocr file with block-, paragraph-, line-, and word-level bounding boxes for the OCR'ed text. Even the command without the hocr config creates a text file with newlines between block-level text, but the hocr format is more explicit.
More config options here: https://github.com/tesseract-ocr/tesseract/tree/master/tessdata/configs

Shortcut
It is also possible to open HOCR files directly with the PageViewer tool. The file extension has to be .xml, however.

The HOCR individual character step is now available in Tesseract since 4.1.
Once the installation check, use :
tesseract {image file} {output name} -c tessedit_create_hocr=1 -c hocr_char_boxes=1

Related

visualise html video with matplotlib animation

In my notebook I get some data from URL, perform some analysis and do some plotting.
I want also create a html animation using FuncAnimation of matplotlib.animation.
So in the preamble I do
import matplotlib.animation as manim
plt.rcParams["animation.html"] = "html5"
%matplotlib inline
(something else... def init()..., def animate(i)...) then
anima = manim.FuncAnimation(fig,
animate,
init_func=init,
frames=len(ypos)-d0,
interval=200,
repeat=False,
blit=True)
To visualise, I then call
FFMpegWriter = manim.writers['ffmpeg']
writer = FFMpegWriter(fps=15)
link = anima.to_html5_video()
from IPython.core.display import display, HTML
display(HTML(link))
because I want the clip to show up as a neat html video in the notebook
Whereas this works well on my machine, on Watson-Studio I get the following error:
RuntimeError: Requested MovieWriter (ffmpeg) not available
I've checked that ffmpeg is available in the form of a Python package
(!pip freeze --isolated | grep ffmpeg gives ffmpeg-python==0.2.0)
The question is: how can I tell matplotlib.animation.writers to use the codec in ffmpeg-python?
Many thanks to all responders and supporters
We currently don't have ffmpeg pre-installed in Watson Studio on Cloud. The package ffmpeg-python that you mention is just a Python wrapper, but it won't work without the actual ffmpeg.
You can install ffmpeg from conda:
!conda install ffmpeg
Once you have the full list of additional packages that your notebook needs, I recommend to create a custom environment. Then you don't have to put install commands into the actual notebook.
The customization might look like this:
dependencies:
- ffmpeg=4.2.2
- pip
- pip:
- ffmpeg-python==0.2.0

Is it possible to directly 'convert' a google drive image url in Imagemagick?

I have had a good search around for this, but cannot find a concrete answer. I have been trying to run the convert command against a url for an image I have stored on google drive, in a public folder.
I have been able to get this to work if I wget the url with -qO- and pipe this to convert- e.g.
wget 'https://drive.google.com/uc?export=download&id=1ydsWevwDxARqrabo5yZEYozez0eZK4K1' -qO- | convert - -resize 100x100 MGFIN01.png
Ideally I would prefer to be able to directly run the url through convert:
convert https://drive.google.com/uc?export=download&id=1ydsWevwDxARqrabo5yZEYozez0eZK4K1 -resize 100x100 MGFIN01.png
With the ultimate intention of creating an html image map like: http://www.imagemagick.org/Usage/montage/#html therefore requiring a list of urls and names (I can probably work this out once I have resolved the url part)
I am on Ubuntu with imagemagick 6.9. I see in delegates.xml that I have this:
<delegate decode="https" command=""curl" -s -k -L -o "%o" "https:%M""/>
Also tried the download with curl and options and that also worked.
This works for me on ImageMagick 6.9.10.97 Q16 Mac OSX. Put the URL in double quotes. You may also have to edit your policy.xml file to give permission for HTTPS. See policy.xml at https://imagemagick.org/script/resources.php
convert "https://drive.google.com/uc?export=download&id=1ydsWevwDxARqrabo5yZEYozez0eZK4K1" -resize 100x100 MGFIN01.png
Just to give a tidier response than possible in comments:
Open policy.xml
sudo nano /etc/ImageMagick-6/policy.xml
Scroll down to find:
<policy domain="delegate" rights="none" pattern="HTTPS" />
Edit this to show:
<policy domain="delegate" rights="read" pattern="https" />
Save (CTRL+X, Y)
Run convert command again. Tada.

OCR detecting E as £

I am using pytesseract (version 5 of tesseract) to scan an image. I have changed image to black and white to remove the noise but still E is being detected as £196893 .
Also tried setting the language, dpi and psm values which has been suggested by most of people. Below are the settings I am using now. Please suggest.
pytesseract.image_to_string(Image.open(impath), config=" --dpi 120 --psm 6 -l eng")
Once of sample picture is shown below. For some samples it is working fine but for some samples it is giving such strange characters.
A solution to overcome this issue is to limit the characters that Tesseract looks for. To do so you must:
Create a file with arbitrary name (i.e. "whitelist") in tesseract config directory. In linux that directory is usually placed in /usr/share/tesseract/tessdata/configs.
Adding a line in that file containing only the characters that are you want to search in text:
tessedit_char_whitelist *list_of_characters*
Then call your script using the whitelist vocabulary:
tesseract input.tif output nobatch whitelist
In this case the parameters must be setted in your Python script as:
pytesseract.image_to_string(Image.open(impath), config=" --dpi 120 --psm 6 -l nobatch whitelist")

getting HTML source or rich text from the X clipboard

How can rich text or HTML source code be obtained from the X clipboard? For example, if you copy some text from a web browser and paste it into kompozer, it pastes as HTML, with links etc. preserved. However, xclip -o for the same selection just outputs plain text, reformatted in a way similar to that of elinks -dump. I'd like to pull the HTML out and into a text editor (specifically vim).
I asked the same question on superuser.com, because I was hoping there was a utility to do this, but I didn't get any informative responses. The X clipboard API is to me yet a mysterious beast; any tips on hacking something up to pull this information are most welcome. My language of choice these days is Python, but pretty much anything is okay.
To complement #rkhayrov's answer, there exists a command for that already: xclip. Or more exactly, there's a patch to xclip which was added to xclip later on in 2010, but hasn't been released yet that does that. So, assuming your OS like Debian ships with the subversion head of xclip (2019 edit: version 0.13 with those changes was eventually released in 2016 (and pulled into Debian in January 2019)):
To list the targets for the CLIPBOARD selection:
$ xclip -selection clipboard -o -t TARGETS
TIMESTAMP
TARGETS
MULTIPLE
SAVE_TARGETS
text/html
text/_moz_htmlcontext
text/_moz_htmlinfo
UTF8_STRING
COMPOUND_TEXT
TEXT
STRING
text/x-moz-url-priv
To select a particular target:
$ xclip -selection clipboard -o -t text/html
 rkhayrov
$ xclip -selection clipboard -o -t UTF8_STRING
rkhayrov
$ xclip -selection clipboard -o -t TIMESTAMP
684176350
And xclip can also set and own a selection (-i instead of -o).
In X11 you have to communicate with the selection owner, ask about supported formats, and then request data in the specific format. I think the easiest way to do this is using existing windowing toolkits. E,g. with Python and GTK:
#!/usr/bin/python
import glib, gtk
def test_clipboard():
clipboard = gtk.Clipboard()
targets = clipboard.wait_for_targets()
print "Targets available:", ", ".join(map(str, targets))
for target in targets:
print "Trying '%s'..." % str(target)
contents = clipboard.wait_for_contents(target)
if contents:
print contents.data
def main():
mainloop = glib.MainLoop()
def cb():
test_clipboard()
mainloop.quit()
glib.idle_add(cb)
mainloop.run()
if __name__ == "__main__":
main()
Output will look like this:
$ ./clipboard.py
Targets available: TIMESTAMP, TARGETS, MULTIPLE, text/html, text/_moz_htmlcontext, text/_moz_htmlinfo, UTF8_STRING, COMPOUND_TEXT, TEXT, STRING, text/x-moz-url-priv
...
Trying 'text/html'...
I asked the same question on superuser.com, because I was hoping there was a utility to do this, but I didn't get any informative responses.
Trying 'text/_moz_htmlcontext'...
<html><body class="question-page"><div class="container"><div id="content"><div id="mainbar"><div id="question"><table><tbody><tr><td class="postcell"><div><div class="post-text"><p></p></div></div></td></tr></tbody></table></div></div></div></div></body></html>
...
Trying 'STRING'...
I asked the same question on superuser.com, because I was hoping there was a utility to do this, but I didn't get any informative responses.
Trying 'text/x-moz-url-priv'...
http://stackoverflow.com/questions/3261379/getting-html-source-or-rich-text-from-the-x-clipboard
Extending the ideas from Stephane Chazelas, you can:
Copy from the formatted source.
Run this command to extract from the clipboard, convert to HTML, and then (with a pipe |) put that HTML back in the clipboard, again using the same xclip:
xclip -selection clipboard -o -t text/html | xclip -selection clipboard
Next, when you paste with Ctrl+v, it will paste the HTML source.
Going further, you can make it a shortcut, so that you don't have to open the terminal and run the exact command each time. ✨
To do that:
Open the settings for your OS (in my case it's Ubuntu)
Find the section for the Keyboard
Then find the section for shortcuts
Create a new shortcut
Set a Name, e.g.: Copy as HTML
Then as the command for the shortcut, put:
bash -c "xclip -selection clipboard -o -t text/html | xclip -selection clipboard"
Note: notice that it's the same command as above, but put inside of an inline Bash script. This is necessary to be able to use the | (pipe) to send the output from one command as input to the next.
Set the shortcut to whatever combination you want, preferably not overwriting another shortcut you use. In my case, I set it to: Ctrl+Shift+c
After this, you can copy some formatted text as normally with: Ctrl+c
And then, before pasting it, convert it to HTML with: Ctrl+Shift+c
Next, when you paste it with: Ctrl+v, it will paste the contents as HTML. 🧙✨

Fraktur recognition with OCRopus/Tesseract on Linux

I am trying to perform recognition of a german text with fraktur typeface with ocropus but It doesn't seem to be using deu-f package.
Here are the steps I performed.
Compiled and installed tesseract and ocropus.
Downloaded http://tesseract-ocr.googlecode.com/files/tesseract-2.01.deu-f.tar.gz, unpacked it to tessdata/.
But when I call
$ ocroscript recognize --tessLanguage=deu-f --output-mode=text image.png
the results are the same as when I call
$ ocroscript recognize --tessLanguage=eng --output-mode=text image.png
Any ideas what the problem is?
The problem is described in http://code.google.com/p/ocropus/issues/detail?id=87. Just need to apply the patch to ocropus and rebuild it.