Detect subtitle from an image - ocr

I need a customized idea for detect only subtitle in an image. Maybe some steps of image processing such that to be able to extract (with tesseract for example) characters from the processed image correctly.

Why wouldn't you cut the bottom of the image and then apply tesseract on this ?
In bash on linux I would put the following in a bash script and apply it to all image (with xargs for example):
# filenames
input="$1"
extension=$(echo $(echo "$input"|sed 's/.*\.//g'))
nomfich=$(basename $input .$extension)
interm="$nomfich.tiff"
# convert to tiff and crop
convert -gravity South -crop 100%x15%+0+0 -density 300 $input $interm
# ocr
tesseract $interm $nomfich

Related

Get wkhtmltopdf to use high resolution image from srcset

I'm using wkhtmltopdf to translate a webpage to a PDF document, but some of the images in the result file are a bit low resolution.
The source webpage actually has the srcset attribute used to provide higher resolution versions for higher pixel density displays. However wkhtmltopdf doesn't appear to be using them.
I figured as this is a WebKit based tool, and WebKit definitely supports this attribute, there might be something I could set to make WebKit use the highest resolution version available.
Edit: While I'm waiting on figuring out a better way of doing this, I have managed to successfully preprocess the HTML with xmlstarlet, by stripping all but the last URL out of the srcset, renaming the attribute src, and removing the last but one src attribute (the HTML is generated so all images follow an identical format in their usage of src / srcset).
xmlstarlet ed -P \
--update "//img/#srcset" \
-x "substring-before(substring-after(.,', '),' ')" \
--rename "//img/#srcset" -v "src" \
--delete "//img/#src[position()=last()-1]" file.html
A useful workaround until I find a more elegant solution!

Markup font-style (italic) in tesseract OCR

Have tesseract-ocr v3.02.02 installed on Windows 7, and have used it via the command line:
1) Output png text to a text file: tesseract image.png txtfile
2) Output png text to a html file: tesseract image.png htmlfile hocr
I need it to be able to markup any italic text in the output text or html file. How do I do this (preferably on the command line - never used it in API mode)?
The hocr output by Tesseract includes only the word coordinates and confidence values, not font-related information. As such, you will need to modify the source code to output what you want for the command-line mode, or use its API.

HTML file to screenshot as .jpg or other image

Nothing to do with rendering an individual image on a webpage. Goal is to render the entire webpage save that as a screenshot. Want to show a thumbnail of an HTML file to the user. The HTML file I will be screenshotting will be an HTML part in a MIME email message - ideally I would like to snapshot the entire MIME file but if I can do this to an HTML file I'll be in good shape.
An API would be ideal but an executable is also good.
You need html2ps, and convert from the package ImageMagick:
html2ps index.html index.ps
convert index.ps index.png
The second program produces one png per page for a long html-page - the page layout was done by by html2ps.
I found a program evince-thumbnailer, which was reported as:
apropos postscript | grep -i png
evince-thumbnailer (1) - create png thumbnails from PostScript and PDF documents
but it didn't work on an simple, first test.
If you like to combine multiple pages to a larger image, convert will help you surely.
Now I see, that convert operates on html directly, so
convert index.html index.png
shall work too. I don't see a difference in the output, and the size of the images is nearly identical.
If you have a multipart mime-type email, you typically have a mail header, maybe some pre-html-text, the html and maybe attachments.
You can extract the html and format it seperately - but rendering it embedded might not be that easy.
Here is a file I tested, which was from Apr. 14, so I extract the one mail from the mailfolder:
sed -n "/From - Sat Apr 14/,/From -/p" /home/stefan/.mozilla-thunderbird/k2jbztqu.default/Mail/Local\ Folders-1/Archives.sbd/sample | \
sed -n '/<html>/,/<\/html>/p' | wkhtmltopdf - - > sample.pdf
then I extract just the html-part of that.
wkhtmltopdf needs - - for reading stdin/writing to stdout. The PDF is rendered, but I don't know how to integrate it into your workflow.
You can replace wkhtml ... with
convert - sample.jpg
I'm going with wkhtmltoimage. This worked once correctly set up xvfb. The postscript suggestion did not render correctly and we need img not pdf.

Tesseract does not recognize single characters

How to represent:
Create new image with paint (any size)
Add letter A to this image
Try to recognize -> tesseract will not find any letters
Copy-paste this letter 5-6 times to this image
Try to recognize -> tesseract will find all the letters
Why?
You must set the "page segmentation mode" to "single char".
For example, in Android you do the following:
api.setPageSegMode(TessBaseAPI.pageSegMode.PSM_SINGLE_CHAR);
python code to do that configuration is like this:
import pytesseract
import cv2
img = cv2.imread("path to some image")
pytesseract.image_to_string(
img, config=("-c tessedit"
"_char_whitelist=abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789"
" --psm 10"
" -l osd"
" "))
the --psm flag defines the page segmentation mode.
according to documentaion of tesseract, 10 means :
Treat the image as a single character.
so to recognize a single character you just need to use : --psm 10 flag.
You need to set Tesseract's page segmentation mode to "single character."
Have you seen this?
https://code.google.com/p/tesseract-ocr/issues/detail?id=581
The bug list shows it as "no longer an issue".
Be sure to have high resolution images.
If you are resizing the image, be sure to keep a high DPI and don't resize too small
Be sure to train your tesseract system
use the baseApi.setVariable("tessedit_char_whitelist", "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz"); code before the init Tesseract
Also, you may look into which font to use with OCR

Html to ansi colored terminal text

I am under Linux and I want to fetch an html page from the web and then output it on terminal. I found out that html2text essentially does the job, but it converts my html to a plain text whereas I would better convert it into ansi colored text in the spirit of ls --color=auto. Any ideas?
The elinks browser can do that. Other text browsers such as lynx or w3m might be able to do that as well.
elinks -dump -dump-color-mode 1 http://example.com/
the above example provides a text version of http://example.com/ using 16 colors. The output format can be customized further depending on need.
The -dump option enables the dump mode, which just prints the whole page as text, with the link destinations printed out in a kind of "email-style".
-dump-color-mode 1 enables the coloring of the output using the 16 basic terminal colors. Depending on the value and the capabilities of the terminal emulator this can be up to ~16 million (True Color). The values are documented in elinks.conf(5).
The colors used for output can be configured as well, which is documented in elinks.conf(5) as well.
The w3m browser supports coloring the output text.
You can use the lynx browser to output the text using this command.
lynx -dump http://example.com