How to convert tiff to searchable pdf using alfresco and tesseact? - integration

I want to convert *.PDF file to searchable *.PDF files using alfresco and tesseract OCR.
tesseract version 3.03 needs to be compiled and i need to generate setup of that using source code.Is there any other solution for the same.
Can anyone help for the same?

You'll need Tesseract 3.03 or later for searchable PDF output feature.
tesseract yourimage.tif out pdf

you can use another tool which is directly performing pdf to searchable pdf conversion.This tool is using tesseract internally for this conversion.You can find more details on below link and configure same for alfresco.
http://ubuntuforums.org/showthread.php?t=1456756
command
pdfocr -i input.pdf -o output.pdf

Related

Convert and parse digitally created PDF file to tesseract tsv file directly from PDF source code

I am trying to convert and parse a digitally created PDF file to tesseract tsv file. And to use this as ground truth for testing tesseract OCR performance on my PDF to TIF to TSV pipeline.
Any idea how I can achieve this task --- convert and parse a digitally created PDF file to a tesseract tsv file, without OCR or anything?
So far, I can use packages such as fitz PyMuPDF, pdfminer to extract texts. Using fitz PyMuPDF can give me sentences with their box location the PDF file (x_1,y_1,x_2,y_2). However, I don't see anyway yet to parse to tesseract-like tsv outputs. See https://blog.tomrochette.com/tesseract-tsv-format.
Any advice would be greatly appreciated. Thanks! :)

Convert a Markdown text file into a Google Document using Appscript?

I am trying to migrate a load of documentation which was written in markdown into a Google Doc so it can be used by our marketing department.
Is there a mechanism using appscript/ Google Docs Api that I can import a file and convert it to a Google Doc using a predefined template?
Eg H1s will map to Title, etc
One suggestion: use Pandoc to convert Markdown to docx, then import to Google Docs using the Google Drive API.
You can also accomplish this using the Google Drive web interface:
Convert markdown to ODT (or some other intermediate) using pandoc: pandoc MyFile.md -f markdown -t odt -s -o MyFile.odt
Move the ODT file into your Google Drive folder.
Right-click the ODT file (in the web version of Drive) and press "Open With -> Google Docs".
Google Drive will now create a new Google Doc, with contents matching the ODT file, and open it for you.
No add-ons needed
I just stumbled upon a really simple approach that may suit your needs.
In github, I opened a ReadMe.md file, rendered in the browser as rich text. I copied it from the browser and pasted it into a new Google Doc.
Voila: Google Docs preserved the headings, bullets, etc. I can't vouch for what it does with links and other fancier markdown ops, but it was a quick way to get started.
Easiest way that most likely requires no new tooling for many developers (if you're willing to do some manual pasting):
Just use Visual Studio Code.
Paste text into the editor.
At bottom right, select markdown as language.
Top right, click the preview button.
This will split the screen and show the rendered markdown you can paste into google docs.
It's pretty quick at this point to just keep paste markdown/copy result/repeat as long as you don't have hundreds of docs.
Of course, if it's a ton of docs, you'll want something more automated than this.
One variation of the suggestion to use pandoc: try using the docx format instead of odt. Google Docs handles MS Office files natively so I found formatting was preserved somewhat better using this approach.
Revised steps:
Convert markdown to DOCX using pandoc: pandoc MyFile.md -f markdown -t docx -s -o MyFile.docx
Upload MyFile.docx into your Google Drive folder
Double-click MyFile.docx in the web version of Drive
Google Drive will open MyFile.docx in a new tab
I don't know of a tool or library that allows for a direct conversion from markdown to a google doc.
Maybe you can convert your markdown to an intermediary format compatible with Google Docs (viable formats include .docx, .docm .dot, .dotx, .dotm, .html, plain text (.txt), .rtf and .odt) and then go from there.
You just need to find a tool that can convert markdown to one of those formats and also process your files in bulk (maybe some command-line utility could help with that).
There is the Google Docs Addon Markdown to Docs ... which converts Markdown to Google Docs.
It has its limitations due to the Gdoc format but works otherwise very well.
#!/bin/bash
# Create the "converted" directory if it doesn't already exist
if [ ! -d "converted" ]; then
mkdir "converted"
fi
# Find all markdown files in the current directory and its subdirectories
find . -name "*.md" | while read filename; do
# Use pandoc to convert the file to a .docx file
pandoc "$filename" -o "${filename%.*}.docx"
# Create the same directory structure under "converted" as the original file
dir=$(dirname "$filename")
mkdir -p "converted/$dir"
# Move the converted file to the "converted" directory
mv "${filename%.*}.docx" "converted/$dir"
done

How to create a uzn file for tesseract

I need to build an OCR application that scans passports and so I have chosen tesseract for start. From what I have read there should be a .uzn file that I define, but I can't find any documentation on it. How can I create such a template for tesseract to use.
you can rather use uzn file or let tesseract do the segmentation itself.
anyway checkout the folowing link if you need more informations about uzn file format :
https://github.com/OpenGreekAndLatin/greek-dev/wiki/uzn-format

How to convert html string(or file) to pdf file using wkhtmltopdf c library- steps?

I am looking for a c/c++ library which can be used to convert html string/file into pdf. Researching StackOverflow led me to wkhtmltopdf c library. I download the zip from http://wkhtmltopdf.org/downloads.html. When I run wkhtmltopdf using command line it works fine and converts the HTML file into pdf output file. But my requirement is to convert HTML file or html string(preferably) into pdf file programmatically from a C++ program. I do not want to use it as command line(or mimic it using say UNIX like "system" command). I am using Windows OS. Could anyone please help me how do I achieve it using wkhtmltopdf. Thank you in advance.

Use text from CSV in Imagemagick

I have a CSV with information relating to images such as name etc. , is there a way that I can use the CSV file information to push to imagemagick (via PHP on windows) to add to the images?
Sure, check out http://php.net/manual/en/function.fgetcsv.php to get the CSV information into the script, then I'd assume you are using the PHP ImageMagick extension, so just run the commands against the given image, or read the image in from some variable from the file. If you are not using the extension in PHP, you could try running it through the exec PHP method: http://php.net/manual/en/function.exec.php