I am trying to convert and parse a digitally created PDF file to tesseract tsv file. And to use this as ground truth for testing tesseract OCR performance on my PDF to TIF to TSV pipeline.
Any idea how I can achieve this task --- convert and parse a digitally created PDF file to a tesseract tsv file, without OCR or anything?
So far, I can use packages such as fitz PyMuPDF, pdfminer to extract texts. Using fitz PyMuPDF can give me sentences with their box location the PDF file (x_1,y_1,x_2,y_2). However, I don't see anyway yet to parse to tesseract-like tsv outputs. See https://blog.tomrochette.com/tesseract-tsv-format.
Any advice would be greatly appreciated. Thanks! :)
I am trying to migrate a load of documentation which was written in markdown into a Google Doc so it can be used by our marketing department.
Is there a mechanism using appscript/ Google Docs Api that I can import a file and convert it to a Google Doc using a predefined template?
Eg H1s will map to Title, etc
One suggestion: use Pandoc to convert Markdown to docx, then import to Google Docs using the Google Drive API.
You can also accomplish this using the Google Drive web interface:
Convert markdown to ODT (or some other intermediate) using pandoc: pandoc MyFile.md -f markdown -t odt -s -o MyFile.odt
Move the ODT file into your Google Drive folder.
Right-click the ODT file (in the web version of Drive) and press "Open With -> Google Docs".
Google Drive will now create a new Google Doc, with contents matching the ODT file, and open it for you.
No add-ons needed
I just stumbled upon a really simple approach that may suit your needs.
In github, I opened a ReadMe.md file, rendered in the browser as rich text. I copied it from the browser and pasted it into a new Google Doc.
Voila: Google Docs preserved the headings, bullets, etc. I can't vouch for what it does with links and other fancier markdown ops, but it was a quick way to get started.
Easiest way that most likely requires no new tooling for many developers (if you're willing to do some manual pasting):
Just use Visual Studio Code.
Paste text into the editor.
At bottom right, select markdown as language.
Top right, click the preview button.
This will split the screen and show the rendered markdown you can paste into google docs.
It's pretty quick at this point to just keep paste markdown/copy result/repeat as long as you don't have hundreds of docs.
Of course, if it's a ton of docs, you'll want something more automated than this.
One variation of the suggestion to use pandoc: try using the docx format instead of odt. Google Docs handles MS Office files natively so I found formatting was preserved somewhat better using this approach.
Revised steps:
Convert markdown to DOCX using pandoc: pandoc MyFile.md -f markdown -t docx -s -o MyFile.docx
Upload MyFile.docx into your Google Drive folder
Double-click MyFile.docx in the web version of Drive
Google Drive will open MyFile.docx in a new tab
I don't know of a tool or library that allows for a direct conversion from markdown to a google doc.
Maybe you can convert your markdown to an intermediary format compatible with Google Docs (viable formats include .docx, .docm .dot, .dotx, .dotm, .html, plain text (.txt), .rtf and .odt) and then go from there.
You just need to find a tool that can convert markdown to one of those formats and also process your files in bulk (maybe some command-line utility could help with that).
There is the Google Docs Addon Markdown to Docs ... which converts Markdown to Google Docs.
It has its limitations due to the Gdoc format but works otherwise very well.
#!/bin/bash
# Create the "converted" directory if it doesn't already exist
if [ ! -d "converted" ]; then
mkdir "converted"
fi
# Find all markdown files in the current directory and its subdirectories
find . -name "*.md" | while read filename; do
# Use pandoc to convert the file to a .docx file
pandoc "$filename" -o "${filename%.*}.docx"
# Create the same directory structure under "converted" as the original file
dir=$(dirname "$filename")
mkdir -p "converted/$dir"
# Move the converted file to the "converted" directory
mv "${filename%.*}.docx" "converted/$dir"
done
I need to build an OCR application that scans passports and so I have chosen tesseract for start. From what I have read there should be a .uzn file that I define, but I can't find any documentation on it. How can I create such a template for tesseract to use.
you can rather use uzn file or let tesseract do the segmentation itself.
anyway checkout the folowing link if you need more informations about uzn file format :
https://github.com/OpenGreekAndLatin/greek-dev/wiki/uzn-format
I am looking for a c/c++ library which can be used to convert html string/file into pdf. Researching StackOverflow led me to wkhtmltopdf c library. I download the zip from http://wkhtmltopdf.org/downloads.html. When I run wkhtmltopdf using command line it works fine and converts the HTML file into pdf output file. But my requirement is to convert HTML file or html string(preferably) into pdf file programmatically from a C++ program. I do not want to use it as command line(or mimic it using say UNIX like "system" command). I am using Windows OS. Could anyone please help me how do I achieve it using wkhtmltopdf. Thank you in advance.
I have a CSV with information relating to images such as name etc. , is there a way that I can use the CSV file information to push to imagemagick (via PHP on windows) to add to the images?
Sure, check out http://php.net/manual/en/function.fgetcsv.php to get the CSV information into the script, then I'd assume you are using the PHP ImageMagick extension, so just run the commands against the given image, or read the image in from some variable from the file. If you are not using the extension in PHP, you could try running it through the exec PHP method: http://php.net/manual/en/function.exec.php