My context
I'm using tesseract to extract text from an image.
I'm generating a .tsv to retrieve the extracted text and perform some regex on it and a .pdf to have a searchable pdf.
The way I do it is by calling tesseract 2 times:
One asking for the .tsv
One asking for the .pdf
But I feel like this is not very efficient (the same computations must be made two times)
What I wish
I wish to make my computations go faster. And my idea is to call tesseract only once but specifying two output formats
Is it possible? If so how?
You can try the command:
tesseract yourimage.tif out pdf tsv
Related
I am trying to extract data [price, information, and number] from PDF (I have like more than 10 000 PDF so the free trial of the website won't work).
Here is one example of PDF I get :
I tried it in Python (beginner on this kind of task and on Python also) with several packages like PyPDF2, pdfx and so on, but I only get the Text like this
with PyPDF2 :
So It's possible to extract the price, the number, and information but I have different format of pdf so it is not possible to just with the text and some algorithms extract the information.
What I want to do, and it is possible because a lot of websites are doing it and make people pay for it. I want to read it in a vertical way and convert the data extracted in XML/JSON or simply a dataset.
I want to read the document per columns and not by line
Is there a way to do it in python or other languages?
First let me tell you that this is not an easy problem to solve since PDF files in the wild tend to be quite diverse in layouts. I can suggest trying an open source project that works really good for extracting information from tables in PDF files. It is called Tabula, you can get it at https://tabula.technology.
Tabula is going to detect tables on each page and export the content as CSV format. Once you have it in CSV it should be easier to get the information using Python. Please note that the CSV layout depends on the table layout in the PDF, meaning that you may need to create several functions to extract the information correctly.
Tabula is not perfect but it should work with most PDF files, for those that do not work you may need to extract the information manually.
I am running tesseract to extract text from PDF files in a context where it is important to distinguish between semicolons and commas.
I find that semi-colons often show up as commas after OCR. The accuracy is otherwise pretty good.
I am looking for suggestions on how to improve accuracy in semicolon versus comma detection. Following this suggestion, my procedure is to first convert a multipage PDF file to a ppm file using pdftoppm from Xpdf, then convert that to tif using imagemagick, then run tesseract on the .tif file.
I have set the resolution of the ppm file to 1000 DPI and used the -sharpen option in imagemagick in an effort to improve resolution, but neither seems to improve the semi-colon recognition.
Any suggestions for pre-processing the image files or is this just an tough hill to climb?
Here are links to the original PDF, the .ppm and .tif files, and the .txt output.
Note that this is copyrighted material which I do not own.
You can always custom train the tesseract according to your dataset. You can check this article How to custom train tesseract.
But for sure it will be a long process to train a new model by collecting
dataset first and all but it's a way to improve the OCR.
For a project I need to convert my .pdf, .docx or .jpg file into binary file which is consisted of 0 and 1s. This is the way that the computer saves data on the hard for example. Now I also need to be able to bring back the 0,1 info into the aforementioned file type. Can anyone guide me how to do this? Any scripting language is ok but I prefer python or C
A pdf docx or jpg file is already represented in binary (ones and zeros) when stored on your hard disc.
Hence no conversion is necessary.
If you mean something different (for example, conversion to a form that displays on the screen as ones and zeros) then you will need to articulate your question more clearly.
I use MATLAB for calculating and plotting purposes. I want to write a plot as a image file like PNG or JPG into a MySQL database (that I can retreive later for a webbrowser). In other words I want to write a blob to the database that is a PNG or JPG file.
If I search for that I get http://www.mathworks.com/matlabcentral/answers/97768-how-do-i-insert-an-image-or-figure-into-a-database-using-the-database-toolbox-in-matlab but here a matrix of MATLAB is written as an array to a database. That is much bigger than a compressed PNG file and thus does not allow to see subplots and other things and cannot be displayed by a webbrowser.
A workaround would be to write the plot to a file and use MATLAB (or a external script tool based on python or so) to read that file as blob and write it as blob to the database.
Do you know a possibility to write a plot as PNG, JPG directly to a databse without the detour of a file?
I also asked the MATLAB support and they gave me an positive answer. A solution is the figToImStream function of MATLAB compiler toolbox: "Stream out figure as byte array encoded in format specified, creating signed byte array in .png format". The downside is that MATLAB compiler toolbox is quite expensive...
I want to be able to compare the results i get from running an OCR on the same document three times. Are there any tools out there that i can use to make this happen?
I would like compare the three documents and based on what characters are the same 3/3 times or 2/3 times, create a fourth document with the output of this decision. I am using Abby Fine reader which has given me great results, but i am trying to do everything i can to get to 100%.
I know microsoft word has a "compare documents" function, and i would like to be able to do this type of analysis on a larger scale with a robust algorithm.
any ideas?
Thanks for your time!
If the output is a simple text file, you could use the bash diff command and a simple shell script to compare them. You could probably then use a slightly more complicated shell script to parse through the output file and create a final document.