Extract text from PDF into JSON or XML or whatever? - json

I am trying to extract data [price, information, and number] from PDF (I have like more than 10 000 PDF so the free trial of the website won't work).
Here is one example of PDF I get :
I tried it in Python (beginner on this kind of task and on Python also) with several packages like PyPDF2, pdfx and so on, but I only get the Text like this
with PyPDF2 :
So It's possible to extract the price, the number, and information but I have different format of pdf so it is not possible to just with the text and some algorithms extract the information.
What I want to do, and it is possible because a lot of websites are doing it and make people pay for it. I want to read it in a vertical way and convert the data extracted in XML/JSON or simply a dataset.
I want to read the document per columns and not by line
Is there a way to do it in python or other languages?

First let me tell you that this is not an easy problem to solve since PDF files in the wild tend to be quite diverse in layouts. I can suggest trying an open source project that works really good for extracting information from tables in PDF files. It is called Tabula, you can get it at https://tabula.technology.
Tabula is going to detect tables on each page and export the content as CSV format. Once you have it in CSV it should be easier to get the information using Python. Please note that the CSV layout depends on the table layout in the PDF, meaning that you may need to create several functions to extract the information correctly.
Tabula is not perfect but it should work with most PDF files, for those that do not work you may need to extract the information manually.

Related

Merge JSON data inside dynamic PDF templates

My problem is rather simple:
I need a tool to merge some medium complex JSON input data inside a template PDF.
Then based on the data:
Some sections of the template could be replicated.
Some sections may be deleted and the gap created should disappear.
Tables could be filled by N elements without messing the formatting.
Tables could have "merged cells inside"
Also templates should be easily adjustable without re-writing the code, this means they could be: editable PDF, Word files, Spreadsheets, some tool's templates... (..html?)
In all the tools i tried (a lot! :(( ), one or more of the points above was always a nightmare.
So far i tried:
latex -> pdflatex -> pdf: this probably simply reduced my life expectation by 1.. 1 and a 1/2 years. Unfortunately the most powerful tool because.. Latex? anyway not maintainable at all
pdfminer\pdfjs\whatever npm pkg.. : coding always ends up in a low level mess or huge workarounds.
google sheets -> pdf: APIs are kind of hard, and anyway are cell\row based so it's difficult to manage dynamic sections
pdfgeneratorapi.com : basically what i needed with a rich editor in it, but formatting and aligning tables is bugged so results are always ugly. Also things like "merged cells" are not possible.
So question would be: is there a tool or package out there in 2022 capable of handling all these requirements at once?
I would recommend using LaTeX for your PDF generation requirements. LaTeX comes with a huge number of available packages and chances are that someone has already solved your issue in LaTeX.
Specifically, DynamicDocs API by ADVICEment might be a good option for your requirements. DynamicDocs is JSON to PDF API based on LaTeX. Here are some features:
Ready-made JSON to PDF templates (no need to understand LaTeX)
Ability to write your own template in LaTeX and merge JSON to PDF using the R language layer, making the templates and their content dynamic
Excel to PDF Add-in (currently in Beta) to generate PDFs based on data in Excel
Each account is given a FREE plan with a limited number of monthly API calls
Disclaimer: I am involved in developing DynamicDocs API.

How should a "project" file be written?

With popular software packages, like Microsoft Word or Photoshop, we often have an option to save our progress as a "project" file and later can open that file to edit our works furthermore. This file often contains all the options and the progress that the user has made (i.e the essay you typed in Word).
So my question is, if I am doing a similar application that requires creating a similar "project" file, how should I go about this? My application is a scientific application, which means it required a lot of (multi-dimension) arrays. I understand there will be a lot of options to do this, but I would like to know the de facto way.
Here are some of the options I have outline out:
XML: Human readable. The size is too big and it's too much work to deal with arrays.
JSON: More popular/modern. Good with array.
Protocol Buffer: It is created by Google. Probably faster.
Database: Probably not a good use case since "project" files are most likely "temporary". Also, working with arrays is not very straight forward.
Creating your own binary format: Might be the most difficult solution for an inexperienced programmer like myself.
???
I would like to get some advice from you guys. Thank you :).
(Good question. :) Only some thoughts) I'd prefer text format for the main project file. You can make diffs and open and read and modify it easily. Large ascii or binary data can be stored as serialized data in external files or in a database like SQLite from where it can be easily accessed and processed through the application. The main project has links to the external data store. My advice for the main project file is a simple XML format that can easily be transformed to JSON format. A list of key value pairs (dict) is good for the beginning. value can be of basic datatype or be an array or dict. A complicated XML tree is not good. The key name can also help to describe and structure data. So i'd prefer key="rect.4711.pos.x" value="500" and not <rect id="4711"><pos><x>500</x>...</pos>.... Important aspect is that the project data is portable and self-contained, and the user can see the project as a single unit even if it is a directory on the file system, for this purpose supporting some kind of zipped format of project data is good.

Importing pptx slides into docx document

I have literally hundreds of slides created with python-pptx. Many of these slides have charts I would like to use in a docx file. So what I would like to do is use python-docx to import these slides/charts into a docx file. Is that possible?
No, not with the current python-pptx or python-docx APIs.
Such a thing is possible of course, since the Word application will allow you to "paste" charts from PowerPoint and in fact the charts themselves are specified in DrawingML, an XML vocabulary that is shared between PowerPoint, Word, and Excel.
But to make this work with Python, you'd have to dig quite deep into the internals of both python-pptx and python-docx (although their architectures are much the same). You would probably also need to learn more about the respective XML vocabularies than you really wanted to know. So you might want to consider alternate approaches such as using win32com support for this sort of thing, especially if you are running on Windows and this is a one-time job and does not need to be hosted on a server for ongoing use.
If you thought you did want to tackle it, a good first step might be to inspect the XML related to a PowerPoint chart (located in both the slide and the chart-parts of the PPTX package) and also inspect the corresponding XML that appears in a Word (.docx) file that includes a chart. That will give you an idea of what needs to come over from the PPTX package, what transformations it may need to undergo (namespace changes perhaps) and where it would need to be added into the DOCX package, including updating relationship files and perhaps updating certain ID values to make them unique in the target package.

Programmatically convert XPS to Excel

I was trying to adapt xps format files to be recognised by my tool. My tool runs on vba code (excel), so was trying to convert the xps format to excel.
Has anyone given a try to do so? if so pls can you help me out?
An XPS file is very similar to a PDF in that it's a representation of a printed page and thus doesn't normally retain any semantic information such as what data constitutes a table and what relation, if any, they have to each other.
I'm not familiar with XPS's internals, but if there is a structure for representing a grid of cells then it should be easy to extract information, but if it's just a bunch of text arranged in a grid then you'll have to devise an algorithm that can detect patterns of regularly-placed text (i.e. a grid layout) and extract the text from there.

Extracting data from PDF or Word using PHP, Java

I need help on this...
Especially since I don't know where to start..
I am an IT undergraduate and, along with my groupmates, is now undergoing on-the-job training in a company.
SCENARIO:
The company asked us to create a program that will generate a report and store it in a database.
The database that will be used is MySQL.
As for what language to use, we are considering VB.Net, Java, PHP.
The program must be able to :
generate a report that will be sent through email to an office
store in a database
collect all reports, collate those reports
generate a new report which will then be sent to their main office
then store it in their own databse...
For now,
we are still trying to determine how the program will run and what language will be used that has the capability of reading and extracting data from a text file (can either be a word document or a PDF file).
The company also wants the program to be online-ready for future expansion.
Now, our problem is
Is there a way to extract data from a PDF or Word file using either Java, PHP, VB then store it in the MySQL DB?
if there is, can it be implemented without using any 3rd party software?
the reason why we chose to use either a PDF or Word file type is that, the file should be printable for archive purposes.
What programming language can we easily use to be able to achieve our problem above?
I would like to apologize if the info I am giving is a bit messed up. I will be giving additional information once we are able to talk wth the company this week.
If there is a problem with the way I posted this, please forgive me. I am just trying my best to provide you with the information the best I could.
I'll answer for Java as it is what I use at work.
You can easily extract text from Word files or build a new Word file with Apache POI
As for PDF, iText or PDFBox both does a pretty nice job.
Why can't you use 3rd party software? If you could, I would recommend something like How to read PDF files using Java?.
Or, to read a .doc file: http://www.roseindia.net/tutorial/java/poi/readDocFile.html
Anyway, if you can't use 3rd party tools, why not read the specifications and figure out how to extract the text from PDF, DOC, and DOCX files?
Here you can find DOC specifications: http://msdn.microsoft.com/en-us/library/cc313118.aspx
Here you can find the PDF format specification: http://www.adobe.com/devnet/pdf/pdf_reference.html
Good luck!