I was trying to adapt xps format files to be recognised by my tool. My tool runs on vba code (excel), so was trying to convert the xps format to excel.
Has anyone given a try to do so? if so pls can you help me out?
An XPS file is very similar to a PDF in that it's a representation of a printed page and thus doesn't normally retain any semantic information such as what data constitutes a table and what relation, if any, they have to each other.
I'm not familiar with XPS's internals, but if there is a structure for representing a grid of cells then it should be easy to extract information, but if it's just a bunch of text arranged in a grid then you'll have to devise an algorithm that can detect patterns of regularly-placed text (i.e. a grid layout) and extract the text from there.
Related
My problem is rather simple:
I need a tool to merge some medium complex JSON input data inside a template PDF.
Then based on the data:
Some sections of the template could be replicated.
Some sections may be deleted and the gap created should disappear.
Tables could be filled by N elements without messing the formatting.
Tables could have "merged cells inside"
Also templates should be easily adjustable without re-writing the code, this means they could be: editable PDF, Word files, Spreadsheets, some tool's templates... (..html?)
In all the tools i tried (a lot! :(( ), one or more of the points above was always a nightmare.
So far i tried:
latex -> pdflatex -> pdf: this probably simply reduced my life expectation by 1.. 1 and a 1/2 years. Unfortunately the most powerful tool because.. Latex? anyway not maintainable at all
pdfminer\pdfjs\whatever npm pkg.. : coding always ends up in a low level mess or huge workarounds.
google sheets -> pdf: APIs are kind of hard, and anyway are cell\row based so it's difficult to manage dynamic sections
pdfgeneratorapi.com : basically what i needed with a rich editor in it, but formatting and aligning tables is bugged so results are always ugly. Also things like "merged cells" are not possible.
So question would be: is there a tool or package out there in 2022 capable of handling all these requirements at once?
I would recommend using LaTeX for your PDF generation requirements. LaTeX comes with a huge number of available packages and chances are that someone has already solved your issue in LaTeX.
Specifically, DynamicDocs API by ADVICEment might be a good option for your requirements. DynamicDocs is JSON to PDF API based on LaTeX. Here are some features:
Ready-made JSON to PDF templates (no need to understand LaTeX)
Ability to write your own template in LaTeX and merge JSON to PDF using the R language layer, making the templates and their content dynamic
Excel to PDF Add-in (currently in Beta) to generate PDFs based on data in Excel
Each account is given a FREE plan with a limited number of monthly API calls
Disclaimer: I am involved in developing DynamicDocs API.
I have literally hundreds of slides created with python-pptx. Many of these slides have charts I would like to use in a docx file. So what I would like to do is use python-docx to import these slides/charts into a docx file. Is that possible?
No, not with the current python-pptx or python-docx APIs.
Such a thing is possible of course, since the Word application will allow you to "paste" charts from PowerPoint and in fact the charts themselves are specified in DrawingML, an XML vocabulary that is shared between PowerPoint, Word, and Excel.
But to make this work with Python, you'd have to dig quite deep into the internals of both python-pptx and python-docx (although their architectures are much the same). You would probably also need to learn more about the respective XML vocabularies than you really wanted to know. So you might want to consider alternate approaches such as using win32com support for this sort of thing, especially if you are running on Windows and this is a one-time job and does not need to be hosted on a server for ongoing use.
If you thought you did want to tackle it, a good first step might be to inspect the XML related to a PowerPoint chart (located in both the slide and the chart-parts of the PPTX package) and also inspect the corresponding XML that appears in a Word (.docx) file that includes a chart. That will give you an idea of what needs to come over from the PPTX package, what transformations it may need to undergo (namespace changes perhaps) and where it would need to be added into the DOCX package, including updating relationship files and perhaps updating certain ID values to make them unique in the target package.
I am trying to extract data [price, information, and number] from PDF (I have like more than 10 000 PDF so the free trial of the website won't work).
Here is one example of PDF I get :
I tried it in Python (beginner on this kind of task and on Python also) with several packages like PyPDF2, pdfx and so on, but I only get the Text like this
with PyPDF2 :
So It's possible to extract the price, the number, and information but I have different format of pdf so it is not possible to just with the text and some algorithms extract the information.
What I want to do, and it is possible because a lot of websites are doing it and make people pay for it. I want to read it in a vertical way and convert the data extracted in XML/JSON or simply a dataset.
I want to read the document per columns and not by line
Is there a way to do it in python or other languages?
First let me tell you that this is not an easy problem to solve since PDF files in the wild tend to be quite diverse in layouts. I can suggest trying an open source project that works really good for extracting information from tables in PDF files. It is called Tabula, you can get it at https://tabula.technology.
Tabula is going to detect tables on each page and export the content as CSV format. Once you have it in CSV it should be easier to get the information using Python. Please note that the CSV layout depends on the table layout in the PDF, meaning that you may need to create several functions to extract the information correctly.
Tabula is not perfect but it should work with most PDF files, for those that do not work you may need to extract the information manually.
I need help on this...
Especially since I don't know where to start..
I am an IT undergraduate and, along with my groupmates, is now undergoing on-the-job training in a company.
SCENARIO:
The company asked us to create a program that will generate a report and store it in a database.
The database that will be used is MySQL.
As for what language to use, we are considering VB.Net, Java, PHP.
The program must be able to :
generate a report that will be sent through email to an office
store in a database
collect all reports, collate those reports
generate a new report which will then be sent to their main office
then store it in their own databse...
For now,
we are still trying to determine how the program will run and what language will be used that has the capability of reading and extracting data from a text file (can either be a word document or a PDF file).
The company also wants the program to be online-ready for future expansion.
Now, our problem is
Is there a way to extract data from a PDF or Word file using either Java, PHP, VB then store it in the MySQL DB?
if there is, can it be implemented without using any 3rd party software?
the reason why we chose to use either a PDF or Word file type is that, the file should be printable for archive purposes.
What programming language can we easily use to be able to achieve our problem above?
I would like to apologize if the info I am giving is a bit messed up. I will be giving additional information once we are able to talk wth the company this week.
If there is a problem with the way I posted this, please forgive me. I am just trying my best to provide you with the information the best I could.
I'll answer for Java as it is what I use at work.
You can easily extract text from Word files or build a new Word file with Apache POI
As for PDF, iText or PDFBox both does a pretty nice job.
Why can't you use 3rd party software? If you could, I would recommend something like How to read PDF files using Java?.
Or, to read a .doc file: http://www.roseindia.net/tutorial/java/poi/readDocFile.html
Anyway, if you can't use 3rd party tools, why not read the specifications and figure out how to extract the text from PDF, DOC, and DOCX files?
Here you can find DOC specifications: http://msdn.microsoft.com/en-us/library/cc313118.aspx
Here you can find the PDF format specification: http://www.adobe.com/devnet/pdf/pdf_reference.html
Good luck!
Having a set of about 400 Documents in word which are part of a Quality Management System Word is causing me a lot of grieve because a) it handles images in large doc poorly b) the layout gets sometimes busted c) it is cumbersome to configure the documentation for different clients.
I can convert single documents by saving them as xml/html or text and convert them manually into latex but that is not possible for 400 documents. I know that i can print word documents directly to pdf with tools like PrimoPDF but that is not flexible enough because i need to modify the content.
Is there a way to keep the structure of the document like plain text, headings, tables, images and transform it into XML? Afterwards i would like to transform the XML into html, latex and pdf according the choices of our clients and also modify the content? Is xslt a way to go for transforming the xml to the other formats?
Thanks for any advice.
You could convert your documents to Word 2007. Office 2007 documents are XML documents: just change the file extension to .zip and upzip. Also, Microsoft publishes an API for working with Office 2007 documents that is higher-level than working with the XML tags.
For batch converting MS Word to something else you might have a look at OpenOffice.org.
OpenOffice has a (command line) batch mode for mass conversions. You can also have a look at JodConverter which converts documents using just that mechanism.
That way you could mass convert Micrososoft Word to some other format OpenOffice.org supports. Perhaps text, perhaps RTF, perhaps OpenOffice XML.
You then have a hopefully easier format to convert to Latex.
Have a search for Word and OpenOffice right here at Stack Overflow, you'll find results like this one about Word to Html conversion.
There is advice on Word <--> LaTeX conversions at TUG (TeX User Group):
http://www.tug.org/utilities/texconv/pctotex.html
that may be worth having a look at to see if any of the suggestions and methods meet your requirements.
Not sure how well it works, but there is Word2tex.