I want to use IcePDF or PDFBox to extract content from PDF. But I don't now the way to continue generating HTML web pages from the text and images extracted.
You can convert pdf to html with PDFBox. Try this link.
By adding -html as parameter when you extract text, you will get html of the pdf. But it will not contain any image, graphics and other details. It will be only the text extracted from the pdf in html format.
If you want to create the exact look and feel of the pdf, there is no single step method in PDFBox. In my knowledge no library provides this facility to create exact html of the pdf. But using PDFBox you can extract images, text and its details. Using these details you have to create a logic to produce the html. We have done a project to convert pdf to html for azzist.com. We have accomplished the conversion using PDFBox. In azzist we are converting the resume to html format. (Still some font issues are there).
Scribd, google, dropbox, zoho etc have accomplished this conversion in a better way. You can have a look at any of these sites to check how they have accomplished this. (You will not get the logic. You have to find it out).
Related
i had a little question, i want to embed a pdf like it were part of the website, i could rewrite the pdf in html but it will be a lot of work, what would be the best looking option to embed it?
The pdf has pictures and letters with fonts and columns.
Thanks.
I would go with one of the two approaches. One is a native rendering of the PDF content in a DOM node, using PDF.js. It is an open source library that's used as the default PDF viewer in Firefox.
Another way is to emulate the look by converting the PDF pages to images when the PDF is uploaded. You may use imagemagick to parse the pages, and display the content in a slideshow/gallery widget.
I am creating an iPad application for reading PDF, this PDF should be generated from a HTML file. I have seen some sample codes for converting HTML to PDF, and I think that part will be fine for me to implement.
I have seen some apps from from AppStore for reading PDF files, in that there is an option for increase the font size (not zooming), color style etc. When we increase the font size the text content automatically wrapping to next line. How can I implement this in my app? Any idea about how would they have done that?
I have seen in some posts that, it is not possible to edit a PDF file, so are they actually using PDF file or some other format?
When rendering your PDF to show it to user, you can convert it in other formats like HTML and then allow user to change font-size, font-style. So your PDF remain unchanged. Also it is easy to manipulate HTML than binary formatted documents like PDF.
I know that Word have option to save the document into web page, but what is produce as result doesn't suit my needs.
I want my document to be save as web page with separate folder, which include images and .css files. But the Word only include the images not the css file.
How can I achieve this with Word or some other tool (need to be free)?
Given that Java is your preferred language, you could try docx4j [disclaimer: I manage it].
See the CreateHtml sample.
Note: docx4j supports docx only (as opposed to legacy binary .doc; those can be converted using LibreOffice).
I currently have a "PrintingWebService" that I call from an AJAX page with all the information that is needed to construct a highly customized PDF printout using PDF Sharp and the PDFSharp's GDI+ mode, which takes DrawString and other commands that work basically just like GDI+ only they are drawn to the PDF.
I then save the PDF file to a location on the webserver and return the file name from the web service, and the AJAX page opens a new window with the pdf file.
So far, it works well, however, there is one part of my AJAX page that I want to printout and I haven't come up with a solution for yet. I've got a string of the HTML content of a TinyMCE editor that I want to dispay in the bottom part of the PDF page.
I'm looking for some sort of tool I could use for this purpose. Even something opensource that prints to GDI+ I could use by taking the source code and translating it to use PdfSharp's GDI+ (the class names are like XGraphics, with each class having X before the GDI+ name).
If I have to I will limit what HTML can be generated by TinyMCE and write my own renderer, but that will be a big challenge, so I'm looking for other solutions first.
I've stayed away from a printer-friendly page approach because I wanted to construct a page that was a near identical of an existing WinForms printout, using my existing code. With PdfSharp I was able to convert all the code except the text area stuff (which used the RichTextBox and RTF in the WinForms version).
Tony,
I personally have used WebSupergoo's ABCPdf library with much success. You can actually render HTML directly to the PDF and it does fairly well in regards to accuracy.
Another free software that will allow you the flexibility of writing HTML to PDF that I have used in the past with much success is iTextSharp.
Otherwise, I think you'll have to write something to render HTML to GDI.
Either way, you may want to consider using an HttpHandler that you map to using your web.config to generate the PDF file. This will allow for you to render the PDF to a bytestream and then dump it directly to the user (as opposed to having to save each PDF receipt to the web server). It will also allow for you to use the .pdf extension in the page that returns the receipt (PurchaseReceipt.pdf could be mapped to a HttpHandler)... making it more cross-browser friendly. Older versions of Adobe / Browsers will not display correctly if you start throwing a PDF byte stream from an ASPX page.
Hope this helps.
Is there an application out there that Generates an HTML form from a PDF file? Also, it would need to generate the HTML form so that it would be able to submit to the PDF to fill out the fields inside the PDF.
PDF is not so good at validation and it's just a kluge interface to begin with.
Not sure of your full requirements but docudesk does this conversion as well as a number of others.
Adobe also have an online conversion tool that does this.