Paragraph alignment changes on some files Adobe Acrobat convert - html

I'm trying to extract text from PDF by converting PDF to HTML using Adobe Acrobat SDK and Python as Acrobat is the only tool that gives out the proper structure of the actual PDF. Some files are okay but in some files, one or two paragraphs leave out somehow, but, the exact paragraph in the pdf looks perfect. It would be great if someone sheds light on this, please.
My Python code to convert:
src = 'location to pdf file'
AvDoc = Dispatch("AcroExch.AVDoc")
if AvDoc.Open(src, ""):
pdDoc = AvDoc.GetPDDoc()
jsObject = pdDoc.GetJSObject()
jsObject.SaveAs(filename+ ".html", "com.adobe.acrobat.html")
Sample PDF file:
20.pdf
Respective HTML file:
20.pdf.html
It's not happening in all PDFs. if you think it might be caused by an empty signature widget, all PDFs have them.
If you consider the '2.' point in the HTML, it is totally collapsed and out of the 'ol' tag which contains other points in a perfect structure.
Please help.

Related

OCR of PDF files with images

I’ve got Tika working with Tesseract on PDF files, but it seems that if I give it a PDF file that has both searchable text and images, the text is OCRed twice. Is there a way to avoid this? Even if it has to make two passes, one for the straight text and then another for just the images
There are 2 important flags that tika uses to extract text:
X-Tika-PDFextractInlineImages (true/false).
When false than all images is ignored. So it works fine for the native pdfs - the text is extracted from the native pdf
When true than images will be used to text extraction
X-Tika-PDFocrStrategy: https://tika.apache.org/1.24/api/org/apache/tika/parser/pdf/PDFParserConfig.OCR_STRATEGY.html
NO_OCR - extract the text without ocr - works for native pdfs
OCR_ONLY - only the ocr is used - so the text from "native pdf" is also send to ocr
OCR_AND_TEXT_EXTRACTION - invokes NO_OCR OCR_ONLY
so when you have the fully native pdf then the combination X-Tika-PDFextractInlineImages: false, X-Tika-PDFocrStrategy: NO_OCR seems to be the best
for the fully scanned pdfs you can use X-Tika-PDFextractInlineImages: true, X-Tika-PDFocrStrategy: OCR_ONLY
but probably your document is a hybrid. It contains the native parts (you need to extract text only) and the images (you need to ocr it). In my opinion there is no way to handle hybrid pdf in tika

Ckeditor / WYSIWYG copy from pdf and retain style/images?

I searched a lot but couldn't get the answer.
I want to retain copied text from pdf to WYSIWYG editor(Ckeditor).
I can retain style while copied from Word files but it does not work the same way when copied from PDF.
Original pdf is like this(I can't post image as reputation is < 10 , please refer links):
PDF text
It shows following output after copy paste:
After copy paste in WYSIWYG editor
Please suggest plugin or code snippet for PDF to RTF conversion.
Thanks
CKEditor can paste only data which it gets from the browsers. It means that if browsers do not provide more data then the plain text there is nothing CKEditor can do.
Since version 4.5 CKEditor provide facade to handle Clipboard API and get all data which are pasted directly in the paste event. Every browser provide different data and you can easily check them:
editor.on( 'paste', function( evt ) {
var types = evt.data.dataTransfer.$.types;
console.log( types );
for ( var i = 0; i < types.length; i++ ) {
console.log( evt.data.dataTransfer.getData( types[ i ] ) );
}
// Additionally you can get information about pasted files.
console.log( evt.data.dataTransfer.getFilesCount() );
} );
Note that Internet Explorer does not provide types array and support only Text and URL types.
To learn more about Clipboard Integration see this guide. Especially "Handling Various Data Types with Clipboard API" chapter which describe how to integrate data converter with the paste event, so if the PDF data are available in any browser you can use them during pasting.
If it's a common case in your system then imho the best thing you can do is to allow users to upload the PDF file, run server side software to transform PDF into HTML and then automatically insert it into CKEditor.
I have no recommendations though on which application to use.
The problem is that PDF files work in a different way that other text documents, so even if you try to paste its contents into a native word processor you won't get the same formatting.
This will vary depending on your PDF reader, but it's usual that images aren't pasted, tables are converted to plain text lines, etc...
If that happens in a native program that has full access to the clipboard, you can't expect anything better in a javascript application that depends on the data that the browser provides, and even after that you have to be careful with CKEditor because by default it includes filters to remove any formatting that it doesn't recognize so even more information can be lost at this last point.

How can I import html content to pdf template?

I created a pdf template with open office draw. it has textboxes and I can set values with acrofield. But I can't import a html content to template.
I can convert html contents to pdf file; but for template, how can I do it?
My problem is with template; also my html content have to map on page, for example center of page.
Thanks
I am not quite sure if I understand your question, but it seems like you need some kind of template where you will enter your content.
My thinking goes to OpenXML as the best fit. But since it is rather complex you can save some time by using third party tools.
From my experience, Docentric gives you good value for the money. You can prepare a template in Word and then merge it with data from any source that can fit into .NET object. Your document can be converted to pdf or xps if required.
Templates are generated in MS Word (2007 or newer) using special Docentric Add-in for template generation. All MS Word formatting can be applied here. Placeholders for data are set where the data will appear at runtime.
The process is straight forward so even end users can design reports. Developers then focus on bringing data in from various sources (database, XML). Chech the product documentation for ideas how to use it.

HTML file to screenshot as .jpg or other image

Nothing to do with rendering an individual image on a webpage. Goal is to render the entire webpage save that as a screenshot. Want to show a thumbnail of an HTML file to the user. The HTML file I will be screenshotting will be an HTML part in a MIME email message - ideally I would like to snapshot the entire MIME file but if I can do this to an HTML file I'll be in good shape.
An API would be ideal but an executable is also good.
You need html2ps, and convert from the package ImageMagick:
html2ps index.html index.ps
convert index.ps index.png
The second program produces one png per page for a long html-page - the page layout was done by by html2ps.
I found a program evince-thumbnailer, which was reported as:
apropos postscript | grep -i png
evince-thumbnailer (1) - create png thumbnails from PostScript and PDF documents
but it didn't work on an simple, first test.
If you like to combine multiple pages to a larger image, convert will help you surely.
Now I see, that convert operates on html directly, so
convert index.html index.png
shall work too. I don't see a difference in the output, and the size of the images is nearly identical.
If you have a multipart mime-type email, you typically have a mail header, maybe some pre-html-text, the html and maybe attachments.
You can extract the html and format it seperately - but rendering it embedded might not be that easy.
Here is a file I tested, which was from Apr. 14, so I extract the one mail from the mailfolder:
sed -n "/From - Sat Apr 14/,/From -/p" /home/stefan/.mozilla-thunderbird/k2jbztqu.default/Mail/Local\ Folders-1/Archives.sbd/sample | \
sed -n '/<html>/,/<\/html>/p' | wkhtmltopdf - - > sample.pdf
then I extract just the html-part of that.
wkhtmltopdf needs - - for reading stdin/writing to stdout. The PDF is rendered, but I don't know how to integrate it into your workflow.
You can replace wkhtml ... with
convert - sample.jpg
I'm going with wkhtmltoimage. This worked once correctly set up xvfb. The postscript suggestion did not render correctly and we need img not pdf.

Paste Image from Microsoft Office to AIR Application?

I'm in a situation where I need to accept copied images from a Word (.doc / .docx) document to a spark image on the AIR application. I tried with a sample document with an image embedded inside. When I open it up on Pages on Mac, the copied image pastes perfectly onto the the spark image object via the code below:
var clipboardImage:Bitmap = new Bitmap(Clipboard.generalClipboard.getData(ClipboardFormats.BITMAP_FORMAT) as BitmapData);
clipboardImage.width = fldPicture.width;
clipboardImage.height = fldPicture.height;
fldPicture.source = clipboardImage;
fldPicture is the spark image. This may have been okay but when I sent the AIR application and the same Word document over to a friend who runs Windows and has Microsoft Office 2010, it didn't work. It only seems to work if the copied image from the Word document is pasted to MS Paint then copied again but this time, from the MS Paint.
Sorry if this seems rather confusing, I tried to explain it as much as I could. If anyone can shed some light on this issue, it would really be appreciated.
Mmh I'm afraid it has to do with the way Word handles file formats and so.
Word uses a lot of headers, internal-code / tags only used by itself to recognize objets, text formats, images...
And I suppose the content of the clipboard coming from Word has to be stripped from this headers of some kind before it can be used, thing that Paint automatically do (thing that could explain why it works when getting to Paint before pasting it in your app).
Maybe you could try to put the pasted data into a byte array and try to remove the headers manually before getting it into a Bitmap ?...