OCR of PDF files with images - ocr

I’ve got Tika working with Tesseract on PDF files, but it seems that if I give it a PDF file that has both searchable text and images, the text is OCRed twice. Is there a way to avoid this? Even if it has to make two passes, one for the straight text and then another for just the images

There are 2 important flags that tika uses to extract text:
X-Tika-PDFextractInlineImages (true/false).
When false than all images is ignored. So it works fine for the native pdfs - the text is extracted from the native pdf
When true than images will be used to text extraction
X-Tika-PDFocrStrategy: https://tika.apache.org/1.24/api/org/apache/tika/parser/pdf/PDFParserConfig.OCR_STRATEGY.html
NO_OCR - extract the text without ocr - works for native pdfs
OCR_ONLY - only the ocr is used - so the text from "native pdf" is also send to ocr
OCR_AND_TEXT_EXTRACTION - invokes NO_OCR OCR_ONLY
so when you have the fully native pdf then the combination X-Tika-PDFextractInlineImages: false, X-Tika-PDFocrStrategy: NO_OCR seems to be the best
for the fully scanned pdfs you can use X-Tika-PDFextractInlineImages: true, X-Tika-PDFocrStrategy: OCR_ONLY
but probably your document is a hybrid. It contains the native parts (you need to extract text only) and the images (you need to ocr it). In my opinion there is no way to handle hybrid pdf in tika

Related

Paragraph alignment changes on some files Adobe Acrobat convert

I'm trying to extract text from PDF by converting PDF to HTML using Adobe Acrobat SDK and Python as Acrobat is the only tool that gives out the proper structure of the actual PDF. Some files are okay but in some files, one or two paragraphs leave out somehow, but, the exact paragraph in the pdf looks perfect. It would be great if someone sheds light on this, please.
My Python code to convert:
src = 'location to pdf file'
AvDoc = Dispatch("AcroExch.AVDoc")
if AvDoc.Open(src, ""):
pdDoc = AvDoc.GetPDDoc()
jsObject = pdDoc.GetJSObject()
jsObject.SaveAs(filename+ ".html", "com.adobe.acrobat.html")
Sample PDF file:
20.pdf
Respective HTML file:
20.pdf.html
It's not happening in all PDFs. if you think it might be caused by an empty signature widget, all PDFs have them.
If you consider the '2.' point in the HTML, it is totally collapsed and out of the 'ol' tag which contains other points in a perfect structure.
Please help.

Ckeditor / WYSIWYG copy from pdf and retain style/images?

I searched a lot but couldn't get the answer.
I want to retain copied text from pdf to WYSIWYG editor(Ckeditor).
I can retain style while copied from Word files but it does not work the same way when copied from PDF.
Original pdf is like this(I can't post image as reputation is < 10 , please refer links):
PDF text
It shows following output after copy paste:
After copy paste in WYSIWYG editor
Please suggest plugin or code snippet for PDF to RTF conversion.
Thanks
CKEditor can paste only data which it gets from the browsers. It means that if browsers do not provide more data then the plain text there is nothing CKEditor can do.
Since version 4.5 CKEditor provide facade to handle Clipboard API and get all data which are pasted directly in the paste event. Every browser provide different data and you can easily check them:
editor.on( 'paste', function( evt ) {
var types = evt.data.dataTransfer.$.types;
console.log( types );
for ( var i = 0; i < types.length; i++ ) {
console.log( evt.data.dataTransfer.getData( types[ i ] ) );
}
// Additionally you can get information about pasted files.
console.log( evt.data.dataTransfer.getFilesCount() );
} );
Note that Internet Explorer does not provide types array and support only Text and URL types.
To learn more about Clipboard Integration see this guide. Especially "Handling Various Data Types with Clipboard API" chapter which describe how to integrate data converter with the paste event, so if the PDF data are available in any browser you can use them during pasting.
If it's a common case in your system then imho the best thing you can do is to allow users to upload the PDF file, run server side software to transform PDF into HTML and then automatically insert it into CKEditor.
I have no recommendations though on which application to use.
The problem is that PDF files work in a different way that other text documents, so even if you try to paste its contents into a native word processor you won't get the same formatting.
This will vary depending on your PDF reader, but it's usual that images aren't pasted, tables are converted to plain text lines, etc...
If that happens in a native program that has full access to the clipboard, you can't expect anything better in a javascript application that depends on the data that the browser provides, and even after that you have to be careful with CKEditor because by default it includes filters to remove any formatting that it doesn't recognize so even more information can be lost at this last point.

Change the format of data a mobile client receives to display text strings

In a smartphone app I receive HTML text from a server that I have to parse using regexes because I can't display it as HTML (I can't use a webview). The regexes are very intensive (are many) and the results (being the original text inserted by users that cut and paste text from any source, pdf, rtf, etc.) is not always as good as the website counterpart. So, I want to suggest my boss to change the format the mobile client receives, so that I don't have to parse HTML. Question is: what could be this format?

Text heavy iOS App. Store text in HTML, Plist, or Other?

I'm writing relatively complex iOS app that is very text heavy.
The text is also heavily formatted. It has lots of color, size, font, and spacing changes, as well bulleted lists and other text features you'd expect to see in a very rich website.
The text is displayed on about 40 different views. Some of which display a lot of text, others a little. There is no one template that all the pages follow. (There are some that are similar, but that's not the point.)
Lastly, the text is constantly being changed and updated by an editorial team during development, not so much after release. The text has to be stored on the device, downloading files is not an option.
My question is, what is the best way to store and then render all this text in an iOS App?
My approach
Store all the text content and formatting info in an html file and use
[[NSAttributedString alloc] initWithFileURL:htmlDoc
options:#{
NSDocumentTypeDocumentAttribute:NSHTMLTextDocumentType}
documentAttributes:&attrDict
error:&error];
to create a NSAttributed string and use that to populate UITextViews.*
*Note: I would do some more work before creating the UITextViews. First I would parse it to find the appropriate page number [[Page:1.3]] and then parse the elements in that section [[header]], [[side_scroller]], etc...
I like this approach for two main reasons:
It created a separate copy document that contained all the text
and formatting info.
I'm the only iOS developer, but we have a couple front-end
developers. So when we get slammed with changes that need to be done
in 3.45 minutes, I could have some of the guys help me make the
changes, without having to know all the nuances of UIFont and
related classes. Occasionally, the editors could even make the
changes themselves :)
Minor reasons for liking this approach:
The text can vary so much per page, that creating a new UIFont + Plist entry to store the formatting info seems like a bigger pain than having everything in a .html document. (I could be wrong about this.)
Project managers will inevitably say: "Make this word a little bigger," "This word looks strange, add italics," and "Make everything purple!" HTML/CSS seems like a more flexible solution for quickly implementing these requests.
Downsides of this approach:
NSAttributedString picks up 99% of the HTML attributes I threw at it. It did not pick bullet spacing changes in unordered lists <ul>.
Plists are more performant.
Here are some other approaches I considered:
Plist + UIFont
RTF Document - Originally started with this, but found it hid a lot of what was going on and NSAttributedString wouldn't pick up some of the changes.
XML
Any advice or input would very appreciated.
Notes:
iPad app,
iOS 7,
No Internet Connectivity,
Xcode 5
What I did to store styled text in an iOS app was to write a Mac OS command line tool that opens RTF files and converts them to attributed strings (It's a 1-line call in Mac OS, but not supported in iOS for some reason.) I then use NSCoding to save the attributed strings as binary data, with a special .DATA filetype.
I created a custom UITextView category with a method that knows how to load the text view's attributed text from my custom filetype.
I created a build rule in my project that treats RTF files as source files in a build step and the .DATA filetype as the output, and copies the .DATA files into the build project.
Now, all I have to do is add an RTF file to my project the build process inserts the .DATA version of the styled text into the executable.
The Xcode editor knows how to edit RTF files, so you can edit them right in place in the IDE, OR you can edit them in TextEdit or any editor that supports RTF files.
There are a few things you can put in an RTF that aren't supported in UITextViews. (I don't remember what those are offhand. Sorry.)
I find styled WYSIWYG text much easier to deal with than HTML. You just edit the text, and the build process picks up the changes.
It worked beautifully. Plus, binary NSCoding output is a whole lot more compact than HTML.
I would recommend using web view. It can open files in resource bundle.
You can disable all the links in HTML by implementing delegate method shouldStartLoadWithRequest to return NO.
You might also want to set dataDetectorTypes to UIDataDetectorTypeNone.
That will disable auto link detection in web view

html idml viewer

I am trying to implement a c# idml to html converter. I've managed to produce a single flat html file similar to the one produced by the indesign export.
What I would like to do is to produce html that will be as similar as possible to the indesign view like an html idml viewer. To do this, I need to find the text that can fit into a textframe, I can extract the story text content but I can't really find a way to split this content into frames/pages.
Is there any way I can achieve that?
Just extracting the text from a story isn't enough. The way the text is laid out is controlled by TextFrames in the Spread documents. Each TextFrame has a ParentStory attribute, showing which story it loads text from, and each frame has dimensions which determine the layout. For unthreaded text frames (ie. one story <> one frame), that's all you need.
For threaded frames, you need to use the PreviousTextFrame and NextTextFrame attributes to create the chain. There is nothing in the IDML to tell you how much text fits in each frame in a threaded chain, you need to do the calculation yourself based on the calculated text dimensions (or using brute force trial and error).
You can find the spreads in the main designmap.xml:
<idPkg:Spread src="Spreads/Spread_udd.xml" />
And the spread will contain one or more TextFrame nodes:
<Spread Self="udd" ...>
<TextFrame Self="uf7" ParentStory="ue5" PreviousTextFrame="n" NextTextFrame="n" ContentType="TextType">...</>
...
</Spread>
Which will in turn link to a specific story:
<Story Self="ue5" AppliedTOCStyle="n" TrackChanges="false" StoryTitle="$ID/" AppliedNamedGrid="n">...</>
(In this example the frames are not threaded, hence the 'n' values.
All this is in the IDML documentation, which you can find with the other InDesign developer docs here: http://www.adobe.com/devnet/indesign/documentation.html
Microsoft and Adobe have proposed a new module for css named Regions which allow you to do flow tekst into multiple containers. Keep in mind that you will never be able to create an html page that looks exactly like an Indesign document.
http://www.w3.org/TR/css3-regions/
For now only IE10 and webkit nightly support it: http://caniuse.com/#feat=css-regions