How could I extract text and Images from DOCX using tika version 2.4.1?

How could I extract text and Images from DOCX using tika version 2.4.1? - extract

When using tika version 1.27, I use AutoDetector and AutoParser to extract text and images from DOCX. But in tika version 2.4.1, AutoDetector detects the format of DOCX file is "application/zip", and the parser it used is PackageParser. How could I correctly extract text and images from DOCX?

Related

OCR of PDF files with images

I’ve got Tika working with Tesseract on PDF files, but it seems that if I give it a PDF file that has both searchable text and images, the text is OCRed twice. Is there a way to avoid this? Even if it has to make two passes, one for the straight text and then another for just the images

There are 2 important flags that tika uses to extract text:
X-Tika-PDFextractInlineImages (true/false).
When false than all images is ignored. So it works fine for the native pdfs - the text is extracted from the native pdf
When true than images will be used to text extraction
X-Tika-PDFocrStrategy: https://tika.apache.org/1.24/api/org/apache/tika/parser/pdf/PDFParserConfig.OCR_STRATEGY.html
NO_OCR - extract the text without ocr - works for native pdfs
OCR_ONLY - only the ocr is used - so the text from "native pdf" is also send to ocr
OCR_AND_TEXT_EXTRACTION - invokes NO_OCR OCR_ONLY
so when you have the fully native pdf then the combination X-Tika-PDFextractInlineImages: false, X-Tika-PDFocrStrategy: NO_OCR seems to be the best
for the fully scanned pdfs you can use X-Tika-PDFextractInlineImages: true, X-Tika-PDFocrStrategy: OCR_ONLY
but probably your document is a hybrid. It contains the native parts (you need to extract text only) and the images (you need to ocr it). In my opinion there is no way to handle hybrid pdf in tika

Can one extract images from pandoc's self-contained HTML files?

I have used pandoc with the option --self-contained to create HTML documents where images are embedded in the HTML code as base64.
The image is included in the IMG tag like this (where I have replaced the long string of base64-characters with a placeholder:
<IMG src="data:image/png;base64,<<base64-coded characters here>>" width=672">
Now, I'd like to extract such images, i.e. do the reverse where base64-coded data are replaced by references to files and the data converted to ordinary PNG or JPEG files that are saved on disk.
I was hoping to use pandoc to do this conversion, but I could not find an option for this in pandoc, nor have I found any other software that does it. Ideally, the solution should be shell/script-type that can easily be included in a longer toolchain.

You can use pandoc with the --extract-media option. The images will be written to the supplied directory and the base64 URLs will be replaced with references to those files.
E.g.
pandoc --from=html YOUR_FILE.html --extract-media=images

rtf (rich text format) base64 data display on angular 6 html

I have RTF file base64 code and I want to display it in my html.
my angular ts file having below code
this.display=
this._sanitizer.bypassSecurityTrustResourceUrl('data:application/rtf;base64,'
+'mybase64code')
and my html
<object [data]="display"></object>
when run the app showing "this Plug-in not supported". Please help me what is the wrong with this approach.

Markup font-style (italic) in tesseract OCR

Have tesseract-ocr v3.02.02 installed on Windows 7, and have used it via the command line:
1) Output png text to a text file: tesseract image.png txtfile
2) Output png text to a html file: tesseract image.png htmlfile hocr
I need it to be able to markup any italic text in the output text or html file. How do I do this (preferably on the command line - never used it in API mode)?

The hocr output by Tesseract includes only the word coordinates and confidence values, not font-related information. As such, you will need to modify the source code to output what you want for the command-line mode, or use its API.

html to pdf from bash with unicode (UTF-8) support

Is there a good way to convert html to PDF from bash with Unicode (UTF-8) support?
I would expect the same result as if I were to use a PDF printer and print a page from Firefox.
Usage Example:
curl http://www.wikipedia.org/ | html2pdf_bash_command > /tmp/wikipedia.org.pdf

You can do something with xfvb and a browser or use a small qt component wkhtmltopdf. Also if you have a full gnome installed on your environment you can use gnome-web-print.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

How could I extract text and Images from DOCX using tika version 2.4.1? - extract

When using tika version 1.27, I use AutoDetector and AutoParser to extract text and images from DOCX. But in tika version 2.4.1, AutoDetector detects the format of DOCX file is "application/zip", and the parser it used is PackageParser. How could I correctly extract text and images from DOCX?

Related

OCR of PDF files with images

Can one extract images from pandoc's self-contained HTML files?

rtf (rich text format) base64 data display on angular 6 html

Markup font-style (italic) in tesseract OCR

html to pdf from bash with unicode (UTF-8) support

Categories

Resources