Is there a way to check which pdf strategy FSCrawler will use? - fscrawler

I am using FSCrawler's REST feature to scan PDFs as they are uploaded. I'm currently using the ocr_and_text pdf strategy, however ocr takes too long for the user to wait for a response. I would like to send the pdf to fscrawler synchronously to use text extraction and if this doesn't work, send to an asynchronous background task for ocr.
Is there a way to do this with FSCrawler? Or is there a way to have multiple pdf strategies?

You should try to change the pdf_strategy to auto.

Related

How can I force the user to only upload a xlsx, csv, or xls file that has two columns and has the same exact name for header as I need in frontend?

So far, I can only check if the user has uploaded an xlsx, xls, or csv file. How can I make sure the user is actually uploading a file with those extensions that also is required to have two columns, with 'Example', and 'Label' headers in the frontend, otherwise, keep asking them to upload the correct file?
<div class="form-group"><input type="file" name="annotatedsamplefile" id="annotatedsamplefile" accept=".xlsx, .xls, .csv" class=""></div>
It sounds like you need to upload the file to the back-end immediately, which would process the file and return whether the file is of the correct format. Then, if it is valid, that same response would have a way to identify the already uploaded file so it doesn't have to be uploaded again on a possible form submission.
I'll leave it to you to figure out how to accomplish this, since there's multiple ways to do it in JQuery, vanilla JavaScript, and probably Bootstrap. It also depends on your server-side language as well as if you are putting the file into a DB or just a file system.
There may be a client-side way to do this, but do you really want to have a potentially large library loaded when your site loads? Also, relying on client-side information isn't generally a good idea when talking about business logic. Business logic should generally be done server-side, so it can be more reliable. It' much harder to alter a server-side response to user data than client-side results based on user data. I'm not saying server-side results are infallible, but rather that it can be easy to manipulate client-side results, even remotely. And even if you did validate on the client-side, you'd still want to validate the file on the server-side anyway, so why not avoid duplication of effort and just do it server-side?

Alfresco simple OCR. Extract text from PDF file and use it to start workflow

I'm using alfresco-simple-ocr with pdfsandwich and tesseract OCR. I want to get the text from a document inserted to a folder and then use the text and a pdf file in a new workflow. I've managed to do OCR extraction and how to start a workflow with a file inserted to catalogue,
but I can't get text from file and use it in the workflow. Is there a possibility to do this? Where can I start implementing that function ? Greetings, RafaƂ
You don't need any extension for that. Alfresco already integrates PDfBox that will do that for you. After, it depends of your PDF if it's a PDF containing images (so scanned documents) or if it's a PDF containing already text inside.
If you want to OCR some images, you have as well this module:
https://github.com/bchevallereau/alfresco-tesseract
When you know what you want to transform, you can look at this page where you have a javascript sample on how to call transformers:
http://docs.alfresco.com/5.2/references/dev-extension-points-content-transformer.html
You can do that as well in Java if you need.

Get Json From Azure Storage Blob

I want to get the json file in Azure Storage Blob through the browser.
I used Stream Analysis and comes out a json file in the Blob container. Now i need to get the information inside the json file in order to show the IOT device status in real-time.
I tried to use Jsonp,
but I don't know how to add the callBack method in the Json file without download it. Is there any way to add the callBack method??
or Is there another way to get the information inside the container?
for this particular scenario, I'd recommend PowerBI. Now Stream Analytics have direct output to PowerBI and you can pretty much customize the dashboard for your real time IoT needs.
You can refer to this article for step by step Stream Analytics + PowerBI.
Coming back to your question, you need to download the blob to access the content. Stream Analytics to BLOB is usually for archiving or later predictive analysis scenarios.
Instead if you still prefer not to use PowerBI, I'd either arrange the SA output to an event hub and read the data from there in real time or alternatively save the data into a NO-SQL db like DocumentDB on Azure and then read from there. I can recommend Highcharts if you want to use custom gauges etc to visualize the data.
Hope this helps.

How can I create PDF output from rrdcgi?

I have created a rrdcgi script to display information about the system performance with graphs. Now I would like to add an option for the users to create PDF on the fly with the details on current page (images and information) and header and footer. I also want the generated PDF files to be saved in some location so that that can be easily accessed next time. Is this possible to do with rrdcgi or any Perl code would be really appreciated.
I need this options
You need to consider what you want to put in the PDF: Do you want an exact replica of the web page the user is viewing (too hard to be close to impossible without having the user's browser installed on your side and using its print output) or do you want the same information in a roughly similar layout?
An important issue is how you are generating the HTML: I did something similar once to generate PDF receipts for experiment participants (now, I just output HTML with print styles).
The HTML is generated using HTML::Template although Template.pm would be just as fine.
It is then trivial to write another template, one that generates a LATEX document which can be processed using pdflatex. If you save the data the time the snapshot is requested, you can add the snapshot to a queue that generates documents asynchronously so that requests do not tie up the web server.
Update: Looking at rrdcgi, I now realize that it already does use a template. That is perfect: Instead of putting HTML in the template, put LATEX code in the template and run rrdcgi with the --filter option to create a LATEX source file which you can run through pdflatex. I guess the problem to solve there is to be able to use the exact same data that was used to generate the page the user is looking at.
If it is not possible to re-run rrdcgi with the exact same data, consider adding some JavaScript that submits the HTML source of the page the user is reviewing (or some JSON representation thereof) to a CGI script that parses the HTML and outputs LATEX. Writing clean HTML in the original template and judicious use of class and id attributes would help there.
I do not have time to test any of these ideas right now, but I will take a look again within the next couple of days.
Is it worth the effort?
Why don't you add a FAQ explaining how to setup a PDF-printer on Windows/MAC/Linux and provide a 'clean' page that can then be printed?
Since you apparently have to create the PDF,
take a look at this (what-is-the-best-perl-module-to-use-for-creating-a-pdf-from-scratch) post here on SO.
There is also this post, that could combine the 'clean' HTML page and a server-side print.
Regarding the LaTeX route, if you have rrdcgi generate graphs in pdf format, pdflatex will be able to integrate them directly into the document, producing super quality pdf with graphs ... very slick. Sorry, no code.

How to extract data from a PDF?

My company receives data from an external company via Excel. We export this into SQL Server to run reports on the data. They are now changing to PDF format, is there a way to reliably port the data from the PDF and insert it into our SQL Server 2008 database?
Would this require writing an app or is there an automated way of doing this?
As already mentioned - you will have to write an app to do this, but ideally you would be able to get the raw data from the external company rather than having to process the PDF.
However, if you do want to extract the data from the PDF, I've used iText and found it to be very powerful, reliable and most importantly - free. It comes in Java and .Net flavours - iTextSharp is the .Net version. It allows you to programatically manipulate PDF documents and it will expose the contents of the PDF to the application that you write.
It all depends on how they've included the data within the PDF. Generally speaking, there's two possible scenarios here:
The data is just a text object within a PDF. You'll need to use a tool to extract the text from the PDF then insert it into your database.
The data is contained within form fields in a PDF. You'll need to use a tool to extract data from the form fields and insert it into your database.
Hopefully scenario #2 applies to you because this is precisely what PDF forms are designed for. Scenario #1 is really just a hack that you'd only use if you didn't have any other options. Extracting plain text from a PDF isn't as easy or accurate as you might expect.
If you're receiving a PDF form then all you need to do is match up the right fields in the PDF form with the corresponding fields in your database and then suck in the data. This process could be entirely automated if you wrote your own application.
Would this require writing an app or
is there an automated way of doing
this?
Yes, both of these options would require writing an app or buying an app. If you write your own app then you'll need to find a third-party PDF library that supports retrieving data from form fields or extracting text from a PDF.
Disclaimer: I am affiliated with the makers of ByteScout PDF Extractor SDK tool
Just wanted to share some additional real-life scenarios for text data extraction from PDF:
Scanned image with no searchable text: should be processed by OCR engine (like free Tesseract from Google)
XFA forms: it is the subset of PDF which is supported mostly by Adobe tools. But the data can be extracted as XML data with low level PDF processing tools like iTextSharp or similar tools.
ZUGFeRD PDF files which are just PDF documents with the copy of a form data attached as XML file (which can be extracted with tools like this)
Text incorrectly encoded by some PDF generators (can be restored via OCR engine with some acceptable error rate though).
Using ItextSharp, do the following
using System;
using System.Configuration;
using System.Data.SqlClient;
using System.IO;
using System.Text;
using iTextSharp.text.pdf;
protected void BtnSubmit_Click(object sender, EventArgs e)
{
String FilePath = #"GetFilePath";
StringBuilder sb = new StringBuilder();
PdfReader reader = new PdfReader(FilePath);
PdfStamper myStamp = new PdfStamper(reader, new FileStream(FilePath + "_TMP", FileMode.Create));
AcroFields form = myStamp.AcroFields;
if (form.GetField("GetFieldIdFromPDF") != null)
sb.Append(form.GetField("GetFieldIdFromPDF").ToString());
}
I think you will have to write an application for this. This question talks about extracting data from PDF. After this you can export the data to excel format so that you can preserve the existing import format.
Look for information on "Scraping" the data from the PDF. I believe Adobe has some tools that allow you to do this for simple text but I've not used them.
Honestly though, I would try to do anything you can to get this data in a raw format from your vendor.