How to extract data from a PDF? - sql-server-2008

My company receives data from an external company via Excel. We export this into SQL Server to run reports on the data. They are now changing to PDF format, is there a way to reliably port the data from the PDF and insert it into our SQL Server 2008 database?
Would this require writing an app or is there an automated way of doing this?

As already mentioned - you will have to write an app to do this, but ideally you would be able to get the raw data from the external company rather than having to process the PDF.
However, if you do want to extract the data from the PDF, I've used iText and found it to be very powerful, reliable and most importantly - free. It comes in Java and .Net flavours - iTextSharp is the .Net version. It allows you to programatically manipulate PDF documents and it will expose the contents of the PDF to the application that you write.

It all depends on how they've included the data within the PDF. Generally speaking, there's two possible scenarios here:
The data is just a text object within a PDF. You'll need to use a tool to extract the text from the PDF then insert it into your database.
The data is contained within form fields in a PDF. You'll need to use a tool to extract data from the form fields and insert it into your database.
Hopefully scenario #2 applies to you because this is precisely what PDF forms are designed for. Scenario #1 is really just a hack that you'd only use if you didn't have any other options. Extracting plain text from a PDF isn't as easy or accurate as you might expect.
If you're receiving a PDF form then all you need to do is match up the right fields in the PDF form with the corresponding fields in your database and then suck in the data. This process could be entirely automated if you wrote your own application.
Would this require writing an app or
is there an automated way of doing
this?
Yes, both of these options would require writing an app or buying an app. If you write your own app then you'll need to find a third-party PDF library that supports retrieving data from form fields or extracting text from a PDF.

Disclaimer: I am affiliated with the makers of ByteScout PDF Extractor SDK tool
Just wanted to share some additional real-life scenarios for text data extraction from PDF:
Scanned image with no searchable text: should be processed by OCR engine (like free Tesseract from Google)
XFA forms: it is the subset of PDF which is supported mostly by Adobe tools. But the data can be extracted as XML data with low level PDF processing tools like iTextSharp or similar tools.
ZUGFeRD PDF files which are just PDF documents with the copy of a form data attached as XML file (which can be extracted with tools like this)
Text incorrectly encoded by some PDF generators (can be restored via OCR engine with some acceptable error rate though).

Using ItextSharp, do the following
using System;
using System.Configuration;
using System.Data.SqlClient;
using System.IO;
using System.Text;
using iTextSharp.text.pdf;
protected void BtnSubmit_Click(object sender, EventArgs e)
{
String FilePath = #"GetFilePath";
StringBuilder sb = new StringBuilder();
PdfReader reader = new PdfReader(FilePath);
PdfStamper myStamp = new PdfStamper(reader, new FileStream(FilePath + "_TMP", FileMode.Create));
AcroFields form = myStamp.AcroFields;
if (form.GetField("GetFieldIdFromPDF") != null)
sb.Append(form.GetField("GetFieldIdFromPDF").ToString());
}

I think you will have to write an application for this. This question talks about extracting data from PDF. After this you can export the data to excel format so that you can preserve the existing import format.

Look for information on "Scraping" the data from the PDF. I believe Adobe has some tools that allow you to do this for simple text but I've not used them.
Honestly though, I would try to do anything you can to get this data in a raw format from your vendor.

Related

Saving Unity GUI fields

I'm new to Unity's native GUI (used to use NGUI / iGUI) and was wondering whether you can have fields or input from the native UI save to a CSV file? If that is possible, can you have multiple iterations of a build open and have OnButtonDown, an entry saved per user to the same csv file?
Literally just wanting to know whether Unity UI has that capability.
Cheers!
That wouldn't be a Unity feature or even a UI feature, but if you design your program correctly (single access to the file at one time, append instead of overwrite), then yes, you can get the functionality that you want.

MarkLogic Java API batch upload files (.csv)

Im trying out the MarkLogic Java API and would want to bulk upload some files with the extension .csv
I'm not sure what to use, since the Java API only supports JSON, XML, and TXT files.
How do I batch upload files using the MarkLogic Java api? Do i convert everything to JSON?
Do i convert everything to JSON?
Yes, that is a common way to do it.
If you would like additional examples of how you can wrangle CSV with the Java Client API, check out OpenCSVBatcherExample and JacksonDatabindTest.testDatabindingThirdPartyPojoWithMixinAnnotations. The first demonstrates converting the csv to XML and using a custom REST extension. The second example (well, unit test...) demonstrates converting the csv to JSON and using the batch upload (Bulk Writes) capabilities Justin linked to.
If you have CSV files on your filesystem, I’d start with mlcp, as suggested above. It will handle all of the parsing and splitting into multiple transactions/batches for you. Take a look at the mlcp documentation for more details and some example configurations.
If you’d like more control over the parsing and splitting logic than mlcp gives you out-of-the-box or you’re getting CSV from some other source (i.e. not files on the filesystem), you can use the Java Client API. The Java Client API allows you to efficiently write batches using a WriteSet. Take a look at the “Bulk Writes” example.
According to your reply to Justin, you cannot use MLCP because it is command line and you need to integrate it into a web portal.
Well, MLCP is released as open cource software under the Apache2 licence. So if you are happy with this licence, then you have the source to integrate.
But what I see as your main problem statement is more specific:
How can I create miltiple XML OR JSON documents from a CSV file [allowing the use of the java API to then upload them as documents in MarkLogic]
With that specific problem statement:
1) have a look at SplitDelimitedTextReader.java from the mlcp source
2) try some java libraries for this purpose such as http://jsefa.sourceforge.net/quick-tutorial.html

How do I download contents of an html table generated by play 1.2.7 backend on java in xls

I've generated a table using play's #{list} tag and get pretty decent results. Now I need to be able to generate and download an xls version of the table and have no idea what to do. Any pointers at all will be much appreciated
Well you have various options.
Excel will open HTML files. So instead of rendering your table as HTML you can it to stream it to the browser and set the content type as XLS.
While Excel will open it this it will still be an HTML file rather than an XLS(X) document.
You can generate as CSV from your data model and stream this to the browser. Again this will be a CSV rather than a proper XLS(X) document.
There also seem to be some solutions around which can do it using Javscript. See as a starting point: Generate excel sheet from html tables using jquery
Finally you can can use something like Apache POI or JXLS to generate a 'proper' xls(x) document and stream this to the browser. I have some code here that will export HTML to 'proper' xlsx file if this is the route you wish to go. Workflow is then to create some HTML from your data model and use this to convert to Excel rather than having to programmatically build the Excel document using POI. https://github.com/alanhay/html-exporter

XPages: Generate CSV file and attach to mail

I'm trying to generate a csv file (specifically an .ics) and attach this to an email.
The Email is composed via an SSJS-Function.
An opportunity could be to generate the csv file, save it to a document and attach this to the email .
I tried to generate the csv file via an XAgent in a XPage (like this http://www.wissel.net/blog/d6plinks/SHWL-8248MT) and get a handle of the output, but with no success.
Do you know a possibility to manage this?
Any help is greatly appreciated!
Thanks in advance!
you are looking at 2 tasks:
Create a csv / ics file
Send this as attachment
For #1 you can use a Stringbuilder or a Printwriter or whatever. However an ics file is actually not a CSV file, but an iCalendar format. To generate it I highly recommend ical4j. In any case whatever you write -> don't create a file. Use a PrintWriter (for the CSV) that uses a ByteArrayOutputStream (or directly for ICS4J), so the result is a ByteArray in memory.
For #2 The ONE mental step you must make is AWAY from "the Notes way" trying to deal with embedded objects etc. You create a MIME message (there are snippets on OpenNTF) and create a mimepart. There you can use setContentFromBytes and you have your attachment.
Pro tip (to make your life easier): create a Java class with a function that takes an outputstream as parameter that generates the file for you. This way you can test it in Eclipse (or the Domino Designer Java view) without having to run preview and with full debugging support (you simply provide a file-output stream for testing and write to a file - or to System.out)

Ways to export Tables/Views from mySQL Database to printer friendly format (other than phpMyAdmin)

I've created a bunch of views in a database and I'd like to export them to pdf. However phpmyadmin lets me only put a title on each page and it's very limited to how i can layout the output.
does anybody have some recommendations of software/scripts they used?
tcpdf is a PHP class for generating pdf documents. They have many example scripts.
There are a fantastillion ways to do this, some ideas:
export csv, import it to your favorite spreadsheet editor, format it, get the pdf using a pdf printer.
export xml, process it using xsl-fo to produce the output you want ( hacking required, fun )
export html ( should work? ), put a css on top optimized for print layout, pdf-printer.
Usually, I write up a script to pull info from a database, then generate a .csv, attach it to an email and send it on its way. Most scripts with support for mySQL can do that and they also go as far as generate a .pdf file with the appropriate formatting (in my case, I use Ruby, so I could have used Prawn to generate a PDF - I just choose not to as of this time).