Extracting data from PDF or Word using PHP, Java

Extracting data from PDF or Word using PHP, Java - mysql

I need help on this...
Especially since I don't know where to start..
I am an IT undergraduate and, along with my groupmates, is now undergoing on-the-job training in a company.
SCENARIO:
The company asked us to create a program that will generate a report and store it in a database.
The database that will be used is MySQL.
As for what language to use, we are considering VB.Net, Java, PHP.
The program must be able to :
generate a report that will be sent through email to an office
store in a database
collect all reports, collate those reports
generate a new report which will then be sent to their main office
then store it in their own databse...
For now,
we are still trying to determine how the program will run and what language will be used that has the capability of reading and extracting data from a text file (can either be a word document or a PDF file).
The company also wants the program to be online-ready for future expansion.
Now, our problem is
Is there a way to extract data from a PDF or Word file using either Java, PHP, VB then store it in the MySQL DB?
if there is, can it be implemented without using any 3rd party software?
the reason why we chose to use either a PDF or Word file type is that, the file should be printable for archive purposes.
What programming language can we easily use to be able to achieve our problem above?
I would like to apologize if the info I am giving is a bit messed up. I will be giving additional information once we are able to talk wth the company this week.
If there is a problem with the way I posted this, please forgive me. I am just trying my best to provide you with the information the best I could.

I'll answer for Java as it is what I use at work.
You can easily extract text from Word files or build a new Word file with Apache POI
As for PDF, iText or PDFBox both does a pretty nice job.

Why can't you use 3rd party software? If you could, I would recommend something like How to read PDF files using Java?.
Or, to read a .doc file: http://www.roseindia.net/tutorial/java/poi/readDocFile.html
Anyway, if you can't use 3rd party tools, why not read the specifications and figure out how to extract the text from PDF, DOC, and DOCX files?
Here you can find DOC specifications: http://msdn.microsoft.com/en-us/library/cc313118.aspx
Here you can find the PDF format specification: http://www.adobe.com/devnet/pdf/pdf_reference.html
Good luck!

Related

Extract text from PDF into JSON or XML or whatever?

I am trying to extract data [price, information, and number] from PDF (I have like more than 10 000 PDF so the free trial of the website won't work).
Here is one example of PDF I get :
I tried it in Python (beginner on this kind of task and on Python also) with several packages like PyPDF2, pdfx and so on, but I only get the Text like this
with PyPDF2 :
So It's possible to extract the price, the number, and information but I have different format of pdf so it is not possible to just with the text and some algorithms extract the information.
What I want to do, and it is possible because a lot of websites are doing it and make people pay for it. I want to read it in a vertical way and convert the data extracted in XML/JSON or simply a dataset.
I want to read the document per columns and not by line
Is there a way to do it in python or other languages?

First let me tell you that this is not an easy problem to solve since PDF files in the wild tend to be quite diverse in layouts. I can suggest trying an open source project that works really good for extracting information from tables in PDF files. It is called Tabula, you can get it at https://tabula.technology.
Tabula is going to detect tables on each page and export the content as CSV format. Once you have it in CSV it should be easier to get the information using Python. Please note that the CSV layout depends on the table layout in the PDF, meaning that you may need to create several functions to extract the information correctly.
Tabula is not perfect but it should work with most PDF files, for those that do not work you may need to extract the information manually.

Choice of database for a dictionary which can be edited as plain text

I am creating a dictionary app but I am not sure what kind of database to use !
My Requirements :
Support for Unicode characters.
Reasonably Fast search.
In case of an installed version of the app, the database file can be edited using a text editor (such as notepad).
I want to implement a web version of the app (WHICH USES THE SAME DATABASE FILE USED BY THE INSTALLED VERSION) and viewers can add/modify entries so it requires concurrent access via network.
Easy to parse in any programming language.
Expecting a maximum file size of 100 MB (May not reach anywhere near that but just to make it future proof.)
So with my limited knowledge I ended up with 3 options : CSV, XML or JSON. I prefer CSV for easy editing.
I understand that they are not as good as RDBMS but for the specific scenario is it possible to use any of these or how good can they perform?
Any alternate ideas are also welcome !
Thanks in advance.

I think that using a single file as a "database" is not such a great choice if you think/plan to extend your application further.
My recommendation is SQLite (https://www.sqlite.org/about.html). As it is currently promoted on their website it's suitable on almost all of your requirements except that of being edited with a text editor. But I think that are easy enough solutions for content management as well, like SqliteBrowser http://sqlitebrowser.org/

Objective-C - Parsing a .csv, extracting and inserting information, then displaying the .csv as an interface for editing

This question has been troubling me for the past week. Below, I will list my issue, and the research I have put into it.
The scenario: I was given a .csv file with 5000 rows and three columns. The three columns are defined as:
Site ID|Site Name|Site URL
My task: To create an HTML interface for the designers of the company to rate each site on a scale of 1-5.
My plan of action: I am a new hire. I am getting accustomed to the language I was hired for, which was Objective-C.
My algorithm for the project was to:
Parse the .csv
Remove the "Site Name" variable
Create a new .csv that contains the below variables: Site ID|Site URL|Rating|Image
Display the new .csv (with all aforementioned items) as an HTML page where there are toggles for "Ratings", which when pressed, will log the rating into the .csv which it was imported (or loaded) from.
The "Image" section I will be using a piece of software by the name of Paparazzi (on the Mac OS X operating system) which takes a fully formatted screenshot of the main page and saves it as a PNG file. I plan on using the file extension URL (which is stored locally) and load it into the "Image" column, thus when the designer clicks on the image, he is able to load the image that is stored locally.
My issue: As Objective-C is not entirely a scripting language, I am confused with some of the libraries I may need and/or methods I can implement this. I have the algorithm, but I am wholy unsure with the implementation.
My questions: If you have done a project similar to this before with Objective-C, what tips can you provide for me? How does one load the .csv as a HTML interface where upon edit, it will save this edit into the .csv? Will I need any servers for this, or is everything executable from just a machine? How do you grab an image (stored locally), extract its file extension, and load it onto the .csv?
The most important question: Is this achievable through Objective-C? My reasoning behind it is, I want to advance my knowledge of OC through a task like this. Yes, using Python is easier, but is it possible to do this with Objective-C?
Thank you.

It certainly is achievable, but I doubt you'd really want to go this way. If I understand it correctly, you want to serve the HTML page to others via web browser - that would mean either writing a (simple) http daemon, that would run on the server or writing a CGI script that would communicate with a standard http daemon. Python/PHP/Ruby do this for you readily, so there is much less room for possible errors.
As for
As Objective-C is not entirely a scripting language
I would perhaps rephrase it as
As Objective-C is entirely not a scripting language

Activating HTML with Haskell

I have a large pile of lecture notes in raw HTML format. I would like to add interactive content to these notes, in particular incorporating online exercises. I have some experience implementing online exercises as cgi-bin executables compiled from Haskell code running on the server, interacting with a student record file and sending suitable HTML back to the browser, using Text.Xhtml to generate the content. Now I plan to integrate the notes and the exercises.
The trouble is that I don't want to spend ages manually transforming my raw HTML into Haskell code to generate exactly the raw HTML I started with. Instead, I'd like to put my Haskell code and my HTML in the same source file, with placeholders in the latter for content generated by the former. A suitable tool should then transform this file into Haskell source code for (e.g.) a cgi-bin executable which generates the corresponding page.
Before I go hacking up such a piece of kit, I thought I'd ask if there's better technology out there already. The fixed points are the large legacy lump of HTML, the need to implement the assessment of the exercises in Haskell, and the need to interact with student records on the server. The handicap is that I need to use the departmental web server and I can't reconfigure it (ok, maybe I could ask nicely): that's one of the reasons I currently use cgi-bin executables, which are just fine on our server already, but I'm open to other possibilities.
My current plan is to write a (I mean adapt an existing) preprocessor to support a special syntax for defining functions of type
Html -> ... -> Html -> Html
that looks a lot like raw HTML with splice points. Then what I do with my existing raw HTML is indent it a bit and mark the holes.
But would that be a waste of time? Please, please tell me that this question is a duplicate!

There are Haskell frameworks like Yesod and Happstack which use templating engines like you describe.
Have you looked at the haskell wiki at http://www.haskell.org/haskellwiki/HSP or
http://www.haskell.org/haskellwiki/Web/Libraries/Templating ?
They may do what you need.

You might find someting to do the job here: Templating packages for Haskell.
And you should probably look into Snap, Yesod or Happstack for serving the content.

I have a large pile of lecture notes in raw HTML format. I would like to add interactive content to these notes, in particular incorporating online exercises.
There is already a system (called "ActiveHs"), written in Haskell, that allows to put lecture notes and interactive exercises in one file.
See:
http://pnyf.inf.elte.hu/fp/UsersGuide_en.xml
http://pnyf.inf.elte.hu/fp/Constructive_en.xml
I can really say that it is very well written code and completely open source!

How can I create a well-formatted PDF?

I'm working on automating our company invoicing system. Currently all data is stored in our local MySQL database and someone manually updates an excel spreadsheet and then merges this data into a MS Word template. The goal is to automate this process so that the invoice can be generated from our intranet website as a PDF.
My original plan was to create a template in HTML/CSS and use wkhtmltopdf to generate the PDF but I ran into problems with getting a repeatable header and footer on each page. thead and tfoot aren't supported by Webkit and the fix suggested in this other question does not seem to work either.
So I then stumbled on using XML and XSL-FO, the latter I know nothing about. Is this the best path to take? Are there any libraries or utilities out there that will make converting my HTML+CSS into XML+XSL-FO easier? Are there any other alternatives I'm overlooking?
EDIT
Currently the server is CentOS Linux with a MySQL database. All other code is currently in PHP currently but that may change as the whole system is being revamped. Linux and MySQL will almost certainly remain, though.

For your requirement, XSL-FO might just do the trick. It is much cleaner to produce the pdf's directly from the data, then going the cumbersome html path, unless you need to display the html as well, then you might consider converting from html to pdf, but it will always be messy.
You can get xml results from mysql quite easily (mysql --xml) and then you write one (or several) xsl-fo stylesheet for the data. then, you cannot only produce pdfs, but also postscript files or rtf's with some processors.
XSL-FO has its limitations tho, but for your situation, it should suffice.
I admit, the learning curve can be steep, and maintaining xslt-stylesheets can get very tiring, but as you start knowing more about it, you end up writing less code.
another possibility is to do the whole thing in e.g. java or c# - send select statements and loop the results and iteratively build the pdf using a library like iText.

You could try JODReports or Docmosis as less-code intensive options. You supply Word or OpenOffice Writer documents to act as templates and use these engines to manipulate/populate the templates then spit out the documents in the format(s) you require. This may mean your existing Word-templates can be used directly which should save you some effort/time.
iText is another library that will let you build and pump out PDFs from code. It's pretty good.

If you cloud use ASP.NET for web you can use free ReportViewer library and designer for automated of publishing PDF-s.
Here is some references:
http://gotreportviewer.com
http://weblogs.asp.net/srkirkland/archive/2007/10/29/exporting-a-sql-server-reporting-services-2005-report-directly-to-pdf-or-excel.aspx

If you're OK using .NET and C#, you could use DotPdf from Atalasoft (obligatory disclaimer: I work for Atalasoft and wrote most of DotPdf). The Generating namespace is geared for exactly what you're trying to do: automate report generation. From the very basics, you could just create docs directly with the toolkit or you can create template documents that have unpopulated text fields that you can reload and fill later (see here and here for examples).

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008