How Read an OCR file data - ocr

I am building a tool which can read an ocr file. I am using idolondemand (idolondemand.com), but I found that not much promising. That is not reading file properly (ex. spell mistakes, special chars).
I can move to any other languages, basically now this problem for me is become language independent, I can go for any language.
I need help in building one.

Related

Activating HTML with Haskell

I have a large pile of lecture notes in raw HTML format. I would like to add interactive content to these notes, in particular incorporating online exercises. I have some experience implementing online exercises as cgi-bin executables compiled from Haskell code running on the server, interacting with a student record file and sending suitable HTML back to the browser, using Text.Xhtml to generate the content. Now I plan to integrate the notes and the exercises.
The trouble is that I don't want to spend ages manually transforming my raw HTML into Haskell code to generate exactly the raw HTML I started with. Instead, I'd like to put my Haskell code and my HTML in the same source file, with placeholders in the latter for content generated by the former. A suitable tool should then transform this file into Haskell source code for (e.g.) a cgi-bin executable which generates the corresponding page.
Before I go hacking up such a piece of kit, I thought I'd ask if there's better technology out there already. The fixed points are the large legacy lump of HTML, the need to implement the assessment of the exercises in Haskell, and the need to interact with student records on the server. The handicap is that I need to use the departmental web server and I can't reconfigure it (ok, maybe I could ask nicely): that's one of the reasons I currently use cgi-bin executables, which are just fine on our server already, but I'm open to other possibilities.
My current plan is to write a (I mean adapt an existing) preprocessor to support a special syntax for defining functions of type
Html -> ... -> Html -> Html
that looks a lot like raw HTML with splice points. Then what I do with my existing raw HTML is indent it a bit and mark the holes.
But would that be a waste of time? Please, please tell me that this question is a duplicate!
There are Haskell frameworks like Yesod and Happstack which use templating engines like you describe.
Have you looked at the haskell wiki at http://www.haskell.org/haskellwiki/HSP or
http://www.haskell.org/haskellwiki/Web/Libraries/Templating ?
They may do what you need.
You might find someting to do the job here: Templating packages for Haskell.
And you should probably look into Snap, Yesod or Happstack for serving the content.
I have a large pile of lecture notes in raw HTML format. I would like to add interactive content to these notes, in particular incorporating online exercises.
There is already a system (called "ActiveHs"), written in Haskell, that allows to put lecture notes and interactive exercises in one file.
See:
http://pnyf.inf.elte.hu/fp/UsersGuide_en.xml
http://pnyf.inf.elte.hu/fp/Constructive_en.xml
I can really say that it is very well written code and completely open source!

Extracting data from PDF or Word using PHP, Java

I need help on this...
Especially since I don't know where to start..
I am an IT undergraduate and, along with my groupmates, is now undergoing on-the-job training in a company.
SCENARIO:
The company asked us to create a program that will generate a report and store it in a database.
The database that will be used is MySQL.
As for what language to use, we are considering VB.Net, Java, PHP.
The program must be able to :
generate a report that will be sent through email to an office
store in a database
collect all reports, collate those reports
generate a new report which will then be sent to their main office
then store it in their own databse...
For now,
we are still trying to determine how the program will run and what language will be used that has the capability of reading and extracting data from a text file (can either be a word document or a PDF file).
The company also wants the program to be online-ready for future expansion.
Now, our problem is
Is there a way to extract data from a PDF or Word file using either Java, PHP, VB then store it in the MySQL DB?
if there is, can it be implemented without using any 3rd party software?
the reason why we chose to use either a PDF or Word file type is that, the file should be printable for archive purposes.
What programming language can we easily use to be able to achieve our problem above?
I would like to apologize if the info I am giving is a bit messed up. I will be giving additional information once we are able to talk wth the company this week.
If there is a problem with the way I posted this, please forgive me. I am just trying my best to provide you with the information the best I could.
I'll answer for Java as it is what I use at work.
You can easily extract text from Word files or build a new Word file with Apache POI
As for PDF, iText or PDFBox both does a pretty nice job.
Why can't you use 3rd party software? If you could, I would recommend something like How to read PDF files using Java?.
Or, to read a .doc file: http://www.roseindia.net/tutorial/java/poi/readDocFile.html
Anyway, if you can't use 3rd party tools, why not read the specifications and figure out how to extract the text from PDF, DOC, and DOCX files?
Here you can find DOC specifications: http://msdn.microsoft.com/en-us/library/cc313118.aspx
Here you can find the PDF format specification: http://www.adobe.com/devnet/pdf/pdf_reference.html
Good luck!

Decipher binary format of file

I have a binary file to which I'm trying to write however I dont have the file format specification nor have found it using google, I've been looking at the file using a hex editor but so far has only give me a headache, is there a better way to decipher the format of the file so that I can append data to it?
File carving tools such as scalpel won't really help here. They're made for extracting files with known header and/or footer signatures from a memory dump or some larger, composite file.
For your scenario, I would recommend a hex editor with templating capability, like the 010 Editor. This will allow you to name and annotate "fields" in the binary as you learn more about what each part of the file does. Unfortunately, the process of finding out what each field does is mostly manual. As a methodology, just start playing with it. Change some values in your current binary and see what happens. Expect to spend significant time on it, but also enjoy the process!
you may want to search it with a open source forensic application like foremost or scalpel. They will do most of the grunt work for you, you just likely wont learn anything.

Programmatically generate high quality PDFs

Note: I realize this question has already been asked (with a ruby slant) here: Creating on-demand, print-quality PDFs (preferably in Ruby if feasible). BUT there was no decent answer IMHO.
So as you may have guessed, I am looking to find the best approach to producing HIGH QUALITY, print ready PDF documents programmatically. Our requirements need us to be able to have design documents that define place holders for dynamic content like images and text i.e. some kind of template mechanism.
The suggestion has been to use Adobe's InDesign server, but this seems like an expensive solution not to mention a little overkill for our need.
Are there any alternative, cheaper and more fitting solutions out there? The language of the solution doesn't really matter, just as long as it can be executes on a Windows box.
My suggestion would be to look at XSL-FO or thereabouts...
You create an XML doc that describes what you want and there are various libraries and toolkits (I've used XEP from RenderX) that will convert said XML into PDF.
In real terms what we did was take a large lump of data in XML format, use XSLT - templates in effect - to convert the data to formating objects which XEP renders up into something (a 500 page hotel directory with auto-generated TOC and Index) that has been consumed quite happily by at least three different commercial printers. We did some other smaller documents too from time to time.
Downside with this is that its not even remotely a WYSIWYG solution - you're effectively compiling "source code" to get PDF out the back. Upside is that the base technologies are reasonably generic even if the specific toolkits may be a bit less so.
You can convert XML templates to PDFs with Prince.
Prince is a computer program that
converts XML and HTML into PDF
documents. Prince can read many XML
formats, including XHTML and SVG.
Prince formats documents according to
style sheets written in CSS.
I have and also know many people that have had much success with ReportLab an open source Python PDF library (http://www.reportlab.org/rl_toolkit.html).
Its extremely easy to use and very quick to get started. So worth trying out.
I don't know why no one has suggested using LaTeX for this. It's an extremely popular open format for document design and not hard to set up a template that you can fill in text or image content. While the reference implementation of LaTeX runs as a standalone program, if that sounds like too many moving parts for you there are wrapper libraries for Python and other languages you can call via an API.
Java language and JasperReports
Java: iText
C#: iTextSharp
depends on what you want to publish, but take a look at Pentaho reporting
http://reporting.pentaho.org/
rinohtype is an open-source document processor that is capable of producing high-quality print-ready PDF documents. You can use one of the built-in document templates (book, article) or define your own template. The look of document elements can be configured by means of CSS-like style sheets. The contents of your document can be parsed from reStructuredText or CommonMark files, or you can build the document tree programmatically.
Full disclosure: I am the author of rinohtype.

How to analyze binary file?

I have a binary file. I don't know how it's formatted, I only know it comes from a delphi code.
Does it exist any way to analyze a binary file?
Does it exist any "pattern" to analyze and deserialize the binary content of a file with unknown format?
Try these:
Deserialize data: analyze how it's compiled your exe (try File Analyzer). Try to deserialize the binary data with the language discovered. Then serialize it in a xml format (language-indipendent) that every programming language can understand
Analyze the binary data: try to save various versions of the file with little variation and use a diff program to analyze the meaning of every bit with an hex editor. Use it in conjunction with binary hacking techniques (like How to crack a Binary File Format by Frans Faase)
Reverse Engineer the application: try getting code using reverse engineering tools for the programming language used for build the app (found with File Analyzer). Otherwise use disassembler analysis tool like IDA Pro Disassembler
For my hobby project I had to reverse engineer some old game files. My approaches were:
Have a good hex editor.
Look for readable words in the binary file. Note how their distribution is. If the distance between them is constant you know it is a listing.
Look for 2-3 consequent zeros. Might indicate an int32 value.
Some dwords might be pointers into the file.
Try to identify reoccurring patterns in the file.
Seeing lots of C0-CF might indicate RLE compressed data.
I've developed Hexinator (Window & Linux) and Synalyze It! (macOS) exactly for this purpose. These applications allow you to see the binary files like in other hex editors but additionally you can create a "grammar" with the specifics of a binary file format. The grammar contains all the building blocks and is used to parse the file automatically.
Thus you can keep the knowledge you gain in the analysis and apply it to multiple files simultaneously. You can also color-code the bits and pieces of file formats for a quick overview in the hex editor.
The parsing results are displayed in a tree view where you can also modify the files easily (applying endianness et cetera).
Reverse engineering a binary file when you have some idea of what it represents is a very time consuming process. If you have no idea what it is then it will be even harder.
It is possible though, but you have to have a pretty good reason for doing so.
The first step would be to open it up in a hex editor of your choice and see if you can find any English text to point you in the direction of what the file is even supposed to represent. From there, Google "Reverse Engineering binary files", there are much more knowledgeable people than me that have written guides about it.
The "strings" program from GNU binutils is very useful. It will print the strings of printable characters in a file, quite often giving a clue to what a file contains or a program does.
If the data represents serialized Delphi objects, you should start reading about the Delphi serialization process. If that's the case, I think your best bet would be to load it using Delphi and continue your analysis from the IDE. Some informations about Delphi serialization can be found here.
EDIT: if the file does contain serialized delphi objects, then you should write a small delphi program that loads it, and "convert" the data yourself to something neutral, like xml. If you manage to do this, you should check and see if delphi supports serializing to xml. Then, you could access those objects from any language.
The unix "file" command is really useful - I don't know if there is anything like it in windows. You run it like this:
file myfile.ext
And it spits out a text description based on the magic numbers and data contained therein.
Probably it is contained within cygwin.
If you have access to the application that creates the file, you can apply changes to the application, then save the file and see the effects (Keep in mind that numbers are probably stored in little endian):
First create the file repeatedly. If the files are not binary equal, the current date/time is probably stored in the area where hte differences occur.
Maybe you want to repeat that with the software running under different environments, to see if OS version etc are stored, but this is rather unusual.
Next you can try to change single variables and create several files that only differ in the value of this variable. This helps you identify where this variable is stored.
That way you can also exclude variables that are not stored in the file: If you change them, but the files created are identical, they are not stored.
In order to test the hypotheses you worked out with the steps above, edit one of the files and have the application read it.
If you don't have access to the application itself, I suggest that you forget about it and find another way to solve your problem. There is a very high probability that it will be faster...
If file does not give a meaningful answer, you may want to try TRiD by Marco Pontello to determine whether your data is stored in a known format.
Get the Delphi application and open it in IDA Pro freeware version, and find where it writes the file, and decode how it writes the file that way.
Unless it's plan text.
Do you know the program that uses it? If so you can hook that programs write to file function and get an idea of what data its writing, the size of the data and where.
More Info: http://www.codeproject.com/KB/DLL/Win32APIHooking_Trouble.aspx
Unlike traditional hex editors which only display the raw hex bytes of a file, 010 Editor can also parse a file into a hierarchical structure using a Binary Template. The results of running a Binary Template are much easier to understand and edit than using just the raw hex bytes.
http://www.sweetscape.com/010editor/
Try to open it in a hex editor and analyse.