Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 10 years ago.
Improve this question
guys. I need OCR software that can read a variety of types of invoices and extract data. The exported data should be presented in a tabular format, preferably with a link to the source document. It must be able to read the documents in a variety of formats (.pdf, .jpg, .gif, .tiff, etc.).
You can visit Wikipedia, on which there is a list of OCR software vendors, professional OCR vendors such as ExperVision, Abby, Nuance... will be better choice. As I know, formats of invoices are complex and variable. It need design templates aimed at each kind of invoice. Therefore, standard OCR is not the best choice and it generally needs customizing development. So you’d better choose OCR vendors providing customized services.
How many forms do you want to process per day ?
How many different types of invoices and layouts do you want to process ?
How many different paper sizes ?
How accurate do you need it to be ?
How many people will be using the system for data correction / validation ?
What system are you exporting the data to ?
How many fields do you want to extract automatically ?
Expect to pay for such solutions if you want it yo read any and every type of invoice. These solutions listed below are not cheap as they are high end solutions.
http://www.documation.co.uk/emc_captiva.html
http://www.captaris-dt.com/product/dokustar-capturesuite/en/
http://www.abbyy.com/data_capture_software/
http://www.kofax.com/forms-processing/
http://www.readsoft.com/
The cost will depend on how many invoices you want to process. My best guess is that the Abbyy product will probably be the cheapest option.
If you have a limited number of documents types to read you may get away with a simpler OCR fixed-form solution as opposed to the free from solutions above.
Also, your scanning solution required will depend very much on your volumes.
Knowledge Lake makes one that integrates with sharepoint. It is also made to integrate with other kinds of systems.
http://www.knowledgelake.com/
Related
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 4 years ago.
Improve this question
I have a large number of English OCRed documents from the 19th century and want to clean up some of the OCR errors by using a contextual spell-checker such as the one proposed by Peter Norvig at http://norvig.com/spell-correct.html. My main goal is to be able to use a probabilistic model (together with the ocred text data and an appropriate and large dictionary) to be able to correct words that are misspelled.
I am happy using the code that Norvig gives in his website and improving it, but before I do so, I would like to ask if there is an open-source solution for this. Norivg himself suggests looking at aspell, but I don't think that aspell is a contextual spell-checker, and I'm worried it might not work so well on OCR error correction.
So, you're looking for a spell checker that will substitute the most probabilistic choice whenever there is a phrase or word it doesn't understand? That seems like it would be a bad idea on 19c texts unless you have a large corpus of such texts that have already been spell checked by hand. Words that were commonplace then but rare now will be replaced without your knowledge. I daresay, you may find a contextual spell-checker trained on modern locution to be tetotaciously exflunctified by your 19c phraseology. ☺
If you have such a corpus, or you're up for creating one, there is a powerful Python based tool for OCR and analysis called OCRopus. It uses natural language processing, neural networks and many other buzzwords — I think I saw "deep learning" on the to-do list. It does not appear easy to use, though I admit I've never tried it myself. It seems to require skill at the command line and programming in Python. If you're still not daunted, it may be exactly what you're looking for.
On the other hand, if you are looking for something simpler, consider using a program with a standard spell checker. For example, gImageReader which can read in your PDF files, OCR them, and let you correct & add the words it doesn't know. I suggest at least trying a simple spell checker before searching for something more complicated.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
I was just thinking of possible ways to go about temporary login systems. I was thinking having a bunch of your standard images with a jumbled up word and users type in the word. I would have a MySQL table where all the photos have a unique id, link and answer-key. that way the webpage just has to choose a random number the GET photo where id = random number. then compare what the user types in to the answer key of the photo.
I'm not currently trying to create this system, it seems very simple and I was just trying to think if it is a secure system that would work.
so my question really is, would there security risks with this, is it robust enough to keep out bots, would my site be destroyed 10 seconds after implementing it.
What you're describing sounds exactly like a CAPTCHA system. These are used widely to prevent bots from issuing automated requests against an interface. The problem is that it's hard to make images that a bot can't just interpret anyway.
Outsmarted: Captcha security not much of a gotcha is an article about some Stanford researchers who developed an image-recognition tool (which is not publicly available) to test captcha implementations:
Decaptcha was able to decode 66 percent of the Captchas used by Visa's Authorize.net payment site, 70 percent of Blizzard Entertainment's Captchas -- the company's games include World of Warcraft and Diablo -- and 25 percent of Wikipedia's. About one-fifth of Digg.com's Captchas and almost that many of CNN.com's were decodable.
The researchers recommended Google's reCAPTCHA as a much more effective system. You can add a reCAPTCHA widget to your own website. This would be safer and easier than trying to develop your own and find it to be too weak.
Short answer: No, it's not secure. If someone really wants to hack your system he can build his own database of image-word.
The key is to invest in security less than it will cost you if your system will be compromise, so I won't invest in a security system too much (it sounds like you don't really have a sensitive information to hide).
BUT, you have an easy & free solution. You can use reCaptcha, not only it's much more secured, you'll help digitize some useful information.
Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 9 years ago.
Improve this question
This is not something I want to do myself, but it's a question/problem I can't get out of my head.
If you distribute open source-program/classes/libraries, how can you make sure the user has purchased a license? Would it not be very easy for programmers to just remove the license-part of the product and distribute it or use a pirate-version?
Take Invision Power Board for instance. It is written in PHP (i.e completely open and editable) and you have to buy a license to be able to use it. How can they make this limit? Do they authenticate the forum towards their servers? If they do, would it not be easy to simply remove this function?
Another example that I have even more problem understanding is HighCharts, a JS library to draw graphs. They offer a free version with their name on each graph. If you purchase the product, the label is gone. How do they do this?
I know this question is a bit wide and open, but I am just asking for a way to prevent people from simply editing out the license/blockade? What is the essence in this?
There are no license purchases for true "open source" libraries or programs, because the essence of open source is that the code is free and you can build/deploy it yourself at will.
What you're talking about is commercial software that might use a codebase that is easily visible/editable. It's not marketed as "open source," but the source code is easily accessible and potentially easily modified.
There are various mechanisms for obfuscating or hiding the content of the code that some products would choose to use, which make modifying the code more difficult. For example, there are various ways of pre-compiling PHP code rather than distributing the raw files (see this question for examples).
However, the biggest thing that you lose out on with most software of this sort is support. If you're a serious user of a complex piece of software, especially a business user, you would typically want to know that you have a commercial support plan in place for any critical software. The kind of user that would crack/pirate such software (that is, individuals or small companies) aren't likely to be as significant to the vendor.
On the internet there's a further obvious avenue: if a significant public site were using Invision Power Board, they would soon notice and could demand suitable license (or take legal action).
Ultimately, this kind of abuse is very difficult to prevent if someone is determined enough: you are very much at the whim of your users.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 5 years ago.
Improve this question
I had an idea for my first mobile application and I was thinking of making it in HTML5 + Jquery Mobile. The core functionality is:
to be able to take a picture of a receipt
digitize all the information.
I've never made a mobile app before and I'm not sure if this is possible. If there is no API available, how would I go about rolling my own receipt reader? Thanks! Please let me know if I am being stupid.
Edit: I found a service that lets me use their application to take a picture(or e-mail the picture) of the receipt and have it extract the necessary information. http://www.proongo.com/b/receipt-reading.php. I'm not exactly sure how to use this service but I will do more research tomorrow and share with you what I find.
I found an OCR API service with a number of different pay-per business models called OCRAPISERVICE. They have a number of examples hosted on github using various mobileOSs through PhoneGap. They do have a free-trial model that lets you submit 100 requests.
I guess you need to apply OCR for software solution with a function of recognizing supermarket receipts. There are many open source OCR solutions like Tesseract and others. However, they are targeted to general OCR. Therefore, you have to use some additional tools for recognizing receipts via a mobile app.
Recently we have worked on the web-based app for receipt recognition. Here you may find some details of the research: http://rnd.azoft.com/applying-ocr-technology-receipt-recognition/
Besides Tesseract, all the big boys: Google, Microsoft and IBM have now got their own offering of OCR APIs. These APIs provide simple image-to-text OCR scan with various degree of accuracy. I find Google Vision to be the most accurate for pictures of a receipt. You would still need to extract the data out of the half-garbage text though.
If you want an API that returns field metadata like: total amount, tax amount, date and merchant information, where you apps can consume directly. Check out https://www.taggun.io. I've built the APIs specifically for this purpose.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 7 years ago.
Improve this question
As part of a recent programming project I compiled a database, the contents of which may conceivably be of use to someone else one day. I'm looking for the best way to 'open source' the data.
I could (and probably will) upload the SQL onto GitHub, but was wondering if anyone had found a more 'data-centric' way of sharing - maybe a website that makes it easy for users to browse/query/visualise/improve data sets, rather than just giving them a big lump of SQL.
To clarify, I'm looking for a place where I can share the data, rather than a format in which to share it - ideally a data-set equivalent of GitHub/Sourceforge.
The data is relatively small (a few thousand lines of SQL) so the volume should not be an obstacle.
I'm a big fan of Amazon's S3 for stuff like this. And if your data set is interesting enough, maybe you could publish it with InfoChimps.
I have worked with a lot of data from different companies. Most often this data has been in text delimited data format. The most popular of course being comma separated or tab. Using comma's is often a good choice because MySQL can also export and import CSV. Here is an example:
id, first_name, last_name, address
1, John, Smith, 11222 Stree Name
Google Fusion Tables ticks some of these boxes, although the emphasis seems to be on visualisation (I haven't used it, so this may be unfair). I am also reluctant to commit too heavily to any second-tier Google products these days, since they have a habit of disappearing.
You could export it to XML, that being probably the most compatible data format, although it is rather verbose. Another solution is OData, but this implies hosting the data and the platform that serves the data which may not be desirable.
Sparkfun is another possibility, it seems to be mainly targeted at real-time data sources but they offer free storage and the platform is open-source so you can host your own server.