Invoice automatic data extraction OCR or PDF [closed]

Invoice automatic data extraction OCR or PDF [closed] - ocr

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 2 years ago.
Improve this question
I am looking for a solution to extract data from my invoices to send a summary to my accountant.
There are some companies out there which provide such services for around 20€ a month and invoices are usually very well recognised. But the services I tried don't extract all data I like, or are missing some functionality like an excel export to send the data to my accountant. And paying 20€ a month and having to manage another service for 5 invoices per month didn't appeal to me yet.
I was researching a little bit and found this stackoverflow question:
Can anyone recommend OCR software to process invoices?
It's a bit outdated and hope to find some more up to date recommendations. I tried the Ephesoft community edition and it looked very promising at first. But the software has a learning and a review step. Inside the review step the data doesn't seem to be fed back to the learning step. Plus it feels more cumbersome then just doing it by hand. I assume it's made for big businesses.
I am looking for a simple data extraction software, which learns with each step I show it.
I also had a look at Apache Tika, but it doesn't seem ready to use with a simple web-interface.
Do you have some recommendation for payed OCR services? Flexible to extract Total VAT amount/VAT %/ Total Amount/ Total Amount Currency/ VAT Currency/ Which account it was payed with/ Company name. With an export to excel?
Do you have some recommendations for open source software?
Do you have some general advice of how you handle your few (less than 50 a year) invoices?

Except raw OCR and regexes on top of that (which may work fine for some very limited use-cases), there are several other options which offer API access. Those you can actually start using without any demo or sales process:
TagGun - specialized on receipts, can extract line-items too, free for 50 receipts monthly
Elis - specialized on invoices, supports a wide variety of templates automatically (a pre-trained machine learning model), free for under 300 invoices monthly
If you are willing to go through the sales process (and they actually seem to be real and live):
LucidTech and Itemize (not sure what their accuracy is and what are the fields they extract, as their API details are non-public)
FlexiCapture Engine - based on templates, if you are willing to define one for each specific invoice format
(disclaimer: I'm affiliated with Rossum, the vendor of Elis. Feel free to suggest edits adding other APIs!)

Sypht provide an API to do this: http://www.sypht.com.
Python client: https://github.com/sypht-team/sypht-python-client
Step 1
pip install sypht
Step 2
from sypht.client import SyphtClient, Fieldset
sc = SyphtClient('<client_id>', '<client_secret>')
with open('invoice.png', 'rb') as f:
fid = sc.upload(f, fieldsets=["document, "invoice"])
print(sc.fetch_results(fid))
Disclaimer: I am affiliated with the vendor

Check out Veryfi
It extracts 50+ fields from receipts and invoices including line items in 3-5seconds.
It’s ready to use out of the box (ie. no need to train it) with high accuracy results and supports over 30 languages/regions.
> pip install veryfi
veryfi_client = Client(client_id, client_secret, username, api_key)
categories = ['Grocery', 'Utilities', 'Travel'] # list of your categories
file_path = '/tmp/invoice.jpg'
response = veryfi_client.process_document(file_path, categories=categories)
print (response)
Here is detailed overview of how to use it:
https://www.veryfi.com/engineering/invoice-data-capture-api/
*I'm a co-founder of Veryfi, so do not hesitate to ask any questions

Extracting data from invoices is a complex problem. I didn't see any open source solutions yet. OCR is just one part of the data extraction process. You need image preprocessing, AI engine for data recognition, etc.
You have many solutions to solve this problem. Every one of them is a bit different. #Peter Baudis already mentioned some of them.
They go from very simple:
OCR SPACE Receipt scanning - extract data in a table format, but you still need to parse them and determine which part of a text is e.g., invoice number
To more advanced:
Nanonets - Machine learning API many solutions (invoices, tax forms, ...)
typless - single call API for any document (invoices, purchase orders, ...), free for 50 invoices per month
Parascript - templating system, similar to Abby FlexiCapture
It is important to know what is your use case. There is no one-fits-all solution. It depends on what are you trying to achieve:
Data mining - It must be cheap and fast. Missing or incorrect data are not mission-critical. You can clean it in data analysis.
Automation in an enterprise - Trained repeating invoices must work almost 100%. Speed and new invoices are not mission-critical.
Automation in e.g., customs - It is essential that as much data is returned as it gets. Accuracy on the whole set is vital, but probably every document will be reviewed anyway.
Therefore, you should test them and see how they fit into your process/needs.
Disclaimer: I am one of the creators of typless. Feel free to suggest edits.

You can try out Nanonets, there is a sample in this Github repo:
https://github.com/NanoNets/invoice-processing-with-python-nanonets
import requests, os, sys, json
model_id = "Your Model Id"
api_key = "Your API Key"
image_path = "Invoice Path"
url = 'https://app.nanonets.com/api/v2/ObjectDetection/Model/' + model_id + '/LabelFile/'
data = {'file': open(image_path, 'rb'), 'modelId': ('', model_id)}
response = requests.post(url, auth=requests.auth.HTTPBasicAuth(api_key, ''), files=data)
print(json.dumps(json.loads(response.text), indent = 2))

Related

Barcode using Ms Access [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
I am planning to use an HID barcode reader to spool data then it will be read as a data source for Microsoft Access? Is this possible? Can I do it in the background? Thanks.

To answer your question, if you have a barcode reader that creates a .CSV or .TXT file with a list of barcodes, yes, you should be able to import the list into Access. (Any valid .CSV file, and most well-structured .TXT files.)
This Stack Overflow post shows how to load a CSV file using VBA.
And here's how to do it manually.
Questions about the specific model are off-topic for this site but since I was the one that asked for that information, I did look into it quickly...
Symcode MJ2090
It looks like this product is made specifically for sale on Amazon/eBay, and every page I clicked has the identical copy/pasted description.
It raises an alarm for me that the "standard description" doesn't specify how the data is output to the computer other than "USB, No Driver Required".
Also, the Chinese manufacturer's sketchy site gave me browser security warning, and then doesn't even list this product in their list of BCR's. Perhaps it was a failed product that they unloaded cheap to resellers.
I've bought cheap USB electronics in the past (recent example: SIM Card reader/writer) which, while one would assume include the software necessary to use the product, that's not always the case, and since the description didn't actually say it includes software, they didn't break any rules and the item is now nonreturnable due to delay, etc.
Technically, if I was so inclined (and skilled in the correct areas) I could write software to communicate with my device, but that would be the equivalent of writing a printer driver from scratch.
My point is, be 100% sure how the device send the data to the computer before purchasing, or else shell out a few extra bucks for a known brand name instead of a no-name product.
I didn't look very closely but when searched eBay for USB barcode reader, sorted by "lowest price + shipping", the first result was this one is $18 USD (free shipping) and specifically says:
Supported Interfaces: RS232 / PS2 keyboard / USB
...although it's wired.
Or this one is $25 USD (free shipping) is wireless and says it:
Supports instant upload mode and storage mode(store 200 barcodes).
..which sounds promising, but "supports" doesn't mean it "does it"... however it's easy to contact the seller and find out.
Price aside, looking at a reputable store, I think this $80 USD model would work for you, but you'll need to check the documentation from the [reputable] manufacturer (Motorola) into it further to confirm. (I've never bought one.)
Or, I betcha this $10000 model will work too. :-)

Machine-learning friendly data organization [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
There are a lot of tutorials online about different machine learning tools (neural networks and various related techniques like DL, ID trees, SVMs). When I do small-scale machine learning, in python or MATLAB or equivalent, I usually have a CSV file with features and a CSV file with labels, I load these files into memory and then organize them as demanded by the program (ex. Matrix for matlab).
I am collecting data for performance of a system in real time. Every single few minutes, I collect a lot of data, and currently I store it in a Json format {Key: value}, etc. I usually collect this data and store it just for an hour to see how my system is doing. What I want to do instead is keep it and try to do some machine learning on it. I am wondering what the rules of thumb are for organizing datasets for machine learning, especially because I am not sure what kind of ML I want to do (this is an exploration project, so I am trying to figure out a way to enable myself to do the most exploration).
I read this blog article:https://www.altexsoft.com/blog/datascience/preparing-your-dataset-for-machine-learning-8-basic-techniques-that-make-your-data-better/
The companies that started data collection with paper ledgers and
ended with .xlsx and .csv files will likely have a harder time with
data preparation than those who have a small but proud ML-friendly
dataset.
It said that .csv data sets are not friendly for ML. Are there some ways to save the data that are considered to be more optimal for ML?
Here are a few use cases that I am thinking about:
Classification using point in time data paired with a label
Classification using time series (organized in a single matrix)
paired with a label
Regression: predict value of X given a matrix of
its time series values
I do not have a particular problem in mind. Rather, I want to start to set up this data set in a way that enables Machine Learning in the future.
My question is: what are the more popular ways to store data as to enable Machine Learning?
Some options:
CSV organized by time:
Time_stamp, feature1, feature2, feature3,...,featureN
Time_stamp, feature1, feature2, feature3,...,featureN
Time_stamp, feature1, feature2, feature3,...,featureN
...
And some starter labels (that may or may not be augmented later)
Time_stamp, label1, label2....labelN
Time_stamp, label1, label2....labelN
Time_stamp, label1, label2....labelN
Json-style key-value pairs:
{
time_stamp: _,
feature1: _,
feature2: _,
...,
featureN:_,
label1:_,
label2:_,
label3:_,
...
}
Say I decide that I want to use time series to predict labels... Then I would have to get time-series data all into one feature set for labels.
I understand that there are many ways to tackle this (one being: forget about organization - just write an API and when you figure out a problem to solve, produce this nicely organized data set for your problem), but really, I wonder what the rules of thumb are for designing the data--side infrastructure for machine learning in industry and academia.
Some issues that arise:
What if you want to add a new feature?
What if you have a new label?
What if you do not want to consider just single point time features, but use time-series of features in analysis?
I do not know much about databases, so wisdom is appreciated, and so are feature storage related online resources. Most of the ones I find have to do with the models, or the ML infrastructure - not the enablement or the data organization piece I am interested in.

For most of the machine learning libraries I have worked with (tensorflow, keras, scikit-learn, R), data is usually worked with in a tabular format (like CSV) because under the hood many machine learning algorithms are implement using fast linear algebra code. So I am not sure what the article is on about but storing data in the CSV format is fine.
Data cleaning, organisation and storage are big topics. Your data cleaning pipeline (and your whole training process) should be reproducible, this paper has some nice principles to keep in mind. This article by Hadley Wickham has some nice thoughts about how to organise data in a tabular format. If your dataset is complicated or you are going to be frequently reusing it, it's probably worth storing in a database and I recommend picking up a guide to SQL and also data warehousing.

What "other features" could be incorporated into a train database?

This is a mini project for DBMS course. My task is to develop a Database for management of passenger trains.
I'm designing tables for Customers, Trains, Ticket Booking (via Telephone & Internet), Origins and Destinations.
He said, we are free to incorporate other features in our Database Model. Some of the features that we can include are as listed:
Ad-hoc Querying
Data Mining
Demographic Passenger Mapping
Origin and Destination Mapping
I've no clue about what these features mean. I know about datamining but unable to apply it in this context. Can any one kindly expand these features or suggest new ideas?
EDIT: What is Ad-hoc Querying? Give an example in this context.

Data mining would incorporate extracting useful facts/figures out of the data obtained by your system & stored in the database. For example, data mining might discover that trains between city x and y are always 5 minutes late, or is never at more than 50% capacity, etc. So you may wish to develop some tools or scripts that automatically run and generate statistics (graphs are best) which display this information and highlight unusual trends. In the given example, the schedulers could then analyse why the trains are always late (e.g., maybe the train speedos are wrong?).
Both points 3. and 4. are a subset of data mining imo. There is a huge amount of metrics you could try to measure, it is just really whatever you can think of. If you specify what type of data you are going to collect, that will make making suggestions easier.
Basically, data mining just means "sort the data to find interesting facts".
Based on comment below you could look for,
% of internet vs. phone sales
popular destinations & origins
customers age/sex/location
usage vs. time of day
...

Is Data Dynamics Reports appropriate for my needs? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 8 years ago.
Improve this question
We currently use ActiveReports (by Data Dynamics, now Grape City) for canned reports, but are considering moving up to their Reports package. If you've used it, I'd love to hear your take on:
Performance - do you feel it will scale well for a web based app (particularly compared with ActiveReports)
Export to Excel - it appears to provide a much cleaner export to Excel (ActiveReports' Excel export is awful, our biggest reason for considering a switch)
Other pros/cons (my company is pretty small, the $3,000 for 2 licenses is a lot for us)

Here are some additional information for you to consider about ActiveReports & Data Dynamics Reports:
ActiveReports Licensing:
There license is per developer. There are no royalties. You can write as many applications as you want and deploy your application to as many users or as many servers as you want without any additional costs. Read the ActiveReports License agreement here.
Reporting to Excel:
First of all, schooner is absolutely correct that all the other reporting tools have a poor scenario when exporting to excel. We recognized the same after many years of experience with ActiveReports. Frankly, it is a very hard problem to take reports designed to be paginated or deployed on the web and put them into a cell-based layout of a spreadsheet.
However, with Data Dynamics Reports', we took a completely different approach. Instead of creating just another "export to Excel", where we look at "paginated" report output and try to fit it into a spreadsheet somehow, we generate the excel output based on two things: A template and the actual data in the report.
By using a template, which is actually a specially formatted excel sheet (cells have some special place holders in them) the reporting engine can output the report's content to an excel sheet completely independent of how the report is laid out when paginated. We call this concept a "Transformation Extension" for Excel since it takes the report's content and transforms it to excel based on a template.
By default DDReports will generate a default template that you will find more often than not has pretty good output. However, if the excel output is not what you want, you can instruct DDReports to save the template so you can customize the output in excel.
The best way to get an introduction to this is to watch the screencast for the Excel Transformation Extension in Data Dynamics Reports here. Jump to about 1:20 in the screencast if you get impatient and see an example of a simple template. Keep in mind this is a very simple template and the possibilities are much more sophisticated. Unfortunately, thus far we haven't published very good documentation on using the excel transformation extension template syntax yet, but let me know if you have questions and I'll help you out! Just comment on this post or send an email to our support team.
Scott Willeke
Data Dynamics / GrapeCity

I've used it and it rocks! It has a Report Designer control that allows you users to build thier own reports on the fly and supports multiple datasources used in a single reports. Best reporting tool on the market bar none.

We use both products and they are quite different from each other. I have been a long time user of Active Reports and have loved them. But when it came time to select a .net reporting tool we did not want to spend a bunch of $$ so we decised to get their DDR product. It took me a couple of weeks to get used to it as I kept trying to use it like Active Reports. Not a good idea. Anyways, once you get used to it it does a decent job. there are some things that they need to do to improve the product. Here are the things that stand out.
You cannot access the control collection in the code area. This is a huge problem if you want to change anything like data binding inside the report.
The database connection have to be refreshed if you repopen the report int he designer. This took a while to figure out and we wondered why our fields would not show up in the preview mode when re reloaded the report.
Their new tech support is terrable. They were bought out recently and now when you call tech supprt you get someone tht has no knowledge that always tells you that someone will call you back. 80% of the time you get no call back. The otehr 20% of the time you get a sample emaild to you that has nothing to do with your issue. Now this is accorss the board with both products. THey used to have great tech support. I hope they fix this.
Those are the main problems and I know they are workign to solve the issues. Like i said we use boh DDR and Active Reports. If you need to do complicated reports stick with Active Reports. If they are simple and you do not want to spend a lto fo money then DDR works fine. I see DDR getting better with each release but it will take a while to get the knks worked out.
Just my opinion

I've only used ActiveReports as well, but their web licensing model is a bit expensive in general in my view, espeically if you need to develop multiple apps on multiple servers. Then there is the per developer costs as well.
I use DevXpress XtraReports and have been fairly happy with it so far and it has some fairly decent export functionality and a much better licensing model.
Regarding export to Excel, I've not seen any reporting tool do it well, mainly due to the formatting issues with the report itself. What we typically do is provide the formatted report to the user, along with an additional link for an Excel export which is a similar but different query with the raw data the report uses.
Another option over formatted printable reports is using grids such as Infragistics which allow you to do sorting, grouping, summaries, and which has excellent Excel export features.

This is to give more information to Bill's response in this thread. I tried to post a comment, but ran out of room :)
Bill Thanks for your honest assessment. Let me give some comments for you from the inside on the issues you mentioned:
1: Admittedly it is not quite as intuitive to access the controls collection as it was with AR, but you /can/ do it. You need to do it outside of the report (not in the script/code embedded into the report). To do it you can load the rdlx file in a ReportDefinition object. For example:
var rpt = new DataDynamics.Reports.ReportDefinition(new FileInfo("myfile...rdlx"));
var list = (DataDynamics.Reports.ReportObjectModel.List)rpt.Report.Body.ReportItems["myList"];
var txt = (DataDynamics.Reports.ReportObjectModel.TextBox)list.ReportItems["myTextBox"];
txt.Value = "=Fields!MyField.Value";
However, depending on the scenario you're after there may be a better way to handle this than changing the binding on the control/reportItem itself. It is difficult to say more without knowing more about your particular scenario/goals.
2: There was recently some discussion I was involved in on how to improve this in the very near future. The dev team was gathering use cases and doing some investigation on various caching strategies to keep hitting the database to an absolutely minimum in the designer. So look for improvements in this area in an upcoming build.
3: Unfortunately, we're working through some challenges with our new technical support team. However, we are improving constantly and we're working hard to bring up the new guys as quickly as possible. If you have a problem with one of your incidents with support feel free to email me personally with your case number and I'll work to try get your case escalated or help out in any way I can (scott dot willeke at grapecity dot com).
Thanks again for your feedback, my next letter is an internal one based on your feedback to help us improve!
Scott Willeke
Program Manager
Data Dynamics / GrapeCity inc.

I have used this product since 2004. Great performance, licensing was great. The migration from earlier versions was great. It had its flaws like ghost images for high speed high volume in production environment and missing some of the goodies you get with Crystal and bar codes issues. But this the engine was fast. Then came version 7. What a mess!! rendering a 4 x 4 label went from 320 ms to 800 ms. Try getting a patch... Good luck with that. Try getting someone on the phone suddenly became like winning lottery. If performance is not a factor and you need just simple reports, go for it. Otherwise, think twice. As for us, this is the last version if our QA can pass it. We're shopping for a replacement product.

They are good and I am not trying to frighten you, but below is the fact, in my perspective :
Pros
Active Community ... you can expect responses overnight.
Good stuff to get you started - walkr-thrus, tutorials, examples, vides etc
Internal builds - Just like Linux kernel patches you can get "hot fixe" for the problems their developer team was able to solve
Web report viewer is available and also works within Visual Studio - just like other reporting tools.
Cons
Week rendering engines - you can not expect that they are going to be exported to word/excel w/o any issues, if you use a sub-report in a table row.
Poor bug fixes - takes over a year to fix a bug - I am following one since 11-11-2011, still they keep saying "we will let you know as soon as we fix this bug"
Not too active to release stable versions. - It takes a year some times for the, to release the next stable version.
Low control over rendering, you may not use events if you wish to embed some code, but yes, Data Dynamics does provide VB.net (and just VB.net ! ) (Custom Code) support, you may use it for validation typo stuff
I am sharing some links for your reference:
forums | How to section | Walkthrough(s) | Useful resources | drill throughs | videos | Convert Crystal reports (Remember: vice versa is not possible) | online help / Documentation - User Guide | Web Report Viewer

Is there a business proven cloud store / Key=>Value Database? (Open Source) [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 8 years ago.
Improve this question
I have been looking for cloud computing / storage solutions for a long time (inspired by the Google Bigtable). But I can't find a easy-to-use, business-ready solution.
I'm searching a simple, fault tolerant, distributed Key=>Value DB like SimpleDB from Amazon.
I've seen things like:
The CouchDB Project : Simple and distributed, fault-tolerant Database. But it understands only JSON. No XML connectors etc.
Eucalyptus : Nice Amazon EC2 interfaces. Open Standards & XML. But less distributed and less fault-tolerant? There are also a lot of open tickets with XEN/VMWare issues.
Cloudstore / Kosmosfs : Nice distributed, fault tolerant fs. But it's hard to configure. Are there any java connectors?
Apache Hadoop : Nice system which much more then abilities to store data. Uses its own Hadoop Distributed File System and has been testet on clusters with 2000 nodes.
*Amazon SimpleDB : Can't find an open-source alternative! It's a nice but expensive system for huge amounts of data. And you're addicted to Amazon.
Are there other, better solutions out there? Which one is the best to choose? Which one offers the smallest amount of SOF(Singe Point of Failure)?

How about memcached?
The High Scalability blog covers this issue; if there's an open source solution for what you're after, it'll surely be there.
Other projects include:
Project Voldemort
Lightcloud - Key-Value Database
Ringo - Distributed key-value storage for immutable data
Another good list: Anti-RDBMS: A list of distributed key-value stores

MongoDB is another option which is very similar to CouchDB, but using query language very similar to SQL instead of map/reduce in JavaScript. It also supports indexes, query profiling, replication and storage of binary data.
It has huge amount of documentation which might be overwhelming at fist, so I would suggest to start with Developer's tour

Wikipedia says that Yahoo both contributes to Hadoop and uses it in production (article linked from wikipedia). So I'd say it counts for business-provenness, although I'm not sure whether it counts as a K/V value database.
Not on your list is the Friendfeed system of using MySQL as a simple schema-less key/value store.
It's hard for me to understand your priorities. CouchDB is simple, fault-tolerant, and distributed, but somehow you exclude it because it doesn't have XML. Are XML and Java connectors an unstated requirement?
(Anyway, CouchDB should in fact be excluded because it's young, its API isn't stable, and it's not a key-value store.)

I use Google's Google Base api, it's Xml, free, documented, cloud based, and has connectors for many languages. I think it will fill your bill if you want free hosting too.
Now if you want to host your own servers Tokyo cabinet is your answer, its key=>value based, uses flat files, and is the fastest database out there right now (very barebones compared to say Oracle, but incredibly good at storing and accessing data, about 1 million records per second, with about 10bytes of overhead (depending on the storage engine)). As for business ready TokyoCabinet is the heart of a service called Mixi, which is the equivalent of Japan's Facebook+MyPage, with several million heavy users, so it's actually very battle proven.

If you want something like Bigtable, you can't go past HBase or Hypertable - they're both open-source Bigtable clones. One thing to consider, though, is if your requirements really are 'big enough' for Bigtable. It scales up to thousands of tablet servers, and as such, has quite a bit of infrastructure under it to enable that (for example, handling the expectation of regular node failures).
If you don't anticipate growing to, at the very least, tens of tablet servers, you might want to consider one of the proposed alternatives: You can't beat BerkelyDb for simplicity, or MySQL for ubiquity. If all you need is a key/value datastore, you can put a simple 'dict' wrapper around your database interface, and switch out your backend if you outgrow one.

You might want to look at hypertable which is modeled after google's bigtable.

Use The CouchDB
Whats wrong with JSON?
JSON to XML is trivial

You might want to take a look at this (using MySQL as key-value store):
http://bret.appspot.com/entry/how-friendfeed-uses-mysql

Cloudera is a company that commercializes Apache Hadoop, with some value-add of course, like productization, configuration, training & support services.

Instead of looking for something inspired by Google's bigtable- Why not just use bigtable directly? You could write a front-end on Google App-Engine.

Good compilation of storage tools for your question :
http://www.metabrew.com/article/anti-rdbms-a-list-of-distributed-key-value-stores/

Tokyo Cabinet has also received some attention as it supports table schemas, key value pairs and hash tables. It uses Lua as an embedded scripting platform and uses HTTP as it's communication protocol Here is an great demonstration.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008