I want to particular data from my image or pdf. For example I have invoice bill in scan document so i just want to extract invoice number. I am already used Tesseract OCR, Apache Tika OCR, Aspose OCR so please suggest me step to get particular data. Thank you in advance.
You can get/extract some specific contents from a portion of the image by using custom recognition blocks. Please note, the above mentioned solution is useful in scenario when you have documents/images following the similar structure, that is; the contents to be scanned are always on the same location for each image.
Furthermore you can perform OCR operation on a PDF file using Aspose.OCR in combination with Aspose.Pdf. Visit the link Performing OCR on PDF Documents for details.
I work with Aspose as Developer evangelist.
Have you looked at using ABBYY FlexiCapture? That function is one of the primary aspects of what it does. In using products like FlexiCapture the issue becomes whether your document is of fixed or semi-structured design. For documents like invoices the answer is almost always semi-structured because the information moves around on the page. Also, there are usually many different layouts of invoices. ABBYY solved that challenge through their FlexiCapture for Invoice product.
As an alternative, if you just needed to extract something like an invoice number in a region there are ways in lower priced products like ABBYY Recognition Server in which you could use what they call an area template, or you could extract all of the OCR text and develop an application to apply a regular expression to locate the field value adjacent to the field label. Problems can arise when the field label and the field value do not fall in proximity to each other in the OCR result text. This can happen when after the line break of the field labels ("invoice no" or "invoice #") there is another value immediately following on the first position of the next line. Then the OCR text could become something like "Invoice No. Bob's Bargain Barn 66422." The Regex could look for the value immediately following the search phrase "Invoice No." then produce a result for the adjacent text "Bob's Bargain Barn." Worse, often times the label text and invoice number will be within a table, complicating matters as some OCR engines would ignore it altogether (not Recognition Server though). It is for these reasons we researched FlexiCapture because it eliminated fancy coding required for data extraction. It is expensive but worth it.
Disclosure, we are an ABBYY Partner.
Related
I am doing some volunteer work for a charity that is using a couple online systems that store their donors and related data. I would like to find a way to store a URL as a custom field in such a way that they can put corresponding links between donors in one of the systems in order to quickly find the same donor in another system. The only built-in method in the products being used is to store a single value in a field labeled "website" which is originally intended to store a value for any website associated with the donor. I would like to avoid using this field if possible and instead create a custom field.
However, the rub is the custom fields only have a handful of options (clear text, date, currency, etc). There is no option to store a URL or something like rich text). I've thought of a couple less optimal ways to make the values stored in those fields clickable (a browser plugin or a proxy) however both of those have obvious drawbacks that I would like to avoid.
What I am wondering and hoping someone has a possible answer for, is if there are an ways of storing a value in a clear text field that might disrupt or escape the underlying html encoding such that the displayed link is clickable. I already control the values being put into these fields (users cannot enter their own values, they are essentially read-only), so security isn't much of a concern.
I have very limited access or influence to have any system level changes, however I would like to make this possible as it would help them a great deal (their users are all volunteers with limited time and education). I've tried a few tricks but havn't found anything that doesn't get converted to unicode or escaped (it could be that it's completely controlled for at output, i simply don't know).
My current attempts have been limited to using the built in forms submission, I may explore their import and/or API methods on the theory that might allow better low-level access to storing the actual values in the system, however I'm still not certain what to try other than adding .
I have also tried an inline script to add the corresponding tab, however that seems to break the form submission method (perhaps it'll work via csv import or via the API)
Does anyone have suggestions for other things I could try before I go any further? I'm a bit of a novice and feel like there may be something else obvious I haven't tried.
How to check pdf is exist or same 80% in mysql?
User want to upload pdf.
But problem is reup.
I think covert pdf to binary
=> I will have a string "X"(binary of that pdf) to save in mysql.
=> Select like %(splice (1/3 length(X) -> 2/3 length(X)).
maybe do it?
im using laravel
thank for reading
This cannot be done reasonably in MySQL. Since you are also using a PHP environment, it may be possible to perform via PHP, but to achieve a general solution you will need substantial effort.
PDF files are composed of (possibly compressed) streams of images and text. Several libraries can attempt to extract the text, and will work reasonably well if the PDF was generated in a straightforward way; however, they will typically fail if some text was rendered as images of its characters, or if other ofuscation has been applied. In those cases, you will need to use OCR to generate the actual text as it is seen when the PDF is displayed. Note also that tables and images are out-of-scope for these tools.
Once you have two text files, finding overlaps becomes much easier, although there are several techniques. "Same 80%" can be interpreted in several ways, but let us assume that copying a contiguous 79% of the text from a file and saving it again should not trigger alarms, while copying 81% of that same text should trigger them. Any diff tool can provide information on duplicate chunks, and may be enough for your purposes. A more sophisticated approach, which however does not provide exact percentages, is to use the normalized compression distance.
Say I have a collection of websites for accountants, like this:
http://www.johnvanderlyn.com
http://www.rubinassociatespa.com
http://www.taxestaxestaxes.com
http://janus-curran.com
http://ricksarassociates.com
http://www.condoaudits.com
http://www.krco-cpa.com
http://ci.boca-raton.fl.us
What I want to do is crawl each and get the names & emails of the partners. How should I approach this problem, at a high-level?
Assume I know how to actually crawl each site (and all subpages) & parse the HTML elements -- I am using Oga.
What I am struggling with is how to make sense of data that is presented in a wide variety of ways. For instance, the email address for the firm (and or partner) can be found in one of these ways:
On the About Us page, under the name of the partner.
On the About Us page, as a generic catch-all email.
On the Team page, under the name of the partner.
On the Contact Us page, as a generic catch-all email.
On a Partner's page, under the name of the partner.
Or it could be any other way.
One way I was thinking about approaching the email, is just to search for all mailto a tags and filter from there.
The obvious downside for this is that there is no guarantee that the email will be for the partner and not some other employee.
Another issue that is more obvious is detecting the partner(s) names just from the markup. I was initially thinking I could just pull all the header tags and text in them, but I have stumbled across a few sites that have the partner names in span tags.
I know SO is usually for specific programming questions, but I am not sure how to approach this and where to ask this. Is there another StackExchange site that this question is more appropriate for?
Any advice on specific direction you can give me would be great.
I looked at the http://ricksarassociates.com/ website and I cant find any partners at all so in my opinion you better stand to gain from this if not you better look for some other invention.
I have done similar datascraping from time to time, and in norway we have laws - or should I say "laws" - that you are not allowed to email people however you are allowed to email the company - so in a way the same problem from another angle.
I wish I knew maths and algorythms by heart because I am sure there is a fascinating sollution hidden in AI and machine learning, but in my mind the only sollution I can see is building a rule set that over time probably gets quite complex. Maby you could apply some bayesian filtering - it works very well for email.
But - to be a little more productive here. One thing i know is inmportant, you could start by creating the crawler environment and building the dataset. Have the database for URLS so you can add more at any time, and start the crawling on what you have already so that you do your testing querying your own data with a 100% copy. This will save you enormous time instead of live scraping while tweaking.
I did my own search engine some years ago, scraping all NO domains however I needed only the index file that time. Took over a week alone just to scrape it down and I think it was 8GB of data just for that single file, and I had to use several proxyservers aswell to make it work due to problems with to much DNS traffik. Lots of problems that needed being taken care of. I guess I am only saying - if you are crawling a large scale you might aswell start getting the data down if you want to work efficient with the parsing later.
Good luck, and do post if you get a sollution. I do not think it is posible without an algorythm or AI though - people design websites the way they like and they pull templates out of their arse so there are no rules to follow. You will end up with bad data.
Do you have funding for this? If so its simpler. Then you could just crawl each site, and make a profile for each site. You could employ someone cheap to manual go through the parsed data and remove all the errors. This is probably how most people does it, unless someone already have done it and the database is for sale / available from webservice so it can be scraped.
The links you provide are mainly US site, so I guess you are focusing on English names. In that case, instead of parsing from html tags, I would just search the whole webpage for name. (There are free database of first name and last name) This may also work if you are donig this for some other Europe company, but it would be a problem for company from some countries. Take Chinese as an example, while there is a fix set of last name, one may use basically any combination of Chinese character as first name, so this solution won't work for Chinese site.
It is easy to find email from a webpage as there is a fixed format of (username)#(domain name) with no space in between. Again I won't treat it as html tags but just as normal string so that the email can be found no matter it is in mailto tag or in plain text. Then, to determine what email is it:
Only one email in page?
Yes -> catch-all email.
No -> Is name found in that page as well?
No -> catch-all email (can have more than one catch-all email, maybe for different purpose like info + employment)
Yes -> Email should be attached to the name found right before it. It is normal that the name should appear before the email.
Then, it should be safe to assume the name appear first belongs to more important member, e.g. Chairman or partner.
I have done similar scraping for these types of pages, and it varies wildly from site to site. If you are trying to make one crawler to sort of auto find the information, it will be difficult. However, the high level looks something like this.
For each site you check, look for element patterns. Divs will often have labels, ID's, and classes which will easily let you grab information. Perhaps you find that many divs will have a particular class name. Check for this first.
It is often better to grab too much data from a particular page, and boil it down on your side afterwards. You could, perhaps, look for information which comes up on a screen by utilizing type (is link) or regex (is email) to look for formatted text. Names and occupation will be harder to find by this method, but might be related positionally on many pages to other well formatted items.
Names will often be affixed with honorifics (Mrs., Mr., Dr., JD, MD, etc.) You could come up with a bank of those, and check against them for any page you end up on.
Finally, if you really wanted to make this process general purpose, you could do some heuristics to improve your methods based off of expected information; names, for example, are most often within a particular list. If it was worth your time, you could check certain text for whether it matches a list of more common names.
What you mentioned in your initial question seems that you would have a lot of benefit with a general purpose Regular Expressions crawler, and you could make improvements on it as you know more about the sites which you interact with.
There are excellent posts on this topic with a lot of useful links throughout these webpages:
https://www.quora.com/What-is-a-good-web-scraper-for-pulling-emails-names-etc-even-if-the-contact-info-is-another-page-deep-a-browser-add-on-is-a-plus
http://www.hongkiat.com/blog/web-scraping-tools/
http://www.garethjames.net/a-guide-to-web-scraping-tools/
http://www.butleranalytics.com/15-web-scraping-tools/
Some of the examined applications are working in macOS.
I'm developing a small project, using Access, SQL, Word and PowerPoint, the goal of this project ist too allow the user, to add is personal information, in Access, throught user forms, register that information in the SQL database, and passe it to templates in Word and PowerPoint.
The templates are loaded and the dynamic data in them, filled, when pressing a button (word or powerpoint) in an access form.
So, with word, it's all working well through the use of bookmarks, but in powerpoint, things get tricky, there are no bookmarks (damn you Microsoft).
So i've been searching SO and Google, but cant seem to find a example that works for me. All of the examples talk about creating a new powerpoint presentation, that is not want i'm looking for.
The goal is just to add the dynamic data, in this case textboxes and a photo, and leave the static data in the template.
So the flow would be something like this
SQL (database) -> ACCESS (user interface) -> Word and Power Point (Final Template).
If anyone has an example of how just to start this i would be very gratefull for your help.
Thank you all for your time in this matter.
I'm not clear on what the problem is exactly. Do you need to know how to start a new presentation based on your existing template? If so, you'd open your template file as though it were a presentation then immediately save it AS a presentation (rather than starting a new presentation and applying the template, which will give you only the design but none of the contents .. the slides ... of the template file).
I also don't understand why you'd expect all applications, Word, Excel, PowerPoint and the rest, to be identical. If Word did everything that PowerPoint does, there'd be no need for one or the other of them. It stands to reason that they'll have different feature sets, as appropriate to their usage.
In any case, to add, say, user name as needed throughout a presentation, it's quite quick to look at each shape to see if it contains text, and if so, to replace, say %%USERNAME%% with the actual user name you're working with.
You can similarly look for special text, say PICTURE:portrait, within a shape and if found, insert a picture, size it to match the shape then delete or hide the original shape.
I have to pull two pre-printed (not hand-written) fields out of a paper form, such that it can be automatically routed after being scanned. The fields contain batch and item identifiers, like "GG-9192" or "EPN/245G".
I've tried the following software:
Tesseract-OCR
Cuneiform
Canon ImageRunner built-in OCR
Asprise OCR Java API (demo)
I've tried the following settings:
Scanning at resolutions of 300dpi and 600dpi
Tried different fonts, including OCR-A and OCR-B.
In all cases output was pretty much all over the place. I can kick back documents for which I can't properly extract the necessary information, but I'm thinking it's going to be at least half of them. I considered some sort of fuzzy logic based on known values in a database, but sometimes these identifiers can differ by a single character, like "123G" and "123C".
Is this a lost cause? Perhaps OCR just isn't mature enough to handle a requirement of this nature? What other techniques might you recommend? Barcodes?
Edit: the containing application is in Java, so any recommendations for which there are free or cheap Java-based APIs for would help.
Edit 2: if anyone is interested...without any special tuning, Cuneiform for Linux and the Canon ImageRunner worked best, with Tesserect-OCR and Asprise Java API producing the worst results...none of the four was acceptable for anything but standard document search grade OCR. I'm beginning to think that this isn't going to work out.
If you have control over the fields, why use a human-readable format in the first place? For scanning, it seems like a QR Code, or something similar would be best. It is marked for orientation, and has some built-in error correction.
http://en.wikipedia.org/wiki/QR_Code
I started digging for products starting with Tomato's suggestion. I tried ABBYY and CVISION. Both have products that can automate OCR:
CVISION Maestro Recognition Server 4.0
ABBYY Recognition Server 2.0
In addition, ABBYY has SDKs for various platforms, and CVISION has an SDK that appears to work with at least VB/VC++.
I haven't tried either SDK yet, and am not sure it's necessary for my project. All I need is PDFs coming in that I can extract the text from. I did however try CVISION's server product and with the OCR on its most accurate settings, it worked really well. I haven't tried ABBYY's server product yet because I have to go through a reseller to get a trial. I'm in the process of doing so, but if it starts getting annoying I'm probably going to go with CVISION. I did try ABBYY's FineReader standalone product, and it worked very well, so I assume that their server product would also.