I have a small problem while trying to do ocr using the tool gocr. It sometimes recognizes an o as zero and vice versa. To solve this, i tried to make it use a user specified database path. But doing that would require me to create a map for all possible characters. Is there any way in which i could tell gocr to just use the manual db for only these 2 characters??
Thanks
I'd suggest you just put those two characters in the db.lst file, and use it along with the ocr-engine...
so long
hank
Related
I would like to see if I can scan a sign-in sheet for a class. The good news is I know 90% of the names that might be written.
My idea was to use tessaract to parse an image of names, and then use the Levenshtein algorithm to compare each line with a list of names in my database and if I get reasonably close matches, then that name is right.
Does this approach sound like a good one? If not, other ideas?
I tried using tesseract on a sample sheet (see below)
I used:
tesseract simple.png -psm 4 outtxt
Tesseract Open Source OCR Engine v3.05.01 with Leptonica
Warning. Invalid resolution 0 dpi. Using 70 instead.
Error in boxClipToRectangle: box outside rectangle
Error in pixScanForForeground: invalid box
I am assuming it didn't like line 2 because I went below the line.
The results I got were:
1.. AM: (harm;
l. ’E (J 22 a 00k
2‘ wau \\) [HQ
4. KIM TAYLOE
5. LN] Davis
6‘ Mzflé! Ha K
Obviously not the greatest, my guess is the distance matches for 4 & 5 would work, but the rest are not even close.
I have control of my sign-in sheet, but not the handwriting of folks coming in, so if any changes to that I can do to help, please let me know.
Since your goal is to get names only - I would suggest you to reduce tessedit_char_whitelist to english alphabetical ones("ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789.") so that you will not get characters that you don't expect as output like \\) [ .
Your initial approach to calculate L distance is fine if you success to extract text from handwritten image (which is a hard task for tesseract).
I would also suggest to run some preprocessing on your image. For example you can remove horizontal lines and extract text ROIs around them. In the best case you will be able to extract separated characters, but even if you don't do that - you will get better results & will be able to distinguish result names "line by line".
You should also try other recommended output quality improvement stages which you can find in Tesseract OCR wiki (link)
Let's say I have a text file that looks like this:
<number> <name> <type> <inputs...>
1 XOR1 XOR A B
2 SUM XOR 1 C
What would be the best approach to generate the truth table for this circuit?
That depends on what you have available, and how big your file is.
Perl is optimized for reading files and generating simple text output. It doesn't have a library of boolean operators, but they're easy enough to write. I'd use that if I just wanted text-in, text-out.
If I wanted to display the data online AND generate a results file, I'd use PHP to read the data and write the table to a CSV file that could either be opened in Excel, or posted online in an HTML table.
If your data is in a REALLY BIG data file, I'd use SQL.
If your data is in a really huge file that you want to be accessible to authorized users online, and you want THEM to be able to create truth tables, I'd use Oracle's APEX to create an easy interface for them to build their own truth tables and play around with the data without altering it.
If you're in an electrical engineering environment, use the tools designed for your problem -- Verilog or similar.
Whatcha got? Whatcha wanna do with it?
-- Ada
I prefer using C#. I already have the code to 'parse' the input
text file. I just don't know where to start in terms of
actually 'simulating' it. The output can simply be a text file
with inputs and output values – Don 12 mins ago
How many inputs and how many outputs in the circuit you want to simulate?
The size of the simulation determines how it can most easily be run. If the circuit is small(ish), you can enter the inputs and circuit values into vector arrays, then cross them to get the output matrix.
Matlab is ideal for this, as it was written for processing arrays.
Again: Whatcha got, and whatcha wanna do with it?
-- Ada
I am looking for an open source ocr (maybe tesseract) that uses a dictionary to match words against. For example, I know that this ocr will only be used to search for certain names. Imagine I have a master guest list (written) and I want to scan this list in under a second with the ocr and check this against a database of names.
I understand that a traditional ocr can attempt to read every letter and then I could just cross reference the results with the 100 names, but this takes too long. If the ocr was just focusing on those 100 words and nothing else then it should be able to do all this in a split second. i.e. There is no point in guessing that a word might be "Jach" since "Jach" isn't a name in my database. The ocr should be able to infer that it is "Jack" since that is an actual name in the database.
Is this possible?
It should be possible. Think of it this way: instead of having your OCR look for 'J' it could be looking for 'Jack' directly, sort of: as an individual symbol.
So when you train / calibrate your OCR, train it with images of whole words, similar to how you would - for an individual symbol.
(if this feature is not directly available in your OCR then first map images of whole words to a unique symbol and later transform that symbol into the final word string)
I'm working on a UTF-8 Persian website with integrated mysql database. All the content in the website are imported through an admin panel and it's all persian.
As you might know arabic language has the same letters as persian except some.
The problem is when a person tries to type on a keyboard with arabic layout it writes "ي" as an character and if he tries to type by a keyboard with persian layout it types "ی" as character.
So if a person searches for 'بازی' the mysql won't find 'بازي' as the result.
Important Note: 'ی' is not the only character with this property, there are lots of them and they are very similar.
How can I fix this issue?
One simple naive solution seems to be replace all "ي" with "ی" before importing the data into database, but i'm searching for a better robust solution than this.
Dear EBAG, We have a single Arabic block in Unicode which contains both Arabic & Persian characters.
06CC is Persian ی and 064A is Arabic ي
Default windows keyboard uses code page 1256 for arabic characters which put 064A as default ي for bothPersian and Arab users because Arab users are much more than Persian.
ISIRI make an standard keyboard ISIRI 9147 and put both Arabic and Persian Yeh on it but Perisan ی is the default characters. Persian users which are using standard keyboard will put ( and use ) standard Persian ی while the rest of them use arabicي`.
As you told usually while we are saving a data to database we change arabic ي to Persian ی and when we are reading from it we just go for Persian so everything is true.
the second approach is to use a JavaScript file in web application to control user input. most of the persian websites use this approach to save characters to database. In this method user don't need to install any Keyboard layout for Persian or Arabic keyboard. He/she just put the keyboard on English and then in JavaScript file developer check that which character is equevalent for him. Here you can find ISIRI 9147 javascript for web application and a Persian Guid to use it.
the Third approach is to use a On-Screen Keyboard that work just like the previous one with a user interface and is usually good for thise who are not familiar with Persian keyboard.
The forth approach is to search both dialect. As you know when you install MySql or SQL Server you can set the collation and also you have an option to support dialect ( and case sensivity). if you enable arabic collation with dialect you can get result for both of them and usually this works fine in sql server I don't test it in MySql. This is the best solution yet.
but if I were you, I implement a simple sql function which get nvarchar and return nvarchar. then I call it when I wanted to write data. and whenever you want to read, you can go for the standard one.
Sorry for the long tail.
update TABLENAME set COLUMNNAME=REPLACE(COLUMNNAME,NCHAR(1610),NCHAR(1740))
or
update TABLENAME set COLUMNNAME=REPLACE(COLUMNNAME,'ي',N'ی')
This is called a collation. It's what MySQL uses to compare two different characters. I'm afraid I don't know anything about persian or arabic, but the concept is the same. Essentially you've got two characters which map to the same base value. You need to find a collation which maps ي to ی. I'm afraid that's as helpful as I can be without knowing more about the language.
The first letter (ي) is Yāʾ in the arabic alphabet.
The second letter (ی) is ye in the perso-arabic alphabet.
More on the perso-arabic alphabet here:
http://en.wikipedia.org/wiki/Perso-Arabic_alphabet
"Two dots are removed in the final ye (ی). Arabic differentiates the final yāʾ with the two dots and the alif maqsura (except in Egyptian Arabic), which is written like a final yāʾ without two dots.
Because Persian drops the two dots in the final ye, the alif maqsura cannot be differentiated from the normal final ye. For example, the name Musâ (Moses) is written موسی. In the final letter in Musâ, Persian does not differentiate between ye or an alif maqsura."
Seems to be an interesting problem...
I was struggling with the similar situation 5-6 years ago, when Lucene was not an option for MySQL and there were no Sphinx (Never tried Sphinx result on this), but what I did was I found pretty much most of the possible alternations and put them in an array in PHP.
So if the input keyword contained any of those characters, I generated all the possible alternates of that.
So for the input of 'بازی' I would have generated {'بازي' , 'بازی' } and then I would query the MySQL for both, like the simplest query below :
SELECT title,Describtion FROM Games WHERE Description LIKE '%بازي%' OR Description LIKE '%بازی%'
The primary list of alternatives is not very long though.
If you've the possibility to switch DB engine, you might want to look into the full text search functionality of PostgreSQL:
http://www.postgresql.org/docs/9.0/static/textsearch.html
Among other things, you can configure it so that it indexes/searches unaccented characters, and you can define all sorts of additional dictionaries (e.g. stop words, thesaurus, synonyms, etc.).
If not, consider using Sphinx or Lucene instead of like statements for your searches.
I know answering this topic is like digging a corpse from its grave since it's really old but I'd like to share my experience IMHO, the best way is to wrap your request and apply your replacement . it's more portable than other ways. here is a java sample
public class FarsiRequestWrapper extends HttpServletRequestWrapper{
#Override
public String getParameter(String name) {
String parameterValue = super.getParameter(name);
parameterValue.replace("ی", "ي");
parameterValue.replace("\\s+", " ");
parameterValue.replace("ک","ک");
return parameter.trim();
}
}
then you only need to setup a filter servlet
public class FarsiFilter implements Filter{
public void doFilter(ServletRequest request, ServletResponse response,
FilterChain chain) throws IOException, ServletException {
HttpServletRequest req = (HttpServletRequest) request;
FarsiRequestWrapper rw = new FarsiRequestWrapper(req);
chain.doFilter(rw, response);
}
}
although this approach only works in Java, I found it simpler and better.
You must use N (meaning uNicode) before non-English characters, for example:
REPLACE(COLUMNNAME, N'ي', N'ی')
According to Adobe's Manual on PDF Open Parameters PDF files can be opened with certain parameters from command line or from a link in HTML.
These open Parameters include page=pagenum, zoom=scale, comment=commentID and others (the first parameter should be preceded with a # and the next should be preceded with a &
The official PDF Open Parameters from adobe gives this example:
#page=1&comment=452fde0e-fd22-457c-84aa-2cf5bed5a349
but the comment part doesn't work for me!
page=pagenum and zoom=scale work for me well. But comment=commentID does not work. I tried on Adobe reader 6.0.0 and Adobe Pro Extended 9.0.0: I can't get to the specified comment.
Also, I get the comment ID by exporting the comments in XFDF format and in the resulting file, there is a name attribute for every comment that I hope corresponds to the ID (well, the appearance looks like the example in the manual).
I thought maybe there is a setting that I should first enable (or maybe disable in adobe) or maybe I am getting the comment IDs wrong, or maybe something else?!
Any help would be extremely appreciated
According to the docs, you must include a page=X along with your comment=foo. Your copied sample has it, but it's copied from the docs, not something you did yourself.
Are you missing a page= when setting comment?
BASTARDS!
From the last page of the manual you linked:
URL Limitations
●Only one digit following a decimal point is retained for float values.
●Individual parameters, together with their values (separated by & or #), can be no greater then 32 characters in length.
Emphasis added.
The comment ID is a 16-byte value expressed as hex, with four hyphens thrown in to break up the monotony. That's 36 characters right there... starting with "comment=" adds another 8 characters. 44 characters total.
According to that, a comment ID can NEVER WORK, including the samples they show in their docs.
Are you just trying it on the command line, or have you tried via a web browser too? I wonder if that makes a difference. If not, we're looking at a feature that CANNOT WORK. EVER... and probably never has.