Is it possible to train tesseract v5 for OCR Egyptain licence plate? - ocr

I'm working on a project to OCR Egyptian licence plate written in arabic alphabet and arabic-indic numbers. The traineddata from https://github.com/Shreeshrii/tessdata_arabic gives an accuracy of 60% for letters and 70% for numbers. I'm gussing the bad accuracy is because the font on the plates is different. Also the letters are written seperatly (أ هـ ج)(ل ل ص) on the plates while it's usually connected in text books (أهج)(للص). And also because the plates deteceted have different lighting conditions or the letters may not be so clear -the plate can be dirty or distorted-.
Here's a sample that's recognised with extra apostrophe at the beginning ('ل ل ص ٦٢٩) after preprocessing the image to gray scale then to black and white. The correct characters are (ل ل ص ٦٢٩)
Another sample of the plates I am trying to recognise. black and white preprocessing. This one fails. it's recognised as (ط ئ ؤ د ١٢) The characters on the plate are (ط ج د ١٢٦٤)
Should I try with another preprossiccing? Or should I retrain the existing traineddata for the different font (I searched the font name but couldn't find it). Or train from scratch as the the plate images have alot of noise and differ in brightness/constract.

Related

some math symbols rendering incorrectly in svg generated by TikZJax

I've started playing with TiKZJax, a system for converting TikZ images to svg, and embedding them in html pages.
I do like this tool a lot, but I'm having some troubles with mathematical formulas (LaTeX) in TikZJax nodes. The problem is that some basic mathematical symbols do not render correctly.
Here you can see a webpage with two MWEs.
The first one is derived from an image I was preparing, where I first noticed the issue.
The negative y axis labels have a "times" sign instead of the expected minus.
The two texts roughly at (3,3) and (6,6) are two "experiments". One should be $-2$ and the other has a bunch of random symbols $\hbar \pi \times \otimes \sum$. The third and the fourth won't render correctly.
The second MWE is the same you can find in the TikZJax demo here, but with $-\y$ in place of $\y$. Again, the minus sign won't render correctly.
The compilation log I see from the console does not complain about anything.
I tried inspecting the "times" symbol that appears in place of the minus sign. I am no expert at all, but from what I understand, this should correspond to the £ symbol in the font family cmsy10:
<text alignment-baseline="baseline" y="61.57359313964842" x="-51.62625122070311" style="font-family: cmsy10; font-size: 12;">£</text>.
I looked for a table of cmsy10 fonts and I found one in this TeX StackExchange thread.
It seems to me that the minus sign is two cells to the right from the times sign. I am not sure how to read the "coordinates" in the table. Anyway from this site £ corresponds to 163, and 161 corresponds to an inverted exclamation mark "¡".
I created an html file with the svg code for (a version of) the first MWE, and changed the pound symbols into inverted exclamation marks. This produces the expected minus signs.
Could it be that the author of TikZJaX got the mapping of some math symbols wrong?
Or am I getting it all wrong?
In the latter case, can something be done to get the correct minus signs (and other symbols)?
Thanks a lot for your help
Francesco
PS By the way, I noticed that the TikZJaX demo here does not actually map the fonts to the correct font family. I think this is because something goes wrong with the link to the css file in the header of the webpage. I'm guessing this because I get a similar rendering if I comment out the link to the css file, which contains a list of font families.

Does text from a rich text editor not inherit styles when rendered in an HTML document?

Just to make things clear, I have used an RTE in the backend to store some description. Later, through an api, I am receiving the description along with other details as a response. Now the styles are intact till now. For example, bold headings. But when I render it in the HTML document using innerHTML property, all I see is unformatted text. The headings are not bold anymore.
Here's a part of response:
</p>\r\n\n\r\n<p><span style=\"font-weight: bold;\">Features</span> \n </p>\r\n\n\r\n<p>Gives even skin tone, smoother complexion and sculpted facial features.
Clearly, font-style="bold" can be seen here. But after this, the rendered version does not contain those styles.
Here's the full response:
"cart_count":2,
"images":[
],
"success":true,
"message":"Sucessfully",
"data":{
"product_id":1,
"name":"Dr G Butterfly Gua Sha",
"category_id":1,
"category":"Skin Tool",
"description":"<p>Dr G Butterfly Rose Quartz Gua Sha is a beauty and wellness tool designed to heal and enhance natural beauty. It lifts and sculpts your face, drains the lymph node, which reduces puffy eyes and face. By scraping with repeated strokes on the surface of the skin, this tool helps stimulate muscles and increases the blood flow. \n </p>\r\n\n\r\n<p><span style=\"font-weight: bold;\">Features</span> \n </p>\r\n\n\r\n<p>Gives even skin tone, smoother complexion and sculpted facial features. Reduces the signs of ageing and gives younger-looking skin. Increases lymphatic function. Stimulates blood circulation. Improves the appearance of dark circles and reduces under-eye puffiness. </p>\r\n\n\r\n<p><span style=\"font-weight: bold;\">How To Use \n</span></p>\r\n\n\r\n<p>Apply Dr G oil or Dr G gel as per your skin type covering the face and neck. </p>\r\n<p>Hold the butterfly gua sha tool firmly and sweep across gently up and out, starting with the neck, cheeks, jawline, chin, around the mouth, and slowly glide under the eyes, across your eyebrows and from your forehead up to your hairline. </p>\r\n<p>You can sweep it 3-5 times per area. </p>\r\n<p>Recommended at least a few times a week for best results. </p>\r\n\n\r\n<p><span style=\"font-weight: bold;\">About Dr G</span> \n </p>\r\n\n\r\n<p>Dr G offers luxury skincare products, backed by over a decade of dermatology expertise and on-ground practice. Made for Indian weather conditions, with variants for different skin types, including sensitive skin, and to address specific skin concerns - these innovative products are a perfect balance of nature and science. Drawing from ancient Ayurveda and combining natural extracts with skin-safe science, Dr G's range of products bridge modern skincare with holistic science.</p>",
"short_description":"Sculpts, Tones, Reduces Puffiness, Lifts",
"max_quantity":500,
"status":1,
"in_stock":1,
"measurement":[
{
"is_cart":true,
"ordered_quantity":2,
"is_wish":false,
"discounted_price":1400.0,
"weight":"200 Gram",
"price":1400.0,
"prod_id":1,
"percentage":100,
"max_quantity":500
}
]
}
}
The HTML from your response isn't valid. You can easily test it, if you copy the HTML string from your response to a text file with .html file ending and open it with your browser (index.html for example). Or use a validator like this one: https://www.freeformatter.com/html-validator.html
Let's pick one part from the HTML string which has wrong characters and gets displayed unformatted:
<span style=\"font-weight: bold;\">Features</span> \n
If you remove the backslashes \ here this peace gets rendered correctly:
<span style="font-weight: bold;">Features</span> \n
I would reccomend you to encode the HTML before sending it to the frondend. You could use Base64 which can be easily encoded in the backend and decoded on the frontend before displaying it.
If this "wrong" characters are already there when you recive this HTML (on your Backend) you have to parse it first to clean it.

Sublime Text 2 and Emmet and wrapping lines

I often get text from clients and want to quickly format the text with html tags and wrap each text line at a specified number of characters. I've just installed Sublime Text 2 and it's pretty nice, but one of the things I really want to do I can't quite figure out.
I want to take long paragraphs, wrap each paragraph in a paragraph p tag, and then wrap the lines so they don't run off the screen. So here's what I'm doing:
Copy and paste text from my client into editor (2 paragraphs for this example).
Select text.
Using Emmet, enter "p*" which puts p tags at the beginning of each paragraph and /p at the end of each paragraph.
Select text.
Click Alt Q to wrap text.
The text wraps but it's corrupted because the opening angle bracket "<" from the /p tag is appended to the beginning of each line and the opening angle bracket is missing from the /p tag.
<p>Our swimming lessons run on a perpetual monthly enrollment system,
<making year-round lessons affordable and convenient. Our online
<registration system allows you to sign up at your convenience and
<monitor your account details easily./p>
<p>Our highly trained swim instructors teach our unique, proven
<curriculum in stages, encouraging swimmers to master the fundamentals
<of every important swimming skill. We continuously encourage
<progression and advancement as each swimmer becomes more confident in
<the water. Our program blends important water safety skills, buoyancy
<principles and correct stroke technique./p>
Help! What am I doing wrong?
Here's what you can do:
Paste content from client.
Select, hit AltQ to wrap. You'll now have two cursors, one at the end of each paragraph.
Select Selection -> Expand Selection to Paragraph (I'll show you how to make a shortcut later). Both paragraphs are now selected, each as a selection region.
Bring up Emmet with CtrlShiftG and enter p (not p*)
Hit Enter and you should have two wrapped paragraphs surrounded by <p></p> tags:
<p>Our swimming lessons run on a perpetual monthly enrollment system, making
year-round lessons affordable and convenient. Our online registration system
allows you to sign up at your convenience and monitor your account details
easily.</p>
<p>Our highly trained swim instructors teach our unique, proven curriculum in
stages, encouraging swimmers to master the fundamentals of every important
swimming skill. We continuously encourage progression and advancement as each
swimmer becomes more confident in the water. Our program blends important
water safety skills, buoyancy principles and correct stroke technique.</p>
To create a keyboard shortcut for Expand Selection to Paragraph, go to Preferences -> Key Bindings - User and add the following:
{ "keys": ["ctrl+alt+shift+p"], "command": "expand_selection_to_paragraph" }
If the file is empty, wrap the line above in square brackets []:
[
{ "keys": ["ctrl+alt+shift+p"], "command": "expand_selection_to_paragraph" }
]
Save the file, and you should now be able to use CtrlAltShiftP for step 3 above. Feel free to change the key combination if you wish, but be aware that it may conflict with other built-in or plugin combos.
Note: I tested all this on Sublime Text 3, but it should work the same in ST2.

Can OpenNLP use HTML tags as part of the training?

I'm creating a training set for the TokenNameFinder using html documents converted into plain text, but my precision is low and I want to use the HTML tags as part of the training. Like words in bold, and sentences in differents margin sizes.
Will OpenNLP accept and use those tags to create rules?
Is there another way to make use of those tags to improve precision?
It is not clear what you mean with using HTML tags to train OpenNLP.
The train input is an annotated tokenized sentence:
<START:person> Pierre Vinken <END> , 61 years old , will join the board as a nonexecutive director Nov. 29 .
Mr . <START:person> Vinken <END> is chairman of <START:company> Elsevier N.V. <END> , the Dutch publishing group .
To train an OpenNLP model using the standard tooling you need annotations follows this convention. Note that the annotations does not follow the XML standard.
You can embed annotations directly to the HTML documents you will use for training. It might even help the classifier with the extra context, but I've never read any experimental results about it.
You should keep in mind that the training data should be tokenized. It means that you should include white spaces between words and punctuation, as well as between text elements and html:
<p> <i> Mr . <START:person> Vinken <END> </i> is chairman of <b> <START:company> Elsevier N.V. <END> </b>, the Dutch publishing group .

What Unicode character do you use in your website? (instead of image icons)

I am looking for character which could replace image icon, for example like ✘ (xmark) and ✔ (tick), maybe some symbol to "draft" or "new message"?
EDIT:
Fav: ❤
Draft: ✍
Message: ✉
To find useful symbols, I have two great resources:
http://shapecatcher.com
Allows you to draw a shape, which it then searches for similarly shaped unicode symbols.
https://www.fileformat.info/info/unicode/block/index.htm
Lists unicode by the character blocks (using an embedded unicode font to maximize compatibility for display) and has a "display a certain block with images" functionality that allows you to review symbol blocks.
Both are quite useful though I often end up using shapecatcher these days just because it's a fun break just to be able to draw the shape that you want and have the site pull it up for you. At least, sometimes it will put it up.
Misc. Symbols Blocks
http://shapecatcher.com/unicode/block/Miscellaneous_Symbols_And_Pictographs is also a great category of unicode symbols, though as with all unicode, you may have to test compatibility.
https://www.fileformat.info/info/unicode/block/miscellaneous_symbols/images.htm is the block of the miscellaneous symbols, for comparison.
⌚ U+0231A WATCH
⌛ U+0231B HOURGLASS
♟ U+265F SOLID CHESS PAWN
⚷ U+26B7 CHIRON
★ U+2605 SOLID STAR
✓ U+2713 CHECK MARK
☑ U+2611 SQUARE CHECKBOX
✕ U+2715 MULTIPLICATION X
☒ U+2612 SQUARE X-ED BOX
⚠ U+26A0 WARNING SIGN
Are also good symbols to add to the list.
Edit: In 2019 I would now recommend using a robust icon pack, either in svg form or font-file form, the presentation of unicode is often less controllable for web developers.
stackoverflow.com used to use "●" (U+25CF BLACK CIRCLE) for badges.
There are tons of useful characters in Unicode:
✆ U+2706 TELEPHONE LOCATION SIGN
✉ U+2709 ENVELOPE
☎ U+260E BLACK TELEPHONE and ☏ U+260F WHITE TELEPHONE
✎ U+270E LOWER RIGHT PENCIL
⌛ U+231B HOURGLASS
⌨ U+2328 KEYBOARD
←
↑
→
↓
↔
↕
↖
↗
↘
↙
just to name a few...
Why not just peruse the whole list?
I've used the block-arrows:
U+25b2 ▲, U+25ba ►, U+25bc ▼, U+25c4 ◄
Look at http://unicode.org/charts#symbols for some ideas. I'm not sure what would work for "draft" or "new message" but there is a lot to choose from there.
Some symbols might not be supported by the font selected into the browser page. Even if they are, a lot of them look really bad at small sizes. You're better off using an image if you can.
http://unicode-table.com/ is great too but for some unicodes designed for web design icons, i recommend : http://kudakurage.com/ligature_symbols/.
Twitter Bootstrap uses × (×) for close buttons.
I would suggest using custom font like https://github.com/FortAwesome/Font-Awesome
You can also have svg/png version https://github.com/encharm/Font-Awesome-SVG-PNG
There are also other svg icons
https://github.com/iconic/open-iconic
https://github.com/outpunk/evil-icons
Pure css icons https://github.com/saeedalipoor/icono
For Material Design you have static svg icons https://google.github.io/material-design-icons/ and animated:
http://tympanus.net/Development/AnimatedSVGIcons/
http://tympanus.net/Development/IconHoverEffects/
http://tympanus.net/Development/AnimatedCheckboxes/
https://alexk111.github.io/SVG-Morpheus/
I am surprised no one has posted Unicode emojis yet:
Range U+1F600 - U+1F64F
Just some from the list:
😁 :U+1F601: GRINNING FACE WITH SMILING EYES &#128513
😂 :U+1F602: FACE WITH TEARS OF JOY &#128514
😃 :U+1F603: SMILING FACE WITH OPEN MOUTH &#128515
😄 :U+1F604: SMILING FACE WITH OPEN MOUTH AND SMILING EYES &#128516
😅 :U+1F605: SMILING FACE WITH OPEN MOUTH AND COLD SWEAT &#128517
😆 :U+1F606: SMILING FACE WITH OPEN MOUTH AND TIGHTLY-CLOSED EYES &#128518
😷 :U+1F637: FACE WITH MEDICAL MASK &#128567
Also have a look at this list of cool icons from Supplemental list
☣ : U+2623: BIOHAZARD SIGN &#9763
☢ : U+2622: RADIOACTIVE SIGN &#9762
I've used the magnifying glass icon as the body of an anchor to link to a cool interactive page for some data analysis that allowed a user to pair arbitrary data selections much like this example.
🔎
Being a link the default underline appearance somewhat obscured the unicode glyph but that effect was negligible for our internal tool but might be suboptimal for something public facing.