Is there any solution for mix language problem in tesseract 4.1.1?

Is there any solution for mix language problem in tesseract 4.1.1? - ocr

I want to covert image to text with tesseract engine. Input image has two language(persian and english). When I use tesseract multi language feature(fas+eng), converted text has many error.
For example:
output:
BERT Joo‏ و استفاده از آن
در این گزارش به تعریف مفاهیم مورد نیاز برای شناخت مدل 7۳11 می‌پردازيم و نحوه استفاده از
آن را برای تحلیل متن توضیح می‌دهیم.
Should I train model with persian and english text?

You must update to latest version of tesseract OR fas trainned data.
I use this version of tesseract:
# tesseract.exe --version
tesseract v5.0.0-alpha.20191030
leptonica-1.78.0
libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.3) : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.0
Found AVX
Found SSE
Found libarchive 3.3.2 zlib/1.2.11 liblzma/5.2.3 bz2lib/1.0.6 liblz4/1.7.5
And also I use this commit version for fas train data:
https://github.com/tesseract-ocr/tessdata/blob/cdd8a9ec438fc0b9f21635466196fe1c05efca16/fas.traineddata
And I use this command:
tesseract.exe image.png out -l fas+eng
So as you can see in here, We have this correct text:
مدل ‎BERT‏ و استفاده از آن
در این گزارش به تعریف مفاهیم مورد نیاز برای شناخت مدل ‎BERT‏ می‌پردازيم و نحوه استفاده از
آن را برای تحلیل متن توضیح می‌دهیم.
Also see this image:

Related

How to convert Text(with breakline) to display properly on HTML textbox

Kemandirian spesies ialah keupayaan haiwan dan tumbuhan untuk mengekalkan spesiesnya bagi mengelakkan kepupusan.
Ciri dan tingkah laku khas haiwan untuk melindungi diri daripada musuh seperti:
(i) Memutuskan anggota badan;
(ii) Menyembur dakwat hitam;
(iii) Mempunyai mata palsu.
Galakkan penggunaan TMK untuk membuat pemerhatian pelbagai ciri dan tingkah laku khas haiwan untuk melindungi diri.
let's say that i have this text and i save it to database , but when i tried to pull it out from database , it just show everything in a single line , what's is the best way to keep the original format ?

You can change all \n characters to <br> and then upload to your DB.
When you fetch your data then you can convert all the <br> tags to \n character.
let x = `Kemandirian spesies ialah keupayaan haiwan dan tumbuhan untuk mengekalkan spesiesnya bagi mengelakkan kepupusan.
Ciri dan tingkah laku khas haiwan untuk melindungi diri daripada musuh seperti:
(i) Memutuskan anggota badan;
(ii) Menyembur dakwat hitam;
(iii) Mempunyai mata palsu.
Galakkan penggunaan TMK untuk membuat pemerhatian pelbagai ciri dan tingkah laku khas haiwan untuk melindungi diri.`;
let withBR = x.replace(/\n/gm, "<br>");
console.log("Upload To DB \n\n");
console.log(withBR);
// Upload To The DB
// ----------------------------------------------------------
// On Fetch Starts
let originalText = withBR.replace(/\<br\>/gm, "\n")
console.log("Original Text \n\n");
console.log(originalText);
Hope it helps.

How to import json to a ionic timeline project

export class HomePage {
items = [
{
title: 'Courgette daikon',
content: 'Parsley amaranth tigernut silver beet maize fennel spinach. Ricebean black-eyed pea maize scallion green bean spinach cabbage jícama bell pepper carrot onion corn plantain garbanzo. Sierra leone bologi komatsuna celery peanut swiss chard silver beet squash dandelion maize chicory burdock tatsoi dulse radish wakame beetroot.',
icon: 'calendar',
time: {subtitle: '4/16/2013', title: '21:30'}
},
{
title: 'Courgette daikon',
content: 'Parsley amaranth tigernut silver beet maize fennel spinach. Ricebean black-eyed pea maize scallion green bean spinach cabbage jícama bell pepper carrot onion corn plantain garbanzo. Sierra leone bologi komatsuna celery peanut swiss chard silver beet squash dandelion maize chicory burdock tatsoi dulse radish wakame beetroot.',
icon: 'calendar',
time: {subtitle: 'January', title: '29'}
},
{
title: 'Courgette daikon',
content: 'Parsley amaranth tigernut silver beet maize fennel spinach. Ricebean black-eyed pea maize scallion green bean spinach cabbage jícama bell pepper carrot onion corn plantain garbanzo. Sierra leone bologi komatsuna celery peanut swiss chard silver beet squash dandelion maize chicory burdock tatsoi dulse radish wakame beetroot.',
icon: 'calendar',
time: {title: 'Short Text'}
}
]
constructor(public navCtrl: NavController) {
}
}
Insted of the manually entered json, i would like to import this from a json file that is
{"items":[{"agendaid":"1","title":"Avreise Medina","content":"Avreise
til Medina er klokken 18:30, det er oppm\u00f8te p\u00e5 flyplassen 3
timer f\u00f8r avreise. Alle m\u00e5 selv komme tidsnok til \u00e5
f\u00e5 sitte sammen andre i familien. Det er kun lov til \u00e5 ha
med seg 30 kg baggasje og 7 kg
h\u00e5ndbaggasje","icon":"plane","TimeTitle":"21.12.2017","TimeSubtitle":"18:30","ExecuteTime":"2017-12-21
18:30:00"},{"agendaid":"2","title":"test","content":"test","icon":"test","TimeTitle":"test","TimeSubtitle":"test","ExecuteTime":"2017-11-22
17:26:23"}]}
Im trying to read it in ionic 3 with this code
url: string = 'http://backend.mishkaat.no/app/agenda.php';
items: any = [];
constructor(public navCtrl: NavController, public navParams: NavParams,private http: Http ) {}
ionViewDidEnter() {
this.http.get( this.url )
.map(res => res.json())
.subscribe(data => {
// we've got back the raw data, now generate the core schedule data
// and save the data for later reference
this.items = data;
});
}
But getting this error:
Error: Error trying to diff '[object Object]'. Only arrays and iterables are allowed
at DefaultIterableDiffer.diff (http://localhost:8100/build/vendor.js:7695:19)
at NgForOf.ngDoCheck (http://localhost:8100/build/vendor.js:43776:57)
at checkAndUpdateDirectiveInline (http://localhost:8100/build/vendor.js:12451:19)
at checkAndUpdateNodeInline (http://localhost:8100/build/vendor.js:13951:20)
at checkAndUpdateNode (http://localhost:8100/build/vendor.js:13894:16)
at debugCheckAndUpdateNode (http://localhost:8100/build/vendor.js:14766:76)
at debugCheckDirectivesFn (http://localhost:8100/build/vendor.js:14707:13)
at Object.eval [as updateDirectives] (ng:///AppModule/AgendaPage.ngfactory.js:99:5)
at Object.debugUpdateDirectives [as updateDirectives] (http://localhost:8100/build/vendor.js:14692:21)
at checkAndUpdateView (http://localhost:8100/build/vendor.js:13861:14)
when displaying it in the html file
<timeline endIcon="call">
<timeline-item *ngFor="let item of items">
<timeline-time [time]="item.time"></timeline-time>
<ion-icon [name]="item.icon"></ion-icon>
<ion-card>
<ion-card-header>
{{item.title}}
</ion-card-header>
<ion-card-content>
{{item.content}}
</ion-card-content>
</ion-card>
</timeline-item>
</timeline>
Any help ?

I now have the original json file that is in the question, but why arent im able to read the nested json with this script ?
#Component({
selector:'timeline-time',
template: '<span>{{time.subtitle}}</span> <span>{{time.title}}</span>'
})
export class TimelineTimeComponent{
#Input('time') time = {};
constructor(){
}
}

How can i make rtl text to continue the left part below

I have this really long Right-to-left arabic paragraph:
https://fiddle.jshell.net/09b6xoaa/4/
But i would expect that when the line is too long it would continue the left part below ("end of the line this should be on the second line"), and not to break line at the right end of the line("start of the rtl line") which is supposed to be the start of the line.
I can't seem to find anywhere the answer to this behaviour which seems to me that is a very basic behaviour to expect.
What am i doing wrong?
Thank you!
17-01-28 12:16 Updated description and fiddle link
Update: seems like my problem was lying within a python library(weasyprint) for rendering html into pdf which does not have support for RTL
Thanks everyone!

For right-to-left scripts like as arabic, the long texts will break at correct position by default (as for left-to-right).
Please see the following Persian (like as Arabic) text which i embed equivalent engligh words beside each number such that you can follow the words as they are written in html code and presented in output web page:
<p dir="rtl" >
کلمه اول - کلمه دوم - کلمه سوم - کلمه چهارم - کلمه پنج - کلمه شش - کلمه هفت - کلمه هشت - کلمه نه - و انتهای جمله.
</p>
<p dir="rtl" >
کلمه اول first- کلمه دوم second - کلمه سوم third - کلمه چهارم fourth - کلمه پنج five - کلمه شش six - کلمه هفت seven - کلمه هشت eight - کلمه نه nine - و انتهای end جمله.
</p>
<p dir="rtl" >
کلمه اول firstWord-
کلمه دوم secondW -
کلمه سوم thirdW -
کلمه چهارم fourthW -
کلمه پنج fiveW -
کلمه شش sixW -
کلمه هفت sevenW -
کلمه هشت eightW -
کلمه نه nineW -
و انتهای end جمله.
</p>
Please note that the right-to-left texts must be aligned right by default.

i think this is what you r looking for
#rtl{
word-wrap: break-word;
}
<p id="rtl" >arabic valuearabic valuearabic valuearabic valuearabic valuearabic valuearabic valuearabic value2<b> arabic label2 </b> ar value1 <b>ar_label1</b></p>

replace space in text without affecting html tags

I need to replace the space inside the html by , but without affecting the spaces inside of .
So that something like this: Hello <font color="red"> How Are <font color="black"> You?
would become this: Hello <font color="red"> How Are <font color="black"> You?
It changes the spaces outside of the tags, but the spaces inside the tags aren't affected.
I have tried this sample code that was suggested by someone:
NSString *string = originalHTMLString;
NSError *error = nil;
NSRegularExpression *regex = [NSRegularExpression regularExpressionWithPattern:#"(?i)(<script(?:[^>\"']|\"[^\"]*\"]|'[^']*')*>)\s+</script\\s*>|<style(?:[^>\"']|\"[^\"]*\"]|'[^']*')*>)\s+</style\\s*>|<textarea(?:[^>\"']|\"[^\"]*\"]|'[^']*')*>)\s+</textarea\\s*>|</?[a-z](?:[^>\"']|\"[^\"]*\"]|'[^']*')*>|\\S+)|\\s+" options:NSRegularExpressionCaseInsensitive error:&error];
NSString *modifiedString = [regex stringByReplacingMatchesInString:string options:0 range:NSMakeRange(0, [string length]) withTemplate:#" "];
finalHTMLString = modifiedString;
But it didn't work. Just returned Null, I think the RegEx pattern is wrong.
This is some sample html I have to convert:
<samp class="s22">من مشاكل جرّأء العثّ والفيروسات منذ سنوات. إلاّ أنّ أمرًا ما حدث في الأعوام الماضية وسبّب المشكلة".</samp></p> <p class="mytext-19" dir="RTL"><samp class="s20">ويعتقد هاكينبرغ أنّ الأمر بدأ منذ عام </samp><samp class="s21">2004</samp><samp class="s22">. ففي أيار ذلك العام، اشتكى مزارعو العنبيّة في ماين من أنّ نحلهم الذي يلقّح محصولهم كان يُنتج طرودًا ويغادر الخلايا. كما أنّ نحل الخلايا الأخرى لا يسرق العسل الموجود في الخليّة المتروكة.</samp></p> <p class="mytext-19" dir="RTL"><samp class="s20">وحين بحث عن تفسير لهذا السلوك الغريب، اكتشف أنّ مزارعي التفاح في واشنطن استعملوا مبيدًا جديدًا يحتوي على النيونيكوتينوييد يُدعى </samp><samp dir="LTR">Assail</samp><samp class="s22"> لأشجارهم. وكان نحله يلقّح تلك الأشجار في الربيع.</samp></p> <p class="mytext-19" dir="RTL"><samp class="s20">ذاك الشتاء (</samp><samp class="s21">2004</samp><samp class="s22">-</samp><samp class="s21">2005</samp><samp class="s22">)، خسر ثلثَ نحله تقريبًا، وهي نسبة أعلى بكثير من المعتاد. وفي العام التالي نفق النصف كما أُبلغ عن خسائر في مختلف أنحاء البلاد.</samp></p> <p class="mytext-19" dir="RTL"><samp class="s20">يقول هاكنبيرغ: "لقد ساءت الأمور جدًّا، ولكنّ أحدًا لم يتمكّن <samp class="s37">من معرفة السبب". لهذا، ففي صيف عام </samp></samp><samp class="s61">2006</samp><samp class="s38"> عقد اجتماعًا مع علماء في نبراسكا ليحاول إيجاد سبب للارتفاع السريع في معدّل نفوق </samp><samp class="s26">النحل. "قيل إنّ الاجتماع ضمّ أذكى العقول ولكنّنا جلسنا ليومين نتباحث من دون التوصّل لشيء". وبعد بضعة أشهر هلك ثلثا ما تبقّى من نحله.</samp></p> <p class="mytext-19" dir="RTL"><samp class="s20">أعطى النحّالون الذين نقلوا </samp><samp class="s21">1</samp><samp class="s22">.</samp><samp class="s21">2</samp><samp class="s22"> مليون قفير إلى بساتين اللوز في كاليفورنيا في شباط أوّل مؤشر على صحة النحل عام </samp><samp class="s21">2008</samp><samp class="s22">. لم تكن الإشارات جيّدة. <a class="MyAppHighlight1" style="background-color:pink; color:black;" name="M10">فمن بين الاثني عشر نحّالاً تقريبًا الذين تحدثنا إليهم</a>، اثنان منهم فقط دخلوا الشتاء سالمين نسبيًّا. أمّا الباقون فخسروا ما يتراوح بين </samp><samp class="s21">30</samp><samp class="s22"> بالمئة و</samp><samp class="s23">60</samp><samp class="s22"> بالمئة من قفرلوا الشتاء سالمين نسبيًّا. أمّا الباقون فخسروا ما يتراوح بين </samp><samp class="s21">30</samp><samp class="s22"> بالمئة و</samp><samp class="s23">60</samp><samp class="s22"> بالمئة من قفر\330انهم بما بدا شبيهًا بداء <samp class="s37">انهيار الخليّة. ومن بين عمليات الهجرة الاثنتي عشرة التي تابعتها وزارة </samp>الزراعة الأميركيّة من أيلول <samp class="s37">العام </samp></samp><samp class="s21">2007</samp><samp class="s22"> وحتّى ربيع <samp class="s37">العام </samp></samp><samp class="s21">2008</samp><samp class="s22">، ظهر في خمس </samp><samp dir="LTR" class="s2"><span style="display:none;">00002</span> </samp><a style="color:transparent;" name="00003"></a><samp><span style="display:none;">00003</span></samp></p> <p class="bigtitle"> </p> <p class="bigtitle"> </p> <p class="bigtitle-3" dir="RTL"><samp class="s4">عَالَمٌ بِلا نَحْل</samp></p> <p class="bigtitle-3" dir="RTL"><samp dir="LTR" class="s5">A World Without Bees</samp></p> <p class="mo2allef"> </p> <p class="mo2allef"> </p> <p class="smallertitleCxSpFirst-6" dir="RTL"><samp class="s7">تأليف</samp><samp class="s8">:</samp></p><p> </p>
Thank you for your assistance.

This isn't a RegEx answer, but in Objective-C this should take a string called originalHTML, switch out all of the spaces outside of tags, and save it as a string called finalHTML
NSString *originalHTML = #"Your backslashed HTML Here";
NSString *finalHTML = [[NSMutableString alloc] init];
BOOL insideTag = NO;
BOOL convertSpace = NO;
for (int i = 0; i < originalHTML.length; i++) {
unichar uniCharacter = [originalHTML characterAtIndex:i];
if ([[NSString stringWithFormat:#"%C", uniCharacter] isEqualToString:#"<"]) {
insideTag = YES;
}
if ([[NSString stringWithFormat:#"%C", uniCharacter] isEqualToString:#">"]) {
insideTag = NO;
}
if (!insideTag) {
if ([[NSString stringWithFormat:#"%C", uniCharacter] isEqualToString:#" "]) {
convertSpace = YES;
}
}
if (!convertSpace) {
finalHTML = [finalHTML stringByAppendingFormat:[NSString stringWithFormat:#"%C", uniCharacter]];
} else {
finalHTML = [finalHTML stringByAppendingFormat:#" "];
convertSpace = NO;
}
}
NSLog(#"%#", finalHTML);
*Note, this will not work if you just have a less than sign or greater than sign in your html body that is not part of a tag. If you need to write Less Than< or Greater Than> In your actual body text please use < or >

how parse HTML in Objective C

Any body could help me please!
I want to parse some HTML code from web for Objective C, the HTML codes are like bellow:
<div class="linkSummary">
<img class="video_thumbnail" width="120" height="90" src="video_thumbnails/vthumb_2309.jpg">
<div class="video_title">المپیک لندن؛ اهداء مدال کشتی فرنگی ۵۵ کیلوگرم </div>
<div class="video_league">المپیک لندن</div>
<div class="video_date">۱۵ مرداد ۱۳۹۱ (<span dir="ltr">5 August 2012</span>)</div>
<div class="send_details">
<span class="icon_holder"><img width="22" height="25" class="rollover v1" src="image_slices/icon_ball_off.jpg" hover="image_slices/icon_ball_on.jpg" vid="1" otid="4" oid="2309"><img width="22" height="25" class="rollover v2" src="image_slices/icon_yellow_off.jpg" hover="image_slices/icon_yellow_on.jpg" vid="2" otid="4" oid="2309"><img width="22" height="25" class="rollover v3" src="image_slices/icon_red_off.jpg" hover="image_slices/icon_red_on.jpg" vid="3" otid="4" oid="2309"></span>تا الان ۱۴ نظر در مورد این ویدیو داده شده است. | نظر شما چیه؟
</div>
</div>
I want parser to parse video_title and the image that associated to that title, and put these into a table. Anybody could show me some sample code to do this?

Depending on what you're trying to do, you may find that ElementParser works for you. It provides some very useful methods for getting data out of an HTML document using CSS selectors, similar to jQuery. The documentation is a little light, but an intro to using it is available.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

Is there any solution for mix language problem in tesseract 4.1.1? - ocr

Related

How to convert Text(with breakline) to display properly on HTML textbox

How to import json to a ionic timeline project

How can i make rtl text to continue the left part below

replace space in text without affecting html tags

how parse HTML in Objective C

Categories

Resources