How to recognize text from image on easyocr without '' symbols for each row - ocr

I am trying to recognize text from image, but for each row easyocr prints '' and , symbols. For example there are lines of text in the picture. When easyocr recognize this picture print for each rows 'example example','example example'... it goes on like this.
I want to recognize the text without these symbols.
Here is the code:
reader = easyocr.Reader(['tr'])
result = reader.readtext(IMAGE_PATH, detail=0, blocklist="-.:';,",
slope_ths=2.5,ycenter_ths=0.2)
print(result)
And the result
['4 ', 'Osmanlı Devleti nde Orhan Bey döneminde', 'Şehirlere kadılar atanmış ', 'Ỉznik te medrese açılmış ', 'Bursa başkent yapılmıştır', 'Buna göre', 'adlip', 'idari', ']', 'askeris', 'IV', 'eğitim', 'yönelik düzenlemeler yapıldı ', 'alanlarından hangilerine', 'savunulabilir?', 'ğı', 'C) Ill ve IV', 'B) Il ve Ill', 'II', 'A) / ve', 'E) IIp Ill ve IV', 've IV', 'D) / Il']
Can i recognize this like below?
['4 Osmanlı Devleti nde Orhan Bey döneminde Şehirlere kadılar atanmış Ỉznik te medrese açılmış Bursa başkent yapılmıştır Buna göre adlip idari askeris IV eğitim yönelik düzenlemeler yapıldı alanlarından hangilerine savunulabilir? ğı C) Ill ve IV B) Il ve Ill II A) / ve E) IIp Ill ve IV ve IV D) / Il']
The image;
image that i recognize it

According to their official Guide (https://www.jaided.ai/easyocr/tutorial/), If you provide hyperparameter paragraph = True, EasyOCR will try to combine raw result into easy-to-read paragraph.
Here's the result -
result = reader.readtext('https://www.somewebsite.com/chinese_tra.jpg',detail = 0) # without paragraph hyperparameter.
result -
['高鐵左營站', 'HSR', 'Station', '汽車臨停接送區', 'Kiss', 'Car', 'and', 'Ride']
With paragraph hyperparameter -
result = reader.readtext('https://www.somewebsite.com/chinese_tra.jpg',detail = 0, paragraph = True)
result -
['高鐵左營站 HSR Station 汽車臨停接送區 Car Kiss and Ride']

Just concatenate the sentences in the list after getting the result.
Something like this:
length=len(result)
i=0
content=""
while i<length:
content = str(content)+' '+str(result[i][0])
i+=1
print(content)

Related

Clear both the left 14 characters and "x" right, after a specific character

i have a mysql table with records like:
<font size=1>ANGRY BIRDS 2<br>Thurop Van Orman, John Rice (USA, Finland)<br>5/9/2019 – Spentzos Film</font>
<font size=1>THE FAVORITE<br>Giorgos Lanthimos (USA, UK, Irland)<br>5/9/2019 - Feelgood Ent</font>
What I need is to keep only the Title of the movie clearing all the unwanted left and right characters
ANGRY BIRDS 2
THE FAVORITE
I tried
SUBSTR(`Title`, 14)
and
SUBSTRING_INDEX(`Title`, '<br>', 1)
I also tried to combine Leading and Trailing in one line like
TRIM(Leading '<font size=1>' FROM `tbl.`Title`),
trim(trailing from `tbl`.`Title`, '<br>', 1)
but it doen't work
Is that possible to get the result I need?
SELECT SUBSTRING_INDEX(SUBSTRING_INDEX(`Title`, '<', 2), '>', -1)

Nltk lesk issue

I am running a simple sentence disambiguation test. But the synset returned by nltk Lesk for the word 'cat' in the sentence "The cat likes milk" is 'kat.n.01', synsetid=3608870.
(n) kat, khat, qat, quat, cat, Arabian tea, African tea (the leaves of the shrub Catha edulis which are chewed like tobacco or used to make tea; has the effect of a euphoric stimulant) "in Yemen kat is used daily by 85% of adults"
This is a simple phrase and yet the disambiguation task fails.
And this is happening for many words in a set containing more than one sentence, for example in my test sentences, I would expect 'dog' to be disambiguated as 'domestic dog' but Lesk gives me 'pawl' (a hinged catch that fits into a notch of a ratchet to move a wheel forward or prevent it from moving backward)
Is it related to the size of the training set which is in my test only few sentences?
Here is my test code:
def test_lesk():
words = get_sample_words()
print(words)
tagger = PerceptronTagger()
tags = tagger.tag(words)
print (tags[:5])
for word, tag in tags:
pos = get_wordnet_pos(tag)
if pos is None:
continue
print("word=%s,tag=%s,pos=%s" %(word, tag, pos))
synset = lesk(words, word, pos)
if synset is None:
print('No synsetid for word=%s' %word)
else:
print('word=%s, synsetname=%s, synsetid=%d' %(word,synset.name(), synset.offset()))

Json Files parsing

So I am trying to open some json files to look for a publication year and sort them accordingly. But before doing this, I decided to experiment on a single file. I am having trouble though, because although I can get the files and the strings, when I try to print one word, it starts printinf the characters.
For example:
print data2[1] #prints
THE BRIDES ORNAMENTS, Viz. Fiue MEDITATIONS, Morall and Diuine. #results
but now
print data2[1][0] #should print THE
T #prints T
This is my code right now:
json_data =open(path)
data = json.load(json_data)
i=0
data2 = []
for x in range(0,len(data)):
data2.append(data[x]['section'])
if len(data[x]['content']) > 0:
for i in range(0,len(data[x]['content'])):
data2.append(data[x]['content'][i])
I probably need to look at your json file to be absolutely sure, but it seems to me that the data2 list is a list of strings. Thus, data2[1] is a string. When you do data2[1][0], the expected result is what you are getting - the character at the 0th index in the string.
>>> data2[1]
'THE BRIDES ORNAMENTS, Viz. Fiue MEDITATIONS, Morall and Diuine.'
>>> data2[1][0]
'T'
To get the first word, naively, you can split the string by spaces
>>> data2[1].split()
['THE', 'BRIDES', 'ORNAMENTS,', 'Viz.', 'Fiue', 'MEDITATIONS,', 'Morall', 'and', 'Diuine.']
>>> data2[1].split()[0]
'THE'
However, this will cause issues with punctuation, so you probably need to tokenize the text. This link should help - http://www.nltk.org/_modules/nltk/tokenize.html

R: Extracting elements between characters in a web page

I have two lines of info from a web page that I want to parse into a data.frame.
[104] " $1775 / 2br - 1112ft² - Wonderful two bedroom two bathroom with balcony! (14001 NE 183rd Street )"
[269] " var pID = \"4619136687\";"
I'd like it to look like this.
postID |rent|type|size|description |location
4619136687|1775|2br |1112|Wonderful two bedroom...|14001 NE 183rd Street
I was able to use the sub() command to get the ID but I'm not exactly familiar with regex in the sub() command to parse out what I need when there are spaces, such as in line [104].
sub(".*pID = \"(.*)\";.*","\\1", " var pID = \"4619136687\";")
Any help would be wonderful, Thanks!

Word frequency count based on two words using python

There are many resources online that shows how to do a word count for single word
like this and this and this and others...
But I was not not able to find a concrete example for two words count frequency .
I have a csv file that has some strings in it.
FileList = "I love TV show makes me happy, I love also comedy show makes me feel like flying"
So I want the output to be like :
wordscount = {"I love": 2, "show makes": 2, "makes me" : 2 }
Of course I will have to strip all the comma, interrogation points.... {!, , ", ', ?, ., (,), [, ], ^, %, #, #, &, *, -, _, ;, /, \, |, }
I will also remove some stop words which I found here just to get more concrete data from the text.
How can I achieve this results using python?
Thanks!
>>> from collections import Counter
>>> import re
>>>
>>> sentence = "I love TV show makes me happy, I love also comedy show makes me feel like flying"
>>> words = re.findall(r'\w+', sentence)
>>> two_words = [' '.join(ws) for ws in zip(words, words[1:])]
>>> wordscount = {w:f for w, f in Counter(two_words).most_common() if f > 1}
>>> wordscount
{'show makes': 2, 'makes me': 2, 'I love': 2}