Clear both the left 14 characters and "x" right, after a specific character - mysql

i have a mysql table with records like:
<font size=1>ANGRY BIRDS 2<br>Thurop Van Orman, John Rice (USA, Finland)<br>5/9/2019 – Spentzos Film</font>
<font size=1>THE FAVORITE<br>Giorgos Lanthimos (USA, UK, Irland)<br>5/9/2019 - Feelgood Ent</font>
What I need is to keep only the Title of the movie clearing all the unwanted left and right characters
ANGRY BIRDS 2
THE FAVORITE
I tried
SUBSTR(`Title`, 14)
and
SUBSTRING_INDEX(`Title`, '<br>', 1)
I also tried to combine Leading and Trailing in one line like
TRIM(Leading '<font size=1>' FROM `tbl.`Title`),
trim(trailing from `tbl`.`Title`, '<br>', 1)
but it doen't work
Is that possible to get the result I need?

SELECT SUBSTRING_INDEX(SUBSTRING_INDEX(`Title`, '<', 2), '>', -1)

Related

Find records that contain characters other than alphabets, space and period in 7 million records stored in MYSQL VER 8

I have special characters inserted in MYSQL DB like below samples mostly in first name,last name columns of DB.
1.  BALPAI SAB
2. à¦à¦¿à¦•à§Âরমাদিতà§Âয
Valid case:
*Saurabh Shree
S.shree
T.M.Anthony
Charles Babbage Senior*
Length is variable.All are case insensitive with no trailing spaces.only spaces and period allowed between two consecutive words.
I have gone through the posts of regexp and also changed the collation of column as well as table to utfmb4_unicode_ci and applied regexp but with no luck.
I have to search even a single occurence of characters in around 7 million records.
SELECT FARMER_BRANCH_NAME, HEX(FARMER_BRANCH_NAME) FROM BSBY.PROPOSAL
OUTPUT
Farmer Branch Name Hex(Farmer Branch Name)
SME œ•œBRANCH JASDANœ•œ 534D45209C959C4252414E4348204A415344414E9C959C
নলহাটি E0A6A8E0A6B2E0A6B9E0A6BEE0A69FE0A6BF
নলহাটি E0A6A8E0A6B2E0A6B9E0A6BEE0A69FE0A6BF
নলহাটি E0A6A8E0A6B2E0A6B9E0A6BEE0A69FE0A6BF
SME œ•œBRANCH JASDANœ•œ 534D45209C959C4252414E4348204A415344414E9C959C
Mumbai - Chembur  4D756D626169202D204368656D627572C2A0
New Delhi - Connaught Place - II  4E65772044656C6869202D20436F6E6E617567687420506C616365202D204949C2A0
Mumbai - Malad  4D756D626169202D204D616C6164C2A0
Bangalore - Cantonment  42616E67616C6F7265202D2043616E746F6E6D656E74C2A0
Ahmedabad-BOPAL  41686D6564616261642D424F50414CC2A0
SME œ•œBRANCH JASDANœ•œ 534D45209C959C4252414E4348204A415344414E9C959C
SELECT FARMER_NAME,HEX(FARMER_NAME) FROM BSBY_UAT.PROPOSAL where FARMER_NAME NOT REGEXP '[A-Za-z0-9.() ]$'
OUTPUT
FARMER NAME HEX(FARMER NAME)
RAHIM BISWAS 524148494D2042495357415309
ESARUL GAZI 45534152554C2047415A4909
GOLAM NABI MANDAL 474F4C414D204E414249204D414E44414C09
LATIF MANDAL 4C41544946204D414E44414C09
NILKAMAL MANDAL 4E494C4B414D414C204D414E44414C09
SHUKUR ALI MONDAL 5348554B555220414C49204D4F4E44414C09
¦€ Â¦€° Â¦€º Â§Â Â¦€¢  Â¦€  Â¦Â² Â¦Â¿ A0C2A680A0C2A680B0A0C2A680BAA0C2A7C281A0C2A680A220A0C2A680A0A0C2A6C2B2A0C2A6C2BF
HASINA KHATUN 484153494E41204B484154554E09
KSHETRAGOPAL GHOSH 4B534845545241474F50414C2047484F534809
SUKUMAR DAS HALDAR 53554B554D4152204441532048414C44415209
Yasin Hossain 596173696E20486F737361696E09
SHAH HOSSAIN MOLLA 5348414820484F535341494E204D4F4C4C4109
RAMJAN SEKH 52414D4A414E2053454B4809
Nibaran Ch. Mahato 4E69626172616E2043682E204D616861746F09
PRAKASH KUMAR MONDAL 5052414B415348204B554D4152204D4F4E44414C2009
UNFERA BEWA 554E4645524120424557410909
BODRUL HOQUE 424F4452554C20484F5155450909
à¦à¦¾à¦¦à¦² চনà§à¦¦à§à¦° সরকার E0A6E0A6BEE0A6A6E0A6B220E0A69AE0A6A8E0A78DE0A6A6E0A78DE0A6B020E0A6B8E0A6B0E0A695E0A6BEE0A6B0
à¦à¦¾à¦¦à¦² চনà§à¦¦à§à¦° সরকার E0A6E0A6BEE0A6A6E0A6B220E0A69AE0A6A8E0A78DE0A6A6E0A78DE0A6B020E0A6B8E0A6B0E0A695E0A6BEE0A6B0
মিনতি সিংহ E0A6AEE0A6BFE0A6A8E0A6A4E0A6BF20E0A6B8E0A6BFE0A682E0A6B9
রেখা সরকার E0A6B0E0A787E0A696E0A6BE20E0A6B8E0A6B0E0A695E0A6BEE0A6B0
রেখা সরকার E0A6B0E0A787E0A696E0A6BE20E0A6B8E0A6B0E0A695E0A6BEE0A6B0
SUKDEB SARKARপ 53554B444542205341524B4152E0A6AA
KEYAMUL SEKH 4B4559414D554C2053454B480909
घोष पारà¥à¤µà¤¤à¥€ E0A498E0A58BE0A4B720E0A4AAE0A4BEE0A4B0E0A58DE0A4B5E0A4A4E0A580
à¦à¦¨à§à¦Ÿà§ সরকার E0A69DE0A6A8E0A78DE0A69FE0A78120E0A6B8E0A6B0E0A695E0A6BEE0A6B0
à¦à¦²à¦°à¦¾à¦® সরকার E0A6E0A6B2E0A6B0E0A6BEE0A6AE20E0A6B8E0A6B0E0A695E0A6BEE0A6B0
মনোতোষ সরকার E0A6AEE0A6A8E0A78BE0A6A4E0A78BE0A6B720E0A6B8E0A6B0E0A695E0A6BEE0A6B0
Here is my code:
SELECT distinct(FARMER_APPLICATION_ID) as FARMER_APPLICATION_ID,FARMER_AADHAR_NO,FARMER_EPIC_NO,FARMER_NAME,FARMER_GUARDIAN_NAME,FARMER_CROP_NAME,FARMER_L3_NAME,FARMER_L4_NAME,FARMER_L5_NAME,FARMER_L6_NAME,FARMER_BANK_NAME,FARMER_BANK_IFSC,PARTY_NAME,PARTY_CODE,FARMER_BRANCH_NAME
FROM BSBY_UAT.PROPOSAL
where FARMER_AADHAR_NO NOT regexp '^[2-9]{1}[0-9]{3}[0-9]{4}[0-9]{4}$'
OR FARMER_BANK_IFSC not regexp '^[A-Z]{4}0[A-Z0-9]{6}$'
OR FARMER_NAME NOT REGEXP '[A-Za-z.() ]$'
OR FARMER_GUARDIAN_NAME NOT REGEXP '[A-Za-z.() ]$'
or FARMER_EPIC_NO NOT REGEXP'[A-Za-z0-9\\/]$'
or FARMER_BANK_NAME NOT REGEXP'[A-Za-z.\\-() ]$'
or FARMER_BRANCH_NAME NOT REGEXP'[A-Za-z0-9.,()\\[\\]\\-]$'
This is taken "2" in the Question; I wonder if it gives any clues:
à ¦
à ¦¿
à ¦•
à §Â
à ¦°
à ¦®
à ¦¾
à ¦¦
à ¦¿
à ¦¤
à §Â
à ¦Â
For one of the hex strings, I see that
CONVERT(UNHEX( 'E0A69DE0A6A8E0A78DE0A69FE0A78120E0A6B8E0A6B0E0A695E0A6BEE0A6B0') USING utf8mb4)
yields ঝন্টু সরকার
That does not necessarily lead to a solution, but it may give a clue that there was an encoding problem during Insertion.
As for a regexp, consider something like
HEX(column) REGEXP '^(..)*[89ABCDEF]'
That will discover whether any byte in the string has an 8-bit code that is not Ascii.
This has a mixture:
CONVERT(UNHEX('53554B444542205341524B4152E0A6AA') USING utf8mb4) --> 'SUKDEB SARKARপ'
That is, it is Ascii, but with a Bengali 'PA' pm the end. The fact that you are seeing 'SUKDEB SARKARপ' is a sign of Mojibake. See this for discussion of Mojibake (and other common messes): Trouble with UTF-8 characters; what I see is not what I stored

How to recognize text from image on easyocr without '' symbols for each row

I am trying to recognize text from image, but for each row easyocr prints '' and , symbols. For example there are lines of text in the picture. When easyocr recognize this picture print for each rows 'example example','example example'... it goes on like this.
I want to recognize the text without these symbols.
Here is the code:
reader = easyocr.Reader(['tr'])
result = reader.readtext(IMAGE_PATH, detail=0, blocklist="-.:';,",
slope_ths=2.5,ycenter_ths=0.2)
print(result)
And the result
['4 ', 'Osmanlı Devleti nde Orhan Bey döneminde', 'Şehirlere kadılar atanmış ', 'Ỉznik te medrese açılmış ', 'Bursa başkent yapılmıştır', 'Buna göre', 'adlip', 'idari', ']', 'askeris', 'IV', 'eğitim', 'yönelik düzenlemeler yapıldı ', 'alanlarından hangilerine', 'savunulabilir?', 'ğı', 'C) Ill ve IV', 'B) Il ve Ill', 'II', 'A) / ve', 'E) IIp Ill ve IV', 've IV', 'D) / Il']
Can i recognize this like below?
['4 Osmanlı Devleti nde Orhan Bey döneminde Şehirlere kadılar atanmış Ỉznik te medrese açılmış Bursa başkent yapılmıştır Buna göre adlip idari askeris IV eğitim yönelik düzenlemeler yapıldı alanlarından hangilerine savunulabilir? ğı C) Ill ve IV B) Il ve Ill II A) / ve E) IIp Ill ve IV ve IV D) / Il']
The image;
image that i recognize it
According to their official Guide (https://www.jaided.ai/easyocr/tutorial/), If you provide hyperparameter paragraph = True, EasyOCR will try to combine raw result into easy-to-read paragraph.
Here's the result -
result = reader.readtext('https://www.somewebsite.com/chinese_tra.jpg',detail = 0) # without paragraph hyperparameter.
result -
['高鐵左營站', 'HSR', 'Station', '汽車臨停接送區', 'Kiss', 'Car', 'and', 'Ride']
With paragraph hyperparameter -
result = reader.readtext('https://www.somewebsite.com/chinese_tra.jpg',detail = 0, paragraph = True)
result -
['高鐵左營站 HSR Station 汽車臨停接送區 Car Kiss and Ride']
Just concatenate the sentences in the list after getting the result.
Something like this:
length=len(result)
i=0
content=""
while i<length:
content = str(content)+' '+str(result[i][0])
i+=1
print(content)

Nltk lesk issue

I am running a simple sentence disambiguation test. But the synset returned by nltk Lesk for the word 'cat' in the sentence "The cat likes milk" is 'kat.n.01', synsetid=3608870.
(n) kat, khat, qat, quat, cat, Arabian tea, African tea (the leaves of the shrub Catha edulis which are chewed like tobacco or used to make tea; has the effect of a euphoric stimulant) "in Yemen kat is used daily by 85% of adults"
This is a simple phrase and yet the disambiguation task fails.
And this is happening for many words in a set containing more than one sentence, for example in my test sentences, I would expect 'dog' to be disambiguated as 'domestic dog' but Lesk gives me 'pawl' (a hinged catch that fits into a notch of a ratchet to move a wheel forward or prevent it from moving backward)
Is it related to the size of the training set which is in my test only few sentences?
Here is my test code:
def test_lesk():
words = get_sample_words()
print(words)
tagger = PerceptronTagger()
tags = tagger.tag(words)
print (tags[:5])
for word, tag in tags:
pos = get_wordnet_pos(tag)
if pos is None:
continue
print("word=%s,tag=%s,pos=%s" %(word, tag, pos))
synset = lesk(words, word, pos)
if synset is None:
print('No synsetid for word=%s' %word)
else:
print('word=%s, synsetname=%s, synsetid=%d' %(word,synset.name(), synset.offset()))

Subtring and Substring index to split into two integers

I'm trying to split this hours into two different columns, I have the columns "hrs" and "mins"
For example, if the user enters 0.50 --- 0 should go to the "hrs" column, while 50 should go to the mins column.
I tried to use SUBSTRING_INDEX and SUBSTRING and it worked
SUBSTRING_INDEX(SUBSTRING("0.50", 3), ".", 2)
SUBSTRING_INDEX(SUBSTRING("0.50", 1), ".", 1)
and I got the output 0 and 50. so that's working, but what if the user will enter, 10.5 or 2.5 then it all becomes messed up.
How can I properly split it into 2 separate integers?
Thank you
Why not explode?
$pizza = "05.500";
$pieces = explode(".", $pizza);
echo $pieces[0];
echo '/';
echo $pieces[1];

Word frequency count based on two words using python

There are many resources online that shows how to do a word count for single word
like this and this and this and others...
But I was not not able to find a concrete example for two words count frequency .
I have a csv file that has some strings in it.
FileList = "I love TV show makes me happy, I love also comedy show makes me feel like flying"
So I want the output to be like :
wordscount = {"I love": 2, "show makes": 2, "makes me" : 2 }
Of course I will have to strip all the comma, interrogation points.... {!, , ", ', ?, ., (,), [, ], ^, %, #, #, &, *, -, _, ;, /, \, |, }
I will also remove some stop words which I found here just to get more concrete data from the text.
How can I achieve this results using python?
Thanks!
>>> from collections import Counter
>>> import re
>>>
>>> sentence = "I love TV show makes me happy, I love also comedy show makes me feel like flying"
>>> words = re.findall(r'\w+', sentence)
>>> two_words = [' '.join(ws) for ws in zip(words, words[1:])]
>>> wordscount = {w:f for w, f in Counter(two_words).most_common() if f > 1}
>>> wordscount
{'show makes': 2, 'makes me': 2, 'I love': 2}