When cleanig tokenized data, how to use .isalpha() in a list of lists to return values, not Booleans? - nltk

I'm practicing NLP with the nltk Library and I want to build myself an dataset for that. I combine several documents into a list of lists and then preprocess them. First I tokenize it, lowercase it and then I want to remove punctuation. It works for a vecor, but not for a list of lists:
Example for a vector:
a = 'This is a Testsentence and it is beautiful times 10!**!.'
b = word_tokenize(a)
c = [x.lower() for x in b]
['this', 'is', 'a', 'testsentence', 'and', 'it', 'is', 'beautiful', 'times', '10', '.']
d = [x for x in c if x.isalpha()]
['this', 'is', 'a', 'testsentence', 'and', 'it', 'is', 'beautiful', 'times']
Now I want to do it in a list of lists, but I fail to write the list comprehension at the end:
aa = 'This is a Testsentence and it is beautiful times 10.'
bb = 'It is a beautiful Testsentence?'
cc = 'Testsentence beautiful!'
dd = [aa, bb, cc]
ee = [word_tokenize(x) for x in dd]
ff = [[x.lower() for x in y] for y in ee]
[['this', 'is', 'a', 'testsentence', 'and', 'it', 'is', 'beautiful', 'times', '10', '.'], ['it', 'is', 'a', 'beautiful', 'testsentence', '?'], ['testsentence', 'beautiful', '!']]
This is where my problems start since I cant figure out how to write the list comprehension correctly.
gg = [[j.isalpha() for j in i] for i in ff]
This is the Result
[[True, True, True, True, True, True, True, True, True, False, False], [True, True, True, True, True, False], [True, True, False]]
But I want something like this:
[['this', 'is', 'a', 'testsentence', 'and', 'it', 'is', 'beautiful', 'times', '10', '.'], ['it', 'is', 'a', 'beautiful', 'testsentence', '?'], ['testsentence', 'beautiful', '!']]
Thanks :)

Try the following
gg = [[j for j in i if j.isalpha()] for i in ff]
This returns the expected answer
[['this', 'is', 'a', 'testsentence', 'and', 'it', 'is', 'beautiful', 'times'],
['it', 'is', 'a', 'beautiful', 'testsentence'],
['testsentence', 'beautiful']]

Related

Merge multiple values for same key to one dict/json (Pandas, Python, Dataframe)?

I have the following dataframe:
pd.DataFrame({'id':[1,1,1,2,2], 'key': ['a', 'a', 'b', 'a', 'b'], 'value': ['kkk', 'aaa', '5', 'kkk','8']})
I want to convert it to the following data frame:
id value
1 {'a':['kkk', 'aaa'], 'b': 5}
2 {'a':['kkk'], 'b': 8}
I am trying to do this using .to_dict method but the output is
df.groupby(['id','key']).aggregate(list).groupby('id').aggregate(list)
{'value': {1: [['kkk', 'aaa'], ['5']], 2: [['kkk'], ['8']]}}
Should I perform dict comprehension or there is an efficient logic to build such generic json/dict?
After you groupby(['id', 'key']) and agg(list), you can group by the first level of the index and for each group thereof, use droplevel + to_dict:
new_df = df.groupby(['id', 'key']).agg(list).groupby(level=0).apply(lambda x: x['value'].droplevel(0).to_dict()).reset_index(name='value')
Output:
>>> new_df
id value
0 1 {'a': ['kkk', 'aaa'], 'b': ['5']}
1 2 {'a': ['kkk'], 'b': ['8']}
Or, simpler:
new_df = df.groupby('id').apply(lambda x: x.groupby('key')['value'].agg(list).to_dict())

BLEU - Error N-gram overlaps of lower order

I ran the code below
a = ['dog', 'in', 'plants', 'crouches', 'to', 'look', 'at', 'camera']
b = ['a', 'brown', 'dog', 'in', 'the', 'grass', ' ', ' ']
from nltk.translate.bleu_score import corpus_bleu
bleu1 = corpus_bleu(a, b, weights=(1.0, 0, 0, 0))
print(bleu1)
This is the error
The hypothesis contains 0 counts of 3-gram overlaps. Therefore the
BLEU score evaluates to 0, independently of how many N-gram overlaps
of lower order it contains. Consider using lower n-gram order or use
SmoothingFunction() warnings.warn(_msg)
Can someone tell me what is the problem here? I can not find the solution on google. Thank you.
Best,
DD
I found the solution. Basically, I need a list inside a list for list 'a'. So code below will work without error.
a = [['dog', 'in', 'plants', 'crouches', 'to', 'look', 'at', 'camera']]
b = ['a', 'brown', 'dog', 'in', 'the', 'grass', ' ', ' ']
from nltk.translate.bleu_score import corpus_bleu
bleu1 = corpus_bleu(a, b, weights=(1.0, 0, 0, 0))
print(bleu1)

Python list to multi-level json

I 'm a Python beginner, I have a list that needs to be converted to json format.
I hope to get some help.
raw data:
result = [('A', 'a1', '1'),
('A', 'a2', '2'),
('B', 'b1', '1'),
('B', 'b2', '2')]
The result I want:
[{'type':'A',
'data':[{'name':'a1','url':'1'},
{'name':'a2','url':'2'}]
},
{'type': 'B',
'data': [{'name':'b1', 'url': '1'},
{'name':'b2','url':'2'}]
}]
('A', 'a1', 1) is an example of a tuple, which are iterable.
result = [('A', 'a1', '1'),
('A', 'a2', '2'),
('B', 'b1', '1'),
('B', 'b2', '2')]
type_list = []
for tup in result:
if len(type_list) == 0:
type_list.append({'type': tup[0], 'data': [{ 'name': tup[1], 'url': tup[2] }]})
else:
for type_obj in type_list:
found = False
if type_obj['type'] == tup[0]:
type_obj['data'].append({ 'name': tup[1], 'url': tup[2] })
found = True
break
if not found:
type_list.append({'type': tup[0], 'data': [{ 'name': tup[1], 'url': tup[2] }]})
print(type_list)
which prints:
[{'type': 'A', 'data': [{'name': 'a1', 'url': '1'}, {'name': 'a2', 'url': '2'}]}, {'type': 'B', 'data': [{'name': 'b1', 'url': '1'}, {'name': 'b2', 'url': '2'}]}]

Print list to csv file in Python3

I have a list like this.
all_chords = [['C', 'C', 'E', 'G'],
['CM7', 'C', 'E', 'G', 'B'],
['C7', 'C', 'E', 'G', 'Bb'],
['Cm7', 'C', 'Eb', 'G', 'Bb'],
['Cm7b5', 'C', 'Eb', 'Gb', 'Bb'],
['Cdim7', 'C', 'Eb', 'Gb', 'Bbb(A)'],
['Caug7', 'C', 'E', 'G#', 'Bb'],
['C6', 'C', 'E', 'G', 'A'],
['Cm6', 'C', 'Eb', 'G', 'A'],
]
I want to print out to a csv file, something like this.
C_chords.csv
C;C,E,G
CM7;C,E,G,B
C7;C,E,G,Bb
Cm7;C,Eb,G,Bb
Cm7b5;C,Eb,Gb,Bb
Cdim7;C,Eb,Gb,Bbb(A)
Caug7;C,E,G#,Bb
C6;C,E,G,A
Cm6;C,Eb,G,A
It has two fileds which are separted by semicolon. (not by a comma)
I used csv module, like this.
myfile = open('C_chords.csv','w')
wr = csv.writer(myfile, quotechar=None)
wr.writerows(all_chords)
myfile.close()
The result is..
C,C,E,G
CM7,C,E,G,B
C7,C,E,G,Bb
Cm7,C,Eb,G,Bb
Cm7b5,C,Eb,Gb,Bb
Cdim7,C,Eb,Gb,Bbb(A)
Caug7,C,E,G#,Bb
C6,C,E,G,A
Cm6,C,Eb,G,A
Should I modify the list? Somewhat like this?
[['C',';', 'C', 'E', 'G'],.......]
or any other brilliant ideas do you guys have?
Thanks in advance.
You're writing four columns, not two, if you want the last list elements be one single column, you need to join them first manually.
And you need to change the delimiter if you want the csv semicolon separated, not the quote character:
import csv
all_chords = [['C', 'C', 'E', 'G'],
['CM7', 'C', 'E', 'G', 'B'],
['C7', 'C', 'E', 'G', 'Bb'],
['Cm7', 'C', 'Eb', 'G', 'Bb'],
['Cm7b5', 'C', 'Eb', 'Gb', 'Bb'],
['Cdim7', 'C', 'Eb', 'Gb', 'Bbb(A)'],
['Caug7', 'C', 'E', 'G#', 'Bb'],
['C6', 'C', 'E', 'G', 'A'],
['Cm6', 'C', 'Eb', 'G', 'A'],
]
myfile = open('C_chords.csv','w')
wr = csv.writer(myfile, delimiter=';')
wr.writerows([c[0], ','.join(c[1:])] for c in all_chords)
myfile.close()
I think it's easier to do it without the csv module:
with open('C_chords.csv','w') as out_file:
for row in all_chords:
print('{};{}'.format(row[0], ','.join(row[1:])), file=out_file)

Compare two strings according to defined values in a query

I have to compare two string fields containing letters but not alphabetically.
I want to compare them according to this order :
"J" "L" "M" "N" "P" "Q" "R" "S" "T" "H" "V" "W" "Y" "Z"
So if I compare H with T, H will be greater than T (unlike alphabetically)
And if I test if a value is greater than 'H' (> 'H') I will get all the entries containing the values ("V" "W" "Y" "Z") (again, unlike alphabetical order)
How can I achieve this in one SQL query?
Thanks
SELECT *
FROM yourtable
WHERE
FIELD(col, 'J', 'L', 'M', 'N', 'P', 'Q', 'R', 'S', 'T', 'H', 'V', 'W', 'Y', 'Z') >
FIELD('H', 'J', 'L', 'M', 'N', 'P', 'Q', 'R', 'S', 'T', 'H', 'V', 'W', 'Y', 'Z')
^ your value
Or also:
SELECT *
FROM yourtable
WHERE
LOCATE(col, 'JLMNPQRSTHVWYZ')>
LOCATE('H', 'JLMNPQRSTHVWYZ')
Please see fiddle here.
You can do
SELECT ... FROM ... ORDER BY yourletterfield='J' DESC, yourletterfield='L' DESC, yourletterfield='M' DESC, ...
The equality operator will evaluate to "1" when it's true, "0" when false, so this should give you the desired order.
There's actually a FIELD() function that will make this a bit less verbose. See this article for details.
SELECT ... FROM ... ORDER BY FIELD(yourletterfield, 'J', 'L', 'M', ...)