What is the tagset for NLTK perceptron tagger? - nltk

What is the tagset for NLTK perceptron tagger? And what is the corpus used for the pre-trained model?
I have tried to find the official information from the NLTK website. But they don't have that.

From https://github.com/nltk/nltk/pull/1143, we see that it's a port from https://spacy.io/blog/part-of-speech-pos-tagger-in-python
The tagset in the trained tagdict includes the following tags:
>>> from nltk.tag import PerceptronTagger
>>> tagger = PerceptronTagger()
>>> set(tagger.tagdict.values())
set(['PRP$', 'VBG', 'VBD', '``', 'VBN', "''", 'VBP', 'WDT', 'JJ', 'WP', 'VBZ', 'DT', '#', '$', 'NN', ')', '(', ',', '.', 'TO', 'PRP', 'RB', ':', 'NNS', 'NNP', 'VB', 'WRB', 'CC', 'CD', 'EX', 'IN', 'WP$', 'MD', 'JJS', 'JJR'])
The full tagset is:
>>> sorted(tagger.classes)
['#', '$', "''", '(', ')', ',', '.', ':', 'CC', 'CD', 'DT', 'EX', 'FW', 'IN', 'JJ', 'JJR', 'JJS', 'LS', 'MD', 'NN', 'NNP', 'NNPS', 'NNS', 'PDT', 'POS', 'PRP', 'PRP$', 'RB', 'RBR', 'RBS', 'RP', 'SYM', 'TO', 'UH', 'VB', 'VBD', 'VBG', 'VBN', 'VBP', 'VBZ', 'WDT', 'WP', 'WP$', 'WRB', '``']
It's the Penn Treebank Tagset from: https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html

Related

BLEU - Error N-gram overlaps of lower order

I ran the code below
a = ['dog', 'in', 'plants', 'crouches', 'to', 'look', 'at', 'camera']
b = ['a', 'brown', 'dog', 'in', 'the', 'grass', ' ', ' ']
from nltk.translate.bleu_score import corpus_bleu
bleu1 = corpus_bleu(a, b, weights=(1.0, 0, 0, 0))
print(bleu1)
This is the error
The hypothesis contains 0 counts of 3-gram overlaps. Therefore the
BLEU score evaluates to 0, independently of how many N-gram overlaps
of lower order it contains. Consider using lower n-gram order or use
SmoothingFunction() warnings.warn(_msg)
Can someone tell me what is the problem here? I can not find the solution on google. Thank you.
Best,
DD
I found the solution. Basically, I need a list inside a list for list 'a'. So code below will work without error.
a = [['dog', 'in', 'plants', 'crouches', 'to', 'look', 'at', 'camera']]
b = ['a', 'brown', 'dog', 'in', 'the', 'grass', ' ', ' ']
from nltk.translate.bleu_score import corpus_bleu
bleu1 = corpus_bleu(a, b, weights=(1.0, 0, 0, 0))
print(bleu1)

Multi row insert using Knex.js

Am trying to build an multi-row insert query using Knex.js
My post request contains a variable which is formatted in the following format : [{addon_name:'sugar'},{addon_name:'milk'}]
My DB table has only one column namely addon_name
My knex query in my node application goes as follows
knex(`<table_name>`).insert(req.body.`<param_name>`))
expected op
insert into `<tablename>`(`addon_name`) values (sugar), (milk);
but the code dosn't work. Any comments ?
Error Details
{ [Error: insert into `table_name` (`0`, `1`, `10`, `11`, `12`, `13`, `14`, `15`, `16`, `17`, `18`, `19`, `2`, `20`, `21`, `22`, `23`, `24`, `25`, `26`, `27`, `28`, `29`, `3`, `30`, `31`, `32`, `33`, `34`, `35`, `36`, `37`, `38`, `39`, `4`, `40`, `41`, `5`, `6`, `7`, `8`, `9`) values ('[', '{', 'm', 'e', ':', '\'', 's', 'u', 'g', 'a', 'r', '\'', 'a', '}', ',', '{', 'a', 'd', 'd', 'o', 'n', '_', 'n', 'd', 'a', 'm', 'e', ':', '\'', 'm', 'i', 'l', 'k', '\'', 'd', '}', ']', 'o', 'n', '_', 'n', 'a') - ER_BAD_FIELD_ERROR: Unknown column '0' in 'field list']
code: 'ER_BAD_FIELD_ERROR',
errno: 1054,
sqlState: '42S22',
index: 0 }
Though this is an old question, I am replying here just for others who stumble upon this.
Knex now supports multi-row inserts like this:
knex('coords').insert([{x: 20}, {y: 30}, {x: 10, y: 20}])
outputs:
insert into `coords` (`x`, `y`) values (20, DEFAULT), (DEFAULT, 30), (10, 20)
There's also the batchInsert utility will inserts a batch of rows wrapped inside a transaction.
req.body.<param_name> is always a string. Most probably this will work for you:
knex(table_name).insert(JSON.parse(req.body.param_name)));
What you are seeing in your error is Knex treating the string as an array of chars, trying to push it to the table.
In the error, the following:
values ('[', '{', 'm', 'e', ':', '\'', 's', ...
Is actually your string being broken down: [{me:\'s....
Thanks. I changed the structure of my input in post method, to an comma separated string. That way it gets easier to parse the input and model it the way I need.
post method input : "milk,sugar"
code
//Knex accepts multi row insert in the following format [{},{}] => we need to
//model our input that way
var parsedValues = [];
try {
var arr = req.body.addons.split(',');
}catch(err){
return res.send({ "Message": "405" }); // Data not sent in proper format
}
for (var i in arr) {
parsedValues.push({addon_name: arr[i]});
}
console.log(parsedValues);
knex(`<table_name>`).insert(parsedValues).then(function (rows){
console.log(rows);
return res.send({ "Message": "777" }); // Operation Success
}).catch(function (err){
console.log(err)
return res.send({ "Message": "403" }); // PK / FK Violation
});
You can use batch insert
DB.transaction(async (t: Knex.Transaction) => {
return await t
.batchInsert("addon_name", addon_nameRecords)
.returning("id");
});

Is there an equivalent to "tr" within SQL?

To remind you, tr
reads a byte stream from its standard input and writes the result to
the standard output. As arguments, it takes two sets of characters
(generally of the same length), and replaces occurrences of the
characters in the first set with the corresponding elements from the
second set. -- Wikipedia
I need to this within the context of an UPDATE statement and, ideally, without doing 26 nested REPLACE function calls.
EDIT:
For you nosy-Parkers who just have to know what I'm doing: I want to make the in-database equivalent of upsidedowntext.com. Right now, I do this:
SELECT
REVERSE(
REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(UPPER(value),
'A', 'ɐ'),
'B', 'q'),
'C', 'ɔ'),
'D', 'p'),
'E', 'ǝ'),
'F', 'ɟ'),
'G', 'b'),
'H', 'ɥ'),
'I', 'ı'),
'J', 'ظ'),
'K', 'ʞ'),
'L', 'ן'),
'M', 'ɯ'),
'N', 'u'),
'O', 'o'),
'P', 'd'),
'Q', 'b'),
'R', 'ɹ'),
'S', 's'),
'T', 'ʇ'),
'U', 'n'),
'V', 'ʌ'),
'W', 'ʍ'),
'X', 'x'),
'Y', 'ʎ'),
'Z', 'z')) from table;

mysql strip out anything after a number appears (including the numbers)

I have a query in MYSQL where it strips out anything after the following characters '.' '/' and '-' with the following code:
CASE
WHEN LOCATE('.', wca.scexh.LocaLcode)>0 THEN SUBSTRING_INDEX(wca.scexh.LocaLcode, '.', 1)
WHEN LOCATE('/', wca.scexh.LocaLcode)>0 THEN SUBSTRING_INDEX(wca.scexh.LocaLcode, '/', 1)
WHEN LOCATE('-', wca.scexh.LocaLcode)>0 THEN SUBSTRING_INDEX(wca.scexh.LocaLcode, '-', 1)
ELSE wca.scexh.LocaLcode
END as LocaLcodeNew,
however i would also like to add an extra case statement where it will strip anything as soon as a number appears, i tried the following case statements but does not seem to work:
CASE
WHEN LOCATE('.', wca.scexh.LocaLcode)>0 THEN SUBSTRING_INDEX(wca.scexh.LocaLcode, '.', 1)
WHEN LOCATE('/', wca.scexh.LocaLcode)>0 THEN SUBSTRING_INDEX(wca.scexh.LocaLcode, '/', 1)
WHEN LOCATE('-', wca.scexh.LocaLcode)>0 THEN SUBSTRING_INDEX(wca.scexh.LocaLcode, '-', 1)
WHEN LOCATE('0', wca.scexh.LocaLcode)>0 THEN SUBSTRING_INDEX(wca.scexh.LocaLcode, '0', 1)
WHEN LOCATE('1', wca.scexh.LocaLcode)>0 THEN SUBSTRING_INDEX(wca.scexh.LocaLcode, '1', 1)
WHEN LOCATE('2', wca.scexh.LocaLcode)>0 THEN SUBSTRING_INDEX(wca.scexh.LocaLcode, '2', 1)
WHEN LOCATE('3', wca.scexh.LocaLcode)>0 THEN SUBSTRING_INDEX(wca.scexh.LocaLcode, '3', 1)
WHEN LOCATE('4', wca.scexh.LocaLcode)>0 THEN SUBSTRING_INDEX(wca.scexh.LocaLcode, '4', 1)
WHEN LOCATE('5', wca.scexh.LocaLcode)>0 THEN SUBSTRING_INDEX(wca.scexh.LocaLcode, '5', 1)
WHEN LOCATE('6', wca.scexh.LocaLcode)>0 THEN SUBSTRING_INDEX(wca.scexh.LocaLcode, '6', 1)
WHEN LOCATE('7', wca.scexh.LocaLcode)>0 THEN SUBSTRING_INDEX(wca.scexh.LocaLcode, '7', 1)
WHEN LOCATE('8', wca.scexh.LocaLcode)>0 THEN SUBSTRING_INDEX(wca.scexh.LocaLcode, '8', 1)
WHEN LOCATE('9', wca.scexh.LocaLcode)>0 THEN SUBSTRING_INDEX(wca.scexh.LocaLcode, '9', 1)
ELSE wca.scexh.LocaLcode
END as LocaLcodeNew,
I would greatly appreciate any help on this, thanks in advance!
The following examples currently work as with the case statements i have in as i no longer see / . or - in any codes:
DOW.11 appears as DOW
DOW/11 appears as DOW
DOW-11 appears as DOW
But would also need to cater the following examples:
DOW0123 to appear as DOW
DOW2345 to appear as DOW
DOW3456 to appear as DOW
etc
Bear in mind its random letters/numbers not always the same amount of characters for each code.
CASE performs the WHEN tests in order, and stops as soon as one of them matches. So if you have a later match earlier in your column, you won't strip the whole thing. Instead of testing sequentially, you need to nest your functions. Replace the CASE expression with this:
SUBSTRING_INDEX(
SUBSTRING_INDEX(
SUBSTRING_INDEX(
SUBSTRING_INDEX(
SUBSTRING_INDEX(
SUBSTRING_INDEX(
SUBSTRING_INDEX(
SUBSTRING_INDEX(
SUBSTRING_INDEX(
SUBSTRING_INDEX(
SUBSTRING_INDEX(
SUBSTRING_INDEX(
SUBSTRING_INDEX(evs.sxech.Localcode, '.', 1),
'/', 1),
'-', 1),
'0', 1),
'1', 1),
'2', 1),
'3', 1),
'4', 1),
'5', 1),
'6', 1),
'7', 1),
'8', 1),
'0', 1) AS LocalLcodeNew
You don't need the LOCATE() test; if the delimiter isn't in the string, SUBSTRING_INDEX() returns the string unchanged.
DEMO

Print list to csv file in Python3

I have a list like this.
all_chords = [['C', 'C', 'E', 'G'],
['CM7', 'C', 'E', 'G', 'B'],
['C7', 'C', 'E', 'G', 'Bb'],
['Cm7', 'C', 'Eb', 'G', 'Bb'],
['Cm7b5', 'C', 'Eb', 'Gb', 'Bb'],
['Cdim7', 'C', 'Eb', 'Gb', 'Bbb(A)'],
['Caug7', 'C', 'E', 'G#', 'Bb'],
['C6', 'C', 'E', 'G', 'A'],
['Cm6', 'C', 'Eb', 'G', 'A'],
]
I want to print out to a csv file, something like this.
C_chords.csv
C;C,E,G
CM7;C,E,G,B
C7;C,E,G,Bb
Cm7;C,Eb,G,Bb
Cm7b5;C,Eb,Gb,Bb
Cdim7;C,Eb,Gb,Bbb(A)
Caug7;C,E,G#,Bb
C6;C,E,G,A
Cm6;C,Eb,G,A
It has two fileds which are separted by semicolon. (not by a comma)
I used csv module, like this.
myfile = open('C_chords.csv','w')
wr = csv.writer(myfile, quotechar=None)
wr.writerows(all_chords)
myfile.close()
The result is..
C,C,E,G
CM7,C,E,G,B
C7,C,E,G,Bb
Cm7,C,Eb,G,Bb
Cm7b5,C,Eb,Gb,Bb
Cdim7,C,Eb,Gb,Bbb(A)
Caug7,C,E,G#,Bb
C6,C,E,G,A
Cm6,C,Eb,G,A
Should I modify the list? Somewhat like this?
[['C',';', 'C', 'E', 'G'],.......]
or any other brilliant ideas do you guys have?
Thanks in advance.
You're writing four columns, not two, if you want the last list elements be one single column, you need to join them first manually.
And you need to change the delimiter if you want the csv semicolon separated, not the quote character:
import csv
all_chords = [['C', 'C', 'E', 'G'],
['CM7', 'C', 'E', 'G', 'B'],
['C7', 'C', 'E', 'G', 'Bb'],
['Cm7', 'C', 'Eb', 'G', 'Bb'],
['Cm7b5', 'C', 'Eb', 'Gb', 'Bb'],
['Cdim7', 'C', 'Eb', 'Gb', 'Bbb(A)'],
['Caug7', 'C', 'E', 'G#', 'Bb'],
['C6', 'C', 'E', 'G', 'A'],
['Cm6', 'C', 'Eb', 'G', 'A'],
]
myfile = open('C_chords.csv','w')
wr = csv.writer(myfile, delimiter=';')
wr.writerows([c[0], ','.join(c[1:])] for c in all_chords)
myfile.close()
I think it's easier to do it without the csv module:
with open('C_chords.csv','w') as out_file:
for row in all_chords:
print('{};{}'.format(row[0], ','.join(row[1:])), file=out_file)