AllenNLP - dataset_reader config for transformers - allennlp

I would like to use bert for tokenization and also indexing for a seq2seq model and this is how my config file looks like so far:
{
"dataset_reader": {
"type": "seq2seq",
"end_symbol": "[SEP]",
"quoting": 3,
"source_token_indexers": {
"tokens": {
"type": "pretrained_transformer",
"model_name": "bert-base-german-cased"
}
},
"source_tokenizer": {
"type": "pretrained_transformer",
"model_name": "bert-base-german-cased"
},
"start_symbol": "[CLS]",
"target_token_indexers": {
"tokens": {
"namespace": "tokens"
}
},
"target_tokenizer": {
"type": "pretrained_transformer",
"add_special_tokens": true,
"model_name": "bert-base-german-cased"
}
},
and later when I load the model and use predictor.predict_json to predict sentences, the output looks like this.
'predicted_tokens': ['[CLS]', 'Die', 'meisten', 'Universitäts',
'##abs', '##ch', '##lüsse', 'sind', 'nicht', 'p', '##raxis',
'##orient', '##iert', 'und', 'bereit', '##en', 'die', 'Studenten',
'nicht', 'auf', 'die', 'wirklich', '##e', 'Welt', 'vor', '.', '[SEP]',
'[SEP]', '[SEP]', '[SEP]', '[SEP]', '[SEP]', '[SEP]', '[SEP]',
'[SEP]', '[SEP]', '[SEP]', '[SEP]', '[SEP]', '[SEP]']
I have 2 questions:
is this a normal output (considering all the '[SEP]' tokens in the end)? or am I doing something wrong in the config file?
is there any function that would convert these tokens to a human-readable sentence?
Thanks

Please set add_special_tokens = False.
Use tokenizer.convert_tokens_to_string (which takes the list of subword tokens as input), where tokenizer refers to the tokenizer used by your DatasetReader.
Please let us know if you have further questions!

Related

Python Regex: How to match the string and then modify that string by adding something at the end

UPDATED CODE: It is working but now the problem is that the code is attaching same random_value to every Path.
Following is my code with a sample chunk of text. I want to read Path and it's value then add (/some unique random alphabet and number combination) at the end of every Path value without changing the already existed value. For example I want the Path to be like
"Path" : "already existed value/1A" e.t.c something like that.
I am unable to make the exact regex pattern of replacing it.
Any help would be appreciated.
It can be done by json parse but the requirement of the task is to do it via REGEX.
from io import StringIO
import re
import string
import random
reader = StringIO("""{
"Bounds": [
{
"HasClip": true,
"Lang": "no",
"Page": 0,
"Path": "//Document/Sect[2]/Aside/P",
"Text": "Potsdam, den 9. Juni 2021 ",
"TextSize": 12.0
}
],
},
{
"Bounds": [
{
"HasClip": true,
"Lang": "de",
"Page": 0,
"Path": "//Document/Sect[3]/P[4]",
"Text": "this is some text ",
"TextSize": 9.0,
}
],
}""")
def id_generator(size=3, chars=string.ascii_uppercase + string.digits):
return ''.join(random.choice(chars) for _ in range(size))
text = reader.read()
random_value = id_generator()
pattern = r'"Path": "(.*?)"'
replacement = '"Path": "\\1/'+random_value+'"'
text = re.sub(pattern, replacement, text)
#This is working but it is only attaching one same random_value on every Path
print(text)
Use group 1 in the replacement:
replacement = '"Path": "\\1/1A"'
See live demo.
The replacement regex \1 puts back what was captured in group 1 of the match via (.*?).
Since you already have a json structure, maybe it would help to use the json module to parse it.
import json
myDict = json.loads("your json string / variable here")
# now myDict is a dictionary that you can use to loop/read/edit/modify and you can then export myDict as json.

JMeter: How can I randomize post body data several times?

I have a post body data as:
"My data": [{
"Data": {
"var1": 6.66,
"var2": 8.88
},
"var3": 9
}],
Here, if I post these details on POST DATA body, it will call "My Data" just once. I want to make it random as starting from 1 to 10 times so that "My data" is running for several times but randomly. If the random value is 2, then "My data" should run twice.
Help appreciated!
If you need to generate more blocks like this one:
{
"Data": {
"var1": 6.66,
"var2": 8.88
},
"var3": 9
}
It can be done using JSR223 PreProcessor and the following code:
def myData = []
1.upto(2, {
def entry = [:]
entry.put('Data', [var1: 6.66, var2: 8.88])
entry.put('var3', '9')
myData.add(entry)
})
vars.put('myData', new groovy.json.JsonBuilder(myData).toPrettyString())
log.info(vars.get('myData'))
The above example will generate 2 blocks:
If you want 10 - change 2 in the 1.upto(2, { line to 10
The generated data can be accessed as ${myData} where needed.
More information:
Apache Groovy - Parsing and producing JSON
Apache Groovy - Why and How You Should Use It

How to run JSONiq from JSON with try.zorba.io

I need to write a JSONiq expression that lists only the name of the products that cost at least 3. This is my JSON file which i had typed in the XQuery section:
{ "supermarket_visit":{
"date":"08032019",
"bought":[
"item",{
"type":"confectionary",
"item_name":"Kit_Kat",
"number": 3,
"individual_price": 3.5
},
"item",{
"type":"drinks",
"item_name":"Coca_Cola",
"number": 2,
"individual_price": 3
},
"item",{
"type":"fruits",
"item_name":"apples",
"number": "some"
}
], 
"next_visit":[
"item",{
"type":"stationary",
"item_name":"A4_paper",
"number": 1
},
"item",{
"type":"stationary",
"item_name":"pen",
"number": 2
}
]
}
}
and this is my JSONiq Xquery JSONiq command, which i dont really know where to type in try.zorba.io:
let $x := find("supermarket_visit")
for $x in $supermarket.bought let $i := $x.item
where $i.individual_price <=3
return $i.item_name
I am getting many errors in try.zorba.io and im really new to JSONiq and JSON. Is something wrong with my JSON or JSONiq part?
The following selection works for me at the site you linked to:
jsoniq version "1.0";
{ "supermarket_visit":{
"date":"08032019",
"bought":[
"item",{
"type":"confectionary",
"item_name":"Kit_Kat",
"number": 3,
"individual_price": 3.5
},
"item",{
"type":"drinks",
"item_name":"Coca_Cola",
"number": 2,
"individual_price": 3
},
"item",{
"type":"fruits",
"item_name":"apples",
"number": "some"
}
],
"next_visit":[
"item",{
"type":"stationary",
"item_name":"A4_paper",
"number": 1
},
"item",{
"type":"stationary",
"item_name":"pen",
"number": 2
}
]
}
}.supermarket_visit.bought()[$$ instance of object and $$.individual_price le 3].item_name
The original query can be slightly modified to (in order to keep a FLWOR expression):
jsoniq version "1.0";
let $document := { (: put the document here :) }
for $x in $document.supermarket_visit.bought[]
where $x instance of object and $x.individual_price le 3
return $x.item_name
Note that try.zorba.io is an older version of Zorba (2.9) that does not implement the latest, stable JSONiq version. This is why () must be used instead of [] on this specific page. If you download the latest version of Zorba, the above query should work.
Also, the original document provided in the question is not well-formed JSON, because it contains a special em space character (Unicode 2003) on the line above "next_visit". This character must be removed for this JSON to be parsed successfully.

Python: JSON to Dictionary

Two examples for a JSON request. Both examples should have the correct JSON syntax, yet only the second version seems to be translatable to a dictionary.
#doesn't work
string_js3 = """{"employees": [
{
"FNAME":"FTestA",
"LNAME":"LTestA",
"SSN":6668844441
},
{
"FNAME":"FTestB",
"LNAME":"LTestB",
"SSN":6668844442
}
]}
"""
#works
string_js4 = """[
{
"FNAME":"FTestA",
"LNAME":"LTestA",
"SSN":6668844441
},
{
"FNAME":"FTestB",
"LNAME":"LTestB",
"SSN":6668844442
}]
"""
This gives an error, while the same with string_js4 works
L1 = json.loads(string_js3)
print(L1[0]['FNAME'])
So I have 2 questions:
1) Why doesn't the first version work
2) Is there a simple way to make the first version also work?
Both of these strings are valid JSON. Where you are getting stuck is in how you are accessing the resulting data structures.
L1 (from string_js3) is a (nested) dict;
L2 (from string_js4) is a list of dicts.
Walkthrough:
import json
string_js3 = """{
"employees": [{
"FNAME": "FTestA",
"LNAME": "LTestA",
"SSN": 6668844441
},
{
"FNAME": "FTestB",
"LNAME": "LTestB",
"SSN": 6668844442
}
]
}"""
string_js4 = """[{
"FNAME": "FTestA",
"LNAME": "LTestA",
"SSN": 6668844441
},
{
"FNAME": "FTestB",
"LNAME": "LTestB",
"SSN": 6668844442
}
]"""
L1 = json.loads(string_js3)
L2 = json.loads(string_js4)
The resulting objects:
L1
{'employees': [{'FNAME': 'FTestA', 'LNAME': 'LTestA', 'SSN': 6668844441},
{'FNAME': 'FTestB', 'LNAME': 'LTestB', 'SSN': 6668844442}]}
L2
[{'FNAME': 'FTestA', 'LNAME': 'LTestA', 'SSN': 6668844441},
{'FNAME': 'FTestB', 'LNAME': 'LTestB', 'SSN': 6668844442}]
type(L1), type(L2)
(dict, list)
1) Why doesn't the first version work?
Because calling L1[0] is trying to return the value from the key 0, and that key doesn't exist. From the docs, "It is an error to extract a value using a non-existent key." L1 is a dictionary with just one key:
L1.keys()
dict_keys(['employees'])
2) Is there a simple way to make the first version also work?
There are several ways, but it ultimately depends on what your larger problem looks like. I'm going to assume you want to modify the Python code rather than the JSON files/strings themselves. You could do:
L3 = L1['employees'].copy()
You now have a list of dictionaries that resembles L2:
L3
[{'FNAME': 'FTestA', 'LNAME': 'LTestA', 'SSN': 6668844441},
{'FNAME': 'FTestB', 'LNAME': 'LTestB', 'SSN': 6668844442}]

Load a log file with multiple JSON datasets into MongoDB

Warning - I'm new to MongoDB and JSON.
I've a log file which contain JSON datasets. A single file has multiple JSON formats as it is capturing clickstream data. Here is an example of one log file.
[
{
"username":"",
"event_source":"server",
"name":"course.activated",
"accept_language":"",
"time":"2016-10-12T01:02:07.443767+00:00",
"agent":"python-requests/2.9.1",
"page":null,
"host":"courses.org",
"session":"",
"referer":"",
"context":{
"user_id":null,
"org_id":"X",
"course_id":"3T2016",
"path":"/api/enrollment"
},
"ip":"160.0.0.1",
"event":{
"course_id":"3T2016",
"user_id":11,
"mode":"audit"
},
"event_type":"activated"
},
{
"username":"VTG",
"event_type":"/api/courses/3T2016/",
"ip":"161.0.0.1",
"agent":"Mozilla/5.0",
"host":"courses.org",
"referer":"http://courses.org/16773734",
"accept_language":"en-AU,en;q=0.8,en-US;q=0.6,en;q=0.4",
"event":"{\"POST\": {}, \"GET\": {}}",
"event_source":"server",
"context":{
"course_user_tags":{
},
"user_id":122,
"org_id":"X",
"course_id":"3T2016",
"path":"/api/courses/3T2016/"
},
"time":"2016-10-12T00:51:57.756468+00:00",
"page":null
}
]
Now I want to store this data in MongoDB. So here are my novice questions:
Do I need to parse the file and then split it into 2 datasets before storing in MongoDB? If yes, then is here a simple program to do this as my file has multiple dataset formats?
Is there some magic in MongoDB that can split the various datasets when we upload it?
First of all you have invalid json format, Make sure your json being formatted as I have cite below. After Successfully having your json data you can perform Mongodb restore option to insert your data back to database.
mongorestore --host hostname --port 27017 --dir pathtojsonfile --db <database_name_to_restore>
Fo more information refer https://docs.mongodb.com/manual/reference/program/mongorestore/
Formatted json
[
{
"username":"",
"event_source":"server",
"name":"course.activated",
"accept_language":"",
"time":"2016-10-12T01:02:07.443767+00:00",
"agent":"python-requests/2.9.1",
"page":null,
"host":"courses.org",
"session":"",
"referer":"",
"context":{
"user_id":null,
"org_id":"X",
"course_id":"3T2016",
"path":"/api/enrollment"
},
"ip":"160.0.0.1",
"event":{
"course_id":"3T2016",
"user_id":11,
"mode":"audit"
},
"event_type":"activated"
},
{
"username":"VTG",
"event_type":"/api/courses/3T2016/",
"ip":"161.0.0.1",
"agent":"Mozilla/5.0",
"host":"courses.org",
"referer":"http://courses.org/16773734",
"accept_language":"en-AU,en;q=0.8,en-US;q=0.6,en;q=0.4",
"event":"{\"POST\": {}, \"GET\": {}}",
"event_source":"server",
"context":{
"course_user_tags":{
},
"user_id":122,
"org_id":"X",
"course_id":"3T2016",
"path":"/api/courses/3T2016/"
},
"time":"2016-10-12T00:51:57.756468+00:00",
"page":null
}
]