Loading json file using torchtext - json

I'm working on the dailydialog dataset, which I've converted into a
JSON file which looks something like this:
[{"response": "You know that is tempting but is really not good for our fitness.", "message": "Say, Jim, how about going for a few beers after dinner?"}, {"response": "Do you really think so? I don't. It will just make us fat and act silly. Remember last time?", "message": "What do you mean? It will help us to relax."}, {"response": "I suggest a walk over to the gym where we can play singsong and meet some of our friends.", "message": "I guess you are right. But what shall we do? I don't feel like sitting at home."}, {"response": "Sounds great to me! If they are willing, we could ask them to go dancing with us.That is excellent exercise and fun, too.", "message": "That's a good idea. I hear Mary and Sally often go there to play pingpong.Perhaps we can make a foursome with them."}, {"response": "All right.", "message": "Please lie down over there."}]
So, each item has two keys - response and message.
This is my first time using PyTorch, so I was following a few online available resources. These are the relevant snippets of my code:
def tokenize_en(text):
return [tok.text for tok in spacy_en.tokenizer(text)]
src = Field(tokenize = tokenize_en,
init_token = '<sos>',
eos_token = '<eos>',
lower = True)
fields = {'response': ('r', src)}
train_data, test_data, validation_data = TabularDataset.splits(
path = 'FilePath',
train = 'trainset.json',
test = 'testset.json',
validation = 'validationset.json',
format = 'json',
fields = fields
)
Although no errors are raised, despite having many items in my JSON file, the train, test and validation datasets strangely have only 1 example each, as seen in this image:
Image Showing the length of train_data, test_data and validation_data
I'd be really grateful if someone could point out the error to me.
Edit: I found out that the whole file is being treated as a single text string due to lack of indents in the file. But if I indent the JSON file, the TabularDataset function throws a JSONDecodeError to me, suggesting it can no more decode the file. How can I get rid of this problem?

I think the code is alright, but the issue is with your JSON file. Can you try removing the square brackets("[]") at the beginning and the end of the file?
Probably that is the reason that Your Python file is reading it as one single object.

Related

Detect inquiry sentence in Wav2Vec 2.0 result

I am studying ASR(Automatic Speech Recognition) using Wav2Vec2.0.
When I run Wav2Vec2.0, I get the result without a comma("."), question mark("?") etc. Therefore, the result came out as one whole sentence.
I know that I removed regex while making the tokenizer.
Is there any way to convert to the perfect sentence which contains regex?
Original Text from wav file = "So what which one is better?"
Wav2Vec 2.0 Result = "SO WHAT WHICH ONE IS BETTER" (Question mark missing)
Expected Result = "SO WHAT WHICH ONE IS BETTER?"
Most of the ASR are trained on open source datasets and all them has remove all kinf punctuation from it. If you like to have punctuation in the final output. Try to pass ASR output into following code.
from transformers import T5Tokenizer, T5ForConditionalGeneration
model_name = "flexudy/t5-small-wav2vec2-grammar-fixer"
tokenizer = T5Tokenizer.from_pretrained(model_name)
model = T5ForConditionalGeneration.from_pretrained(model_name)
sent = """WHEN ARE YOU COMING TOMORROW I AM ASKING BECAUSE OF THE MONEY YOU OWE ME PLEASE GIVE IT TO ME I AM WAITING YOU HAVE BEEN AVOIDING ME SINCE TWO THOUSAND AND THREE"""
input_text = "fix: { " + sent + " } </s>"
input_ids = tokenizer.encode(input_text, return_tensors="pt", max_length=256, truncation=True, add_special_tokens=True)
outputs = model.generate(
input_ids=input_ids,
max_length=256,
num_beams=4,
repetition_penalty=1.0,
length_penalty=1.0,
early_stopping=True
)
sentence = tokenizer.decode(outputs[0], skip_special_tokens=True, clean_up_tokenization_spaces=True)
print(f"{sentence}")
You can see following results as a output.
When are you coming tomorrow? I am asking because of the money you owe me, please give it to me. I am waiting. You have been avoiding me since 2003.
For better understanding check this model on HuggingFace.
https://huggingface.co/flexudy/t5-small-wav2vec2-grammar-fixer

Deserialize JSON without knowing full structure

I'm redoing the backend of a very basic framework that connects to a completely customizable frontend. It was originally in PHP but for the refactor have been plodding away in F#. Although it seems like PHP might be the more suited language. But people keep telling me you can do everything in F# and I like the syntax and need to learn and this seemingly simple project has me stumped when it comes to JSON. This is a further fleshed out version of my question yesterday, but it got alot more complex than I thought.
Here goes.
The frontend is basically a collection of HTML files, which are simply loaded in PHP and preg_replace() is used to replace things like [var: varName] or [var: array|key] or the troublesome one: [lang: hello]. That needs to be replaced by a variable defined in a translation dictionary, which is stored as JSON which is also editable by a non-programmer.
I can't change the frontend or the JSON files, and both are designed to be edited by non-programmers so it is very likely that there will be errors, calls to language variables that don't exist etc.
So we might have 2 json files, english.json and french.json
english.json contains:
{
"hello":"Hello",
"bye":"Goodbye"
}
french.json:
{
"hello": "Bonjour",
"duck": "Canard"
//Plus users can add whatever else they want here and expect to be able to use it in a template
}
There is a template that contains
<b>[lang: hello]</b>
<span>Favourite Animal: [lang:duck]</span>
In this case, if the language is set to "english" and english.json is being loaded, that should read:
<b>Hello</b>
<span>Favourite Animal: </span>
Or in French:
<b>Bonjour</b>
<span>Favourite Animal: Canard</span>
We can assume that the json format key: value is always string:string but ideally I'd like to handle string: 'T as well but that might be beyond the scope of this question.
So I need to convert a JSON file (called by dynamic name, which gave F# Data a bit of an issue I couldn't solve last night as it only allowed a static filename as a sample, and since these two files have potential to be different from sample and provided, the type provider doesn't work) to a dictionary or some other collection.
Now inside the template parsing function I need to replace [lang: hello] with something like
let key = "duck"
(*Magic function to convert JSON to usable collection*)
let languageString = convertedJSONCollection.[key] (*And obviously check if containsKey first*)
Which means I need to call the key dynamically, and I couldn't figure out how to do that with the type that FSharp.Data provided.
I have played around with some Thoth as well to some promising results that ended up going nowhere. I avoided JSON.NET because I thought it was paid, but just realised I am mistaken there so might be an avenue to explore
For comparison, the PHP function looks something like this:
function loadLanguage($lang='english){
$json = file_get_contents("$lang.json");
return json_decode($json, true);
}
$key = 'duck';
$langVars = loadLanguage();
$duck = $langVars[$key] || "";
Is there a clean way to do this in F#/.NET? JSON seems really painful to work with in comparison to PHP/Javascript and I'm starting to lose my mind. Am I going to have to write my own parser (which means probably going back to PHP)?
Cheers to all you F# geniuses who know the answer :p
open Thoth.Json.Net
let deserialiseDictionary (s: string) =
s
|> Decode.unsafeFromString (Decode.keyValuePairs Decode.string)
|> Map.ofList
let printDictionary json =
json
|> deserialiseDictionary
|> fun m -> printfn "%s" m.["hello"] // Hello
For the question about 'T the question becomes, what can 'T be? For json it very limited, it can be a number of things, string, json-object, number, bool or json array. What should happen if it is bool or a number?

Accesing Json data after 'loading' it

With a lot of help from people in this site, I managed to get some Json data from an amazon page. The data, for example, looks like this.
https://jsoneditoronline.org/?id=9ea92643044f4ac88bcc3e76d98425fc
First I have a list of strings which is converted to a string.
script = response.xpath('//script/text()').extract()
#For example, I need the variationValues data
variationValues = re.findall(r'variationValues\" : ({.*?})', ' '.join(script))[0]
Then, in my code, I have this (not a great name, will be changed later)
variationValuesJson = json.loads(variationValues)
variationValuesJson is in fact a dictionary, so doing something like this
variationValues["size_name"][3]
Should return "5.5 M US"
My issue is that, when running the program, I get the string indices must be integers error. Anyone knows whats wrong?
Note: I have tried using 'size_name' instead of "size_name", same error
variationValues["size_name"][3] #this is the raw string which you have converted to variationValuesjson
I think this is not what you actually want.
Your code should be this.
variationValuesJson['size_name'][3] #use variationValuesjson ;)

R - Twitter Extraction - Error in .subset2(x, i, exact=exact)

I am making an R-script to get all of the mentions (#username) of a specific set of users.
My first issue isn't a big deal. I try to work at home, as well as work. At work, the code works fine. At home, I get Error 32 - Could not authenticate you from Oauth. This is using the exact same code, key, secret, token. I have tried resetting my secret key/token, same thing. Not a problem, since I can do remote login, but its frustrating.
The REAL issue here...
I construct a URL (ex: final_url = "https://api.twitter.com/1.1/search/tweets.json?q=#JimFKenney&until=2015-10-25&result_type=recent&count=100")
Then I search twitter for my query of #usernameDesired to get all the comments where they were mentioned.
mentions = GET(final_url, sig)
This works fine, but then I want my data in a usable format so I do...
library(rjson)
#install.packages("jsonlite", repos="http://cran.rstudio.com/")
library(jsonlite)
#install.packages("bit64", repos="http://cran.rstudio.com/")
json = content(mentions)
I then get the following error -
$statuses
Error in .subset2(x, i, exact = exact) : subscript out of bounds
I don't have even the first idea of what can be causing this.
Any help is gratly appreciated.
EDIT 1: For Clarity, I get the error when trying to see what is in json. If I do "json = content(mentions)" that line of code executes fine. I then type "json" to see what is in the variable, and I get the above error that starts with $statuses.

No word morphology in NLTK FUF?

I'm trying to output the simple sentence "The man eats the meal." using the NLTK FUF linearize() function here
All I do is pass a unified input/grammar to linearize(). But when I do that I get the output "The man eat the meal". I took a look at the morphology.py module and it seems to have all the appropriate functions to morphologize sentences. The default tense for verbs in FUF is the present tense, so "eat" should be converted to "eats".
I did notice, however, that linearizer.py isn't using any of the functions in morphology.py and I'm thinking that perhaps that is why nothing is being morphologized. I tried adding [number = 'plural'] to the feature structure for the direct object in the input sentence so that it would output "The man eats the meals", but that doesn't work either. I have a feeling that the linearizer.py module is incomplete, but first I want to rule out that I'm not doing anything wrong. You can try the test for yourself at the bottom of the linearizer.py module
Please let me know what result you get and please advise.
thanks
Mika