Detect inquiry sentence in Wav2Vec 2.0 result - deep-learning

I am studying ASR(Automatic Speech Recognition) using Wav2Vec2.0.
When I run Wav2Vec2.0, I get the result without a comma("."), question mark("?") etc. Therefore, the result came out as one whole sentence.
I know that I removed regex while making the tokenizer.
Is there any way to convert to the perfect sentence which contains regex?
Original Text from wav file = "So what which one is better?"
Wav2Vec 2.0 Result = "SO WHAT WHICH ONE IS BETTER" (Question mark missing)
Expected Result = "SO WHAT WHICH ONE IS BETTER?"

Most of the ASR are trained on open source datasets and all them has remove all kinf punctuation from it. If you like to have punctuation in the final output. Try to pass ASR output into following code.
from transformers import T5Tokenizer, T5ForConditionalGeneration
model_name = "flexudy/t5-small-wav2vec2-grammar-fixer"
tokenizer = T5Tokenizer.from_pretrained(model_name)
model = T5ForConditionalGeneration.from_pretrained(model_name)
sent = """WHEN ARE YOU COMING TOMORROW I AM ASKING BECAUSE OF THE MONEY YOU OWE ME PLEASE GIVE IT TO ME I AM WAITING YOU HAVE BEEN AVOIDING ME SINCE TWO THOUSAND AND THREE"""
input_text = "fix: { " + sent + " } </s>"
input_ids = tokenizer.encode(input_text, return_tensors="pt", max_length=256, truncation=True, add_special_tokens=True)
outputs = model.generate(
input_ids=input_ids,
max_length=256,
num_beams=4,
repetition_penalty=1.0,
length_penalty=1.0,
early_stopping=True
)
sentence = tokenizer.decode(outputs[0], skip_special_tokens=True, clean_up_tokenization_spaces=True)
print(f"{sentence}")
You can see following results as a output.
When are you coming tomorrow? I am asking because of the money you owe me, please give it to me. I am waiting. You have been avoiding me since 2003.
For better understanding check this model on HuggingFace.
https://huggingface.co/flexudy/t5-small-wav2vec2-grammar-fixer

Related

What is the right way to generate long sequence using PyTorch-Transformers?

I am trying to generate a long sequence of text using PyTorch-Transformers from a sample text. I am following this tutorial for this purpose. Because the original article only predicts one word from a given text, I modified that script to generate long sequence instead of one. This is the modified part of the code
# Encode a text inputs
text = """An examination can be defined as a detailed inspection or analysis
of an object or person. For example, an engineer will examine a structure,
like a bridge, to see if it is safe. A doctor may conduct"""
indexed_tokens = tokenizer.encode(text)
# Convert indexed tokens in a PyTorch tensor
tokens_tensor = torch.tensor([indexed_tokens])
seq_len = tokens_tensor.shape[1]
tokens_tensor = tokens_tensor.to('cuda')
with torch.no_grad():
for i in range(50):
outputs = model(tokens_tensor[:,-seq_len:])
predictions = outputs[0]
predicted_index = torch.argmax(predictions[0, -1, :])
tokens_tensor = torch.cat((tokens_tensor,predicted_index.reshape(1,1)),1)
pred = tokens_tensor.detach().cpu().numpy().tolist()
predicted_text = tokenizer.decode(pred[0])
print(predicted_text)
Output
An examination can be defined as a detailed inspection or analysis
of an object or person. For example, an engineer will examine a
structure, like a bridge, to see if it is safe. A doctor may conduct
an examination of a patient's body to see if it is safe.
The doctor may also examine a patient's body to see if it is safe. A
doctor may conduct an examination of a patient's body to see if it is
safe.
As you can see the generated text does not generates any unique text sequence but it generates the same sentence over and over again with minor changes.
How should we create long sequence using PyTorch-Transformers?
There is usually no such thing as generating a complete sentence or complete text once. There were some research approaches on that but almost all of the state-of-the-art models generate a text word by word. The generated word at time t-1 is then used as input (together with other already generated or given words) while generating the next word at time t. So, it is normal that it generates word by word. I do not understand what you mean by this.
Which model are you using?

Loading json file using torchtext

I'm working on the dailydialog dataset, which I've converted into a
JSON file which looks something like this:
[{"response": "You know that is tempting but is really not good for our fitness.", "message": "Say, Jim, how about going for a few beers after dinner?"}, {"response": "Do you really think so? I don't. It will just make us fat and act silly. Remember last time?", "message": "What do you mean? It will help us to relax."}, {"response": "I suggest a walk over to the gym where we can play singsong and meet some of our friends.", "message": "I guess you are right. But what shall we do? I don't feel like sitting at home."}, {"response": "Sounds great to me! If they are willing, we could ask them to go dancing with us.That is excellent exercise and fun, too.", "message": "That's a good idea. I hear Mary and Sally often go there to play pingpong.Perhaps we can make a foursome with them."}, {"response": "All right.", "message": "Please lie down over there."}]
So, each item has two keys - response and message.
This is my first time using PyTorch, so I was following a few online available resources. These are the relevant snippets of my code:
def tokenize_en(text):
return [tok.text for tok in spacy_en.tokenizer(text)]
src = Field(tokenize = tokenize_en,
init_token = '<sos>',
eos_token = '<eos>',
lower = True)
fields = {'response': ('r', src)}
train_data, test_data, validation_data = TabularDataset.splits(
path = 'FilePath',
train = 'trainset.json',
test = 'testset.json',
validation = 'validationset.json',
format = 'json',
fields = fields
)
Although no errors are raised, despite having many items in my JSON file, the train, test and validation datasets strangely have only 1 example each, as seen in this image:
Image Showing the length of train_data, test_data and validation_data
I'd be really grateful if someone could point out the error to me.
Edit: I found out that the whole file is being treated as a single text string due to lack of indents in the file. But if I indent the JSON file, the TabularDataset function throws a JSONDecodeError to me, suggesting it can no more decode the file. How can I get rid of this problem?
I think the code is alright, but the issue is with your JSON file. Can you try removing the square brackets("[]") at the beginning and the end of the file?
Probably that is the reason that Your Python file is reading it as one single object.

Linear function with live data

I am an absolute newbie on the programming thing and really desperate. I picked up a high end task to solve as it seems to me...
I know there are tons of explanations for solving y = mx + b with python but they're all for the situation with "solid" data. I am trying to realize it with live data.
So far, i have two data streams, which I successful directed into two lists - please see code below.
for graph in basis_graph:
high_1 = float(graph.high)
low_1 = float(graph.low)
if high_1 > 0:
graph_high.append([high_1])
if low_1 > 0:
graph_low.append([low_1])
Now comes the tricky part and I DON`T GET IT. I need a function that calculates me "m". Something like that:
def function_signal():
if graph_high[-1] < graph_high[-2]:
please, mr. computer, calculate me "m"
I tried something like
def signal():
if graph_low[-1] < graph_low[-2]:
print("a")
ay1 = graph_low[-1]
by1 = graph_low[-2]
m = ay1 - by1
return m
print(m(ay1, ay2))
Two days I tried EVERYTHING from what I know so far but the only thing I earned was a cascade of Tracebacks. From "I can't divide two list objects" to " "m" is not defined" and so on and so on...
In the case above for example NOTHING happens. Sometimes he says "m is not defined"...
Please, if there's someone out there who is willing to help me than I would really appreciate it.
Thanks in advance.

How to make a complex list into a dataframe in R?

I have a complex list which is get from a json file.
The json file was get from a map service api in China.
I searched the website to solve the problem but I can't find a proper solution to my question, so I put it in this question and hope it can be solved.
If I missing something that I didn't find in the website, I apologize for that.
The code to get the list are as follows:`
library(rjson)
library(RCurl)
key<-"fd5a14632c36aecd2e759a0cc91a3b4a"
origin<-"大润发东环店"
urlorigin <- paste("http://restapi.amap.com/v3/geocode/geo?key=",key,"&address=",origin,"&city=苏州",sep = "")
dataorigin<-readLines(urlorigin,encoding="UTF-8")
origininfo<-fromJSON(dataorigin)
originpoi<-origininfo$geocodes[[1]]$location
destination<-"苏州大学本部北门"
urldest <- paste("http://restapi.amap.com/v3/geocode/geo?key=",key,"&address=",destination,"&city=苏州",sep = "")
datadest<-readLines(urldest,encoding="UTF-8")
destinfo<-fromJSON(datadest)
destpoi<-destinfo$geocodes[[1]]$location
urlpath <- paste("http://restapi.amap.com/v3/direction/driving?key=",key,"&origin=",originpoi,"&destination=",destpoi, "&originid=&destinationid=&extensions=all&strategy=0&waypoints=&avoidpolygons=&avoidroad=",sep = "")
pathjson<-paste(readLines(urlpath,encoding = "UTF-8"),collapse = "")
pathinfo<-fromJSON(pathjson)
The pathinfo was the list I get at last and I want to convert it into a dataframe that I can work with.
Thank you for your time.
I'm from China and my English is not that good, I apologize for that.
My Chinese is very limited as well. But your code to get the data is working (with some warnings).
pathinfo_df <- as.data.frame(lapply(pathinfo,rbind))
pathinfo_df is now a data_frame.
summary(pathinfo_df)
status info infocode count
1:1 OK:1 10000:1 1:1
route.origin.Length route.origin.Class route.origin.Mode
1 -none- character
route.destination.Length route.destination.Class route.destination.Mode
1 -none- character
route.taxi_cost.Length route.taxi_cost.Class route.taxi_cost.Mode
1 -none- character
route.paths.Length route.paths.Class route.paths.Mode
1 -none- list
So, there's plenty to select and play with. Read up on selecting from lists. see also:
str(pathinfo_df)
Then map it on Google Earth. Looks like the taxi might be costly. Have a good trip!

Saving HTML tables to a Database

I am trying to scrape an html table and save its data in a database. What strategies/solutions have you found to be helpful in approaching this program.
I'm most comfortable with Java and PHP but really a solution in any language would be helpful.
EDIT: For more detail, the UTA (Salt Lake's Bus system) provides bus schedules on its website. Each schedule appears in a table that has stations in the header and times of departure in the rows. I would like to go through the schedules and save the information in the table in a form that I can then query.
Here's the starting point for the schedules
It all depends on how properly your HTML to scrape is? If it's valid XHTML, you can simply use some XPath queries on it to get whatever you want.
Example of xpath in php: http://blogoscoped.com/archive/2004_06_23_index.html#108802750834787821
A helper class to scrape a table into an array: http://www.tgreer.com/class_http_php.html
There is a nice book about this topic: Spidering Hacks by Kevin Hemenway and Tara Calishain.
I've found that scripting languages are generally better suited for doing such tasks. I personally prefer Python, but PHP will work as well. Chopping, mincing and parsing strings in Java is just too much work.
I have tried screen-scraping before, but I found it to be very brittle, especially with dynamically-generated code.
I found a third-party DOM-parser and used it to navigate the source code with Regex-like matching patterns in order to find the data I needed.
I suggested trying to find out if the owners of the site have a published API (often Web Services) for retrieving data from their system. If not, then good luck to you.
If what you want is a form a csv table then you can use this:
using python:
for example imagine you want to scrape forex quotes in csv form from some site like: fxoanda
then...
from BeautifulSoup import BeautifulSoup
import urllib,string,csv,sys,os
from string import replace
date_s = '&date1=01/01/08'
date_f = '&date=11/10/08'
fx_url = 'http://www.oanda.com/convert/fxhistory?date_fmt=us'
fx_url_end = '&lang=en&margin_fixed=0&format=CSV&redirected=1'
cur1,cur2 = 'USD','AUD'
fx_url = fx_url + date_f + date_s + '&exch=' + cur1 +'&exch2=' + cur1
fx_url = fx_url +'&expr=' + cur2 + '&expr2=' + cur2 + fx_url_end
data = urllib.urlopen(fx_url).read()
soup = BeautifulSoup(data)
data = str(soup.findAll('pre', limit=1))
data = replace(data,'[<pre>','')
data = replace(data,'</pre>]','')
file_location = '/Users/location_edit_this'
file_name = file_location + 'usd_aus.csv'
file = open(file_name,"w")
file.write(data)
file.close()
once you have it in this form you can convert the data to any form you like.
At the risk of starting a shitstorm here on SO, I'd suggest that if the format of the table never changes, you could just about get away with using Regularexpressions to parse and capture the content you need.
pianohacker overlooked the HTML::TableExtract module, which was designed for exactly this sort of thing. You'd still need LWP to retrieve the table.
This would be by far the easiest with Perl, and the following CPAN modules:
http://metacpan.org/pod/HTML::Parser
http://metacpan.org/pod/LWP
http://metacpan.org/pod/DBD/mysql
http://metacpan.org/pod/DBI.pm
CPAN being the main distribution mechanism for Perl modules, and accessible by running the following shell command, for example:
# cpan HTML::Parser
If you're on Windows, things will be more interesting, but you can still do it: http://www.perlmonks.org/?node_id=583586