Sequence labeling for sentences and not tokens - deep-learning

I have sentences that belong to a paragraph. Each sentence has a label.
[s1,s2,s3,…], [l1,l2,l3,…]
I understand that I have to encode each sentence using an encoder, and then use sequence labeling. Could you guide me on how I could do that, combining them?

If i understand your question correctly, you are looking for encoding of your sentences into numeric representation.
let's say you have data like :
data = ["Sarah, is that you? Hahahahahaha Todd give you another black eye??"
"Well, being slick comes with the job of being a propagandist, Andi..."
"Sad to lose a young person who was earnestly working for the common good and public safety when so many are in the basement smoking pot and playing computer games."]
labels = [0,1,0]
Now you want to build a classifier, for training classifier data should be in numeric format so here we will transfer text data into numeric structure for that we will use tf-idf vectorizer which will create matrix for text data, then apply any algorithm.
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC
from sklearn.pipeline import Pipeline
vectorizerPipe = Pipeline([
('tfidf', TfidfVectorizer(lowercase=True,stop_words='english')),
('classification', LinearSVC(penalty='l2',loss='hinge'))])
trained_model = vectorizerPipe.fit(data,labels)
Here pipeline is constructed where first step is feature vector extraction (converting text data into numeric format) and in next step we are applying algorithm to it. There are lot of parameters in both steps you can try.
later we fir the pipeline with .fit method and passing data and labels.

Related

What is the right way to generate long sequence using PyTorch-Transformers?

I am trying to generate a long sequence of text using PyTorch-Transformers from a sample text. I am following this tutorial for this purpose. Because the original article only predicts one word from a given text, I modified that script to generate long sequence instead of one. This is the modified part of the code
# Encode a text inputs
text = """An examination can be defined as a detailed inspection or analysis
of an object or person. For example, an engineer will examine a structure,
like a bridge, to see if it is safe. A doctor may conduct"""
indexed_tokens = tokenizer.encode(text)
# Convert indexed tokens in a PyTorch tensor
tokens_tensor = torch.tensor([indexed_tokens])
seq_len = tokens_tensor.shape[1]
tokens_tensor = tokens_tensor.to('cuda')
with torch.no_grad():
for i in range(50):
outputs = model(tokens_tensor[:,-seq_len:])
predictions = outputs[0]
predicted_index = torch.argmax(predictions[0, -1, :])
tokens_tensor = torch.cat((tokens_tensor,predicted_index.reshape(1,1)),1)
pred = tokens_tensor.detach().cpu().numpy().tolist()
predicted_text = tokenizer.decode(pred[0])
print(predicted_text)
Output
An examination can be defined as a detailed inspection or analysis
of an object or person. For example, an engineer will examine a
structure, like a bridge, to see if it is safe. A doctor may conduct
an examination of a patient's body to see if it is safe.
The doctor may also examine a patient's body to see if it is safe. A
doctor may conduct an examination of a patient's body to see if it is
safe.
As you can see the generated text does not generates any unique text sequence but it generates the same sentence over and over again with minor changes.
How should we create long sequence using PyTorch-Transformers?
There is usually no such thing as generating a complete sentence or complete text once. There were some research approaches on that but almost all of the state-of-the-art models generate a text word by word. The generated word at time t-1 is then used as input (together with other already generated or given words) while generating the next word at time t. So, it is normal that it generates word by word. I do not understand what you mean by this.
Which model are you using?

How do I get molecular structural information from SMILES

My question is: is there any algorithm that can convert a SMILES structure into a topological fingerprint? For example if glycerol is the input the answer would be 3 x -OH , 2x -CH2 and 1x -CH.
I'm trying to build a python script that can predict the density of a mixture using an artificial neural network. As an input I want to have the structure/fingerprint of my molecules starting from the SMILES structure.
I'm already familiar with -rdkit and the morganfingerprint but that is not what i'm looking for. I'm also aware that I can use the 'matching substructure' search in rdkit, but then I would have to define all the different subgroups. Is there any more convenient/shorter way?
For most of the structures, there's no existing option to find the fragments. However, there's a module in rdkit that can provide you the number of fragments especially when it's a function group. Check it out here. As an example, let's say you want to find the number of aliphatic -OH groups in your molecule. You can simply call the following function to do that
from rdkit.Chem.Fragments import fr_Al_OH
fr_Al_OH(mol)
or the following would return the number of aromatic -OH groups:
from rdkit.Chem.Fragments import fr_Ar_OH
fr_Ar_OH(mol)
Similarly, there are 83 more functions available. Some of them would be useful for your task. For the ones, you don't get the pre-written function, you can always go to the source code of these rdkit modules, figure out how they did it, and then implement them for your features. But as you already mentioned, the way would be to define a SMARTS string and then fragment matching. The fragment matching module can be found here.
If you want to predict densities of pure components before predicting the mixtures I recommend the following paper:
https://pubs.acs.org/doi/abs/10.1021/acs.iecr.6b03809
You can use the fragments specified by rdkit as mnis proposes. Or you could specify the groups as SMARTS patterns and look for them yourself using GetSubstructMatches as you proposed yourself.
Dissecting a molecule into specific groups is not as straightforward as it might appear in the first place. You could also use an algorithm I published a while ago:
https://jcheminf.biomedcentral.com/articles/10.1186/s13321-019-0382-3
It includes a list of SMARTS for the UNIFAC model, but you could also use them for other things, like density prediction.

NLTK (or other) Part of speech tagger that returns n-best tag sequences

I need a part of speech tagger that does not just return the optimal tag sequence for a given sentence, but that returns the n-best tag sequences. So for 'time flies like an arrow', it could return both NN VBZ IN DT NN and NN NNS VBP DT NN for example, ordered in terms of their probability. I need to train the tagger using my own tag set and sentence examples, and I would like a tagger that allows different features of the sentence to be engineered. If one of the nltk taggers had this functionality, that would be great, but any tagger that I can interface with my Python code would do. Thanks in advance for any suggestions.
I would recommend having a look at spaCy. From what I have seen, it doesn't by default allow you to return the top-n tags, but it supports creating custom pipeline components.
There is also an issue on Github where exactly this is discussed, and there are some suggestions on how to implement it relatively quickly.

Generating truth tables for basic logic circuits

Let's say I have a text file that looks like this:
<number> <name> <type> <inputs...>
1 XOR1 XOR A B
2 SUM XOR 1 C
What would be the best approach to generate the truth table for this circuit?
That depends on what you have available, and how big your file is.
Perl is optimized for reading files and generating simple text output. It doesn't have a library of boolean operators, but they're easy enough to write. I'd use that if I just wanted text-in, text-out.
If I wanted to display the data online AND generate a results file, I'd use PHP to read the data and write the table to a CSV file that could either be opened in Excel, or posted online in an HTML table.
If your data is in a REALLY BIG data file, I'd use SQL.
If your data is in a really huge file that you want to be accessible to authorized users online, and you want THEM to be able to create truth tables, I'd use Oracle's APEX to create an easy interface for them to build their own truth tables and play around with the data without altering it.
If you're in an electrical engineering environment, use the tools designed for your problem -- Verilog or similar.
Whatcha got? Whatcha wanna do with it?
-- Ada
I prefer using C#. I already have the code to 'parse' the input
text file. I just don't know where to start in terms of
actually 'simulating' it. The output can simply be a text file
with inputs and output values – Don 12 mins ago
How many inputs and how many outputs in the circuit you want to simulate?
The size of the simulation determines how it can most easily be run. If the circuit is small(ish), you can enter the inputs and circuit values into vector arrays, then cross them to get the output matrix.
Matlab is ideal for this, as it was written for processing arrays.
Again: Whatcha got, and whatcha wanna do with it?
-- Ada

How can I program a simple chat bot AI?

I want to build a bot that asks someone a few simple questions and branches based on the answer. I realize parsing meaning from the human responses will be challenging, but how do you setup the program to deal with the "state" of the conversation?
It will be a one-to-one conversation between a human and the bot.
You probably want to look into Markov Chains as the basics for the bot AI. I wrote something a long time ago (the code to which I'm not proud of at all, and needs some mods to run on Python > 1.5) that may be a useful starting place for you: http://sourceforge.net/projects/benzo/
EDIT: Here's a minimal example in Python of a Markov Chain that accepts input from stdin and outputs text based on the probabilities of words succeeding one another in the input. It's optimized for IRC-style chat logs, but running any decent-sized text through it should demonstrate the concepts:
import random, sys
NONWORD = "\n"
STARTKEY = NONWORD, NONWORD
MAXGEN=1000
class MarkovChainer(object):
def __init__(self):
self.state = dict()
def input(self, input):
word1, word2 = STARTKEY
for word3 in input.split():
self.state.setdefault((word1, word2), list()).append(word3)
word1, word2 = word2, word3
self.state.setdefault((word1, word2), list()).append(NONWORD)
def output(self):
output = list()
word1, word2 = STARTKEY
for i in range(MAXGEN):
word3 = random.choice(self.state[(word1,word2)])
if word3 == NONWORD: break
output.append(word3)
word1, word2 = word2, word3
return " ".join(output)
if __name__ == "__main__":
c = MarkovChainer()
c.input(sys.stdin.read())
print c.output()
It's pretty easy from here to plug in persistence and an IRC library and have the basis of the type of bot you're talking about.
Folks have mentioned already that statefulness isn't a big component of typical chatbots:
a pure Markov implementations may express a very loose sort of state if it is growing its lexicon and table in real time—earlier utterances by the human interlocutor may get regurgitated by chance later in the conversation—but the Markov model doesn't have any inherent mechanism for selecting or producing such responses.
a parsing-based bot (e.g. ELIZA) generally attempts to respond to (some of the) semantic content of the most recent input from the user without significant regard for prior exchanges.
That said, you certainly can add some amount of state to a chatbot, regardless of the input-parsing and statement-synthesis model you're using. How to do that depends a lot on what you want to accomplish with your statefulness, and that's not really clear from your question. A couple general ideas, however:
Create a keyword stack. As your human offers input, parse out keywords from their statements/questions and throw those keywords onto a stack of some sort. When your chatbot fails to come up with something compelling to respond to in the most recent input—or, perhaps, just at random, to mix things up—go back to your stack, grab a previous keyword, and use that to seed your next synthesis. For bonus points, have the bot explicitly acknowledge that it's going back to a previous subject, e.g. "Wait, HUMAN, earlier you mentioned foo. [Sentence seeded by foo]".
Build RPG-like dialogue logic into the bot. As your parsing human input, toggle flags for specific conversational prompts or content from the user and conditionally alter what the chatbot can talk about, or how it communicates. For example, a chatbot bristling (or scolding, or laughing) at foul language is fairly common; a chatbot that will get het up, and conditionally remain so until apologized to, would be an interesting stateful variation on this. Switch output to ALL CAPS, throw in confrontational rhetoric or demands or sobbing, etc.
Can you clarify a little what you want the state to help you accomplish?
Imagine a neural network with parsing capabilities in each node or neuron. Depending on rules and parsing results, neurons fire. If certain neurons fire, you get a good idea about topic and semantic of the question and therefore can give a good answer.
Memory is done by keeping topics talked about in a session, adding to the firing for the next question, and therefore guiding the selection process of possible answers at the end.
Keep your rules and patterns in a knowledge base, but compile them into memory at start time, with a neuron per rule. You can engineer synapses using something like listeners or event functions.
I think you can look at the code for Kooky, and IIRC it also uses Markov Chains.
Also check out the kooky quotes, they were featured on Coding Horror not long ago and some are hilarious.
I think to start this project, it would be good to have a database with questions (organized as a tree. In every node one or more questions).
These questions sould be answered with "yes " or "no".
If the bot starts to question, it can start with any question from yuor database of questions marked as a start-question. The answer is the way to the next node in the tree.
Edit: Here is a somple one written in ruby you can start with: rubyBOT
naive chatbot program. No parsing, no cleverness, just a training file and output.
It first trains itself on a text and then later uses the data from that training to generate responses to the interlocutor’s input. The training process creates a dictionary where each key is a word and the value is a list of all the words that follow that word sequentially anywhere in the training text. If a word features more than once in this list then that reflects and it is more likely to be chosen by the bot, no need for probabilistic stuff just do it with a list.
The bot chooses a random word from your input and generates a response by choosing another random word that has been seen to be a successor to its held word. It then repeats the process by finding a successor to that word in turn and carrying on iteratively until it thinks it’s said enough. It reaches that conclusion by stopping at a word that was prior to a punctuation mark in the training text. It then returns to input mode again to let you respond, and so on.
It isn’t very realistic but I hereby challenge anyone to do better in 71 lines of code !! This is a great challenge for any budding Pythonists, and I just wish I could open the challenge to a wider audience than the small number of visitors I get to this blog. To code a bot that is always guaranteed to be grammatical must surely be closer to several hundred lines, I simplified hugely by just trying to think of the simplest rule to give the computer a mere stab at having something to say.
Its responses are rather impressionistic to say the least ! Also you have to put what you say in single quotes.
I used War and Peace for my “corpus” which took a couple of hours for the training run, use a shorter file if you are impatient…
here is the trainer
#lukebot-trainer.py
import pickle
b=open('war&peace.txt')
text=[]
for line in b:
for word in line.split():
text.append (word)
b.close()
textset=list(set(text))
follow={}
for l in range(len(textset)):
working=[]
check=textset[l]
for w in range(len(text)-1):
if check==text[w] and text[w][-1] not in '(),.?!':
working.append(str(text[w+1]))
follow[check]=working
a=open('lexicon-luke','wb')
pickle.dump(follow,a,2)
a.close()
here is the bot
#lukebot.py
import pickle,random
a=open('lexicon-luke','rb')
successorlist=pickle.load(a)
a.close()
def nextword(a):
if a in successorlist:
return random.choice(successorlist[a])
else:
return 'the'
speech=''
while speech!='quit':
speech=raw_input('>')
s=random.choice(speech.split())
response=''
while True:
neword=nextword(s)
response+=' '+neword
s=neword
if neword[-1] in ',?!.':
break
print response
You tend to get an uncanny feeling when it says something that seems partially to make sense.
I would suggest looking at Bayesian probabilities. Then just monitor the chat room for a period of time to create your probability tree.
I'm not sure this is what you're looking for, but there's an old program called ELIZA which could hold a conversation by taking what you said and spitting it back at you after performing some simple textual transformations.
If I remember correctly, many people were convinced that they were "talking" to a real person and had long elaborate conversations with it.
If you're just dabbling, I believe Pidgin allows you to script chat style behavior. Part of the framework probably tacks the state of who sent the message when, and you'd want to keep a log of your bot's internal state for each of the last N messages. Future state decisions could be hardcoded based on inspection of previous states and the content of the most recent few messages. Or you could do something like the Markov chains discussed and use it both for parsing and generating.
If you do not require a learning bot, using AIML (http://www.aiml.net/) will most likely produce the result you want, at least with respect to the bot parsing input and answering based on it.
You would reuse or create "brains" made of XML (in the AIML-format) and parse/run them in a program (parser). There are parsers made in several different languages to choose from, and as far as I can tell the code seems to be open source in most cases.
You can use "ChatterBot", and host it locally using - 'flask-chatterbot-master"
Links:
[ChatterBot Installation]
https://chatterbot.readthedocs.io/en/stable/setup.html
[Host Locally using - flask-chatterbot-master]: https://github.com/chamkank/flask-chatterbot
Cheers,
Ratnakar