How to build deep learning model that picks words from serval distinct bags and forms a meaningful sentence [closed] - deep-learning

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
Image of Bags and how to choose from them
Imagine I have 10 bags,Ordered one after other.ie Bag 1 , Bag 2 ......... Bag n.
Each bag has distinct set of words.
In order to understand what a bag is,
Consider we have a vocabulary of 10,000 words.
The first bag contains words Hello , India , Manager.
ie Bag 1 will have 1's at the words index present in the bag.
ex:Bag 1 will be of size 10000*1
if Hello's index was 1 India's index was 2 and Manager's was 4
It will be
[0 , 1, 1, 0 , 1 ,0,0,0,0.........]
*I dont have a model yet.
*I'm thinking to use story books,But its still kind of abstract for me.
A word has to chosen from each bag and assigned a number word 1(word from bag 1)
word 2(word from bag 2) and they must form a MEANINGFULL sentence in their numerical order.!

First, we need a way that the computer can recognise a word otherwise it cannot pick the correct one. That means at this stage, we need to decide what we are teaching the computer to begin with (ie what is a verb, noun, grammar) but I will assume we will dump a dictionary into it and give no information except the words themselves.
So that the computer can compute what sentences are, we need to convert them to numbers (one way would be to work alphabetically starting at 1, using them as keys for a dictionary (digital this time(!)) and the word as the value). Now we can apply the same linear algebra techniques to this problem as any other problem.
So we need to make generations of matrices of weights to multiply into the keys of the dictionary, then remove all the weights beyond the range of dictionary keys, the rest can be used to get the value in the dictionary and make a sentence. Optionally, you can also use a threshold value to take off of all the outputs of the matrix multiplication
Now for the hard part: learning. Once you have a few (say 100) matrices, we need to "breed" the best ones (this is where human intervention is needed) and you need to pick the 50 most meaningful sentences (might be hard at first) and use them to base your next 100 of (easiest way would be to weight the 50 matrices randomly for a weighted mean 100 times).
And the boring bit, keep running the generations over and over until you get to a point where your sentences are meaningful most of the time (of course there is no guarantee that it will always be meaningful but that's the nature of ANN's)
If you find it doesn't work, you can use more layers (more matrices) and/or I recently heard of a different technique that dynamically changed the network but I can't really help with that.

Have a database with thousands/millions of valid sentences.
Create a dictionary where each word represents a number (reserve 0 for "nothing", 1 for "start of sentence" and 2 for "end of sentence").
word_dic = { "_nothing_": 0, "_start_": 1, "_end_": 2, "word1": 3, "word2": 4, ...}
reverse_dic = {v:k for k,v in word_dic.items()}
Remember to add "_start_" and "_end_" at the beginning and end of all sentences in the database, and "_nothing_" after the end to complete the desired length capable of containing all sentences. (Ideally, work with sentences with 10 or less words, so your model wont't try to create bigger sentences).
Transform all your sentences into sequences of indices:
#supposing you have an array of shape (sentences, length) as string:
indices = []
for word in database.reshape((-1,)):
indices.append(word_dic[word])
indices = np.array(indices).reshape((sentences,length))
Transform this into categorical words with the keras function to_categorical()
cat_sentences = to_categorical(indices) #shape (sentences,length,dictionary_size)
Hint: keras has lots of useful text preprocessing functions here.
Separate training input and output data:
#input is the sentences except for the last word
x_train = cat_sentences[:,:-1,:]
y_train = cat_sentences[:,1:,:]
Let's create an LSTM based model that will predict the next words from the previous words:
model = Sequential()
model.add(LSTM(dontKnow,return_sequences=True,input_shape=(None,dictionary_size)))
model.add(.....)
model.add(LSTM(dictionary_size,return_sequences=True,activation='sigmoid'))
#or a Dense(dictionary_size,activation='sigmoid')
Compile and fit this model with x_train and y_train:
model.compile(....)
model.fit(x_train,y_train,....)
Create an identical model using stateful=True in all LSTM layers:
newModel = ......
Transfer the weights from the trained model:
newModel.set_weights(model.get_weights())
Create your bags in a categorical way, shape (10, dictionary_size).
Use the model to predict one word from the _start_ word.
#reset the states of the stateful model before you start a 10 word prediction:
newModel.reset_states()
firstWord = newModel.predict(startWord) #startword is shaped as (1,1,dictionary_size)
The firstWord will be a vector with size dictionary_size telling (sort of) the probabilities of each existing word. Compare to the words in the bag. You can choose the highest probability, or use some random selecting if the probabilities of other words in the bag are also good.
#example taking the most probable word:
firstWord = np.array(firstWord == firstWord.max(), dtype=np.float32)
Do the same again, but now input firstWord in the model:
secondWord = newModel.predict(firstWord) #respect the shapes
Repeat the process until you get a sentence. Notice that you may find _end_ before the 10 words in the bag are satisfied. You may decide to finish the process with a shorter sentence then, especially if other word probabilities are low.

Related

Data PreProcessing for BERT (base-german)

I am working on a sentiment analysis solution with BERT to analyze tweets in german. My training dataset of is a class of 1000 tweets, which have been manually annotated into the classes neutral, positive and negative.
The dataset with 10.000 tweets is quite unevenly distributed:
approx.
3000 positive
2000 negative
5000 neutral
the tweets contain formulations with #names, https links, numbers, punctuation marks, smileys like :3 :D :) etc..
The interesting thing is, if I remove them with the following code during Data Cleaning, the F1 score gets worse. Only the removal of https links (if I do it alone) leads to a small improvement.
# removing the punctuation and numbers
def remove_punct(text):
text = re.sub(r'http\S+', '', text) # removing links
text = re.sub(r'#\S+', '', text) # removing referencing on usernames with #
text = re.sub(r':\S+', '', text) # removing smileys with : (like :),:D,:( etc)
text = "".join([char for char in text if char not in string.punctuation])
text = re.sub('[0-9]+', '', text)
return text
data['Tweet_clean'] = data['Tweet'].apply(lambda x: remove_punct(x)) # extending the dataset with the column tweet_clean
data.head(40)
also steps like stop words removal or lemmitazation lead more to a deterioration. Is this because I do something wrong or can the model BERT actually handle such values?
A second question is:
I found other records that were also manually annotated, but these are not tweets and the structure of the sentences and language use is different. Would you still recommend to add these records to my original?
There are about 3000 records in German.
My last question:
Should I reduce the class sizes to the size of the smallest unit and thus balance?
BERT can handle punctuation, smileys etc. Of course, smileys contribute a lot to sentiment analysis. So, don't remove them. Next, it would be fair to replace #mentions and links with some special tokens, because the model will probably never see them again in the future.
If your model is designed for tweets, I suggest that you fine-tune BERT with additional corpus, and after fine-tune with Twitter corpus. Or do it simultaneously. More training samples is generally better.
No, it is better to use class weights instead of downsampling.
Based on this paper (By Adam Ek, Jean-Philippe Bernardy and Stergios Chatzikyriakidis), BERT models outperform BiLSTM in terms of better generalizing to punctuations. Looking at the experiments' results in the paper, I say keep the punctuations.
I couln't find anything solid for smiley faces; However, after doing some experiments with the HuggingFace API, I didn't notice much difference with/without smiley faces.

Uses of Embedding/ Embedding layer in deep learning

I am exploring deep learning methods especially LSTM to predict next word. Suppose, My data set is like this: Each data point consists of 7 features (7 different words)(A-G here) of different length.
Group1 Group2............ Group 38
A B F
E C A
B E G
C D G
C F F
D G G
. . .
. . .
I used one hot encoding as an Input layer. Here is the model
main_input= Input(shape=(None,action_count),name='main_input')
lstm_out= LSTM(units=64,activation='tanh')(main_input)
lstm_out=Dropout(0.2)(lstm_out)
lstm_out=Dense(action_count)(lstm_out)
main_output=Activation('softmax')(lstm_out)
model=Model(inputs=[main_input],outputs=main_output)
print(model.summary())
Using this model. I got an accuracy of about 60%.
My question is how can I use embedding layer for my problem. Actually, I do not know much about embedding (why, when and how it works)[I only know one hot vector does not carry much information]. I am wondering if embedding can improve accuracy. If someone can provide me guidance in these regards, it will be greatly beneficial for me. (At least whether uses of embedding is logical or not for my case)
What are Embedding layers?
They are layers which converts positive integers ( maybe word counts ) into fixed size dense vectors. They learn the so called embeddings for a particular text dataset ( in NLP tasks ).
Why are they useful?
Embedding layers slowly learn the relationships between words. Hence, if you have a large enough corpus ( which probably contains all possible English words ), then vectors for words like "king" and "queen" will show some similarity in the mutidimensional space of the embedding.
How are used in Keras?
The keras.layers.Embedding has the following configurations:
keras.layers.Embedding(input_dim, output_dim, embeddings_initializer='uniform', embeddings_regularizer=None, activity_regularizer=None, embeddings_constraint=None, mask_zero=False, input_length=None)
Turns positive integers (indexes) into dense vectors of fixed size. eg. [[4], [20]] -> [[0.25, 0.1], [0.6, -0.2]]
This layer can only be used as the first layer in a model.
When the input_dim is the vocabulary size + 1. Vocabulary is the corpus of all the words used in the dataset. The input_length is the length of the input sequences whereas output_dim is the dimensionality of the output vectors ( the dimensions for the vector of a particular word ).
The layer can also be used wih pretrained word embeddings like Word2Vec or GloVE.
Are they suitable for my use case?
Absolutely, yes. For sentiment analysis, if we could generate a context ( embedding ) for a particular word then we could definitely increase its efficiency.
How can I use them in my use case?
Follow the steps:
You need to tokenize the sentences. Maybe with keras.preprocessing.text.Tokenizer.
Pad the sequences to a fixed length using keras.preprocessing.sequence.pad_sequences. This will be the input_length parameter for the Embedding layer.
Initialize the model with Embedding layer as the first layer.
Hope this helps.

Understanding stateful LSTM [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
I'm going through this tutorial on RNNs/LSTMs and I'm having quite a hard time understanding stateful LSTMs. My questions are as follows :
1. Training batching size
In the Keras docs on RNNs, I found out that the hidden state of the sample in i-th position within the batch will be fed as input hidden state for the sample in i-th position in the next batch. Does that mean that if we want to pass the hidden state from sample to sample we have to use batches of size 1 and therefore perform online gradient descent? Is there a way to pass the hidden state within a batch of size >1 and perform gradient descent on that batch ?
2. One-Char Mapping Problems
In the tutorial's paragraph 'Stateful LSTM for a One-Char to One-Char Mapping' were given a code that uses batch_size = 1 and stateful = True to learn to predict the next letter of the alphabet given a letter of the alphabet. In the last part of the code (line 53 to the end of the complete code), the model is tested starting with a random letter ('K') and predicts 'B' then given 'B' it predicts 'C', etc. It seems to work well except for 'K'. However, I tried the following tweak to the code (last part too, I kept lines 52 and above):
# demonstrate a random starting point
letter1 = "M"
seed1 = [char_to_int[letter1]]
x = numpy.reshape(seed, (1, len(seed), 1))
x = x / float(len(alphabet))
prediction = model.predict(x, verbose=0)
index = numpy.argmax(prediction)
print(int_to_char[seed1[0]], "->", int_to_char[index])
letter2 = "E"
seed2 = [char_to_int[letter2]]
seed = seed2
print("New start: ", letter1, letter2)
for i in range(0, 5):
x = numpy.reshape(seed, (1, len(seed), 1))
x = x / float(len(alphabet))
prediction = model.predict(x, verbose=0)
index = numpy.argmax(prediction)
print(int_to_char[seed[0]], "->", int_to_char[index])
seed = [index]
model.reset_states()
and these outputs:
M -> B
New start: M E
E -> C
C -> D
D -> E
E -> F
It looks like the LSTM did not learn the alphabet but just the positions of the letters, and that regardless of the first letter we feed in, the LSTM will always predict B since it's the second letter, then C and so on.
Therefore, how does keeping the previous hidden state as initial hidden state for the current hidden state help us with the learning given that during test if we start with the letter 'K' for example, letters A to J will not have been fed in before and the initial hidden state won't be the same as during training ?
3. Training an LSTM on a book for sentence generation
I want to train my LSTM on a whole book to learn how to generate sentences and perhaps learn the authors style too, how can I naturally train my LSTM on that text (input the whole text and let the LSTM figure out the dependencies between the words) instead of having to 'artificially' create batches of sentences from that book myself to train my LSTM on? I believe I should use stateful LSTMs could help but I'm not sure how.
Having a stateful LSTM in Keras means that a Keras variable will be used to store and update the state, and in fact you could check the value of the state vector(s) at any time (that is, until you call reset_states()). A non-stateful model, on the other hand, will use an initial zero state every time it processes a batch, so it is as if you always called reset_states() after train_on_batch, test_on_batch and predict_on_batch. The explanation about the state being reused for the next batch on stateful models is just about that difference with non-stateful; of course the state will always flow within each sequence in the batch and you do not need to have batches of size 1 for that to happen. I see two scenarios where stateful models are useful:
You want to train on split sequences of data because these are very long and it would not be practical to train on their whole length.
On prediction time, you want to retrieve the output for each time point in the sequence, not just at the end (either because you want to feed it back into the network or because your application needs it). I personally do that in the models that I export for later integration (which are "copies" of the training model with batch size of 1).
I agree that the example of an RNN for the alphabet does not really seem very useful in practice; it will only work when you start with the letter A. If you want to learn to reproduce the alphabet starting at any letter, you would need to train the network with that kind of examples (subsequences or rotations of the alphabet). But I think a regular feed-forward network could learn to predict the next letter of the alphabet training on pairs like (A, B), (B, C), etc. I think the example is meant for demonstrative purposes more than anything else.
You may have probably already read it, but the popular post The Unreasonable Effectiveness of Recurrent Neural Networks shows some interesting results along the lines of what you want to do (although it does not really dive into implementation specifics). I don't have personal experience training RNN with textual data, but there is a number of approaches you can research. You can build character-based models (like the ones in the post), where your input and receive one character at a time. A more advanced approach is to do some preprocessing on the texts and transform them into sequences of numbers; Keras includes some text preprocessing functions to do that. Having one single number as feature space is probably not going to work all that well, so you could simply turn each word into a vector with one-hot encoding or, more interestingly, have the network learn the best vector representation for each for, which is what they call en embedding. You can go even further with the preprocessing and look into something like NLTK, specially if you want to remove stop words, punctuation and things like that. Finally, if you have sequences of different sizes (e.g. you are using full texts instead of excerpts of a fixed size, which may or may not be important for you) you will need to be a bit more careful and use masking and/or sample weighting. Depending on the exact problem, you can set up the training accordingly. If you want to learn to generate similar text, the "Y" would be the similar to the "X" (one-hot encoded), only shifted by one (or more) positions (in this case you may need to use return_sequences=True and TimeDistributed layers). If you want to determine the autor, your output could be a softmax Dense layer.
Hope that helps.

What is out of bag error in Random Forests? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
What is out of bag error in Random Forests?
Is it the optimal parameter for finding the right number of trees in a Random Forest?
I will take an attempt to explain:
Suppose our training data set is represented by T and suppose data set has M features (or attributes or variables).
T = {(X1,y1), (X2,y2), ... (Xn, yn)}
and
Xi is input vector {xi1, xi2, ... xiM}
yi is the label (or output or class).
summary of RF:
Random Forests algorithm is a classifier based on primarily two methods -
Bagging
Random subspace method.
Suppose we decide to have S number of trees in our forest then we first create S datasets of "same size as original" created from random resampling of data in T with-replacement (n times for each dataset). This will result in {T1, T2, ... TS} datasets. Each of these is called a bootstrap dataset. Due to "with-replacement" every dataset Ti can have duplicate data records and Ti can be missing several data records from original datasets. This is called Bootstrapping. (en.wikipedia.org/wiki/Bootstrapping_(statistics))
Bagging is the process of taking bootstraps & then aggregating the models learned on each bootstrap.
Now, RF creates S trees and uses m (=sqrt(M) or =floor(lnM+1)) random subfeatures out of M possible features to create any tree. This is called random subspace method.
So for each Ti bootstrap dataset you create a tree Ki. If you want to classify some input data D = {x1, x2, ..., xM} you let it pass through each tree and produce S outputs (one for each tree) which can be denoted by Y = {y1, y2, ..., ys}. Final prediction is a majority vote on this set.
Out-of-bag error:
After creating the classifiers (S trees), for each (Xi,yi) in the original training set i.e. T, select all Tk which does not include (Xi,yi). This subset, pay attention, is a set of boostrap datasets which does not contain a particular record from the original dataset. This set is called out-of-bag examples. There are n such subsets (one for each data record in original dataset T). OOB classifier is the aggregation of votes ONLY over Tk such that it does not contain (xi,yi).
Out-of-bag estimate for the generalization error is the error rate of the out-of-bag classifier on the training set (compare it with known yi's).
Why is it important?
The study of error estimates for bagged classifiers in Breiman
[1996b], gives empirical evidence to show that the out-of-bag estimate
is as accurate as using a test set of the same size as the training
set. Therefore, using the out-of-bag error estimate removes the need
for a set aside test set.1
(Thanks #Rudolf for corrections. His comments below.)
In Breiman's original implementation of the random forest algorithm, each tree is trained on about 2/3 of the total training data. As the forest is built, each tree can thus be tested (similar to leave one out cross validation) on the samples not used in building that tree. This is the out of bag error estimate - an internal error estimate of a random forest as it is being constructed.

Of Ways to Count the Limitless Primes [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
Alright, so maybe I shouldn't have shrunk this question sooo much... I have seen the post on the most efficient way to find the first 10000 primes. I'm looking for all possible ways. The goal is to have a one stop shop for primality tests. Any and all tests people know for finding prime numbers are welcome.
And so:
What are all the different ways of finding primes?
Some prime tests only work with certain numbers, for instance, the Lucas–Lehmer test only works for Mersenne numbers.
Most prime tests used for big numbers can only tell you that a certain number is "probably prime" (or, if the number fails the test, it is definitely not prime). Usually you can continue the algorithm until you have a very high probability of a number being prime.
Have a look at this page and especially its "See Also" section.
The Miller-Rabin test is, I think, one of the best tests. In its standard form it gives you probable primes - though it has been shown that if you apply the test to a number beneath 3.4*10^14, and it passes the test for each parameter 2, 3, 5, 7, 11, 13 and 17, it is definitely prime.
The AKS test was the first deterministic, proven, general, polynomial-time test. However, to the best of my knowledge, its best implementation turns out to be slower than other tests unless the input is ridiculously large.
For a given integer, the fastest primality check I know is:
Take a list of 2 to the square root of the integer.
Loop through the list, taking the remainder of the integer / current number
If the remainder is zero for any number in the list, then the integer is not prime.
If the remainder was non-zero for all numbers in the list, then the integer is prime.
It uses significantly less memory than The Sieve of Eratosthenes and is generally faster for individual numbers.
The Sieve of Eratosthenes is a decent algorithm:
Take the list of positive integers 2 to any given Ceiling.
Take the next item in the list (2 in the first iteration) and remove all multiples of it (beyond the first) from the list.
Repeat step two until you reach the given Ceiling.
Your list is now composed purely of primes.
There is a functional limit to this algorithm in that it exchanges speed for memory. When generating very large lists of primes the memory capacity needed skyrockets.
#akdom's question to me:
Looping would work fine on my previous suggestion, and you don't need to do any calculations to determine if a number is even; in your loop, simply skip every even number, as shown below:
//Assuming theInteger is the number to be tested for primality.
// Check if theInteger is divisible by 2. If not, run this loop.
// This loop skips all even numbers.
for( int i = 3; i < sqrt(theInteger); i + 2)
{
if( theInteger % i == 0)
{
//getting here denotes that theInteger is not prime
// somehow indicate that some number, i, divides it and break
break;
}
}
A Rutgers grad student recently found a recurrence relation that generates primes. The difference of its successive numbers will generate either primes or 1's.
a(1) = 7
a(n) = a(n-1) + gcd(n,a(n-1)).
It makes a lot of crap that needs to be filtered out. Benoit Cloitre also has this recurrence that does a similar task:
b(1) = 1
b(n) = b(n-1) + lcm(n,b(n-1))
then the ratio of successive numbers, minus one [b(n)/b(n-1)-1] is prime. A full account of all this can be read at Recursivity.
For the sieve, you can do better by using a wheel instead of adding one each time, check out the Improved Incremental Prime Number Sieves. Here is an example of a wheel. Let's look at the numbers, 2 and 5 to ignore. Their wheel is, [2,4,2,2].
In your algorithm using the list from 2 to the root of the integer, you can improve performance by only testing odd numbers after 2. That is, your list only needs to contain 2 and all odd numbers from 3 to the square root of the integer. This cuts the number of times you loop in half without introducing any more complexity.
#theprise
If I were wanting to use an incrementing loop instead of an instantiated list (problems with memory for massive numbers...), what would be a good way to do that without building the list?
It doesn't seem like it would be cheaper to do a divisibility check for the given integer (X % 3) than just the check for the normal number (N % X).
If you're wanting to find a way of generating prime numbers, this have been covered in a previous question.