Adding new speaker to a pre-trained speaker recognition model

Adding new speaker to a pre-trained speaker recognition model - deep-learning

I have trained a deep neural network for speaker recognition(Trained on 64 different speakers).Next I want to add or delete a speaker from the model. Can anyone help me out with the coding part on how to do it, as I am new to voice recognition. Even any research paper that someone knows of can be helpful.
P.s. If I use a new dataset on the pre-trained model then I need to train the model again on new 64 speakers. Considering I just want to add or delete 1 or 2 speaker, how can that be achieved?

One way you can achieve this is to measure the similarity (as often done in speaker verification) instead of the logits layer you have trained on the initial dataset (with 64 speakers).
When feeding an input audio to the speaker recognition model, you can take the hidden layer values right before the logits layer and use it as an utterance-level feature. (let's call this speaker embedding vector)
Let us say that you have a new dataset with M utterances and N speakers (disjoint from the initial training set).
From this dataset, you can extract M embedding vectors using your pre-trained network.
Average the embedding vectors with the same speaker, you will get M speaker-specific embedding vectors. We will call these enrolled vectors.
Then to test a new speech sample, you simply have to extract the embedding vector from the test speech and compare its similarity with the M enrolled vectors (usually cosine similarity is used for speaker verification).
For P test utterances, this will give you a [P x M] matrix. For each test utterance, you can select the language with the highest similarity to perform speaker identification.
By doing this, you can perform speaker identification without re-training the network you have trained for speakers not included in the test set.
If you wish to learn some classical/popular methods used for speaker recognition, you can check the following paper out:
J. H. L. Hansen and T. Hasan, "Speaker recognition by machines and humans: a tutorial review," IEEE Signal Processing Magazine, vol. 32, no. 6, 2015.

Related

What does it mean when we combine text features and feed it to a Neural Network?

I am reading this paper -"Review Spam Detection Using Word Embeddings and Deep Neural Networks" - paywall link and here they talk about how they combined ngram and skip-gram features of text before feeding it to the to feed-forward network.Here is the architectureenter image description here
Some brief description of the dataset:-
no. of documents=1600
dimension of skip-gram model=500
no. of the n-gram features(uni,bi,trigram)=2000
For example:-The pictures show that skip-gram and n-gram models were combined before they were sent as an input to the feed-forward network.
Let's suppose the dimension of skip-gram is (no. of documents, dimension of skip-gram) and the dimension of the n-gram model is (no. of documents, no. of n-gram features)
My question is what does it mean when you combine two different features like skip-gram and n-gram. Does it mean concatenation i.e how do you combine two features? Along which axis do you combine those features?
The size of the word vectors (embeddings) was set to 500 and context size c = 5 [7]
to generate a complex representation. The average values of the vector were used to
represent each review. Thus, the input attributes (features) for the subsequent supervised
learning included 2000 n-grams and 500 embeddings.
Deep feedforward neural network (DNN) was used to classify reviews into
spam/legitimate categories.
Hope I have explained it well this time

One-hot encoding to word2vec embedding

I'm trying to create vectors for categorical information that I have at hand. This information is intended to be used for aiding seq2seq network for NLP purposes (like summarization).
To get the idea, maybe an example would be of help:
Sample Text: shark attacks off Florida in a 1-hour span
And suppose that we have this hypothetical categorical information:
1. [animal, shark, sea, ocean]
2. [animal, tiger, jungle, mountains]
...
19. [animal, eagle, sky, mountains]
I want to feed sample text to an LSTM network token-by-token (like seq2seq networks). I'm using pre-trained GloVe embeddings as my original embeddings which are fed into the network, but also want to concatenate a dense vector to each token denoting its category.
For now, I know that I can simply use the one-hot embeddings (0-1 binary). So, for example, the first input (for shark) to the RNN network would be:
# GloVe embeddings of shark + one-hot encoding for shark, + means concatenation
[-0.323 0.213 ... -0.134 0.934 0.031 ] + [1 0 0 0 0 ... 0 0 1]
The problem is that I have an extreme number of categories out there (around 20,000). After searching over the Internet, it seemed to me that people suggest using word2vec instead of one-hots. But, I can't get the underlying idea of how word2vec can demonstrate the categorical features in this case. Does anybody have a more clear idea?

Word2Vec can't be used for classification. It is just the underlying algorithm.
For classification you can use Doc2Vec or something similar.
It basically takes a list of documents and each has unique id assigned to it. After the training it builds relations between the documents similar to those which word2vec builds for the words. Then when you give it an unknown document it will tell you the top n most similar, and if your documents have previously defined tags you can assume that the unknown document can be labeled the same way.

Uses of Embedding/ Embedding layer in deep learning

I am exploring deep learning methods especially LSTM to predict next word. Suppose, My data set is like this: Each data point consists of 7 features (7 different words)(A-G here) of different length.
Group1 Group2............ Group 38
A B F
E C A
B E G
C D G
C F F
D G G
. . .
. . .
I used one hot encoding as an Input layer. Here is the model
main_input= Input(shape=(None,action_count),name='main_input')
lstm_out= LSTM(units=64,activation='tanh')(main_input)
lstm_out=Dropout(0.2)(lstm_out)
lstm_out=Dense(action_count)(lstm_out)
main_output=Activation('softmax')(lstm_out)
model=Model(inputs=[main_input],outputs=main_output)
print(model.summary())
Using this model. I got an accuracy of about 60%.
My question is how can I use embedding layer for my problem. Actually, I do not know much about embedding (why, when and how it works)[I only know one hot vector does not carry much information]. I am wondering if embedding can improve accuracy. If someone can provide me guidance in these regards, it will be greatly beneficial for me. (At least whether uses of embedding is logical or not for my case)

What are Embedding layers?
They are layers which converts positive integers ( maybe word counts ) into fixed size dense vectors. They learn the so called embeddings for a particular text dataset ( in NLP tasks ).
Why are they useful?
Embedding layers slowly learn the relationships between words. Hence, if you have a large enough corpus ( which probably contains all possible English words ), then vectors for words like "king" and "queen" will show some similarity in the mutidimensional space of the embedding.
How are used in Keras?
The keras.layers.Embedding has the following configurations:
keras.layers.Embedding(input_dim, output_dim, embeddings_initializer='uniform', embeddings_regularizer=None, activity_regularizer=None, embeddings_constraint=None, mask_zero=False, input_length=None)
Turns positive integers (indexes) into dense vectors of fixed size. eg. [[4], [20]] -> [[0.25, 0.1], [0.6, -0.2]]
This layer can only be used as the first layer in a model.
When the input_dim is the vocabulary size + 1. Vocabulary is the corpus of all the words used in the dataset. The input_length is the length of the input sequences whereas output_dim is the dimensionality of the output vectors ( the dimensions for the vector of a particular word ).
The layer can also be used wih pretrained word embeddings like Word2Vec or GloVE.
Are they suitable for my use case?
Absolutely, yes. For sentiment analysis, if we could generate a context ( embedding ) for a particular word then we could definitely increase its efficiency.
How can I use them in my use case?
Follow the steps:
You need to tokenize the sentences. Maybe with keras.preprocessing.text.Tokenizer.
Pad the sequences to a fixed length using keras.preprocessing.sequence.pad_sequences. This will be the input_length parameter for the Embedding layer.
Initialize the model with Embedding layer as the first layer.
Hope this helps.

How can we define an RNN - LSTM neural network with multiple output for the input at time "t"?

I am trying to construct a RNN to predict the possibility of a player playing the match along with the runs score and wickets taken by the player.I would use a LSTM so that performance in current match would influence player's future selection.
Architecture summary:
Input features: Match details - Venue, teams involved, team batting first
Input samples: Player roster of both teams.
Output:
Discrete: Binary: Did the player play.
Discrete: Wickets taken.
Continous: Runs scored.
Continous: Balls bowled.
Question:
Most often RNN uses "Softmax" or"MSE" in the final layers to process "a" from LSTM -providing only a single variable "Y" as output. But here there are four dependant variables( 2 Discrete and 2 Continuous). Is it possible to stitch together all four as output variables?
If yes, how do we handle mix of continuous and discrete outputs with loss function?
(Though the output from LSTM "a" has multiple features and carries the information to the next time-slot, we need multiple features at output for training based on the ground-truth)

You just do it. Without more detail on the software (if any) in use it is hard to give more detasmail
The output of the LSTM unit is at every times step on of the hidden layers of your network
You can then input it in to 4 output layers.
1 sigmoid
2 i'ld messarfound wuth this abit. Maybe 4x sigmoid(4 wickets to an innnings right?) Or relu4
3,4 linear (squarijng it is as lso an option,e or relu)
For training purposes your loss function is the sum of your 4 individual losses.
Since f they were all MSE you could concatenat your 4 outputs before calculating the loss.
But sincd the first is cross-entropy (for a decision sigmoid) yould calculate seperately and sum.
You can still concatenate them after to have a output vector

Hard to understand Caffe MNIST example

After going through the Caffe tutorial here: http://caffe.berkeleyvision.org/gathered/examples/mnist.html
I am really confused about the different (and efficient) model using in this tutorial, which is defined here: https://github.com/BVLC/caffe/blob/master/examples/mnist/lenet_train_test.prototxt
As I understand, Convolutional layer in Caffe simply calculate the sum of Wx+b for each input, without applying any activation function. If we would like to add the activation function, we should add another layer immediately below that convolutional layer, like Sigmoid, Tanh, or Relu layer. Any paper/tutorial I read on the internet applies the activation function to the neuron units.
It leaves me a big question mark as we only can see the Convolutional layers and Pooling layers interleaving in the model. I hope someone can give me an explanation.
As a site note, another doubt for me is the max_iter in this solver:
https://github.com/BVLC/caffe/blob/master/examples/mnist/lenet_solver.prototxt
We have 60.000 images for training, 10.000 images for testing. So why does the max_iter here only 10.000 (and it still can get > 99% accuracy rate)? What does Caffe do in each iteration?
Actually, I'm not so sure if the accuracy rate is the total correct prediction/test size.
I'm very amazed of this example, as I haven't found any example, framework that can achieve this high accuracy rate in that very short time (only 5 mins to get >99% accuracy rate). Hence, I doubt there should be something I misunderstood.
Thanks.

Caffe uses batch processing. The max_iter is 10,000 because the batch_size is 64. No of epochs = (batch_size x max_iter)/No of train samples. So the number of epochs is nearly 10. The accuracy is calculated on the test data. And yes, the accuracy of the model is indeed >99% as the dataset is not very complicated.

For your question about the missing activation layers, you are correct. The model in the tutorial is missing activation layers. This seems to be an oversight of the tutorial. For the real LeNet-5 model, there should be activation functions following the convolution layers. For MNIST, the model still works surprisingly well without the additional activation layers.
For reference, in Le Cun's 2001 paper, it states:
As in classical neural networks, units in layers up to F6 compute a dot product between their input vector and their weight vector, to which a bias is added. This weighted sum, denoted a_i, for unit i, is then passed through a sigmoid squashing function to produce the state of unit i ...
F6 is the "blob" between the two fully connected layers. Hence the first fully connected layers should have an activation function applied (the tutorial uses ReLU activation functions instead of sigmoid).
MNIST is the hello world example for neural networks. It is very simple to today's standard. A single fully connected layer can solve the problem with accuracy of about 92%. Lenet-5 is a big improvement over this example.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008