Where to start with speech synthesis - speech-synthesis

You guys may be familiar with Google's TTS engine: here.
I have a basic understanding of how something like that is able to analyze the input and pick out different syllables/parts of speech, but where would I start if I wanted to create a "voice" for a TTS system?

That's a question that I spent nearly a semester in college learning the answer to, and a year (or more) of classes beforehand to learn the underlying signal processing required to understand the process. Whole classes are devoted to speech synthesis, and whole curriculums to signal processing.
One can think of the human vocal tract as a filter, and the glottis as an impulse generator—that is, speech is actually the result of an impulse train filtered by the vocal tract, mouth, and nasal cavity.
For every phoneme, the "filter" will be different, so you will need a library of phonemes to generate "filters" for. Theoretically, inverse filtering could be used on a library of phoneme sound clips to find "filter" coefficients. The Levinson-Durbin recursion is often used to find LPC coefficients.
A glottal pulse train must be created. A simple way to do this is to convolve a pulse train with a positive half-sine wave.
Finally, filter the glottal pulse train with the "filter" coefficients associated with the phoneme you wish to create.
But that's only for voiced speech. In order to generate unvoiced speech, a simple solution is to filter a random noise signal with "filter" coefficients associated with unvoiced speech phonemes.
One layer of abstraction above that, create a list of phonemes needed, and concatenate. Simple as pie!
UPDATE:
A friend pointed out Festival, a "black box" to input text and get speech out: http://festvox.org/festival/

Related

Multi-attention based supervised Feature Selection in Multivariate time series

I have been working on a multivariate time series problem. The dataset has at least 40 different factors. I tried to select only the appropriate features before training the model. I came across a paper called "A Multiattention-Based Supervised Feature Selection Method for Multivariate Time Series. The link to the paper:"https://www.hindawi.com/journals/cin/2021/6911192/
The paper looks promising however I could not find the the implementation of it. I would like to know if anyone has come across a similar paper and knows how to implement the architecture in the paper?
If not, I want to know alternate methods to find only the appropriate features for my multivariate time series before training the model.

How to input audio data into deep learning algorithm?

I'm very new in deep learning, and I'm targeting to use GAN (Generative Adversarial Network) to recognize emotional speech. I've only known images being as inputs to most deep learning algorithms, such as GAN. but I'm curious as to how audio data can be an input into it, besides of using images of the spectrograms as the input. also, i'd appreciate it if you can explain it in laymen terms.
Audio data can be be represented in form of numpy arrays but before moving to that you must understand what audio really is. If you give a thought on what an audio looks like, it is nothing but a wave like format of data, where the amplitude of audio change with respect to time.
Assuming that our audio is represented in time domain, we can extract the values at every half-second(arbitrary). This is called sampling rate.
Converting the data into frequency domain can reduce the amount of computation requires as the sampling rate is less.
Now, let's load the data. We'll use a library called librosa , which can be installed using pip.
data, sampling_rate = librosa.load('audio.wav')
Now, you have both the data and the sampling rate. We can plot the waveform now.
librosa.display.waveplot(data, sr=sampling_rate)
Now, you have the audio data in form of numpy array. You can now study the features of the data and extract the ones you find interesting to train your models.
Further to Ayush’s discussion, for information on the challenges and work arounds of dealing with large amounts of data at different time scales in audio data I suggest this post on WaveNet: https://deepmind.com/blog/article/wavenet-generative-model-raw-audio
After that it sounds like you want to do classification. In that case a GAN on it’s own is not suitable. If you have plenty of data you could use a straight LSTM (or another type of RNN) which is designed to model time series, or you can take set sized chunks of input and use a 1-d CNN (similar to WaveNet). If you have lots of unlabelled data from the same or similar domain and limited training data you could use a GAN to learn to generate new samples, then use the discriminator from the GAN as pre-trained weights for a CNN classifier.
Since you are trying to perform Speech Emotion Recognition (SER) using deep learning, you can go for a recurrent architecture (LSTM or GRU) or a combination of CNN and recurrent network architecture (CRNN) instead of GANs since GANs are complicated and difficult to train.
In a CRNN, the CNN layers will extract features of varying details and complexity, whereas the recurrent layers will take care of the temporal dependencies. You can then finally use a fully connected layer for regression or classification output, depending on whether your output label is discrete (for categorical emotions like angry, sad, neutral etc) or continuous (arousal and valence space).
Regarding the choice of input, you can use either a spectrogram input (2D) or raw speech signal (1D) as input. For spectrogram input, you have to use a 2D CNN whereas for a raw speech signal you can use a 1D CNN. Mel scale spectrograms are usually preferred over linear spectrograms since our ears hear frequencies in log scale and not linearly.
I have used a CRNN architecture to estimate the level of verbal conflict arising from conversational speech. Even though it is not SER, it is a very similar task.
You can find more details in the paper
http://www.eecs.qmul.ac.uk/~andrea/papers/2019_SPL_ConflictNET_Rajan_Brutti_Cavallaro.pdf
Also, check my github code for the same paper
https://github.com/smartcameras/ConflictNET
and a SER paper whose code I reproduced in Python
https://github.com/vandana-rajan/1D-Speech-Emotion-Recognition
And finally as Ayush mentioned, Librosa is one of the best Python libraries for audio processing. You have functions to create spectrograms in Librosa.

Using HMM for offline character recognition

I have extracted features from many images of isolated characters (such as gradient, neighbouring pixel weight and geometric properties. How can I use HMMs as a classifier trained on this data? All literature I read about HMM refers to states and state transitions but I can't connect it to features and class labeling. The example on JAHMM's home page doesn't relate to my problem.
I need to use HMM not because it will work better than other approaches for this problem but because of constraints on project topic.
There was an answer to this question for online recognition but I want the same for offline and in a little more detail
EDIT: I partitioned each character into a grid with fixed number of squares. Now I am planning to perform feature extraction on each grid block and thus obtain a sequence of features for each sample by moving from left to right and top to bottom.
Would this represent an adequate "sequence" for an HMM i.e. would an HMM be able to guess the temporal variation of the data, even though the character is not drawn from left to right and top to bottom? If not suggest an alternate way.
Should I feed a lot of features or start with a few? how do I know if the HMM is underforming or if the features are bad? I am using JAHMM.
Extracting stroke features is difficult and cant be logically combined with grid features? (since HMM expects a sequence generated by some random process)
I've usually seen neural networks used for this sort of recognition task, i.e. here, here here, and here. Since a simple google search turns up so many hits for neural networks in OCR, I'll assume you are set in using HMMs (a project limitation, correct?) Regardless, these links can offer some insight into gridding the image and obtaining image features.
Your approach for turning a grid into a sequence of observations is reasonable. In this case, be sure you do not confuse observations and states. The features you extract from one block should be collected into one observation, i.e. a feature vector. (In comparison to speech recognition, your block's feature vector is analogous to the feature vector associated with a speech phoneme.) You don't really have much information regarding the underlying states. This is the hidden aspect of HMMs, and the training process should inform the model how likely one feature vector is to follow another for a character (i.e. transition probabilities).
Since this is an off-line process, don't be concerned with the temporal aspects of how characters are actually drawn. For the purposes of your task, you've imposed a temporal order on the sequence of observations with your the left-to-right, top-to-bottom block sequence. This should work fine.
As for HMM performance: choose a reasonable vector of salient features. In speech recog, the dimensionality of a feature vector can be high (>10). (This is also where the cited literature can assist.) Set aside a percentage of the training data so that you can properly test the model. First, train the model, and then evaluate the model on the training dataset. How well does classify your characters? If it does poorly, re-evaluate the feature vector. If it does well on the test data, test the generality of the classifier by running it on the reserved test data.
As for the number of states, I would start with something heuristically derived number. Assuming your character images are scaled and normalized, perhaps something like 40%(?) of the blocks are occupied? This is a crude guess on my part since a source image was not provided. For an 8x8 grid, this would imply that 25 blocks are occupied. We could then start with 25 states - but that's probably naive: empty blocks can convey information (meaning the number of states might increase), but some features sets may be observed in similar states (meaning the number of states might decrease.) If it were me, I would probably pick something like 20 states. Having said that: be careful not to confuse features and states. Your feature vector is a representation of things observed in a particular state. If the tests described above show your model is performing poorly, tweak the number of states up or down and try again.
Good luck.

Any visualizations of neural network decision process when recognizing images?

I'm enrolled in Coursera ML class and I just started learning about neural networks.
One thing that truly mystifies me is how recognizing something so “human”, like a handwritten digit, becomes easy once you find the good weights for linear combinations.
It is even crazier when you understand that something seemingly abstract (like a car) can be recognized just by finding some really good parameters for linear combinations, and combining them, and feeding them to each other.
Combinations of linear combinations are much more expressible than I once thought.
This lead me to wonder if it is possible to visualize NN's decision process, at least in simple cases.
For example, if my input is 20x20 greyscale image (i.e. total 400 features) and the output is one of 10 classes corresponding to recognized digits, I would love to see some kind of visual explanation of which cascades of linear combinations led the NN to its conclusion.
I naïvely imagine that this may be implemented as visual cue over the image being recognized, maybe a temperature map showing “pixels that affected the decision the most”, or anything that helps to understand how neural network worked in a particular case.
Is there some neural network demo that does just that?
This is not a direct answer to your question. I would suggest you take a look at convolutional neural networks (CNN). In CNNs you can almost see the concept that is learned. You should read this publication:
Y. LeCun, L. Bottou, Y. Bengio and P. Haffner: Gradient-Based Learning Applied to Document Recognition, Proceedings of the IEEE, 86(11):2278-2324, November 1998
CNNs are often called "trainable feature extractors". In fact, CNNs implement 2D filters with trainable coefficients. This is why the activation of the first layers are usually shown as 2D images (see Fig. 13). In this paper the authors use another trick to make the networks even more transparant: the last layer is a radial basis function layer (with gaussian functions), i. e. the distance to an (adjustable) prototype for each class is calculated. You can really see the learned concepts by looking at the parameters of the last layer (see Fig. 3).
However, CNNs are artificial neural networks. But the layers are not fully connected and some neurons share the same weights.
Maybe it doesn't answer the question directly but I found this interesting piece in this Andrew Ng, Jeff Dean, Quoc Le, Marc’Aurelio Ranzato, Rajat Monga, Matthieu Devin,
Kai Chen and
Greg Corrado paper (emphasis mine):
In this section, we will present two visualization techniques to verify if the optimal stimulus of the neuron is indeed a face. The first method is visualizing the most responsive stimuli in the test set. Since the test set is large, this method can reliably detect near optimal stimuli of the tested neuron. The second approach is to perform numerical optimization to find the optimal stimulus
...
These visualization methods have complementary strengths and weaknesses. For instance, visualizing the most responsive stimuli may suffer from fitting to noise. On the other hand, the numerical optimization approach can be susceptible to local minima. Results, shown [below], confirm that the tested neuron indeed learns the concept of faces.
In other words, they take a neuron that is best-performing at recognizing faces and
select images from the dataset that it cause it to output highest confidence;
mathematically find an image (not in dataset) that would get highest condifence.
It's fun to see that it actually “captures” features of the human face.
The learning is unsupervised, i.e. input data didn't say whether an image is a face or not.
Interestingly, here are generated “optimal input” images for cat heads and human bodies:

What techniques exist for the software-driven locomotion of a bipedal robot?

I'm programming a software agent to control a robot player in a simulated game of soccer. Ultimately I hope to enter it in the RoboCup competition.
Amongst the various challenges involved in creating such an agent, the motion of it's body is one of the first I'm facing. The simulation I'm targeting uses a Nao robot body with 22 hinge to control. Six in each leg, four in each arm and two in the neck:
(source: sourceforge.net)
I have an interest in machine learning and believe there must be some techniques available to control this guy.
At any point in time, it is known:
The angle of all 22 hinges
The X,Y,Z output of an accelerometer located in the robot's chest
The X,Y,Z output of a gyroscope located in the robot's chest
The location of certain landmarks (corners, goals) via a camera in the robot's head
A vector for the force applied to the bottom of each foot, along with a vector giving the position of the force on the foot's sole
The types of tasks I'd like to achieve are:
Running in a straight line as fast as possible
Moving at a defined speed (that is, one function that handles fast and slow walking depending upon an additional input)
Walking backwards
Turning on the spot
Running along a simple curve
Stepping sideways
Jumping as high as possible and landing without falling over
Kicking a ball that's in front of your feet
Making 'subconscious' stabilising movements when subjected to unexpected forces (hit by ball or another player), ideally in tandem with one of the above
For each of these tasks I believe I could come up with a suitable fitness function, but not a set of training inputs with expected outputs. That is, any machine learning approach would need to offer unsupervised learning.
I've seen some examples in open-source projects of circular functions (sine waves) wired into each hinge's angle with differing amplitudes and phases. These seem to walk in straight lines ok, but they all look a bit clunky. It's not an approach that would work for all of the tasks I mention above though.
Some teams apparently use inverse kinematics, though I don't know much about that.
So, what approaches are there for robot biped locomotion/ambulation?
As an aside, I wrote and published a .NET library called TinMan that provides basic interaction with the soccer simulation server. It has a simple programming model for the sensors and actuators of the robot's 22 hinges.
You can read more about RoboCup's 3D Simulated Soccer League:
http://en.wikipedia.org/wiki/RoboCup_3D_Soccer_Simulation_League
http://simspark.sourceforge.net/wiki/index.php/Main_Page
http://code.google.com/p/tin-man/
There is a significant body of research literature on robot motion planning and robot locomotion.
General Robot Locomotion Control
For bipedal robots, there are at least two major approaches to robot design and control (whether the robot is simulated or physically real):
Zero Moment Point - a dynamics-based approach to locomotion stability and control.
Biologically-inspired locomotion - a control approach modeled after biological neural networks in mammals, insects, etc., that focuses on use of central pattern generators modified by other motor control programs/loops to control overall walking and maintain stability.
Motion Control for Bipedal Soccer Robot
There are really two aspects to handling the control issues for your simulated biped robot:
Basic walking and locomotion control
Task-oriented motion planning
The first part is just about handling the basic control issues for maintaining robot stability (assuming you are using some physics-based model with gravity), walking in a straight-line, turning, etc. The second part is focused on getting your robot to accomplish specific tasks as a soccer player, e.g., run toward the ball, kick the ball, block an opposing player, etc. It is probably easiest to solve these separately and link the second part as a higher-level controller that sends trajectory and goal directives to the first part.
There are a lot of relevant papers and books which could be suggested, but I've listed some potentially useful ones below that you may wish to include in whatever research you have already done.
Reading Suggestions
LaValle, Steven Michael (2006). Planning Algorithms, Cambridge University Press.
Raibert, Marc (1986). Legged Robots that Balance. MIT Press.
Vukobratovic, Miomir and Borovac, Branislav (2004). "Zero-Moment Point - Thirty Five Years of its Life", International Journal of Humanoid Robotics, Vol. 1, No. 1, pp 157–173.
Hirose, Masato and Takenaka, T (2001). "Development of the humanoid robot ASIMO", Honda R&D Technical Review, vol 13, no. 1.
Wu, QiDi and Liu, ChengJu and Zhang, JiaQi and Chen, QiJun (2009). "Survey of locomotion control of legged robots inspired by biological concept ", Science in China Series F: Information Sciences, vol 52, no. 10, pp 1715--1729, Springer.
Wahde, Mattias and Pettersson, Jimmy (2002) "A brief review of bipedal robotics research", Proceedings of the 8th Mechatronics Forum International Conference, pp 480-488.
Shan, J., Junshi, C. and Jiapin, C. (2000). "Design of central pattern generator for
humanoid robot walking based on multi-objective GA", In: Proc. of the IEEE/RSJ
International Conference on Intelligent Robots and Systems, pp. 1930–1935.
Chestnutt, J., Lau, M., Cheung, G., Kuffner, J., Hodgins, J., and Kanade, T. (2005). "Footstep planning for the Honda ASIMO humanoid", Proceedings of the 2005 IEEE International Conference on Robotics and Automation (ICRA 2005), pp 629-634.
I was working on a project not that dissimilar from this (making a robotic tuna) and one of the methods we were exploring was using a genetic algorithm to tune the performance of an artificial central pattern generator (in our case the pattern was a number of sine waves operating on each joint of the tail). It might be worth giving a shot, Genetic Algorithms are another one of those tools that can be incredibly powerful, if you are careful about selecting a fitness function.
Here's a great paper from 1999 by Peter Nordin and Mats G. Nordahl that outlines an evolutionary approach to controlling a humanoid robot, based on their experience building the ELVIS robot:
An Evolutionary Architecture for a Humanoid Robot
I've been thinking about this for quite some time now and I realized that you need at least two intelligent "agents" to make this work properly. The basic idea is that you have two types intelligent activity here:
Subconscious Motor Control (SMC).
Conscious Decision Making (CDM).
Training for the SMC could be done on-line... if you really think about it: defining success within motor control is basically done when you provide a signal to your robot, it evaluates that signal and either accepts it or rejects it. If your robot accepts a signal and it results in a "failure", then your robot goes "offline" and it can't accept any more signals. Defining "failure" and "offline" could be tricky, but I was thinking that it would be a failure if, for example, a sensor on the robot indicates that the robot is immobile (laying on the ground).
So your fitness function for the SMC might be something of the sort: numAcceptedSignals/numGivenSignals + numFailure
The CDM is another AI agent that generates signals and the fitness function for it could be: (numSignalsAccepted/numSignalsGenerated)/(numWinGoals/numLossGoals)
So what you do is you run the CDM and all the output that comes out of it goes to the SMC... at the end of a game you run your fitness functions. Alternately you can combine the SMC and the CDM into a single agent and you can make a composite fitness function based on the other two fitness functions. I don't know how else you could do it...
Finally, you have to determine what constitutes a learning session: is it half a game, full game, just a few moves, etc. If a game lasts 1 minute and you have a total of 8 players on the field, then the process of training could be VERY slow!
Update
Here is a quick reference to a paper that used genetic programming to create "softbots" that play soccer: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.36.136&rep=rep1&type=pdf
With regards to your comments: I was thinking that for the subconscious motor control (SMC), the signals would come from the conscious decision maker (CDM). This way you're evolving your SMC agent to properly handle the CDM agent's commands (signals). You want to maximize the up-time of the SMC agent regardless of what the CDM agent says.
The SMC agent receives an input, for example a vector force on a joint, and it then runs it through its processing unit to determine if it should execute that input or if it should reject it. The SMC should only execute inputs that it doesn't "think" it will recover from and it should reject inputs that it "thinks" would lead to a "catastrophic failure".
Now the SMC agent has an output: accept or reject a signal (1 or 0). The CDM can use that signal for its own training... the CDM wants to maximize the number of signals that the SMC accepts and it also wants to satisfy a goal: a high score for its own team and a low score for the opposing team. So the CDM has its own processing unit that is being evolved to satisfy both of those needs. Your reference provided a 3-layer design, while mine is only a 2-layer... I think mine was a right step in towards the 3-layer design.
One more thing to note here: is falling really a "catastrophic failure"? What if your robot falls, but the CDM makes it stand up again? I think that would be a valid behavior, so you shouldn't penalize the robot for falling... perhaps a better thing to do is penalize it for the amount of time it takes in order to perform a goal (not necessarily a soccer goal).
There is this tutorial on humanoid locomotion control that describes the software stack used on the HRP-4 humanoid (which can walk or climb stairs). It consists mainly of:
Linear inverted pendulum: a simplified model for balancing. It involves only the center of mass (COM) and ZMP already mentioned in other answers.
Trajectory optimization: the robot computes what it wants to do, ideally, for the next 2 seconds or so. It keeps recomputing this trajectory as it moves, which is known as model predictive control.
Balance control: the last stage that corrects the robot's posture based on sensor measurements and the desired trajectory.
Follow links to the academic papers and source code to learn more.