Letter Speech Recognition using octave - octave

I am trying to recognise all the letters of the alphabet from audio files.
The audio files (26) are my recordings of speaking the alphabet.
So I am trying to implement a speech to text program, where the input is my voice speaking a letter which I am trying to predict what is it.
I guess I will have to do some kind of convertion of the sound like fft, spectrograms ex. Collect some info of the sound like zero-cross-rate, max frequency etc..
And then try to find the most similar one with something like euclidean distance.
I am trying doing this only using octave.
I don't want the full code, I just want to know where to start. How would you do it?

Related

Dividing Sound Into Equal Frequency Ranges with pyaudio

I'm trying to build audio reactive leds as a little project. I'm using an api to interact with leds via computer. What I'm trying to accomplish is to be able to divide the audio stream into frequency ranges. For example, if I divide by 100Hz, a 22100 Hz solo channel should provide 221 groups. Then I want to assign each group to a led or leds and do effects with the normalized mean value of the groups.How can I accomplish this? I've been reading about FFT and tried some sample codes but so far I've failed. I can't seem to get my head around this. Below is the closest project that I've found online. I've tweaked around with it and I'm able to control my leds but I just can't, for example, isolate the first 100 Hz.
https://gist.github.com/limitedmage/2245892

"Play" sounds into a .wav

I'm trying to make a program that can convert ORG files into WAV files directly. The ORG format is similar to MIDI, in the sense that it is a list of "instructions" about when and how to play specific instruments, and a program plays these instruments for it to create the song.
However, as I said, I want to generate a WAV directly, instead of just playing the ORG. So, in a sense, I want to "play" the sounds into a WAV. I do know the WAV format and have created some files from raw PCM samples, but this isn't as simple.
The sounds generated by the ORG come from a bunch of files containing WAV samples I have. They're mono, 8-bit samples should be played at 22050Hz. They're all under a second long, and the largest aren't more than 11KB. I would assume that to play them all after each other, I would simply put the samples into the WAV one after the other. It isn't that simple though, as the ORG can have up to 16 different instruments playing at once, and each note of each instrument also has a pan (i.e. a balance, allowing stereo sound). What's more, each ORG has its own tempo (i.e. milliseconds between each point a sound can be played), and some sounds may be longer than this tempo, which means that two sounds on the same instrument can overlap. For instance, a note plays on an instrument, 90 milliseconds later the same note plays on the same instrument, but the first not hasn't finished, hence the first note plays into the second.
I just thought to explain all of that to be sure the situation is clear. In any case, I'd basically like to know how I would go about converting or "playing" an ORG (or if you like, a MIDI (since they're essentially the same)) into a WAV. As I mentioned each note does have a pan/balance, so the WAV would also need to be stereo.
If it matters at all, I'll be doing this in ActionScript 3.0 in FlashDevelop. I don't need any code (as that would be asking someone to do the work for me), but I just want to know how I would go about doing this correctly. An algorithm or two may be handy as well.
First let me say AS3 is not the best language to do these kind of things. Super collider would be a better and easier choice.
But if you want to do it in AS3 here's a general approach. I haven't tested any of it, this is pure theory.
First, put all your sounds into an array, and then find a way of matching the notes from your midi file to a position in the array.
I don't know the format of midi in depth, but I know the smallest value is a tick, and the length of a tick depends on the BPM. Here's the formula to calculate a midi tick: Midi Ticks to Actual PlayBack Seconds !!! ( Midi Music)
Let's say your tick is 2ms in length. So now you have a base value. You can fill a Vector (like an Array but faster) with what happens at every tick. If nothing happens at a particular tick, then insert a null value.
Now the big problem is reading that Vector. It's a problem because the Timer class does not work at small values like 2ms. But what you can do is check the ellapsed time in ms since the app started using getTimer(). You can have some loop that will check the ellapsed time, and whenever you have 2ms more, you read the next index in the Vector. If there are notes on that index, you play the sounds. If not you wait for the next tick.
The problem with this, is that if a loop goes on for more than 15 seconds (I'm not sure of that value) Flash will think the program is not responding and will kill it. So you have to take care of that too, ending the loop and opening a new one before Flash kills your program.
Ok, so now you have sounds playing. You can record the sounds that flash is making (wavs, mp3, mic) with a library called Standing Wave 3.
https://github.com/maxl0rd/standingwave3
This is very theoretical... and I'm quite sure depending on the number of sounds you want to play you can freeze your program... but I hope it will help to get you going.

Tesseract OCR - Handwritten font

I'm trying to use Tesseract-OCR to detect the text of images with pure text in it but these text has a handwritten font called Journal.
Example:
The result is not the best:
Maxima! size` W (35)
Is there any possibility to improve the result or rather to get the exact result?
I am surprised Tesseract is doing so well. With a little bit of training you should be able to train the lower case 'l' to be recognised correctly.
The main problem you have is the top of the large T character. The horizontal line extends across 2 (possibly 3) other character cells and this would cause a problem for any OCR engine when it tries to segment the characters for recognition. Training may be able to help in this case.
The next problem is the . and : which are very light/thin and are possibly being removed with image pre-processing before the OCR even starts.
Overall the only chance to improve the results with Tesseract would be to investigate training. Here are some links which may help.
Alternative to Tesseract OCR Training?
Tesseract OCR Library learning font
Tesseract confuses two numbers
Like Andrew Cash mentioned, it'll be very hard to perform OCR for that T letter because of its intersection with a number of next characters.
For results improvement you may want to try a more accurate SDK. Have a look at ABBYY Cloud OCR SDK, it's a cloud-based OCR SDK recently launched by ABBYY. It's in beta, so for now it's totally free to use. I work # ABBYY and can provide you additional info on our products if necessary. I've sent the image you've attached to our SDK and got this response:
Maximal size: lall (35)

How can I analyze live data from webcam?

I am going to be working on self-chosen project for my college networking class and I just had a couple questions to help get me started in the right direction.
My project will involve creating a new "physical" link over which data, in the form of text, will be transmitted from one computer to another. This link will involve one computer with a webcam that reads a series of flashing colors (black/white) as binary and converts it to text. Each series of flashes will simulate a packet of data. I will be using OSX an the integrated webcam in a Macbook, the flashing computer will either be windows or osx.
So my questions are: which programming languages or API's would be best for reading live webcam data and analyzing the color of a certain area as well as programming and timing the flashes? Also, would I need to worry about matching the flash rate of the "writing" computer and the frame capture rate of the "reading" computer?
Thank you for any help you might be able to provide.
Regarding the frame capture rate, Shannon sampling theorem says that "perfect reconstruction of a signal is possible when the sampling frequency is greater than twice the maximum frequency of the signal being sampled". In other words if your flashing light switches 10 times per second, you need a camera of more than 20fps to properly capture that. So basically check your camera specs, divide by 2, lower the resulting a little and you have your maximum flashing rate.
Whatever can get the frames will work. If the light conditions in which the camera works are gonna be stable, and the position of the light on images is gonna be static then it is gonna be very very easy with checking the average pixel values of a certain area.
If you need additional image processing you should probably also find out about OpenCV (it has bindings to every programming language).
To answer your question about language choice, I would recommend java. The Java Media Framework is great and easy to use. I have used it for capturing video from webcams in the past. Be warned, however, that everyone you ask will recommend a different language - everyone has their preferences!
What are you using as the flashing device? What kind of distance are you trying to achieve? Something worth thinking about is how are you going to get the receiver to recognise where within the captured image to look for the flashes. Some kind of fiducial marker might be necessary. Longer ranges will make this problem harder to resolve.
If you're thinking about shorter ranges, have you considered using a two-dimensional transmitter? (given that you're using a two-dimensional receiver, it makes sense) and maybe have a transmitter that shows a sequence of QR codes (or similar encodings) on a monitor?
You will have to consider some kind of error-correction encoding, such as a hamming code. While encoding would increase the data footprint, it might give you overall better bandwidth given that you can crank up the speed much higher without having to worry about the odd corrupt bit.
Some 'evaluation' type material might include you discussing the obvious security risks in using such a channel - anyone with line of sight to the transmitter can eavesdrop! You could suggest in your writeup using some kind of encryption, a block cipher in CBC would do, but would require a key-exchange prior to transmission, so you could think about public key encryption.

OCR and word reviewing

I'm using Tesseract for my letter recognition project and currently the recognitions is quite good. The image processing part was done using OpenCv libraries.
The letters are hand written.But there are some problems when I used it to recognise the letter "O" and number "0". These letters are used in data areas as the fields that enter names. So names cannot have any numbers with it. And when we are using the the system of the data fields as date of birth it only contains numbers. So I'm willing to give restriction to the recognition system saying that the corresponding data fields have only numbers or the letters.
And also I'm willing to review the recognised letters with the possible words so we can improve the accuracy of the data. I'm willing to use the openCv libraries for this task. But I don't know what are the libraries that help for this task and what are the functionalities of those. So please can some one help me. Thank you.
Regards,
Thilanka.
I've never used Tesseract. However, in the FAQ it says
How do I recognise only digits?
TessBaseAPI::SetVariable("tessedit_char_whitelist", "0123456789");
Presumably you could use the pattern of the FAQ entry to set it up so it only recognises letters or just digits appropriately.
If you have already tried this, can you give more details of why it doesn't work?