Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 7 years ago.
Improve this question
I'm trying to find a library or sample code or something in the right direction that could help me change the speed of audio while maintaining normal pitch. I need this functionality in an open source application, so preferably the library is open source itself. Any ideas to get me on the right track?
If you need to convey a signal in the audio domain, playing on time but not in pitch:
You have to know what your signal is composed of. So as to synthesitize the good frequency when its worth.
1/ You have all the parameters known, like in analogic synthesizing, you know you want to synthetize one note, so you tune all the Oscillators frequencies you can to this value: I guess this not what you can do, any virtual/virtual analog synth can do this on your demand.
2/ you have a source sound ou want to control
You have to decompose it in items you can control to futhfill your harmonic constraint, in time and rhythmical constraints: 3 solutions.
a. FFT, fast fourrier transform, giving you the amount of power on all harmonics of your source sound, and up to you to enlarge the time scale of some harmonics or another ( really cook recipes, but really worth the expreriment)
b. Wavelet, close to FFT, but focussing on harmonic details whenever they happen, and how precise they happen. (imagine its like FFT optimizing on some meaningfull frequencies at each time)
c. Granular Synthesis, i think it is the easiest: it perfroms windows, (applying sort of Gauss Normal law to each time fragment of sound), like clouds of windows over your original sound, decoupling it in numerous parts, totally manageable on their pitch and duration (the speed and period of the window applied on the sound)
There maybe be a lot of other techniques but I am not aware of.
The Wikipedia article on Audio timescale-pitch modification may be helpful.
The basic idea is that you need to convert a signal along a time axis into a signal over time and frequency axes. Then you modify that signal appropriately, then convert back again.
Windowed fast fourier transforms are a common approach - take a short segment of the signal, convert to the frequency domain, repeat for periodic steps through the signal. Modifying the signal basically means relabelling your frequency and/or time axis scaling before applying the inverse transforms. Windows will probably overlap a little, so you can blend (cross-fade) from one block to another.
Another possible approach is to use wavelet transforms, filter banks, or some other closely related multi-resolution approach. The basis of these is the use of integral transforms in which each frequency is treated on an appropriate scale (relative to wavelength). A morlet basis, for example, is very like a single-wavelength-limited variation of the sine+j.cosine combination that is the basis of the fourier transform.
In theory, these should provide a better result. As the transforms naturally have both time and frequency axes, there is no need to generate the time axis "artificially" by windowing. This may avoid the sometimes obvious crossfade-between-blocks issues with the windowed Fourier transform approach. I'm going to guess that there may be other artefacts instead, but I don't know enough to know what they are.
Sorry if my terminology is misleading or wrong about multi-resolution stuff - I'm very far from being an expert.
Related
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 1 year ago.
Improve this question
Some time ago (before CuDNN introduced its own RNN/LSTM implementation), you would use a tensor of shape [B,T,D] (batch-major) or [T,B,D] (time-major) and then have a straight-forward LSTM implementation.
Straight-forward means e.g. pure Theano or pure TensorFlow.
It was (is?) common wisdom that time-major is more efficient for RNNs/LSTMs.
This might be due to the unrolling internal details of Theano/TensorFlow (e.g. in TensorFlow, you would use tf.TensorArray, and that naturally unrolls over the first axis, so it must be time-major, and otherwise it would imply a transpose to time-major; and not using tf.TensorArray but directly accessing the tensor would be extremely inefficient in the backprop phase).
But I think this is also related to memory locality, so even with your own custom native implementation where you have full control over these details (and thus could choose any format you like), time-major should be more efficient.
(Maybe someone can confirm this?)
(In a similar way, for convolutions, batch-channel major (NCHW) is also more efficient. See here.)
Then CuDNN introduced their own RNN/LSTM implementation and they used packed tensors, i.e. with all padding removed. Also, sequences must be sorted by sequence length (longest first). This is also time-major but without the padded frames.
This caused some difficulty in adopting these kernels because padded tensors (non packed) were pretty standard in all frameworks up to that point, and thus you need to sort by seq length, pack it, then call the kernel, then unpack it and undo the sequence sorting. But slowly the frameworks adopted this.
However, then Nvidia extended the CuDNN functions (e.g. cudnnRNNForwardTrainingEx, and then later cudnnRNNForward). which now supports all three formats:
CUDNN_RNN_DATA_LAYOUT_SEQ_MAJOR_UNPACKED: Data layout is padded, with outer stride from one time-step to the next (time-major, or sequence-major)
CUDNN_RNN_DATA_LAYOUT_BATCH_MAJOR_UNPACKED: Data layout is padded, with outer stride from one batch to the next (batch-major)
CUDNN_RNN_DATA_LAYOUT_SEQ_MAJOR_PACKED: The sequence length is sorted and packed as in the basic RNN API (time-major without padding frames, i.e. packed)
CuDNN references:
CuDNN developer guide,
CuDNN API reference
(search for "packed", or "padded").
See for example cudnnSetRNNDataDescriptor. Some quotes:
With the unpacked layout, both sequence major (meaning, time major) and batch major are supported. For backward compatibility, the packed sequence major layout is supported.
This data structure is intended to support the unpacked (padded) layout for input and output of extended RNN inference and training functions. A packed (unpadded) layout is also supported for backward compatibility.
In TensorFlow, since CuDNN supports the padded layout, they have cleaned up the code and only support the padded layout now. I don't see that you can use the packed layout anymore. (Right?)
(I'm not sure why this decision was made. Just to have simpler code? Or is this more efficient?)
PyTorch only supports the packed layout properly (when you have sequences of different lengths) (documentation).
Despite computational efficiency, there is also memory efficiency. Obviously the packed tensor is better w.r.t. memory consumption. So this is not really the question.
I mostly wonder about computational efficiency. Is the packed format most efficient? Or just the same as padded time-major? Time-major is more efficient than batch-major?
(This question is not necessarily about CuDNN, but in general about any naive or optimized implementation in CUDA.)
But obviously, this question also depends on the remaining neural network. When you mix the LSTM together with other modules which might require non-packed tensors, you would have a lot of packing and unpacking, if the LSTM uses the packed format. But consider that you could re-implement all other modules as well to work on packed format: Then maybe packed format would be better in every aspect?
(Maybe the answer is, there is no clear answer. But I don't know. Maybe there is also a clear answer. Last time I actually measured, the answer was pretty clear, at least for some parts of my question, namely that time-major is in general more efficient than batch-major for RNNs. Maybe the answer is, it depends on the hardware. But this should not be a guess, but either with real measurements, or even better with some good explanation. From the best of my knowledge, this should be mostly invariant to the hardware. It would be kind of unexpected to me if the answer varies depending on the hardware. I also assume that packed vs padded probably should not really make any/much a difference, again no matter the hardware. But maybe someone really knows.)
I have found the keras-rl/examples/cem_cartpole.py example and I would like to understand, but I don't find documentation.
What does the line
memory = EpisodeParameterMemory(limit=1000, window_length=1)
do? What is the limit and what is the window_length? Which effect does increasing either / both parameters have?
EpisodeParameterMemory is a special class that is used for CEM. In essence it stores the parameters of a policy network that were used for an entire episode (hence the name).
Regarding your questions: The limit parameter simply specifies how many entries the memory can hold. After exceeding this limit, older entries will be replaced by newer ones.
The second parameter is not used in this specific type of memory (CEM is somewhat of an edge case in Keras-RL and mostly there as a simple baseline). Typically, however, the window_length parameter controls how many observations are concatenated to form a "state". This may be necessary if the environment is not fully observable (think of it as transforming a POMDP into an MDP, or at least approximately). DQN on Atari uses this since a single frame is clearly not enough to infer the velocity of a ball with a FF network, for example.
Generally, I recommend reading the relevant paper (again, CEM is somewhat of an exception). It should then become relatively clear what each parameter means. I agree that Keras-RL desperately needs documentation but I don't have time to work on it right now, unfortunately. Contributions to improve the situation are of course always welcome ;).
A little late to the party, but I feel like the answer doesn't really answer the question.
I found this description online (https://pytorch.org/tutorials/intermediate/reinforcement_q_learning.html#replay-memory):
We’ll be using experience replay
memory for training our DQN. It stores the transitions that the agent
observes, allowing us to reuse this data later. By sampling from it
randomly, the transitions that build up a batch are decorrelated. It
has been shown that this greatly stabilizes and improves the DQN
training procedure.
Basically you observe and save all of your state transitions so that you can train your network on them later on (instead of having to make observations from the environment all the time).
I would like to obtain ALL the real roots of a system of 8 polynomial equations in 8 variables, where the maximum degree of each equation is 3. Is this doable? What is the best software to do this?
You will need to use a multi-dimensional root finding method.
Newton's should be fairly easy to implement (simple algorithm) and you may find some details about this and other methods here.
As far as I understand, it's not likely to be straightforward to solve 8 simultaneous cubic equations with a very general direct method.
Depending on your problem, you may also want to check if it's analogous to the problem of fitting a cubic spline to a set of points. If it is, then a simple direct algorithm is available here.
For the initial conditions, you might need to use well-dispersed random starting points in your domain of interest. You could use sobol sequences or some other low-discrepancy random number generator to efficiently generate random numbers to fill-out the space under consideration.
You may also consider moving this question to math exchange where you might have a better chance of having this answered correctly.
So I have been making a simple HTML5 tuner using the Web Audio API. I have it all set up to respond to the correct frequencies, the problem seems to be with getting the actual frequencies. Using the input, I create an array of the spectrum where I look for the highest value and use that frequency as the one to feed into the tuner. The problem is that when creating an analyser in Web Audio it can not become more specific than an FFT value of 2048. When using this if i play a 440hz note, the closest note in the array is something like 430hz and the next value seems to be higher than 440. Therefor the tuner will think I am playing these notes when infact the loudest frequency should be 440hz and not 430hz. Since this frequency does not exist in the analyser array I am trying to figure out a way around this or if I am missing something very obvious.
I am very new at this so any help would be very appreciated.
Thanks
There are a number of approaches to implementing pitch detection. This paper provides a review of them. Their conclusion is that using FFTs may not be the best way to go - however, it's unclear quite what their FFT-based algorithm actually did.
If you're simply tuning guitar strings to fixed frequencies, much simpler approaches exist. Building a fully chromatic tuner that does not know a-priori the frequency to expect is hard.
The FFT approach you're using is entirely possible (I've built a robust musical instrument tuner using this approach that is being used white-label by a number of 3rd parties). However you need a significant amount of post-processing of the FFT data.
To start, you solve the resolution problem using the Short Timer FFT (STFT) - or more precisely - a succession of them. The process is described nicely in this article.
If you intend building a tuner for guitar and bass guitar (and let's face it, everyone who asks the question here is), you'll need t least a 4092-point DFT with overlapping windows in order not to violate the nyquist rate on the bottom E1 string at ~41Hz.
You have a bunch of other algorithmic and usability hurdles to overcome. Not least, perceived pitch and the spectral peak aren't always the same. Taking the spectral peak from the STFT doesn't work reliably (this is also why the basic auto-correlation approach is also broken).
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 months ago.
The community reviewed whether to reopen this question 8 months ago and left it closed:
Original close reason(s) were not resolved
Improve this question
I'm currently implementing a raytracer. Since raytracing is extremely computation heavy and since I am going to be looking into CUDA programming anyway, I was wondering if anyone has any experience with combining the two. I can't really tell if the computational models match and I would like to know what to expect. I get the impression that it's not exactly a match made in heaven, but a decent speed increasy would be better than nothing.
One thing to be very wary of in CUDA is that divergent control flow in your kernel code absolutely KILLS performance, due to the structure of the underlying GPU hardware. GPUs typically have massively data-parallel workloads with highly-coherent control flow (i.e. you have a couple million pixels, each of which (or at least large swaths of which) will be operated on by the exact same shader program, even taking the same direction through all the branches. This enables them to make some hardware optimizations, like only having a single instruction cache, fetch unit, and decode logic for each group of 32 threads. In the ideal case, which is common in graphics, they can broadcast the same instruction to all 32 sets of execution units in the same cycle (this is known as SIMD, or Single-Instruction Multiple-Data). They can emulate MIMD (Multiple-Instruction) and SPMD (Single-Program), but when threads within a Streaming Multiprocessor (SM) diverge (take different code paths out of a branch), the issue logic actually switches between each code path on a cycle-by-cycle basis. You can imagine that, in the worst case, where all threads are on separate paths, your hardware utilization just went down by a factor of 32, effectively killing any benefit you would've had by running on a GPU over a CPU, particularly considering the overhead associated with marshalling the dataset from the CPU, over PCIe, to the GPU.
That said, ray-tracing, while data-parallel in some sense, has widely-diverging control flow for even modestly-complex scenes. Even if you manage to map a bunch of tightly-spaced rays that you cast out right next to each other onto the same SM, the data and instruction locality you have for the initial bounce won't hold for very long. For instance, imagine all 32 highly-coherent rays bouncing off a sphere. They will all go in fairly different directions after this bounce, and will probably hit objects made out of different materials, with different lighting conditions, and so forth. Every material and set of lighting, occlusion, etc. conditions has its own instruction stream associated with it (to compute refraction, reflection, absorption, etc.), and so it becomes quite difficult to run the same instruction stream on even a significant fraction of the threads in an SM. This problem, with the current state of the art in ray-tracing code, reduces your GPU utilization by a factor of 16-32, which may make performance unacceptable for your application, especially if it's real-time (e.g. a game). It still might be superior to a CPU for e.g. a render farm.
There is an emerging class of MIMD or SPMD accelerators being looked at now in the research community. I would look at these as logical platforms for software, real-time raytracing.
If you're interested in the algorithms involved and mapping them to code, check out POVRay. Also look into photon mapping, it's an interesting technique that even goes one step closer to representing physical reality than raytracing.
It can certainly be done, has been done, and is a hot topic currently among the raytracing and Cuda gurus. I'd start by perusing http://www.nvidia.com/object/cuda_home.html
But it's basically a research problem. People who are doing it well are getting peer-reviewed research papers out of it. But well at this point still means that the best GPU/Cuda results are approximately competitive with best-of-class solutions on CPU/multi-core/SSE. So I think that it's a little early to assume that using Cuda is going to accelerate a ray tracer. The problem is that although ray tracing is "embarrassingly parallel" (as they say), it is not the kind of "fixed input and output size" problem that maps straightforwardly to GPUs -- you want trees, stacks, dynamic data structures, etc. It can be done with Cuda/GPU, but it's tricky.
Your question wasn't clear about your experience level or the goals of your project. If this is your first ray tracer and you're just trying to learn, I'd avoid Cuda -- it'll take you 10x longer to develop and you probably won't get good speed. If you're a moderately experienced Cuda programmer and are looking for a challenging project and ray tracing is just a fun thing to learn, by all means, try to do it in Cuda. If you're making a commercial app and you're looking to get a competitive speed edge -- well, it's probably a crap shoot at this point... you might get a performance edge, but at the expense of more difficult development and dependence on particular hardware.
Check back in a year, the answer may be different after another generation or two of GPU speed, Cuda compiler development, and research community experience.