If i do like that, it will delete only pointer to struct, or completely free all allocated memory without any leaks?
RECT* winrect = new RECT;
winrect->top = 0;
winrect->left = 0;
winrect->right = wnd_width * glb_scale;
winrect->bottom = wnd_height * glb_scale;
delete winrect;
Related
Its work fine when i set delay 20 but wav file playing slowly in some systems with windows 10.
private static void waveSourceAgent_DataAvailable(object sender, WaveInEventArgs e)
{
int delay = 10;
recordedMsgBuffer = RECORDING_BUFFER[0].recordedMsgBuffer;
int iterations = recordedMsgBuffer.Length / 320;
int remainingBytes = recordedMsgBuffer.Length % 320;
byte[] buffer = new byte[320];
for (int i = 0; i < iterations; i++)
{
byte[] tempbuff = new byte[320];
for (int j = 0; j < 320; j++)
{
tempbuff[j] = recordedMsgBuffer[(i * 320) + j];
}
if (waveProviderReciever != null)
waveProviderReciever.AddSamples(tempbuff, 0, tempbuff.Length);
Thread.Sleep(delay);
}
Looks like you are trying to play data that's being received (but not sure exactly how - you're not using the DataAvailable args) by putting it into a BufferedWaveProvider.
The secret to this working is that the speed at which audio is being placed into the buffer should match the speed it is being read out. If the WaveFormat of record and playback match (which they need to for this to work) then the buffer should never fill up unless playback has unexpectedly stopped.
If you are receiving data much faster than you are playing it (e.g. you are downloading a pre-recorded audio file), then you either need to increase the buffer size, or hold off putting any new data into the buffer until there is more space available.
Please have a look at this piece of code:
public static function getCharLocationById(id:int):Point {
var lx:int = id % 16;
var ly:int = id / 16;
return new Point(lx, ly);
}
It works perfectly but is very slow. Does anyone know of a way to make it much faster?
Thanks in advance!
If you create the objects beforehand for all possibilities, all you have to do is look them up in an array (with the id as index).
private static const _locationLookUpTable:Array = []; //or Vector, if you like
// fill the array somewhere, maybe like this
for (var i:uint = 0; i <= maximumId; ++i) _locationLookUpTable.push(i % 16, i / 16);
public static function getCharLocationById(id:int):Point {
return _locationLookUpTable[id];
}
If the number of ids is not limited or very large you can employ an object pool.
This requires a little more code as you should return the objects to the pool if they are not used any more.
Ignore the variable creations, only takes time to create, assign and then again read them to submit them to the Point constructor.
public static function getCharLocationById(id:int):Point
{
return new Point(id % 16, id / 16);
}
Also, considering that your input is an integer, you can use bitshifts for the division by 16 like this:
id = id >> 1; // div. by 2 = id/2
id = id >> 1; // div. by 2 = id/2/2 = id/4
id = id >> 1; // div. by 2 = id/2/2/2 = id/8
id = id >> 1; // div. by 2 = id/2/2/2/2 = id/16
Shortening that we get
id = id >> 4; // (1+1+1+1 = 4)
Keep in mind that the result will also be an integer, so 11 >> 1 will return 5 and not 5.5.
I am trying to measure the performance difference of a GPU between allocating memory using 'malloc' in a kernel function vs. using pre-allocated storage from 'cudaMalloc' on the host. To do this, I have two kernel functions, one that uses malloc, one that uses a pre-allocated array, and I time the execution of each function repeatedly.
The problem is that the first execution of each kernel function takes between 400 - 2500 microseconds, but all subsequent runs take about 15 - 30 microseconds.
Is this behavior expected, or am I witnessing some sort of carryover effect from previous runs? If this is carryover, what can I do to prevent it?
I have tried putting in a kernel function that zeros out all memory on the GPU between each timed test run to eliminate that carryover, but nothing changed. I have also tried reversing the order in which I run the tests, and that has no effect on relative or absolute execution times.
const int TEST_SIZE = 1000;
struct node {
node* next;
int data;
};
int main() {
int numTests = 5;
for (int i = 0; i < numTests; ++i) {
memClear();
staticTest();
memClear();
dynamicTest();
}
return 0;
}
__global__ void staticMalloc(int* sum) {
// start a linked list
node head[TEST_SIZE];
// initialize nodes
for (int j = 0; j < TEST_SIZE; j++) {
// allocate the node & assign values
head[j].next = NULL;
head[j].data = j;
}
// verify creation by adding up values
int total = 0;
for (int j = 0; j < TEST_SIZE; j++) {
total += head[j].data;
}
sum[0] = total;
}
/**
* This is a test that will time execution of static allocation
*/
int staticTest() {
int expectedValue = 0;
for (int i = 0; i < TEST_SIZE; ++i) {
expectedValue += i;
}
// host output vector
int* h_sum = new int[1];
h_sum[0] = -1;
// device output vector
int* d_sum;
// vector size
size_t bytes = sizeof(int);
// allocate memory on device
cudaMalloc(&d_sum, bytes);
// only use 1 CUDA thread
dim3 blocksize(1, 1, 1), gridsize(1, 1, 1);
Timer runTimer;
int runTime = 0;
// check dynamic allocation time
runTime = 0;
runTimer.start();
staticMalloc<<<gridsize, blocksize>>>(d_sum);
runTime += runTimer.lap();
h_sum[0] = 0;
cudaMemcpy(h_sum, d_sum, bytes, cudaMemcpyDeviceToHost);
cudaFree(d_sum);
delete (h_sum);
return 0;
}
__global__ void dynamicMalloc(int* sum) {
// start a linked list
node* headPtr = (node*) malloc(sizeof(node));
headPtr->data = 0;
headPtr->next = NULL;
node* curPtr = headPtr;
// add nodes to test cudaMalloc in device
for (int j = 1; j < TEST_SIZE; j++) {
// allocate the node & assign values
node* nodePtr = (node*) malloc(sizeof(node));
nodePtr->data = j;
nodePtr->next = NULL;
// add it to the linked list
curPtr->next = nodePtr;
curPtr = nodePtr;
}
// verify creation by adding up values
curPtr = headPtr;
int total = 0;
while (curPtr != NULL) {
// add and increment current value
total += curPtr->data;
curPtr = curPtr->next;
// clean up memory
free(headPtr);
headPtr = curPtr;
}
sum[0] = total;
}
/**
* Host function that prepares data array and passes it to the CUDA kernel.
*/
int dynamicTest() {
// host output vector
int* h_sum = new int[1];
h_sum[0] = -1;
// device output vector
int* d_sum;
// vector size
size_t bytes = sizeof(int);
// allocate memory on device
cudaMalloc(&d_sum, bytes);
// only use 1 CUDA thread
dim3 blocksize(1, 1, 1), gridsize(1, 1, 1);
Timer runTimer;
int runTime = 0;
// check dynamic allocation time
runTime = 0;
runTimer.start();
dynamicMalloc<<<gridsize, blocksize>>>(d_sum);
runTime += runTimer.lap();
h_sum[0] = 0;
cudaMemcpy(h_sum, d_sum, bytes, cudaMemcpyDeviceToHost);
cudaFree(d_sum);
delete (h_sum);
return 0;
}
__global__ void clearMemory(char *zeros) {
int i = threadIdx.x + blockDim.x * blockIdx.x;
zeros[i] = 0;
}
void memClear() {
char *zeros[1024]; // device pointers
for (int i = 0; i < 1024; ++i) {
cudaMalloc((void**) &(zeros[i]), 4 * 1024 * 1024);
clearMemory<<<1024, 4 * 1024>>>(zeros[i]);
}
for (int i = 0; i < 1024; ++i) {
cudaFree(zeros[i]);
}
}
The first execution of a kernel takes more time because you have to load a lots of stuff on GPU (kernel, lib etc...). To prove it, you can just measure how long it takes to launch an empty kernel and you will see that it's take some times. Try like:
time -> start
launch emptykernel
time -> end
firstTiming = end - start
time -> start
launch empty kernel
time -> end
secondTiming = end - start
You will see that the secondTiming is significantly smaller thant the firstTiming.
The first CUDA (kernel) call initializes the CUDA system transparently. You can avoid this by calling an empty kernel first. Note that this is required in e.g. OpenCL, but there you have to do all that init-stuff manually. CUDA does it for you in the background.
Then some problems with your timing: CUDA kernel calls are asynchronous. So (assuming your Timer class is a host timer like time()) currently you measure the kernel launch time (and for the first call the init-time of CUDA) not the kernel execution time.
At the very least you HAVE to do a cudaDeviceSynchronize() before starting AND stopping the timer.
You are better of using CUDA events which can exactly measure the kernel execution time and only that. Using host-timers you still include the launch-overhead. See https://devblogs.nvidia.com/parallelforall/how-implement-performance-metrics-cuda-cc/
I know it was asked a thousand times before, but I still can't find a solution.
Searching SO, I indeed found the algorithm for it, but lacking the mathematical knowledge required to truly understand it, I am helplessly lost!
To start with the beginning, my goal is to compute an entire spectrogram and save it to an image in order to use it for a visualizer.
I tried using Sound.computeSpectrum, but this requires to play the sound and wait for it to end, I want to compute the spectrogram in a way shorter time than that will require to listen all the song. And I have 2 hours long mp3s.
What I am doing now is to read the bytes from a Sound object, the separate into two Vectors(.); Then using a timer, at each 100 ms I call a function (step1) where I have the implementation of the algorithm, as follows:
for each vector (each for a channel) I apply the hann function to the elements;
for each vector I nullify the imaginary part (I have a secondary vector for that)
for each vector I apply FFT
for each vector I find the magnitude for the first N / 2 elements
for each vector I convert squared magnitude to dB scale
end.
But I get only negative values, and only 30 percent of the results might be useful (in the way that the rest are identical)
I will post the code for only one channel to get rid off the "for each vector" part.
private var N:Number = 512;
private function step1() : void
{
var xReLeft:Vector.<Number> = new Vector.<Number>(N);
var xImLeft:Vector.<Number> = new Vector.<Number>(N);
var leftA:Vector.<Number> = new Vector.<Number>(N);
// getting sample range
leftA = this.channels.left.slice(step * N, step * (N) + (N));
if (leftA.length < N)
{
stepper.removeEventListener(TimerEvent.TIMER, getFreq100ms);
return;
}
else if (leftA.length == 0)
{
stepper.removeEventListener(TimerEvent.TIMER, getFreq100ms);
return;
}
var i:int;
// hann window function init
m_win = new Vector.<Number>(N);
for ( var i:int = 0; i < N; i++ )
m_win[i] = (4.0 / N) * 0.5 * (1 - Math.cos(2 * Math.PI * i / N));
// applying hann window function
for ( i = 0; i < N; i++ )
{
xReLeft[i] = m_win[i]*leftA[i];
//xReRight[i] = m_win[i]*rightA[i];
}
// nullify the imaginary part
for ( i = 0; i < N; i++ )
{
xImLeft[i] = 0.0;
//xImRight[i] = 0.0;
}
var magnitutel:Vector.<Number> = new Vector.<Number>(N);
fftl.run( xReLeft, xImLeft );
current = xReLeft;
currf = xImLeft;
for ( i = 0; i < N / 2; i++ )
{
var re:Number = xReLeft[i];
var im:Number = xImLeft[i];
magnitutel[i] = Math.sqrt(re * re + im * im);
}
const SCALE:Number = 20 / Math.LN10;
var l:uint = this.total.length;
for ( i = 0; i < N / 2; i++ )
{
magnitutel[i] = SCALE * Math.log( magnitutel[i] + Number.MIN_VALUE );
}
var bufferl:Vector.<Number> = new Vector.<Number>();
for (i = 0; i < N / 2 ; i++)
{
bufferl[i] = magnitutel[i];
}
var complete:Vector.<Vector.<Number>> = new Vector.<Vector.<Number>>();
complete[0] = bufferl;
this.total[step] = complete;
this.step++;
}
This function is executed in the event dispatched by the timer (stepper).
Obviously I do something wrong, as I said I have only negative values and further more values range between 1 and 7000 (at least).
I want to thank you in advance for any help.
With respect,
Paul
Negative dB values are OK. Just add a constant (representing your volume control) until the number of points you want to color become positive. The remaining values that stay negative are usually just displayed or colored as black in a spectrogram. No matter how negative (as they might just be the FFT's numerical noise, which can be a huge negative dB number or even NaN or -Inf for log(0)).
I have a Matrix which I recycle and use for drawing DisplayObject instances onto a Bitmap.
At the moment, I reset the Matrix before I render each item, like this:
_matrix.a = 1;
_matrix.b = 0;
_matrix.c = 0;
_matrix.d = 1;
_matrix.tx = 0;
_matrix.ty = 0;
Would it be better to do the above, or to simply do this?:
_matrix = new Matrix();
Generally I would say the former, however I'm unsure if in the case of Matrix there is some heavy stuff going on for each of those properties that I reset (mathematically).
I think reusing the same instance of Matrix is more efficient than creating a new one every time.
In fact, creating a new instance is a relatively heavy operation and that's why caches are used: to create a few instances and reuse them instead of creating a high number of instances.
I run a little benchmark and it confirms my answer:
var t:Number;
var i:int;
var N:int = 10000000;
t = getTimer();
for (i = 0; i < N; i++) {
_matrix = new Matrix();
}
trace(getTimer()-t); // 7600
t = getTimer();
for (i = 0; i < N; i++) {
_matrix.a = 1;
_matrix.b = 0;
_matrix.c = 0;
_matrix.d = 1;
_matrix.tx = 0;
_matrix.ty = 0;
}
trace(getTimer()-t); // 4162
Finally, note that the difference is not that much great and that creating 10000000 new instances takes only 7600 ms, so unless you are creating thousands of matrices per frame, either approach wouldn't have a noticeable impact on performance.
EDIT:
Using the method identity will have the advantages of both approaches (simplicity and performance):
t = getTimer();
for (i = 0; i < N; i++) {
_matrix.identity();
}
trace(getTimer()-t); // 4140