NSight Compute - get total number of samples? - cuda

When you use NSight-Compute's Source, PTX or SASS view, you see the number of samples taken on each line. However, you don't see (or it is difficult to find) the total number of samples taken overall, which you would like to normalize by.
Is it listed somewhat inconspicuous or is it just missing?

Is it listed somewhat inconspicuous or is it just missing?
There is currently no dedicated metric or place in the tool where the total number of samples is shown. We will look into adding this in the future.
As a workaround, you should be able to collapse the source view using the +/- button, which aggregates the metrics, including samples, per function/file and makes it easier to manually count, should your kernel have multiple functions/files.

Related

Web Audio Pitch Detection for Tuner

So I have been making a simple HTML5 tuner using the Web Audio API. I have it all set up to respond to the correct frequencies, the problem seems to be with getting the actual frequencies. Using the input, I create an array of the spectrum where I look for the highest value and use that frequency as the one to feed into the tuner. The problem is that when creating an analyser in Web Audio it can not become more specific than an FFT value of 2048. When using this if i play a 440hz note, the closest note in the array is something like 430hz and the next value seems to be higher than 440. Therefor the tuner will think I am playing these notes when infact the loudest frequency should be 440hz and not 430hz. Since this frequency does not exist in the analyser array I am trying to figure out a way around this or if I am missing something very obvious.
I am very new at this so any help would be very appreciated.
Thanks
There are a number of approaches to implementing pitch detection. This paper provides a review of them. Their conclusion is that using FFTs may not be the best way to go - however, it's unclear quite what their FFT-based algorithm actually did.
If you're simply tuning guitar strings to fixed frequencies, much simpler approaches exist. Building a fully chromatic tuner that does not know a-priori the frequency to expect is hard.
The FFT approach you're using is entirely possible (I've built a robust musical instrument tuner using this approach that is being used white-label by a number of 3rd parties). However you need a significant amount of post-processing of the FFT data.
To start, you solve the resolution problem using the Short Timer FFT (STFT) - or more precisely - a succession of them. The process is described nicely in this article.
If you intend building a tuner for guitar and bass guitar (and let's face it, everyone who asks the question here is), you'll need t least a 4092-point DFT with overlapping windows in order not to violate the nyquist rate on the bottom E1 string at ~41Hz.
You have a bunch of other algorithmic and usability hurdles to overcome. Not least, perceived pitch and the spectral peak aren't always the same. Taking the spectral peak from the STFT doesn't work reliably (this is also why the basic auto-correlation approach is also broken).

Nvprof, metrics, the elapsed_cycles variable

I'm having some issues with the CUDA nvprof profiler. Some of the metrics on the site are named differently than in the profiler, and the variables don't seem to be explained anywhere on the site, or for that matter anywhere on the web (I wasn't able to find any valid reference).
I decoded most of those (here: calculating gst_throughput and gld_throughput with nvprof), but I'm still not sure about:
elapsed_cycles
max_warps_per_sm
Anyone knows precisely how to count those?
I'm trying to use the nvprof to assess some 6000 different kernels via cmdline, so it is not really viable for me to use the visual profiler.
Any help appreciated. Thanks very much!
EDIT:
What I'm using:
CUDA 5.0, GTX480 which is cc. 2.0.
What I've already done:
I've made a script that gets the formulas for each of the metrics from the profiler documentation site, resolves dependencies for any given metric, extracts those through nvprof and then counts the results from those. This involved using a (rather large) sed script that changes all the occurrences of variables that appear on the site to the ones with the same meaning that are actually accepted by the profiler. Basically I've emulated grepping metrics via nvprof. I'm just having problems with those:
Why there is a problem with those concrete variables:
max_warps_per_sm - If it is the bound of the cc or another metric/event which I am perhaps somehow missing and is specific for my program (wouldn't be a surprise as some of the variables in the profiler documentation have 3 (!) different names all for the same thing).
elapsed_cycles - I don't have elapsed_cycles in the output of nvprof --query-events. Not even anything containing the words "elapse" and the only one containing "cycle" is "active_cycles". Could that be it ? Is there any other way to count it? Is there any harm done in using "gputime" instead of this variable ? I don't need absolute numbers, I'm using it to find correlations and analyze code so if "gputime"= "elapsed_cycles" * CONSTANT, I'm perfectly okay with that.
You can use the following command that lists all the events available on each device:
nvprof --query-events
This is not very complete, but it's a good start to understand what these events/metrics are. For instance, with CUDA 5.0 and a CC 3.0 GPU, we get:
elapsed_cycles_sm: Elapsed clocks
elapsed_cycles_sm is the number of elapsed clock cycles per multiprocessor. If you want to measure this metric for your program:
nvprof --events elapsed_cycles_sm ./your_program
max_warps_per_sm is quite straightforward: this is the maximum number of resident warps per multiprocessor. This value depends on the Compute Capability (see the chart here). This is a hardware limit, no matter what your kernels are, at any given time, you will never have more resident warps per multiprocessor than this value.
Also, more information is available in the profiler's online documentation, with descriptions and formulae.
UPDATE
According to this answer:
active_cycles: Number of cycles a multiprocessor has at least one active warp.

are there existing libraries for many optimization jobs in parallel on GPU

I'm looking to perform many (thousands) of small optimization jobs on my nVidia Geforce.
With small jobs I mean 3-6 dimensions and around 1000 data points input each. Basically it's for curve fitting purposes, so the objective function to minimize is a sum of squares of a continuous (non-trivial) analytical function, of which I can compute the first derivative analytically. Each dimension is constrained between lower and upper boundary.
The only thing these jobs have in common is the original data series out of which they take different 1000 data points.
I suspect this will be much faster on my GPU than now, running them one by one my CPU, so I could use it for realtime monitoring.
However, the GPU libraries I've seen only focus on calculating a single function evaluation (faster) on the GPU.
There was a thread on my specific question on the nvidia CUDA forum, with more users looking for this, but the forums have been down for a while. It mentioned porting an existing C library (eg. levmar) to the CUDA language, but this got lost...
Do you know of an existing library to run many optimizations in parallel on a gpu?
Thanks!
The GFOR loop is meant to tile together many small problems like this. Each body of the loop is tiled together with the other loop bodies. Many people have used it for optimization problems like you describe. It is available for C/C++, Fortran, or Python as well as MATLAB code.
My disclaimer is that I work on GFOR. But I'm not aware of any other production-level GPU library that does similar optimization problems. You might be able to find some academic projects if you search around.

Specifying number of samples for a custom xaudio2 effect

I'm trying to write a custom xaudio2 effect that involves a fourier transform. However, the number of samples given to the process method each call is not a power of 2 (a precondition of the fourier transform implementation I have).
Is there a way to force power of 2 sized samples? Is there a technique to allow working with non power of 2 sizes?
Don't send samples to the FFT every call that you are given samples. Buffer (save) them up till you have at least a power-of-2 samples or more and then process the power-of-2 number of samples from your intermediate buffer. Rinse and repeat.
Also, newer FFTs will often allow sizes with prime factors larger than 2.
If your implementation requires that you have a power of 2 sample size, then you can pad the sample to force it to accept. Zero padding seems to be the easiest/most straight forward.
Here is an article that explains another way to do it:
The Chirp z-Transform Algorithm and Its Application

What is Cyclomatic Complexity?

A term that I see every now and then is "Cyclomatic Complexity". Here on SO I saw some Questions about "how to calculate the CC of Language X" or "How do I do Y with the minimum amount of CC", but I'm not sure I really understand what it is.
On the NDepend Website, I saw an explanation that basically says "The number of decisions in a method. Each if, for, && etc. adds +1 to the CC "score"). Is that really it? If yes, why is this bad? I can see that one might want to keep the number of if-statements fairly low to keep the code easy to understand, but is this really everything to it?
Or is there some deeper concept to it?
I'm not aware of a deeper concept. I believe it's generally considered in the context of a maintainability index. The more branches there are within a particular method, the more difficult it is to maintain a mental model of that method's operation (generally).
Methods with higher cyclomatic complexity are also more difficult to obtain full code coverage on in unit tests. (Thanks Mark W!)
That brings all the other aspects of maintainability in, of course. Likelihood of errors/regressions/so forth. The core concept is pretty straight-forward, though.
Cyclomatic complexity measures the number of times you must execute a block of code with varying parameters in order to execute every path through that block. A higher count is bad because it increases the chances for logical errors escaping your testing strategy.
Cyclocmatic complexity = Number of decision points + 1
The decision points may be your conditional statements like if, if … else, switch , for loop, while loop etc.
The following chart describes the type of the application.
Cyclomatic Complexity lies 1 – 10  To be considered Normal
applicatinon
Cyclomatic Complexity lies 11 – 20  Moderate application
Cyclomatic Complexity lies 21 – 50  Risky application
Cyclomatic Complexity lies more than 50  Unstable application
Wikipedia may be your friend on this one: Definition of cyclomatic complexity
Basically, you have to imagine your program as a control flow graph and then
The complexity is (...) defined as:
M = E − N + 2P
where
M = cyclomatic complexity,
E = the number of edges of the graph
N = the number of nodes of the graph
P = the number of connected components
CC is a concept that attempts to capture how complex your program is and how hard it is to test it in a single integer number.
Yep, that's really it. The more execution paths your code can take, the more things that must be tested, and the higher probability of error.
Another interesting point I've heard:
The places in your code with the biggest indents should have the highest CC. These are generally the most important areas to ensure testing coverage because it's expected that they'll be harder to read/maintain. As other answers note, these are also the more difficult regions of code to ensure coverage.
Cyclomatic Complexity really is just a scary buzzword. In fact it's a measure of code complexity used in software development to point out more complex parts of code (more likely to be buggy, and therefore has to be very carefully and thoroughly tested). You can calculate it using the E-N+2P formula, but I would suggest you have this calculated automatically by a plugin. I have heard of a rule of thumb that you should strive to keep the CC below 5 to maintain good readability and maintainability of your code.
I have just recently experimented with the Eclipse Metrics Plugin on my Java projects, and it has a really nice and concise Help file which will of course integrate with your regular Eclipse help and you can read some more definitions of various complexity measures and tips and tricks on improving your code.
That's it, the idea is that a method which has a low CC has less forks, looping etc which all make a method more complex. Imagine reviewing 500,000 lines of code, with an analyzer and seeing a couple methods which have oder of magnitude higher CC. This lets you then focus on refactoring those methods for better understanding (It's also common that a high CC has a high bug rate)
Each decision point in a routine (loop, switch, if, etc...) essentially boils down to an if statement equivalent. For each if you have 2 codepaths that can be taken. So with the 1st branch there's 2 code paths, with the second there are 4 possible paths, with the 3rd there are 8 and so on. There are at least 2**N code paths where N is the number of branches.
This makes it difficult to understand the behavior of code and to test it when N grows beyond some small number.
The answers provided so far do not mention the correlation of software quality to cyclomatic complexity. Research has shown that having a lower cyclomatic complexity metric should help develop software that is of higher quality. It can help with software quality attributes of readability, maintainability, and portability. In general one should attempt to obtain a cyclomatic complexity metric of between 5-10.
One of the reasons for using metrics like cyclomatic complexity is that in general a human being can only keep track of about 7 (plus or minus 2) pieces of information simultaneously in your brain. Therefore, if your software is overly complex with multiple decision paths, it is unlikely that you will be able to visualize how your software will behave (i.e. it will have a high cyclomatic complexity metric). This would most likely lead to developing erroneous or bug ridden software. More information about this can be found here and also on Wikipedia.
Cyclomatic complexity is computed using the control flow graph. The Number of quantitative measure of linearly independent paths through a program's source code is called as Cyclomatic Complexity ( if/ if else / for / while )
Cyclomatric complexity is basically a metric to figure out areas of code that needs more attension for the maintainability. It would be basically an input to the refactoring.
It definitely gives an indication of code improvement area in terms of avoiding deep nested loop, conditions etc.
That's sort of it. However, each branch of a "case" or "switch" statement tends to count as 1. In effect, this means CC hates case statements, and any code that requires them (command processors, state machines, etc).
Consider the control flow graph of your function, with an additional edge running from the exit to the entrance. The cyclomatic complexity is the maximum number of cuts we can make without separating the graph into two pieces.
For example:
function F:
if condition1:
...
else:
...
if condition2:
...
else:
...
Control Flow Graph
You can probably intuitively see why the linked graph has a cyclomatic complexity of 3.
Cyclomatric complexity is a measure of how complex a unit of software is.It measures the number of different paths a program might follow with conditional logic constructs (If ,while,for,switch & cases etc....). If you will like to learn more about calculating it here is a wonderful youtube video you can watch https://www.youtube.com/watch?v=PlCGomvu-NM
It is important in designing test cases because it reveals the different paths or scenarios a program can take .
"To have good testability and maintainability, McCabe recommends
that no program module should exceed a cyclomatic complexity of 10"(Marsic,2012, p. 232).
Reference:
Marsic., I. (2012, September). Software Engineering. Rutgers University. Retrieved from www.ece.rutgers.edu/~marsic/books/SE/book-SE_marsic.pdf