Single Precision math slower than Double Precision in FFTW? [closed] - fft

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 9 years ago.
Improve this question
I am looking at the benchmarks of FFT library and wondering why double precision math would be faster than that of the single precision (even on a 32-bit hardware).

Assuming Intel CPUs - It all depends on the compiler. Compiling for 32 bit applications , you can use normal i87 floating point where single and double precision are the same speed. Or you can select SSE for SP and SSE2 for DP, where SSE (4 words in registers) is twice as fast as SSE2 (2 words per register). Compiling for 64 bits, i87 instructions are not available. Then floating point is always compiled to use SSE/SSE2. Either due to the compiler or the particular program, these can be compiled as SIMD (Single Instruction Multiple Data - 4/2 words at a time) or SISD (Single Data using one word per register). Then, I suppose, SP and DP will be of a similar speed and the code can be slower than 32 bit compilations.
Using data from RAM, and possibly cache, performance can be limited by bus speed, where SP will be faster than DP. If the code is like my FFT benchmarks, it depends on skipped sequential reading and writing. Then speed is affected by data being read in bursts of at least 64 bytes, where SP is likely to be a little faster.
Such as trig functions are often calculated in DP. Then SP is a bit slower due to DP to SP conversion.

Related

Safe maximum amount of nodes in the DOM? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
For a web application, given the available memory in a target mobile device1 running a target mobile browser2, how can one estimate the maximum number of DOM nodes, including text nodes, that can be generated via HTML or DHTML?
How can one calculate the estimate before
Failure
Crash
Significant degradation in response
Also, is there a hard limit on any browser not to cross per tab open?
Regarding Prior Closure
This is not like the other questions in the comments below. It is also asking a very specific question seeking a method for estimation. There is nothing duplicated, broad, or opinion based about it, especially now that it is rewritten for clarity without changing its author's expressed interests.
Footnotes
[1] For instance, Android or IOS mobile devices sold from 2013 - 2018 with some specific RAM capacity
[2] Firefox, Chrome, IE 11, Edge, Opera, Safari
This is a question for which only a statistical answer could be accurate and comprehensive.
Why
The appropriate equation is this, where N is the number of nodes, bytesN is the total bytes required to represent them in the DOM, and the node index n ∈ [0, N).
bytesN = ∑N (bytesContentn + bytesOverheadn)
The value requested in the question is the maximum value of N in the worst case handheld device, operating system, browser, and operating conditions. Solving for N for each permutation is not trivial. The equation above reveals three dependencies, each of which could drastically alter the answer.
The average size of a node is dependent on the average number of bytes used in each to hold the content, such as UTF-8 text, attribute names and values, or cached information.
The average overhead of a DOM object is dependent on the HTTP user agent that manages the DOM representation of each document. W3C's Document Object Model FAQ states, "While all DOM implementations should be interoperable, they may vary considerably in code size, memory demand, and performance of individual operations."
The memory available to use for DOM representations is dependent upon the browser used by default (which can vary depending on what browser handheld device vendors or users prefer), user override of the default browser, the operating system version, the memory capacity of the handheld device, common background tasks, and other memory consumption.
Rigorous Solution
One could run tests to determine (1) and (2) for each of the common http user agents used on handheld devices. The distribution of user agents for any given site can be obtained by configuring the logging mechanism of the web server to place the HTTP_USER_AGENT if it isn't there by default and then stripping all but that field in the log and counting the instances of each value.
The number of bytes per character would need to be tested for both attributes values and UTF-8 inner text (or whatever the encoding) to get a clear pair of factors for calculating (1).
The memory available would need to be tested too under a variety of common conditions, which would be a major research project by itself.
The particular value of N chosen would have to be ZERO to handle the actual worst case, so one would chose a certain percentage of typical cases of content, node structures, and run time conditions. For instance, one may take a sample of cases using some form of randomized in situ (within normal environmental conditions) study and find N that satisfies 95% of those cases.
Perhaps a set of cases could be tested in the above ways and the results placed in a table. Such would represent a direct answer to your question.
I'm guessing it would take an excellent mobile software engineer with a good math background and a statistics expert working together full time with a substantial budget for about four weeks to get reasonable results.
A More Practical Estimation
One could guess the worst case scenario. With a few full days of research and a few proof-of-concept apps, this proposal could be refined. Absent of the time to do that, here's a good first guess.
Consider a cell phone that permits 1 Gbyte for DOM because normal operating conditions use 3 Gbytes out of the 4 GBytes for the above mentioned purposes. One might assume the average consumption of memory for a node to be as follows, to get a ballpark figure.
2 bytes per character for 40 characters of inner text per node
2 bytes per character for 4 attribute values of 10 characters each
1 byte per character for 4 attribute names of 4 characters each
160 bytes for the C/C++ node overhead
In this case Nworst_case, the worst case max nodes,
= 1,024 X 1,024 X 1,024
/ (2 X 40 + 2 X 4 X 10 + 1 X 4 X 4 + 160)
= 3,195,660 . 190,476.
I would not, however, build a document in a browser with three million DOM nodes if it could be at all avoided. Consider employing the more common practice below.
Common Practice
The best solution is to stay far below what N might be and simply reduce the total number of nodes to the degree possible using standard HTTP design techniques.
Reduce the size and complexity of that which is displayed on any given page, which also improves visual and conceptual clarity.
Request minimal amounts of data from the server, deferring content that is not yet visible using windowing techniques or balancing response time with memory consumption in well-planned ways.
Use asynchronous calls to assist with the above minimalism.
There is no limit for the DOM. Instead there is a limit for a running application, called 'browser'. As all other applications, it has a limit of 4GB of virtual memory. How much of resident memory is used depends on the amount of physical memory. With low RAM you might get to situation of constantly swapping in and out (having affordable amount of swap memory). Some systems (Linux, Android) have a special kernel task to kill applications if the system runs out of memory. Also, the maximum size of application in Linux like systems is usually limited to 2MB of virual memory and can be changed by ulimit command.

Conversion from block dimensions to warps in CUDA [duplicate]

This question already has answers here:
How are 2D / 3D CUDA blocks divided into warps?
(2 answers)
Closed 7 years ago.
I'm a little confused regarding how blocks of certain dimensions are mapped to warps of size 32.
I have read and experienced first hand that the inner dimension of a block being a multiple of 32 improves performance.
Say I create a block with dimensions 16x16.
Can a warp contain threads from two different y-dimensions, e.g. 1 and 2 ?
Why would having an inner dimension of 32 improve performance even though there technically are enough threads to be scheduled to a warp?
Your biggest question has already been answered in About warp and threads and How are CUDA threads divided into warps?. So, I have focuses this answer in the why.
The blocksize in CUDA is always a multiple of the warp size. The warp size is implementation defined and the numbe 32 is mainly related to shared memory organization, data access patterns and data flow control [ 1 ].
So, a blocksize being a multiple of 32 does not improves performance but means that all the threads will be used for something. Note that used for something depends on what you do with the threads within the block.
A blocksize being not a multiple of 32 will rounds up to the nearest multiple, even if you request fewer threads. See GPU Optimization Fundamentals presentation of Cliff Woolley from the NVIDIA
Developer Technology Group has interesting hints about performance.
In addition, memory operations and instructions are executed per warp, so you can understand the importance of this number. I think the reason why it is 32 and not 16 or 64 is undocumented. So I like remember the warp size as "The Answer to the Ultimate Question of Life, the Universe, and Everything" [ 2 ].
[1] David B Kirk and W Hwu Wen-mei. Programming Massively Parallel Processors: a Hands-on Approach. Elsevier, 2010.
[2] The Hitchhiker's Guide to the Galaxy.

Do different arithmetic operations have different processing times?

Are the basic arithmetic operations same with respect to processor usage. For e.g if I do an addition vs division in a loop, will the calculation time for addition be less than that for division?
I am not sure if this question belongs here or computer science SE
Yes. Here is a quick example:
http://my.safaribooksonline.com/book/hardware/9788131732465/instruction-set-and-instruction-timing-of-8086/app_c
those are the microcode and the timing of the operation of a massively old architecture, the 8086. it is a fairly simple point to start.
of relevant note, they are measured in cycles, or clocks, and everything move at the speed of the cpu (they are synchronized on the main clock or frequency of the microprocessor)
if you scroll down on that table you'll see a division taking anywhere from 80 to 150 cycles.
also note operation speed is affected by which area of memory the operand reside.
note that on modern processor you can have parallel instruction executed concurrently (even if the cpu is single threaded) and some of them are executed out of order, then vector instruction murky the question even more.
i.e. a SSE multiplication can multiply multiple number in a single operation (taking multiple time)
Yes. Different machine instructions are not equally expensive.
You can either do measurements yourself or use one of the references in this question to help you understand the costs for various instructions.

Why don't we use full 32 bits to store 136 years since the Epoch? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 9 years ago.
Improve this question
I've seen it many times, e.g. on UNIX, in MySQL timestamp etc.: the Epoch starts at 1970-1-1, but the maximum recordable year is 2038. Now let me count:
2^32/60/60/24/365+1970
2106
So if we used full 32 bits, we would naturally get to year 2106 without any problems. But apparently the year 2038 corresponds to 31 bits only. So why do we throw the one bit out? By using full 32 bits we could hope that we won't have to solve the problem since we'll probably destroy the Earth first...
Reaction to comments: of course it's because it's signed, but why would timestamp ever have to be signed?? That's the point of this question.
It might sound crazy but people might want to represent dates prior to 1970. Switching the interpretation of the classic time_t value would cause nothing but trouble.
The 2038 problem can be side-stepped by switching to a 64-bit representation with the same specification. Exactly how this should be done is subject to debate, as being able to represent dates billions of years in the future is of dubious value when that precision could be used to represent sub-second times, but the naive solution works better than nothing.
The short answer is: We use a signed value because that's what the standard is.
This probably falls under 'why is time_t signed and not unsigned' in which case you may be interested in hearing the reason behind this here:
There was originally some controversy over whether the Unix time_t should be
signed or unsigned. If unsigned, its range in the future would be doubled,
postponing the 32-bit overflow (by 68 years). However, it would then be
incapable of representing times prior to 1970. Dennis Ritchie, when asked about
this issue, said that he hadn't thought very deeply about it, but was of the
opinion that the ability to represent all times within his lifetime would be
nice. (Ritchie's birth, in 1941, is around Unix time −893 400 000.) The
consensus is for time_t to be signed, and this is the usual practice. The
software development platform for version 6 of the QNX operating system has an
unsigned 32-bit time_t, though older releases used a signed type.

High Performance Computing Terminology: What's a GF/s? [closed]

This question is unlikely to help any future visitors; it is only relevant to a small geographic area, a specific moment in time, or an extraordinarily narrow situation that is not generally applicable to the worldwide audience of the internet. For help making this question more broadly applicable, visit the help center.
Closed 11 years ago.
I'm reading this Dr Dobb's Article on CUDA
In my system, the global memory bandwidth is slightly over 60 GB/s.
This is excellent until you consider that this bandwidth must service
128 hardware threads -- each of which can deliver a large number of
floating-point operations. Since a 32-bit floating-point value
occupies four (4) bytes, global memory bandwidth limited applications
on this hardware will only be able to deliver around 15 GF/s -- or
only a small percentage of the available performance capability.
Question: GF/s means Giga flops per second??
Giga flops per second would be it!
GF/s or GFLOPS is GigaFlops or 10^9 FLoating Operations Per Second. (GF/s is bit unusual abbreviation of GigaFLOP/S = GigaFLOPS, see e.g. here "Gigaflops (GF/s) = 10^9 flops" or here "gigaflops per second (GF/s)").
And it is clear for me that GF/s is not GFLOPS/s (not an acceleration).
You should remember that floating operation on CPU and on GPU usually counted in different way. For most CPU, 64-bit floating point format operations are counted usually. And for GPU - 32 bit, because GPU have much more performance in 32bit floating point.
What types of operations are counted? Addition, subtraction and multiplication are. Loading and storing data are not counted. But loading and storing data is necessary to get data from/to memory and sometimes it will limit FLOPS achieved in real application (the article you cited says about this case, "memory bandwidth limited application", when CPU/GPU can deliver lot of FLOPS but memory can't read needed data so fast)
How FLOPS are counted for some chip or computer? There are two different metrics, one is for theoretical upper limit of FLOPS for this chip. It is counted by multipliing cores number, frequency of chip and floating point operations per CPU tick (it was 4 for Core2 and is 8 for Sandy Bridge CPUs).
Other metric is something like real-world flops, which are counted by running LINPACK benchmark (solving a huge linear system of equations). This benchmark uses matrix-matrix multiplication a lot and is kind of approximation of real-world flops. Top500 of supercomupters are measured by parallel version of LINPACK banchmark, the HPL. For single CPU, linpack can have up to 90-95% of theoretical flops, and for huge clusters it is in 50-85% range.
GF in this case is GigaFLOPS, but FLOPS is "floating point operations per second". I'm fairly certain that the author does not mean F/s to be "floating point operations per second per second", so GF/s is actually an error. (Unless you are talking about a computer that increases performance at runtime, I guess) The author probably means GFLOPS.