I am modeling algorithm to hardware mapping with Gecode and standard Gecode::Int::Limits is too small at least because I want to target systems with more than 2^32 memory.
Is there a way to get use of arbitrary-precision arithmetic with Gecode or at least 64-bits integers?
I know that Gecode can be built with MPIR or GMP support, but seems those are just for trigonometric operations?
If I understand Gecode documentation properly:
The totally available number of bits for all variable implementation types used by Gecode is 32
So seems there is no way to model with values bigger than 2147483646, but I still think I'm fundamentally wrong about something, since it's almost obligatory for modeling toolkit/library to have an ability to model with values bigger than that. Especially, Wikipedia says that:
ECLiPSe interfaces to external solvers, in particular ... and the Gecode solver library
but ECLiPSe tutorial stands that
Numbers in ECLiPSe come in several flavors:
Integers can be as large as fits into memory, e.g.:
123 0 -27 393423874981724
I cannot understand how just an interface being able to have numbers bigger than underlying library.
Related
I notice there is a macro uint4korr in the MySQL/MariaDB source code.
include/byte_order_generic.h
I merely understand this macro is correlated with byte order. But I looked for the comments about this macro, found nothing. I don't know the meaning of the suffix korr. What does the abbreviation express?
I want to know why the code implements like this? What are the effects on different platforms?
"korr" is an abbreviation for "Korrekt" of the phonic and meaning equivalent of the English word "Correct".
The purpose of the code is to provide a uniform byte order of storage and communication components so the storage files are portable between different endian architectures without conversion, and the client/server communication doesn't need to know which endian the other architecture is.
I believe that the related Swedish verb is korrigera, to correct. uint4korr() is kind of the opposite of ntohl(), because it will swap the bytes on a big-endian architecture and not little-endian.
Somewhat related to this, the InnoDB storage engine stores its data in big-endian byte order, so that a simple memcmp() can be used for comparing keys. (It also inverts the sign bit of signed integers due to this.) The InnoDB function mach_read_from_4() is basically ntohl() combined with a 32-bit load via an unaligned pointer. Recent versions of GCC and clang impress me by translating that into the IA-32 or AMD64 instructions mov and bswap or simply movbe.
Just of curiosity. CuBLAS is a library for basic matrix computations. But these computations, in general, can also be written in normal Cuda code easily, without using CuBLAS. So what is the major difference between the CuBLAS library and your own Cuda program for the matrix computations?
We highly recommend developers use cuBLAS (or cuFFT, cuRAND, cuSPARSE, thrust, NPP) when suitable for many reasons:
We validate correctness across every supported hardware platform, including those which we know are coming up but which maybe haven't been released yet. For complex routines, it is entirely possible to have bugs which show up on one architecture (or even one chip) but not on others. This can even happen with changes to the compiler, the runtime, etc.
We test our libraries for performance regressions across the same wide range of platforms.
We can fix bugs in our code if you find them. Hard for us to do this with your code :)
We are always looking for which reusable and useful bits of functionality can be pulled into a library - this saves you a ton of development time, and makes your code easier to read by coding to a higher level API.
Honestly, at this point, I can probably count on one hand the number of developers out there who actually implement their own dense linear algebra routines rather than calling cuBLAS. It's a good exercise when you're learning CUDA, but for production code it's usually best to use a library.
(Disclosure: I run the CUDA Library team)
There's several reasons you'd chose to use a library instead of writing your own implementation. Three, off the top of my head:
You don't have to write it. Why do work when somebody else has done it for you?
It will be optimised. NVIDIA supported libraries such as cuBLAS are likely to be optimised for all current GPU generations, and later releases will be optimised for later generations. While most BLAS operations may seem fairly simple to implement, to get peak performance you have to optimise for hardware (this is not unique to GPUs). A simple implementation of SGEMM, for example, may be many times slower than an optimised version.
They tend to work. There's probably less chance you'll run up against a bug in a library then you'll create a bug in your own implementation which bites you when you change some parameter or other in the future.
The above isn't just relevent to cuBLAS: if you have a method that's in a well supported library you'll probably save a lot of time and gain a lot of performance using it relative to using your own implementation.
Recently, I have been making program (FDTD Operation) using the CUDA
development environment, OS is Windows server 2008 , Graphic card is TeslaC2070, compiler is VS2010. This program calculates using single and double precision floating-point.
I was reading the CUDA programming guide 3.2 and 4.0 . In appendix, guide tell me sin(), cos() has maximum accuracy of 2 ULP. My original CPU program produces results which are different to the CUDA Version.
I want to make results correctly same. Is it possible?
To quote Goldberg (a paper that every Computer Scientist, Computational Scientist, and possibly even every scientist who programs, should read):
Due to roundoff errors, the associative laws of algebra do not
necessarily hold for floating-point numbers.
This means that when you change the order of operations—even when using ostensibly associative arithmetic—you are likely to get slightly different answers.
Parallelism, by definition, results in different ordering of operations relative to serial arithmetic. "Embarrasingly parallel" computations, that is, computations where each output element is computed independently from all others, sometimes do not have to worry about this. But collective operations, like reductions or scans, and spatial neighborhood computations, such stencils (as in FDTD), do experience this effect.
In practice, even using a different compiler (and even different compiler options) can change the result of floating point computation, even when compiling the same code, with or without parallelism.
I was reading the CURAND Library API and I am a newbie in CUDA and I wanted to see if someone could actually show me a simple code that uses the CURAND Library to generate random numbers. I am looking into generating a large amount of number to use with Discrete Event Simulation. My task is just to develop the algorithms to use GPGPU's to speed up the random number generation. I have implemented the LCG, Multiplicative, and Fibonacci methods in standard C Language Programming. However I want to "port" those codes into CUDA and take advantage of threads and blocks to speed up the process of generating random numbers.
Link 1: http://adnanboz.wordpress.com/tag/nvidia-curand/
That person has two of the methods I will need (LCG and Mersenne Twister) but the codes do not provide much detail. I was wondering if anyone could expand on those initial implementations to actually point me in the right direction on how to use them properly.
Thanks!
Your question is misleading - you say "Use the cuRAND Library for Dummies" but you don't actually want to use cuRAND. If I understand correctly, you actually want to implement your own RNG from scratch rather than use the optimised RNGs available in cuRAND.
First recommendation is to revisit your decision to use your own RNG, why not use cuRAND? If the statistical properties are suitable for your application then you would be much better off using cuRAND in the knowledge that it is tuned for all generations of the GPU. It includes Marsaglia's XORWOW, l'Ecuyer's MRG32k3a, and the MTGP32 Mersenne Twister (as well as Sobol' for Quasi-RNG).
You could also look at Thrust, which has some simple RNGs, for an example see the Monte Carlo sample.
If you really need to create your own generator, then there's some useful techniques in GPU Computing Gems (Emerald Edition, Chapter 16: Parallelization Techniques for Random Number Generators).
As a side note, remember that while a simple LCG is fast and easy to skip-ahead, they typically have fairly poor statistical properties especially when using large quantities of draws. When you say you will need "Mersenne Twister" I assume you mean MT19937. The referenced Gems book talks about parallelising MT19937 but the original developers created the MTGP generators (also referenced above) since MT19937 is fairly complex to implement skip-ahead.
Also as another side note, just using a different seed to achieve parallelisation is usually a bad idea, statistically you are not assured of the independence. You either need to skip-ahead or leap-frog, or else use some other technique (e.g. DCMT) for ensuring there is no correlation between sequences.
What considerations do I need to make if I want my code to run correctly on both 32bit and 64bit platforms ?
EDIT: What kind of areas do I need to take care in, e.g. printing strings/characters or using structures ?
Options:
Code it in some language with a Virtual Machine (such as Java)
Code it in .NET and don't target any specific architecture. The .NET JIT compiler will compile it for you to the right architecture before running it.
One solution would be to target a virtual environment that runs on both platforms (I'm thinking Java, or .Net here).
Or pick an interpreted language.
Do you have other requirements, such as calling existing code or libraries?
The same things you should have been doing all along to ensure you write portable code :)
mozilla guidelines and the C faq are good starting points
I assume you are still talking about compiling them separately for each individual platform? As running them on both is completely doable by just creating a 32bit binary.
The biggest one is making sure you don't put pointers into 32-bit storage locations.
But there's no proper 'language-agnostic' answer to this question, really. You couldn't even get a particularly firm answer if you restricted yourself to something like standard 'C' or 'C++' - the size of data storage, pointers, etc, is all terribly implementation dependant.
It honestly depends on the language, because managed languages like C# and Java or Scripting languages like JavaScript, Python, or PHP are locked in to their current methodology and to get started and to do anything beyond the advanced stuff there is not much to worry about.
But my guess is that you are asking about languages like C++, C, and other lower level languages.
The biggest thing you have to worry about is the size of things, because in the 32-bit world you are limited to the power of 2^32 however in the 64-bit world things get bigger 2^64.
With 64-bit you have a larger space for memory and storage in RAM, and you can compute larger numbers. However if you know you are compiling for both 32 and 64, you need to make sure to limit your expectations of the system to the 32-bit world and limitations of buffers and numbers.
In C (and maybe C++) always remember to use the sizeof operator when calculating buffer sizes for malloc. This way you will write more portable code anyway, and this will automatically take 64bit datatypes into account.
In most cases the only thing you have to do is just compile your code for both platforms. (And that's assuming that you're using a compiled language; if it's not, then you probably don't need to worry about anything.)
The only thing I can think of that might cause problems is assuming the size of data types, which is something you probably shouldn't be doing anyway. And of course anything written in assembly is going to cause problems.
Keep in mind that many compilers choose the size of integer based on the underlying architecture, given that the "int" should be the fastest number manipulator in the system (according to some theories).
This is why so many programmers use typedefs for their most portable programs - if you want your code to work on everything from 8 bit processors up to 64 bit processors you need to recognize that, in C anyway, int is not rigidly defined.
Pointers are another area to be careful - don't use a long, or long long, or any specific type if you are fiddling with the numeric value of the pointer - use the proper construct, which, unfortunately, varies from compiler to compiler (which is why you have a separate typedef.h file for each compiler you use).
-Adam Davis