Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 9 years ago.
Improve this question
Many programming languages and Operating Systems have means of providing dynamic memory: new, malloc, etc.
What-if: a program is restricted by using only stacks, with predefined (i.e. compile-time) sizes, stacks cannot grow or shrink dynamically.
All data is "inside", or "on", these stacks. The program consists of stack manipulations: given a reasonable amount of registers and operations (let's say Intel IA-32, for the sake of specificity).
What are the limitations of using this hypothetical system? (e.g. what algorithmic problems can be solved, and which ones can't?)
For instance: can this program practically do networking, I/O, cryptography or (G)UI?
In the theory of computation, such computers would still be Turing equivalent (at least in the sense that real RAM computers are... even they have a limit to the amount of memory). A machine with two stacks (of arbitrary, though perhaps not infinite) capacity would suffice.
Note: this is for expressing algorithms, not doing I/O. Really, with finite memory, you can't accept non-regular languages (I.e. Solve all algorithmic problems)... Meaning that your hypothetical computer, and real computers, can only accept regular languages (and only certain ones of those!) if finite memory is assumed. Since we have lots of memory, this problem is normally ignored.
If your program's memory is limited to a fixed set of stacks (and, I'm assuming, registers), then the behavior of the program can be determined solely from the PC register and the tops of all of these stacks. Since these stacks have a predetermined fixed size, you could simulate the behavior of the entire system as a finite automaton. In particular:
The automaton has a single state for every possible configuration of the bits of these stacks plus the registers. This might make the automaton exponentially huge, but it's still finite.
The automaton has transitions between two different states if, if the program were in the first state, the program would execute an instruction that would change memory in a way that caused it to look like the memory configuration in the second state.
Consequently, your program could be no stronger than a DFA. The sequence of transitions through its states could thus be described using a regular language, so your program could not, for example, print out balanced series of parentheses, or print out all the prime numbers, etc.
However, it is substantially weaker than a DFA. If all memory has to be stored in finitely many stacks, then you can't run the program on inputs any larger than all of the stacks put together (since you wouldn't have space to store the input). Consequently, your program would essentially work by being a DFA that begins in one of many possible states corresponding to the initial configuration of the stacks. Thus your program could have only finitely many possible sequences of behaviors.
Related
I'm curious as to why we are not allowed to use registers as offsets in MIPS. I know that you can't use registers as offsets like this: lw $t3, $t1($t4); I'm just curious as to why that is the case.
Is it a hardware restriction? Or simply just part of the ISA?
PS: if you're looking for what to do instead, see Load Word in MIPS, using register instead of immediate offset from another register or look at compiler output for a C function like int foo(int *arr, int idx){ return arr[idx]; } - https://godbolt.org/z/PhxG57ox1
I'm curious as to why we are not allowed to use registers as offsets in MIPS.
I'm not sure if you mean "why does MIPS assembly not permit you to write it this form" or "why does the underlying ISA not offer this form".
If it's the former, then the answer is that the base ISA doesn't have any machine instructions that offers that functionality, and apparently the designers didn't decide to offer any pseudo-instruction that would implement that behind the scenes.2
If you're asking why the ISA doesn't offer it in the first place, it's just a design choice. By offering fewer or simpler addressing modes, you get the following advantages:
Less room is needed to encode a more limited set of possibilities, so you save encoding space for more opcodes, shorter instructions, etc.
The hardware can be simpler, or faster. For example, allowing two registers in address calculation may result in:
The need for an additional read port in the register file1.
Additional connections between the register file and the AGU to get both registers values there.
The need to do a full width (32 or 64 bit) addition rather than a simpler address-side + 16 bit-addition for the offset.
The need to have a three-input ALU if you want to still want to support immediate offsets with the 2-register addresses (and they are less useful if you don't).
Additional complexity in instruction decoding and address-generation since you may need to support two quite different paths for address generation.
Of course, all of those trade-offs may very well pay off in some contexts that could make good use of 2-reg addressing with smaller or faster code, but the original design which was heavily inspired by the RISC philosophy didn't include it. As Peter points out in the comments, new addressing modes have been subsequently added for some cases, although apparently not a general 2-reg addressing mode for load or store.
Is it a hardware restriction? Or simply just part of the ISA?
There's a bit of a false dichotomy there. Certainly it's not a hardware restriction in the sense that hardware could certainly support this, even when MIPS was designed. It sort of seems to imply that some existing hardware had that restriction and so the MIPS ISA somehow inherited it. I would suspect it was much the other way around: the ISA was defined this way, based on analysis of how likely hardware would be implemented, and then it became a hardware simplification since MIPS hardware doesn't need to support anything outside of what's in the MIPS ISA.
1 E.g., to support store instructions which would need to read from 3 registers.
2 It's certainly worth asking whether such a pseudo-instruction is a good idea or not: it would probably expand to an add of the two registers to a temporary register and then a lw with the result. There is always a danger that this hides "too much" work. Since this partly glosses over the difference between a true load that maps 1:1 to a hardware load, and the version that is doing extra arithmetic behind the covers, it is easy to imagine it might lead to sup-optimal decisions.
Take the classic example of linearly accessing two arrays of equal element size in a loop. With 2-reg addressing, it is natural to write this loop as two 2-reg accesses (each with a different base register and a common offset register). The only "overhead" for the offset maintenance is the single offset increment. This hides the fact that internally there are two hidden adds required to support the addressing mode: it would have simply been better to increment each base directly and not use the offset. Furthermore, once the overhead is clear, you can see that unrolling the loop and using immediate offsets can further reduce the overhead.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 1 year ago.
Improve this question
Some time ago (before CuDNN introduced its own RNN/LSTM implementation), you would use a tensor of shape [B,T,D] (batch-major) or [T,B,D] (time-major) and then have a straight-forward LSTM implementation.
Straight-forward means e.g. pure Theano or pure TensorFlow.
It was (is?) common wisdom that time-major is more efficient for RNNs/LSTMs.
This might be due to the unrolling internal details of Theano/TensorFlow (e.g. in TensorFlow, you would use tf.TensorArray, and that naturally unrolls over the first axis, so it must be time-major, and otherwise it would imply a transpose to time-major; and not using tf.TensorArray but directly accessing the tensor would be extremely inefficient in the backprop phase).
But I think this is also related to memory locality, so even with your own custom native implementation where you have full control over these details (and thus could choose any format you like), time-major should be more efficient.
(Maybe someone can confirm this?)
(In a similar way, for convolutions, batch-channel major (NCHW) is also more efficient. See here.)
Then CuDNN introduced their own RNN/LSTM implementation and they used packed tensors, i.e. with all padding removed. Also, sequences must be sorted by sequence length (longest first). This is also time-major but without the padded frames.
This caused some difficulty in adopting these kernels because padded tensors (non packed) were pretty standard in all frameworks up to that point, and thus you need to sort by seq length, pack it, then call the kernel, then unpack it and undo the sequence sorting. But slowly the frameworks adopted this.
However, then Nvidia extended the CuDNN functions (e.g. cudnnRNNForwardTrainingEx, and then later cudnnRNNForward). which now supports all three formats:
CUDNN_RNN_DATA_LAYOUT_SEQ_MAJOR_UNPACKED: Data layout is padded, with outer stride from one time-step to the next (time-major, or sequence-major)
CUDNN_RNN_DATA_LAYOUT_BATCH_MAJOR_UNPACKED: Data layout is padded, with outer stride from one batch to the next (batch-major)
CUDNN_RNN_DATA_LAYOUT_SEQ_MAJOR_PACKED: The sequence length is sorted and packed as in the basic RNN API (time-major without padding frames, i.e. packed)
CuDNN references:
CuDNN developer guide,
CuDNN API reference
(search for "packed", or "padded").
See for example cudnnSetRNNDataDescriptor. Some quotes:
With the unpacked layout, both sequence major (meaning, time major) and batch major are supported. For backward compatibility, the packed sequence major layout is supported.
This data structure is intended to support the unpacked (padded) layout for input and output of extended RNN inference and training functions. A packed (unpadded) layout is also supported for backward compatibility.
In TensorFlow, since CuDNN supports the padded layout, they have cleaned up the code and only support the padded layout now. I don't see that you can use the packed layout anymore. (Right?)
(I'm not sure why this decision was made. Just to have simpler code? Or is this more efficient?)
PyTorch only supports the packed layout properly (when you have sequences of different lengths) (documentation).
Despite computational efficiency, there is also memory efficiency. Obviously the packed tensor is better w.r.t. memory consumption. So this is not really the question.
I mostly wonder about computational efficiency. Is the packed format most efficient? Or just the same as padded time-major? Time-major is more efficient than batch-major?
(This question is not necessarily about CuDNN, but in general about any naive or optimized implementation in CUDA.)
But obviously, this question also depends on the remaining neural network. When you mix the LSTM together with other modules which might require non-packed tensors, you would have a lot of packing and unpacking, if the LSTM uses the packed format. But consider that you could re-implement all other modules as well to work on packed format: Then maybe packed format would be better in every aspect?
(Maybe the answer is, there is no clear answer. But I don't know. Maybe there is also a clear answer. Last time I actually measured, the answer was pretty clear, at least for some parts of my question, namely that time-major is in general more efficient than batch-major for RNNs. Maybe the answer is, it depends on the hardware. But this should not be a guess, but either with real measurements, or even better with some good explanation. From the best of my knowledge, this should be mostly invariant to the hardware. It would be kind of unexpected to me if the answer varies depending on the hardware. I also assume that packed vs padded probably should not really make any/much a difference, again no matter the hardware. But maybe someone really knows.)
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
I almost know nothing about GPU computing. I already have seen articles written about GPU computing, say Fast minimum spanning tree for large graphs on the GPU or All-pairs shortest-paths for large graphs on the GPU. It sounds GPU has some restrictions in computing that CPU doesn't have. I need to know what kind of computations a GPU can do?
thanks.
Well, I'm a CUDA rookie with some experience, so I think I may help with a response from one begneer to another one.
A very short answer to your question is:
It can do the very same thing as CPU, but it has different features which can make it deliver the desired result faster or slower (if you take in account the same cost in hardware).
The CPU, even multicore ones seeks lower latency and it leads to set of demands in construction. On the opposite direction, GPU assumes that you have so much independent data to process in a way that if you process a single instruction for each data entry result from the first data entry should be available to take part in the next code instruction before processing everything in the current instruction (it is kinda hard to achieve and a expressive amount of experience in parallel development is required). Thus, the GPU construction does not take in account the processing latency with the same intensity as CPU does, because it can be "hidden" by the bulk processing, also, it does not worry that much about the clock frequency, since it can be compensated in the number of processors.
So, I would not dare to say that GPU has restrictions over CPU, I would say that it has a more specific processing purpose, as a sound card for example, and it construction takes advantage of this specificity. Comparing both is the same as comparing a snowmobile to a bike, it does not make real sense.
But, one thing, is possible to state: if a high parallel approach is possible, the GPU can provide more efficiency for a lower cost than CPU, just remember that CPU stands for Central Processing Unit, and Central can be understood as it must be more general the peripheric ones.
First of all your code should consists of so many loops so that scheduler can switch between loops when it can't find enough resources to complete a loop. After that you should make sure that your code doesn't face one of the following lamitaions:
1.Divergance: If your code has long if statements then your code is likely to be divergant on GPU. Every 32 threads are grouped together and one instruction is assigned to all of them at once. So when the if is excuted on some threads, others that fall into else statement should wait and vice versa, which drops performance.
Uncoalesced memory access: One other thing is memory access pattern. If you access global memory orderly then you can utilize maximum memory bandwidth but if your access to data on global memory is misordered then you'll find memory access as a botteleneck. So if your code is very cache favorable, don't go for GPU as the ratio of ALU/cache on GPU is mich lower than CPU.
Low occupancy: If your code consume so many registers, shared memory, loading/ storing data and especial math functions (like trigonometrics) then it's likely that you find shortage in resources which prevent you to establish the full computational capacity of GPU.
I've been looking for the answer on google and can't seem to find it. But binary is represented in bytes/octets, 8 bits. So the character a (I think) is 01100010, and the word hey is
01101000
01100101
01111001
So my question is, why 8? Is this just a good number for the computer to work with? And I've noticed how 32 bit/ 62 bit computers are all multiples of eight... so does this all have to do with how the first computers were made?
Sorry if this question doesn't meet the Q/A standards... its not code related but I can't think of anywhere else to ask it.
The answer is really "historical reasons".
Computer memory must be addressable at some level. When you ask your RAM for information, you need to specify which information you want - and it will return that to you. In theory, one could produce bit-addressable memory: you ask for one bit, you get one bit back.
But that wouldn't be very efficient, since the interface connecting the processor to the memory needs to be able to convey enough information to specify which address it wants. The smaller the granularity of access, the more wires you need (or the more pushes along the same number of wires) before you've given an accurate enough address for retrieval. Also, returning one bit multiple times is less efficient than returning multiple bits one time (side note: true in general. This is a serial-vs-parallel debate, and due to reduced system complexity and physics, serial interfaces can generally run faster. But overall, more bits at once is more efficient).
Secondly, the total amount of memory in the system is limited in part by the size of the smallest addressable block, since unless you used variably-sized memory addresses, you only have a finite number of addresses to work with - but each address represents a number of bits which you get to choose. So a system with logically byte-addressable memory can hold eight times the RAM of one with logically bit-addressable memory.
So, we use memory logically addressable at less fine levels (although physically no RAM chip will return just one byte). Only powers of two really make sense for this, and historically the level of access has been a byte. It could just as easily be a nibble or a two-byte word, and in fact older systems did have smaller chunks than eight bits.
Now, of course, modern processors mostly eat memory in cache-line-sized increments, but our means of expressing groupings and dividing the now-virtual address space remained, and the smallest amount of memory which a CPU instruction can access directly is still an eight-bit chunk. The machine code for the CPU instructions (and/or the paths going into the processor) would have to grow the same way the number of wires connecting to the memory controller would in order for the registers to be addressable - it's the same problem as with the system memory accessibility I was talking about earlier.
"In the early 1960s, AT&T introduced digital telephony first on long-distance trunk lines. These used the 8-bit ยต-law encoding. This large investment promised to reduce transmission costs for 8-bit data. The use of 8-bit codes for digital telephony also caused 8-bit data octets to be adopted as the basic data unit of the early Internet"
http://en.wikipedia.org/wiki/Byte
Not sure how true that is. It seems that that's just the symbol and style adopted by the IEEE, though.
One reason why we use 8-bit bytes is because the complexity of the world around us has a definitive structure. On the scale of human beings, observed physical world has finite number of distinctive states and patterns. Our innate restricted abilities to classify information, to distinguish order from chaos, finite amount of memory in our brains - these all are the reasons why we choose [2^8...2^64] states to be enough to satisfy our everyday basic computational needs.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 months ago.
The community reviewed whether to reopen this question 8 months ago and left it closed:
Original close reason(s) were not resolved
Improve this question
I'm currently implementing a raytracer. Since raytracing is extremely computation heavy and since I am going to be looking into CUDA programming anyway, I was wondering if anyone has any experience with combining the two. I can't really tell if the computational models match and I would like to know what to expect. I get the impression that it's not exactly a match made in heaven, but a decent speed increasy would be better than nothing.
One thing to be very wary of in CUDA is that divergent control flow in your kernel code absolutely KILLS performance, due to the structure of the underlying GPU hardware. GPUs typically have massively data-parallel workloads with highly-coherent control flow (i.e. you have a couple million pixels, each of which (or at least large swaths of which) will be operated on by the exact same shader program, even taking the same direction through all the branches. This enables them to make some hardware optimizations, like only having a single instruction cache, fetch unit, and decode logic for each group of 32 threads. In the ideal case, which is common in graphics, they can broadcast the same instruction to all 32 sets of execution units in the same cycle (this is known as SIMD, or Single-Instruction Multiple-Data). They can emulate MIMD (Multiple-Instruction) and SPMD (Single-Program), but when threads within a Streaming Multiprocessor (SM) diverge (take different code paths out of a branch), the issue logic actually switches between each code path on a cycle-by-cycle basis. You can imagine that, in the worst case, where all threads are on separate paths, your hardware utilization just went down by a factor of 32, effectively killing any benefit you would've had by running on a GPU over a CPU, particularly considering the overhead associated with marshalling the dataset from the CPU, over PCIe, to the GPU.
That said, ray-tracing, while data-parallel in some sense, has widely-diverging control flow for even modestly-complex scenes. Even if you manage to map a bunch of tightly-spaced rays that you cast out right next to each other onto the same SM, the data and instruction locality you have for the initial bounce won't hold for very long. For instance, imagine all 32 highly-coherent rays bouncing off a sphere. They will all go in fairly different directions after this bounce, and will probably hit objects made out of different materials, with different lighting conditions, and so forth. Every material and set of lighting, occlusion, etc. conditions has its own instruction stream associated with it (to compute refraction, reflection, absorption, etc.), and so it becomes quite difficult to run the same instruction stream on even a significant fraction of the threads in an SM. This problem, with the current state of the art in ray-tracing code, reduces your GPU utilization by a factor of 16-32, which may make performance unacceptable for your application, especially if it's real-time (e.g. a game). It still might be superior to a CPU for e.g. a render farm.
There is an emerging class of MIMD or SPMD accelerators being looked at now in the research community. I would look at these as logical platforms for software, real-time raytracing.
If you're interested in the algorithms involved and mapping them to code, check out POVRay. Also look into photon mapping, it's an interesting technique that even goes one step closer to representing physical reality than raytracing.
It can certainly be done, has been done, and is a hot topic currently among the raytracing and Cuda gurus. I'd start by perusing http://www.nvidia.com/object/cuda_home.html
But it's basically a research problem. People who are doing it well are getting peer-reviewed research papers out of it. But well at this point still means that the best GPU/Cuda results are approximately competitive with best-of-class solutions on CPU/multi-core/SSE. So I think that it's a little early to assume that using Cuda is going to accelerate a ray tracer. The problem is that although ray tracing is "embarrassingly parallel" (as they say), it is not the kind of "fixed input and output size" problem that maps straightforwardly to GPUs -- you want trees, stacks, dynamic data structures, etc. It can be done with Cuda/GPU, but it's tricky.
Your question wasn't clear about your experience level or the goals of your project. If this is your first ray tracer and you're just trying to learn, I'd avoid Cuda -- it'll take you 10x longer to develop and you probably won't get good speed. If you're a moderately experienced Cuda programmer and are looking for a challenging project and ray tracing is just a fun thing to learn, by all means, try to do it in Cuda. If you're making a commercial app and you're looking to get a competitive speed edge -- well, it's probably a crap shoot at this point... you might get a performance edge, but at the expense of more difficult development and dependence on particular hardware.
Check back in a year, the answer may be different after another generation or two of GPU speed, Cuda compiler development, and research community experience.