What language(s) for very large lists? - language-agnostic

Java (and maybe the underlying C-ish code) has max capacity of Integer.MAX_VALUE (~ 2 billion) for arrays and containers in java.util. Are there other languages that feature containers with larger capacities?

You don't want languages, you want databases.

You can write your own containers in both languages that support long indices.

If you're starting to hit the 32-bit limit of a number in relation to the number of elements you can store in a list/array/collection, then I would seriously start finding a new way to implement your algorithm.
You're going to have lots of "we need this specialized hardware in order to execute our program" type of requirements.

STL containers in C++ use size_t indices, which are 64-bit on a 64-bit machine.

Do you have a machine with enough RAM to use more? O_o If you do, I'd say you need your own collection, because performance of the builtin ones will doubtfully scale...

There is no limit if you write your own container.

I normally use a database to store that large amounts of data. Solves a lot of problems with scaling.

No body wants to cope with such large quantities of data in memory at once. I don't know what you're trying to do, but if you need to maximize the amount of resident data you must take in account:
First use dynamic memory allocation.
You may try (in your OS) to maximize the user-mode virtual memory addressing space, if this is an option. e.g. 32-bit Windows can use the typical 2GB/2GB addressing spaces for usermode/kernel or 3GB/1GB.
Always monitor your commit memory/OS commit limit so you don't force the OS to thrash, slowing down the entire system.
You can lock down a minimal amount of memory for exclusive use of your application. This varies from OS to OS, but e.g you can give the user to fix 256MB, 512MB, 1GB memory space for your application.
If you need to surpass the available usermode address space, there are 64-bit systems at your disposal, or 32-bit with extensions such as PAE.
Well, there are a lot of things to research, but just my 2c.

An object database may suit your purposes better.
For example, db4o.
Or, for arrays of fixed size objects, it might be worth experimenting with a memory mapped file, but you will need a language interface to the OS API for that.
edit: or just use on ORM to map your collection to a standard SQL database. These exist for most languages. For example ruby has activerecord and Java has hibernate.

Related

is it possible to force cudaMallocManaged allocate on a specific gpu id (e.g. via cudaSetDevice)

I want to use cudaMallocManaged, but is it possible force it allocate memory on a specific gpu id (e.g. via cudaSetDevice) on a multiple GPU system?
The reason is that I need allocate several arrays on the GPU, and I know which set of these arrays need to work together, so I want to manually make sure they are on the same GPU.
I searched CUDA documents, but didn't find any info related to this. Can someone help? Thanks!
No you can't do this directly via cudaMallocManaged. The idea behind managed memory is that the allocation migrates to whatever processor it is needed on.
If you want to manually make sure a managed allocation is "present" on (migrated to) a particular GPU, you would typically use cudaMemPrefetchAsync. Some examples are here and here. This is generally recommended for good performance if you know which GPU the data will be needed on, rather than using "on-demand" migration.
Some blogs on managed memory/unified memory usage are here and here, and some recorded training is available here, session 6.
From N.2.1.1. Explicit Allocation Using cudaMallocManaged() (emphasis mine):
By default, the devices of compute capability lower than 6.x allocate managed memory directly on the GPU. However, the devices of compute capability 6.x and greater do not allocate physical memory when calling cudaMallocManaged(): in this case physical memory is populated on first touch and may be resident on the CPU or the GPU.
So for any recent architecture it works like NUMA nodes on the CPU: Allocation says nothing about where the memory will be physically allocated. This instead is decided on "first touch", i.e. initialization. So as long as the first write to these locations comes from the GPU where you want it to be resident, you are fine.
Therefore I also don't think a feature request will find support. In this memory model allocation and placement just are completely independent operations.
In addition to explicit prefetching as Robert Crovella described it, you can give more information about which devices will access which memory locations in which way (reading/writing) by using cudaMemAdvise (See N.3.2. Data Usage Hints).
The idea behind all this is that you can start off by just using cudaMallocManaged and not caring about placement, etc. during fast prototyping. Later you profile your code and then optimize the parts that are slow using hints and prefetching to get (almost) the same performance as with explicit memory management and copies. The final code may not be that much easier to read / less complex than with explicit management (e.g. cudaMemcpy gets replaced with cudaMemPrefetchAsync), but the big difference is that you pay for certain mistakes with worse performance instead of a buggy application with e.g. corrupted data that might be overlooked.
In Multi-GPU applications this idea of not caring about placement at the start is probably not applicable, but NVIDIA seems to want cudaMallocManaged to be as uncomplicated as possible for this type of workflow.

Is there a Tcl extension to deal with huge lists, workaround the 2GB allocation limit?

Based a previous node in this site, TCL max size of array
it would appear that Tcl cannot handle > 256M elements list. Is there an extension/future plan to overcome this limitation?O/W, I would assume that for the time being, and next foreseeable future, if one needs to handle larger indexed arrays and/or dictionaries than that, one must resort to a different language.
Is that true?
I plan to fix the limitation as part of Tcl 9.0; I've done a few dry runs, so I know that it's a large but mostly mechanical change.
If you're dealing with very large amounts of data, consider putting it in a database. SQLite is recommended; it has an excellent Tcl API, and should be shipped as part of Tcl 8.6 (though that does depend on the packager; Linux distributions might make it be separate).

What are the advantages of memory-mapped files?

I've been researching memory mapped files for a project and would appreciate any thoughts from people who have either used them before, or decided against using them, and why?
In particular, I am concerned about the following, in order of importance:
concurrency
random access
performance
ease of use
portability
I think the advantage is really that you reduce the amount of data copying required over traditional methods of reading a file.
If your application can use the data "in place" in a memory-mapped file, it can come in without being copied; if you use a system call (e.g. Linux's pread() ) then that typically involves the kernel copying the data from its own buffers into user space. This extra copying not only takes time, but decreases the effectiveness of the CPU's caches by accessing this extra copy of the data.
If the data actually have to be read from the disc (as in physical I/O), then the OS still has to read them in, a page fault probably isn't any better performance-wise than a system call, but if they don't (i.e. already in the OS cache), performance should in theory be much better.
On the downside, there's no asynchronous interface to memory-mapped files - if you attempt to access a page which isn't mapped in, it generates a page fault then makes the thread wait for the I/O.
The obvious disadvantage to memory mapped files is on a 32-bit OS - you can easily run out of address space.
I have used a memory mapped file to implement an 'auto complete' feature while the user is typing. I have well over 1 million product part numbers stored in a single index file. The file has some typical header information but the bulk of the file is a giant array of fixed size records sorted on the key field.
At runtime the file is memory mapped, cast to a C-style struct array, and we do a binary search to find matching part numbers as the user types. Only a few memory pages of the file are actually read from disk -- whichever pages are hit during the binary search.
Concurrency - I had an implementation problem where it would sometimes memory map the file multiple times in the same process space. This was a problem as I recall because sometimes the system couldn't find a large enough free block of virtual memory to map the file to. The solution was to only map the file once and thunk all calls to it. In retrospect using a full blown Windows service would of been cool.
Random Access - The binary search is certainly random access and lightning fast
Performance - The lookup is extremely fast. As users type a popup window displays a list of matching product part numbers, the list shrinks as they continue to type. There is no noticeable lag while typing.
Memory mapped files can be used to either replace read/write access, or to support concurrent sharing. When you use them for one mechanism, you get the other as well.
Rather than lseeking and writing and reading around in a file, you map it into memory and simply access the bits where you expect them to be.
This can be very handy, and depending on the virtual memory interface can improve performance. The performance improvement can occur because the operating system now gets to manage this former "file I/O" along with all your other programmatic memory access, and can (in theory) leverage the paging algorithms and so forth that it is already using to support virtual memory for the rest of your program. It does, however, depend on the quality of your underlying virtual memory system. Anecdotes I have heard say that the Solaris and *BSD virtual memory systems may show better performance improvements than the VM system of Linux--but I have no empirical data to back this up. YMMV.
Concurrency comes into the picture when you consider the possibility of multiple processes using the same "file" through mapped memory. In the read/write model, if two processes wrote to the same area of the file, you could be pretty much assured that one of the process's data would arrive in the file, overwriting the other process' data. You'd get one, or the other--but not some weird intermingling. I have to admit I am not sure whether this is behavior mandated by any standard, but it is something you could pretty much rely on. (It's actually agood followup question!)
In the mapped world, in contrast, imagine two processes both "writing". They do so by doing "memory stores", which result in the O/S paging the data out to disk--eventually. But in the meantime, overlapping writes can be expected to occur.
Here's an example. Say I have two processes both writing 8 bytes at offset 1024. Process 1 is writing '11111111' and process 2 is writing '22222222'. If they use file I/O, then you can imagine, deep down in the O/S, there is a buffer full of 1s, and a buffer full of 2s, both headed to the same place on disk. One of them is going to get there first, and the other one second. In this case, the second one wins. However, if I am using the memory-mapped file approach, process 1 is going to go a memory store of 4 bytes, followed by another memory store of 4 bytes (let's assume that't the maximum memory store size). Process 2 will be doing the same thing. Based on when the processes run, you can expect to see any of the following:
11111111
22222222
11112222
22221111
The solution to this is to use explicit mutual exclusion--which is probably a good idea in any event. You were sort of relying on the O/S to do "the right thing" in the read/write file I/O case, anyway.
The classing mutual exclusion primitive is the mutex. For memory mapped files, I'd suggest you look at a memory-mapped mutex, available using (e.g.) pthread_mutex_init().
Edit with one gotcha: When you are using mapped files, there is a temptation to embed pointers to the data in the file, in the file itself (think linked list stored in the mapped file). You don't want to do that, as the file may be mapped at different absolute addresses at different times, or in different processes. Instead, use offsets within the mapped file.
Concurrency would be an issue.
Random access is easier
Performance is good to great.
Ease of use. Not as good.
Portability - not so hot.
I've used them on a Sun system a long time ago, and those are my thoughts.

registers vs stacks

What exactly are the advantages and disadvantages to using a register-based virtual machine versus using a stack-based virtual machine?
To me, it would seem as though a register based machine would be more straight-forward to program and more efficient. So why is it that the JVM, the CLR, and the Python VM are all stack-based?
Implemented in hardware, a register-based machine is going to be more efficient simply because there are fewer accesses to the slower RAM. In software, however, even a register based architecture will most likely have the "registers" in RAM. A stack based machine is going to be just as efficient in that case.
In addition a stack-based VM is going to make it a lot easier to write compilers. You don't have to deal with register allocation strategies. You have, essentially, an unlimited number of registers to work with.
Update: I wrote this answer assuming an interpreted VM. It may not hold true for a JIT compiled VM. I ran across this paper which seems to indicate that a JIT compiled VM may be more efficient using a register architecture.
This has already been answered, to a certain level, in the Parrot VM's FAQ and associated documents:
A Parrot Overview
The relevant text from that doc is this:
the Parrot VM will have a register architecture, rather than a stack architecture. It will also have extremely low-level operations, more similar to Java's than the medium-level ops of Perl and Python and the like.
The reasoning for this decision is primarily that by resembling the underlying hardware to some extent, it's possible to compile down Parrot bytecode to efficient native machine language.
Moreover, many programs in high-level languages consist of nested function and method calls, sometimes with lexical variables to hold intermediate results. Under non-JIT settings, a stack-based VM will be popping and then pushing the same operands many times, while a register-based VM will simply allocate the right amount of registers and operate on them, which can significantly reduce the amount of operations and CPU time.
You may also want to read this: Registers vs stacks for interpreter design
Quoting it a bit:
There is no real doubt, it's easier to generate code for a stack machine. Most freshman compiler students can do that. Generating code for a register machine is a bit tougher, unless you're treating it as a stack machine with an accumulator. (Which is doable, albeit somewhat less than ideal from a performance standpoint) Simplicity of targeting isn't that big a deal, at least not for me, in part because so few people are actually going to directly target it--I mean, come on, how many people do you know who actually try to write a compiler for something anyone would ever care about? The numbers are small. The other issue there is that many of the folks with compiler knowledge already are comfortable targeting register machines, as that's what all hardware CPUs in common use are.
Traditionally, virtual machine implementors have favored stack-based architectures over register-based due to 'simplicity of VM implementation' ease of writing a compiler back-end - most VMs are originally designed to host a single language and code density and executables for stack architecture are invariably smaller than executables for register architectures. The simplicity and code density are a cost of performance.
Studies have shown that a registered-based architecture requires an average of 47% less executed VM instructions than stack-based architecture, and the register code is 25% larger than corresponding stack code but this increase cost of fetching more VM instructions due to larger code size involves only 1.07% extra real machine loads per VM instruction which is negligible. The overall performance of the register-based VM is that it takes, on average, 32.3% less time to execute standard benchmarks.
One reason for building stack-based VMs is that that actual VM opcodes can be smaller and simpler (no need to encode/decode operands). This makes the generated code smaller, and also makes the VM code simpler.
How many registers do you need?
I'll probably need at least one more than that.
Stack based VM's are simpler and the code is much more compact. As a real world example, a friend built (about 30 years ago) a data logging system with a homebrew Forth VM on a Cosmac. The Forth VM was 30 bytes of code on a machine with 2k of ROM and 256 bytes of RAM.
It is not obvious to me that a "register-based" virtual machine would be "more straight-forward to program" or "more efficient". Perhaps you are thinking that the virtual registers would provide a short-cut during the JIT compilation phase? This would certainly not be the case, since the real processor may have more or fewer registers than the VM, and those registers may be used in different ways. (Example: values that are going to be decremented are best placed in the ECX register on x86 processors.) If the real machine has more registers than the VM, then you're wasting resources, fewer and you've gained nothing using "register-based" programming.
Stack based VMs are easier to generate code for.
Register based VMs are easier to create fast implementations for, and easier to generate highly optimized code for.
For your first attempt, I recommend starting with a stack based VM.

How do you write code that is both 32 bit and 64 bit compatible?

What considerations do I need to make if I want my code to run correctly on both 32bit and 64bit platforms ?
EDIT: What kind of areas do I need to take care in, e.g. printing strings/characters or using structures ?
Options:
Code it in some language with a Virtual Machine (such as Java)
Code it in .NET and don't target any specific architecture. The .NET JIT compiler will compile it for you to the right architecture before running it.
One solution would be to target a virtual environment that runs on both platforms (I'm thinking Java, or .Net here).
Or pick an interpreted language.
Do you have other requirements, such as calling existing code or libraries?
The same things you should have been doing all along to ensure you write portable code :)
mozilla guidelines and the C faq are good starting points
I assume you are still talking about compiling them separately for each individual platform? As running them on both is completely doable by just creating a 32bit binary.
The biggest one is making sure you don't put pointers into 32-bit storage locations.
But there's no proper 'language-agnostic' answer to this question, really. You couldn't even get a particularly firm answer if you restricted yourself to something like standard 'C' or 'C++' - the size of data storage, pointers, etc, is all terribly implementation dependant.
It honestly depends on the language, because managed languages like C# and Java or Scripting languages like JavaScript, Python, or PHP are locked in to their current methodology and to get started and to do anything beyond the advanced stuff there is not much to worry about.
But my guess is that you are asking about languages like C++, C, and other lower level languages.
The biggest thing you have to worry about is the size of things, because in the 32-bit world you are limited to the power of 2^32 however in the 64-bit world things get bigger 2^64.
With 64-bit you have a larger space for memory and storage in RAM, and you can compute larger numbers. However if you know you are compiling for both 32 and 64, you need to make sure to limit your expectations of the system to the 32-bit world and limitations of buffers and numbers.
In C (and maybe C++) always remember to use the sizeof operator when calculating buffer sizes for malloc. This way you will write more portable code anyway, and this will automatically take 64bit datatypes into account.
In most cases the only thing you have to do is just compile your code for both platforms. (And that's assuming that you're using a compiled language; if it's not, then you probably don't need to worry about anything.)
The only thing I can think of that might cause problems is assuming the size of data types, which is something you probably shouldn't be doing anyway. And of course anything written in assembly is going to cause problems.
Keep in mind that many compilers choose the size of integer based on the underlying architecture, given that the "int" should be the fastest number manipulator in the system (according to some theories).
This is why so many programmers use typedefs for their most portable programs - if you want your code to work on everything from 8 bit processors up to 64 bit processors you need to recognize that, in C anyway, int is not rigidly defined.
Pointers are another area to be careful - don't use a long, or long long, or any specific type if you are fiddling with the numeric value of the pointer - use the proper construct, which, unfortunately, varies from compiler to compiler (which is why you have a separate typedef.h file for each compiler you use).
-Adam Davis