Should I be worried about parquet files being 48MB? - palantir-foundry

I set a transform to use 2000 shuffle partitions and found that the output files had gone from 200 files (of about 442MB each) to 2000 (of about 48MB each) files. Is this something to be worried about?

Short answer: No, this is probably fine and likely won't cause issues.
Reducing file size, however, is a fairly cheap operation, which you can achieve by using .coalesce(200) at the end of the transform. This will collapse files together without causing a shuffle. Depending on uniformity of your data, there may be some discrepancy in file sizes. If that will ever become an issue, you can use .repartition(200) instead (this will require a shuffle, increasing the compute cost of your job)

Related

How is it possible to save a file which contains one or a few bytes (i.e. 20 bytes), without occupying 4 KB disk space for that file?

I'm trying to save log data, and each log data (for example, the transaction numbers in the financial system) is only a few bytes. I do not want to use the Database structure. And I already know about the facts about the Cluster/Sector/Inode of the hard disk and/or the Operating Systems.
However, I think there should be a way of saving one file which is only 20 bytes, while it only occupies only 20 bytes (or 20+n% bytes, i.e. 25 bytes), and not 4 KB in the disk. Yes, I know the problems which may arise if we use millions of one-byte or very small files, particularly with the Indexing and searching speeds. In my case, the benefits of saving such small files outweigh the problems it might have. So I'm wondering if there's any practical way to do so (even if there's any special hardware for it or a particular hard disk made for it which I don't know). I appreciate any kind of information and help.
I've tried many tests on Windows and MacOS. But my tests were all limited to just saving some one-byte or a-few-byte files, and nothing more. I have no information about a practical way to do what I'm looking for.

Speeding up puts operation to file in tcl

I have to puts large amounts of data to a file in TCL, and it takes very long. I tried increasing the buffer capacity from 4KB to 1MB using fconfigure, but noticed no improvement whatsoever.
I am not sure if I could flush my buffer at intervals, as I was guessing some of my data would be lost if I do so.
Is there some way I could increase the speed of puts without losing any data?
Generally the output speed is going to be limited by your disk drive's speed and computer system's i/o bandwidth.
Increasing the buffer size is probably the only thing you can do to help.
flush will slow down the write, as it will force-push the write buffer to the operating system.
If your incoming data stream ever pauses or comes in one big chunk that can be fit into memory, you can buffer the incoming data internally, and let the write catch up later.
If your data is coming from another channel (file, socket, whatever) then you can use fcopy to move it across. The fcopy command is careful to work as efficiently as possible, and if you configure both sides (incoming and outgoing) to use binary data transfer — so no encoding conversion or EOL/EOF character processing — then it can do it with minimal data copies; it's as efficient as a user-process level system can copy data (and you'd have to do hackery to move the copy into the OS kernel to do better). Obviously, having to process encoding conversion and transformation of end-of-line markers will slow things down.
Otherwise, the main bottleneck will still (probably) be device to which the output is being written. If it is going to a file, moving to writing to an SSD is the simplest option (but not necessarily the cheapest!) When writing over the network, better networking will make a gigantic difference. You really have to identify what the bottleneck really is; if Tcl is spending most of its time waiting for the hardware, there's very little point in working hard to make Tcl faster as you'll see virtually no results for that work. Fixing hardware bottlenecks is out of the scope of Stack Overflow, though some sister sites might be able to assist.
puts will not lose data unless you do something really evil like doing a force kill (kill -9) on the process, or doing a reset on the location of the file pointer from C code.

How to correctly calculate RAM requirement for a bucket in Couchbase

We have a bucket of about 34 million items in a Couchbase cluster setup of 6 AWS nodes. The bucket has been allocated 32.1GB of RAM (5482MB per node) and is currently using 29.1GB. If I use the formula provided in the Couchbase documentation (http://docs.couchbase.com/admin/admin/Concepts/bp-sizingGuidelines.html) it should use approx. 8.94GB of RAM.
Am I calculating it incorrectly? Below is link to google spreadsheet with all the details.
https://docs.google.com/spreadsheets/d/1b9XQn030TBCurUjv3bkhiHJ_aahepaBmFg_lJQj-EzQ/edit?usp=sharing
Assuming that you indeed have a working set of 0.5%, which as Kirk pointed out in his comment, is odd but not impossible, then you are calculating the result of the memory sizing formula correctly. However, it's important to understand that the formula is not a hard and fast rule that fits all situations. Rather, it's a general guideline and serves as a good starting point for you to go and begin your performance tests. Also, keep in mind that the RAM sizing isn't the only consideration for deciding on cluster size, because you also have to consider data safety, total disk write throughput, network bandwidth, CPU, how much a single node failure affects the rest of the cluster, and more.
Using the result of the RAM sizing formula as a starting point, you should now actually test whether your working assumptions were correct. Which means putting real (or close to representative) load on the bucket and seeing whether the % of cache misses is low enough and the operation lacency is within your acceptable limits. There is no general rule for this, what's acceptable to some applications might be too slow for others.
Just as an example, if you see that under load your cache miss ratio is 5% and while the average read latency is 3ms, the top 1% latency is 100ms - then you have to consider whether having one out of every 100 reads take that much longer is acceptable in your application. If it is - great, if not - you need to start increasing the RAM size until it matches your actual working set. Similarly, you should keep an eye on the disk throughput, CPU usage, etc.

What is the intuition behind cache oblivious data structures?

I understand what the expression cache oblivious means. But I was wondering if there is any easy explanation for how data structures can be designed that can use the cache optimally, without knowing the sizes of the cache.
Can you please provide such an explanation, preferably with an (easy) example?
Even an algorithm as familiar as quicksort is somewhat cache oblivious (but not optimal). Recall that it works by partitioning the array, then recursing on each side of the partition. Eventually, it is operating on a sub-array which fits in cache, and so there will be no more cache misses until it finishes that sub-array and moves on to another one. That's the property we're looking for.
Contrast this with insertion sort, which (to use a technical term) leaps all over the place all the time. So quite aside from insertion sort's need to move O(n^2) items around, it also misses cache a lot when used on large arrays.
Quicksort is some way from optimal, though. Each individual partition phase doesn't divide and recurse - it does a long sequential run through memory churning the cache. Potentially this will happen several times before the sub-array size is small enough that we start winning, so we're not minimising the number of cache misses.
The primary intuition is that if you recursively split the dataset you work with, at some point (usually pretty quickly) you'll reach a size that 1) fits in the cache, and 2) fills at least half the cache (assuming each split of the dataset is (at least approximately) in half).

What are the advantages of memory-mapped files?

I've been researching memory mapped files for a project and would appreciate any thoughts from people who have either used them before, or decided against using them, and why?
In particular, I am concerned about the following, in order of importance:
concurrency
random access
performance
ease of use
portability
I think the advantage is really that you reduce the amount of data copying required over traditional methods of reading a file.
If your application can use the data "in place" in a memory-mapped file, it can come in without being copied; if you use a system call (e.g. Linux's pread() ) then that typically involves the kernel copying the data from its own buffers into user space. This extra copying not only takes time, but decreases the effectiveness of the CPU's caches by accessing this extra copy of the data.
If the data actually have to be read from the disc (as in physical I/O), then the OS still has to read them in, a page fault probably isn't any better performance-wise than a system call, but if they don't (i.e. already in the OS cache), performance should in theory be much better.
On the downside, there's no asynchronous interface to memory-mapped files - if you attempt to access a page which isn't mapped in, it generates a page fault then makes the thread wait for the I/O.
The obvious disadvantage to memory mapped files is on a 32-bit OS - you can easily run out of address space.
I have used a memory mapped file to implement an 'auto complete' feature while the user is typing. I have well over 1 million product part numbers stored in a single index file. The file has some typical header information but the bulk of the file is a giant array of fixed size records sorted on the key field.
At runtime the file is memory mapped, cast to a C-style struct array, and we do a binary search to find matching part numbers as the user types. Only a few memory pages of the file are actually read from disk -- whichever pages are hit during the binary search.
Concurrency - I had an implementation problem where it would sometimes memory map the file multiple times in the same process space. This was a problem as I recall because sometimes the system couldn't find a large enough free block of virtual memory to map the file to. The solution was to only map the file once and thunk all calls to it. In retrospect using a full blown Windows service would of been cool.
Random Access - The binary search is certainly random access and lightning fast
Performance - The lookup is extremely fast. As users type a popup window displays a list of matching product part numbers, the list shrinks as they continue to type. There is no noticeable lag while typing.
Memory mapped files can be used to either replace read/write access, or to support concurrent sharing. When you use them for one mechanism, you get the other as well.
Rather than lseeking and writing and reading around in a file, you map it into memory and simply access the bits where you expect them to be.
This can be very handy, and depending on the virtual memory interface can improve performance. The performance improvement can occur because the operating system now gets to manage this former "file I/O" along with all your other programmatic memory access, and can (in theory) leverage the paging algorithms and so forth that it is already using to support virtual memory for the rest of your program. It does, however, depend on the quality of your underlying virtual memory system. Anecdotes I have heard say that the Solaris and *BSD virtual memory systems may show better performance improvements than the VM system of Linux--but I have no empirical data to back this up. YMMV.
Concurrency comes into the picture when you consider the possibility of multiple processes using the same "file" through mapped memory. In the read/write model, if two processes wrote to the same area of the file, you could be pretty much assured that one of the process's data would arrive in the file, overwriting the other process' data. You'd get one, or the other--but not some weird intermingling. I have to admit I am not sure whether this is behavior mandated by any standard, but it is something you could pretty much rely on. (It's actually agood followup question!)
In the mapped world, in contrast, imagine two processes both "writing". They do so by doing "memory stores", which result in the O/S paging the data out to disk--eventually. But in the meantime, overlapping writes can be expected to occur.
Here's an example. Say I have two processes both writing 8 bytes at offset 1024. Process 1 is writing '11111111' and process 2 is writing '22222222'. If they use file I/O, then you can imagine, deep down in the O/S, there is a buffer full of 1s, and a buffer full of 2s, both headed to the same place on disk. One of them is going to get there first, and the other one second. In this case, the second one wins. However, if I am using the memory-mapped file approach, process 1 is going to go a memory store of 4 bytes, followed by another memory store of 4 bytes (let's assume that't the maximum memory store size). Process 2 will be doing the same thing. Based on when the processes run, you can expect to see any of the following:
11111111
22222222
11112222
22221111
The solution to this is to use explicit mutual exclusion--which is probably a good idea in any event. You were sort of relying on the O/S to do "the right thing" in the read/write file I/O case, anyway.
The classing mutual exclusion primitive is the mutex. For memory mapped files, I'd suggest you look at a memory-mapped mutex, available using (e.g.) pthread_mutex_init().
Edit with one gotcha: When you are using mapped files, there is a temptation to embed pointers to the data in the file, in the file itself (think linked list stored in the mapped file). You don't want to do that, as the file may be mapped at different absolute addresses at different times, or in different processes. Instead, use offsets within the mapped file.
Concurrency would be an issue.
Random access is easier
Performance is good to great.
Ease of use. Not as good.
Portability - not so hot.
I've used them on a Sun system a long time ago, and those are my thoughts.