How is it possible to save a file which contains one or a few bytes (i.e. 20 bytes), without occupying 4 KB disk space for that file?

How is it possible to save a file which contains one or a few bytes (i.e. 20 bytes), without occupying 4 KB disk space for that file? - binary

I'm trying to save log data, and each log data (for example, the transaction numbers in the financial system) is only a few bytes. I do not want to use the Database structure. And I already know about the facts about the Cluster/Sector/Inode of the hard disk and/or the Operating Systems.
However, I think there should be a way of saving one file which is only 20 bytes, while it only occupies only 20 bytes (or 20+n% bytes, i.e. 25 bytes), and not 4 KB in the disk. Yes, I know the problems which may arise if we use millions of one-byte or very small files, particularly with the Indexing and searching speeds. In my case, the benefits of saving such small files outweigh the problems it might have. So I'm wondering if there's any practical way to do so (even if there's any special hardware for it or a particular hard disk made for it which I don't know). I appreciate any kind of information and help.
I've tried many tests on Windows and MacOS. But my tests were all limited to just saving some one-byte or a-few-byte files, and nothing more. I have no information about a practical way to do what I'm looking for.

Related

Should I be worried about parquet files being 48MB?

I set a transform to use 2000 shuffle partitions and found that the output files had gone from 200 files (of about 442MB each) to 2000 (of about 48MB each) files. Is this something to be worried about?

Short answer: No, this is probably fine and likely won't cause issues.
Reducing file size, however, is a fairly cheap operation, which you can achieve by using .coalesce(200) at the end of the transform. This will collapse files together without causing a shuffle. Depending on uniformity of your data, there may be some discrepancy in file sizes. If that will ever become an issue, you can use .repartition(200) instead (this will require a shuffle, increasing the compute cost of your job)

Browser Memory consumption / leak issues

I need help in understanding this Testing process.
Our Quality Assurance (QA) team is using Performance Monitor (from Microsoft) to test browser memory consumption & leak.
Steps QA do:
Open web browser and login to our webapp.
Note down the initial virtual bytes from tool (shown in screenshot)
Perform some operation(lets say search) for a couple of times.
Note down the virtual bytes from tool.
Calculate the difference between last & first virtual bytes allocated. (After converting virtual bytes to MB)
Divide this difference by number of total number of clicks performed by user.
Note down the remainder.
Now, this remainder should be less than 1. (this number is decided by them)
If its greater than 1, they say our webapp has memory leaks.
For Firefox & chrome, this remainder is less than 1 for us. But for IE 10 & 11 (32 & 64 bit both) remainder is more than 1.
Questions:
Is this some standard practice they are following?
How correct is their analysis process?
How can I convince them, if their analysis is not so right?
How should I go about fixing this problem?
P.S I'm not able to get more information from our QA.
P.S We use angular js for Client.

Note down the initial virtual bytes from tool (shown in screenshot)
Virtual bytes are nearly meaningless on 64bit because large chunks of address space can be reserved ahead of time without actually backing them with RAM or swap. Of course the amount is somewhat correlated to actual memory use, but it's just that "somewhat".
Calculate the difference between last & first virtual bytes allocated. (After converting virtual bytes to MB)
This calculation can be meaningless for a different reason. Browsers use complex memory management systems (custom allocators and garbage collectors) which may not immediately release memory back to the operating system after they have used it. Which means that for some amount of time their memory usage may only appear to grow, not shrink, even when you close tabs.
How should I go about fixing this problem?
Use the built-in memory tracking tools of the browsers. E.g. about:memory in firefox.

Why is binary represented in octects?

I've been looking for the answer on google and can't seem to find it. But binary is represented in bytes/octets, 8 bits. So the character a (I think) is 01100010, and the word hey is
01101000
01100101
01111001
So my question is, why 8? Is this just a good number for the computer to work with? And I've noticed how 32 bit/ 62 bit computers are all multiples of eight... so does this all have to do with how the first computers were made?
Sorry if this question doesn't meet the Q/A standards... its not code related but I can't think of anywhere else to ask it.

The answer is really "historical reasons".
Computer memory must be addressable at some level. When you ask your RAM for information, you need to specify which information you want - and it will return that to you. In theory, one could produce bit-addressable memory: you ask for one bit, you get one bit back.
But that wouldn't be very efficient, since the interface connecting the processor to the memory needs to be able to convey enough information to specify which address it wants. The smaller the granularity of access, the more wires you need (or the more pushes along the same number of wires) before you've given an accurate enough address for retrieval. Also, returning one bit multiple times is less efficient than returning multiple bits one time (side note: true in general. This is a serial-vs-parallel debate, and due to reduced system complexity and physics, serial interfaces can generally run faster. But overall, more bits at once is more efficient).
Secondly, the total amount of memory in the system is limited in part by the size of the smallest addressable block, since unless you used variably-sized memory addresses, you only have a finite number of addresses to work with - but each address represents a number of bits which you get to choose. So a system with logically byte-addressable memory can hold eight times the RAM of one with logically bit-addressable memory.
So, we use memory logically addressable at less fine levels (although physically no RAM chip will return just one byte). Only powers of two really make sense for this, and historically the level of access has been a byte. It could just as easily be a nibble or a two-byte word, and in fact older systems did have smaller chunks than eight bits.
Now, of course, modern processors mostly eat memory in cache-line-sized increments, but our means of expressing groupings and dividing the now-virtual address space remained, and the smallest amount of memory which a CPU instruction can access directly is still an eight-bit chunk. The machine code for the CPU instructions (and/or the paths going into the processor) would have to grow the same way the number of wires connecting to the memory controller would in order for the registers to be addressable - it's the same problem as with the system memory accessibility I was talking about earlier.

"In the early 1960s, AT&T introduced digital telephony first on long-distance trunk lines. These used the 8-bit µ-law encoding. This large investment promised to reduce transmission costs for 8-bit data. The use of 8-bit codes for digital telephony also caused 8-bit data octets to be adopted as the basic data unit of the early Internet"
http://en.wikipedia.org/wiki/Byte
Not sure how true that is. It seems that that's just the symbol and style adopted by the IEEE, though.

One reason why we use 8-bit bytes is because the complexity of the world around us has a definitive structure. On the scale of human beings, observed physical world has finite number of distinctive states and patterns. Our innate restricted abilities to classify information, to distinguish order from chaos, finite amount of memory in our brains - these all are the reasons why we choose [2^8...2^64] states to be enough to satisfy our everyday basic computational needs.

Are 0 bytes files really 0 bytes?

I have a simple question.
When we create a file, let's say "abc.txt" and leave it blank.
The OS will show that the file is 0 bytes size and takes 0 bytes on disk.
If we save 100 of these 0 bytes file into a folder, the OS will also say that the folder's total size is 0 bytes.
This may sound logical because there is nothing in the file. But should not these files take at least a few bytes in the storage device?
After all, we save it somewhere and named it something. Shouldn't the file's name and possibly some other headers at least takes up some space?

No, they still occupy a few bytes on the file system. Otherwise I would implement a magic filesystem that stored everything encoded in the filenames on empty files.
It actually boils down to a matter of definition. Either the "size of a file" refers to the size of the content of the file, or it refers to the "difference" it makes in terms of free bytes on the underlying file system (that is, size of content (rounded up to the closest block- or cluster-size) + bytes used for it's inode).

These details are stored into what is known as File Allocation Table (talking in Windows FAT context) traditionally. They are created when we format the hard drive. Some predefined space is allocated for it. I don't think the size of it changes.
For example, you format a 100 GB hard drive, only 90+ GB is available for you to use. Other space is used by the file system to manage/remember each file/folder that is saved on the hard drive and where it is saved.

The answer to this question is file-system dependent.
For example, on NTFS an empty file takes up a cluster, and a cluster has a size that depends on your hard disk size.
Here you can read some common cluster size for Windows' file systems.

The fact that they are present on disk means that a record has been created for them, which of course requires some amount of memory. The 0 bytes simply corresponds to the logical size of the file rounded to the granularity displayed in the UI, but even then it likely contains a file header which will depend on the file format.

What are the advantages of memory-mapped files?

I've been researching memory mapped files for a project and would appreciate any thoughts from people who have either used them before, or decided against using them, and why?
In particular, I am concerned about the following, in order of importance:
concurrency
random access
performance
ease of use
portability

I think the advantage is really that you reduce the amount of data copying required over traditional methods of reading a file.
If your application can use the data "in place" in a memory-mapped file, it can come in without being copied; if you use a system call (e.g. Linux's pread() ) then that typically involves the kernel copying the data from its own buffers into user space. This extra copying not only takes time, but decreases the effectiveness of the CPU's caches by accessing this extra copy of the data.
If the data actually have to be read from the disc (as in physical I/O), then the OS still has to read them in, a page fault probably isn't any better performance-wise than a system call, but if they don't (i.e. already in the OS cache), performance should in theory be much better.
On the downside, there's no asynchronous interface to memory-mapped files - if you attempt to access a page which isn't mapped in, it generates a page fault then makes the thread wait for the I/O.
The obvious disadvantage to memory mapped files is on a 32-bit OS - you can easily run out of address space.

I have used a memory mapped file to implement an 'auto complete' feature while the user is typing. I have well over 1 million product part numbers stored in a single index file. The file has some typical header information but the bulk of the file is a giant array of fixed size records sorted on the key field.
At runtime the file is memory mapped, cast to a C-style struct array, and we do a binary search to find matching part numbers as the user types. Only a few memory pages of the file are actually read from disk -- whichever pages are hit during the binary search.
Concurrency - I had an implementation problem where it would sometimes memory map the file multiple times in the same process space. This was a problem as I recall because sometimes the system couldn't find a large enough free block of virtual memory to map the file to. The solution was to only map the file once and thunk all calls to it. In retrospect using a full blown Windows service would of been cool.
Random Access - The binary search is certainly random access and lightning fast
Performance - The lookup is extremely fast. As users type a popup window displays a list of matching product part numbers, the list shrinks as they continue to type. There is no noticeable lag while typing.

Memory mapped files can be used to either replace read/write access, or to support concurrent sharing. When you use them for one mechanism, you get the other as well.
Rather than lseeking and writing and reading around in a file, you map it into memory and simply access the bits where you expect them to be.
This can be very handy, and depending on the virtual memory interface can improve performance. The performance improvement can occur because the operating system now gets to manage this former "file I/O" along with all your other programmatic memory access, and can (in theory) leverage the paging algorithms and so forth that it is already using to support virtual memory for the rest of your program. It does, however, depend on the quality of your underlying virtual memory system. Anecdotes I have heard say that the Solaris and *BSD virtual memory systems may show better performance improvements than the VM system of Linux--but I have no empirical data to back this up. YMMV.
Concurrency comes into the picture when you consider the possibility of multiple processes using the same "file" through mapped memory. In the read/write model, if two processes wrote to the same area of the file, you could be pretty much assured that one of the process's data would arrive in the file, overwriting the other process' data. You'd get one, or the other--but not some weird intermingling. I have to admit I am not sure whether this is behavior mandated by any standard, but it is something you could pretty much rely on. (It's actually agood followup question!)
In the mapped world, in contrast, imagine two processes both "writing". They do so by doing "memory stores", which result in the O/S paging the data out to disk--eventually. But in the meantime, overlapping writes can be expected to occur.
Here's an example. Say I have two processes both writing 8 bytes at offset 1024. Process 1 is writing '11111111' and process 2 is writing '22222222'. If they use file I/O, then you can imagine, deep down in the O/S, there is a buffer full of 1s, and a buffer full of 2s, both headed to the same place on disk. One of them is going to get there first, and the other one second. In this case, the second one wins. However, if I am using the memory-mapped file approach, process 1 is going to go a memory store of 4 bytes, followed by another memory store of 4 bytes (let's assume that't the maximum memory store size). Process 2 will be doing the same thing. Based on when the processes run, you can expect to see any of the following:
11111111
22222222
11112222
22221111
The solution to this is to use explicit mutual exclusion--which is probably a good idea in any event. You were sort of relying on the O/S to do "the right thing" in the read/write file I/O case, anyway.
The classing mutual exclusion primitive is the mutex. For memory mapped files, I'd suggest you look at a memory-mapped mutex, available using (e.g.) pthread_mutex_init().
Edit with one gotcha: When you are using mapped files, there is a temptation to embed pointers to the data in the file, in the file itself (think linked list stored in the mapped file). You don't want to do that, as the file may be mapped at different absolute addresses at different times, or in different processes. Instead, use offsets within the mapped file.

Concurrency would be an issue.
Random access is easier
Performance is good to great.
Ease of use. Not as good.
Portability - not so hot.
I've used them on a Sun system a long time ago, and those are my thoughts.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008