Are "65k" and "65KB" the same?
From xkcd:
65KB normally means 66560 bytes. 65k means 65000, and says nothing about what it is 65000 of. If someone says 65k bytes, they might means 65KB...but they're mispeaking if so. Some people argue for the use of KiB to mean 66560 bytes, since k means 1000 in the metric system. Everyone ignores them, though.
Note: a lowercase b would mean bit, rather than bytes. 8Kb = 1KB. When talking about transmission rates, bits are usually used.
Edit: As Joel mentions, hard drive manufacturers often treat the K as meaning 1000. So hard disk space of 65KB would often mean 65000. Thumb drives and the like tend to use K as meaning 1024, though.
Probably.
Technically 65k just means 65 thousand (monkeys perhaps?). You would have to take into account the context.
65kB can be interpreted to mean either 65 * 1000 = 65,000 bytes or 60 * 2^10 = 66,560 bytes.
You can read about all this and kibibytes at Wikipedia.
65k is 65,000 of something
65KB is 66,560 bytes (65*1024)
Like most have said, 65KB is 66560, 65k is 65000. 65KB means 66560 BYTES, and 65k is ambiguous. So they're not the same.
Additionally, since there are a few people equating "8 bits = 1 byte", I thought I'd add a little bit about that.
Transmission rates are usually in bits per second, because the grouping into bytes might not be directly related to the actual transmission clock rate.
Take for instance 9600 baud with RS232 serial ports. There are always exactly 9600 bits going out per second (+/- maybe a 5% clock tolerance). However, if those bits are grouped as N-8-1, meaning "no parity, 8 bits, 1 stop bit", then there are 10 bits per byte and so the byte rate is 960 bytes/second maximum. However, if you have something odd like E-8-2, or "even parity, 8 bits, 2 stop bits" then it's 12 bits per byte, or 800 bytes/second. The actual bits are going out at exactly the same rate, so it only makes sense to talk about the bits/second rate.
So 1 byte might be 8 bits, 9 bits (ie parity), 10 bits (ie N81,E71,N72), 11 bits(ie E81), 12 bits (ie E82), or whatever. There are lots of combinations of ways with just RS232-style transmission to get very odd byte rates. If you throw in RS or ECC correction, you could have even more bits per byte. Then there's 8b/10b, 6b/8b, hamming codes, etc...
In terms of data transfer rates - 65k implies 65 kilobits and 65KB implies 65 KiloBytes
Check this http://en.wikipedia.org/wiki/Data_rate_units
cheers
From Wikipedia for Kilobyte:
It is abbreviated in a number of ways: KB, kB, K and Kbyte.
In other words, they could both be abbreviations for Kilobyte. However, using only a lowercase 'k' is not a standard abbreviation, but most people will know what you mean.
There you go:
kB = kiloByte
KB = KelvinByte
kb = kilobit
Kb = Kelvinbit
Use the bold ones! But be aware that some people use 1024 instead of 1000 for k (kilo).
My opinion on this: kilo = 1000. So the first one who decided to use 1024 made the mistake. If I am not mistaken 1024 was used first by IT engineers. Later they found out (probably some marketing genius) that they can label things using 1000 as kilo and make things look bigger than they actualy are. Since then, you can't be sure which value is used for kilo.
In general, yes, they're both 65 kilobytes (66,560 bytes).
Sometimes the abbreviations are tricky with casing. If it had been "65Kb", it would have correctly meant kilo***bits***.
A kilobyte (KB) is 1024 bytes.
Kilo stands for 1000.
So, going purely by notation: (65k = 65,000) != (65KB = 66,560).
However, if you're talking about memory you're probably always going to see KB (even if its written as k).
Generally, KB = k. It's all very confusing really.
Strictly speaking, the former is not specifying the unit: 65,000 What? So, the two can't really be compared.
However, in general speech then most people mean 65K (note it's normally uppercase) to mean 65 KiloBytes (or 65 * 1024 Bytes).
Note 65Kb usually denotes KiloBits.
"Officially", 65k is 65,000; however people say 65k all the time, even if the real number is something like 65,123.
Typically 65k means anywhere from 64.00001 to 65.99999998 KiB or sometimes anywhere between 63500 and 64999 bytes ... ie, we aren't all the precise most of the time with sizes of things. When someone cares to be precise, they will be explicit, or the meaning will be clear from context.
65 KiB means 65 * 1024 bytes. .... unless the person was rounding. Never trust a number unless you measure it yourself! ... :)
Hope that helps,
--- Dave
65k may be the same as 65KB, but remember, 65KB is larger than 65Kb.
Case is important, as are units.
Psto, you're right. This is an absolute minefield!
As many said, K is tecnically Kilo, meaning Thousand (of anything) and comes from greek.
But you can assume different units depending on the context.
As data transfer rates are most often measured in bits, K in this context could be assumed to be Kilo Bits.
When talking about data storage, a file's size, etc. K can be assumed to be Kilo Bytes.
Related
Im trying to work out an answer for a question about meassuring pressures.
The meassurments are supposed to be stored in binary floating point format and my task is to determine the minimum number of bits required to do so with some constraints;
Maximum pressure is 1e+07 Pa
Minimum pressure is 10 Pa
Accuracy of meassurments is 0.001 %
So if I understand it correctly, I could happen to measssure
1e+07 + 0.00001 * 1e+07 = 10000100 Pa
and would want to store it precisely. This would mean I would need 24 bits, since
2^23< 10000100 <2^24-1.
Is this including the 1 bit for a negative sign? Since we don't meassure negative pressures, would a more accurate answer be 23 bits?
And also for the minimum pressure, I could happen to meassure 9.9999 Pa and would want to store this correctly, so 4 decimals.
Since I could do the same type of calculation and end up with
2^13<9999<2^14-1
Is this already covered in the 23-24 bits I chose first?
I'm very new to this so any help or just clarification would be appreciated.
Unless you are asking this question for (i.) academic interest or (ii.) are really short on storage - neither of which I am going to address - I would strongly advocate that you don't worry about the numbers of bits and instead use a standard float (4 bytes) or double (8 bytes). Databases and software packages understand floats and doubles. If you try to craft your own floating point format using the minimum number of bits then you are adding a lot of work for yourself.
A float will hold 7 significant digits, whereas a double will hold 16. If you want to get technical a float (single precision) uses 8 bits for the exponent, 1 bit for the sign and the remaining 23 bits for the significand, whilst a double (double precision) uses 11 bits for the exponent, 1 bit for the sign and 52 bits for the significand.
So I suggest you focus instead on whether a float or a double best meets your needs.
I know this doesn't answer your question, but I hope it addresses what you need.
In Deflate algorithm there are two ways to encode a length of 258:
Code 284 + 5 extra bits of all 1's
Code 285 + 0 extra bits;
On first glance, this is not optimal, because the proper use of code 285 would allow a length of 259 be encoded;
Is this duality some specification mistake, not fixed because of compatibility reasons, or there are some arguments about it - for example length of 258 must be encoded with shorter code (0 extra bits) because of some reason?
We may never know. The developer of the deflate format, Phil Katz, passed away many years ago at a young age.
My theory is that a match length was limited to 258 so that a match length in the range 3..258 could fit in a byte, encoded as 0..255. This format was developed around 1990, when this might make a difference in an assembler implementation.
Adding a second answer here to underscore Mark's guess that allowing the length to be encoded in a byte is helpful to assembler implementations. At the time 8086 level assembler was still common and using the 8 bit form of registers gave you more of them to work with than using them in 16 bit size.
The benefit is even more pronounced on 8 bit processors such as the 6502. It starts with the length decoding. Symbols 257 .. 264 represent a match length of 3 .. 10 respectively. If you take the low byte of those symbols (1 .. 8) you get exactly 2 less than the match length.
A more complicated yet fairly easy to compute formula gives 2 less than the match length of symbols 265 through 284. 2 less than the match length of symbol 285 is 256. That doesn't fit in a byte but we can store 0 which turns out to be equivalent.
zlib6502 uses this for considerable advantage. It calculates the match length in inflateCodes_lengthMinus2. And once the back pointer into the window has been determined it copies the data like so:
jsr copyByte
jsr copyByte
inflateCodes_copyByte
jsr copyByte
dec inflateCodes_lengthMinus2
bne inflateCodes_copyByte
It makes two explicit calls to copy a byte and then loops over the length less 2. Which works as you would expect for lengths 1 to 255. For length 0 it will actually iterate 256 times as we desire. The first time through the loop the length of 0 is decremented to 255 which is non-zero so the loop continues 255 more times for a total of 256.
I'd have to think that Phil Katz understood intuitively if not explicitly the benefits of keeping the length of matches within 8 bits.
Until now I believed that 1024 bytes equals 1 KB (kilobyte) but I was reading on the internet about decimal and binary system.
So, actually 1024 bytes = 1 KB would be the correct way to define or simply there is a general confusion?
What you are seeing is a marketing stunt.
Since non-technical people don't know the difference between Metric Meg, Gig, etc. against the binary Meg, Gig, etc. marketers for storage will use the Metric calculation, thus 1000 Bytes == 1 KiloByte.
This can cause issues with development or highly technical people so you get the idea of a binary Meg, Gig, etc. which is designated with a bi instead of the standard combination (ex. Mebibyte vs Megabyte, or Gibibyte vs Gigabyte)
There are two ways to represent big numbers: You could either display them in multiples of 1000 (base 10) or 1024 (base 2). If you divide by 1000, you probably use the SI prefix names, if you divide by 1024, you probably use the IEC prefix names. The problem starts with dividing by 1024. Many applications use the SI prefix names for it and some use the IEC prefix names. But it is important how it is written:
Using IEC standard:
1 KiB = 1,024 bytes (Note: big K)
1 MiB = 1,024 KiB = 1,048,576 bytes
Using SI standard:
1 kB = 1,000 bytes (Note: small k)
1 MB = 1,000 kB = 1,000,000 bytes
Source: ubunty units policy: https://wiki.ubuntu.com/UnitsPolicy
In the normal world, most things go by the power of 10. This would include electricity, for example.
But, in the computer world, it is about half binary. For example, when they sell a hard drive, they sell it by the value of 10, so if it is a 1KB drive, then it is 1000 B. But, when the computer reads it, the OS's usually read by the value of 1024. This is why, when you read the size of space available on a drive, it reads much less then what it was advertised. A 500 GB drive will read only about 466GB, because the computer is reading the drive by the binary 1024 version. Not the power of 10 that it was sold and advertised by. Same will go with flash drives. But, RAM is sold, and read by the computer, by the Binary 1024 version.
One thing to note.. It is "B", not "b". There are 8 bits "b" in a Byte "B". The reason I bring this up is when you get internet service, they usually advertise the speed by bits, not bytes. When it reads in the download box on the computer, it reads the speed in bytes. Say you have a 50Mb internet connection, it is actually 6.25MB connection in the download speed box, because you have to divide the 50 by 8 since there are 8 bits in a byte. That is how the computer reads it. Another marking strategy too. After all, 50Mb sounds much faster then 6.25MB. Other then speeds through a network, most things are read by bytes "B". Some people do not realize that there is a difference between the "B" and "b".
Quite simple...
The word 'Byte' is a computing reference for which the letter 'B' is used as abbreviation.
It must follow then that any reference to Bytes, eg. KB, MB etc, must be based on the well known and widely accepted 1024 base.
Therefore 1KB must equal 1024 Bytes, 1MB must equal 1048576 Bytes (1024x1024) etc.
Any non-computing reference to Kilo/Mega etc. Is based on the decimal 1000 base, eg. 1KW or 1KiloWatt which is 1000 Watts.
I extracted this from a tutorial:
Little-Endian order is the one we will be using in this document, and unless stated specifically you should assume that Little-Endian order is used in any file. The alternate is Big-Endian ordering. So let’s see an example. Take the following stream or 8 bits 10001110 If you have been following the document so far, you would quickly calculate the value of this 8-bit number as being 1x2^7 + 0x2^6 + … + 1x2^1 + 0x2^0 = 142 This is an example of Little-Endian ordering. However, in Big-Endian ordering we need to read the number in the opposite direction 1x2^0 + 0x2^1 + … + 1x2^6 + 0x2^7 = 113
Is this correct?
I used to think that endianess has to do with order that the BYTES (not the bits) are read.
Yes, in the context of memory/storage, endianness indeed refers to byte ordering (typically). What would it mean to say that e.g. the least-significant bit "comes first"?
Bit endianness is relevant in some situations, for instance when sending data over a serial bus.
You are correct - that quote you have there is rubbish, IMHO.
It wouldn't make sense to reorder bits, and it would be pretty confusing to boot. CPUs don't read simgle bits, they read bytes, or combinations of bytes, at one time, so that's the ordering that's important.
When they store a number made up of multiple bytes, they can either store it from left to right, making the high-order byte lowest in memory, or right to left, with the low-order byte lowest in memory.
We were asked to find a way to compress a square binary matrix as much as possible, and if possible, to add redundancy bits to check and maybe correct errors.
The redundancy thing is easy to implement in my opinion. The complicated part is compressing the matrix. I thought about using run-length after reshaping the matrix to a vector because there will be more zeros than ones, but I only achieved a 40bits compression (we are working on small sizes) although I thought it'd be better.
Also, after run-length an idea was Huffman coding the matrix, but a dictionary must be sent in order to recover the original information.
I'd like to know what would be the best way to compress a binary matrix?
After reading some comments, yes #Adam you're right, the 14x14 matrix should be compressed in 128bits, so if I only use the coordinates (rows&cols) for each non-zero element, still it would be 160bits (since there are twenty ones). I'm not looking for an exact solution but for a useful idea.
You can only talk about compressing something if you have a distribution and a representation. That's the issue of the dictionary you have to send along: you always need some sort of dictionary of protocol to uncompress something. It just so happens that things like .zip and .mpeg already have those dictionaries/codecs. Even something as simple as Huffman-encoding is an algorithm; on the other side of the communication channel (you can think of compression as communication), the other person already has a bit of code (the dictionary) to perform the Huffman decompression scheme.
Thus you cannot even begin to talk about compressing something without first thinking "what kinds of matrices do I expect to see?", "is the data truly random, or is there order?", and if so "how can I represent the matrices to take advantage of order in the data?".
You cannot compress some matrices without increasing the size of other objects (by at least 1 bit). This is bad news if all matrices are equally probable, and you care equally about them all.
Addenda:
The answer to use sparse matrix machinery is not necessarily the right answer. The matrix could for example be represented in python as [[(r+c)%2 for c in range (cols)] for r in range(rows)] (a checkerboard pattern), and a sparse matrix wouldn't compress it at all, but the Kolmogorov complexity of the matrix is the above program's length.
Well, I know every matrix will have the same number of ones, so this is kind of deterministic. The only think I don't know is where the 1's will be. Also, if I transmit the matrix with a dictionary and there are burst errors, maybe the dictionary gets affected so... wouldnt be the resulting information corrupted? That's why I was trying to use lossless data compression such as run-length, the decoder just doesnt need a dictionary. --original poster
How many 1s does the matrix have as a fraction of its size, and what is its size (NxN -- what is N)?
Furthermore, this is an incorrect assertion and should not be used as a reason to desire run-length encoding (which still requires a program); when you transmit data over a channel, you can always add error-correction to this data. "Data" is just a blob of bits. You can transmit both the data and any required dictionaries over the channel. The error-correcting machinery does not care at all what the bits you transmit are for.
Addendum 2:
There are (14*14) choose 20 possible arrangements, which I assume are randomly chosen. If this number was larger than 128^2 what you're trying to do would be impossible. Fortunately log_2((14*14) choose 20) ~= 90bits < 128bits so it's possible.
The simple solution of writing down 20 numbers like 32,2,67,175,52,...,168 won't work because log_2(14*14)*20 ~= 153bits > 128bits. This would be equivalent to run-length encoding. We want to do something like this but we are on a very strict budget and cannot afford to be "wasteful" with bits.
Because you care about each possibility equally, your "dictionary"/"program" will simulate a giant lookup table. Matlab's sparse matrix implementation may work but is not guaranteed to work and is thus not a correct solution.
If you can create a bijection between the number range [0,2^128) and subsets of size 20, you're good to go. This corresponds to enumerating ways to descend the pyramid in http://en.wikipedia.org/wiki/Binomial_coefficient to the 20th element of row 196. This is the same as enumerating all "k-combinations". See http://en.wikipedia.org/wiki/Combination#Enumerating_k-combinations
Fortunately I know that Mathematica and Sage and other CAS software can apparently generate the "5th" or "12th" or arbitrarily numbered k-subset. Looking through their documentation, we come upon a function called "rank", e.g. http://www.sagemath.org/doc/reference/sage/combinat/subset.html
So then we do some more searching, and come across some arcane Fortran code like http://people.sc.fsu.edu/~jburkardt/m_src/subset/ksub_rank.m and http://people.sc.fsu.edu/~jburkardt/m_src/subset/ksub_unrank.m
We could reverse-engineer it, but it's kind of dense. But now we have enough information to search for k-subset rank unrank, which leads us to http://www.site.uottawa.ca/~lucia/courses/5165-09/GenCombObj.pdf -- see the section
"Generating k-subsets (of an n-set): Lexicographical
Ordering" and the rank and unrank algorithms on the next few pages.
In order to achieve the exact theoretically optimal compression, in the case of a uniformly random distribution of 1s, we must thus use this technique to biject our matrices to our output number of range <2^128. It just so happens that combinations have a natural ordering, known as ranking and unranking of combinations. You assign a number to each combination (ranking), and if you know the number you automatically know the combination (unranking). Googling k-subset rank unrank will probably yield other algorithms.
Thus your solution would look like this:
serialize the matrix into a list
e.g. [[0,0,1][0,1,1][1,0,0]] -> [0,0,1,0,1,1,1,0,0]
take the indices of the 1s:
e.g. [0,0,1,0,1,1,1,0,0] -> [3,5,6,7]
1 2 3 4 5 6 7 8 9 a k=4-subset of an n=9 set
take the rank
e.g. compressed = rank([3,5,6,7], n=9)
compressed==412 (or something, I made that up)
you're done!
e.g. 412 -binary-> 110011100 (at most n=9bits, less than 2^n=2^9=512)
to uncompress, unrank it
I'll get to 128 bits in a sec, first here's how you fit a 14x14 boolean matrix with exactly 20 nonzeros into 136 bits. It's based on the CSC sparse matrix format.
You have an array c with 14 4-bit counters that tell you how many nonzeros are in each column.
You have another array r with 20 4-bit row indices.
56 bits (c) + 80 bits (r) = 136 bits.
Let's squeeze 8 bits out of c:
Instead of 4-bit counters, use 2-bit. c is now 2*14 = 28 bits, but can't support more than 3 nonzeros per column. This leaves us with 128-80-28 = 20 bits. Use that space for array a4c with 5 4-bit elements that "add 4 to an element of c" specified by the 4-bit element. So, if a4c={2,2,10,15, 15} that means c[2] += 4; c[2] += 4 (again); c[10] += 4;.
The "most wasteful" distribution of nonzeros is one where the column count will require an add-4 to support 1 extra nonzero: so 5 columns with 4 nonzeros each. Luckily we have exactly 5 add-4s available.
Total space = 28 bits (c) + 20 bits
(a4c) + 80 bits (r) = 128 bits.
Your input is a perfect candidate for a sparse matrix. You said you're using Matlab, so you already have a good sparse matrix built for you.
spm = sparse(dense_matrix)
Matlab's sparse matrix implementation uses Compressed Sparse Columns, which has memory usage on the order of 2*(# of nonzeros) + (# of columns), which should be pretty good in your case of 20 nonzeros and 14 columns. Storing 20 values sure is better than storing 196...
Also remember that all matrices in Matlab are going to be composed of doubles. Just because your matrix can be stored as a 1-bit boolean doesn't mean Matlab won't stick it into a 64-bit floating point value... If you do need it as a boolean you're going to have to make your own type in C and use .mex files to interface with Matlab.
After thinking about this again, if all your matrices are going to be this small and they're all binary, then just store them as a binary vector (bitmask). Going off your 14x14 example, that requires 196 bits or 25 bytes (plus n, m if your dimensions are not constant). That same vector in Matlab would use 64 bits per element, or 1568 bytes. So storing the matrix as a bitmask takes as much space as 4 elements of the original matrix in Matlab, for a compression ratio of 62x.
Unfortunately I don't know if Matlab supports bitmasks natively or if you have to resort to .mex files. If you do get into C++ you can use STL's vector<bool> which implements a bitmask for you.