Using large numbers in sql query

Using large numbers in sql query - mysql

I have a field called size which is a BIGINT storing the number of bytes in a file. To get a file that is larger than 1GB I am currently doing:
size > (1024*1024*1024)
But this looks a bit hairy. Is there another way to write this that makes it more clear that the result of 1024*1024*1024 is 1GiB?
Additionally is the exponent operator built into mysql? I've used
select power(2, 30)
But I was wondering if there was a shortform to do that directly in the query, such as 2^30.

^ is the bitwise xor operator.
Either POW(2,30) or POWER(2,30) (or POWER(1024,3)) will work; I believe of the two POWER is the more standard. There is no typographic operator for exponentiation.
I would just leave it as 1024*1024*1024; to me that provides the best readability (and makes it clear it is 1 GiB, not 1 GB).

Related

is it possible to speed up writing numbers in SQL?

I was wondering, in cases when you have to write big numbers such as 50.000.000.000, is it possible to write something like.. BILLION(50) or any shortcut?

Yes, you can use 5e10 (meaning 5 x 1010). See https://dev.mysql.com/doc/refman/8.0/en/number-literals.html
But numbers expressed in scientific notation like that are interpreted as floating point constants, with a more limited precision, and calculations using them will continue to use floating point. For example:
select 5e16-1, 50000000000000000-1;
returns
5e16
49999999999999999

Finding the set of wildcards for representing a binary interval

In a switch flow table there is a field called match field, where a list of matching condition is maintained. In the match fields, binary sequences with wildcards (*, a bit of wildcard character means that it could be either 1 or 0 in this bit) are used to represent some matching conditions.
For example, we can use the following binary sequences with wildcards to represent the matching condition '38798 <= Port <= 56637':
100101111000111*
100101111**1****
100101111*1*****
1001011111******
10011***********
101*************
110111010011110*
1101110100***0**
1101110100**0***
1101110100*0****
11011101000*****
11011100********
110**0**********
110*0***********
1100************
Does anyone know (or suggests) a way to obtain such sequence? So far, I used a brute force strategy (getting all the possible combinations for the intervals), but it is not computationally feasible (memory explosion) and has a problem of redundancy (wildcards representing the same interval). So now I'm trying to use range splitting and grid-of-tries to get a solution, but without success yet.

Give an unique 6 or 9 digit number to each row

Is it possible to assign an unique 6 or 9 digit number to each new row only with MySQL.
Example :
id1 : 928524
id2 : 124952
id3 : 485920
...
...
P.S : I can do that with php's rand() function, but I want a better way.

MySQL can assign unique continuous keys by itself. If you don't want to use rand(), maybe this is what you meant?

I suggest you manually set the ID of the first row to 100000, then tell the database to auto increment. Next row should then be 100001, then 100002 and so on. Each unique.

Don't know why you would ever want to do this but you will have to use php's rand function, see if its already in the database, if it is start from the beginning again, if its not then use it for the id.

Essentially you want a cryptographic hash that's guaranteed not to have a collision for your range of inputs. Nobody seems to know the collision behavior of MD5, so here's an algorithm that's guaranteed not to have any: Choose two large numbers M and N that have no common divisors-- they can be two very large primes, or 2**64 and 3**50, or whatever. You will be generating numbers in the range 0..M-1. Use the following hashing function:
H(k) = k*N (mod M)
Basic number theory guarantees that the sequence has no collisions in the range 0..M-1. So as long as the IDs in your table are less than M, you can just hash them with this function and you'll have distinct hashes. If you use unsigned 64-bit integer arithmetic, you can let M = 2**64. N can then be any odd number (I'd choose something large enough to ensure that k*N > M), and you get the modulo operation for free as arithmetic overflow!
I wrote the following in comments but I'd better repeat it here: This is not a good way to implement access protection. But it does prevent people from slurping all your content, if M is sufficiently large.

what is the meaning of "<<" in TCL?

I know the "<<" is a bit operation. but I do not understand what it exactly functions in TCL, and when should we use it?
can anyone help me on this?

The << operator in Tcl's expressions is an arithmetic bit shift left. It's exceptionally similar to the equivalent in C and many other languages, and would be used in all the same places (it's logically equivalent to a multiply by a suitable power of 2, but it's usually advisable to use a shift when thinking about bits and a multiply when thinking about numbers).
Note that one key difference with many other languages (from Tcl 8.5 onwards) is that it does not “drop bits off the front”; the language implementation automatically uses wider number representations as necessary so that information is never lost. Bits are dropped by using a separate binary mask operation (e.g., & ((1 << $numBits) - 1)).

There are a number of uses for the << shift left operator. Some that come to my mind are :
Bit by bit processing. Shift a number and observe highest order bit etc. It comes in more handy than you might think.
If you add a zero to a number in the decimal number system you effectively multiply it by 10. shifting bits effectively means multiplying by 2. This actually translated into a low level assembly command of bit shifting which has lower compute cycles than multiplication by 2. This is used for efficiency in the gaming industry. Shift if twice (<< 2) to multiply it by 4 and so on.
I am sure there are many others.

The << operation is not much different from C's, for instance. And it's used when you need to shift bits of an integer value to the left. This can be occasionally useful when doing subtle number crunching like implemening a hash function or deserialising something from an input bytestream (but note that [binary scan] covers almost all of what's needed for this sort of thing). For a more general info refer to this Wikipedia article or something like this, this is not really Tcl-related.

The '<<' is a left bit shift. You must apply it to an integer. This arithmetic operator will shift the bits to left.
For example, if you want to shifted the number 1 twice to the left in the Tcl interpreter tclsh, type:
expr { 1 << 2 }
The command will return 4.
Pay special attention to the maximum integer the interpreter hold on your platform.

Compressing a binary matrix

We were asked to find a way to compress a square binary matrix as much as possible, and if possible, to add redundancy bits to check and maybe correct errors.
The redundancy thing is easy to implement in my opinion. The complicated part is compressing the matrix. I thought about using run-length after reshaping the matrix to a vector because there will be more zeros than ones, but I only achieved a 40bits compression (we are working on small sizes) although I thought it'd be better.
Also, after run-length an idea was Huffman coding the matrix, but a dictionary must be sent in order to recover the original information.
I'd like to know what would be the best way to compress a binary matrix?
After reading some comments, yes #Adam you're right, the 14x14 matrix should be compressed in 128bits, so if I only use the coordinates (rows&cols) for each non-zero element, still it would be 160bits (since there are twenty ones). I'm not looking for an exact solution but for a useful idea.

You can only talk about compressing something if you have a distribution and a representation. That's the issue of the dictionary you have to send along: you always need some sort of dictionary of protocol to uncompress something. It just so happens that things like .zip and .mpeg already have those dictionaries/codecs. Even something as simple as Huffman-encoding is an algorithm; on the other side of the communication channel (you can think of compression as communication), the other person already has a bit of code (the dictionary) to perform the Huffman decompression scheme.
Thus you cannot even begin to talk about compressing something without first thinking "what kinds of matrices do I expect to see?", "is the data truly random, or is there order?", and if so "how can I represent the matrices to take advantage of order in the data?".
You cannot compress some matrices without increasing the size of other objects (by at least 1 bit). This is bad news if all matrices are equally probable, and you care equally about them all.
Addenda:
The answer to use sparse matrix machinery is not necessarily the right answer. The matrix could for example be represented in python as [[(r+c)%2 for c in range (cols)] for r in range(rows)] (a checkerboard pattern), and a sparse matrix wouldn't compress it at all, but the Kolmogorov complexity of the matrix is the above program's length.
Well, I know every matrix will have the same number of ones, so this is kind of deterministic. The only think I don't know is where the 1's will be. Also, if I transmit the matrix with a dictionary and there are burst errors, maybe the dictionary gets affected so... wouldnt be the resulting information corrupted? That's why I was trying to use lossless data compression such as run-length, the decoder just doesnt need a dictionary. --original poster
How many 1s does the matrix have as a fraction of its size, and what is its size (NxN -- what is N)?
Furthermore, this is an incorrect assertion and should not be used as a reason to desire run-length encoding (which still requires a program); when you transmit data over a channel, you can always add error-correction to this data. "Data" is just a blob of bits. You can transmit both the data and any required dictionaries over the channel. The error-correcting machinery does not care at all what the bits you transmit are for.
Addendum 2:
There are (14*14) choose 20 possible arrangements, which I assume are randomly chosen. If this number was larger than 128^2 what you're trying to do would be impossible. Fortunately log_2((14*14) choose 20) ~= 90bits < 128bits so it's possible.
The simple solution of writing down 20 numbers like 32,2,67,175,52,...,168 won't work because log_2(14*14)*20 ~= 153bits > 128bits. This would be equivalent to run-length encoding. We want to do something like this but we are on a very strict budget and cannot afford to be "wasteful" with bits.
Because you care about each possibility equally, your "dictionary"/"program" will simulate a giant lookup table. Matlab's sparse matrix implementation may work but is not guaranteed to work and is thus not a correct solution.
If you can create a bijection between the number range [0,2^128) and subsets of size 20, you're good to go. This corresponds to enumerating ways to descend the pyramid in http://en.wikipedia.org/wiki/Binomial_coefficient to the 20th element of row 196. This is the same as enumerating all "k-combinations". See http://en.wikipedia.org/wiki/Combination#Enumerating_k-combinations
Fortunately I know that Mathematica and Sage and other CAS software can apparently generate the "5th" or "12th" or arbitrarily numbered k-subset. Looking through their documentation, we come upon a function called "rank", e.g. http://www.sagemath.org/doc/reference/sage/combinat/subset.html
So then we do some more searching, and come across some arcane Fortran code like http://people.sc.fsu.edu/~jburkardt/m_src/subset/ksub_rank.m and http://people.sc.fsu.edu/~jburkardt/m_src/subset/ksub_unrank.m
We could reverse-engineer it, but it's kind of dense. But now we have enough information to search for k-subset rank unrank, which leads us to http://www.site.uottawa.ca/~lucia/courses/5165-09/GenCombObj.pdf -- see the section
"Generating k-subsets (of an n-set): Lexicographical
Ordering" and the rank and unrank algorithms on the next few pages.
In order to achieve the exact theoretically optimal compression, in the case of a uniformly random distribution of 1s, we must thus use this technique to biject our matrices to our output number of range <2^128. It just so happens that combinations have a natural ordering, known as ranking and unranking of combinations. You assign a number to each combination (ranking), and if you know the number you automatically know the combination (unranking). Googling k-subset rank unrank will probably yield other algorithms.
Thus your solution would look like this:
serialize the matrix into a list
e.g. [[0,0,1][0,1,1][1,0,0]] -> [0,0,1,0,1,1,1,0,0]
take the indices of the 1s:
e.g. [0,0,1,0,1,1,1,0,0] -> [3,5,6,7]
1 2 3 4 5 6 7 8 9 a k=4-subset of an n=9 set
take the rank
e.g. compressed = rank([3,5,6,7], n=9)
compressed==412 (or something, I made that up)
you're done!
e.g. 412 -binary-> 110011100 (at most n=9bits, less than 2^n=2^9=512)
to uncompress, unrank it

I'll get to 128 bits in a sec, first here's how you fit a 14x14 boolean matrix with exactly 20 nonzeros into 136 bits. It's based on the CSC sparse matrix format.
You have an array c with 14 4-bit counters that tell you how many nonzeros are in each column.
You have another array r with 20 4-bit row indices.
56 bits (c) + 80 bits (r) = 136 bits.
Let's squeeze 8 bits out of c:
Instead of 4-bit counters, use 2-bit. c is now 2*14 = 28 bits, but can't support more than 3 nonzeros per column. This leaves us with 128-80-28 = 20 bits. Use that space for array a4c with 5 4-bit elements that "add 4 to an element of c" specified by the 4-bit element. So, if a4c={2,2,10,15, 15} that means c[2] += 4; c[2] += 4 (again); c[10] += 4;.
The "most wasteful" distribution of nonzeros is one where the column count will require an add-4 to support 1 extra nonzero: so 5 columns with 4 nonzeros each. Luckily we have exactly 5 add-4s available.
Total space = 28 bits (c) + 20 bits
(a4c) + 80 bits (r) = 128 bits.

Your input is a perfect candidate for a sparse matrix. You said you're using Matlab, so you already have a good sparse matrix built for you.
spm = sparse(dense_matrix)
Matlab's sparse matrix implementation uses Compressed Sparse Columns, which has memory usage on the order of 2*(# of nonzeros) + (# of columns), which should be pretty good in your case of 20 nonzeros and 14 columns. Storing 20 values sure is better than storing 196...
Also remember that all matrices in Matlab are going to be composed of doubles. Just because your matrix can be stored as a 1-bit boolean doesn't mean Matlab won't stick it into a 64-bit floating point value... If you do need it as a boolean you're going to have to make your own type in C and use .mex files to interface with Matlab.

After thinking about this again, if all your matrices are going to be this small and they're all binary, then just store them as a binary vector (bitmask). Going off your 14x14 example, that requires 196 bits or 25 bytes (plus n, m if your dimensions are not constant). That same vector in Matlab would use 64 bits per element, or 1568 bytes. So storing the matrix as a bitmask takes as much space as 4 elements of the original matrix in Matlab, for a compression ratio of 62x.
Unfortunately I don't know if Matlab supports bitmasks natively or if you have to resort to .mex files. If you do get into C++ you can use STL's vector<bool> which implements a bitmask for you.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

Using large numbers in sql query - mysql

Related

is it possible to speed up writing numbers in SQL?

Finding the set of wildcards for representing a binary interval

Give an unique 6 or 9 digit number to each row

what is the meaning of "<<" in TCL?

Compressing a binary matrix

Categories

Resources