Using bitwise on a large number in mysql - mysql

I have a need to store a provider name and the country(ies) they are able to provide services in. There are 92 counties. Rather than store up to 92 rows, I'd like to store 2^91 so if they provide only in county 1 and 2, I'd store 3.
The problem I'm having is the largest number is 2475880078570760549798248448, which is way too big for the largest BigInt.
In the past when I've had smaller numbers of options I've been able to do something like....
SELECT * FROM tblWhatever WHERE my_col & 2;
If my_col had 2, 3, 6, etc. stored (anything with bit 2) it would be found.
I guess I'm not sure of 2 things... how to store AND how to query if stored in a way other than an INT.

You could use BINARY(13) as a datatype to store up to 94 bits. But MySQL bitwise operators only support BIGINT which is 64 bits.
So if you want to use bitwise operators, you'd have to store your county bitfield in two BIGINT columns (or perhaps one BIGINT and one INT), and in application code work out which of the two columns to search. That seems like it's awkward.
However, I'll point out a performance consideration: using bitwise operators for searching isn't very efficient. You can't make use of an index to do the search, so every query is be forced to perform a table-scan. As your data grows, this will become more and more costly.
It's the same problem as searching for a substring with LIKE '%word%', or searching for all dates with a given day of the month.
So I'd suggest storing each county in a separate row after all. You don't have to store 92 rows for each service provider -- you only have to store as many rows as the number of counties they service. The absence of a row indicates no service in that county.

Related

Is there an indexable way to store several bitfields in MySQL?

I have a MySQL table which needs to store several bitfields...
notification.id -- autonumber int
association.id -- BIT FIELD 1 -- stores one or more association ids (which are obtained from another table)
type.id -- BIT FIELD 2 -- stores one or more types that apply to this notification (again, obtained from another table)
notification.day_of_week -- BIT FIELD 3 -- stores one or more days of the week
notification.target -- where to send the notification -- data type is irrelevant, as we'll never index or sort on this field, but
will probably store an email address.
My users will be able to configure their notifications to trigger on one or more days, in one or more associations, for one or more types. I need a quick, indexable way to store this data.
Bit fields 1 and 2 can expand to have more values than they do presently. Currently 1 has values as high as 125, and 2 has values as high as 7, but both are expected to go higher.
Bit field 3 stores days of the week, and as such, will always have only 7 possible values.
I'll need to run a script frequently (every few minutes) that scans this table based on type, association, and day, to determine if a given notification should be sent. Queries need to be fast, and the simpler it is to add new data, the better. I'm not above using joins, subqueries, etc as needed, but I can't imagine these being faster.
One last requirement -- if I have 1000 different notifications stored in here, with 125 association possibilities, 7 types, and 7 days of the week, the combination of records is too high for my taste if just using integers, and storing multiple copies of the row, instead of using bit fields, so it seems like using bit fields is a requirement.
However, from what I've heard, if I wanted to select everything from a particular day of the week, say Tuesday (b0000100 in a bit field, perhaps), bit fields are not indexed such that I can do...
SELECT * FROM \`mydb\`.\`mytable\` WHERE \`notification.day_of_week\` & 4 = 4;
This, from my understanding, would not use an index at all.
Any suggestions on how I can do this, or something similar, in an indexable fashion?
(I work on a pretty standard LAMP stack, and I'm looking for specifics on how the MySQL indexing works on this or a similar alternative.)
Thanks!
There's no "good" way (that I know of) to accomplish what you want to.
Note that the BIT datatype is limited to a size of 64 bits.
For bits that can be statically defined, MySQL provides the SET datatype, which is in some ways the same as BIT, and in other ways it is different.
For days of the week, for example, you could define a column
dow SET('SUN','MON','TUE','WED','THU','FRI','SAT')
There's no builtin way (that I know of of getting the internal bit represntation back out, but you can add a 0 to the column, or cast to unsigned, to get a decimal representation.
SELECT dow+0, CONVERT(dow,UNSIGNED), dow, ...
1 1 SUN
2 2 MON
3 3 SUN,MON
4 4 TUE
5 5 SUN,TUE
6 6 MON,TUE
7 7 SUN,MON,TUE
It is possible for MySQL to use a "covering index" to satisfy a query with a predicate on a SET column, when the SET column is the leading column in the index. (i.e. EXPLAIN shows 'Using where; Using index') But MySQL may be performing a full scan of the index, rather than doing a range scan. (And there may be differences between the MyISAM engine and the InnoDB engine.)
SELECT id FROM notification WHERE FIND_IN_SET('SUN',dow)
SELECT id FROM notification WHERE (dow+0) MOD 2 = 1
BUT... this usage is non-standard, and can't really be recommended. For one thing, this behavior is not guaranteed, and MySQL may change this behavior in a future release.
I've done a bit more research on this, and realized there's no way to get the indexing to work as I outlined above. So, I've created an auxiliary table (somewhat like the WordPress meta table format) which stores entries for day of week, etc. I'll just join these tables as needed. Fortunately, I don't anticipate having more than ~10,000 entries at present, so it should join quickly enough.
I'm still interested in a better answer if anyone has one!

Sorting and Indexing Rank as Bigint or Float

Using MySql. I have a table of users and need to store user's rank in one of the columns. The application determines the rank and it happens to be a signed float. Example of ranks would be:
7732.5366604
7733.0252967
7740.8813879
-7736.5642667
-42.0223467
81121.3647382
So basically I always have at least seven places after "." and number is signed.
I need to create index on this field and ORDER BY it.
From what I understand, I have 2 choices:
Store the number as is in the most appropriate mysql type (float?)
format the value to remove the dot (in application layer) and store it as a signed bigint so that -7736.5642667 would become -77365642667. If for some reason I did not have 7 places after dot, I would append zeros, so that: 234.34565 would become 234.3456500 and then become 2343456500
I'm wondering which of the two would be faster in terms of storing, indexing and sorting.
Thanks and appreciate any input.
I don't understand the algorithm behind index in terms of difference between indexing int and float but I would do a benchmark on big data set. If no difference in performance, then I would use float because that's your numbers and you don't have to do any formatting.

MySQL PRIMARY KEYs: UUID / GUID vs BIGINT (timestamp+random)

tl;dr: Is assigning rows IDs of {unixtimestamp}{randomdigits} (such as 1308022796123456) as a BIGINT a good idea if I don't want to deal with UUIDs?
Just wondering if anyone has some insight into any performance or other technical considerations / limitations in regards to IDs / PRIMARY KEYs assigned to database records across multiple servers.
My PHP+MySQL application runs on multiple servers, and the data needs to be able to be merged. So I've outgrown the standard sequential / auto_increment integer method of identifying rows.
My research into a solution brought me to the concept of using UUIDs / GUIDs. However the need to alter my code to deal with converting UUID strings to binary values in MySQL seems like a bit of a pain/work. I don't want to store the UUIDs as VARCHAR for storage and performance reasons.
Another possible annoyance of UUIDs stored in a binary column is the fact that rows IDs aren't obvious when looking at the data in PhpMyAdmin - I could be wrong about this though - but straight numbers seem a lot simpler overall anyway and are universal across any kind of database system with no conversion required.
As a middle ground I came up with the idea of making my ID columns a BIGINT, and assigning IDs using the current unix timestamp followed by 6 random digits. So lets say my random number came about to be 123456, my generated ID today would come out as: 1308022796123456
A one in 10 million chance of a conflict for rows created within the same second is fine with me. I'm not doing any sort of mass row creation quickly.
One issue I've read about with randomly generated UUIDs is that they're bad for indexes, as the values are not sequential (they're spread out all over the place). The UUID() function in MySQL addresses this by generating the first part of the UUID from the current timestamp. Therefore I've copied that idea of having the unix timestamp at the start of my BIGINT. Will my indexes be slow?
Pros of my BIGINT idea:
Gives me the multi-server/merging advantages of UUIDs
Requires very little change to my application code (everything is already programmed to handle integers for IDs)
Half the storage of a UUID (8 bytes vs 16 bytes)
Cons:
??? - Please let me know if you can think of any.
Some follow up questions to go along with this:
Should I use more or less than 6 random digits at the end? Will it make a difference to index performance?
Is one of these methods any "randomer" ?: Getting PHP to generate 6 digits and concatenating them together -VS- getting PHP to generate a number in the 1 - 999999 range and then zerofilling to ensure 6 digits.
Thanks for any tips. Sorry about the wall of text.
I have run into this very problem in my professional life. We used timestamp + random number and ran into serious issues when our applications scaled up (more clients, more servers, more requests). Granted, we (stupidly) used only 4 digits, and then change to 6, but you would be surprised how often that the errors still happen.
Over a long enough period of time, you are guaranteed to get duplicate key errors. Our application is mission critical, and therefore even the smallest chance it could fail to due inherently random behavior was unacceptable. We started using UUIDs to avoid this issue, and carefully managed their creation.
Using UUIDs, your index size will increase, and a larger index will result in poorer performance (perhaps unnoticeable, but poorer none-the-less). However MySQL supports a native UUID type (never use varchar as a primary key!!), and can handle indexing, searching,etc pretty damn efficiently even compared to bigint. The biggest performance hit to your index is almost always the number of rows indexed, rather than the size of the item being index (unless you want to index on a longtext or something ridiculous like that).
To answer you question: Bigint (with random numbers attached) will be ok if you do not plan on scaling your application/service significantly. If your code can handle the change without much alteration and your application will not explode if a duplicate key error occurs, go with it. Otherwise, bite-the-bullet and go for the more substantial option.
You can always implement a larger change later, like switching to an entirely different backend (which we are now facing... :P)
You can manually change the autonumber starting number.
ALTER TABLE foo AUTO_INCREMENT = ####
An unsigned int can store up to 4,294,967,295, lets round it down to 4,290,000,000.
Use the first 3 digits for the server serial number, and the final 7 digits for the row id.
This gives you up to 430 servers (including 000), and up to 10 million IDs for each server.
So for server #172 you manually change the autonumber to start at 1,720,000,000, then let it assign IDs sequentially.
If you think you might have more servers, but less IDs per server, then adjust it to 4 digits per server and 6 for the ID (i.e. up to 1 million IDs).
You can also split the number using binary digits instead of decimal digits (perhaps 10 binary digits per server, and 22 for the ID. So, for example, server 76 starts at 2^22*76 = 318,767,104 and ends at 322,961,407).
For that matter you don't even need a clear split. Take 4,294,967,295 divide it by the maximum number of servers you think you will ever have, and that's your spacing.
You could use a bigint if you think you need more identifiers, but that's a seriously huge number.
Use the GUID as a unique index, but also calculate a 64-bit (BIGINT) hash of the GUID, store that in a separate NOT UNIQUE column, and index it. To retrieve, query for a match to both columns - the 64-bit index should make this efficient.
What's good about this is that the hash:
a. Doesn't have to be unique.
b. Is likely to be well-distributed.
The cost: extra 8-byte column and its index.
If you want to use the timestamp method then do this:
Give each server a number, to that append the proccess ID of the application that is doing the insert (or the thread ID) (in PHP it's getmypid()), then to that append how long that process has been alive/active for (in PHP it's getrusage()), and finally add a counter that starts at 0 at the start of each script invocation (i.e. each insert within the same script adds one to it).
Also, you don't need to store the full unix timestamp - most of those digits are for saying it's year 2011, and not year 1970. So if you can't get a number saying how long the process was alive for, then at least subtract a fixed timestamp representing today - that way you'll need far less digits.

which algorithm does mysql use to search an row in the table?

If we give a query:
select name from employee where id=23102 and sir_name="raj";
I want to know using which algorithm this search will happen?
Assuming you have indexed the id field and it is unique.
The algorithm is a binary search (there are optimizations and improvements, but below is the general theory behind it).
Lets say you have the following ordered list of numbers:
1,45,87,111,405,568,620,945,1100,5000,5102,5238,5349,5520
Say you want to search for number 5000, There are two ways.
scan the entire list, in which case you will have to check 10 numbers (count from start until you reach 5000).
Binary -> here are the steps:
2a. go to middle number (620), Since 5000 is bigger then that->
2b. You do the same on the numbers 945-5520, the median is 5102 Since 5000 is smaller then that->
2c. go to the median of the part 945-5102, which is 1100 since it is lower then 5000 go the part between 1100-5102
2d. Found it!
That was 4 operation against 10, so, binary search complexity will grow at the same rate as full scan when in binary search data grows exponentially
Indexes are stored as B-trees in MySQL, Indexes on spatial data types use R-trees, MEMORY tables also support hash indexes.
You can use explain and procedure analyse to find out how your query is being run by mysql.
If you want to know what kind of algorithm it internally uses to find the resulting set. I'd suggest you read up on how DBMS work.

Storing large prime numbers in a database

This problem struck me as a bit odd. I'm curious how you could represent a list of prime numbers in a database. I do not know of a single datatype that would be able to acuratly and consistently store a large amount of prime numbers. My concern is that when the prime numbers are starting to contain 1000s of digits, that it might be a bit difficult to reference form the database. Is there a way to represent a large set of primes in a DB? I'm quite sure that this has topic has been approached before.
One of the issues about this that makes it difficult is that prime numbers can not be broken down into factors. If they could this problem would be much easier.
If you really want to store primes as numbers and one of questions, stopping you is "prime numbers can not be broken down into factors", there are another thing: store it in list of modulus of any number ordered by sequence.
Small example:
2831781 == 2*100^3 + 83*100^2 + 17*100^1 + 81*100^0
List is:
81, 17, 83, 2
In real application is useful to split by modulus of 2^32 (32-bits integers), specially if prime numbers in processing application stored as byte arrays.
Storage in DB:
create table PRIMES
(
PRIME_ID NUMBER not null,
PART_ORDER NUMBER(20) not null,
PRIME_PART_VALUE NUMBER not null
);
alter table PRIMES
add constraint PRIMES_PK primary key (PRIME_ID, PART_ORDER) using index;
insert for example above (1647 is for example only):
insert into primes(PRIME_ID, PART_ORDER, PRIME_PART_VALUE) values (1647, 0, 81);
insert into primes(PRIME_ID, PART_ORDER, PRIME_PART_VALUE) values (1647, 1, 17);
insert into primes(PRIME_ID, PART_ORDER, PRIME_PART_VALUE) values (1647, 2, 83);
insert into primes(PRIME_ID, PART_ORDER, PRIME_PART_VALUE) values (1647, 3, 82);
prime_id value can be assigned from oracle sequence ...
create sequence seq_primes start with 1 increment by 1;
Get ID of next prime number to insert:
select seq_primes.nextval from dual;
select prime number content with specified id:
select PART_ORDER, PRIME_PART_VALUE
from primes where prime_id = 1647
order by part_order
You could store them as binary data. They won't be human readable straight from the database, but that shouldn't be a problem.
Databases (depending on which) can routinely store numbers up to 38-39 digits accurately. That gets you reasonably far.
Beyond that you won't be doing arithmetic operations on them (accurately) in databases (barring arbitrary-precision modules that may exist for your particular database). But numbers can be stored as text up to several thousand digits. Beyond that you can use CLOB type fields to store millions of digits.
Also, it's worth nothing that if you're storing sequences of prime numbers and your interest is in space-compression of that sequence you could start by storing the difference between one number and the next rather than the whole number.
This is a bit inefficient, but you could store them as strings.
If you are not going to use database-side calculations with these numbers, just store them as bit sequences of their binary representation (BLOB, VARBINARY etc.)
Here's my 2 cents worth. If you want to store them as numbers in a database then you'll be constrained by the maximum size of integer that your database can handle. You'd probably want a 2 column table, with the prime number in one column and it's sequence number in the other. Then you'd want some indexes to make finding the stored values quick.
But you don't really want to do that do you, you want to store humongous (sp?) primes way beyond any integer datatype you've even though of yet. And you say that you are averse to strings so it's binary data for you. (It would be for me too.) Yes, you could store them in a BLOB in a database but what sort of facilities will the DBMS offer you for finding the n-th prime or checking the primeness of a candidate integer ?
How to design a suitable file structure ? This is the best I could come up with after about 5 minutes thinking:
Set a counter to 2.
Write the two-bits which represent the first prime number.
Write them again, to mark the end of the section containing the 2-bit primes.
Set the counter to counter+1
Write the 3-bit primes in order. ( I think there are two: 5 and 7)
Write the last of the 3-bit primes again to mark the end of the section containing the 3-bit primes.
Go back to 4 and carry on mutatis mutandis.
The point about writing the last n-bit prime twice is to provide you with a means to identify the end of the part of the file with n-bit primes in it when you come to read the file.
As you write the file, you'll probably also want to make note of the offsets into the files at various points, perhaps the start of each section containing n-bit primes.
I think this would work, and it would handle primes up to 2^(the largest unsigned integer you can represent). I guess it would be easy enough to find code for translating a 325467-bit (say) value into a big integer.
Sure, you could store this file as a BLOB but I'm not sure why you'd bother.
It all depends on what kinds of operations you want to do with the numbers. If just store and lookup, then just use strings and use a check constraint / domain datatype to enforce that they are numbers. If you want more control, then PostgreSQL will let you define custom datatypes and functions. You can for instance interface with the GMP library to have correct ordering and arithmetic for arbitrary precision integers. Using such a library will even let you implement a check constraint that uses the probabilistic primality test to check if the numbers really are prime.
The real question is actually whether a relational database is the correct tool for the job.
I think you're best off using a BLOB. How the data is stored in your BLOB depends on your intended use of the numbers. If you want to use them in calculations I think you'll need to create a class or type to store the values as some variety of ordered binary value and allow them to be treated as numbers, etc. If you just need to display them then storing them as a sequence of characters would be sufficient, and would eliminate the need to convert your calculatable values to something displayable, which can be very time consuming for large values.
Share and enjoy.
Probably not brilliant, but what if you stored them in some recursive data structure. You could store it as an int, it's exponent, and a reference to the lower bit numbers.
Like the string idea, it probably wouldn't be very good for memory considerations. And query time would be increased due to the recursive nature of the query.