Sorting and Indexing Rank as Bigint or Float - mysql

Using MySql. I have a table of users and need to store user's rank in one of the columns. The application determines the rank and it happens to be a signed float. Example of ranks would be:
7732.5366604
7733.0252967
7740.8813879
-7736.5642667
-42.0223467
81121.3647382
So basically I always have at least seven places after "." and number is signed.
I need to create index on this field and ORDER BY it.
From what I understand, I have 2 choices:
Store the number as is in the most appropriate mysql type (float?)
format the value to remove the dot (in application layer) and store it as a signed bigint so that -7736.5642667 would become -77365642667. If for some reason I did not have 7 places after dot, I would append zeros, so that: 234.34565 would become 234.3456500 and then become 2343456500
I'm wondering which of the two would be faster in terms of storing, indexing and sorting.
Thanks and appreciate any input.

I don't understand the algorithm behind index in terms of difference between indexing int and float but I would do a benchmark on big data set. If no difference in performance, then I would use float because that's your numbers and you don't have to do any formatting.

Related

Performance benefit in using correct data types

Is there any performance benefit in using the exact data types needed for a column? Or is it just storage optimisation?
For example, I'm creating a users table and I know for certainty that there will only be 200 users in total. When I'm manipulating the data in the the server, doing some select/update/insert/delete, is there any performance difference between using TINYINT - UN for the users_id column or using just INT?
The same applies to the user's name. I know, for now, that the user with the longest name length is 48, but I don't know if in the future there won't be a new user inserted in the table with a name with 65 characters in length. Is there any performance benefit in reserving only the needed lenght, for now, using VARCHAR(48) or can I avoid having to check constantly the column allowed length for each new user and use just VARCHAR(255)?
There is little advantage in either case.
For the number, you do gain a slight performance advantage. Typically, integers are 4 and a tinyint is 1 byte. So, if you have multiple smaller fields, then your records will be smaller. Smaller records then imply fewer data pages and ultimately slightly faster queries. This shows up when you start to have lots of records.
For the varchar, you don't even have that advantage. Both varchar(48) and varchar(255) occupy the same amount of space (there is one addition byte for lengths greater than 255). The values determine the space for this data type.
In other cases, it can make a big difference. In particular, storing dates as the native format is usually important, both to take advantage of date/time functions and to make better use of indexes.

Roughly how much faster is int compared to bigint when indexing and used in ORDER BY?

First I know there are many variables, but I just need something to base my decision to use int or bigint.
I have a rank field that is something like 9 to 10 digits long so I either have to decrease the precision to ensure it's never bigger than 9 digits or use bignit. My question is whether there is an information on approximations for speed of read/write, indexing and using int and bigint in ORDER BY clause?
Roughly how much faster is int compared to bigint when indexing and used in ORDER BY?
I'm asking for any benchmarks you may have done, or in your own experience.
Edit: my ranking algorithm creates rank that's a float and grows by day. by 2020 it will be 7745.9570846. I convert this number to integer by deciding how precise I want to be, for example if I decide to be precise up to 5 digits after dot, converted number would be 774595708 which fits in int even though I loose some precision. I could have more precision by using bigint, and also ensure I wont run out of space shortly after 2020, but because I am not sure what the speed tradeoffs are I'm not sure if i shoud or not
Ints may be slightly (though unlikely in a noticable way) faster than bigints because it is less data to store, compare, etc.
The bottom line is to use the data type that fits your data, not based on some premature optimization. Build your app first, then optimize if there is a problem.

MySQL indexing phone numbers

I'm working tons of phone numbers, and many are international.
I've changed my phone numbers table structure to have 5 columns:
`phonenumbers`.`phoneID`
`phonenumbers`.`countrycode`
`phonenumbers`.`areacode`
`phonenumbers`.`phonenumber`
`phonenumbers`.`ext`
At the moment the phoneID is the only column that's an INT, since it's the primary key.
Should I change the other columns to integers? I've heard indexes work best with numeric values, and I'm only storing numbers in each of the columns (no dashes, parenthesis, spaces, etc)
I'm still learning how MySQL works with indexes, so I'm curious how others work with searching for numbers. In this case, I'm sure I'll be searching for numbers that start with a certain known areacode and part of a known phonenumber, or an entire phonenumber.
The part that gets me with indexing and table columns like phone numbers is that I don't always know how long a phonenumber will be. Since countries have different lengths for areacodes and phonenumbers.
In summary, INT vs VARCHAR indexing with numbers.
Phone numbers are not integers, so don't store them as one, it'll just cause you trouble. The obvious cases are when you have to handle phone numbers too big to fit in an int, or phone numbers starting with a 0.
Moreover, as you want to do prefix matches (phonenumber like '800%'), mysql will be able to use indexes if you're using varchar columns.
You have to figure out how you're querying this data, if you're frequently doing queries like where countrycode='1' and areacode='123' and phonenumber like '2%' , you'd want a compound index on (countrycode,areacode,phonenumber) , and if you're also often doing queries on only the phonenumber, you'd want an additional index only on the phonenumber column, but this is something you have to work out depeding on the amount of data you have and queries you do - work with EXPLAIN to learn how your indexes are used and where they are needed.
Use varchar for representing phone numbers NOT integers. Otherwise you will find your design decision will come back to bite you.
Also: "I've heard indexes work best with numeric values" - well, that's not strictly accurate: yes the index will take up less space, and more rows will fit per page etc, but an index on a varchar column works perfectly well.
Worry about index size and performance when (1) you have a huge amount of data and (2) when you have measured a performance problem.
In my opinion you have a lot of attributes, that you don´t need, and for phone numbers i usualy use an auto-increment key for id and the phone number is a varchar. This makes it easier the validation with the use of a programming language. It´s my opinion...
Use a BIGINT UNSIGNED simple because this forces you to normalize your data. Force your user to store the phonenumber in root level. That means at country level. You could store the country prefix in a separate column to ease the usage.
Everybody types phone-numbers in different ways and this makes it almost impossible to search the data.
E.g. %020123456% will not match 02 0123456. Are you going to search all combinations or just parse it?
This i know from experience, we had to fix manually about 1,000 phonenumbers which we could not script out when installing an auto-dialer.

Best way to store 'extra' user data in MySQL?

I'm adding a new feature to my user module for my CMS and I've hit a road block... Or I guess, a fork in the road, and I wanted to get some opinions from stackoverflow before I commit to anything.
Basically I want to allow admins to add new, 'extra' user fields that users can fill out on registration, edit in their profile, and/or be controlled by other modules. An example of this would be a birthday field, a lengthy description of themselves, or maybe points the user has earned on the site. Needless to say, the data stored will be varied and can range from large amounts of text, to a small integer value. To make matters worse - I want there to be the option to search this data.
With that out of the way - what would be the best way to do this? Right now I'm leaning towards having a table with the following columns.
userid, refFieldID, varchar, tinyint, smallint, int, text, date, datetime, etc.
I would prefer this as it would make searching significantly faster, and the reference table (Which holds all of the field's data, such as the name of the field, whether it's searchable or not, etc.) can reference which column should be used when storing data for that field.
The other idea, which was suggested to me and I've seen used in other solutions (vBulletin being one, although I have seen others whose names escape me at the moment), where you just have the userid, reference id, and a medtext field. I don't know enough about MySQL to say this with any certainty, but this method seems like it would be slower to search, and possibly have a larger overhead.
So which method would be 'best'? Is there another method I'm missing? Whichever method I end up using, it needs to be fast to search, not massive (A tiny bit of overhead is fine), and preferably allow complex queries used against the data.
I agree that a key-value table is probably the best solution. My first inclination would be to just store a text column, like vBulletin did. But, if you wanted to add the ability for the data store to be a bit more extensible and searchable like you've laid out, I might suggest:
1 medium/longtext or medium/longblob field for arbitrary text/binary storage (whatever is stored + overhead of 3-4 bytes for string length). Only reason to choose medium over long is to limit what can be stored to 2^24 bytes (16.7 MB) versus 2^32 bytes (2 GB).
1 integer (4 bytes) or bigint (8 bytes)
1 datetime (8 bytes)
Perhaps 1 float or double (4-8 bytes) for floating point storage
These fields will allow you to store nearly any type of data in the table but without inflating the width of table** (like a varchar would) and avoid any redundant storage (like having tinyint and mediumint etc). The text stored in the longtext field can still be reasonably searched using a fulltext index or a regular limited length index (e.g. index longtext_storage(8)).
** all blob values, such as longtext, are stored independently from the main table.
One technique that might work for you is to store this arbitrary data as text, in some notation like JSON, XML, or YAML. This decision depends on how you'll need to access the data: if you only look up each user's full chunk of user data, it could be ideal. If you need to run SQL queries on specific fields in the user data, you'll need to use a pure SQL or a hybrid approach.
Many of the newer, highly scalable "NoSQL" systems seem to favor JSON data (eg, MongoDB, CouchDB, and Project Voldemort). It's nice and terse, and you can create arbitrarily complex structures including maps (JSON objects) and lists (JSON arrays).

Storing large prime numbers in a database

This problem struck me as a bit odd. I'm curious how you could represent a list of prime numbers in a database. I do not know of a single datatype that would be able to acuratly and consistently store a large amount of prime numbers. My concern is that when the prime numbers are starting to contain 1000s of digits, that it might be a bit difficult to reference form the database. Is there a way to represent a large set of primes in a DB? I'm quite sure that this has topic has been approached before.
One of the issues about this that makes it difficult is that prime numbers can not be broken down into factors. If they could this problem would be much easier.
If you really want to store primes as numbers and one of questions, stopping you is "prime numbers can not be broken down into factors", there are another thing: store it in list of modulus of any number ordered by sequence.
Small example:
2831781 == 2*100^3 + 83*100^2 + 17*100^1 + 81*100^0
List is:
81, 17, 83, 2
In real application is useful to split by modulus of 2^32 (32-bits integers), specially if prime numbers in processing application stored as byte arrays.
Storage in DB:
create table PRIMES
(
PRIME_ID NUMBER not null,
PART_ORDER NUMBER(20) not null,
PRIME_PART_VALUE NUMBER not null
);
alter table PRIMES
add constraint PRIMES_PK primary key (PRIME_ID, PART_ORDER) using index;
insert for example above (1647 is for example only):
insert into primes(PRIME_ID, PART_ORDER, PRIME_PART_VALUE) values (1647, 0, 81);
insert into primes(PRIME_ID, PART_ORDER, PRIME_PART_VALUE) values (1647, 1, 17);
insert into primes(PRIME_ID, PART_ORDER, PRIME_PART_VALUE) values (1647, 2, 83);
insert into primes(PRIME_ID, PART_ORDER, PRIME_PART_VALUE) values (1647, 3, 82);
prime_id value can be assigned from oracle sequence ...
create sequence seq_primes start with 1 increment by 1;
Get ID of next prime number to insert:
select seq_primes.nextval from dual;
select prime number content with specified id:
select PART_ORDER, PRIME_PART_VALUE
from primes where prime_id = 1647
order by part_order
You could store them as binary data. They won't be human readable straight from the database, but that shouldn't be a problem.
Databases (depending on which) can routinely store numbers up to 38-39 digits accurately. That gets you reasonably far.
Beyond that you won't be doing arithmetic operations on them (accurately) in databases (barring arbitrary-precision modules that may exist for your particular database). But numbers can be stored as text up to several thousand digits. Beyond that you can use CLOB type fields to store millions of digits.
Also, it's worth nothing that if you're storing sequences of prime numbers and your interest is in space-compression of that sequence you could start by storing the difference between one number and the next rather than the whole number.
This is a bit inefficient, but you could store them as strings.
If you are not going to use database-side calculations with these numbers, just store them as bit sequences of their binary representation (BLOB, VARBINARY etc.)
Here's my 2 cents worth. If you want to store them as numbers in a database then you'll be constrained by the maximum size of integer that your database can handle. You'd probably want a 2 column table, with the prime number in one column and it's sequence number in the other. Then you'd want some indexes to make finding the stored values quick.
But you don't really want to do that do you, you want to store humongous (sp?) primes way beyond any integer datatype you've even though of yet. And you say that you are averse to strings so it's binary data for you. (It would be for me too.) Yes, you could store them in a BLOB in a database but what sort of facilities will the DBMS offer you for finding the n-th prime or checking the primeness of a candidate integer ?
How to design a suitable file structure ? This is the best I could come up with after about 5 minutes thinking:
Set a counter to 2.
Write the two-bits which represent the first prime number.
Write them again, to mark the end of the section containing the 2-bit primes.
Set the counter to counter+1
Write the 3-bit primes in order. ( I think there are two: 5 and 7)
Write the last of the 3-bit primes again to mark the end of the section containing the 3-bit primes.
Go back to 4 and carry on mutatis mutandis.
The point about writing the last n-bit prime twice is to provide you with a means to identify the end of the part of the file with n-bit primes in it when you come to read the file.
As you write the file, you'll probably also want to make note of the offsets into the files at various points, perhaps the start of each section containing n-bit primes.
I think this would work, and it would handle primes up to 2^(the largest unsigned integer you can represent). I guess it would be easy enough to find code for translating a 325467-bit (say) value into a big integer.
Sure, you could store this file as a BLOB but I'm not sure why you'd bother.
It all depends on what kinds of operations you want to do with the numbers. If just store and lookup, then just use strings and use a check constraint / domain datatype to enforce that they are numbers. If you want more control, then PostgreSQL will let you define custom datatypes and functions. You can for instance interface with the GMP library to have correct ordering and arithmetic for arbitrary precision integers. Using such a library will even let you implement a check constraint that uses the probabilistic primality test to check if the numbers really are prime.
The real question is actually whether a relational database is the correct tool for the job.
I think you're best off using a BLOB. How the data is stored in your BLOB depends on your intended use of the numbers. If you want to use them in calculations I think you'll need to create a class or type to store the values as some variety of ordered binary value and allow them to be treated as numbers, etc. If you just need to display them then storing them as a sequence of characters would be sufficient, and would eliminate the need to convert your calculatable values to something displayable, which can be very time consuming for large values.
Share and enjoy.
Probably not brilliant, but what if you stored them in some recursive data structure. You could store it as an int, it's exponent, and a reference to the lower bit numbers.
Like the string idea, it probably wouldn't be very good for memory considerations. And query time would be increased due to the recursive nature of the query.