I have several questions in mind,
1) Is searching on int key is faster than searching on string key?
Relevance of my second question depends totally on answer of first,
If yes,
2) I have a table which have a [1-5]billion records, having a unique email column. I am planning to have one more column that will store hashcode(int) of email(string). Whenever I want a record with given email, I will search records with the hashcode of the email and then match the exact email.
How effective will be second? Please suggest if there is any better alternative available.
A CPU can compare an integer faster than a string. Strings are represented as ASCII encoded integers in memory, so to compare a string the program must first convert and compare each character before a conclusion is returned. In MYSQL, if you have a UNIQUE column combined with a fixed length VARCHAR, the search time will be very fast because the mysql engine will build a tree and use that to search for the key. Without those two, the mysql engine must compare each email row to the search criteria. MySQL has advanced through the years and there are lots of build in mechanisms that can be leveraged to make database management extremely fast.
Related
I was just wondering about the efficiency of storing a large amount of boolean values inside of a CHAR or VARCHAR
data
"TFTFTTF"
vs
isFoo isBar isText
false true false
Would it be worth the worse performance by switching storing these values in this manner? I figured it would just be easier just to set a single value rather than having all of those other fields
thanks
Don't do it. MySQL offers types such as char(1) and tinyint that occupy the same space as a single character. In addition, MySQL offers enumerated types, if you want your flags to have more than one value -- and for the values to be recognizable.
That last point is the critical point. You want your code to make sense. The string 'FTF' does not make sense. The columns isFoo, isBar, and isText do make sense.
There is no need to obfuscate your data model.
This would be a bad idea, not only does it have no advantage in terms of the space used, it also has a bad influence on query performance and the comprehensibility of your data model.
Disk Space
In terms of storage usage, it makes no real difference whether the data is stored in a single varchar(n) or char(n) column or in multiple tinynt, char(1)or bit(1) columns. Only when using varchar you would need 1 to 2 bytes more disk space per entry.
For more information about the storage requirements of the different data types, see the MySql documentation.
Query Performance
If boolean values were stored in a VarChar, the search for all entries where a specific value is True would take much longer, since string operations would be necessary to find the correct entries. Even when searching for a combination of Boolean values such as "TFTFTFTFTT", the query would still take longer than if the boolean values were stored in individual columns. Furthermore you can assign indexes to single columns like isFoo or isBar, which has a great positive effect on query performance.
Data Model
A data model should be as comprehensible as possible and if possible independent of any kind of implementation considerations.
Realistically, a database field should only contain one atomic value, that is to say: a value that can't be subdivided into separate parts.
Columns that do not contain atomic values:
cannot be sorted
cannot be grouped
cannot be indexed
So let's say you want to find all rows where isFoo is true you wouldn't be able to do it unless you were to do string operations like "find the third characters in this string and see if it's equal to "F". This would imply a full table scan with every query which would degrade performance quite dramatically.
it depends on what you want to do after storing the data in this format.
after retrieving this record you will have to do further processing on the server side which worsen the performance if you want to load the data by checking specific conditions. the logic in the server would become complex.
The columns isFoo, isBar, and isText would help you to write queries better.
I have a website which stores links like website.com/picture?id=12345
I'm considering obfusicating the number-id and converting it into something like "Af3Gh2" so that people find it harder to iterate and scrape all the links
Would a Query like select * from table where row_id=12345 be faster to compute than select ... where row_id="Af3Gh2"
The row_id column is indexed already
Here is the performance ranking for primary-keys from fastest to slowest auto-increment-integer > random-integer > random-char > random-varchar
There is enough material regarding why this is so. In short: Data on disk is spread in order of primary key (aka clustering). Hence random is slower than sequential. With sequential indexing when you insert a record, on disk it goes after the last record. But with random indexing, each insert needs to wedge-in between two records. On disk things take time to actually move around.
char fields are faster than varchar because chars can be read as is. But to read a varchar data you need to (1) read first byte to get the actual length. (2) read no. of chars equal to now known length.
character (char/varchar) is slower then integer because integer-integer comparison is easy. To compare two character type data, one first needs to convert them into integer, or somehow get them into lexical (dictionary) order. Mostly it is done by matching both strings character-by-character. Thus slooow.
I have a table that has, among its primary keys, a VARCHAR(16) column that always contains 16 characters. I'm currently searching for various substrings at specific positions within this column using "LIKE CONCAT('_______________', ?)", "LIKE CONCAT('______________', ? '_')", etc to use a 1 char example, but it is not necessarily always one char. The char varies with each parameter ? and through each query I do, and there are often many of these LIKEs ORed together. While automatically generating that query is no big deal, it still isn't fast enough. I was considering splitting the column into 16 VARCHAR(1) columns and doing = ? queries, as they appear to go much faster for simple tests, but this is getting ridiculous.
Is there any way to make mysql index a certain string column by every character in it? Because that is basically what I need. Or is the best way to do it separating it all up into 1 char fields?
Is there any way to make mysql index a certain string column by every character in it?
Some databases support functional indexes which would allow you to do this. Unfortunately MySQL isn't one of them.
Or is the best way to do it separating it all up into 1 char fields?
I'd go with this. You may also want to consider denormalizing and storing both representations if you also want to be able to perform a lookup on the entire key.
tl;dr: Is assigning rows IDs of {unixtimestamp}{randomdigits} (such as 1308022796123456) as a BIGINT a good idea if I don't want to deal with UUIDs?
Just wondering if anyone has some insight into any performance or other technical considerations / limitations in regards to IDs / PRIMARY KEYs assigned to database records across multiple servers.
My PHP+MySQL application runs on multiple servers, and the data needs to be able to be merged. So I've outgrown the standard sequential / auto_increment integer method of identifying rows.
My research into a solution brought me to the concept of using UUIDs / GUIDs. However the need to alter my code to deal with converting UUID strings to binary values in MySQL seems like a bit of a pain/work. I don't want to store the UUIDs as VARCHAR for storage and performance reasons.
Another possible annoyance of UUIDs stored in a binary column is the fact that rows IDs aren't obvious when looking at the data in PhpMyAdmin - I could be wrong about this though - but straight numbers seem a lot simpler overall anyway and are universal across any kind of database system with no conversion required.
As a middle ground I came up with the idea of making my ID columns a BIGINT, and assigning IDs using the current unix timestamp followed by 6 random digits. So lets say my random number came about to be 123456, my generated ID today would come out as: 1308022796123456
A one in 10 million chance of a conflict for rows created within the same second is fine with me. I'm not doing any sort of mass row creation quickly.
One issue I've read about with randomly generated UUIDs is that they're bad for indexes, as the values are not sequential (they're spread out all over the place). The UUID() function in MySQL addresses this by generating the first part of the UUID from the current timestamp. Therefore I've copied that idea of having the unix timestamp at the start of my BIGINT. Will my indexes be slow?
Pros of my BIGINT idea:
Gives me the multi-server/merging advantages of UUIDs
Requires very little change to my application code (everything is already programmed to handle integers for IDs)
Half the storage of a UUID (8 bytes vs 16 bytes)
Cons:
??? - Please let me know if you can think of any.
Some follow up questions to go along with this:
Should I use more or less than 6 random digits at the end? Will it make a difference to index performance?
Is one of these methods any "randomer" ?: Getting PHP to generate 6 digits and concatenating them together -VS- getting PHP to generate a number in the 1 - 999999 range and then zerofilling to ensure 6 digits.
Thanks for any tips. Sorry about the wall of text.
I have run into this very problem in my professional life. We used timestamp + random number and ran into serious issues when our applications scaled up (more clients, more servers, more requests). Granted, we (stupidly) used only 4 digits, and then change to 6, but you would be surprised how often that the errors still happen.
Over a long enough period of time, you are guaranteed to get duplicate key errors. Our application is mission critical, and therefore even the smallest chance it could fail to due inherently random behavior was unacceptable. We started using UUIDs to avoid this issue, and carefully managed their creation.
Using UUIDs, your index size will increase, and a larger index will result in poorer performance (perhaps unnoticeable, but poorer none-the-less). However MySQL supports a native UUID type (never use varchar as a primary key!!), and can handle indexing, searching,etc pretty damn efficiently even compared to bigint. The biggest performance hit to your index is almost always the number of rows indexed, rather than the size of the item being index (unless you want to index on a longtext or something ridiculous like that).
To answer you question: Bigint (with random numbers attached) will be ok if you do not plan on scaling your application/service significantly. If your code can handle the change without much alteration and your application will not explode if a duplicate key error occurs, go with it. Otherwise, bite-the-bullet and go for the more substantial option.
You can always implement a larger change later, like switching to an entirely different backend (which we are now facing... :P)
You can manually change the autonumber starting number.
ALTER TABLE foo AUTO_INCREMENT = ####
An unsigned int can store up to 4,294,967,295, lets round it down to 4,290,000,000.
Use the first 3 digits for the server serial number, and the final 7 digits for the row id.
This gives you up to 430 servers (including 000), and up to 10 million IDs for each server.
So for server #172 you manually change the autonumber to start at 1,720,000,000, then let it assign IDs sequentially.
If you think you might have more servers, but less IDs per server, then adjust it to 4 digits per server and 6 for the ID (i.e. up to 1 million IDs).
You can also split the number using binary digits instead of decimal digits (perhaps 10 binary digits per server, and 22 for the ID. So, for example, server 76 starts at 2^22*76 = 318,767,104 and ends at 322,961,407).
For that matter you don't even need a clear split. Take 4,294,967,295 divide it by the maximum number of servers you think you will ever have, and that's your spacing.
You could use a bigint if you think you need more identifiers, but that's a seriously huge number.
Use the GUID as a unique index, but also calculate a 64-bit (BIGINT) hash of the GUID, store that in a separate NOT UNIQUE column, and index it. To retrieve, query for a match to both columns - the 64-bit index should make this efficient.
What's good about this is that the hash:
a. Doesn't have to be unique.
b. Is likely to be well-distributed.
The cost: extra 8-byte column and its index.
If you want to use the timestamp method then do this:
Give each server a number, to that append the proccess ID of the application that is doing the insert (or the thread ID) (in PHP it's getmypid()), then to that append how long that process has been alive/active for (in PHP it's getrusage()), and finally add a counter that starts at 0 at the start of each script invocation (i.e. each insert within the same script adds one to it).
Also, you don't need to store the full unix timestamp - most of those digits are for saying it's year 2011, and not year 1970. So if you can't get a number saying how long the process was alive for, then at least subtract a fixed timestamp representing today - that way you'll need far less digits.
I had a feeling that searching domain names taking time more than as usual in mysql. actually domain name column has a unique index though query seems slow.
My question is do I need to convert to binary mode?? say md5 hash or something??
Normally keeping the domain names in a "VARCHAR" data type, with an UNIQUE index defined for that field, is the most simple & efficient way of managing your data.
Never try to use any complexity (like using Binary mode or "BLOB" data type), for the sake of one / two field(s), as it will further deteriorate your MySQL performance.
Hope it helps.