MySQL - query by number or letter? - mysql

I need to set values to a "Yes or No" column name STATUS. And I'm thinking about 2 methods.
method 1 (use letter): set value Y/N then find all rows that have value Y in field STATUS by a query like:
SELECT * FROM post WHERE status="Y"
method 2 (use number): set value 1/0 then find all rows that have value 1 in field STATUS by a query like:
SELECT * FROM post WHERE status=1
Should I use method 1 or method 2? Which one is faster? Which one is better?

The two are essentially equivalent, so this becomes a question of which is better for your application.
If you are concerned about space, then the smallest space for one character is char(1), using 8 bits. With a number, you can use bit or set types for pack multiple flags. But, this only makes a difference if you have lots of flags.
The store-it-as-a-number approach has a slight advantage, where you can count the "Yes" values by doing:
select sum(status)
(Of course, in MySQL, this is only a marginal improvement on sum(status = 'Y').
The store-it-as-a-letter approach has a slight advantage if you decide to include "Maybe" or other values at some point in the future.
Finally, any difference in performance in different ways of representing these values is going to be very, very minimal. You would need a table with millions and millions of rows to start to notice a problem. So, use the mechanism that works best for your application and way of representing the value.

Second one is definitely faster primarily because whenever you involve something within quotes , it is meaningless to SQL. It would be better to use types that are non string in order to get better performance. I would suggest using METHOD 2.

Fastest way would be ;
SELECT * FROM post WHERE `status` = FIND_IN_SET(`status`,'y');
I think you should create column with ENUM('n','y'). Mysql stores this type in optimal way. It also will help you to store only allowed values in the field.
You can also make it more human friendly ENUM('no','yes') without affect to performance. Because strings 'no' and 'yes' are stored only once per ENUM definition. Mysql stores only index of the value per row.

I think the method 1 is better if you are concerned with the storage prospective .
As storing an integer i.e 1/2 takes 4 bytes of memory where as a character takes only 1 byte of memory. So its better to use method 1.
This may increase some performance .

Related

Difference in performance between two similar sql queries

What is the difference between doing:
SELECT * FROM table WHERE column IS NULL
or -
SELECT * FROM table WHERE column = 0
Is doing IS NULL significantly worse off than equating to a constant?
The use case comes up where I have something like:
SELECT * FROM users WHERE paying IS NULL
(or adding an additional column)
SELECT * FROM users WHERE is_paying = 0
If I understand your question correctly, you are asking about the relative benefits/problems with the two situations:
where is_paying = 0
where paying is null
Given that both are in the data table, I cannot think of why one would perform better than the other. I do think the first is clearer on what the query is doing, so that is the version I would prefer. But from a performance perspective, they should be the same.
Someone else mentioned -- and I'm sure you are aware -- that NULL and 0 are different beasts. They can also behave differently in the optimization of joins and other elements. But, for simple filtering, I would expect them to have the same performance.
Well, there is one technicaility. The comparison to "0" is probably built into the CPU. The comparison to NULL is probably a bit operation that requires something like a mask, shift, and comparison -- which might take an iota of time longer. However, this performance difference is negligible when compared to the fact that you are reading the data from disk to begin with.
comparing to NULL and zero are two different things. zero is a value (known value) while NULL is UNKNOWN. The zero specifically means that the value was set to be zero; null means that the value was not set, or was set to null.
You'll get entirely different results using these queries, it's not simply a matter of performance.
Suppose you have a variety of users. Some have non-zero values for the "paying" column, some have 0, and some don't have a value whatsoever. The last case is what "null" more or less represents.
As for performance, do you have an index on the "paying" column? If you only have a few hundred rows in the table, this is probably irrelevant. If you have many thousands of rows, you are basically telling the query to iterate over every row of the table unless you have some indexing in place. This is true regardless of whether you are searching for "paying = 0" or "paying is null".
But again, just to reemphasize, the two queries will give you completely different results.
As far as I know comparing to NULL is as fast as comparing to 0, so you should choose based on:
Simplicity - use the option which makes your code simpler
Minimal size - use the option which makes your table smaller
In this case making the paying column NULL-able will probably be better.
You should also check out these questions:
NULL in MySQL (Performance & Storage)
MySQL: NULL vs “”

Getting an Unique Identifier without Inserting

I'm looking for the best way to basically get a unique number (guaranteed, not some random string or current time in milliseconds & of a reasonable length about 8 characters) using MySQL (or other ways suggestions welcome).
I just basically want to run some SELECT ... statement and have it always return a unique number with out any inserting into database. Just something that increments some stored value and returns the value and can handle a lot of requests concurrently, without heavy blocking of the application.
I know that I can make something with combinations of random numbers with higher bases (for shorter length), that could make it very unlikely that they overlap, but won't guarantee it.
It just feels like there should be some easy way to get this.
To clarify...
I need this number to be short as it will be part of a URL and it is ok for the query to lock a row for a short period of time. What I was looking for is maybe some command that underhood does something like this ...
LOCK VALUE
RETURN VALUE++
UNLOCK VALUE
Where the VALUE is stored in the database, a MySQL database maybe.
You seek UUID().
http://dev.mysql.com/doc/refman/5.0/en/miscellaneous-functions.html#function_uuid
mysql> SELECT UUID();
-> '6ccd780c-baba-1026-9564-0040f4311e29'
It will return a 128-bit hexadecimal number. You can munge as necessary.
Is the unique number to be associated with a particular row in a table? If not, why not call rand(): select rand(); The value returned is between zero and one, so scale as desired.
Great question.
Shortest answer - that is simply not possible according to your specifications.
Long answer - the closest approach to this is MySQL's UUID but that is neither short, nor is sortable (ie: a former UUID value to be greater/smaller than a previous one).
To UUID or not to UUID? is a nice article describing pros and cons regarding their usage, touching also some of the reasons of why you can't have what you need
I am not sure I understand exactly, maybe something like this:
SELECT ROUND(RAND() * 123456789) as id
The larger you make the number, the larger your id.
No guarantees about uniqueness of course, this is a quick hack after all and you should implement a check in code to handle the off chance a duplicate is inserted, but maybe this would serve your purpose?
Of course, there are many other approaches possible to do this.
You can easily use most any scripting language to generate this for you, php example here:
//Generates a 32 character identifier that is extremely difficult to predict.
$id = md5(uniqid(rand(), true));
//Generates a 32 character identifier that is extremely difficult to predict.
$id = md5(uniqid(rand(), true));
Then use $id in your query or whatever you need your unique id in. In my opinion, the advantage of doing this in a scripting language when interacting with a DB is that it is easier to validate for application / usage purposes and act accordingly. For instance, in your example, whatever method you use, if you wanted to be 100% always sure of data integrity, you have to make sure there are no duplicates of that id elsewhere. This is easier to do in a script than in SQL.
Hope that helps my friend, good-luck!

MySQL: storing several boolean values in one column. One tinyint(4) -vs- several tinyint(4)

I need to store 5 boolean values in 1 table.
Each value could be stored as tinyint(4). So, there are 5 tinyint(4).
I'm thinking of putting 5 boolean values in one tinyint(4).
I believe, everybody knows even better than me, 5 bits could be saved in 1 byte with no problem:)
The first value could be stored as 0(false) or 1(true), the second as 0(false) or 2(true), the third as 0 or 4, fourth as 0 or 8, fifth as 0 or 16.
So, if we store the sum of that values in tinyint(4), we exactly know 5 Boolean values.
For example, stored 21 -> 16 + 4+1.
So, if 21 is stored, we know that:
Fifth=true
Fourth=false
Third=true
Second=false
First=true.
My question is:
Does it make sense to keep only 1 variable? We win db volume (bytes) and performance (4 columns less, but that's only 4 bytes whereas you have actually used varchar(1000) in the same table), but every time we have to "extract" a proper Boolean value from "sum" using php function, and that happens often (let's say when a user presses the button).
Does it all make sense to store Boolean values as a sum in 1 column or not, so you have 7 columns instead of 11?
That values, that's clear (because that table has much more rows that just 2), are not keys.
Thank you.
Don't do this -- unless the value is a single "opaque" external data-type, such as a Enum of flags -- if the columns will ever be used in a query or will ever be used outside of said "opaque" type: use discrete/separate fields. (Of the correct type, as jmucchiello and MarkR noted in their answers.)
Trying "for performance" here will just make you loath databases -- and this one in particular -- when you have to "fix" it or work around the ugly scheme later. (If you have a performance problem you'll know it ... and know enough to run a performance analysis before asking.) Donald Knuth was right when he pointed out that 97% of stuff Just Doesn't Matter. So make it pretty and let the database do what it feels like doing.
Happy coding.
If I sound animated above, it's because I'm trying help others avoid the same mistakes I've done run into :-)
No, it does not make sense.
Either store each in its own column, OR use the MySQL-specific SET type, which internally uses bitfields, but is more human-readable.
Worrying about a few bytes per row is really a bad idea. It is a case of incredibly premature optimisation.
Why don't you use the BIT Type and let MySQL worry about optimizing for space?

Disadvantages of quoting integers in a Mysql query?

I am curious about the disadvantage of quoting integers in MYSQL queries
For example
SELECT col1,col2,col3 FROM table WHERE col1='3';
VS
SELECT col1,col2,col3 FROM table WHERE col1= 3;
If there is a performance cost, what is the size of it and why does it occur? Are there any other disavantages other that performance?
Thanks
Andrew
Edit: The reason for this question
1. Because I want to learn the difference because I am curious
2. I am experimenting with a way of passing composite keys from my database around in my php code as psudo-Id-keys(PIK). These PIK's are the used to target the record.
For example, given a primary key (AreaCode,Category,RecordDtm)
My PIK in the url would look like this:
index.php?action=hello&Id=20001,trvl,2010:10:10 17:10:45
And I would select this record like this:
$Id = $_POST['Id'];//equals 20001,trvl,2010:10:10 17:10:45
$sql = "SELECT AreaCode,Category,RecordDtm,OtherColumns.... FROM table WHERE (AreaCode,Category,RecordDtm) = ({$Id});
$mysqli->query($sql):
......and so on.
At this point the query won't work because of the datetime(which must be quoted) and it is open to sql injection because I haven't escaped those values. Given the fact that I won't always know how my PIK's are constructed I would write a function splits the Id PIK at the commas, cleans each part with real_escape_string and puts It back together with the values quoted. For Example:
$Id = "'20001','trvl','2010:10:10 17:10:45'"
Of course, in this function that is breaking apart and cleaning the Id I could check if the value is a number or not. If it is a number, don't quote it. If it is anything but a string then quote it.
The performance cost is that whenever mysql needs to do a type conversion from whatever you give it to datatype of the column. So with your query
SELECT col1,col2,col3 FROM table WHERE col1='3';
If col1 is not a string type, MySQL needs to convert '3' to that type. This type of query isn't really a big deal, as the performance overhead of that conversion is negligible.
However, when you try to do the same thing when, say, joining 2 table that have several million rows each. If the columns in the ON clause are not the same datatype, then MySQL will have to convert several million rows every single time you run your query, and that is where the performance overhead comes in.
Strings also have a different sort order from numbers.
Compare:
SELECT 312 < 41
(yields 0, because 312 numerically comes after 41)
to:
SELECT '312' < '41'
(yields 1, because '312' lexicographically comes before '41')
Depending on the way your query is built using quotes might give wrong results or none at all.
Numbers should be used as such, so never use quotes unless you have a special reason to do so.
According to me, I think there is no performance/size cost in the case you have mentioned. Even if there is, then it is very much negligible and wont affect your application as such.
It gives the wrong impression about the data type for the column. As an outsider, I assume the column in question is CHAR/VARCHAR & choose operations accordingly.
Otherwise MySQL, like most other databases, will implicitly convert the value to whatever the column data type is. There's no performance issue with this that I'm aware of but there's a risk that supplying a value that requires explicit conversion (using CAST or CONVERT) will trigger an error.

How does a hash table work? Is it faster than "SELECT * from .."

Let's say, I have :
Key | Indexes | Key-values
----+---------+------------
001 | 100001 | Alex
002 | 100002 | Micheal
003 | 100003 | Daniel
Lets say, we want to search 001, how to do the fast searching process using hash table?
Isn't it the same as we use the "SELECT * from .. " in mysql? I read alot, they say, the "SELECT *" searching from beginning to end, but hash table is not? Why and how?
By using hash table, are we reducing the records we are searching? How?
Can anyone demonstrate how to insert and retrieve hash table process in mysql query code? e.g.,
SELECT * from table1 where hash_value="bla" ...
Another scenario:
If the indexes are like S0001, S0002, T0001, T0002, etc. In mysql i could use:
SELECT * from table WHERE value = S*
isn't it the same and faster?
A simple hash table works by keeping the items on several lists, instead of just one. It uses a very fast and repeatable (i.e. non-random) method to choose which list to keep each item on. So when it is time to find the item again, it repeats that method to discover which list to look in, and then does a normal (slow) linear search in that list.
By dividing the items up into 17 lists, the search becomes 17 times faster, which is a good improvement.
Although of course this is only true if the lists are roughly the same length, so it is important to choose a good method of distributing the items between the lists.
In your example table, the first column is the key, the thing we need to find the item. And lets suppose we will maintain 17 lists. To insert something, we perform an operation on the key called hashing. This just turns the key into a number. It doesn't return a random number, because it must always return the same number for the same key. But at the same time, the numbers must be "spread out" widely.
Then we take the resulting number and use modulus to shrink it down to the size of our list:
Hash(key) % 17
This all happens extremely fast. Our lists are in an array, so:
_lists[Hash(key % 17)].Add(record);
And then later, to find the item using that key:
Record found = _lists[Hash(key % 17)].Find(key);
Note that each list can just be any container type, or a linked list class that you write by hand. When we execute a Find in that list, it works the slow way (examine the key of each record).
Do not worry about what MySQL is doing internally to locate records quickly. The job of a database is to do that sort of thing for you. Just run a SELECT [columns] FROM table WHERE [condition]; query and let the database generate a query plan for you. Note that you don't want to use SELECT *, since if you ever add a column to the table that will break all your old queries that relied on there being a certain number of columns in a certain order.
If you really want to know what's going on under the hood (it's good to know, but do not implement it yourself: that is the purpose of a database!), you need to know what indexes are and how they work. If a table has no index on the columns involved in the WHERE clause, then, as you say, the database will have to search through every row in the table to find the ones matching your condition. But if there is an index, the database will search the index to find the exact location of the rows you want, and jump directly to them. Indexes are usually implemented as B+-trees, a type of search tree that uses very few comparisons to locate a specific element. Searching a B-tree for a specific key is very fast. MySQL is also capable of using hash indexes, but these tend to be slower for database uses. Hash indexes usually only perform well on long keys (character strings especially), since they reduce the size of the key to a fixed hash size. For data types like integers and real numbers, which have a well-defined ordering and fixed length, the easy searchability of a B-tree usually provides better performance.
You might like to look at the chapters in the MySQL manual and PostgreSQL manual on indexing.
http://en.wikipedia.org/wiki/Hash_table
Hash tables may be used as in-memory data structures. Hash tables may also be adopted for use with persistent data structures; database indices sometimes use disk-based data structures based on hash tables, although balanced trees are more popular.
I guess you could use a hash function to get the ID you want to select from. Like
SELECT * FROM table WHERE value = hash_fn(whatever_input_you_build_your_hash_value_from)
Then you don't need to know the id of the row you want to select and can do an exact query. Since you know that the row will always have the same id because of the input you build the hash value form and you can always recreate this id through the hash function.
However this isn't always true depending on the size of the table and the maximum number of hashvalues (you often have "X mod hash-table-size" somewhere in your hash). To take care of this you should have a deterministic strategy you use each time you get two values with the same id. You should check Wikipedia for more info on this strategy, its called collision handling and should be mentioned in the same article as hash-tables.
MySQL probably uses hashtables somewhere because of the O(1) feature norheim.se (up) mentioned.
Hash tables are great for locating entries at O(1) cost where the key (that is used for hashing) is already known. They are in widespread use both in collection libraries and in database engines. You should be able to find plenty of information about them on the internet. Why don't you start with Wikipedia or just do a Google search?
I don't know the details of mysql. If there is a structure in there called "hash table", that would probably be a kind of table that uses hashing for locating the keys. I'm sure someone else will tell you about that. =)
EDIT: (in response to comment)
Ok. I'll try to make a grossly simplified explanation: A hash table is a table where the entries are located based on a function of the key. For instance, say that you want to store info about a set of persons. If you store it in a plain unsorted array, you would need to iterate over the elements in sequence in order to find the entry you are looking for. On average, this will need N/2 comparisons.
If, instead, you put all entries at indexes based on the first character of the persons first name. (A=0, B=1, C=2 etc), you will immediately be able to find the correct entry as long as you know the first name. This is the basic idea. You probably realize that some special handling (rehashing, or allowing lists of entries) is required in order to support multiple entries having the same first letter. If you have a well-dimensioned hash table, you should be able to get straight to the item you are searching for. This means approx one comparison, with the disclaimer of the special handling I just mentioned.