I have a mysql table that gets populated from a flat file with 20 million rows a daily. I have one field 'app_value' that is a varchar(24) that is a mix of text and numbers. I want to run a batch task every day to normalize all the values that are equivalent of zeros. Checking the database I have seen at least the following zero equivalent values but I think there are others.
0
0.0
0.000000
My plan was to cast to decimal and check if that that cast was equal to 0. If so I will update the value to '0'. To test my theory I ran
SELECT d.id, d.pp_value, CAST('d.app_value' as DECIMAL(20,6))
It seemed to work okay on the zero equivalent numbers however I was not that surprised to see that when 'app_value' is a character it is also cast to 0.000000. Is there a better way to do this? I need to protect against null, blanks and characters. I also need to be concerned about efficiency as I have to do this against 20 million rows every day.
You could match against a regular expression:
WHERE d.app_value RLIKE '^0+(\\.0*)?$'
Of course, this would not be particularly efficient (as it will require a full table scan on every invocation: indexes are of no help). If at all possible, I'd suggest checking for zeroes when loading data into the table (either directly within LOAD DATA itself, or using triggers, or else through some external preprocessing).
Second way to do it is with an NOT IN subquery.
This may scale better vs the regex engine on the larger datasets.. Regex engine startup/matching is relatively costly for the CPU compared to normal string matching..
note this one is bit hacky because we "trust" on MySQL auto cast conversion
select
*
from
data
where
number not in (
select
number
from
data
where
number >= 'a'
) and number = 0
see demo http://sqlfiddle.com/#!2/7118e7/2
Related
I save very large numbers in my database (usually with 50+ digits) and then need to run queries like (id is the label of column where the numbers are saved):
WHERE id % 2 = 0
I tried to use varchar data type for this column and although no error is generated while running the query, the returned result is mathematically wrong (the returned ids are not even).
does MySQL convert varchar to int while running my query and so is overflow the reason of the mistake in results?
what is the best choice for saving such large numbers on which I can do arithmetic operation latter? if decimal is the best candidate then what if I need to save the numbers with 100 digits?
Your only choice is to cast to a numeric/decimal value explicitly. In MySQL, that supports up to 65 digits of precision.
Here is a db<>fiddle with an example. Or an example using %.
I need to set values to a "Yes or No" column name STATUS. And I'm thinking about 2 methods.
method 1 (use letter): set value Y/N then find all rows that have value Y in field STATUS by a query like:
SELECT * FROM post WHERE status="Y"
method 2 (use number): set value 1/0 then find all rows that have value 1 in field STATUS by a query like:
SELECT * FROM post WHERE status=1
Should I use method 1 or method 2? Which one is faster? Which one is better?
The two are essentially equivalent, so this becomes a question of which is better for your application.
If you are concerned about space, then the smallest space for one character is char(1), using 8 bits. With a number, you can use bit or set types for pack multiple flags. But, this only makes a difference if you have lots of flags.
The store-it-as-a-number approach has a slight advantage, where you can count the "Yes" values by doing:
select sum(status)
(Of course, in MySQL, this is only a marginal improvement on sum(status = 'Y').
The store-it-as-a-letter approach has a slight advantage if you decide to include "Maybe" or other values at some point in the future.
Finally, any difference in performance in different ways of representing these values is going to be very, very minimal. You would need a table with millions and millions of rows to start to notice a problem. So, use the mechanism that works best for your application and way of representing the value.
Second one is definitely faster primarily because whenever you involve something within quotes , it is meaningless to SQL. It would be better to use types that are non string in order to get better performance. I would suggest using METHOD 2.
Fastest way would be ;
SELECT * FROM post WHERE `status` = FIND_IN_SET(`status`,'y');
I think you should create column with ENUM('n','y'). Mysql stores this type in optimal way. It also will help you to store only allowed values in the field.
You can also make it more human friendly ENUM('no','yes') without affect to performance. Because strings 'no' and 'yes' are stored only once per ENUM definition. Mysql stores only index of the value per row.
I think the method 1 is better if you are concerned with the storage prospective .
As storing an integer i.e 1/2 takes 4 bytes of memory where as a character takes only 1 byte of memory. So its better to use method 1.
This may increase some performance .
Please explain, which one will be faster in Mysql for the following query?
SELECT * FROM `userstatus` where BINARY Name = 'Raja'
[OR]
SELECT * FROM `userstatus` where Name = 'raja'
Db entry for Name field is 'Raja'
I have 10000 records in my db, i tried with "explain" query but both saying same execution time.
Your question does not make sense.
The collation of a row determines the layout of the index and whether tests wil be case-sensitive or not.
If you cast a row, the cast will take time.
So logically the uncasted operation should be faster....
However, if the cast makes it to find fewer rows than the casted operation will be faster or the other way round.
This of course changes the whole problem and makes the comparison invalid.
A cast to BINARY makes the comparison to be case-sensitive, changing the nature of the test and very probably the number of hits.
My advice
Never worry about speed of collations, the percentages are so small it is never worth bothering about.
The speed penalty from using select * (a big no no) will far outweigh the collation issues.
Start with putting in an index. That's a factor 10,000 speedup with a million rows.
Assuming that the Names field is a simple latin-1 text type, and there's no index on it, then the BINARY version of the query will be faster. By default, MySQL does case-insensitive comparisons, which means the field values and the value you're comparing against both get smashed into a single case (either all-upper or all-lower) and then compared. Doing a binary comparison skips the case conversion and does a raw 1:1 numeric comparison of each character value, making it a case-sensitive comparison.
Of course, that's just one very specific scenario, and it's unlikely to be met in your case. Too many other factors affect this, especially the presence of an index.
tl;dr: Is assigning rows IDs of {unixtimestamp}{randomdigits} (such as 1308022796123456) as a BIGINT a good idea if I don't want to deal with UUIDs?
Just wondering if anyone has some insight into any performance or other technical considerations / limitations in regards to IDs / PRIMARY KEYs assigned to database records across multiple servers.
My PHP+MySQL application runs on multiple servers, and the data needs to be able to be merged. So I've outgrown the standard sequential / auto_increment integer method of identifying rows.
My research into a solution brought me to the concept of using UUIDs / GUIDs. However the need to alter my code to deal with converting UUID strings to binary values in MySQL seems like a bit of a pain/work. I don't want to store the UUIDs as VARCHAR for storage and performance reasons.
Another possible annoyance of UUIDs stored in a binary column is the fact that rows IDs aren't obvious when looking at the data in PhpMyAdmin - I could be wrong about this though - but straight numbers seem a lot simpler overall anyway and are universal across any kind of database system with no conversion required.
As a middle ground I came up with the idea of making my ID columns a BIGINT, and assigning IDs using the current unix timestamp followed by 6 random digits. So lets say my random number came about to be 123456, my generated ID today would come out as: 1308022796123456
A one in 10 million chance of a conflict for rows created within the same second is fine with me. I'm not doing any sort of mass row creation quickly.
One issue I've read about with randomly generated UUIDs is that they're bad for indexes, as the values are not sequential (they're spread out all over the place). The UUID() function in MySQL addresses this by generating the first part of the UUID from the current timestamp. Therefore I've copied that idea of having the unix timestamp at the start of my BIGINT. Will my indexes be slow?
Pros of my BIGINT idea:
Gives me the multi-server/merging advantages of UUIDs
Requires very little change to my application code (everything is already programmed to handle integers for IDs)
Half the storage of a UUID (8 bytes vs 16 bytes)
Cons:
??? - Please let me know if you can think of any.
Some follow up questions to go along with this:
Should I use more or less than 6 random digits at the end? Will it make a difference to index performance?
Is one of these methods any "randomer" ?: Getting PHP to generate 6 digits and concatenating them together -VS- getting PHP to generate a number in the 1 - 999999 range and then zerofilling to ensure 6 digits.
Thanks for any tips. Sorry about the wall of text.
I have run into this very problem in my professional life. We used timestamp + random number and ran into serious issues when our applications scaled up (more clients, more servers, more requests). Granted, we (stupidly) used only 4 digits, and then change to 6, but you would be surprised how often that the errors still happen.
Over a long enough period of time, you are guaranteed to get duplicate key errors. Our application is mission critical, and therefore even the smallest chance it could fail to due inherently random behavior was unacceptable. We started using UUIDs to avoid this issue, and carefully managed their creation.
Using UUIDs, your index size will increase, and a larger index will result in poorer performance (perhaps unnoticeable, but poorer none-the-less). However MySQL supports a native UUID type (never use varchar as a primary key!!), and can handle indexing, searching,etc pretty damn efficiently even compared to bigint. The biggest performance hit to your index is almost always the number of rows indexed, rather than the size of the item being index (unless you want to index on a longtext or something ridiculous like that).
To answer you question: Bigint (with random numbers attached) will be ok if you do not plan on scaling your application/service significantly. If your code can handle the change without much alteration and your application will not explode if a duplicate key error occurs, go with it. Otherwise, bite-the-bullet and go for the more substantial option.
You can always implement a larger change later, like switching to an entirely different backend (which we are now facing... :P)
You can manually change the autonumber starting number.
ALTER TABLE foo AUTO_INCREMENT = ####
An unsigned int can store up to 4,294,967,295, lets round it down to 4,290,000,000.
Use the first 3 digits for the server serial number, and the final 7 digits for the row id.
This gives you up to 430 servers (including 000), and up to 10 million IDs for each server.
So for server #172 you manually change the autonumber to start at 1,720,000,000, then let it assign IDs sequentially.
If you think you might have more servers, but less IDs per server, then adjust it to 4 digits per server and 6 for the ID (i.e. up to 1 million IDs).
You can also split the number using binary digits instead of decimal digits (perhaps 10 binary digits per server, and 22 for the ID. So, for example, server 76 starts at 2^22*76 = 318,767,104 and ends at 322,961,407).
For that matter you don't even need a clear split. Take 4,294,967,295 divide it by the maximum number of servers you think you will ever have, and that's your spacing.
You could use a bigint if you think you need more identifiers, but that's a seriously huge number.
Use the GUID as a unique index, but also calculate a 64-bit (BIGINT) hash of the GUID, store that in a separate NOT UNIQUE column, and index it. To retrieve, query for a match to both columns - the 64-bit index should make this efficient.
What's good about this is that the hash:
a. Doesn't have to be unique.
b. Is likely to be well-distributed.
The cost: extra 8-byte column and its index.
If you want to use the timestamp method then do this:
Give each server a number, to that append the proccess ID of the application that is doing the insert (or the thread ID) (in PHP it's getmypid()), then to that append how long that process has been alive/active for (in PHP it's getrusage()), and finally add a counter that starts at 0 at the start of each script invocation (i.e. each insert within the same script adds one to it).
Also, you don't need to store the full unix timestamp - most of those digits are for saying it's year 2011, and not year 1970. So if you can't get a number saying how long the process was alive for, then at least subtract a fixed timestamp representing today - that way you'll need far less digits.
I am curious about the disadvantage of quoting integers in MYSQL queries
For example
SELECT col1,col2,col3 FROM table WHERE col1='3';
VS
SELECT col1,col2,col3 FROM table WHERE col1= 3;
If there is a performance cost, what is the size of it and why does it occur? Are there any other disavantages other that performance?
Thanks
Andrew
Edit: The reason for this question
1. Because I want to learn the difference because I am curious
2. I am experimenting with a way of passing composite keys from my database around in my php code as psudo-Id-keys(PIK). These PIK's are the used to target the record.
For example, given a primary key (AreaCode,Category,RecordDtm)
My PIK in the url would look like this:
index.php?action=hello&Id=20001,trvl,2010:10:10 17:10:45
And I would select this record like this:
$Id = $_POST['Id'];//equals 20001,trvl,2010:10:10 17:10:45
$sql = "SELECT AreaCode,Category,RecordDtm,OtherColumns.... FROM table WHERE (AreaCode,Category,RecordDtm) = ({$Id});
$mysqli->query($sql):
......and so on.
At this point the query won't work because of the datetime(which must be quoted) and it is open to sql injection because I haven't escaped those values. Given the fact that I won't always know how my PIK's are constructed I would write a function splits the Id PIK at the commas, cleans each part with real_escape_string and puts It back together with the values quoted. For Example:
$Id = "'20001','trvl','2010:10:10 17:10:45'"
Of course, in this function that is breaking apart and cleaning the Id I could check if the value is a number or not. If it is a number, don't quote it. If it is anything but a string then quote it.
The performance cost is that whenever mysql needs to do a type conversion from whatever you give it to datatype of the column. So with your query
SELECT col1,col2,col3 FROM table WHERE col1='3';
If col1 is not a string type, MySQL needs to convert '3' to that type. This type of query isn't really a big deal, as the performance overhead of that conversion is negligible.
However, when you try to do the same thing when, say, joining 2 table that have several million rows each. If the columns in the ON clause are not the same datatype, then MySQL will have to convert several million rows every single time you run your query, and that is where the performance overhead comes in.
Strings also have a different sort order from numbers.
Compare:
SELECT 312 < 41
(yields 0, because 312 numerically comes after 41)
to:
SELECT '312' < '41'
(yields 1, because '312' lexicographically comes before '41')
Depending on the way your query is built using quotes might give wrong results or none at all.
Numbers should be used as such, so never use quotes unless you have a special reason to do so.
According to me, I think there is no performance/size cost in the case you have mentioned. Even if there is, then it is very much negligible and wont affect your application as such.
It gives the wrong impression about the data type for the column. As an outsider, I assume the column in question is CHAR/VARCHAR & choose operations accordingly.
Otherwise MySQL, like most other databases, will implicitly convert the value to whatever the column data type is. There's no performance issue with this that I'm aware of but there's a risk that supplying a value that requires explicit conversion (using CAST or CONVERT) will trigger an error.