With MySQL I often overlook some options like 'signed/unsigned' ints and 'allow null' but I'm wondering if these details could slow a web application down.
Are there any notable performance differences in these situations?
using a low/high range of Integer primary key
5000 rows with ids from 1 to 5000
5000 rows with ids from 20001 to 25000
Integer PK incrementing uniformly vs non-uniformly.
5000 rows with ids from 1 to 5000
5000 rows with ids scattered from 1 to 30000
Setting an Integer PK as unsigned vs. signed
example: where the gain in range of unsigned isn't actually needed
Setting a default value for a field (any type) vs. no default
example: update a row and all field data is given
Allow Null vs deny Null
example: updating a row and all field data is given
I'm using MySQL, but this is more of a general question.
From my understanding of B-trees (that's how relational databases are usually implemented, right?), these things should not make any difference. All you need is a fast comparison function on your key, and it usually doesn't matter what range of integers you use (unless you get out of the machine word size).
Of course, for keys, a uniform default value or allowing null doesn't make much sense. In all non-key fields, allowing null or providing default values should not have any significant impact.
5000 rows is almost nothing for a database. They normally use large B-trees for indexes, so they don't care much about the distribution of primary keys.
Generally, whether to use the other options should be based on what you need from the database application. They can't significantly affect the performance. So, use a default value when you want a default value, use a NOT NULL contraint when you don't want the column to be NULL.
If you have database performance issues, you should look for more important problems like missing indexes, slow queries that can be rewritten efficiently, making sure that the database has accurate statistics about the data so it can use indexes the right way (although this is an admin task).
using a low/high range of Integer primary key
* 5000 rows with ids from 1 to 5000
* 5000 rows with ids from 20001 to 25000
Does not make any difference.
Integer PK incrementing uniformly vs non-uniformly.
* 5000 rows with ids from 1 to 5000
* 5000 rows with ids scattered from 1 to 30000
If the distribution is uniform, this makes not difference.
Uniform distribution may help to build more efficient random sampling query, like described in this article in my blog:
PostgreSQL 8.4: sampling random rows
It's distribution which matters, not bounds: 1, 11, 21, 31 is OK, 1, 2, 3, 31 is not.
Setting an Integer PK as unsigned vs. signed
* example: where the gain in range of unsigned isn't actually needed
If you declare PRIMARY KEY as UNSIGNED, MySQL can optimize out predicates like id >= -1
Setting a default value for a field (any type) vs. no default
* example: update a row and all field data is given
No difference.
Allow Null vs deny Null
* example: updating a row and all field data is given
Nullable columns are one byte larger: the index key for an INT NOT NULL is 5 bytes long, that for an INT NULL is 4 bytes long.
Related
I have table storing phone numbers with 800M rows.
column
region_code_id smallint(4) unsigned YES
local_number mediumint(7) unsigned YES
region_id smallint(4) unsigned YES
operator_id smallint(4) unsigned YES
id int(10) unsigned NO PRI auto_increment
I need find number.id where region_code_id = 119 and localnumber = 1234567
select * from numbers where numbers.region_code_id = 119 and numbers.local_number = 1234567;
this query execute over 600 second.
How can I improve it ?
UPDATE
Thank for unswer, i understand i need index for this column, i try this as soon as I get the server with more SSD, now i have free 1GB SSD space. How i can to find out how much space the index will occupy?
Consider adding INDEX on columns which you use in WHERE clause.
Start with:
ALTER TABLE `numbers`
ADD INDEX `region_code_id_local_number`
(`region_code_id`, `local_number`);
Note : it can take some time for index to build.
Before and after change, execute explain plan to compare:
EXPLAIN EXTENDED select * from numbers where numbers.region_code_id = 119 and numbers.local_number = 1234567;
References:
How MySQL uses indexes
For this query:
select *
from numbers
where numbers.region_code_id = 119 and
numbers.local_number = 1234567;
You want an index on numbers(region_code_id, local_number) or numbers(local_number, region_code_id). The order of the columns doesn't matter because the conditions are equality for both columns.
create index idx_numbers_region_local on numbers(region_code_id, local_number);
I agree that INDEX(region_code_id, local_number) (in either order) is mandatory for this problem, but I am sticking my nose in to carry it a step further. Isn't that pair "unique"? Or do you have duplicate numbers in the table? If it is unique, then get rid of id and make that pair PRIMARY KEY(region_code_id, local_number). The table will possibly be smaller after the change.
Back to your question of "how big". How big is the table now? Perhaps 40GB? A secondary index (as originally proposed) would probably add about 20GB. And you would need 20-60GB of free disk space to perform the ALTER. This depends on whether adding the index can be done "inplace" in that version.
Changing the PK (as I suggest) would result in a little less than 40GB for the table. It will take 40GB of free space to perform the ALTER.
In general (and pessimistically), plan on an ALTER needing the both the original table and the new table sitting on disk at one time. That includes full copies of the data and index(es).
(A side question: Are you sure local_number is limited to 7 digits everywhere?)
Another approach to the question... For calculating the size of a table or index in InnoDB, add up the datatype sizes (3 bytes for MEDIUMINT, some average for VARCHAR, etc). Then multiply by the number of rows. Then multiply by 4; this will give you the approximate disk space needed. (Usually 2-3 is sufficient for the last multiplier.)
When changing the PK, do it in one step:
ALTER TABLE foo
DROP PRIMARY KEY,
ADD PRIMARY KEY(region_code_id, local_number);
Changing the PK cannot be done "inplace".
Edit (mostly for other readers)
#berap points out that id is needed for other purposes. Hence, dropping id and switching the PK is not an option.
However, this is sometimes an option (perhaps not in this case):
ALTER TABLE foo
DROP PRIMARY KEY,
ADD PRIMARY KEY(region_code_id, local_number),
ADD INDEX(id);
Notes:
The id..AUTO_INCREMENT will continue to work even with just INDEX.
The SELECT in question will be more efficient because it is the PK.
SELECT .. WHERE id = ... will be less efficient because id is a secondary key.
The table will be the same size either way; the secondary key would also be the same size either way -- because every secondary key contains the PK columns, too. (This note is InnoDB-specific.)
I am looking into storing a "large" amount of data and not sure what the best solution is, so any help would be most appreciated. The structure of the data is
450,000 rows
11,000 columns
My requirements are:
1) Need as fast access as possible to a small subset of the data e.g. rows (1,2,3) and columns (5,10,1000)
2) Needs to be scalable will be adding columns every month but the number of rows are fixed.
My understanding is that often its best to store as:
id| row_number| column_number| value
but this would create 4,950,000,000 entries? I have tried storing as just rows and columns as is in MySQL but it is very slow at subsetting the data.
Thanks!
Build the giant matrix table
As N.B. said in comments, there's no cleaner way than using one mysql row for each matrix value.
You can do it without the id column:
CREATE TABLE `stackoverflow`.`matrix` (
`rowNum` MEDIUMINT NOT NULL ,
`colNum` MEDIUMINT NOT NULL ,
`value` INT NOT NULL ,
PRIMARY KEY ( `rowNum`, `colNum` )
) ENGINE = MYISAM ;
You may add a UNIQUE INDEX on colNum, rowNum, or only a non-unique INDEX on colNum if you often access matrix by column (because PRIMARY INDEX is on ( `rowNum`, `colNum` ), note the order, so it will be inefficient when it comes to select a whole column).
You'll probably need more than 200Go to store the 450.000x11.000 lines, including indexes.
Inserting data may be slow (because there are two indexes to rebuild, and 450.000 entries [1 per row] to add when adding a column).
Edit should be very fast, as index wouldn't change and value is of fixed size
If you access same subsets (rows + cols) often, maybe you can use PARTITIONing of the table if you need something "faster" than what mysql provides by default.
After years of experience (20201 edit)
Re-reading myself years later, I would say the "cache" ideas are totally dumb, as it's MySQL role to handle these sort of cache (it should actually already be in the innodb pool cache).
A better thing would be, if matrix is full of zeroes, not storing the zero values, and consider 0 as "default" in the client code. That way, you may lightenup the storage (if needed: mysql should actually be pretty fast responding to queries event on such 5 billion row table)
Another thing, if storage makes issue, is to use a single ID to identify both row and col: you say number of rows is fixed (450000) so you may replace (row, col) with a single (id = 450000*col+row) value [tho it needs BIGINT so maybe not better than 2 columns)
Don't do like below: don't reinvent MySQL cache
Add a cache (actually no)
Since you said you add values, and doesn't seem to edit matrix values, a cache can speed up frequently asked rows/columns.
If you often read the same rows/columns, you can cache their result in another table (same structure to make it easier):
CREATE TABLE `stackoverflow`.`cachedPartialMatrix` (
`rowNum` MEDIUMINT NOT NULL ,
`colNum` MEDIUMINT NOT NULL ,
`value` INT NOT NULL ,
PRIMARY KEY ( `rowNum`, `colNum` )
) ENGINE = MYISAM ;
That table will be void at the beginning, and each SELECT on the matrix table will feed the cache. When you want to get a column / row:
SELECT the row/column from that caching table
If the SELECT returns a void/partial result (no data returned or not enough data to match the expected row/column number) then do the SELECT on the matrix table
Save the SELECT from the matrix table to the cachingPartialMatrix
If the caching matrix gets too big, clear it (the bigger cached matrix is, the slower it becomes)
Smarter cache (actually, no)
You can make it even smarter with a third table to count how many times a selection is done:
CREATE TABLE `stackoverflow`.`requestsCounter` (
`isRowSelect` BOOLEAN NOT NULL ,
`index` INT NOT NULL ,
`count` INT NOT NULL ,
`lastDate` DATETIME NOT NULL,
PRIMARY KEY ( `isRowSelect` , `index` )
) ENGINE = MYISAM ;
When you do a request on your matrix (one may use TRIGGERS) for the Nth-row or Kth-column, increment the counter. When the counter gets big enough, feed the cache.
lastDate can be used to remove some old values from the cache (take care: if you remove the Nth-column from cache entries because its ``lastDate```is old enough, you may break some other entries cache) or to regularly clear the cache and only leave the recently selected values.
What is the best solution for inserting boolean values on database if you want more query performance and minimum losing of memory on select statement.
For example:
I have a table with 36 fields that 30 of them has boolean values (zero or one) and i need to search records using the boolean fields that just have true values.
SELECT * FROM `myTable`
WHERE
`field_5th` = 1
AND `field_12th` = 1
AND `field_20` = 1
AND `field_8` = 1
Is there any solution?
If you want to store boolean values or flags there are basically three options:
Individual columns
This is reflected in your example above. The advantage is that you will be able to put indexes on the flags you intend to use most often for lookups. The disadvantage is that this will take up more space (since the minimum column size that can be allocated is 1 byte.)
However, if you're column names are really going to be field_20, field_21, etc. Then this is absolutely NOT the way to go. Numbered columns are a sign you should use either of the other two methods.
Bitmasks
As was suggested above you can store multiple values in a single integer column. A BIGINT column would give you up to 64 possible flags.
Values would be something like:
UPDATE table SET flags=b'100';
UPDATE table SET flags=b'10000';
Then the field would look something like: 10100
That would represent having two flag values set. To query for any particular flag value set, you would do
SELECT flags FROM table WHERE flags & b'100';
The advantage of this is that your flags are very compact space-wise. The disadvantage is that you can't place indexes on the field which would help improve the performance of searching for specific flags.
One-to-many relationship
This is where you create another table, and each row there would have the id of the row it's linked to, and the flag:
CREATE TABLE main (
main_id INT UNSIGNED NOT NULL AUTO_INCREMENT PRIMARY KEY,
);
CREATE TABLE flag (
main_id INT UNSIGNED NOT NULL,
name VARCHAR(16)
);
Then you would insert multiple rows into the flag table.
The advantage is that you can use indexes for lookups, and you can have any number of flags per row without changing your schema. This works best for sparse values, where most rows do not have a value set. If every row needs all flags defined, then this isn't very efficient.
For performance comparisson you can read a blog post I wrote on the topic:
Set Performance Compare
Also when you ask which is "Best" that's a very subjective question. Best at what? It all really depends on what your data looks like and what your requirements are and how you want to query it.
Keep in mind that if you want to do a query like:
SELECT * FROM table WHERE some_flag=true
Indexes will only help you if few rows have that value set. If most of the rows in the table have some_flag=true, then mysql will ignore indexes and do a full table scan instead.
How many rows of data are you querying over? You can store the boolean values in an integer value and use bit operations to test for them them. It's not indexable, but storage is very well packed. Using TINYINT fields with indexes would pick one index to use and scan from there.
I m currently doing a project using mysql and am a perfect beginner in it.....
I made a table with the following columns.....
ID // A integer type column which is a primary key........
Date // A Date type column.........
Day // A String column.........
Now i just wanna know whether there exist any method by which the ID column insertion value is automatically generated......??
for eg: - If i insert a date - 4/10/1992 and Day - WED as values. The Mysql Server should automatically generate any integer value starting from 1 checking whether they exist.
i.e in a table containing the values
ID Date Day
1 01/02/1987 Sun
3 04/08/1990 Sun
If i m inserting the Date value and Day value(specified in the example) in the above table. It should be inserted as
2 04/10/1992 WED
I tried methods like using auto incrementer.....But i m afraid it just only increments the ID value.
There's a way to do this, but it's going to affect performance. Go ahead and keep auto_increment on the column, just for the first insert, or for when you want to insert more quickly.
Even with auto_increment on a column, you can specify the value, so long as it doesn't collide with an existing value.
To get the next value or first gap:
SELECT a.ID + 1 AS NextID FROM tbl a
LEFT JOIN tbl b ON b.ID = a.ID + 1
WHERE b.ID IS NULL
ORDER BY a.ID
LIMIT 1
If you get an empty set, just use 1, or let auto_increment do its thing.
For concurrency sake, you will need to lock the table to keep other sessions from using the next ID which you just found.
Well...i understood your problem...You want to generate the entries in such a way that it can control it's limit...
Well i've got a solution which is quite whacky...you may accept it if u feel like....
create your table with your primary key in auto increment mode using unsigned int (as every one suggested here)....
now consider two situations....
If your table needs to be cleared every single year or within certain duration(if such a situation exist)....
perform alter table operation to disable autoincrement mode and delete all your contents...
and then enable it again......
if what you are doing is some sort of datawarehousing.....so that a database for years....
then included a sql query to find the smallest primary key value using predefined key functions before you insert and if it is more than the 2^33 create a new table with the same details and you should maintain a seperate table to track the number of tables of this types
The trick is bit complicated and i m afraid....there don't exist a simple way as you expected....
You really don't need to cover the gaps created by deleting values from integer primary key columns. They were especially designed to ignore those gaps.
The auto increment mechanism could have been designed to take into consideration either the gaps at the top (after you delete some products with the biggest id values) or all gaps. But it wasn't because it was designed not to save space but to save time and to ensure that different transactions don't accidentally generate the same id.
In fact PostgreSQL implements it's SEQUENCE data type / SERIAL column (their equivalent to MySQL auto_increment) in such a way that if a transaction requests the sequence to increment a few times but ends up not using those ids, they never get used. That's also designed to avoid the possibility of transactions ever accidentally generating and using the same id.
You can't even save space because when you decide your table is going to use SMALLINT that's a fixed length 2 byte integer, it doesn't matter if the values are all 0 or maxed out. If you use a normal INTEGER that's a fixed length 4 byte integer.
If you use an UNSIGNED BIGINT that's an 8 byte integer which means it uses 8*8 bits = 64 bits. With an 8 byte integer you can count up to 2^64, even if your application works continuously for years and years it shouldn't reach a 20 digit number like 18446744070000000000 (if it does what the hell are you counting the molecules in the known universe?).
But, assuming you really have a concern that the ids might run out in a couple of years perhaps you should be using UUIDs in stead of integers.
Wikipedia states that "Only after generating 1 billion UUIDs every second for the next 100 years, the probability of creating just one duplicate would be about 50%".
UUIDs can be stored as BINARY(16) if you convert them into raw binary, as CHAR(32) if you strip the dashes or as CHAR(36) if you leave the dashes.
Out of the 16 bytes = 128 bits of data UUIDs use 122 random bits and 6 validation bits and they are constructed using information about when and where they were created. Meaning it is safe to create billions of UUIDs on different computers and the likelihood of collision would be overwhelmingly minuscule (as opposed to generating auto-incremented integers on different machines).
Funny thing I've found abount mysql. MySQL has a 3 byte numeric type - MEDIUMINT. Its range is from -8388608 to 8388607. It seems strange to me. Size of numeric types choosen for better performance, I thought data should be aligned to a machine word or double word. And if we need some restriction rules for numeric ranges, it must be external relative to datatype. For example:
CREATE TABLE ... (
id INT RANGE(0, 500) PRIMARY KEY
)
So, does anyone know why 3 bytes? Is there any reason?
The reason is so that if you have a number that falls within a 3 byte range, you don't waste space by storing it using 4 bytes.
When you have twenty billion rows, it matters.
The alignment issue you mentioned applies mostly to data in RAM. Nothing forces MySQL to use 3 bytes to store that type as it processes it.
This might have a small advantage in using disk cache more efficiently though.
We frequently use tinyint, smallint, and mediumint as very significant space savings. Keep in mind, it makes your indexes that much smaller.
This effect is magnified when you have really small join tables, like:
id1 smallint unsigned not null,
id2 mediumint unsigned not null,
primary key (id1, id2)
And then you have hundreds of millions or billions of records.