Funny thing I've found abount mysql. MySQL has a 3 byte numeric type - MEDIUMINT. Its range is from -8388608 to 8388607. It seems strange to me. Size of numeric types choosen for better performance, I thought data should be aligned to a machine word or double word. And if we need some restriction rules for numeric ranges, it must be external relative to datatype. For example:
CREATE TABLE ... (
id INT RANGE(0, 500) PRIMARY KEY
)
So, does anyone know why 3 bytes? Is there any reason?
The reason is so that if you have a number that falls within a 3 byte range, you don't waste space by storing it using 4 bytes.
When you have twenty billion rows, it matters.
The alignment issue you mentioned applies mostly to data in RAM. Nothing forces MySQL to use 3 bytes to store that type as it processes it.
This might have a small advantage in using disk cache more efficiently though.
We frequently use tinyint, smallint, and mediumint as very significant space savings. Keep in mind, it makes your indexes that much smaller.
This effect is magnified when you have really small join tables, like:
id1 smallint unsigned not null,
id2 mediumint unsigned not null,
primary key (id1, id2)
And then you have hundreds of millions or billions of records.
Related
I have a MySQL table (> 100k rows) that contains data imported from two different APIs. As a consequence, UID from each API have a significantly different length (35 from the first API and 19 from the second). These UIDs are alphabenumerical strings (AMANR5L16PO791932BTC0014P0D1N000001 for the 1st API and AMSNT006654N00II598 for the 2nd).
These UIDs are the primary key of my table.
I wonder what's the best in term of performance between having a primary key with a length up to 35 stored as VARCHAR(35) and having a primary key with an uniformized length of 35 (by adding "XXX" to the end of the shortest UIDs in order to have only UIDs with a length of 35) to store them as CHAR(35).
Do not pad. Not even for 100 billion rows. VARCHAR will store only the actual data (plus a length). Shrinking space also provides some performance. But neither is worth sneezing at.
Is it possible to use the Locate() function on TEXT column, or is there any alternative to it for TEXT fields.
the thing is we have LARGE varchars (65kb) that we use to track for subscriptions, so we add subscription_ids inside 1 long string in varchar.
this string can hold up to 5000 subscription_ids in 1 row. we use LOCATE to see if a user is subscribed.
if a subscription_id is found inside the varchar string.
the problem is that we plan to have more than 500,000 rows like this, it seems this can have a big impact on performance.
so we decided to move to TEXT instead, but now there is a problem with indexation and how to LOCATE sub-text inside a TEXT column.
Billions of subscriptions? Please show an abbreviated example of a TEXT value. Have you tried FIND_IN_SET()?
Is one TEXT field showing up to 5000 subscriptions for one user? Or is it the other way -- up to 5K users for one magazine?
In any case, it would be better to have a table with 2 columns:
CREATE TABLE user_sub (
user_id INT UNSIGNED NOT NULL,
sub_id INT UNSIGNED NOT NULL,
PRIMARY KEY(user_id, sub_id),
INDEX(sub_id, user_id)
) ENGINE=InnoDB;
The two composite indexes let you very efficiently find the 5K subscriptions for a user or the 500K users for a sub.
Shrink the less-500K id to MEDIUMINT UNSIGNED (16M limit instead of 4 billion; 3 bytes each instead of 4).
Shrink the less-5K id to SMALLINT UNSIGNED (64K limit instead of 4B; 2 bytes each instead of 4).
If you desire, you can use GROUP_CONCAT() to reconstruct the commalist. Be sure to change group_concat_max_len to a suitably large number (default is only 1024 bytes.)
I have table storing phone numbers with 800M rows.
column
region_code_id smallint(4) unsigned YES
local_number mediumint(7) unsigned YES
region_id smallint(4) unsigned YES
operator_id smallint(4) unsigned YES
id int(10) unsigned NO PRI auto_increment
I need find number.id where region_code_id = 119 and localnumber = 1234567
select * from numbers where numbers.region_code_id = 119 and numbers.local_number = 1234567;
this query execute over 600 second.
How can I improve it ?
UPDATE
Thank for unswer, i understand i need index for this column, i try this as soon as I get the server with more SSD, now i have free 1GB SSD space. How i can to find out how much space the index will occupy?
Consider adding INDEX on columns which you use in WHERE clause.
Start with:
ALTER TABLE `numbers`
ADD INDEX `region_code_id_local_number`
(`region_code_id`, `local_number`);
Note : it can take some time for index to build.
Before and after change, execute explain plan to compare:
EXPLAIN EXTENDED select * from numbers where numbers.region_code_id = 119 and numbers.local_number = 1234567;
References:
How MySQL uses indexes
For this query:
select *
from numbers
where numbers.region_code_id = 119 and
numbers.local_number = 1234567;
You want an index on numbers(region_code_id, local_number) or numbers(local_number, region_code_id). The order of the columns doesn't matter because the conditions are equality for both columns.
create index idx_numbers_region_local on numbers(region_code_id, local_number);
I agree that INDEX(region_code_id, local_number) (in either order) is mandatory for this problem, but I am sticking my nose in to carry it a step further. Isn't that pair "unique"? Or do you have duplicate numbers in the table? If it is unique, then get rid of id and make that pair PRIMARY KEY(region_code_id, local_number). The table will possibly be smaller after the change.
Back to your question of "how big". How big is the table now? Perhaps 40GB? A secondary index (as originally proposed) would probably add about 20GB. And you would need 20-60GB of free disk space to perform the ALTER. This depends on whether adding the index can be done "inplace" in that version.
Changing the PK (as I suggest) would result in a little less than 40GB for the table. It will take 40GB of free space to perform the ALTER.
In general (and pessimistically), plan on an ALTER needing the both the original table and the new table sitting on disk at one time. That includes full copies of the data and index(es).
(A side question: Are you sure local_number is limited to 7 digits everywhere?)
Another approach to the question... For calculating the size of a table or index in InnoDB, add up the datatype sizes (3 bytes for MEDIUMINT, some average for VARCHAR, etc). Then multiply by the number of rows. Then multiply by 4; this will give you the approximate disk space needed. (Usually 2-3 is sufficient for the last multiplier.)
When changing the PK, do it in one step:
ALTER TABLE foo
DROP PRIMARY KEY,
ADD PRIMARY KEY(region_code_id, local_number);
Changing the PK cannot be done "inplace".
Edit (mostly for other readers)
#berap points out that id is needed for other purposes. Hence, dropping id and switching the PK is not an option.
However, this is sometimes an option (perhaps not in this case):
ALTER TABLE foo
DROP PRIMARY KEY,
ADD PRIMARY KEY(region_code_id, local_number),
ADD INDEX(id);
Notes:
The id..AUTO_INCREMENT will continue to work even with just INDEX.
The SELECT in question will be more efficient because it is the PK.
SELECT .. WHERE id = ... will be less efficient because id is a secondary key.
The table will be the same size either way; the secondary key would also be the same size either way -- because every secondary key contains the PK columns, too. (This note is InnoDB-specific.)
I'm a developer with some limited database knowledge, trying to put together a scalable DB design for a new app. Any thoughts that anyone could provide on this problem would be appreciated.
Assume I currently have the following table:
Stuff
------------
ID Integer
Attr1 Integer
Attr2 Integer
Attr3 Double
Attr4 TinyInt
Attr5 Varchar(250)
Looking forward, assume we will have 500 million records in this table. However, at any given time only 5000 or so records will have anything in the Attr5 column; all other records will have a blank or null Attr5 column. The Attr5 column is populated with 100-200 characters when a record is inserted, then a nightly process will clear the data in it.
My concern is that such a large varchar field in the center of a tablespace that otherwise contains mostly small numeric fields will decrease the efficiency of reads against the table. As such, I was wandering if it might be better to change the DB design to use two tables like this:
Stuff
------------
ID Integer
Attr1 Integer
Attr2 Integer
Attr3 Double
Attr4 TinyInt
Stuff_Text
------------
StuffID Integer
Attr5 Varchar(250)
Then just delete from Stuff_Text during the nightly process keeping it at 5,000 records, thus keeping the Stuff table minimal in size.
So my question is this: Is it necessary to break this table into two, or is the database engine intelligent enough to store and access the information efficiently? I could see the DB compressing the data efficiency and storing records without data in Attr5 as if there was no varchar column. I could also see the DB leaving an open 250 bytes of data in every record anticipating data for Attr5. I tend to expect the former, as I thought that was the purpose of varchar over char, but my DB experience is limited so I figure I'd better double check.
I am using MySQL 5.1, currently on Windows 2000AS, eventually upgrading to Windows Server 2008 family. Database is currently on a standard 7200 rpm magnetic disc, eventually to be moved to an SSD.
Stuff
------------
ID Integer
Attr1 Integer
Attr2 Integer
Attr3 Double
Attr4 TinyInt
Attr5 Integer NOT NULL DEFAULT 0 (build an index on this)
Stuff_Text
------------
Attr5_id Integer (primary key)
Attr5_text Varchar(250)
In action
desc select * from Stuff WHERE Attr5<>0;
desc select Stuff.*, Stuff_text.Attr5_text
from Stuff
inner join Stuff_text ON Stuff.Attr5=Stuff_text.Attr5_id;
don't store NULL
make use on integer as foreign key
when pulling of record where Attr5 <>0 <-- scan 5,000 rows
much smaller index size
do a benchmark yourself
If you're using VARCHAR and allowing NULL values, then you shouldn't have problems. Becouse it's really efficient storing this kind of datatype. This is very different from CHAR datatype, but you already has VARCHAR.
Anyway, splitting it into two tables is not a bad idea. This could be good to keep the query cache alive, but it mostly depends in the use these tables have.
Last thing i can say: Try to benchmark it. Instert a bulk of data and try to simulate some use.
With MySQL I often overlook some options like 'signed/unsigned' ints and 'allow null' but I'm wondering if these details could slow a web application down.
Are there any notable performance differences in these situations?
using a low/high range of Integer primary key
5000 rows with ids from 1 to 5000
5000 rows with ids from 20001 to 25000
Integer PK incrementing uniformly vs non-uniformly.
5000 rows with ids from 1 to 5000
5000 rows with ids scattered from 1 to 30000
Setting an Integer PK as unsigned vs. signed
example: where the gain in range of unsigned isn't actually needed
Setting a default value for a field (any type) vs. no default
example: update a row and all field data is given
Allow Null vs deny Null
example: updating a row and all field data is given
I'm using MySQL, but this is more of a general question.
From my understanding of B-trees (that's how relational databases are usually implemented, right?), these things should not make any difference. All you need is a fast comparison function on your key, and it usually doesn't matter what range of integers you use (unless you get out of the machine word size).
Of course, for keys, a uniform default value or allowing null doesn't make much sense. In all non-key fields, allowing null or providing default values should not have any significant impact.
5000 rows is almost nothing for a database. They normally use large B-trees for indexes, so they don't care much about the distribution of primary keys.
Generally, whether to use the other options should be based on what you need from the database application. They can't significantly affect the performance. So, use a default value when you want a default value, use a NOT NULL contraint when you don't want the column to be NULL.
If you have database performance issues, you should look for more important problems like missing indexes, slow queries that can be rewritten efficiently, making sure that the database has accurate statistics about the data so it can use indexes the right way (although this is an admin task).
using a low/high range of Integer primary key
* 5000 rows with ids from 1 to 5000
* 5000 rows with ids from 20001 to 25000
Does not make any difference.
Integer PK incrementing uniformly vs non-uniformly.
* 5000 rows with ids from 1 to 5000
* 5000 rows with ids scattered from 1 to 30000
If the distribution is uniform, this makes not difference.
Uniform distribution may help to build more efficient random sampling query, like described in this article in my blog:
PostgreSQL 8.4: sampling random rows
It's distribution which matters, not bounds: 1, 11, 21, 31 is OK, 1, 2, 3, 31 is not.
Setting an Integer PK as unsigned vs. signed
* example: where the gain in range of unsigned isn't actually needed
If you declare PRIMARY KEY as UNSIGNED, MySQL can optimize out predicates like id >= -1
Setting a default value for a field (any type) vs. no default
* example: update a row and all field data is given
No difference.
Allow Null vs deny Null
* example: updating a row and all field data is given
Nullable columns are one byte larger: the index key for an INT NOT NULL is 5 bytes long, that for an INT NULL is 4 bytes long.