I'm a developer with some limited database knowledge, trying to put together a scalable DB design for a new app. Any thoughts that anyone could provide on this problem would be appreciated.
Assume I currently have the following table:
Stuff
------------
ID Integer
Attr1 Integer
Attr2 Integer
Attr3 Double
Attr4 TinyInt
Attr5 Varchar(250)
Looking forward, assume we will have 500 million records in this table. However, at any given time only 5000 or so records will have anything in the Attr5 column; all other records will have a blank or null Attr5 column. The Attr5 column is populated with 100-200 characters when a record is inserted, then a nightly process will clear the data in it.
My concern is that such a large varchar field in the center of a tablespace that otherwise contains mostly small numeric fields will decrease the efficiency of reads against the table. As such, I was wandering if it might be better to change the DB design to use two tables like this:
Stuff
------------
ID Integer
Attr1 Integer
Attr2 Integer
Attr3 Double
Attr4 TinyInt
Stuff_Text
------------
StuffID Integer
Attr5 Varchar(250)
Then just delete from Stuff_Text during the nightly process keeping it at 5,000 records, thus keeping the Stuff table minimal in size.
So my question is this: Is it necessary to break this table into two, or is the database engine intelligent enough to store and access the information efficiently? I could see the DB compressing the data efficiency and storing records without data in Attr5 as if there was no varchar column. I could also see the DB leaving an open 250 bytes of data in every record anticipating data for Attr5. I tend to expect the former, as I thought that was the purpose of varchar over char, but my DB experience is limited so I figure I'd better double check.
I am using MySQL 5.1, currently on Windows 2000AS, eventually upgrading to Windows Server 2008 family. Database is currently on a standard 7200 rpm magnetic disc, eventually to be moved to an SSD.
Stuff
------------
ID Integer
Attr1 Integer
Attr2 Integer
Attr3 Double
Attr4 TinyInt
Attr5 Integer NOT NULL DEFAULT 0 (build an index on this)
Stuff_Text
------------
Attr5_id Integer (primary key)
Attr5_text Varchar(250)
In action
desc select * from Stuff WHERE Attr5<>0;
desc select Stuff.*, Stuff_text.Attr5_text
from Stuff
inner join Stuff_text ON Stuff.Attr5=Stuff_text.Attr5_id;
don't store NULL
make use on integer as foreign key
when pulling of record where Attr5 <>0 <-- scan 5,000 rows
much smaller index size
do a benchmark yourself
If you're using VARCHAR and allowing NULL values, then you shouldn't have problems. Becouse it's really efficient storing this kind of datatype. This is very different from CHAR datatype, but you already has VARCHAR.
Anyway, splitting it into two tables is not a bad idea. This could be good to keep the query cache alive, but it mostly depends in the use these tables have.
Last thing i can say: Try to benchmark it. Instert a bulk of data and try to simulate some use.
Related
everyone. Here is a problem in my mysql server.
I have a table about 40,000,000 rows and 10 columns.
Its size is about 4GB.And engine is innodb.
It is a master database, and only execute one sql like this.
insert into mytable ... on duplicate key update ...
And about 99% sqls executed update part.
Now the server is becoming slower and slower.
I heard that split table may enhance its performance. Then I tried on my personal computer, splited into 10 tables, failed , also tried 100 ,failed too. The speed became slower instead. So I wonder why splitting tables didn't enhance the performance?
Thanks in advance.
more details:
CREATE TABLE my_table (
id BIGINT AUTO_INCREMENT,
user_id BIGINT,
identifier VARCHAR(64),
account_id VARCHAR(64),
top_speed INT UNSIGNED NOT NULL,
total_chars INT UNSIGNED NOT NULL,
total_time INT UNSIGNED NOT NULL,
keystrokes INT UNSIGNED NOT NULL,
avg_speed INT UNSIGNED NOT NULL,
country_code VARCHAR(16),
update_time TIMESTAMP DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
PRIMARY KEY(id), UNIQUE KEY(user_id)
);
PS:
I also tried different computers with Solid State Drive and Hard Disk Drive, but didn't help too.
Splitting up a table is unlikely to help at all. Ditto for PARTITIONing.
Let's count the disk hits. I will skip counting non-leaf nodes in BTrees; they tend to be cached; I will count leaf nodes in the data and indexes; they tend not to be cached.
IODKU does:
Read the index block containing the for any UNIQUE keys. In your case, that is probably user_id. Please provide a sample SQL statement. 1 read.
If the user_id entry is found in the index, read the record from the data as indexed by the PK(id) and do the UPDATE, and leave this second block in the buffer_pool for eventual rewrite to disk. 1 read now, 1 write later.
If the record is not found, do INSERT. The index block that needs the new row was already read, so it is ready to have a new entry inserted. Meanwhile, the "last" block in the table (due to id being AUTO_INCREMENT) is probably already cached. Add the new row to it. 0 reads now, 1 write later (UNIQUE). (Rewriting the "last" block is amortized over, say, 100 rows, so I am ignoring it.)
Eventually do the write(s).
Total, assuming essentially all take the UPDATE path: 2 reads and 1 write. Assuming the user_id follows no simple pattern, I will assume that all 3 I/Os are "random".
Let's consider a variation... What if you got rid of id? Do you need id anywhere else? Since you have a UNIQUE key, it could be the PK. That is replace your two indexes with just PRIMARY KEY(user_id). Now the counts are:
1 read
If UPDATE, 0 read, 1 write
If INSERT, 0 read, 0 write
Total: 1 read, 1 write. 2/3 as many as before. Better, but still not great.
Caching
How much RAM do you have?
What is the value of innodb_buffer_pool_size?
SHOW TABLE STATUS -- What are Data_length and Index_length?
I suspect that the buffer_pool is not big enough, and possible could be raised. If you have more than 4GB of RAM, make it about 70% of RAM.
Others
SSDs should have helped significantly, since you appear to be I/O bound. Can you tell whether you are I/O-bound or CPU-bound?
How many rows are you updating at once? How long does it take? Is it batched, or one at a time? There may be a significant improvement possible here.
Do you really need BIGINT (8 bytes)? INT UNSIGNED is only 4 bytes.
Is a transaction involved?
Is the Master having a problem? The Slave? Both? I don't want to fix the Master in such a way that it messes up the Slave.
Try to split your database into some mysql instances using mysql proxy just like mysql-proxy or haproxy instead of one mysql instance. Maybe you can have great performance.
I need to create a table which saves measurements consisting of a device id (int), logdate (datetime) and a value (decimal) (SQL Server 2008). The measurements are always on the quarter e.g. 00:00, 00:15, 00:30, 00:45, 01:00, 01:15... so I was thinking that an int defining the amount of quarters since a certain date would be result in better performance than a datetime.
Retrieving would usually be done using the following:
-where DeviceId = x and QuarterNumber between a and b
-where DeviceId in (x, y, ...) and QuarterNumber between a and b
-where DeviceId = x and QuarterNumber = a
What would be the best design for this table?
PK DeviceId int
PK QuarterNumber int
Value int
or
PK MeasurementId int
UQ QuarterNumber int
UQ DeviceId int
Value int
(UQ=unique index)
or something totally different?
Thanks!
You might get marginally better SELECT performance by defining the number of quarter hours since a certain date if you have many millions of rows.
Personally, I don't think the marginal performance gain will be worth the reduced readability. I also wouldn't like basing the design on a quarter-hour assumption. (In my experience, that kind of requirement often changes over time.) You could include a quarter-hour CHECK constraint now on a datetime column now, and drop it later if that requirement changes.
But there's no point in relying on opinion when you can test and measure. Build three tables, load several million rows of sample data, and study the query plans. (It's not completely impractical to load 50 million rows into each table. I've sometimes loaded 20 million rows into a test table when answering a question on SO.) Don't assume that your first try at indexing will be optimal. Consider multiple indexes, and consider a multi-column index, too.
I dont think there can be any specific guidelines for your criteria. You might need to create and test(you can insert a demo data in each). Since you want performance improvement I would suggest the use of index in your table.
I m currently doing a project using mysql and am a perfect beginner in it.....
I made a table with the following columns.....
ID // A integer type column which is a primary key........
Date // A Date type column.........
Day // A String column.........
Now i just wanna know whether there exist any method by which the ID column insertion value is automatically generated......??
for eg: - If i insert a date - 4/10/1992 and Day - WED as values. The Mysql Server should automatically generate any integer value starting from 1 checking whether they exist.
i.e in a table containing the values
ID Date Day
1 01/02/1987 Sun
3 04/08/1990 Sun
If i m inserting the Date value and Day value(specified in the example) in the above table. It should be inserted as
2 04/10/1992 WED
I tried methods like using auto incrementer.....But i m afraid it just only increments the ID value.
There's a way to do this, but it's going to affect performance. Go ahead and keep auto_increment on the column, just for the first insert, or for when you want to insert more quickly.
Even with auto_increment on a column, you can specify the value, so long as it doesn't collide with an existing value.
To get the next value or first gap:
SELECT a.ID + 1 AS NextID FROM tbl a
LEFT JOIN tbl b ON b.ID = a.ID + 1
WHERE b.ID IS NULL
ORDER BY a.ID
LIMIT 1
If you get an empty set, just use 1, or let auto_increment do its thing.
For concurrency sake, you will need to lock the table to keep other sessions from using the next ID which you just found.
Well...i understood your problem...You want to generate the entries in such a way that it can control it's limit...
Well i've got a solution which is quite whacky...you may accept it if u feel like....
create your table with your primary key in auto increment mode using unsigned int (as every one suggested here)....
now consider two situations....
If your table needs to be cleared every single year or within certain duration(if such a situation exist)....
perform alter table operation to disable autoincrement mode and delete all your contents...
and then enable it again......
if what you are doing is some sort of datawarehousing.....so that a database for years....
then included a sql query to find the smallest primary key value using predefined key functions before you insert and if it is more than the 2^33 create a new table with the same details and you should maintain a seperate table to track the number of tables of this types
The trick is bit complicated and i m afraid....there don't exist a simple way as you expected....
You really don't need to cover the gaps created by deleting values from integer primary key columns. They were especially designed to ignore those gaps.
The auto increment mechanism could have been designed to take into consideration either the gaps at the top (after you delete some products with the biggest id values) or all gaps. But it wasn't because it was designed not to save space but to save time and to ensure that different transactions don't accidentally generate the same id.
In fact PostgreSQL implements it's SEQUENCE data type / SERIAL column (their equivalent to MySQL auto_increment) in such a way that if a transaction requests the sequence to increment a few times but ends up not using those ids, they never get used. That's also designed to avoid the possibility of transactions ever accidentally generating and using the same id.
You can't even save space because when you decide your table is going to use SMALLINT that's a fixed length 2 byte integer, it doesn't matter if the values are all 0 or maxed out. If you use a normal INTEGER that's a fixed length 4 byte integer.
If you use an UNSIGNED BIGINT that's an 8 byte integer which means it uses 8*8 bits = 64 bits. With an 8 byte integer you can count up to 2^64, even if your application works continuously for years and years it shouldn't reach a 20 digit number like 18446744070000000000 (if it does what the hell are you counting the molecules in the known universe?).
But, assuming you really have a concern that the ids might run out in a couple of years perhaps you should be using UUIDs in stead of integers.
Wikipedia states that "Only after generating 1 billion UUIDs every second for the next 100 years, the probability of creating just one duplicate would be about 50%".
UUIDs can be stored as BINARY(16) if you convert them into raw binary, as CHAR(32) if you strip the dashes or as CHAR(36) if you leave the dashes.
Out of the 16 bytes = 128 bits of data UUIDs use 122 random bits and 6 validation bits and they are constructed using information about when and where they were created. Meaning it is safe to create billions of UUIDs on different computers and the likelihood of collision would be overwhelmingly minuscule (as opposed to generating auto-incremented integers on different machines).
With MySQL I often overlook some options like 'signed/unsigned' ints and 'allow null' but I'm wondering if these details could slow a web application down.
Are there any notable performance differences in these situations?
using a low/high range of Integer primary key
5000 rows with ids from 1 to 5000
5000 rows with ids from 20001 to 25000
Integer PK incrementing uniformly vs non-uniformly.
5000 rows with ids from 1 to 5000
5000 rows with ids scattered from 1 to 30000
Setting an Integer PK as unsigned vs. signed
example: where the gain in range of unsigned isn't actually needed
Setting a default value for a field (any type) vs. no default
example: update a row and all field data is given
Allow Null vs deny Null
example: updating a row and all field data is given
I'm using MySQL, but this is more of a general question.
From my understanding of B-trees (that's how relational databases are usually implemented, right?), these things should not make any difference. All you need is a fast comparison function on your key, and it usually doesn't matter what range of integers you use (unless you get out of the machine word size).
Of course, for keys, a uniform default value or allowing null doesn't make much sense. In all non-key fields, allowing null or providing default values should not have any significant impact.
5000 rows is almost nothing for a database. They normally use large B-trees for indexes, so they don't care much about the distribution of primary keys.
Generally, whether to use the other options should be based on what you need from the database application. They can't significantly affect the performance. So, use a default value when you want a default value, use a NOT NULL contraint when you don't want the column to be NULL.
If you have database performance issues, you should look for more important problems like missing indexes, slow queries that can be rewritten efficiently, making sure that the database has accurate statistics about the data so it can use indexes the right way (although this is an admin task).
using a low/high range of Integer primary key
* 5000 rows with ids from 1 to 5000
* 5000 rows with ids from 20001 to 25000
Does not make any difference.
Integer PK incrementing uniformly vs non-uniformly.
* 5000 rows with ids from 1 to 5000
* 5000 rows with ids scattered from 1 to 30000
If the distribution is uniform, this makes not difference.
Uniform distribution may help to build more efficient random sampling query, like described in this article in my blog:
PostgreSQL 8.4: sampling random rows
It's distribution which matters, not bounds: 1, 11, 21, 31 is OK, 1, 2, 3, 31 is not.
Setting an Integer PK as unsigned vs. signed
* example: where the gain in range of unsigned isn't actually needed
If you declare PRIMARY KEY as UNSIGNED, MySQL can optimize out predicates like id >= -1
Setting a default value for a field (any type) vs. no default
* example: update a row and all field data is given
No difference.
Allow Null vs deny Null
* example: updating a row and all field data is given
Nullable columns are one byte larger: the index key for an INT NOT NULL is 5 bytes long, that for an INT NULL is 4 bytes long.
Funny thing I've found abount mysql. MySQL has a 3 byte numeric type - MEDIUMINT. Its range is from -8388608 to 8388607. It seems strange to me. Size of numeric types choosen for better performance, I thought data should be aligned to a machine word or double word. And if we need some restriction rules for numeric ranges, it must be external relative to datatype. For example:
CREATE TABLE ... (
id INT RANGE(0, 500) PRIMARY KEY
)
So, does anyone know why 3 bytes? Is there any reason?
The reason is so that if you have a number that falls within a 3 byte range, you don't waste space by storing it using 4 bytes.
When you have twenty billion rows, it matters.
The alignment issue you mentioned applies mostly to data in RAM. Nothing forces MySQL to use 3 bytes to store that type as it processes it.
This might have a small advantage in using disk cache more efficiently though.
We frequently use tinyint, smallint, and mediumint as very significant space savings. Keep in mind, it makes your indexes that much smaller.
This effect is magnified when you have really small join tables, like:
id1 smallint unsigned not null,
id2 mediumint unsigned not null,
primary key (id1, id2)
And then you have hundreds of millions or billions of records.