I need to create a table which saves measurements consisting of a device id (int), logdate (datetime) and a value (decimal) (SQL Server 2008). The measurements are always on the quarter e.g. 00:00, 00:15, 00:30, 00:45, 01:00, 01:15... so I was thinking that an int defining the amount of quarters since a certain date would be result in better performance than a datetime.
Retrieving would usually be done using the following:
-where DeviceId = x and QuarterNumber between a and b
-where DeviceId in (x, y, ...) and QuarterNumber between a and b
-where DeviceId = x and QuarterNumber = a
What would be the best design for this table?
PK DeviceId int
PK QuarterNumber int
Value int
or
PK MeasurementId int
UQ QuarterNumber int
UQ DeviceId int
Value int
(UQ=unique index)
or something totally different?
Thanks!
You might get marginally better SELECT performance by defining the number of quarter hours since a certain date if you have many millions of rows.
Personally, I don't think the marginal performance gain will be worth the reduced readability. I also wouldn't like basing the design on a quarter-hour assumption. (In my experience, that kind of requirement often changes over time.) You could include a quarter-hour CHECK constraint now on a datetime column now, and drop it later if that requirement changes.
But there's no point in relying on opinion when you can test and measure. Build three tables, load several million rows of sample data, and study the query plans. (It's not completely impractical to load 50 million rows into each table. I've sometimes loaded 20 million rows into a test table when answering a question on SO.) Don't assume that your first try at indexing will be optimal. Consider multiple indexes, and consider a multi-column index, too.
I dont think there can be any specific guidelines for your criteria. You might need to create and test(you can insert a demo data in each). Since you want performance improvement I would suggest the use of index in your table.
Related
We have a database table which stores browser data for visitors, broken down by multiple different subtypes. For simplicity, let's use the table schema below. The querying will basically be on any single id column, the metric column, the timestamp column (stored as seconds since epoch), and one of the device, browser, or os columns.
We are going to performance test the star vs snowflake schema (where all of the ids go into a single column, but then an additional column id_type is added to determine which type of identifier it is) for this table, but as long as the star schema (which is how it is now) is within 80% of the snowflake performance, we are going to keep it since it will make our load process much easier. Before I do that however, I want to make sure the indexes are optimized on the star schema.
create table browser_data (
id_1 int,
id_2 int,
id_3 int,
id_4 int,
metric varchar(20),
browser varchar(20),
device varchar(20),
os varchar(20),
timestamp bigint
)
Would it be better to create individual indexes on just the id columns, or also include the metric and timestamp columns in those indexes as well?
Do not normalize "continuous" values, such as DATETIME, FLOAT, INT. Do leave the values in the main table.
When you move the value to other table(s), especially "snowflake", it makes querying based on the values somewhere between a little slower and a lot slower. This especially happens when you need to filter on more than one metric that is not in the main table. Either of these perform very poorly because of "snowflake" or "over-normalization":
WHERE a.x = 123 AND b.y = 345
ORDER BY a.x, b.y
As for what indexes to create -- that depends entirely on the queries you need to perform. So, I strongly recommend you sketch out the likely SELECTs based on your tentative CREATE TABLEs.
INT is 4 bytes. TIMESTAMP is 5, FLOAT is 4, etc. That is, normalizing such things are also inefficient on space.
More
When doing JOINs, the Optimizer will almost always start with one table, then move on to the another table, etc. (See "Nested Loop Join".)
For example (building on the above 'code'), when 2 columns are normalized, and you are testing on the values, you do not have two ids in hand, you only have the two values. This makes the query execution very inefficient. For
SELECT ...
FROM main
JOIN a USING(a_id)
JOIN b USING(b_id)
WHERE a.x = 123 AND b.y = 345
The following is very likely to be the 'execution plan':
Reach into a to find the row(s) with x=123; get the id(s) for those rows. This may include many rows that are yet to be filtered by b.y. a needs INDEX(x)
Go back to the main table, looking up rows with those id(s). main needs INDEX(a_id). Again, more rows than necessary may be hauled around.
Only now, do you get to b (using b_id) to check for y=345; toss the unnecessary rows you have been hauling around. b needs INDEX(b_id)
Note my comment about "haul around". Blindly using * (in SELECT *) adds to the problem -- all the columns are being hauled around while performing the steps.
On the other hand... If x and y were in the main table, then the code works like:
WHERE main.x = 123
AND main.y = 345
only needs INDEX(x,y) (in either order). And it quickly locates exactly the rows desired.
In the case of ORDER BY a.x, b.y, it cannot use any index on any table. So the query must create a tmp table, sort it, then deliver the rows in the desired order.
But if x and y are in the same table, then INDEX(x,y) (in that order) may be useful for ORDER BY x,y and avoid the tmp table and the sort.
With a single table, the Optimizer might use an index for WHERE, or it might use an index for ORDER BY, depending on the phase of the moon. In some cases, one index can be used for both -- this is optimal.
Another note: If you also have LIMIT 10,... If the sort is avoided, then only 10 rows need to be looked at, not the entire set from the WHERE.
The application we are developing is writing around 4-5 millions rows of data every day. And, we need to save these data for the past 90 days.
The table user_data has the following structure (simplified):
id INT PRIMARY AUTOINCREMENT
dt TIMESTAMP CURRENT_TIMESTAMP
user_id varchar(20)
data varchar(20)
About the application:
Data that is older than 7 days old will not be written / updated.
Data is mostly accessed based on user_id (i.e. all queries will have WHERE user_id = XXX)
There are around 13000 users at the moment.
User can still access older data. But, in accessing the older data, we can restrict that he/she can only get the whole day data only and not a time range. (e.g. If a user attempts to get the data for 2016-10-01, he/she will get the data for the whole day and will not be able to get the data for 2016-10-01 13:00 - 2016-10-01 14:00).
At the moment, we are using MySQL InnoDB to store the latest data (i.e. 7 days and newer) and it is working fine and fits in the innodb_buffer_pool.
As for the older data, we created smaller tables in the form of user_data_YYYYMMDD. After a while, we figured that these tables cannot fit into the innodb_buffer_pool and it started to slow down.
We think that separating / sharding based on dates, sharding based on user_ids would be better (i.e. using smaller data sets based on user and dates such as user_data_[YYYYMMDD]_[USER_ID]). This will keep the table in much smaller numbers (only around 10K rows at most).
After researching around, we have found that there are a few options out there:
Using mysql tables to store per user per date (i.e. user_data_[YYYYMMDD]_[USER_ID]).
Using mongodb collection for each user_data_[YYYYMMDD]_[USER_ID]
Write the old data (json encoded) into [USER_ID]/[YYYYMMDD].txt
The biggest con I see in this is that we will have huge number of tables/collections/files when we do this (i.e. 13000 x 90 = 1.170.000). I wonder if we are approaching this the right way in terms of future scalability. Or, if there are other standardized solutions for this.
Scaling a database is an unique problem to the application. Most of the times someone else's approach cannot be used as almost all applications writes its data in its own way. So you have to figure out how you are going to manage your data.
Having said that, if your data continue to grow, best solution is the shadring where you can distribute the data across different servers. As long as bound to a single server like creating different tables you are getting hit by resource limits like memory, storage and processing power. Those cannot be increased unlimited manner.
How to distribute the data, that you have to figure out based on your business use cases. As you mentioned, if you are not getting more request on old data, the best way to distribute the data base on date. Like DB for 2016 data, DB for 2015 and so on. Later you may purge or shutdown the servers which you have more old data.
This is a big table, but not unmanageable.
If user_id + dt is UNIQUE, make it the PRIMARY KEY, and get rid if id, thereby saving space. (More in a minute...)
Normalize user_id to a SMALLINT UNSIGNED (2 bytes) or, to be safer MEDIUMINT UNSIGNED (3 bytes). This will save a significant amount of space.
Saving space is important for speed (I/O) for big tables.
PARTITION BY RANGE(TO_DAYS(dt))
with 92 partitions -- the 90 you need, plus 1 waiting to be DROPped and one being filled. See details here .
ENGINE=InnoDB
to get the PRIMARY KEY clustered.
PRIMARY KEY(user_id, dt)
If this is "unique", then it allows efficient access for any time range for a single user. Note: you can remove the "just a day" restriction. However, you must formulate the query without hiding dt in a function. I recommend:
WHERE user_id = ?
AND dt >= ?
AND dt < ? + INTERVAL 1 DAY
Furthermore,
PRIMARY KEY(user_id, dt, id),
INDEX(id)
Would also be efficient even if (user_id, dt) is not unique. The addition of id to the PK is to make it unique; the addition of INDEX(id) is to keep AUTO_INCREMENT happy. (No, UNIQUE(id) is not required.)
INT --> BIGINT UNSIGNED ??
INT (which is SIGNED) will top out at about 2 billion. That will happen in a very few years. Is that OK? If not, you may need BIGINT (8 bytes vs 4).
This partitioning design does not care about your 7-day rule. You may choose to keep the rule and enforce it in your app.
BY HASH
will not work as well.
SUBPARTITION
is generally useless.
Are there other queries? If so they must be taken into consideration at the same time.
Sharding by user_id would be useful if the traffic were too much for a single server. MySQL, itself, does not (yet) have a sharding solution.
Try TokuDB engine at https://www.percona.com/software/mysql-database/percona-tokudb
Archive data are great for TokuDB. You will need about six times less disk space to store AND memory to PROCESS your dataset compared to InnoDB or about 2-3 times less than archived myisam.
1 million+ tables sounds like a bad idea. Having sharding via dynamic table naming by the app code at runtime has also not been a favorable pattern for me. My first go-to for this type of problem would be partitioning. You probably don't want 400M+ rows in a single unpartitioned table. In MySQL 5.7 you can even subpartition (but that gets more complex). I would first range partition on your date field, with one partition per day. Index on the user_id. If you are on 5.7 and want to dabble with subpartitioning, I would suggest range partition by date, then hash subpartition by user_id. As a starting point, try 16 to 32 hash buckets. Still index the user_id field.
EDIT: Here's something to play with:
CREATE TABLE user_data (
id INT AUTO_INCREMENT
, dt TIMESTAMP DEFAULT CURRENT_TIMESTAMP
, user_id VARCHAR(20)
, data varchar(20)
, PRIMARY KEY (id, user_id, dt)
, KEY (user_id, dt)
) PARTITION BY RANGE (UNIX_TIMESTAMP(dt))
SUBPARTITION BY KEY (user_id)
SUBPARTITIONS 16 (
PARTITION p1 VALUES LESS THAN (UNIX_TIMESTAMP('2016-10-25')),
PARTITION p2 VALUES LESS THAN (UNIX_TIMESTAMP('2016-10-26')),
PARTITION p3 VALUES LESS THAN (UNIX_TIMESTAMP('2016-10-27')),
PARTITION p4 VALUES LESS THAN (UNIX_TIMESTAMP('2016-10-28')),
PARTITION pMax VALUES LESS THAN MAXVALUE
);
-- View the metadata if you're interested
SELECT * FROM information_schema.partitions WHERE table_name='user_data';
Let's consider this strange situation,
where there's a redundancy of indexes.
TableA (item_id, code_key, data01, ... data0n)
TableB (item_id, code_key, dataA1, ... dataAn)
Both item_id and code_key are unique and they could be a primary key in both tables. item_id or code_key could be removed from both tables without losing any reference/relation.
It's redundant I know, but this is not the point of the question.
Consider that, both columns are indexed.
Item_id is a INT, codeKey is a VARCHAR(100).
Someone is suggesting that's better querying:
select * from TableA INNER JOIN TableB USING(item_id)
rather than :
select * from TableA INNER JOIN TableB USING(code_key)
I don't see the point of it since both columns are indexed and the performance would be the same.... isn't it?
Is it that having a INT would be faster than having a VARCHAR in the ON clause?Even if they're both indexed?
Int comparisons are faster than varchar comparisons, for the simple
fact that ints take up much less space than varchars.
This holds true both for unindexed and indexed access. The fastest way
to go is an indexed int column.
-- #Robert Munteanu
Hope that helps. There are no much differences, but we value the speed performance. The longer is the varchar the slower it gets.
You seem to be asking about having two columns for the same information. That is almost always frowned on.
Moving on... Should you have an INT or a VARCHAR...
Fetching a row costs a lot more (even if cached) than anything to do with the individual columns. So, while VARCHAR might be more costly than INT, it is not enough more costly to warrant going out of your way to make the change just for that reason.
The same argument goes for complexity of expressions.
In a related vein, there are multiple reasons for using an ENUM instead of a VARCHAR when appropriate. (Ditto for changing a VARCHAR into a TINYINT.)
Smaller --> faster, especially if I/O-bound.
If indexed, then the index(es) are smaller, too.
Less disk space
"Normalization" is a deliberate attempt to replace a VARCHAR by some size of INT. But there are multiple reasons for that.
Only one place to change the string, not many rows in many tables. If this reason exists, then it trumps other considerations.
Space savings.
But it adds complexity (now need to JOIN). Hence speed may or may not be improved.
When picking INT, always pick the smallest flavor. INT takes 4 bytes; MEDIUMINT - 3 bytes, etc. And pick it based on the range. And use UNSIGNED usually.
I have a query of the following form:
SELECT * FROM MyTable WHERE Timestamp > [SomeTime] AND Timestamp < [SomeOtherTime]
I would like to optimize this query, and I am thinking about putting an index on timestamp, but am not sure if this would help. Ideally I would like to make timestamp a clustered index, but MySQL does not support clustered indexes, except for primary keys.
MyTable has 4 million+ rows.
Timestamp is actually of type INT.
Once a row has been inserted, it is never changed.
The number of rows with any given Timestamp is on average about 20, but could be as high as 200.
Newly inserted rows have a Timestamp that is greater than most of the existing rows, but could be less than some of the more recent rows.
Would an index on Timestamp help me to optimize this query?
No question about it. Without the index, your query has to look at every row in the table. With the index, the query will be pretty much instantaneous as far as locating the right rows goes. The price you'll pay is a slight performance decrease in inserts; but that really will be slight.
You should definitely use an index. MySQL has no clue what order those timestamps are in, and in order to find a record for a given timestamp (or timestamp range) it needs to look through every single record. And with 4 million of them, that's quite a bit of time! Indexes are your way of telling MySQL about your data -- "I'm going to look at this field quite often, so keep an list of where I can find the records for each value."
Indexes in general are a good idea for regularly queried fields. The only downside to defining indexes is that they use extra storage space, so unless you're real tight on space, you should try to use them. If they don't apply, MySQL will just ignore them anyway.
I don't disagree with the importance of indexing to improve select query times, but if you can index on other keys (and form your queries with these indexes), the need to index on timestamp may not be needed.
For example, if you have a table with timestamp, category, and userId, it may be better to create an index on userId instead. In a table with many different users this will reduce considerably the remaining set on which to search the timestamp.
...and If I'm not mistaken, the advantage of this would be to avoid the overhead of creating the timestamp index on each insertion -- in a table with high insertion rates and highly unique timestamps this could be an important consideration.
I'm struggling with the same problems of indexing based on timestamps and other keys. I still have testing to do so I can put proof behind what I say here. I'll try to postback based on my results.
A scenario for better explanation:
timestamp 99% unique
userId 80% unique
category 25% unique
Indexing on timestamp will quickly reduce query results to 1% the table size
Indexing on userId will quickly reduce query results to 20% the table size
Indexing on category will quickly reduce query results to 75% the table size
Insertion with indexes on timestamp will have high overhead **
Despite our knowledge that our insertions will respect the fact of have incrementing timestamps, I don't see any discussion of MySQL optimisation based on incremental keys.
Insertion with indexes on userId will reasonably high overhead.
Insertion with indexes on category will have reasonably low overhead.
** I'm sorry, I don't know the calculated overhead or insertion with indexing.
If your queries are mainly using this timestamp, you could test this design (enlarging the Primary Key with the timestamp as first part):
CREATE TABLE perf (
, ts INT NOT NULL
, oldPK
, ... other columns
, PRIMARY KEY(ts, oldPK)
, UNIQUE (oldPK)
) ENGINE=InnoDB ;
This will ensure that the queries like the one you posted will be using the clustered (primary) key.
Disadvantage is that your Inserts will be a bit slower. Also, If you have other indices on the table, they will be using a bit more space (as they will include the 4-bytes wider primary key).
The biggest advantage of such a clustered index is that queries with big range scans, e.g. queries that have to read large parts of the table or the whole table will find the related rows sequentially and in the wanted order (BY timestamp), which will also be useful if you want to group by day or week or month or year.
The old PK can still be used to identify rows by keeping a UNIQUE constraint on it.
You may also want to have a look at TokuDB, a MySQL (and open source) variant that allows multiple clustered indices.
Assume that I have one big table with three columns: "user_name", "user_property", "value_of_property". Lat's also assume that I have a lot of user (let say 100 000) and a lot of properties (let say 10 000). Then the table is going to be huge (1 billion rows).
When I extract information from the table I always need information about a particular user. So, I use, for example where user_name='Albert Gates'. So, every time the mysql server needs to analyze 1 billion lines to find those of them which contain "Albert Gates" as user_name.
Would it not be wise to split the big table into many small ones corresponding to fixed users?
No, I don't think that is a good idea. A better approach is to add an index on the user_name column - and perhaps another index on (user_name, user_property) for looking up a single property. Then the database does not need to scan all the rows - it just need to find the appropriate entry in the index which is stored in a B-Tree, making it easy to find a record in a very small amount of time.
If your application is still slow even after correctly indexing it can sometimes be a good idea to partition your largest tables.
One other thing you could consider is normalizing your database so that the user_name is stored in a separate table and use an integer foriegn key in its place. This can reduce storage requirements and can increase performance. The same may apply to user_property.
you should normalise your design as follows:
drop table if exists users;
create table users
(
user_id int unsigned not null auto_increment primary key,
username varbinary(32) unique not null
)
engine=innodb;
drop table if exists properties;
create table properties
(
property_id smallint unsigned not null auto_increment primary key,
name varchar(255) unique not null
)
engine=innodb;
drop table if exists user_property_values;
create table user_property_values
(
user_id int unsigned not null,
property_id smallint unsigned not null,
value varchar(255) not null,
primary key (user_id, property_id),
key (property_id)
)
engine=innodb;
insert into users (username) values ('f00'),('bar'),('alpha'),('beta');
insert into properties (name) values ('age'),('gender');
insert into user_property_values values
(1,1,'30'),(1,2,'Male'),
(2,1,'24'),(2,2,'Female'),
(3,1,'18'),
(4,1,'26'),(4,2,'Male');
From a performance perspective the innodb clustered index works wonders in this similar example (COLD run):
select count(*) from product
count(*)
========
1,000,000 (1M)
select count(*) from category
count(*)
========
250,000 (500K)
select count(*) from product_category
count(*)
========
125,431,192 (125M)
select
c.*,
p.*
from
product_category pc
inner join category c on pc.cat_id = c.cat_id
inner join product p on pc.prod_id = p.prod_id
where
pc.cat_id = 1001;
0:00:00.030: Query OK (0.03 secs)
Properly indexing your database will be the number 1 way of improving performance. I once had a query take a half an hour (on a large dataset, but none the less). Then we come to find out that the tables had no index. Once indexed the query took less than 10 seconds.
Why do you need to have this table structure. My fundemental problem is that you are going to have to cast the data in value of property every time you want to use it. That is bad in my opinion - also storing numbers as text is crazy given that its all binary anyway. For instance how are you going to have required fields? Or fields that need to have constraints based on other fields? Eg start and end date?
Why not simply have the properties as fields rather than some many to many relationship?
have 1 flat table. When your business rules begin to show that properties should be grouped then you can consider moving them out into other tables and have several 1:0-1 relationships with the users table. But this is not normalization and it will degrade performance slightly due to the extra join (however the self documenting nature of the table names will greatly aid any developers)
One way i regularly see databqase performance get totally castrated is by having a generic
Id, property Type, Property Name, Property Value table.
This is really lazy but exceptionally flexible but totally kills performance. In fact on a new job where performance is bad i actually ask if they have a table with this structure - it invariably becomes the center point of the database and is slow. The whole point of relational database design is that the relations are determined ahead of time. This is simply a technique that aims to speed up development at a huge cost to application speed. It also puts a huge reliance on business logic in the application layer to behave - which is not defensive at all. Eventually you find that you wan to use properties in a key relationsip which leads to all kinds of casting on the join which further degrades performance.
If data has a 1:1 relationship with an entity then it should be a field on the same table. If your table gets to more than 30 fields wide then consider movign them into another table but dont call it normalisation because it isnt. It is a technique to help developers group fields together at the cost of performance in an attempt to aid understanding.
I don't know if mysql has an equivalent but sqlserver 2008 has sparse columns - null values take no space.
SParse column datatypes
I'm not saying a EAV approach is always wrong, but i think using a relational database for this approach is probably not the best choice.