I have created a database with One to many relationship
The Parent Table say Master has 2 columns NodeId,NodeName; NodeId is the PrimaryKey and it is of type int rest are of type varchar.
The Child Table say Student has 5 columns NodeId,B,M,F,T; and NodeId is the ForeignKey over here.
none of the columns B,M,F,T are unique and it can have null values hence none of these columns have been defined as Primary Key.
assume student table has more than 20,00,000 fields.
My fetch query is
SELECT * FROM STUDENT WHERE NODEID = 1 AND B='1-123'
I would like to improve the speed of fetching , Any suggestion regarding improvement of the DB structure or alternative fetch query would be really helpful or any suggestion that can improve overall efficiency is most welcome.
since foreign key is not indexed by default, maybe adding indexes to nodeID in student and B would improve query performance if inserts performance are not as big of a issue.
Update:
an index is essentially a way to keep your data sorted to increase search/query time. It should be good enough to just think of it as an ordered list.
An index is quite transparent, so your query would remain exactly the same.
A simple index does allow rows with the same indexed fields, so it should be fine.
it is worth to m mention. a primary key element is indexed by default, however a PK does not allow duplicate data.
also, since it's keeping an ordering of your data, insertion time will increase, however if your dataset is big query time should become faster.
Related
I have a table;
Orders
* id INT NN AN PK
* userid INT NN
* is_open TINYINT NN DEFAULT 1
* amount INT NN
* desc VARCHAR(255)
and the query SELECT * FROM orders WHERE userid = ? AND is_open = 1; that I run frequently. I would like to optimize the database for this query and I currently have two options;
Move closed orders (is_open = 0) to a different table since currently open orders will be relatively smaller than closed orders thereby minimizing rows to scan on lookup
Set a unique key constraint: ALTER TABLE orders ADD CONSTRAINT UNIQUE KEY(id, userid);
I don't know how the latter will perform and I know the former will help performance but I don't know if it's a good approach in terms of best practices.
Any other ideas would be appreciated.
The table is of orders; there can be multiple open/closed orders for each userid.
WHERE userid = ? AND is_open = 1 would benefit from either of these 'composite' indexes: INDEX(userid, is_open) or INDEX(is_open, user_id). The choice of which is better depends on what other queries might benefit from one more than the other.
Moving "closed" orders to another table is certainly a valid option. And it will help performance. (I usually don't recommend it, only because of the clumsy code needed to move rows and/or to search both tables in the few cases where that is needed.)
I see no advantage with UNIQUE(id, userid). Presumably id is already "unique" because of being the PRIMARY KEY? Also, in a composite index, the first column will be checked first; that is what the PK is already doing.
Another approach... The AUTO_INCREMENT PK leads to the data BTree being roughly chronological. But you usually reach into the table by userid? To make that more efficient, change PRIMARY KEY(id), INDEX(userid) to PRIMARY KEY(userid, id), INDEX(id). (However... without knowing the other queries touching this table, I can't say whether this will provide much overall improvement.)
This might be even better:
PRIMARY KEY(userid, is_open, id), -- to benefit many queries
INDEX(id) -- to keep AUTO_INCREMENT happy
The cost of an additional index (on the performance of write operations) is usually more than compensated for by the speedup of Selects.
Setting a unique index on id and user_id will gain you nothing since the id is already uniquely indexed as a primary key, and doesn't feature in your query anyway.
Moving closed orders to a different table will give some performance improvement, but since the closed orders are probably distributed throughout the table, that performance improvement won't be as great as you might expect. It also carries an administrative overhead, requiring that orders be moved periodically, and additional complications with reporting.
Your best solution is likely to be to add an index on user_id so that MySQL can go straight to the required User Id and search only those rows. You might get a further boost by indexing on user_id and is_open instead, but the additional benefit is likely to be small.
Bear in mind that each additional index incurs a performance penalty on every table update. This won't be a problem if your table is not busy.
I'm not very expert in SQL and I need to ask an advice about what's the best way to set up a table that will contains appointments.
My doubt is on the primary key.
My ideas are:
1-Use an auto-increment column for the Id of the appointment (for example unsigned integer).
My doubts about this solution are: the index can reachs the overflow even if it's very high and when the number of record grows up performances can decrease?
2-Create a table for every year.
Dubts: it will be complex to mantain and execute queries.
3-Use a composite index.
Dubts: how set it
4-Other?
Thanks.
Use an autoincrement primary key. MySQL will not be able to process a growing table way before your integer will overflow.
MySQL's performance will go down on a large even if you did not have a primary key. This is when you will start thinking about partitioning (your option 2) and archiving old data. But from the beginning autoincrement primary key on a single table should do just fine.
1 - Do you think you will exceed 4 billion rows? Performance degrades if you don't have suitable indexes for your queries, not because of table size. (Well, there is a slight degradation, but not worth worrying about.) Based on 182K/year, MEDIUMINT UNSIGNED (16M max) will suffice.
2 - NO! This is a common question; the answer is always "do not create identical tables".
3 - What column or combination of columns are UNIQUE for the table? Simply list them inside PRIMARY KEY (...)
Number 3 is usually preferred. If there is no unique column(s), go with Number 1.
182K rows per year does not justify PARTITIONing. Consider it if you expect more than a million rows. (Here's an easy prediction: You will re-design this schema before 182K grows to a million.)
I have a large MySql table with over 11 million rows. This is just a huge data set and my task is to be able to analyze the dataset based on certain rules.
Each row belongs to a certain category. There are 2 million different categories. I want to get all rows for a category and perform operations on that.
So currently, I do the following:
Select distinct categories from the table.
for each category : Select fields from table WHERE category=category
Even though my category column is indexed, it takes a really long time to execute Step 2. This is mainly because of the huge data set.
Alternatively, I can use GROUP BY clause, however I am not sure if it will be as fast since GROUP BY on such a huge dataset may be expensive, especially when considering that I will be running my analysis several times on parts of the dataset. A way to permanently ensure a sorted table would be useful.
Therefore as an alternative, I can speed up my queries if only my table is pre-sorted by category. Now I can just read the table row by row and perform the same operations in a much faster time, as all rows of one category will be fetched consecutively.
As the dataset (MySql table) is fixed and no update, delete, insert operations will be performed on it. I want to be able to ensure a way to maintain a default sort order by category. Can anyone suggest a trick to ensure the default sort order of the rows.
Maybe read all rows and rewrite them to a new table or add a new primary key which ensures this order?
Even though my category column is indexed
Indexed by a secondary index? If so, you can encounter the following performance problems:
InnoDB tables are always clustered and the secondary index in clustered table can require a double-lookup (see the "Disadvantages of clustering" in this article).
Indexed rows can be scattered all over the place (index can have bad clustering factor - the link is for Oracle but the principle is the same). If so, an index range scan (such as WHERE category = whatever) can end-up loading many table pages, even though the index is actually used and only a small subset of rows is actually selected. This can destroy the range scan performance.
In alternative to the secondary index, consider using a natural primary key, which in InnoDB tables also acts as a clustering key. The primary/clustering key such as {category, no} will keep the rows of the same category physically close together, making both of your queries (and especially the second one) maximally efficient.
OTOH, if you want to keep the secondary index, consider covering all the fields that you query, so the primary B-Tree doesn't have to be touched at all.
You can do this in one step regardless of indexing by doing something like (pseudo code):
Declare #LastCategory int = Null
Declare #Category int
For Each Row In
Select
#Category = Category,
...
From
Table
Order By
Category
If #LastCategory Is Null Or #LastCategory != #Category
Do any "New Category Steps"
Set #LastCategory = #Category
End
Process Row
End For
With the index on category I'd expect this to perform OK. Your performance issues may be down to what you are doing when processing each row.
Here's an example: http://sqlfiddle.com/#!2/e53c98/1
I have a query of the following form:
SELECT * FROM MyTable WHERE Timestamp > [SomeTime] AND Timestamp < [SomeOtherTime]
I would like to optimize this query, and I am thinking about putting an index on timestamp, but am not sure if this would help. Ideally I would like to make timestamp a clustered index, but MySQL does not support clustered indexes, except for primary keys.
MyTable has 4 million+ rows.
Timestamp is actually of type INT.
Once a row has been inserted, it is never changed.
The number of rows with any given Timestamp is on average about 20, but could be as high as 200.
Newly inserted rows have a Timestamp that is greater than most of the existing rows, but could be less than some of the more recent rows.
Would an index on Timestamp help me to optimize this query?
No question about it. Without the index, your query has to look at every row in the table. With the index, the query will be pretty much instantaneous as far as locating the right rows goes. The price you'll pay is a slight performance decrease in inserts; but that really will be slight.
You should definitely use an index. MySQL has no clue what order those timestamps are in, and in order to find a record for a given timestamp (or timestamp range) it needs to look through every single record. And with 4 million of them, that's quite a bit of time! Indexes are your way of telling MySQL about your data -- "I'm going to look at this field quite often, so keep an list of where I can find the records for each value."
Indexes in general are a good idea for regularly queried fields. The only downside to defining indexes is that they use extra storage space, so unless you're real tight on space, you should try to use them. If they don't apply, MySQL will just ignore them anyway.
I don't disagree with the importance of indexing to improve select query times, but if you can index on other keys (and form your queries with these indexes), the need to index on timestamp may not be needed.
For example, if you have a table with timestamp, category, and userId, it may be better to create an index on userId instead. In a table with many different users this will reduce considerably the remaining set on which to search the timestamp.
...and If I'm not mistaken, the advantage of this would be to avoid the overhead of creating the timestamp index on each insertion -- in a table with high insertion rates and highly unique timestamps this could be an important consideration.
I'm struggling with the same problems of indexing based on timestamps and other keys. I still have testing to do so I can put proof behind what I say here. I'll try to postback based on my results.
A scenario for better explanation:
timestamp 99% unique
userId 80% unique
category 25% unique
Indexing on timestamp will quickly reduce query results to 1% the table size
Indexing on userId will quickly reduce query results to 20% the table size
Indexing on category will quickly reduce query results to 75% the table size
Insertion with indexes on timestamp will have high overhead **
Despite our knowledge that our insertions will respect the fact of have incrementing timestamps, I don't see any discussion of MySQL optimisation based on incremental keys.
Insertion with indexes on userId will reasonably high overhead.
Insertion with indexes on category will have reasonably low overhead.
** I'm sorry, I don't know the calculated overhead or insertion with indexing.
If your queries are mainly using this timestamp, you could test this design (enlarging the Primary Key with the timestamp as first part):
CREATE TABLE perf (
, ts INT NOT NULL
, oldPK
, ... other columns
, PRIMARY KEY(ts, oldPK)
, UNIQUE (oldPK)
) ENGINE=InnoDB ;
This will ensure that the queries like the one you posted will be using the clustered (primary) key.
Disadvantage is that your Inserts will be a bit slower. Also, If you have other indices on the table, they will be using a bit more space (as they will include the 4-bytes wider primary key).
The biggest advantage of such a clustered index is that queries with big range scans, e.g. queries that have to read large parts of the table or the whole table will find the related rows sequentially and in the wanted order (BY timestamp), which will also be useful if you want to group by day or week or month or year.
The old PK can still be used to identify rows by keeping a UNIQUE constraint on it.
You may also want to have a look at TokuDB, a MySQL (and open source) variant that allows multiple clustered indices.
Assume that I have one big table with three columns: "user_name", "user_property", "value_of_property". Lat's also assume that I have a lot of user (let say 100 000) and a lot of properties (let say 10 000). Then the table is going to be huge (1 billion rows).
When I extract information from the table I always need information about a particular user. So, I use, for example where user_name='Albert Gates'. So, every time the mysql server needs to analyze 1 billion lines to find those of them which contain "Albert Gates" as user_name.
Would it not be wise to split the big table into many small ones corresponding to fixed users?
No, I don't think that is a good idea. A better approach is to add an index on the user_name column - and perhaps another index on (user_name, user_property) for looking up a single property. Then the database does not need to scan all the rows - it just need to find the appropriate entry in the index which is stored in a B-Tree, making it easy to find a record in a very small amount of time.
If your application is still slow even after correctly indexing it can sometimes be a good idea to partition your largest tables.
One other thing you could consider is normalizing your database so that the user_name is stored in a separate table and use an integer foriegn key in its place. This can reduce storage requirements and can increase performance. The same may apply to user_property.
you should normalise your design as follows:
drop table if exists users;
create table users
(
user_id int unsigned not null auto_increment primary key,
username varbinary(32) unique not null
)
engine=innodb;
drop table if exists properties;
create table properties
(
property_id smallint unsigned not null auto_increment primary key,
name varchar(255) unique not null
)
engine=innodb;
drop table if exists user_property_values;
create table user_property_values
(
user_id int unsigned not null,
property_id smallint unsigned not null,
value varchar(255) not null,
primary key (user_id, property_id),
key (property_id)
)
engine=innodb;
insert into users (username) values ('f00'),('bar'),('alpha'),('beta');
insert into properties (name) values ('age'),('gender');
insert into user_property_values values
(1,1,'30'),(1,2,'Male'),
(2,1,'24'),(2,2,'Female'),
(3,1,'18'),
(4,1,'26'),(4,2,'Male');
From a performance perspective the innodb clustered index works wonders in this similar example (COLD run):
select count(*) from product
count(*)
========
1,000,000 (1M)
select count(*) from category
count(*)
========
250,000 (500K)
select count(*) from product_category
count(*)
========
125,431,192 (125M)
select
c.*,
p.*
from
product_category pc
inner join category c on pc.cat_id = c.cat_id
inner join product p on pc.prod_id = p.prod_id
where
pc.cat_id = 1001;
0:00:00.030: Query OK (0.03 secs)
Properly indexing your database will be the number 1 way of improving performance. I once had a query take a half an hour (on a large dataset, but none the less). Then we come to find out that the tables had no index. Once indexed the query took less than 10 seconds.
Why do you need to have this table structure. My fundemental problem is that you are going to have to cast the data in value of property every time you want to use it. That is bad in my opinion - also storing numbers as text is crazy given that its all binary anyway. For instance how are you going to have required fields? Or fields that need to have constraints based on other fields? Eg start and end date?
Why not simply have the properties as fields rather than some many to many relationship?
have 1 flat table. When your business rules begin to show that properties should be grouped then you can consider moving them out into other tables and have several 1:0-1 relationships with the users table. But this is not normalization and it will degrade performance slightly due to the extra join (however the self documenting nature of the table names will greatly aid any developers)
One way i regularly see databqase performance get totally castrated is by having a generic
Id, property Type, Property Name, Property Value table.
This is really lazy but exceptionally flexible but totally kills performance. In fact on a new job where performance is bad i actually ask if they have a table with this structure - it invariably becomes the center point of the database and is slow. The whole point of relational database design is that the relations are determined ahead of time. This is simply a technique that aims to speed up development at a huge cost to application speed. It also puts a huge reliance on business logic in the application layer to behave - which is not defensive at all. Eventually you find that you wan to use properties in a key relationsip which leads to all kinds of casting on the join which further degrades performance.
If data has a 1:1 relationship with an entity then it should be a field on the same table. If your table gets to more than 30 fields wide then consider movign them into another table but dont call it normalisation because it isnt. It is a technique to help developers group fields together at the cost of performance in an attempt to aid understanding.
I don't know if mysql has an equivalent but sqlserver 2008 has sparse columns - null values take no space.
SParse column datatypes
I'm not saying a EAV approach is always wrong, but i think using a relational database for this approach is probably not the best choice.