Segregating data or using UNIQUE index for optimization - mysql

I have a table;
Orders
* id INT NN AN PK
* userid INT NN
* is_open TINYINT NN DEFAULT 1
* amount INT NN
* desc VARCHAR(255)
and the query SELECT * FROM orders WHERE userid = ? AND is_open = 1; that I run frequently. I would like to optimize the database for this query and I currently have two options;
Move closed orders (is_open = 0) to a different table since currently open orders will be relatively smaller than closed orders thereby minimizing rows to scan on lookup
Set a unique key constraint: ALTER TABLE orders ADD CONSTRAINT UNIQUE KEY(id, userid);
I don't know how the latter will perform and I know the former will help performance but I don't know if it's a good approach in terms of best practices.
Any other ideas would be appreciated.

The table is of orders; there can be multiple open/closed orders for each userid.
WHERE userid = ? AND is_open = 1 would benefit from either of these 'composite' indexes: INDEX(userid, is_open) or INDEX(is_open, user_id). The choice of which is better depends on what other queries might benefit from one more than the other.
Moving "closed" orders to another table is certainly a valid option. And it will help performance. (I usually don't recommend it, only because of the clumsy code needed to move rows and/or to search both tables in the few cases where that is needed.)
I see no advantage with UNIQUE(id, userid). Presumably id is already "unique" because of being the PRIMARY KEY? Also, in a composite index, the first column will be checked first; that is what the PK is already doing.
Another approach... The AUTO_INCREMENT PK leads to the data BTree being roughly chronological. But you usually reach into the table by userid? To make that more efficient, change PRIMARY KEY(id), INDEX(userid) to PRIMARY KEY(userid, id), INDEX(id). (However... without knowing the other queries touching this table, I can't say whether this will provide much overall improvement.)
This might be even better:
PRIMARY KEY(userid, is_open, id), -- to benefit many queries
INDEX(id) -- to keep AUTO_INCREMENT happy
The cost of an additional index (on the performance of write operations) is usually more than compensated for by the speedup of Selects.

Setting a unique index on id and user_id will gain you nothing since the id is already uniquely indexed as a primary key, and doesn't feature in your query anyway.
Moving closed orders to a different table will give some performance improvement, but since the closed orders are probably distributed throughout the table, that performance improvement won't be as great as you might expect. It also carries an administrative overhead, requiring that orders be moved periodically, and additional complications with reporting.
Your best solution is likely to be to add an index on user_id so that MySQL can go straight to the required User Id and search only those rows. You might get a further boost by indexing on user_id and is_open instead, but the additional benefit is likely to be small.
Bear in mind that each additional index incurs a performance penalty on every table update. This won't be a problem if your table is not busy.

Related

MYSQL index optimization for table that stores relationship between 2 other tables

My question is regarding database structuring for a table that links 2 other tables for storing the relationship.
for example, I have 3 tables, users, locations, and users_locations.
users and locations table both have an id column.
users_locations table has the user_id and location_id from the other 2 tables.
how do you define your indexes/constraints on these tables to efficiently answer questions such as what locations does this user have or what users belong to this location?
eg.
select user_id from users_locations where location_id = 5;
or
select location_id from users_locations where user_id = 5;
currently, I do not have a foreign key constraint set, which I assume I should add, but does that automatically speed up the queries or create an index?
I don't think I can create an index on each column since there will be duplicates eg. multiple user_id entries for each location, and vice versa.
Will adding a composite key like PRIMARY_KEY (user_id, location_id) speed up queries when most queries only have half of the key?
Is there any reason to just set an AUTO INCREMENT PRIMARY_KEY field on this table when you will never query by that id?
Do I really even need to set a PRIMARY KEY?
Basically, for any table, decision to create an index or not create an index, totally depends on your use cases which you support. Indexes must always be on the per use basis and not on nice to have.
For your particular queries that you have mentioned, separate indexes on both the columns are good enough, that is query doesn't need to go to your rows to fetch the information.
Creating foreign key on a table column automatically creates an index so you need not create indexes yourself if you decide to set up foreign keys.
If you keep an auto increment key as primary key, you will still have to make user_id and location id combination as unique otherwise you will bloat your table with duplicates.So keeping a separate auto increment key doesn't make sense in your use case. However if you want to keep track of each visit to a location and save user experience each time then auto increment primary key will be a required thing.
However I would like to point it out that creating indexes does not guarantee that your queries will use them unless specified explicitly. For a single query there can be many execution plans and most efficient may not use an index.
The optimal indexes for a many-to-many mapping table:
PRIMARY KEY (aid, bid),
INDEX(bid, aid)
More discussion and more tips: http://mysql.rjweb.org/doc.php/index_cookbook_mysql#many_to_many_mapping_table
(Comments on specific points in the Question)
FOREIGN KEYs implicitly create indexes, unless an explicit index has already been provided.
Composite indexes are better for many-to-many tables.
A FOREIGN KEY involves an integrity check, so it is inherently slower than simply having the index. (And the integrity check for this kind of table is of dubious value.)
There is no need for an AUTO_INCREMENT on a many:many table. However, ...
It is important to have a PRIMARY KEY on every table. The pair of columns is fine as a "natural" PRIMARY KEY.
A WHERE clause would like to use the first column(s) of some index; don't worry that it is not using all the columns.
In EXPLAIN you sometimes see "Using index". This means that a "covering index" was used. That means that all the columns used in the SELECT were found in that one index -- without having to reach into the data to get more columns. This is a performance boost. And necessitates two two-column indexes (on is the PK, one is a plain INDEX.)
With InnoDB, any 'secondary' index (INDEX or UNIQUE) implicitly includes the columns of the PK. So, given PRIMARY KEY(a,b), INDEX(b), that secondary index is effectively INDEX(b,a). I prefer to spell out the two columns to point out the to reader that I deliberately wanted those two columns in that order.
Hopefully, the above link will answer any further questions.

How to speed up this SQL index query?

Given the following SQL table :
Employee(ssn, name, dept, manager,
salary)
You discover that the following query is significantly slower than
expected. There is an index on salary, and you have verified that
the query plan is using it.
SELECT *
FROM Employee
WHERE salary = 48000
Please give a possible reason why this query is slower than expected, and provide a tuning solution that
addresses that reason.
I have two ideas for why this query is slower than expected. One is that we are trying to SELECT * instead of SELECT Employee.salary which would slow down the query as we must search across all columns instead of one. Another idea is that the index on salary is non-clustered, and we want to use a clustered index, as the company could be very large and it would make sense to organize the table by the salary field.
Would either of those two solutions speed up this query? I.e. either change SELECT * to SELECT Employee.salary or explicitly set the index on salary to be clustered?
What indexes do you have now?
Is it really "slow"? What evidence do you have?
Comments on "SELECT * instead of SELECT Employee.salary" --
* is bad form because tomorrow you might add a column, thereby breaking any code that is expecting a certain number of columns in a certain order.
Dealing with * versus salary does not happen until after the row(s) is located.
Locating the row(s) is the costly part.
On the other hand, if you have INDEX(salary) and only look at salary then the index is "covering". That means that the "data" (the other columns) does not need to be fetched. Hence, faster. But this is probably beyond what your teacher has told you about yet.
Comments on "the index on salary is non-clustered, and we want to use a clustered index" --
In MySQL (not necessarily in other RDBMSs), InnoDB has exactly one PRIMARY KEY and it is always UNIQUE and "clustered".
That is, "clustered" implies "unique", which seems inappropriate for "salary".
In InnoDB a "secondary key" implicitly includes the column(s) of the PK (ssn?), with which it can reach over into the data.
"verified that the query plan" -- Have you learned about EXPLAIN SELECT ...?
More Tips on creating the optimal index for a given SELECT.
I will try to be as simple as I can be ,
You can not simply make salary a clustered index unless you make it a unique or primary which is kind of both stupid and senseless because two person can have same salary.
There can be only one clustered index per table according to MYSQL documentation. Database by default elects primary key for being clustered index .
If you do not define a PRIMARY KEY for your table, MySQL locates the
first UNIQUE index where all the key columns are NOT NULL and InnoDB
uses it as the clustered index.
To speed up your query I have a few suggestions , go for secondary indexes,
If you want to search a salary by direct value then hash based indexes are a better option, if MYSQL supports that already.
If you want to search a value using greater than , less than or some range ,then B-tree indexes are better choice.
The first option is faster than the second one , but is limited to only equality operator.
Hope it helps.

Should I use multiple index method if indexed fields are also foreign keys?

After adding foreign keys mysql forced to index the keys which indexed before with multiple index method. I use InnoDB.
Here is a structure of my table:
id, company_id, student_id ...
company_id and student_id had been indexed using:
ALTER TABLE `table` ADD INDEX `multiple_index` (`company_id`,`student_id`)
Why I use multiple index column? Because, the most time my query is:
SELECT * FROM `table` WHERE company_id = 1 AND student_id = 3
Sometime i just fetch columns by student_id:
SELECT * FROM `table` WHERE student_id = 3
After adding foreign keys for company_id and student_id mysql indexed both of them separately. So, now I have multiple and separately indexed fields.
My question is should I drop the multiple indexed keys?
It depends. If the same student belongs to many companies, no, don't drop it. When querying for company_id = 1 AND student_id = 3, the optimizer has to pick one index, and after that, it will either have to check multiple students or multiple companies.
My gut tells me this won't be the case, though; students won't be associated with more than ~10 companies, so scanning over the index a bit won't be a big deal. That said, this is a lot more brittle than having the index on both columns. In that case, the optimizer knows what the right thing to do is, here. When it has two indices to pick from, it might not, and it might not in the future, so you should FORCE INDEX to make sure it uses the student_id index.
The other thing to consider is how this table is used. If it's rarely written to but read frequently, there's not much of a penalty beyond space for the extra index.
TL;DR: the index is not redundant. Whether or not you should keep it is complicated.

One to Many Database

I have created a database with One to many relationship
The Parent Table say Master has 2 columns NodeId,NodeName; NodeId is the PrimaryKey and it is of type int rest are of type varchar.
The Child Table say Student has 5 columns NodeId,B,M,F,T; and NodeId is the ForeignKey over here.
none of the columns B,M,F,T are unique and it can have null values hence none of these columns have been defined as Primary Key.
assume student table has more than 20,00,000 fields.
My fetch query is
SELECT * FROM STUDENT WHERE NODEID = 1 AND B='1-123'
I would like to improve the speed of fetching , Any suggestion regarding improvement of the DB structure or alternative fetch query would be really helpful or any suggestion that can improve overall efficiency is most welcome.
since foreign key is not indexed by default, maybe adding indexes to nodeID in student and B would improve query performance if inserts performance are not as big of a issue.
Update:
an index is essentially a way to keep your data sorted to increase search/query time. It should be good enough to just think of it as an ordered list.
An index is quite transparent, so your query would remain exactly the same.
A simple index does allow rows with the same indexed fields, so it should be fine.
it is worth to m mention. a primary key element is indexed by default, however a PK does not allow duplicate data.
also, since it's keeping an ordering of your data, insertion time will increase, however if your dataset is big query time should become faster.

Can I optimize my database by splitting one big table into many small ones?

Assume that I have one big table with three columns: "user_name", "user_property", "value_of_property". Lat's also assume that I have a lot of user (let say 100 000) and a lot of properties (let say 10 000). Then the table is going to be huge (1 billion rows).
When I extract information from the table I always need information about a particular user. So, I use, for example where user_name='Albert Gates'. So, every time the mysql server needs to analyze 1 billion lines to find those of them which contain "Albert Gates" as user_name.
Would it not be wise to split the big table into many small ones corresponding to fixed users?
No, I don't think that is a good idea. A better approach is to add an index on the user_name column - and perhaps another index on (user_name, user_property) for looking up a single property. Then the database does not need to scan all the rows - it just need to find the appropriate entry in the index which is stored in a B-Tree, making it easy to find a record in a very small amount of time.
If your application is still slow even after correctly indexing it can sometimes be a good idea to partition your largest tables.
One other thing you could consider is normalizing your database so that the user_name is stored in a separate table and use an integer foriegn key in its place. This can reduce storage requirements and can increase performance. The same may apply to user_property.
you should normalise your design as follows:
drop table if exists users;
create table users
(
user_id int unsigned not null auto_increment primary key,
username varbinary(32) unique not null
)
engine=innodb;
drop table if exists properties;
create table properties
(
property_id smallint unsigned not null auto_increment primary key,
name varchar(255) unique not null
)
engine=innodb;
drop table if exists user_property_values;
create table user_property_values
(
user_id int unsigned not null,
property_id smallint unsigned not null,
value varchar(255) not null,
primary key (user_id, property_id),
key (property_id)
)
engine=innodb;
insert into users (username) values ('f00'),('bar'),('alpha'),('beta');
insert into properties (name) values ('age'),('gender');
insert into user_property_values values
(1,1,'30'),(1,2,'Male'),
(2,1,'24'),(2,2,'Female'),
(3,1,'18'),
(4,1,'26'),(4,2,'Male');
From a performance perspective the innodb clustered index works wonders in this similar example (COLD run):
select count(*) from product
count(*)
========
1,000,000 (1M)
select count(*) from category
count(*)
========
250,000 (500K)
select count(*) from product_category
count(*)
========
125,431,192 (125M)
select
c.*,
p.*
from
product_category pc
inner join category c on pc.cat_id = c.cat_id
inner join product p on pc.prod_id = p.prod_id
where
pc.cat_id = 1001;
0:00:00.030: Query OK (0.03 secs)
Properly indexing your database will be the number 1 way of improving performance. I once had a query take a half an hour (on a large dataset, but none the less). Then we come to find out that the tables had no index. Once indexed the query took less than 10 seconds.
Why do you need to have this table structure. My fundemental problem is that you are going to have to cast the data in value of property every time you want to use it. That is bad in my opinion - also storing numbers as text is crazy given that its all binary anyway. For instance how are you going to have required fields? Or fields that need to have constraints based on other fields? Eg start and end date?
Why not simply have the properties as fields rather than some many to many relationship?
have 1 flat table. When your business rules begin to show that properties should be grouped then you can consider moving them out into other tables and have several 1:0-1 relationships with the users table. But this is not normalization and it will degrade performance slightly due to the extra join (however the self documenting nature of the table names will greatly aid any developers)
One way i regularly see databqase performance get totally castrated is by having a generic
Id, property Type, Property Name, Property Value table.
This is really lazy but exceptionally flexible but totally kills performance. In fact on a new job where performance is bad i actually ask if they have a table with this structure - it invariably becomes the center point of the database and is slow. The whole point of relational database design is that the relations are determined ahead of time. This is simply a technique that aims to speed up development at a huge cost to application speed. It also puts a huge reliance on business logic in the application layer to behave - which is not defensive at all. Eventually you find that you wan to use properties in a key relationsip which leads to all kinds of casting on the join which further degrades performance.
If data has a 1:1 relationship with an entity then it should be a field on the same table. If your table gets to more than 30 fields wide then consider movign them into another table but dont call it normalisation because it isnt. It is a technique to help developers group fields together at the cost of performance in an attempt to aid understanding.
I don't know if mysql has an equivalent but sqlserver 2008 has sparse columns - null values take no space.
SParse column datatypes
I'm not saying a EAV approach is always wrong, but i think using a relational database for this approach is probably not the best choice.