When/why to use (combined) indexes? [closed] - mysql

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
While working on a project to test the performance on a database, I came to the point of adding indexes.
Having surfed a big part on the internet I'm still left with a couple of questions.
On what table/column is it a good idea to put an index?
I have different types of tables for examples a table full with predefined country names. So I believe it is a good idea to put an index on the Column country_name. I know this is good because there is a small chance I have to add new records to this table and queries will be faster when using a country_name in the where clause.
But what about more complex tables like client (or any other table that will chance a lot and contains a big amount of columns)?
What about combined indexes?
When are combined indexes a good idead, is it when I will query a lot of clients with their first_name and last_name together? Or is it better to add individual indexes to both those columns?
Paradox?
Having read this answer on stackoverflow, I'm left with a paradox. Knowing the data will increase significantly, is a reason for me to add an index. But will slow it down at the same time, as indexes slow down updates/inserts.
e.g. I have to keep a daily track of the weight of clients(>3M records). Adding an index will help me get my results faster. But I gain about 1000 new clients each day, so I'll have to insert them AND update their weights. Which means slower performance because of the inserts/updates.
mySQL specific addition
Is there an advantage on different storage engines, combined with indexes?
As for now I've only used innoDB.

I'm going to focus on the "Combined Indexes" part of the question, but use that to cover several other points that I think will help you better understand indexes.
What about combined indexes?
When are combined indexes a good idead, is it when I will query a lot of clients with their first_name and last_name together? Or is it better to add individual indexes to both those columns?
Indexes are just like phone books. A phone book is a table with fields for Last_Name, First_Name, Address, and Phone_Number. This table has an index on Last_Name,First_Name. This is what you called a combined index.
Let's say you wanted to find "John Smith" in this phone book. That would work out to an query like this:
SELECT * FROM PhoneBook WHERE First_Name = 'John' and Last_Name = 'Smith';
That's pretty easy in your phone book. Just find the section for "Smith", and then go find all the "John"s within that section.
Now imagine that instead of a combined index on Last_Name,First_Name, you had separate indexes: one for Last_Name and one for First_Name. You try to run the same query. So you open up the Last_Name index and find the section for Smith. There are a lot of them. You go to find the John's, but the First_Name fields aren't in the correct order. Maybe it's ordered by Address now instead. More likely in a database, it's in order by when this particular Mr or Ms Smith first moved to town. You'll have to go through all of the Smiths to find your phone number. That's not so good.
So we move to the First_Name index instead. You do the same process and find the section for "John". This isn't any better. We didn't specify to additionally order by last name, and so you have to go through all of the Johns to find your Smiths.
This is exactly how database indexes work. Each index is just a copy of the information included with index, stored in the order specified by the index, along with a pointer back to the full record. There are some additional optimizations, like not filling up each page the index so that you can more efficiently add new entries without having to rebuild the whole index (you only need to rebuild that page), but in a nutshell each new index is another phone book that you have to maintain. Hopefully you can see now why things COLUMN LIKE '%keyword%' searches are so bad.
The other thing to understand about indexes is they exist to support queries, not tables. You don't necessarily want to look at a table and think about what columns you'll key on. You want to look at your queries and think about what columns they use for each table.
For this reason, you may still need separate indexes for both First_Name and Last_Name. This would be when you need to support different queries that use different means to query the table. This is also why application don't always just let you search by any field. Each additional searchable field requires new indexes, which adds new performance cost to the application.
This is also the reason why it's so important to have a separate and organized database layer in your application. It helps you get a handle on what queries you really have, and therefore what indexes you really need. Good tiered application design, or a well-designed service layer for the service-oriented crowd, is really a performance thing as much as anything else, because database performance often cuts to the core of your larger application performance.

Ok you need to know 2 thing: index are for increase speed of search ( select ) but will slow your changes ( insert/update/delete ) if you need to do a track, try use a table only for collect informations, and athor table to be sintetisez your info about your track. Example:
table track ( ip,date,page,... )
table hour_track ( page,number_visitator,date )
In table track you will only add, no update or delete. Table hour_track you will generate with a cronjob ( or athor thenique ) and there you will add a combinate index ( most_search, secound_most_search, ... ) . Combinated index will increase your speed because your databse need only remake 1 arbores not more, more then that if maiby you need a index for a column because there column is more used for your query you can add there column to be first of your index declaration. You can red more here

Related

MySQL self join performance: fact or just bad indexing?

As an example: I'm having a database to detect visitor (bots, etc) and since not every visitor have the same amount of 'credential' I made a 'dynamic' table like so: see fiddle: http://sqlfiddle.com/#!9/ca4c8/1 (simplified version).
This returns me the profile ID that I use to gather info about each profile (in another DB). Depending on the profile type I query the table with different nameclause (name='something') (ei: hostname, ipAddr, userAgent, HumanId, etc).
I'm not an expert in SQL but I'm familiar with indexes, constraints, primary, unique, foreign key etc. And from what I saw from these search results:
Mysql Self-Join Performance
How to tune self-join table in mysql like this?
Optimize MySQL self join query
JOIN Performance Issue MySQL
MySQL JOIN performance issue
Most of them have comments about bad performance on self-join but answers tend to go for the missing index cause.
So the final question is: is self joining a table makes it more prone to bad performance assuming that everything is indexed properly?
On a side note, more information about the table: might be irrelevant to the question but is well in context for my particular situation:
column flag is used to mark records for deletion as the user I use from php don't have DELETE permission over this database. Sorry, Security is more important than performance
I added the 'type' that will go with info I get from the user agent. (ie: if anything is (at least seems to be) a bot, we will only search for type 5000.
Column 'name' is unfortunately a varchar indexed in the primary key (with profile and type).
I tried to use as much INT and filtering (WHERE) in the SELECT query to reduce eventual lost of performance (if that even matters)
I'm willing to study and tweak the thing if needed unless someone with a high background in mysql tells me it's really not a good thing to do.
This is a big project I have in development so I cannot test it with millions of records for now but I wonder if performance will be an issues as this grows. Any input, links, references, documentation or test procedure (maybe in comments) will be appreciated.
A self-join is no different than joining two different tables. The optimizer will pick one 'table', usually based on the WHERE, then do a Nested Loop Join into the other. In your case, you have implied, via LEFT, that it should work only one way. (The Optimizer will ignore that if it sees no need for it.
Your keys are find for that Fiddle.
The real problem is "Entity-Attribute-Value", which is a messy way to lay out data in tables. Your query seems to be saying "find a (limit 1) profile (entity) that has a certain pair of attributes (name = Googlebot AND addr = ...).
It would be so much easier, and faster, to have two columns (name and addr) and a "composite" INDEX(name, addr).
I recommend doing that for the common "attributes", then put the rest into a single column with a JSON string. See here.

Optimizing a mysql query to fetch "unseen" entries per user

This title is rather mesmerizing but I couldn't come up with something clearer.
Long story short, we're creating a mobile app connected to a node.js server communicating with a mySql database. Pretty common setup. Now, we have multiple users connected that are able to upload "moments" to our servers. These moments can be only seen once by all other users.
As soon as a user x sees another user y's moment, x cannot see this one y's moment, ever. Maybe a bit like Snapchat, except the moment is single user to multiple users instead of single to single. Moments are also ordered by distance according to the current user's location.
Now, I'm looking for an intelligent way of only fetching the "unseen" moments from database. For now, we're using a relational table between Users and Moments.
Let's say a user (ID = 20) sees a moment (ID = 30320), then we insert into this table 20 and 30320. I know. This is hardly scalable and probably a terrible idea.
I thought about maybe checking the last seen date and only fetching moments that are past this date, but again, moments are ordered by distance before being ordered by date so it is possible to see a moment that is 3 minutes old followed by a moment that is 30 seconds old.
Is there a more clever way of doing this, or am I doomed to use a relationship table between Moments and Users, and join to it when querying?
Thanks a lot.
EDIT -
This logic uses in total 3 tables.
Users
Moments
MomentSeen
MomentSeen only contains what user has seen what moment, and when. Since the moments aren't ordered by date, I can't fetch all the moments that were uploaded after the last seen moment.
EDIT -
I just realized the mobile app Tinder must use similar logic for which user "liked" which other user. Since you can't go back in time and see a user twice, they probably use a very similar query as what I'm looking for.
Considering they have a lot of users, and that they're ordered by distance and some other unknown criteria, there must be a more clever way of doing things than a "UserSawUser" relational table.
EDIT
I can't provide the entire database structure so I'll just leave the important tables and some of their fields.
Users {
UserID INT UNSIGNED AUTO_INCREMENT PRIMARY KEY
}
Moments {
MomentID INT UNSIGNED AUTO_INCREMENT PRIMARY KEY,
UploaderID INT UNSIGNED, /* FK to UserID */
TimeUploaded DATE /* usually NOW() while insertion */
}
MomentSeen {
/* Both are FK to Users and Moments */
MomentID INT UNSIGNED,
UserID INT UNSIGNED
}
You can consider implementing bloom filter. It is widely used to reduce disk seeks and drive better performance.
Medium is using it to check if a user has read a post already.
More details here-
https://medium.com/the-story/what-are-bloom-filters-1ec2a50c68ff
https://en.wikipedia.org/wiki/Bloom_filter
Do not use one table per user. Do have a single table for the moments.
You seem to have two conflicting orderings for "moments": 'distance' and 'unseen'; which is it?
If it is 'unseen', are the 'moments' numbered chronologically? This implies that each user has a last_moment_seen -- all Moments before then have been seen; all after that have not been seen. So...
SELECT ...
WHERE moment > ( SELECT last_moment_seen
FROM Users WHERE user_id = $user_id );
would get all the moments not yet seen for a given user.
Munch on that for a while; the come back for more suggestions.
Edit
This should give you the Moments not yet seen. You can then order them as you see fit.
SELECT m....
FROM Moments m
LEFT JOIN MomentSeen ms ON ms.MomentID = m.MomentID
WHERE ms.MomentID IS NULL
ORDER BY ...
LIMIT 1 -- if desired
Why hesitate on using join?
Have you try to fill your database with dummy data. million of them to be able to measure performance impact on your system?
Using joins is not such a bad idea and is often faster than a single table, if done right.
You should probably do a research on database design to provide some reference.
For instance, ordering a table is done using an index.
However, you could use more than one index on one table and a combination of column in each index.
This can be done by analyzing the query(s) often used against the table.
A handy recipe is by creating index containing a set of column used in join keys, another index for each possible "where" parameter combination as a set of column index, and a set of index column(s) for each "order" query run against that table (the ascending / descending order did matter).
So don't be shy to add another column and index to suit your need.
If you talking about scalability, then you should consider tuning the database engine.
eg. make a higher possible key using big integer.
using a database cluster setup would also require to do in depth analysis, because autoincrement keys have issue in a multiple master setup
If you try to squeeze more performance from your system, then you should consider designing the whole database to be table partition friendly from the very start. That will include serious analysis of your business rule. Creating a table partition friendly environment require setting up a series of column as key, and splitting the data physically (remember to set file_per_table = 1 on mysql config, otherwise the benefit of table partitioning is lost)
If not done right, however, partitioning will not do any benefit to you.
https://dev.mysql.com/doc/refman/5.5/en/partitioning.html

Creating index on mysql table

I have a table called data in mysql database. The table is quite large and has about 500k records and this number will grow up to 1 million. Each record consists of about 50 columns and most of them contain varchars.
The data table is being used very frequently. Actually, most queries access this table. The data is being read from and written to it by ~50 users simultaneously. The system is highly loaded with the users uploading and checking their data so it can be stopped maximum for an hour or two.
After some research. I found out that almost all the select queries that have 'where' clause use one of four fields in the table. That fields are: isActive, country, state, city - all are in the int format. The where can be either
where isActive = {0|1}
or
where isActive = {0|1} and {country|state|city} = {someIntValue}
or
where {country|state|city} = {someIntValue}
And the last thing is that the table does not have any indexes except for the primary id one.
After the table has grown to current sizes i faced some performance issues.
So, my question is if i create the indexes on the columns isActive, country, state and city will the performance increase?
UPD: I've just created an index on one of that fields and WOW! the queries are being executed immediately. Thank you, guys.
I don't think it's a good idea to index the isActive field because it'll cause the indexing overhead when adding/updating/deleting, but it'll only split data in two chunks (1 and 0) when reading so it'll not really help.
Edit: found this to explain the point above:
Is there any performance gain in indexing a boolean field?
For the other tree columns, I recommend you to do a benchmark when most user are offline (in the night, or lunch time) and see how it affect performance, but I think it'll really help without many downsides.
Edit: ypercube has signaled some interesting use cases where my answer about indexing boolean field isn't relevant, check comments.
Yes creating an index on each of these columns will help you.
Consider and underline the word each.
A separate index for each one is what I suggest. The reason being coexistence of different combinations of the columns.
Yes, definately.
you may see even better results if you include selected additional fields to each index too. Just take careful notice of the column order...
But before all else, make sure you dont use myisam engine for a big table with many writes! Switch to innodb for example.

what is mysql indexing and how do you create an index?

Okay, mysql indexing. Is indexing nothing more than having a unique ID for each row that will be used in the WHERE clause?
When indexing a table does the process add any information to the table? For instance, another column or value somewhere.
Does indexing happen on the fly when retrieving values or are values placed into the table much like an insert or update function?
Any more information to clearly explain mysql indexing would be appreciated. And please dont just place a link to the mysql documentation, it is confusing and it is always better to get a personal response from a professional.
Lastly, why is indexing different from telling mysql to look for values between two values. For Example: WHERE create_time >= 'AweekAgo'
I'm asking because one of my tables is 220,000+ rows and it takes more than a minute to return values with a very simple mysql select statement and I'm hoping indexing will speed this up.
Thanks in advanced.
You were down voted because you didn't make effort to read or search for what you are asking for. A simple search in google could have shown you the benefits and drawbacks of Database Index. Here is a related question on StackOverflow. I am sure there are numerous questions like that.
To simplify the jargons, it would be easier to locate books in a library if you arrange the in shelves numbered according to their area of specialization. You can easily tell somebody to go to a specific location and pick the book - that is what index does
Another example: imagine an alphabetically ordered admission list. If your name start with Z, you will just skip A to Y and get to Z - faster? If otherwise, you will have to search and search and may not even find it if you didn't look carefully
A database index is a data structure that improves the speed of operations in a table. Indexes can be created using one or more columns, providing the basis for both rapid random lookups and efficient ordering of access to records.
You can create an index like this way :
CREATE INDEX index_name
ON table_name ( column1, column2,...);
You might be working on a more complex database, so it's good to remember a few simple rules.
Indexes slow down inserts and updates, so you want to use them carefully on columns that are FREQUENTLY updated.
Indexes speed up where clauses and order by.
For further detail, you can read :
http://dev.mysql.com/doc/refman/5.0/en/mysql-indexes.html
http://www.tutorialspoint.com/mysql/mysql-indexes.htm
There are a lot of indexing, for example a hash, a trie, a spatial index. It depends on the value. Most likely it's a hash and a binary search tree. Nothing really fancy because most likely the fancy thing is expensive.

When is it a good idea to move columns off a main table into an auxiliary table?

Say I have a table like this:
create table users (
user_id int not null auto_increment,
username varchar,
joined_at datetime,
bio text,
favorite_color varchar,
favorite_band varchar
....
);
Say that over time, more and more columns -- like favorite_animal, favorite_city, etc. -- get added to this table.
Eventually, there are like 20 or more columns.
At this point, I'm feeling like I want to move columns to a separate
user_profiles table is so I can do select * from users without
returning a large number of usually irrelevant columns (like
favorite_color). And when I do need to query by favorite_color, I can just do
something like this:
select * from users inner join user_profiles using user_id where
user_profiles.favorite_color = 'red';
Is moving columns off the main table into an "auxiliary" table a good
idea?
Or is it better to keep all the columns in the users table, and always
be explicit about the columns I want to return? E.g.
select user_id, username, last_logged_in_at, etc. etc. from users;
What performance considerations are involved here?
Don't use an auxiliary table if it's going to contain a collection of miscellaneous fields with no conceptual cohesion.
Do use a separate table if you can come up with a good conceptual grouping of a number of fields e.g. an Address table.
Of course, your application has its own performance and normalisation needs, and you should only apply this advice with proper respect to your own situation.
I would say that the best option is to have properly normalized tables, and also to only ask for the columns you need.
A user profile table might not be a bad idea, if it is structured well to provide data integrity and simple enhancement/modification later. Only you can truly know your requirements.
One thing that no one else has mentioned is that it is often a good idea to have an auxiliary table if the row size of the main table would get too large. Read about the row size limits of your specific databases in the documentation. There are often performance benefits to having tables that are less wide and moving the fields you don't use as often off to a separate table. If you choose to create an auxiliarary table with a one-to-one relationship make sure to set up the PK/FK relationship to maintain data integrity and set a unique index or constraint on the FK field to mainatin the one-to-one relationship.
And to go along with everyone else, I cannot stress too strongly how bad it is to ever use select * in production queries. You save a few seconds of development time and create a performance problem as well as make the application less maintainable (yes less - as you should not willy nilly return things you may not want to show on the application but you need in the database. You will break insert statements that use selects and show users things you don't want them to see when you use select *.).
Try not to get in the habit of using SELECT * FROM ... If your application becomes large, and you query the users table for different things in different parts of your application, then when you do add favorite_animal you are more likely to break some spot that uses SELECT *. Or at the least, that place is now getting unused fields that slows it down.
Select the data you need specifically. It self-documents to the next person exactly what you're trying to do with that code.
Don't de-normalize unless you have good reason to.
Adding a favorite column ever other day every time a user has a new favorite is a maintenance headache at best. I would highly consider creating a table to hold a favorites value in your case. I'm pretty sure I wouldn't just keep adding a new column all the time.
The general guideline that applies to this (called normalization) is that tables are grouped by distinct entities/objects/concepts and that each column(field) in that table should describe some aspect of that entity
In your example, it seems that favorite_color describes (or belongs to) the user. Some times it is a good idea to moved data to a second table: when it becomes clear that that data actually describes a second entity. For example: You start your database collecting user_id, name, email, and zip_code. Then at some point in time, the ceo decides he would also like to collect the street_address. At this point a new entity has been formed, and you could conceptually view your data as two tables:
user: userid, name, email
address: steetaddress, city, state, zip, userid(as a foreign key)
So, to sum it up: the real challenge is to decide what data describes the main entity of the table, and what, if any, other entity exists.
Here is a great example of normalization that helped me understand it better
When there is no other reason (e.g. there are normal forms for databases) you should not do it. You dont save any space, as the data must still stored, instead you waste more as you need another index to access them.
It is always better (though may require more maintenance if schemas change) to fetch only the columns you need.
This will result in lower memory usage by both MySQL and your client application, and reduced query times as the amount of data transferred is reduced. You'll see a benefit whether this is over a network or not.
Here's a rule of thumb: if adding a column to an existing table would require making it nullable (after data has been migrated etc) then instead create a new table with all NOT NULL columns (with a foreign key reference to the original table, of course).
You should not rely on using SELECT * for a variety of reasons (google it).