MySQL self join performance: fact or just bad indexing?

MySQL self join performance: fact or just bad indexing? - mysql

As an example: I'm having a database to detect visitor (bots, etc) and since not every visitor have the same amount of 'credential' I made a 'dynamic' table like so: see fiddle: http://sqlfiddle.com/#!9/ca4c8/1 (simplified version).
This returns me the profile ID that I use to gather info about each profile (in another DB). Depending on the profile type I query the table with different nameclause (name='something') (ei: hostname, ipAddr, userAgent, HumanId, etc).
I'm not an expert in SQL but I'm familiar with indexes, constraints, primary, unique, foreign key etc. And from what I saw from these search results:
Mysql Self-Join Performance
How to tune self-join table in mysql like this?
Optimize MySQL self join query
JOIN Performance Issue MySQL
MySQL JOIN performance issue
Most of them have comments about bad performance on self-join but answers tend to go for the missing index cause.
So the final question is: is self joining a table makes it more prone to bad performance assuming that everything is indexed properly?
On a side note, more information about the table: might be irrelevant to the question but is well in context for my particular situation:
column flag is used to mark records for deletion as the user I use from php don't have DELETE permission over this database. Sorry, Security is more important than performance
I added the 'type' that will go with info I get from the user agent. (ie: if anything is (at least seems to be) a bot, we will only search for type 5000.
Column 'name' is unfortunately a varchar indexed in the primary key (with profile and type).
I tried to use as much INT and filtering (WHERE) in the SELECT query to reduce eventual lost of performance (if that even matters)
I'm willing to study and tweak the thing if needed unless someone with a high background in mysql tells me it's really not a good thing to do.
This is a big project I have in development so I cannot test it with millions of records for now but I wonder if performance will be an issues as this grows. Any input, links, references, documentation or test procedure (maybe in comments) will be appreciated.

A self-join is no different than joining two different tables. The optimizer will pick one 'table', usually based on the WHERE, then do a Nested Loop Join into the other. In your case, you have implied, via LEFT, that it should work only one way. (The Optimizer will ignore that if it sees no need for it.
Your keys are find for that Fiddle.
The real problem is "Entity-Attribute-Value", which is a messy way to lay out data in tables. Your query seems to be saying "find a (limit 1) profile (entity) that has a certain pair of attributes (name = Googlebot AND addr = ...).
It would be so much easier, and faster, to have two columns (name and addr) and a "composite" INDEX(name, addr).
I recommend doing that for the common "attributes", then put the rest into a single column with a JSON string. See here.

Related

Can I use index in MySQL in this way? [duplicate]

If I have a query like:
Select EmployeeId
From Employee
Where EmployeeTypeId IN (1,2,3)
and I have an index on the EmployeeTypeId field, does SQL server still use that index?

Yeah, that's right. If your Employee table has 10,000 records, and only 5 records have EmployeeTypeId in (1,2,3), then it will most likely use the index to fetch the records. However, if it finds that 9,000 records have the EmployeeTypeId in (1,2,3), then it would most likely just do a table scan to get the corresponding EmployeeIds, as it's faster just to run through the whole table than to go to each branch of the index tree and look at the records individually.
SQL Server does a lot of stuff to try and optimize how the queries run. However, sometimes it doesn't get the right answer. If you know that SQL Server isn't using the index, by looking at the execution plan in query analyzer, you can tell the query engine to use a specific index with the following change to your query.
SELECT EmployeeId FROM Employee WITH (Index(Index_EmployeeTypeId )) WHERE EmployeeTypeId IN (1,2,3)
Assuming the index you have on the EmployeeTypeId field is named Index_EmployeeTypeId.

Usually it would, unless the IN clause covers too much of the table, and then it will do a table scan. Best way to find out in your specific case would be to run it in the query analyzer, and check out the execution plan.

Unless technology has improved in ways I can't imagine of late, the "IN" query shown will produce a result that's effectively the OR-ing of three result sets, one for each of the values in the "IN" list. The IN clause becomes an equality condition for each of the list and will use an index if appropriate. In the case of unique IDs and a large enough table then I'd expect the optimiser to use an index.
If the items in the list were to be non-unique however, and I guess in the example that a "TypeId" is a foreign key, then I'm more interested in the distribution. I'm wondering if the optimiser will check the stats for each value in the list? Say it checks the first value and finds it's in 20% of the rows (of a large enough table to matter). It'll probably table scan. But will the same query plan be used for the other two, even if they're unique?
It's probably moot - something like an Employee table is likely to be small enough that it will stay cached in memory and you probably wouldn't notice a difference between that and indexed retrieval anyway.
And lastly, while I'm preaching, beware the query in the IN clause: it's often a quick way to get something working and (for me at least) can be a good way to express the requirement, but it's almost always better restated as a join. Your optimiser may be smart enough to spot this, but then again it may not. If you don't currently performance-check against production data volumes, do so - in these days of cost-based optimisation you can't be certain of the query plan until you have a full load and representative statistics. If you can't, then be prepared for surprises in production...

So there's the potential for an "IN" clause to run a table scan, but the optimizer will
try and work out the best way to deal with it?
Whether an index is used doesn't so much vary on the type of query as much of the type and distribution of data in the table(s), how up-to-date your table statistics are, and the actual datatype of the column.
The other posters are correct that an index will be used over a table scan if:
The query won't access more than a certain percent of the rows indexed (say ~10% but should vary between DBMS's).
Alternatively, if there are a lot of rows, but relatively few unique values in the column, it also may be faster to do a table scan.
The other variable that might not be that obvious is making sure that the datatypes of the values being compared are the same. In PostgreSQL, I don't think that indexes will be used if you're filtering on a float but your column is made up of ints. There are also some operators that don't support index use (again, in PostgreSQL, the ILIKE operator is like this).
As noted though, always check the query analyser when in doubt and your DBMS's documentation is your friend.

#Mike: Thanks for the detailed analysis. There are definately some interesting points you make there. The example I posted is somewhat trivial but the basis of the question came from using NHibernate.
With NHibernate, you can write a clause like this:
int[] employeeIds = new int[]{1, 5, 23463, 32523};
NHibernateSession.CreateCriteria(typeof(Employee))
.Add(Restrictions.InG("EmployeeId",employeeIds))
NHibernate then generates a query which looks like
select * from employee where employeeid in (1, 5, 23463, 32523)
So as you and others have pointed out, it looks like there are going to be times where an index will be used or a table scan will happen, but you can't really determine that until runtime.

Select EmployeeId From Employee USE(INDEX(EmployeeTypeId))
This query will search using the index you have created. It works for me. Please do a try..

Join 10 tables on a single join id called session_id that's stored in session table. Is this good/bad practice?

There's 10 tables all with a session_id column and a single session table. The goal is to join them all on the session table. I get the feeling that this is a major code smell. Is this good/bad practice ?
What problems could occur?

Whether this is a good design or not depends deeply on what you are trying to represent with it. So, it might be OK or it might not be... there's no way to tell just from your question in its current form.
That being said, there are couple ways to speed up a join:
Use indexes.
Use covering indexes.
Under the right DBMS, you could use a materialized view to store pre-joined rows. You should be able to simulate that under MySQL by maintaining a special table via triggers (or even manually).
Don't join a table unless you actually need its fields. List only the fields you need in the SELECT list (instead of blindly using *). The fastest operation is the one you don't have to do!
And above all, measure on representative amounts of data! Possible results:
It's lightning fast. Yay!
It's slow, but it doesn't matter that it's slow (i.e. rarely used / not important).
It's slow and it matters that it's slow. Strap-in, you have work to do!

We need Query with 11 joins and the EXPLAIN posted in the original question when it is available, please. And be kind to your community, for every table involved post as well SHOW CREATE TABLE tblname SHOW INDEX FROM tblname to avoid additional requests for these 11 tables. And we will know scope of data and cardinality involved for each indexed column.

of Course more join kills performance.
but it depends !! if your data model is like that then you can't help yourself here unless complete new data model re-design happen !!
1) is it a online(real time transaction ) DB or offline DB (data warehouse)
if online , then better maintain single table. keep data in one table , let column increase in size.!!
if offline , it's better to maintain separate table , because you are not going to required all column always.!!

Is it faster to only query specific columns?

I've heard that it is faster to select colums manually ("col1, col2, col3, etc") instead of querying them all with "*".
But what if I don't even want to query all columns of a table? Would it be faster to query, for Example, only "col1, col2" insteaf of "col1, col2, col3, col4"?
From my understanding SQL has to search through all of the columns anyway, and just the return-result changes. I'd like to know if I can achieve a gain in performance by only choosing the right columns.
(I'm doing this anyway, but a backend API of one of my applications returns more often than not all columns, so I'm thinking about letting the user manually select the columns he want)

In general, reducing the number of columns in the select is a minor optimization. It means that less data is being returned from the database server to the application calling the server. Less data is usually faster.
Under most circumstances, this a minor improvement. There are some cases where the improvement can be more important:
If a covering index is available for the query, so the index satisfies the query without having to access data pages.
If some fields are very long, so records occupy multiple pages.
If the volume of data being retrieved is a small fraction (think < 10%) of the overall data in each record.
Listing the columns individually is a good idea, because it protects code from changes in underlying schema. For instance, if the name of a column is changed, then a query that lists columns explicitly will break with an easy-to-understand error. This is better than a query that runs and produces erroneous results.

You should try not to use select *.
Inefficiency in moving data to the consumer. When you SELECT *, you're often retrieving more columns from the database than your application really needs to function. This causes more data to move from the database server to the client, slowing access and increasing load on your machines, as well as taking more time to travel across the network. This is especially true when someone adds new columns to underlying tables that didn't exist and weren't needed when the original consumers coded their data access.
Indexing issues. Consider a scenario where you want to tune a query to a high level of performance. If you were to use *, and it returned more columns than you actually needed, the server would often have to perform more expensive methods to retrieve your data than it otherwise might. For example, you wouldn't be able to create an index which simply covered the columns in your SELECT list, and even if you did (including all columns [shudder]), the next guy who came around and added a column to the underlying table would cause the optimizer to ignore your optimized covering index, and you'd likely find that the performance of your query would drop substantially for no readily apparent reason.
Binding Problems. When you SELECT *, it's possible to retrieve two columns of the same name from two different tables. This can often crash your data consumer. Imagine a query that joins two tables, both of which contain a column called "ID". How would a consumer know which was which? SELECT * can also confuse views (at least in some versions SQL Server) when underlying table structures change -- the view is not rebuilt, and the data which comes back can be nonsense. And the worst part of it is that you can take care to name your columns whatever you want, but the next guy who comes along might have no way of knowing that he has to worry about adding a column which will collide with your already-developed names.
I got this from this answer.

I believe this topic has already been covered here:
select * vs select column
I believe it covers your concerns as well. Please take a look.

All the column labels and values occupy some space. Sending them to the issuer of the request instead of a subset of the columns means sending more data. More data is sent slower.
If you have columns, like
id, username, password, email, bio, url
and you want to get only the username and password, then
select username, password ...
is quicker than
select * ...
because id, email, bio and url are sent as well for the latter, which makes the response larger. But the main problem with select * is different. It might be the source of inconsistencies if, for some reason the order of the columns changed. Also, it might retrieve data you do not want to retrieve. It is always better to have a whitelist with the columns you actually want to retrieve.

MySQL Best Practice for adding columns

So I started working for a company where they had 3 to 5 different tables that were often queried in either a complex join or through a double, triple query (I'm probably the 4th person to start working here, it's very messy).
Anyhow, I created a table that when querying the other 3 or 5 tables at the same time inserts that data into my table along with whatever information normally got inserted there. It has drastically sped up the page speeds for many applications and I'm wondering if I made a mistake here.
I'm hoping that in the future to remove inserting into those other tables and simply inserting all that information into the table that I've started and to switch the applications to that one table. It's just a lot faster.
Could someone tell me why it's much faster to group all the information into one massive table and if there is any downside to doing it this way?

If the joins are slow, it may be because the tables did not have FOREIGN KEY relationships and indexes properly defined. If the tables had been properly normalized before, it is probably not a good idea to denormalize them into a single table unless they were not performant with proper indexing. FOREIGN KEY constraints require indexing on both the PK table and the related FK column, so simply defining those constraints if they don't already exist may go a long way toward improving performance.
The first course of action is to make sure the table relationships are defined correctly and the tables are indexed, before you begin denormalizing it.
There is a concept called materialized views, which serve as a sort of cache for views or queries whose result sets are deterministic, by storing the results of a view's query into a temporary table. MySQL does not support materialized views directly, but you can implement them by occasionally selecting all rows from a multi-table query and storing the output into a table. When the data in that table is stale, you overwrite it with a new rowset. For simple SELECT queries which are used to display data that doesn't change often, you may be able to speed up your pageloads using this method. It is not advisable to use it for data which is constantly changing though.
A good use for materialized views might be constructing rows to populate your site's dropdown lists or to store the result of complicated reports which are only run once a week. A bad use for them would be to store customer order information, which requires timely access.

Without seeing the table structures, etc it would be guesswork. But it sounds like possibly the database was over-normalized.
It is hard to say exactly what the issue is without seeing it. But you might want to look at adding indexes, and foreign keys to the tables.
If you are adding a table with all of the data in it, you might be denormalizing the database.

There are some cases where de-normalizing your tables has its advantages, but I would be more interested in finding out if the problem really lies with the table schema or with how the queries are being written. You need to know if the queries utilize indexes (or whether indexes need to be added to the table), whether the original query writer did things like using subselects when they could have been using joins to make a query more efficient, etc.
I would not just denormalize because it makes things faster unless there is a good reason for it.

Having a separate copy of the data in your newly defined table is a valid performance enchancing practice, but on the other hand it might become a total mess when it comes to keeping the data in your table and the other ones same. You are essentially having two truths, without good idea how to invalidate this "cache" when it comes to updates/deletes.
Read more about "normalization" and read more about "EXPLAIN" in MySQL - it will tell you why the other queries are slow and you might get away with few proper indexes and foreign keys instead of copying the data.

Big tables and analysis in MySql

For my startup, I track everything myself rather than rely on google analytics. This is nice because I can actually have ips and user ids and everything.
This worked well until my tracking table rose about 2 million rows. The table is called acts, and records:
ip
url
note
account_id
...where available.
Now, trying to do something like this:
SELECT COUNT(distinct ip)
FROM acts
JOIN users ON(users.ip = acts.ip)
WHERE acts.url LIKE '%some_marketing_page%';
Basically never finishes. I switched to this:
SELECT COUNT(distinct ip)
FROM acts
JOIN users ON(users.ip = acts.ip)
WHERE acts.note = 'some_marketing_page';
But it is still very slow, despite having an index on note.
I am obviously not pro at mysql. My question is:
How do companies with lots of data track things like funnel conversion rates? Is it possible to do in mysql and I am just missing some knowledge? If not, what books / blogs can I read about how sites do this?

While getting towards 'respectable', 2 Millions rows is still a relatively small size for a table. (And therefore a faster performance is typically possible)
As you found out, the front-ended wildcard are particularly inefficient and we'll have to find a solution for this if that use case is common for your application.
It could just be that you do not have the right set of indexes. Before I proceed, however, I wish to stress that while indexes will typically improve the DBMS performance with SELECT statements of all kinds, it systematically has a negative effect on the performance of "CUD" operations (i.e. with the SQL CREATE/INSERT, UPDATE, DELETE verbs, i.e. the queries which write to the database rather than just read to it). In some cases the negative impact of indexes on "write" queries can be very significant.
My reason for particularly stressing the ambivalent nature of indexes is that it appears that your application does a fair amount of data collection as a normal part of its operation, and you will need to watch for possible degradation as the INSERTs queries get to be slowed down. A possible alternative is to perform the data collection into a relatively small table/database, with no or very few indexes, and to regularly import the data from this input database to a database where the actual data mining takes place. (After they are imported, the rows may be deleted from the "input database", keeping it small and fast for its INSERT function.)
Another concern/question is about the width of a row in the cast table (the number of columns and the sum of the widths of these columns). Bad performance could be tied to the fact that rows are too wide, resulting in too few rows in the leaf nodes of the table, and hence a deeper-than-needed tree structure.
Back to the indexes...
in view of the few queries in the question, it appears that you could benefit from an ip + note index (an index made at least with these two keys in this order). A full analysis of the index situation, and frankly a possible review of the database schema cannot be done here (not enough info for one...) but the general process for doing so is to make the list of the most common use case and to see which database indexes could help with these cases. One can gather insight into how particular queries are handled, initially or after index(es) are added, with mySQL command EXPLAIN.
Normalization OR demormalization (or indeed a combination of both!), is often a viable idea for improving performance during mining operations as well.

Why the JOIN? If we can assume that no IP makes it into acts without an associated record in users then you don't need the join:
SELECT COUNT(distinct ip) FROM acts
WHERE acts.url LIKE '%some_marketing_page%';
If you really do need the JOIN it might pay to first select the distinct IPs from acts, then JOIN those results to users (you'll have to look at the execution plan and experiment to see if this is faster).
Secondly, that LIKE with a leading wild card is going to cause a full table scan of acts and also necessitate some expensive text searching. You have three choices to improve this:
Decompose the url into component parts before you store it so that the search matches a column value exactly.
Require the search term to appear at the beginning of the of the url field, not in the middle.
Investigate a full text search engine that will index the url field in such a way that even an internal LIKE search can be performed against indexes.
Finally, in the case of searching on acts.notes, if an index on notes doesn't provide sufficient search improvement, I'd consider calculating and storing an integer hash on notes and searching for that.

Try running 'EXPLAIN PLAN' on your query and look to see if there are any table scans.
Should this be a LEFT JOIN?
Maybe this site can help.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008