Optimize query with 1 join, on tables with 10+ millions rows - mysql

I am looking at making a request using 2 tables faster.
I have the following 2 tables :
Table "logs"
id varchar(36) PK
date timestamp(2)
more varchar fields, and one text field
That table has what the PHP Laravel Framework calls a "polymorphic many to many" relationship with several other objects, so there is a second table "logs_pivot" :
id unsigned int PK
log_id varchar(36) FOREIGN KEY (logs.id)
model_id varchar(40)
model_type varchar(50)
There is one or several entries in logs_pivot per entry in logs. They have 20+ and 10+ millions of rows, respectively.
We do queries like so :
select * from logs
join logs_pivot on logs.id = logs_pivot.log_id
where model_id = 'some_id' and model_type = 'My\Class'
order by date desc
limit 50;
Obviously we have a compound index on both the model_id and model_type fields, but the requests are still slow : several (dozens of) seconds every times.
We also have an index on the date field, but an EXPLAIN show that this is the model_id_model_type index that is used.
Explain statement:
+----+-------------+-------------+------------+--------+--------------------------------------------------------------------------------+-----------------------------------------------+---------+-------------------------------------------+------+----------+---------------------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-------------+------------+--------+--------------------------------------------------------------------------------+-----------------------------------------------+---------+-------------------------------------------+------+----------+---------------------------------+
| 1 | SIMPLE | logs_pivot | NULL | ref | logs_pivot_model_id_model_type_index,logs_pivot_log_id_index | logs_pivot_model_id_model_type_index | 364 | const,const | 1 | 100.00 | Using temporary; Using filesort |
| 1 | SIMPLE | logs | NULL | eq_ref | PRIMARY | PRIMARY | 146 | the_db_name.logs_pivot.log_id | 1 | 100.00 | NULL |
+----+-------------+-------------+------------+--------+--------------------------------------------------------------------------------+-----------------------------------------------+---------+-------------------------------------------+------+----------+---------------------------------+
In other tables, I was able to make a similar request much faster by including the date field in the index. But in that case they are in a separate table.
When we want to access these data, they are typically a few hours/days old.
Our InnoDB pools are much too small to hold all that data (+ all the other tables) in memory, so the data is most probably always queried on disk.
What would be all the ways we could make that request faster ?
Ideally only with another index, or by changing how it is done.
Thanks a lot !
Edit 17h05 :
Thank you all for your answers so far, I will try something like O Jones suggest, and also to somehow include the date field in the pivot table, so that I can include in the index index.
Edit 14/10 10h.
Solution :
So I ended up changing how the request was really done, by sorting on the id field of the pivot table, which indeed allow to put it in an index.
Also the request to count the total number of rows is changed to only be done on the pivot table, when it is not filtered by date.
Thank you all !

Just a suggestion. Using a compound index is obviously a good thing. Another might be to pre-qualify an ID by date, and extend your index based on your logs_pivot table indexing on (model_id, model_type, log_id ).
If your querying data, and the entire history is 20+ million records, how far back does the data go where you are only dealing with getting a limit of 50 records per given category of model id/type. Say 3-months? vs say your log of 5 years? (not listed in post, just a for-instance). So if you can query the minimum log ID where the date is greater than say 3 months back, that one ID can limit what else is going on from your logs_pivot table.
Something like
select
lp.*,
l.date
from
logs_pivot lp
JOIN Logs l
on lp.log_id = l.id
where
model_id = 'some_id'
and model_type = 'My\Class'
and log_id >= ( select min( id )
from logs
where date >= datesub( curdate(), interval 3 month ))
order by
l.date desc
limit
50;
So, the where clause for the log_id is done once and returns just an ID from as far back as 3 months and not the entire history of the logs_pivot. Then you query with the optimized two-part key of model id/type, but also jumping to the end of its index with the ID included in the index key to skip over all the historical.
Another thing you MAY want to include are some pre-aggregate tables of how many records such as per month/year per given model type/id. Use that as a pre-query to present to users, then you can use that as a drill-down to further get more detail. A pre-aggregate table can be done on all the historical stuff once since it would be static and not change. The only one you would have to constantly update would be whatever the current single month period is, such as on a nightly basis. Or even possibly better, via a trigger that either inserts a record every time an add is done, or updates a count for the given model/type based on year/month aggregations. Again, just a suggestion as no other context on how / why the data will be presented to the end-user.

I see two problems:
UUIDs are costly when tables are huge relative to RAM size.
The LIMIT cannot be handled optimally because the WHERE clauses come from one table, but the ORDER BY column comes from another table. That is, it will do all of the JOIN, then sort and finally peel off a few rows.

SELECT columns FROM big table ORDER BY something LIMIT small number is a notorious query performance antipattern. Why? the server sorts a whole mess of long rows then discards almost all of them. It doesn't help that one of your columns is a LOB -- a TEXT column.
Here's an approach that can reduce that overhead: Figure out which rows you want by finding the set of primary keys you want, then fetch the content of only those rows.
What rows do you want? This subquery finds them.
SELECT id
FROM logs
JOIN logs_pivot
ON logs.id = logs_pivot.log_id
WHERE logs_pivot.model_id = 'some_id'
AND logs_pivot.model_type = 'My\Class'
ORDER BY logs.date DESC
LIMIT 50
This does all the heavy lifting of working out the rows you want. So, this is the query you need to optimize.
It can be accelerated by this index on logs
CREATE INDEX logs_date_desc ON logs (date DESC);
and this three-column compound index on logs_pivot
CREATE INDEX logs_pivot_lookup ON logs_pivot (model_id, model_type, log_id);
This index is likely to be better, since the Optimizer will see the filtering on logs_pivot but not logs. Hence, it will look in logs_pivot first.
Or maybe
CREATE INDEX logs_pivot_lookup ON logs_pivot (log_id, model_id, model_type);
Try one then the other to see which yields faster results. (I'm not sure how the JOIN will use the compound index.) (Or simply add both, and use EXPLAIN to see which one it uses.)
Then, when you're happy -- or satisfied anyway -- with the subquery's performance, use it to grab the rows you need, like this
SELECT *
FROM logs
WHERE id IN (
SELECT id
FROM logs
JOIN logs_pivot
ON logs.id = logs_pivot.log_id
WHERE logs_pivot.model_id = 'some_id'
AND model_type = 'My\Class'
ORDER BY logs.date DESC
LIMIT 50
)
ORDER BY date DESC
This works because it sorts less data. The covering three-column index on logs_pivot will also help.
Notice that both the sub query and main query have ORDER BY clauses, to make sure the returned detail result set is in the order you need.
Edit Darnit, been on MariaDB 10+ and MySQL 8+ so long I forgot about the old limitation. Try this instead.
SELECT *
FROM logs
JOIN (
SELECT id
FROM logs
JOIN logs_pivot
ON logs.id = logs_pivot.log_id
WHERE logs_pivot.model_id = 'some_id'
AND model_type = 'My\Class'
ORDER BY logs.date DESC
LIMIT 50
) id_set ON logs.id = id_set.id
ORDER BY date DESC
Finally, if you know you only care about rows newer than some certain time you can add something like this to your subquery.
AND logs.date >= NOW() - INTERVAL 5 DAY
This will help a lot if you have tonnage of historical data in your table.

Related

MySql Indexes are not applied in GROUP BY

I have two tables to make my search engine, one containing all keywords and the other contains all the possible targets for each keyword.
Table: keywords
id (int)
keyword (varchar)
Table: results
id (int)
keyword_id (int)
table_id (int)
target_id (int)
For both tables, I set MyISAM as storage engine since 95% of the times I am just running select queries on these tables and in 5% of the times, insert queries. And off course, I already compared the performance using InnoDB and the performance was poor considering my later queries.
I also added the following indexes
keywords.keyword (unique)
results.keyword_id (index)
results.table_id (index)
results.target_id (index)
in the keywords table, I have about 1.2 million records and in results table I have about 9.8 million records.
Now the issue is that I run the following query and the results is made in 0.0014 seconds
SELECT rs.table_id, rs.target_id
FROM keywords ky INNER JOIN results rs ON ky.id=rs.keyword_id
WHERE ky.keyword LIKE "x%" OR ky.keyword LIKE "y%"
But when I add GROUP BY, the result is made in 0.2 seconds
SELECT rs.table_id, rs.target_id
FROM keywords ky INNER JOIN results rs ON ky.id=rs.keyword_id
WHERE ky.keyword LIKE "x%" OR ky.keyword LIKE "y%"
GROUP BY rs.table_id, rs.target_id
I tested composite indexes, single column indexes and even dropping table_id and target_id indexes but in all the cases the performance is the same and it seems that in Group By clause, the index is not applied.
The explain plan shows that:
id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra
1 | SIMPLE | ky | range | PRIMARY,keyword | keyword | 767 | NULL | 3271 | Using index condition; Using where; Using temporary; Using filesort
1 | SIMPLE | rs | ref | keyword_id | keyword_id | 4 | ky.id | 3
I have the following composite key already added
ALTER TABLE results ADD INDEX `table_id` (`table_id`, `target_id`) USING BTREE;
Here's MySQL documentation for GROUP BY optimization, this is what it says:
The most important preconditions for using indexes for GROUP BY are
that all GROUP BY columns reference attributes from the same index
So, if you have different index on these two columns, they won't be used by GROUP BY. You should try creating a composite index on table_id and target_id.
Also, the query seem to be using LIKE operator. Please note that if the value being compared in LIKE has leading wildcard in it then MySQL won't be able to use any index for that column anyway. Have a look at explain plan of the query and see which indices are used.
JOIN + GROUP BY (or DISTINCT) is what I call "explode-implode" -- First the JOIN multiplies the number of 'rows' to look at, then the GROUP BY deflates the row count.
One work around to avoid this is to focus on the primary table, then check for EXISTS in the other table:
SELECT rs.table_id, rs.target_id
FROM keywords ky
WHERE EXISTS(
SELECT 1
FROM results rs
WHERE ky.id = rs.keyword_id
AND ( ky.keyword LIKE "x%"
OR ky.keyword LIKE "y%" )
);
rs requires INDEX(keyword_id).
An improvement on that might be to get rid of the OR via
WHERE ky.id = rs.keyword_id
AND ky.keyword REGEXP "^[xy]"
But that is not very helpful since it still needs to fully check keyword.
Another improvement could be to turn the OR into UNION:
( SELECT rs.table_id, rs.target_id
FROM keywords ky
INNER JOIN results rs ON ky.id=rs.keyword_id
WHERE ky.keyword LIKE "x%"
) UNION ALL
( SELECT rs.table_id, rs.target_id
FROM keywords ky
INNER JOIN results rs ON ky.id=rs.keyword_id
WHERE ky.keyword LIKE "y%"
)
ky: INDEX(keyword, id)
rs: INDEX(keyword_id)
The advantage here (other than avoiding the inflate-deflate) is that the index can be used on.
(Please provide SHOW CREATE TABLE for both tables; there may be other tips.)

using an index when SELECTing from a MySQL join

I have the following two MySQL/MariaDB tables:
CREATE TABLE requests (
request_id BIGINT UNSIGNED AUTO_INCREMENT PRIMARY KEY,
unix_timestamp DOUBLE NOT NULL,
[...]
INDEX unix_timestamp_index (unix_timestamp)
);
CREATE TABLE served_objects (
request_id BIGINT UNSIGNED NOT NULL,
object_name VARCHAR(255) NOT NULL,
[...]
FOREIGN KEY (request_id) REFERENCES requests (request_id)
);
There are several million columns in each table. There are zero or more served_objects per request. I have a view that provides a complete served_objects view by joining these two tables:
CREATE VIEW served_objects_view AS
SELECT
r.request_id AS request_id,
unix_timestamp,
object_name
FROM requests r
RIGHT JOIN served_objects so ON r.request_id=so.request_id;
This all seems pretty straightforward so far. But when I do a simple SELECT like this:
SELECT * FROM served_objects_view ORDER BY unix_timestamp LIMIT 5;
It takes a full minute or more. It's obviously not using the index. I've tried many different approaches, including flipping things around and using a LEFT or INNER join instead, but to no avail.
This is the output of the EXPLAIN for this SELECT:
+------+-------------+-------+--------+---------------+---------+---------+------------------+---------+---------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+------+-------------+-------+--------+---------------+---------+---------+------------------+---------+---------------------------------+
| 1 | SIMPLE | so | ALL | NULL | NULL | NULL | NULL | 5196526 | Using temporary; Using filesort |
| 1 | SIMPLE | r | eq_ref | PRIMARY | PRIMARY | 8 | db.so.request_id | 1 | |
+------+-------------+-------+--------+---------------+---------+---------+------------------+---------+---------------------------------+
Is there something fundamental here that prevents the index from being used? I understand that it needs to use a temporary table to satisfy the view and that that's interfering with the ability to use the index. But I'm hoping that some trick exists that will allow me SELECT from the view while honouring the indexes in the requests table.
You're using a notorious performance antipattern.
SELECT * FROM served_objects_view ORDER BY unix_timestamp LIMIT 5;
You've told the query planner to make a copy of your whole view (in RAM or temp storage), sort it, and toss out all but five rows. So, it obeyed. It really didn't care how long it took.
SELECT * is generally considered harmful to query performance, and this is the kind of case why that's true.
Try this deferred-join optimization
SELECT a.*
FROM served_objects_view a
JOIN (
SELECT request_id
FROM served_objects_view
ORDER BY unix_timestamp
LIMIT 5
) b ON a.request_id = b.request_id
This sorts a smaller subset of data (just the request_id and timestamp values). It then fetches a small subset of the view's rows.
If it's still too slow for your purposes, try creating a compound index on request (unix_timestamp, request_id). But that's probably unnecessary. If it is necessary, concentrate on optimizing the subquery.
Remark: RIGHT JOIN? Really? Don't you want just JOIN?
VIEWs are not always well-optimized. Does the query run slow when you use the SELECT? Have you added the suggested index?
What version of MySQL/MariaDB are you using? There may have been optimization improvements in newer versions, and an upgrade might help.
My point is, you may have to abandon VIEW.
The answer provided by O. Jones was the right approach; thanks! The big saviour here is that if the inner SELECT refers only to columns from the requests table (such as the case when SELECTing only request_id), the optimizer can satisfy the view without performing a join, making it lickety-split.
I had to make two adjustments, though, to make it produce the same results as the original SELECT. First, if non-unique request_ids are returned by the inner SELECT, the outer JOIN creates a cross-product of these non-unique entries. These duplicate rows can be effectively discarded by changing the outer SELECT into a SELECT DISTINCT.
Second, if the ORDER BY column can contain non-unique values, the result can contain irrelevant rows. These can be effectively discarded by also SELECTing orderByCol and adding AND a.orderByCol = b.orderByCol to the JOIN rule.
So my final solution, which works well if orderByCol comes from the requests table, is the following:
SELECT DISTINCT a.*
FROM served_objects_view a
JOIN (
SELECT request_id, <orderByCol> FROM served_objects_view
<whereClause>
ORDER BY <orderByCol> LIMIT <startRow>,<nRows>
) b ON a.request_id = b.request_id AND a.<orderByCol> = b.<orderByCol>
ORDER BY <orderByCol>;
This is a more convoluted solution than I was hoping for, but it works, so I'm happy.
One final comment. An INNER JOIN and a RIGHT JOIN are effectively the same thing here, so I originally formulated it in terms of a RIGHT JOIN because that's the way I was conceptualizing it. However, after some experimentation (after your challenge) I discovered that an INNER join is much more efficient. (It's what allows the optimizer to satisfy the view without performing a join if the inner SELECT refers only to columns from the requests table.) Thanks again!

MySQL JOIN and ORDER BY - performance issue

I have this query that drives me crazy for quite some time. It has 3 tables (originally it has a lot more but I isolated the performance issue), 1 base table, 1 product table which adds more data, and 1 with product types.
The product types table contains a "max age" column which indicates the maximum age of a row I want to fetch (anything older is considered "archived") and its value is different according to the product type.
My poor performance query goes like this and it takes 50 seconds for a 250,000 rows base table:
(select d_baseservices.ID
from d_baseservices
inner join d_products on d_baseservices.ServiceID = d_products.ServiceID
inner join md_prodtypes on d_products.ProdType = md_prodtypes.ProdType
where
(d_baseservices.CreationDate > (curdate() - INTERVAL md_prodtypes.MaxAge DAY))
order by CreationDate desc
limit 750);
Here is the EXPLAIN of this query:
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE md_prodtypes index PRIMARY,ProdType_UNIQUE,ID_MAX_AGE MAX_AGE 5 23 Using index; Using temporary; Using filesort
1 SIMPLE d_products ref PRIMARY,ServiceID_UNIQUE,fk_Products_BaseServices1,fk_d_products_md_prodtypes1 fk_d_products_md_prodtypes1 4 combina.md_prodtypes.ProdType 8625
1 SIMPLE d_baseservices eq_ref PRIMARY,CreationDateDesc_index,CreationDate_index PRIMARY 8 combina.d_products.ServiceID 1 Using where
I found a clue a few days back, when I was able to determine that limiting the query to 750 records would cause is to go fast, but 751 would bring poor performance.
I tried creating indexes of many kinds, with no success.
I tried removing the reference to MAX_AGE and the curdate function and just set a fixed value, with little success as the query now takes 20 seconds:
(select d_baseservices.ID
from d_baseservices
inner join d_products on d_baseservices.ServiceID = d_products.ServiceID
inner join md_prodtypes on d_products.ProdType = md_prodtypes.ProdType
where
(d_baseservices.CreationDate > '2015-09-21 19:02:25')
order by CreationDate desc
limit 750);
And the EXPLAIN command output:
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE md_prodtypes index PRIMARY,ProdType_UNIQUE,ID_MAX_AGE ProdType_UNIQUE 4 23 Using index; Using temporary; Using filesort
1 SIMPLE d_products ref PRIMARY,ServiceID_UNIQUE,fk_Products_BaseServices1,fk_d_products_md_prodtypes1 fk_d_products_md_prodtypes1 4 combina.md_prodtypes.ProdType 8625
1 SIMPLE d_baseservices eq_ref PRIMARY,CreationDateDesc_index,CreationDate_index PRIMARY 8 combina.d_products.ServiceID 1 Using where\
Can anyone please help? I'm stuck for almost a month
It's hard to say exactly what to do without knowing more about the specific data you have (how many rows in each table, how many rows you expect the query to return, the distribution of the data values, etc), but I'll make some educated guesses and hopefully point you in the right direction.
First an explanation about why taking md_prodtypes.MaxAge out of the query greatly reduced the run time: Prior to that change the database had no ability at all to filter using indexes because in order to see if rows are candidates for inclusion it had to join the three tables in order to compare CreationDate from the first table to MaxAge in the third table. There is simply no index that you can add to correlate these two values. You're forcing the database engine to look at every single row.
As to the 750 magic number - I'm guessing that past 750 results the database has to page data or that it's hitting some other memory limit based on the values in your specific MySQL configuration file. I wouldn't read too much into that 750 number.
Lastly I'd like to point out that the EXPLAIN of your second query is a bit strange since it's showing md_prodtypes as the first table despite the fact that you took MaxAge out of the WHERE. That means the database is starting from md_prodtypes then moving up to d_products and finally to d_baseservices and only then filtering based on the date. I'm guessing that you're expecting it to first filter on the date then join only when it's decided what baseservices records to include. It's impossible to know why this is happening with the information you've provided. Perhaps you are missing an index.
Another possibility may have to do with variance in your CreationDate column. Let me explain by example: Say you had a table of users, and each user had a gender column that could be either f or m. Let's pretend that we have a 50%/50% split of females and males. Now, if you add an index on the column gender and do a query filtered by WHERE gender='f' expecting that the index will filter out half of the records, you'd be surprised to see that the database will totally ignore the index and just scan the table. The reason being is that it's cheaper to just read the whole table if you know the index isn't filtering out enough (the alternative being jumping constantly from the index to the main table data). In your case, if the WHERE on the CreationDate column doesn't filter out enough records, then even if you have an index on it, it won't be used.
With a constant date...
INDEX(CreationDate)
That will encourage the optimizer to start with the table that can be filtered. Also, since the ORDER BY is on the same field, the WHERE, ORDER BY and LIMIT can all be done at the same time.
Otherwise, it must read all the relevant records from all 3 tables, sort them, then deliver 750 (or 751) of them.
Using MAX_AGE...
Now the optimizer won't know whether it is better to do as above or find all the rows, sort them, then deliver the LIMIT.

MySQL join query performance issue

I am running the be query
SELECT packages.id, packages.title, subcat.id, packages.weight
FROM packages ,provider, packagestosubcat,
packagestocity, subcat, usertosubcat,
usertocity, usertoprovider
WHERE packages.endDate >'2011-03-11 06:00:00' AND
usertosubcat.userid = 1 AND
usertocity.userid = 1 AND
packages.providerid = provider.id AND
packages.id = packagestosubcat.packageid AND
packages.id = packagestocity.packageid AND
packagestosubcat.subcatid = subcat.id AND
usertosubcat.subcatid = packagestosubcat.subcatid AND
usertocity.cityid = packagestocity.cityid AND
(
provider.providertype = 'reg' OR
(
usertoprovider.userid = 1 AND
provider.providertype != 'reg' AND
usertoprovider.providerid = provider.ID
)
)
GROUP BY packages.title
ORDER BY subcat.id, packages.weight DESC
When i run explain, everything seems to look ok except for the scan on the usertoprovider table, which doesn't seem to be using table's keys:
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE usertocity ref user,city user 4 const 4 Using temporary; Using filesort
1 SIMPLE packagestocity ref city,packageid city 4 usertocity.cityid 419
1 SIMPLE packages eq_ref PRIMARY,enddate PRIMARY 4 packagestocity.packageid 1 Using where
1 SIMPLE provider eq_ref PRIMARY,providertype PRIMARY 4 packages.providerid 1 Using where
1 SIMPLE packagestosubcat ref subcatid,packageid packageid 4 packages.id 1 Using where
1 SIMPLE subcat eq_ref PRIMARY PRIMARY 4 packagestosubcat.subcatid 1
1 SIMPLE usertosubcat ref userid,subcatid subcatid 4 const 12 Using where
1 SIMPLE usertoprovider ALL userid,providerid NULL NULL NULL 3735 Using where
As you can see in the above query, the condition itself is:
provider.providertype = 'reg' OR
(
usertoprovider.userid = 1 AND
provider.providertype != 'reg' AND
usertoprovider.providerid = provider.ID
)
Both tables, provider and usertoprovider, are indexed. provider has indexes on providerid and providertype while usertoprovider has indexes on userid and providerid
The cardinality of the keys is:
provider.id=47, provider.type=1, usertoprovider.userid=1245, usertoprovider.providerid=6
So its quite obvious that the indexes are not used.
Further more, to test it out, i went ahead and:
Duplicated the usertoprovider table
Inserted all the provider values that have providertype='reg' into the cloned table
Simplified the condition to (usertoprovider.userid = 1 AND usertoprovider.providerid = provider.ID)
The query execution time changed from 8.1317 sec. to 0.0387 sec.
Still, provider values that have providertype='reg' are valid for all the users and i would like to avoid inserting these values into the usertoprovider table for all the users since this data is redundant.
Can someone please explain why MySQL still runs a full scan and doesn't use the keys? What can be done to avoid it?
It seems that provider.providertype != 'reg' is redundant (always true) unless provider.providertype is nullable and you want the query to fail on NULL.
And shouldn't != be <> instead to be standard SQL, although MySQL may allow !=?
On cost of table scans
It is not necessarily that a full table scan is more expensive than walking an index, because walking an index still requires multiple page accesses. In many database engines, if your table is small enough to fit inside a few pages, and the number of rows are small enough, it will be cheaper to do a table scan. Database engines make this type of decision based on the data and index statistics of the table.
This case
However, in your case, it might also be because of the other leg in your OR clause: provider.providertype = 'reg'. If providertype is "reg", then this query joins in ALL the rows of usertoprovider (most likely not what you want) since it is a multi-table cross join.
The database engine is correct in determining that you'll likely need all the table rows in usertoprovider anyway (unless none of the providertype's is "reg", but the engine also may know!).
The query hides this fact because you are grouping on the (MASSIVE!) result set later on and just returning the package ID, so you won't see how many usertoprovider rows have been returned. But it will run very slowly. Get rid of the GROUP BY clause to find out how many rows you are actually forcing the database engine to work on!!!
The reason you see a massive speed improvement if you fill out the usertoprovider table is because then every row participates in a join, and there is no full cross join happening in the case of "reg". Before, if you have 1,000 rows in usertoprovider, every row with type="reg" expands the result set 1,000 times. Now, that row joins with only one row in usertoprovider, and the result set is not expanded.
If you really want to pass anything with providertype='reg', but not in your many-to-many mapping table, then the easiest way may be to use a sub-query:
Remove usertoprovider from your FROM clause
Do the following:
provider.providertype='reg' OR EXISTS (SELECT * FROM usertoprovider WHERE userid=1 AND providerid = provider.ID)
Another method is to use an OUTER JOIN on the usertoprovider -- any row with "reg" which is not in the table will come back with one row of NULL instead of expanding the result set.
Hmm, I know that MySQL does funny things with grouping. In any other RDBMS, your query won't even be executed. What does that even mean,
SELECT packages.id
[...]
GROUP BY packages.title
ORDER BY subcat.id, packages.weight DESC
You want to group by title. Then in standard SQL syntax, this means you can only select title and aggregate functions of the other columns. MySQL magically tries to execute (and probably guess) what you may have meant to execute. So what would you expect to be selected as packages.id ? The First matching package ID for every title? Or the last? And what would the ORDER BY clause mean with respect to the grouping? How can you order by columns that are not part of the result set (because only packages.title really is)?
There are two solutions, as far as I can see:
You're on the right track with your query, then remove the ORDER BY clause, because I don't think it will affect your result, but it may severely slow down your query.
You have a SQL problem, not a performance problem

Really slow query used to be really fast. Explain shows rows=1 on local backup but rows=2287359 on server

I check for spam every now and then using "select * from posts where post like '%http://%' order by id desc limit 10" and searching a few other keywords. Lately the select is impossibly slow.
mysql> explain select * from posts where reply like "%http://%" order by id desc limit 1;
+----+-------------+-----------+-------+---------------+---------+---------+------+---------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-----------+-------+---------------+---------+---------+------+---------+-------------+
| 1 | SIMPLE | posts | index | NULL | PRIMARY | 4 | NULL | 2287347 | Using where |
+----+-------------+-----------+-------+---------------+---------+---------+------+---------+-------------+
1 row in set (0.00 sec)
On my netbook with 1 gig ram the only difference is it shows the "ROWS" column as being 1. There is only 1.3 mil posts in my netbook. The server has like 6 gigs ram and a fast processor. What should I optimize so it's not horribly slow. Recently I added an index to search by userId, which I'm not sure was a smart choice, but I added it to the backup and production server both a little before this issue started happening. I'd imagine it's related to it not being able to sort in ram due to a missed tweak?
It also seems to be slow when I do stuff like "delete from posts where threadId=X", dunno if related.
With respect to
SELECT * FROM posts WHERE reply LIKE "%http://%" ORDER BY id DESC LIMIT 1
Due to the wild cards on both sides of the http://, MySQL will can not use an index on reply to quickly find what you're looking for. Moreover, since you're asking for the one with the largest id, MySQL will have to pull all results to make sure that you have the one with the largest `id'.
Depending how much of the data of the posts table is made up of the reply, it might be worthwhile to add a compound index on (id, reply), and change the query to something like
SELECT id FROM posts WHERE reply LIKE "%http://%" ORDER BY id DESC LIMIT 1
(which will have an index only execution), then join to the posts table or retrive the posts using the retrived ids. If the query has index only execution, and the index fits in memory and is already in memory (due to normal use or by intentionality warming it up), you could potentially speed up the query execution.
Having said all that, if identical queries on two identical servers with identical data are giving different execution plans and execution times, it might be time to OPTIMIZE TABLE posts to refresh the index statistics and/or defragment the table. If you have recently been adding/removing indexes, things might have gotten astray. Moreover, if the the data is fragmented, when it's pulling rows in PRIMARY KEY order, it could be jumping all over the disk to retrieve the data.
With respect to DELETE FROM posts WHERE threadId=X, it should be fine as long as there is an index on threadId.
Indexes won't be used if you start your search comparison with a "%". You problem is with
where reply like "%http://%"
As confirmed by your explain, no indexes are used. The speed difference may be due to caching.
What kind of indexes do you have on your table(s)? A good rule of thumb is to have an index on the columns that appear most often in your WHERE clause. If you do not have an index on your threadId column, your last query will be a lot slower than if you did.
Your first query (select * from posts where post like '%http://%' will be slow simply due to the "like" in the query. I would suggest filtering your query with another WHERE clause - perhaps by date (which is hopefully indexed):
select * from posts where postdate > 'SOMEDATE' and post like '%http://%'
Can you write an after-insert trigger that examines the text looking for substring 'http://' and either flags the current record or writes out its id to a SPAM table? As #brent said, indexes are not used for "contains substring" searches.