MySQL - Large database, improve search time - mysql

I have two rather big tables (threads and posts) that include a ton of forum posts. I really have to improve my search time. Even doing a normal search where COLUMN = VALUE will take 15 seconds. Doing a LIKE often crash the entire website (timeout).
Here's a picture of my site and two tables:
The threads table contains about 430,000 rows.
The posts table contains about 2,700,000 rows.
And I need to combine these in a query to get the results I want.
Don't bother about the search boxes on the website for now. Let's just start off with this query right here and start improving this one first.
SELECT p.id, t.id, t.title, t.threadstarter, t.replies, t.views, t.board, p.dateposted FROM threads t
JOIN posts p
ON t.id = p.threadid
WHERE t.title = 'sell'
GROUP BY t.id
This query will take about 15 seconds to get all threads and posts where the thread title is "sell". How would I improve this, making it just a second or two? Is this even possible with MySQL in two tables with these sizes?
And from there on, I would have to make a LIKE (unless there is another method). Because the users on the website will most likely not search for an exact match. And I'd want to include any title that includes the world "sell". So that would be like this:
SELECT p.id, t.id, t.title, t.threadstarter, t.replies, t.views, t.board, p.dateposted FROM threads t
JOIN posts p
ON t.id = p.threadid
WHERE t.title LIKE '%sell%'
GROUP BY t.id
Which I am not even going to bother measuring. It's crashing the website (too long time to execute). So this one really(!) needs improvement.
How should I even approach this? Should I even use MySQL? What options do I have? I do not want a user to sit and wait 30-300 seconds for a query to finish. At most 5 seconds.
Is this possible, with such large tables?
I've heard using "MATCH" and "AGAINST" could be better than a "COLUMN" LIKE "VALUE". But then I need to make all the columns freetext. Any downsides of doing that?
If there's anyone out there that's worked with a ~3 million row MySQL database, then please let me know how you handled it (if you did).

Make use of INDEX. Just try to create an index on one of the table which has more records or the master though its an inner join still itll make it easier to inner join the two.
Plus, I simply din understand usage of group by without any aggregation as its select *.. in the query.
CREATE INDEX Index_NAME ON
threads(title);

The correct way to express your first query is:
SELECT p.id, t.id, t.title, t.threadstarter, t.replies, t.views, t.board, p.dateposted
FROM threads t JOIN
posts p
ON t.id = p.threadid
WHERE t.title = 'sell' AND
p.dateposted = (SELECT MIN(p2.dateposted) FROM posts p2 WHERE p2.threadid = p.threadid);
This gets rid of the GROUP BY, so it might improve performance. In particular, you want indexes on:
threads(title, id)
posts(threadid, dateposted)

give these two articles a read.
how to optimize mysql queries for speed and performance
MySQL Optimization

LIKE with a leading wild card must scan all 430,000 rows:
WHERE t.title LIKE '%sell%'
Change to this:
WHERE MATCH(t.title) AGAINST('+sell' IN BOOLEAN MODE)
and have
FULLTEXT(title)
With that setup, the query can go directly to the few rows that have the 'word' sell in it.
Caveat: There are restrictions on what FULLTEXT can search for -- only "words', not "stop words", only words of a certain minimum length, etc.

Related

What is a "point-in-select" in MySQL?

I was given this query to update a report, and it was taking a long time to run on my computer.
select
c.category_type, t.categoryid, t.date, t.clicks
from transactions t
join category c
on c.category_id = t.categoryid
I asked the DBA if there were any issues with the query, and the DBA optimized the query in this manner:
select
(select category_type
from category c where c.category_id = t.categoryid) category_type,
categoryid,
date, clicks
from transactions t
He described the first subquery as a "point-in-select". I have never heard of this before. Can someone explain this concept?
I want to note that the two queries are not the same, unless the following is true:
transactions.categoryid is always present in category.
category has no duplicate values of category_id.
In practice, these would be true (in most databases). The first query should be using a left join version for closer equivalence:
select c.category_type, t.categoryid, t.date, t.clicks
from transactions t left join
category c
on c.category_id = t.categoryid;
Still not exactly the same, but more similar.
Finally, both versions should make use of an index on category(category_id), and I would expect the performance to be very similar in MySQL.
Your DBA's query is not the same, as others noted, and afaik nonstandard SQL. Yours is much preferable just for its simplicity alone.
It's usually not advantageous to re-write queries for performance. It can help sometimes, but the DBMS is supposed to execute logically equivalent queries equivalently. Failure to do so is a flaw in the query planner.
Performance issues are often a function of physical design. In your case, I would look for indexes on the category and transactions tables that contain categoryid as first column. If neither exist, your join is O(mn) because the category table must be scanned for each transaction row.
Not being a MySQL user, I can only advise you to get query planner output and look for indexing opportunities.

Complex query optimization improve speed

I have the following query that i would like to optimize:
SELECT
*, #rownum := #rownum + 1 AS rank
FROM (
SELECT
SUM(a.id = 1) as KILLS,
SUM(a.id = 2) as DEATHS,
SUM(a.id = 3) as WINS,
tb1.totalPlaytime,
p.playerName
FROM
(
SELECT
player_id,
SUM(pg.timeEnded - pg.timeStarted) as totalPlaytime
FROM playergame pg
INNER JOIN player p
ON pg.player_id = p.id
WHERE pg.game_id IN(1, 2, 3)
GROUP BY
p.id
ORDER BY
p.playerName ASC
) tb1
INNER JOIN playeraction pa
ON pa.player_id = tb1.player_id
INNER JOIN action a
ON pa.action_id = a.id
INNER JOIN player p
ON pa.player_id = p.id
GROUP BY
p.id
ORDER BY
KILLS DESC) tb2
WHERE tb2.playerName LIKE "%"
Somehow i am having the feeling that this is not suited for mysql. I keep a lot of actions in different tables for a good statistical approach but this slows down everything. (perhaps big data?)
This is my model
Now i tried doing the following:
Combining joins in a view
I Combined the many JOINS into a VIEW. This gave me no improvements.
Index the tables
I indexed the frequently used keys, this did speed up but i can't manage to get the entire resultset below 0.613s.
Start from the action table and use left joins
This gave me a somewhat different approach but yet the joins keep being slow (the first example is still the fastest)
indexes:
Any hints, tips, additions, improvements are welcome
I removed my previous answer as it was wrong and did not help, and here I am just summarizing our conversation in the comments with additional comments from myself
There are several ways to speed up the query.
Make sure you are not making any redundant queries.
Do as few joins as possible.
Make indexes on multiple columns if possible.
Make indexes clustered if needed/possible http://dev.mysql.com/doc/refman/5.0/en/innodb-index-types.html
Regarding the query you wrote in the question:
Remove ORDER BY in the inner query
Remove INNER JOIN in the inner query and replace GROUP BY p.id by GROUP BY player_id
Few words on where indexes make sense and where not.
In your case it would not make sense to have index on gameid on table playergame because that probably would return loads of rows. So that is all what you can do about the most inner query.
The joins can also be a bit optimized if you know what you expect from the tables, i.e., the amount of data they may face. you may think of it as a question are you building database behind a MMO game of FPS. MMO will have millions of users per game, FPS will have only a few. Also different types of games may have different actions. That would imply that you may try to optimize the query by making the index more precise. If you are able to define in the inner join of action that gameid IN (...) then creating an index on tuple (gameid, id) might help.
Wildcart in WHERE clause. You may try to create an index on playername but it will only work if you look with a wildcard at the end of your search string, for one in the beginning you would need a separate index, and hope that query optimizer will be smart enough to switch between them each time you make a query.
Keep in mind that more indexes imply slowed insert and delete, so keep only as few as possible.
Another thing would be redesigning the structure a bit. You may still keep the database normalized, but maybe it would be usefull to have a table with summary of some games. You may have a table with summary of games that happened before yesterday, and your query would only summarize the data for today, and then join both tables if needed. Then you could optimize it by either creating and index on timestamp or partitioning table by day. Everything depends on the load you expect.
The topic is rather deep, so everything depends on what is the story behind the data.

Optimizing the SQL Query to reduce execution time

My SQL Query with all the filters applied is returning 10 lakhs (one million) records . To get all the records it is taking 76.28 seconds .. which is not acceptable . How can I optimize my SQL Query which should take less time.
The Query I am using is :
SELECT cDistName , cTlkName, cGpName, cVlgName ,
cMmbName , dSrvyOn
FROM sspk.villages
LEFT JOIN gps ON nVlgGpID = nGpID
LEFT JOIN TALUKS ON nGpTlkID = nTlkID
left JOIN dists ON nTlkDistID = nDistID
LEFT JOIN HHINFO ON nHLstGpID = nGpID
LEFT JOIN MEMBERS ON nHLstID = nMmbHhiID
LEFT JOIN BNFTSTTS ON nMmbID = nBStsMmbID
LEFT JOIN STATUS ON nBStsSttsID = nSttsID
LEFT JOIN SCHEMES ON nBStsSchID = nSchID
WHERE (
(nMmbGndrID = 1 and nMmbAge between 18 and 60)
or (nMmbGndrID = 2 and nMmbAge between 18 and 55)
)
AND cSttsDesc like 'No, Eligible'
AND DATE_FORMAT(dSrvyOn , '%m-%Y') < DATE_FORMAT('2012-08-01' , '%m-%Y' )
GROUP BY cDistName , cTlkName, cGpName, cVlgName ,
DATE_FORMAT(dSrvyOn , '%m-%Y')
I have searched on the forum and outside and used some of the tips given but it hardly makes any difference . The joins that i have used in above query is left join all on Primary Key and Foreign key . Can any one suggest me how can I modify this sql to get less execution time ....
You are, sir, a very demanding user of MySQL! A million records retrieved from a massively joined result set at the speed you mentioned is 76 microseconds per record. Many would consider this to be acceptable performance. Keep in mind that your client software may be a limiting factor with a result set of that size: it has to consume the enormous result set and do something with it.
That being said, I see a couple of problems.
First, rewrite your query so every column name is qualified by a table name. You'll do this for yourself and the next person who maintains it. You can see at a glance what your WHERE criteria need to do.
Second, consider this search criterion. It requires TWO searches, because of the OR.
WHERE (
(MEMBERS.nMmbGndrID = 1 and MEMBERS.nMmbAge between 18 and 60)
or (MEMBERS.nMmbGndrID = 2 and MEMBERS.nMmbAge between 18 and 55)
)
I'm guessing that these criteria match most of your population -- females 18-60 and males 18-55 (a guess). Can you put the MEMBERS table first in your list of LEFT JOINs? Or can you put a derived column (MEMBERS.working_age = 1 or some such) in your table?
Also try a compound index on (nMmbGndrID,nMmbAge) on MEMBERS to speed this up. It may or may not work.
Third, consider this criterion.
AND DATE_FORMAT(dSrvyOn , '%m-%Y') < DATE_FORMAT('2012-08-01' , '%m-%Y' )
You've applied a function to the dSrvyOn column. This defeats the use of an index for that search. Instead, try this.
AND dSrvyOn >= '2102-08-01'
AND dSrvyOn < '2012-08-01' + INTERVAL 1 MONTH
This will, if you have an index on dSrvyOn, do a range search on that index. My remark also applies to the function in your ORDER BY clause.
Finally, as somebody else mentioned, don't use LIKE to search where = will do. And NEVER use column LIKE '%something%' if you want acceptable performance.
You claim yourself you base your joins on good and unique indexes. So there is little to be optimized. Maybe a few hints:
try to optimize your table layout, maybe you can reduce the number of joins required. That probably brings more performance optimization than anything else.
check your hardware (available memory and things) and the server configuration.
use mysqls explain feature to find bottle necks.
maybe you can create an auxilliary table especially for this query, which is filled by a background process. That way the query itself runs faster, since the work is done before the query in background. That usually works if the query retrieves data that must not neccessarily be synchronous with every single change in the database.
check if an RDBMS is really the right type of database. For many purposes graph databases are much more efficient and offer better performance.
Try adding an index to nMmbGndrID, nMmbAge, and cSttsDesc and see if that helps your queries out.
Additionally you can use the "Explain" command before your select statement to give you some hints on what you might do better. See the MySQL Reference for more details on explain.
If the tables used in joins are least use for updates queries, then you can probably change the engine type from INNODB to MyISAM.
Select queries in MyISAM runs 2x faster then in INNODB, but the updates and insert queries are much slower in MyISAM.
You can create Views in order to avoid long queries and time.
Your like operator could be holding you up -- full-text search with like is not MySQL's strong point.
Consider setting a fulltext index on cSttsDesc (make sure it is a TEXT field first).
ALTER TABLE articles ADD FULLTEXT(cSttsDesc);
SELECT
*
FROM
table_name
WHERE MATCH(cSttsDesc) AGAINST('No, Eligible')
Alternatively, you can set a boolean flag instead of cSttsDesc like 'No, Eligible'.
Source: http://devzone.zend.com/26/using-mysql-full-text-searching/
This SQL has many things that are redundant that may not show up in an explain.
If you require a field, it shouldn't be in a table that's in a LEFT JOIN - left join is for when data might be in the joined table, not when it has to be.
If all the required fields are in the same table, it should be the in your first FROM.
If your text search is predictable (not from user input) and relates to a single known ID, use the ID not the text search (props to Patricia for spotting the LIKE bottleneck).
Your query is hard to read because of the lack of table hinting, but there does seem to be a pattern to your field names.
You require nMmbGndrID and nMmbAge to have a value, but these are probably in MEMBERS, which is 5 left joins down. That's a redundancy.
Remember that you can do a simple join like this:
FROM sspk.villages, gps, TALUKS, dists, HHINFO, MEMBERS [...] WHERE [...] nVlgGpID = nGpID
AND nGpTlkID = nTlkID
AND nTlkDistID = nDistID
AND nHLstGpID = nGpID
AND nHLstID = nMmbHhiID
It looks like cSttsDesc comes from STATUS. But if the text 'No, Eligible' matches exactly one nBStsSttsID in BNFTSTTS then find out the value and use that! If it is 7, take out LEFT JOIN STATUS ON nBStsSttsID = nSttsID and replace AND cSttsDesc like 'No, Eligible' with AND nBStsSttsID = '7'. This would see a massive speed improvement.

Complex MySQL Query is Slow

A program I've been working on uses a complex MySQL query to combine information from several tables that have matching item IDs. However, since I added the subqueries you see below, the query has gone from taking under 1 second to execute to over 3 seconds. Do you have any suggestions for what I might do to optimize this query to be faster? Am I wrong in my thinking that having one complex query is better than having 4 or 5 smaller queries?
SELECT uninet_articles.*,
Unix_timestamp(uninet_articles.gmt),
uninet_comments.commentcount,
uninet_comments.lastposter,
Unix_timestamp(uninet_comments.maxgmt)
FROM uninet_articles
RIGHT JOIN (SELECT aid,
(SELECT poster
FROM uninet_comments AS a
WHERE b.aid = a.aid
ORDER BY gmt DESC
LIMIT 1) AS lastposter,
Count(*) AS commentcount,
Max(gmt) AS maxgmt
FROM uninet_comments AS b
GROUP BY aid
ORDER BY maxgmt DESC
LIMIT 10) AS uninet_comments
ON uninet_articles.aid = uninet_comments.aid
LIMIT 10
Queries can be though of as going through the data to find what matches. Sub-queries require going through the data many times in order to find which items are needed. In this case, you probably want to rewrite it as multiple queries. Many times, multiple simpler queries will be better - I think this is one of those cases.
You can also look at if your indexes are working well - if you know what that is. The reason why has to do with this: How does database indexing work?.
For a specific suggestion, you can find the last poster for each AID in a different query, and simply join it afterwards.
It always depends on the data you have and the way you use it.
You should use explain on your selects to see if you are using the indexes or not.
http://dev.mysql.com/doc/refman/5.5/en/explain.html

Which is more efficient in mysql, a big join or multiple queries of single table?

I have a mysql database like this
Post – 500,000 rows (Postid,Userid)
Photo – 200,000 rows (Photoid,Postid)
About 50,000 posts have photos, average 4 each, most posts do not have photos.
I need to get a feed of all posts with photos for a userid, average 50 posts each.
Which approach would be more efficient?
1: Big Join
select *
from post
left join photo on post.postid=photo.postid
where post.userid=123
2: Multiple queries
select * from post where userid=123
while (loop through rows) {
select * from photo where postid=row[postid]
}
I've not tested this, but I very much suspect (at an almost cellular level) that a join would be vastly, vastly faster - what you're attempting is pretty much the reason why joins exist after all.
Additionally, there would be considerably less overhead in terms of scripting language <-> MySQL communications, etc. but I suspect that's somewhat of a mute factor.
The JOIN is always faster with proper indexing (as mentioned before) but several smaller queries may be more easily cached, provided of course that you are using the query cache. The more tables a query contains the greater the chances of more frequent invalidations.
As long as the parsing and optimization procedure, I believe MySQL maintains its own statistics internally and this usually happens once. What you are losing when executing multiple queries is the roundtrip time and the client buffering lag, which is small if the resultset is relatively small in size.
A join will be much faster.
Each separate query will need to be parsed, optimized and executed which takes quite long.
Just don't forget to create the following indexes:
post (userid)
photo (postid)
With proper indexing on the postid columns, the join should be superior.
There's also the possibility of a sub-query:
SELECT * FROM photo WHERE postid IN (SELECT postid FROM post WHERE userid = 123);
I'd start with optimizing your queries, e.g. select * from post where userid=123 is obviously not needed as you only use row[postid] in your loop, so don't select * if you want to split the query.Then I'd run a couple of tests which ones faster but JOINing just two tables is usually the fastest (don't forget to create an index where needed).
If you're planning to make your "big query" very big (by joining more tables), things can get very slow and you may need to split your query. I once joined seven tables which took the query to run 30 seconds. Splitting the query made in run in a fraction of a second.
I'm not sure about this but there is another option. It might be much slower or faster depending upon indexes used.
In your case, something like:
select t1.postid FROM (select postid from post where userid = 23) AS t1 JOIN photo ON t1.postid = photo.postid
If the number of rows in table t1 is going to be small compared to table post there might be a chance for considerable performance improvement. But I haven't tested it yet.
SELECT * FROM photo, post
WHERE post.userid = 123 AND photo.postid = post.postid;
If you only want posts with photos, construct your query starting with the photo table as your base table. Note, you will get the post info repeated with each result row.
If you didn't want to return all of the post info with each row, an alternative would be to
SELECT DISTINCT postid from photo, post where post.userid = 123;
Then foreach postid, you could
SELECT * from photo WHERE postid = $inpostid;