Mysql Simple Query Takes Too Long To Execute - mysql

I have a query and it seems very slow
My Problem
select conversation_hash as search_hash
from conversation
where conversation_hash ='xxxxx'
and result_published_at between '1600064000' and '1610668799'
order by result_published_at desc
limit 5
There is a total of 773179 Records when I run
select count(*)
from conversation
where conversation_hash ='xxxxx'
After I do an explain query
explain select conversation_hash as search_hash
from conversation
where conversation_hash ='xxxxx'
and result_published_at between '1600064000' and '1610668799'
order by result_published_at desc
limit 5
i got this
id,select_type,table,partitions,type,possible_keys,key,key_len,ref,rows,filtered,extra
1, SIMPLE, conversation, , range, idx_result_published_at,conversation_hash_channel_content_id_index,conversation_result_published_at_index,virtaul_ad_id_conversation_hash, idx_result_published_at, 5, , 29383288, 1.79, Using index condition;Using where
Possible Issues
By looking in the explain query I can see it return more
rows(29383288) than the total Records (ie 773179)
key_len is 5. result_published_at is a timestamp field and its length
is def more than 5 eg(1625836640)
What can I improve to make this query Fast, Thanks in advance
EDIT
Indexes for conversation
Table,Non_unique,Key_name,Seq_in_index,Column_name,Collation,Cardinality,Sub_part,Packed,Null,Index_type,Comment,Index_comment
conversation,0,PRIMARY,1,id,A,96901872,NULL,NULL,,BTREE,,
conversation,0,conversation_conversation_hash_id_result_id_unique,1,conversation_hash_id,A,240485,NULL,NULL,,BTREE,,
conversation,0,conversation_conversation_hash_id_result_id_unique,2,result_id,A,100693480,NULL,NULL,,BTREE,,
conversation,0,conversation_conversation_hash_id_channel_content_id_unique,1,conversation_hash_id,A,232122,NULL,NULL,,BTREE,,
conversation,0,conversation_conversation_hash_id_channel_content_id_unique,2,channel_content_id,A,100693480,NULL,NULL,,BTREE,,
conversation,1,conversation_tool_id_foreign,1,tool_id,A,7788,NULL,NULL,,BTREE,,
conversation,1,idx_result_published_at,1,result_published_at,A,38164712,NULL,NULL,YES,BTREE,,
conversation,1,idx_user_name,1,user_name,A,10896208,NULL,NULL,YES,BTREE,,
conversation,1,conversation_hash_channel_content_id_index,1,conversation_hash,A,294048,NULL,NULL,,BTREE,,
conversation,1,conversation_hash_channel_content_id_index,2,channel_content_id,A,99699696,NULL,NULL,,BTREE,,
conversation,1,idx_parent_channel_content_id,1,parent_channel_content_id,A,3550741,NULL,NULL,YES,BTREE,,
conversation,1,idx_channel_content_id,1,channel_content_id,A,90350472,NULL,NULL,,BTREE,,
conversation,1,conversation_result_published_at_index,1,result_published_at,A,37177476,NULL,NULL,YES,BTREE,,
conversation,1,virtaul_ad_id_conversation_hash,1,conversation_hash,A,238906,NULL,NULL,,BTREE,,
conversation,1,virtaul_ad_id_conversation_hash,2,virtual_ad_id,A,230779,NULL,NULL,YES,BTREE,,
conversation,1,idx_ad_story_id,1,ad_story_id,A,167269,NULL,NULL,YES,BTREE,,

Query is correct, it seems you have to update server configurations for mysql, which is probably not available on a shared hosting environment. However, if you have your own server the follow these steps:
Go to my.cnf file, in my case it is hosted at /etc/mysql/my.cnf
Increate the values of query_cache_size, max_connection, innodb_buffer_pool_size, innodb_io_capacity
Switch from MyISAM to InnoDB (if possible)
Use latest MySQL version (if possible)
You can get more help from this article https://phoenixnap.com/kb/improve-mysql-performance-tuning-optimization

It's a bit hard to read the output of the Explain command because the possible_keys output is separated by commas.
Depending on the data access patterns, you might want to create a unique index on conversation_hash, in case rows are unique.
If conversation_hash is not a unique field you can create a compound index on conversation_hash, result_published_at so your query will be fulfilled from the index itself.

EXPLAIN estimates the row counts. (It has no way to get the exact number of rows without actually running the query.) That estimate may be a lot lower or higher than the real count. It is rare for the estimate to be that far off, but I would not be worried, just annoyed.
The existence of Text and Blob columns sometimes adds to the imprecision of Explain.
Key_len:
The raw length, which is 5 for TIMESTAMP (more below).
+1 for if NULL (+0 for NOT NULL).
Not very useful for VARCHAR.
In older versions of MySQL, a Timestamp took 4 bytes and a DATETIME took 8. When fractional seconds were added, those numbers were changed to 5 in both cases. This allowed for a "length" to indicate the number of decimal places. And Datetime was changed from packed decimal to an integer.
Suggest you run ANALYZE TABLE. This might improve the underlying statistics that feed into the estimates.
Please provide SHOW CREATE TABLE; it may give more insight.
The 'composite' INDEX(conversation_hash, result_published_at), in that order, is optimal for that query.

Related

Is this strategy for fast substring search in MySQL fast enough?

I have a USER table with millions of rows. I am implementing a search function that allows someone to look for a user by typing in a username. This autocomplete feature needs to be blazingly fast. Given that, in MySQL, column indexes speed up queries using LIKE {string}%, is the following approach performant enough to return within 200ms? (Note: Memory overhead is not an issue here, username are maximum 30 characters).
Create a USERSEARCH table that has a foreign key to the user table and an indexed ngram username column:
USERSEARCH
user_id username_ngram
-------------------------
1 crazyguy23
1 razyguy23
1 azyguy23
1 zyguy23
...
The query would then be:
SELECT user_id FROM myapp.usersearch WHERE username_ngram LIKE {string}%
LIMIT 10
I am aware that third party solutions exist, but I would like to stay away from them at the moment for other reasons. Is this approach viable in terms of speed? Am I overestimating the power of indexes if the db would need to check all O(30n) rows where n is the number of users?
Probably not. The union distinct is going to process each subquery to completion.
If you just want arbitrary rows, phrase this as:
(SELECT user_id
FROM myapp.usersearch
WHERE username_1 LIKE {string}%
LIMIT 10
) UNION DISTINCT
(SELECT user_id
FROM myapp.usersearch
WHERE username_2 LIKE {string}%
LIMIT 10
)
LIMIT 10;
This will at least save you lots of time for common prefixes -- say 'S'.
That said, this just returns an arbitrary list of 10 user_ids when there might be many more.
I don't know if the speed will be fast enough for your application. You have to make that judgement by testing on an appropriate set of data.
Assuming SSDs, that should be blazing fast, yes.
Here are some further optimizations:
I would add a DISTINCT to your query, since there is no point in returning the same user_id multiple times. Especially when searching for a very common prefix, such as a single letter.
Also consider searching only for at least 3 letters of input. Less tends to be meaningless (since hopefully your usernames are at least 3 characters long) and is a needless hit on your database.
If you're not adding any more columns (I hope you're not, since this table is meant for blazing fast search!), we can do better. Swap the columns. Make the primary key (username_ngram, user_id). This way, you're searching directly on the primary key. (Note the added benefit of the alphabet ordering of the results! Well... alphabetic on the matching suffixes, that is, not the full usernames.)
Make sure you have an index on user_id, to be able to replace everything for a user if you ever need to change a username. (To do so, just delete all rows for that user_id and insert brand new ones.)
Perhaps we can do even better. Since this is just for fast searching, you could use an isolation level of READ_UNCOMMITTED. That avoids placing any read locks, if I'm not mistaken, and should be even faster. It can read uncommitted data, but so what... Afterwards you'll just query any resulting user_ids in another table and perhaps not find them, if that user was still being created. You haven't lost anything. :)
I think you nedd to use mysql full text index to improve performance.
You need to change your syntax to use your full text index.
Create full text index:
CREATE FULLTEXT INDEX ix_usersearch_username_ngram ON usersearch(username_ngram);
The official mysql documentation how to use full text index: https://dev.mysql.com/doc/refman/8.0/en/fulltext-search.html

MySQL - How to optimize table for speed

First of all, I am still very new to PHP / mySQL so excuse my question if it is too simple for you :)
Here is my issue: I am currently working on storing a lot of data into a mysql database. It's basically a directory like archive for my son's school.
The structure is basically:
id, keyword, title, description, url, rank, hash
id is int 11
keyword is varchar 255
title is varchar 255
description is text
url is varchar 255
rank is int 2
hash is varchar 50
We plan to insert about 10 million rows containing the fields above and my mission is being able to query the database as fast as possible.
My query is always for an exact keyword.
For example:
select * from table where keyword = "keyword" limit 10
I really just need to query the keyword and not the title or description or anything else. There are a maximum of 10 results for each keyword - never more.
I have no real clue about mysql indexes and stuff but I read that it can improve speed if you have some indexes.
Now here is where I need help from a Pro. My mission is being able to run the fastest possible query, so it doesn't take too long to query the database. Since I am only looking up the keyword field, I am sure there is a way to make sure that even if you have millions of rows, that the results can be returned quickly.
What would you suggest that I should do. Should I set the keyword field to INDEX or do I have to watch anything else? Since I have no real clue about INDEXES, your help is appreciated, meaning I don't know if I should use indexes at all, or if I have to use them for everything like keyword, title, description and so on...
The database is updated frequently - in case it matters.
Do you think it's even possible to store millions of rows and doing a query in less than a second?
Any other suggestions such as custom my.cnf settings etc would be also helpful.
Your help is greatly appreciated.
Your intuition is correct - if you are only filtering on keyword in your WHERE clause, it should be indexed and you likely will see some execution speed improvement if you do.
CREATE INDEX `idx_keyword` ON `yourtable` (`keyword`)
You may be using a client like PHPMyAdmin which makes index creation easier than execution commands, but review the MySQL documentation on CREATE INDEX. Yours is a very run-of-the-mill case, so you won't need any special options.
Although this isn't the case for you (as you said there would be up to 10 rows per keyword), if you already had a unique constraint or PRIMARY KEY or FOREIGN KEY defined on keyword, it would function as an index as well.
Add an index on the keyword column. It will increase the speed significantly. Then it should be no problem to query the data in milliseconds.
In generel you should put an index on fields you are using in your where clause. That way the DB can limit the data really fast and return the results.

MySQL Bottleneck

I have a table with the following structure:
ID, SourceID, EventId, Starttime, Stoptime
All of the ID columns are char(36) and the times are dates.
The problem is that querying the table is really slow. I have 7 millons rows, I have about 60-70 threads that are writing (insert or update) to the table all the time.
On the other side I have the GUI that needs to read from this table, and it's here it get slow. If I want to select all the events that have been made where SourceID = something it takes almost 300 seconds. SourceID has an index. I take the same query and put explain keyword first I got this.
select type = simple
type = ref
possible_keys = sourceidnevent,sourceid
key = soruceid
key_len = 109
ref = const
rows = 84148
And the query
SELECT * FROM tabel where sourceid='28B791C7-D519-4F0C-BC03-EFB1D4AC9CEB'
However I started to think about what does I really need from the table. I want to know which event occured on which server, and also which event occured on servers, sorted by date. I have added index for all combination of which where and order by are used.
I need all the rows for becuse I want to make some calculation on them, some grouping, avarage and so on. But I'm doing it in .NET enviroment insteed of asking many question.
However if I add a limit to the select it goes faster. So is the bottleneck the amount of data that is transfered and not actully the finding/selecting part? If so I can rebuild my application to do the calculation on only one day and save the result into another table, and later aggregate all of it.
How can I speed up the procecss? Would it be better to switch to MongoDB? I currently use MySQL and InnoDB.
There's a lot of information you've not provided here - some of which I've mentioned in my comment elsewhere.
NoSQL is unlikely to be much faster than MySQL on a single node. I'd be very surprised if it were faster than using the handler API on MySQL along with appropirate indexes.
You've provided part of an explain plan (but not the query being explained) - but you haven't provided any interpretation of this:
rows = 84148
Does it really need to process that many rows to provide the result you need? If so and the result is not aggregated then maybe you need to think about why you need to ship 80k rows of data to the front end. If it's only having to return a few non-aggregated rows then you really need to analyse your indexes.
I have added index for all combination
Too many indexes is just as bad for performance as too few.

MySQL: low cardinality/selectivity columns = how to index?

I need to add indexes to my table (columns) and stumbled across this post:
How many database indexes is too many?
Quote:
“Having said that, you can clearly add a lot of pointless indexes to a table that won't do anything. Adding B-Tree indexes to a column with 2 distinct values will be pointless since it doesn't add anything in terms of looking the data up. The more unique the values in a column, the more it will benefit from an index.”
Is an Index really pointless if there are only two distinct values? Given a table as follows (MySQL Database, InnoDB)
Id (BIGINT)
fullname (VARCHAR)
address (VARCHAR)
status (VARCHAR)
Further conditions:
The Database contains 300 Million records
Status can only be “enabled” and “disabled”
150 Million records have status= enabled and 150 Million records have
stauts= disabled
My understanding is, without having an index on status, a select with where status=’enabled’ would result in a full tablescan with 300 Million Records to process?
How efficient is the lookup when I use a BTREE index on status?
Should I index this column or not?
What alternatives (maybe any other indexes) does MySQL InnoDB provide to efficiently look records up by the "where status="enabled" clause in the given example with a very low cardinality/selectivity of the values?
The index that you describe is pretty much pointless. An index is best used when you need to select a small number of rows in comparison to the total rows.
The reason for this is related to how a database accesses a table. Tables can be assessed either by a full table scan, where each block is read and processed in turn. Or by a rowid or key lookup, where the database has a key/rowid and reads the exact row it requires.
In the case where you use a where clause based on the primary key or another unique index, eg. where id = 1, the database can use the index to get an exact reference to where the row's data is stored. This is clearly more efficient than doing a full table scan and processing every block.
Now back to your example, you have a where clause of where status = 'enabled', the index will return 150m rows and the database will have to read each row in turn using separate small reads. Whereas accessing the table with a full table scan allows the database to make use of more efficient larger reads.
There is a point at which it is better to just do a full table scan rather than use the index. With mysql you can use FORCE INDEX (idx_name) as part of your query to allow comparisons between each table access method.
Reference:
http://dev.mysql.com/doc/refman/5.5/en/how-to-avoid-table-scan.html
I'm sorry to say that I do not agree with Mike. Adding an index is meant to limit the amount of full records searches for MySQL, thereby limiting IO which usually is the bottleneck.
This indexing is not free; you pay for it on inserts/updates when the index has to be updated and in the search itself, as it now needs to load the index file (full text index for 300M records is probably not in memory). So it might well be that you get extra IO in stead of limitting it.
I do agree with the statement that a binary variable is best stored as one, a bool or tinyint, as that decreases the length of a row and can thereby limit disk IO, also comparisons on numbers are faster.
If you need speed and you seldom use the disabled records, you may wish to have 2 tables, one for enabled and one for disabled records and move the records when the status changes. As it increases complexity and risk this would be my very last choice of course. Definitely do the move in 1 transaction if you happen to go for it.
It just popped into my head that you can check wether an index is actually used by using the explain statement. That should show you how MySQL is optimizing the query. I don't really know hoe MySQL optimizes queries, but from postgresql I do know that you should explain a query on a database approximately the same (in size and data) as the real database. So if you have a copy on the database, create an index on the table and see wether it's actually used. As I said, I doubt it, but I most definitely don't know everything:)
If the data is distributed like 50:50 then query like where status="enabled" will avoid half scanning of the table.
Having index on such tables is completely depends on distribution of data, i,e : if entries having status enabled is 90% and other is 10%. and for query where status="disabled" it scans only 10% of the table.
so having index on such columns depends on distribution of data.
#a'r answer is correct, however it needs to be pointed out that the usefulness of an index is given not only by its cardinality but also by the distribution of data and the queries run on the database.
In OP's case, with 150M records having status='enabled' and 150M having status='disabled', the index is unnecessary and a waste of resource.
In case of 299M records having status='enabled' and 1M having status='disabled', the index is useful (and will be used) in queries of type SELECT ... where status='disabled'.
Queries of type SELECT ... where status='enabled' will still run with a full table scan.
You will hardly need all 150 mln records at once, so I guess "status" will always be used in conjunction with other columns. Perhaps it'd make more sense to use a compound index like (status, fullname)
Jan, you should definitely index that column. I'm not sure of the context of the quote, but everything you said above is correct. Without an index on that column, you are most certainly doing a table scan on 300M rows, which is about the worst you can do for that data.
Jan, as asked, where your query involves simply "where status=enabled" without some other limiting factor, an index on that column apparently won't help (glad to SO community showed me what's up). If however, there is a limiting factor, such as "limit 10" an index may help. Also, remember that indexes are also used in group by and order by optimizations. If you are doing "select count(*),status from table group by status", an index would be helpful.
You should also consider converting status to a tinyint where 0 would represent disabled and 1 would be enabled. You're wasting tons of space storing that string vs. a tinyint which only requires 1 byte per row!
I have a similar column in my MySQL database. Approximately 4 million rows, with the distribution of 90% 1 and 10% 0.
I've just discovered today that my queries (where column = 1) actually run significantly faster WITHOUT the index.
Foolishly I deleted the index. I say foolishly, because I now suspect the queries (where column = 0) may have still benefited from it. So, instead I should explicitly tell MySQL to ignore the index when I'm searching for 1, and to use it when I'm searching for 0. Maybe.

Database table with 3.5 million entries - how can we improve performance?

We have a MySQL table with about 3.5 million IP entries.
The structure:
CREATE TABLE IF NOT EXISTS `geoip_blocks` (
`uid` int(11) NOT NULL auto_increment,
`pid` int(11) NOT NULL,
`startipnum` int(12) unsigned NOT NULL,
`endipnum` int(12) unsigned NOT NULL,
`locid` int(11) NOT NULL,
PRIMARY KEY (`uid`),
KEY `startipnum` (`startipnum`),
KEY `endipnum` (`endipnum`)
) TYPE=MyISAM AUTO_INCREMENT=3538967 ;
The problem: A query takes more than 3 seconds.
SELECT uid FROM `geoip_blocks` WHERE 1406658569 BETWEEN geoip_blocks.startipnum AND geoip_blocks.endipnum LIMIT 1
- about 3 seconds
SELECT uid FROM `geoip_blocks` WHERE startipnum < 1406658569 and endipnum > 1406658569 limit 1
- no gain, about 3 seconds
How can this be improved?
The solution to this is to grab a BTREE/ISAM library and use that (like BerkelyDB). Using ISAM this is a trivial task.
Using ISAM, you would set your start key to the number, do a "Find Next", (to find the block GREATER or equal to your number), and if it wasn't equal, you'd "findPrevious" and check that block. 3-4 disk hits, shazam, done in a blink.
Well, it's A solution.
The problem that is happening here is that SQL, without a "sufficiently smart optimizer", does horrible on this kind of query.
For example, your query:
SELECT uid FROM `geoip_blocks` WHERE startipnum < 1406658569 and endipnum > 1406658569 limit 1
It's going to "look at" ALL of the rows that are "less than" 1406658569. ALL of them, then it's going to scan them looking for ALL of the rows that match the 2nd criteria.
With a 3.5m row table, assuming "average" (i.e. it hits the middle), welcome to a 1.75m row table scan. Even worse, and index table scan. Ideally MySQl will "give up" and "just" table scan, as it's faster.
Clearly, this is not what you want.
#Andomar's solution is basically forcing you to "block" to data space, via the "network" criteria. Effectively breaking your table in to 255 pieces. So, instead of scanning 1.75m rows, you get to scan 6800 rows, a marked improvement at a cost of you breaking your blocks up on the network boundary.
There is nothing wrong with range queries in SQL.
SELECT * FROM table WHERE id between X and Y
is a, typically, fast query, as the optimizer can readily delimit the range of rows using the index.
But, that's not your query, because you are not ranging you main ID in this case (startipnum).
If you "know" that your block sizes are within a certain range (i.e. none of your blocks, EVER, have more than, say, 1000 ips), then you can block your query by adding "WHERE startipnum between {ipnum - 1000} and {ipnum + 1000}". That's not really different than the network blocking that was proposed, but here you don't have to maintain it as much. Of course, you can learn this with:
SELECT max(endipnum - startipnum) FROM table
to get an idea what your largest range is.
Another option, which I've seen, have never used, but is actually, well, perfect for this, is to look at MySql's Spatial Extensions, since that's what this really is.
This is designed more for GIS applications, but you ARE searching for something in ranges, and that's a lot of what GIS apps do. So, that may well be a solution for you as well.
Your startip and endip should be a combined index. Mysql can't utilize multiple indexes on the same table in one query.
I'm not sure about the syntax, but something like
KEY (startipnum, endipnum)
It looks like you're trying to find the range that an IP address belongs to. The problem is that MySQL can't make the best use of an index for the BETWEEN operation. Indexes work better with an = operation.
One way you can add an = operation to your query is by adding the network part of the address to the table. For your example:
numeric 1406658569
ip address 83.215.232.9
class A with 8 bit network part
network part = 83
With an index on (networkpart, startipnum, endipnum, uid) a query like this will become very fast:
SELECT uid
FROM `geoip_blocks`
WHERE networkpart = 83
AND 1406658569 BETWEEN startipnum AND endipnum
In case one geoip block spans multiple network classes, you'd have to split it in one row per network class.
Based on information from your question I am assuming that what you are doing is an implementation of the GeoIP® product from MaxMind®. I downloaded the free version of the GeoIP® data, loaded it into a MySQL database and did a couple of quick experiments.
With an index on startipnum the query execution time ranged from 0.15 to 0.25 seconds. Creating a composite index on startipnum and endipnum did not change the query performance. This leads me to believe that your performance issues are due to insufficient hardware, improper MySQL tuning, or both. The server I ran my tests on had 8G of RAM which is considerably more than would be needed to get this same performance as the index file was only 28M.
My recommendation is one of the two following options.
Spend some time tuning your MySQL server. The MySQL online documentation would be a reasonable starting point. http://dev.mysql.com/doc/refman/5.0/en/optimizing-the-server.html An internet search will turn up a large volume of books, forums, articles, etc. if the MySQL documentation is not sufficient.
If my assumption is correct that you are using the GeoIP® product, then a second option would be to use the binary file format provided by MaxMind®. The custom file format has been optimized for speed, memory usages, and database size. APIs to access the data are provided for a number of languages. http://www.maxmind.com/app/api
As an aside, the two queries you presented are not equivalent. The between operator is inclusive. The second query would need to use <= >= operators to be equivalent to the query which used the between operator.
Maybe you would like to have a look at partitioning the table. This feature has been available since MySQL 5.1 - hence you do not specify which version you are using, this might not work for you if you are stuck with an older version.
As the possible value range for IP addresses is known - at least for IPv4 - you could break down the table into multiple partitions of equal size (or maybe even not equal if your data is not evenly distributed). With that MySQL could very easily skip large parts of the table, speeding up the scan if it is still required.
See MySQL manual on partitioning for the available options and syntax.
Thanks for all your comments, I really appreciate it.
For now we ended up using a caching mechanism and we have reduced that expensive queries.