MySQL Count Distinct - Very Slow

MySQL Count Distinct - Very Slow - mysql

I have a very big MySQL InnoDB table with following structure:
TABLE `whois_records` (
`record_id` int(10) unsigned NOT NULL,
`domain_name` varchar(100) NOT NULL,
`tld_id` smallint(5) unsigned DEFAULT NULL,
`create_date` date DEFAULT NULL,
`update_date` date DEFAULT NULL,
`expiry_date` date DEFAULT NULL,
`query_time` datetime NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
PRIMARY KEY (`record_id`)
UNIQUE KEY `domain_time` (`domain_name`,`query_time`)
INDEX `tld_id` (`tld_id`)
This table currently has 10 Million rows.
It stores frequently updated details of domain names.
So there can be multiple records for same domain name in the table.
TLD ID is the numeric value of the type of domain extension.
Problem is when I'm trying to count the total number of domain names of a particular TLD.
I have tried the following 3 SQL queries:
SELECT COUNT(DISTINCT(domain_name)) FROM `whois_records` WHERE tld_id=159
SELECT COUNT(*) FROM `whois_records` WHERE tld_id=159 GROUP BY domain_name
SELECT COUNT(*) FROM ( SELECT 1 FROM `whois_records` WHERE tld_id=159 GROUP BY domain_name) q
All the 3 are very slow, taking between 5 to 10 minutes. It is also using up a lot of CPU to complete. There is INDEX defined on the TLD ID column, so these queries might be doing a FULL INDEX SCAN. It is still very slow. TLD ID of 159 is for ".com", which are the most in number. So when doing a search for 159, it is slowest. For non-popular TLD, with less than 100 domains, the same query takes around 0.10 seconds. TLD ID 159 has around 6 Million records, which is 60% of the entire table consisting of 10 Million rows.
Is there any way to optimize the calculation?
As table grows, the current queries will take longer. So please can anyone help me with a future proof solution to this problem. Is any alteration of table required? Plz help, thank you :)

Extend the index to contain domain_name as well:
INDEX `tld_id` (`tld_id`, `domain_name`)
This should make MySQL use only the index and not table data to compute the result. If the combination of both values is unique, instead add a new unique index:
UNIQUE INDEX `new_index` (`tld_id`, `domain_name`)
I doubt you can push it a lot further than that. If it is still not fast enough, think about caching the counters.

Related

MySQL query is slower after index create [duplicate]

At first i will write some information about my test table.
This is books table with 665647 rows of data.
Below you can see how it looks.
I made 10 same queries for books with price equal
select * from books where price = 10
Execution time for all 10 queries was 9s 663ms.
After that i created index which you can see here:
i tried to run same 10 queries one more time.
Execution time for them was 21s 996ms.
show index from books;
Showed very wired data for me.
Possible value is just one!
What did i wrong? I was sure indexes are thing that can make our queries faster, not slower.
i found this topic : MySQL index slowing down query
but to be honest i dont really understand this especially Cardinality column
in my table books i have two possible values for price field at this moment
10 and 30 still show index from books; shows 1
#Edit1
SHOW CREATE TABLE books
Result:
CREATE TABLE `books` (
`id` bigint unsigned NOT NULL AUTO_INCREMENT,
`name` varchar(255) COLLATE utf8mb4_unicode_ci NOT NULL,
`description` text COLLATE utf8mb4_unicode_ci NOT NULL,
`isbn` bigint unsigned NOT NULL,
`price` double(8,2) unsigned NOT NULL,
`created_at` timestamp NULL DEFAULT NULL,
`updated_at` timestamp NULL DEFAULT NULL,
`author_id` bigint unsigned NOT NULL,
PRIMARY KEY (`id`),
UNIQUE KEY `books_isbn_unique` (`isbn`),
KEY `books_author_id_foreign` (`author_id`),
KEY `books_price_index` (`price`),
CONSTRAINT `books_author_id_foreign` FOREIGN KEY (`author_id`) REFERENCES `users` (`id`) ON DELETE CASCADE
) ENGINE=InnoDB AUTO_INCREMENT=665648 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci
#Edit2
I added new index create index nameIndex on books (name)
Which have big Cardinality value.
When i tried to do this query select * from books where name ='Inventore cumque quis.'
Before and after index create i can see the difference in execution time.
But still i dont understand how index'es works. I was sure about one thing - if i create new index in my database is creating new data structure with data which fit to this index.
For example if i have orws with price 10, 30 i got two "Tables" where are rows with these prices.

Is it realistic to have so many rows with the same price? Is it realistic to return 444K rows from a query? I ask these because query optimization is predicated on "normal" data.
An index (eg, INDEX(price)) is useful when looking for a price that occurs a small number of times. In fact, the Optimizer shuns the index if it sees that the value being searched for occurs more than about 20% of the time. Instead, it would simply ignore the index and do what you tested first--simply scan the entire table, ignoring any rows that don't match.
You should be able to see that by doing
EXPLAIN select * from books where price = 10
with and without the index. Alternatively, you can try:
EXPLAIN select * from books IGNORE INDEX(books_price_index) where price = 10
EXPLAIN select * from books FORCE INDEX(books_price_index) where price = 10
But, ... It seems that the Optimizer did not ignore the index. I see that the "cardinality" of price is "1", which implies that there is only one distinct value in that column. This 'statistic' is either incorrect or misleading. Please run this and see what changes:
ANALYZE TABLE books;
This will recompute the stats via a few random probes, and may change that "1" to perhaps "2".
General advice: Beware of benchmarks that run against fabricated data.

Maybe this?
https://stackoverflow.com/questions/755569/why-does-the-cardinality-of-an-index-in-mysql-remain-unchanged-when-i-add-a-new
Cardinality didnt get updated after index was created. Try to run the analyze table command.

MySQL query is slower after index create

Is it realistic to have so many rows with the same price? Is it realistic to return 444K rows from a query? I ask these because query optimization is predicated on "normal" data.
An index (eg, INDEX(price)) is useful when looking for a price that occurs a small number of times. In fact, the Optimizer shuns the index if it sees that the value being searched for occurs more than about 20% of the time. Instead, it would simply ignore the index and do what you tested first--simply scan the entire table, ignoring any rows that don't match.
You should be able to see that by doing
EXPLAIN select * from books where price = 10
with and without the index. Alternatively, you can try:
EXPLAIN select * from books IGNORE INDEX(books_price_index) where price = 10
EXPLAIN select * from books FORCE INDEX(books_price_index) where price = 10
But, ... It seems that the Optimizer did not ignore the index. I see that the "cardinality" of price is "1", which implies that there is only one distinct value in that column. This 'statistic' is either incorrect or misleading. Please run this and see what changes:
ANALYZE TABLE books;
This will recompute the stats via a few random probes, and may change that "1" to perhaps "2".
General advice: Beware of benchmarks that run against fabricated data.

Maybe this?
https://stackoverflow.com/questions/755569/why-does-the-cardinality-of-an-index-in-mysql-remain-unchanged-when-i-add-a-new
Cardinality didnt get updated after index was created. Try to run the analyze table command.

improving the count() performance in MySQL

I have a mysql query like below.
SELECT `indexVal`, COUNT(`indexVal`)
FROM `key_word`
WHERE `hashed_word` IN ('001','01v','0ji','0k9','0vc','0#v','0%d','13#' ,'148' ,
'1e1','1sx','1v$','1#c','1?b','1?k','226','2kl','2ue',
'2*l','2?4','36h','3au','3us','4d~')
GROUP BY `indexVal`
This query take 5 seconds to generate the results! I even have a compound index created with ALTER TABLE key_word ADD INDEX (hashed_word, indexVal). Please note that my query is counting how many times indexVal appeared in the "search" and not how many times it appears in the "table".
My table is having 3 columns, 28 million records, future table will have billions of records. I am using InndoDB, I just selected it. Below is my table Show Create Table result
CREATE TABLE `key_word` (
`primary_key` bigint(20) NOT NULL AUTO_INCREMENT,
`indexVal` int(11) NOT NULL,
`hashed_word` char(3) NOT NULL,
PRIMARY KEY (`primary_key`),
KEY `hashed_word` (`hashed_word`,`indexVal`)
) ENGINE=InnoDB AUTO_INCREMENT=28570982 DEFAULT CHARSET=latin1
I ran the above select query with Explain command. Below is the result
So how can I speed up this? I prefer to have the result in less than 1 second. I appreciate your advice.
PS: I don't need the result to be in any order.

Try an index with reversed orders of columns:
create index xx on key_word( `indexVal`,`hashed_word`);
This may help prevent from using the filesort by the optimizer,
but I don't think that this can help to speed up the query by 500% from 5 sec to less than 1 sec.
You probably need a faster hardware.

Count the number of rows between unix time stamps for each ID

I'm trying to populate some data for a table. The query is being run on a table that contains ~50 million records. The query I'm currently using is below. It counts the number of rows that match the template id and are BETWEEN two unix timestamps:
SELECT COUNT(*) as count FROM `s_log`
WHERE `time_sent` BETWEEN '1346904000' AND '1346993271'
AND `template` = '1'
While the query above does work, performance is rather slow while looping through each template which at times can be in the hundreds. The time stamps are stored as int and are properly indexed. Just to test thing out, I tried running the query below, omitting the time_sent restriction:
SELECT COUNT(*) as count FROM `s_log`
AND `template` = '1'
As expected, it runs very fast, but is obviously not restricting count results inside the correct time frame. How can I obtain a count for a specific template AND restrict that count BETWEEN two unix timestamps?
EXPLAIN:
1 | SIMPLE | s_log | ref | time_sent,template | template | 4 | const | 71925 | Using where
SHOW CREATE TABLE s_log:
CREATE TABLE `s_log` (
`id` int(255) NOT NULL AUTO_INCREMENT,
`email` varchar(255) NOT NULL,
`time_sent` int(25) NOT NULL,
`template` int(55) NOT NULL,
`key` varchar(255) NOT NULL,
`node_id` int(55) NOT NULL,
`status` varchar(55) NOT NULL,
PRIMARY KEY (`id`),
KEY `email` (`email`),
KEY `time_sent` (`time_sent`),
KEY `template` (`template`),
KEY `node_id` (`node_id`),
KEY `key` (`key`),
KEY `status` (`status`),
KEY `timestamp` (`timestamp`)
) ENGINE=MyISAM AUTO_INCREMENT=2078966 DEFAULT CHARSET=latin1

The best index you may have in this case is composite one template + time_sent
CREATE INDEX template_time_sent ON s_log (template, time_sent)
PS: Also as long as all your columns in the query are integer DON'T enclose their values in quotes (in some cases it could lead to issues, at least with older mysql versions)

First, you have to create an index that has both of your columns together (not seperately). Also check your table type, i think it would work great if your table is innoDB.
And lastly, use your WHERE clause in this fashion:
`WHEREtemplate= '1' ANDtime_sent` BETWEEN '1346904000' AND '1346993271'
What this does is first check if template is 1, if it is then it would check for the second condition else skip. This will definitely give you performance-edge

If you have to call the query for each template maybe it would be faster to get all the information with one query call by using GROUP BY:
SELECT template, COUNT(*) as count FROM `s_log`
WHERE `time_sent` BETWEEN 1346904000 AND 1346993271;
GROUP BY template
It's just a guess that this would be faster and you also would have to redesign your code a bit.
You could also try to use InnoDB instead of MyISAM. InnoDB uses a clustered index which maybe performs better on large tables. From the MySQL site:
Accessing a row through the clustered index is fast because the row data is on the same page where the index search leads. If a table is large, the clustered index architecture often saves a disk I/O operation when compared to storage organizations that store row data using a different page from the index record. (For example, MyISAM uses one file for data rows and another for index records.)
There are some questions on Stackoverflow which discuss the performance between InnoDB and MyISAM:
Should I use MyISAM or InnoDB Tables for my MySQL Database?
Migrating from MyISAM to InnoDB
MyISAM versus InnoDB

Optimizing MySQL table structure. Advice needed

I have these table structures and while it works, using EXPLAIN on certain SQL queries gives 'Using temporary; Using filesort' on one of the table. This might hamper performance once the table is populated with thousands of data. Below are the table structure and explanations of the system.
CREATE TABLE IF NOT EXISTS `jobapp` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`fullname` varchar(50) NOT NULL,
`icno` varchar(14) NOT NULL,
`status` tinyint(1) NOT NULL DEFAULT '1',
`timestamp` int(11) NOT NULL,
PRIMARY KEY (`id`),
KEY `icno` (`icno`)
) ENGINE=MyISAM;
CREATE TABLE IF NOT EXISTS `jobapplied` (
`appid` int(11) NOT NULL,
`jid` int(11) NOT NULL,
`jobstatus` tinyint(1) NOT NULL,
`timestamp` int(10) NOT NULL,
KEY `jid` (`jid`),
KEY `appid` (`appid`)
) ENGINE=MyISAM;
Query I tried which gives aforementioned statement:
EXPLAIN SELECT japp.id, japp.fullname, japp.icno, japp.status, japped.jid, japped.jobstatus
FROM jobapp AS japp
INNER JOIN jobapplied AS japped ON japp.id = japped.appid
WHERE japped.jid = '85'
AND japped.jobstatus = '2'
AND japp.status = '2'
ORDER BY japp.`timestamp` DESC
This system is for recruiting new staff. Once registration is opened, hundreds of applicant will register in a single time. They are allowed to select 5 different jobs. Later on at the end of registration session, the admin will go through each job one by one. I have used a single table (jobapplied) to store 2 items (applicant id, job id) to record who applied what. And this is the table which causes aforementioned statement. I realize this table is without PRIMARY key but I just can't figure out any other way later on for the admin to search specifically which job who have applied.
Any advice on how can I optimize the table?

Apart from the missing indexes and primary keys others have mentioned . . .
This might hamper performance once the
table is populated with thousands of
data.
You seem to be assuming that the query optimizer will use the same execution plan on a table with thousands of rows as it will on a table with just a few rows. Optimizers don't work like that.
The only reliable way to tell how a particular vendor's optimizer will execute a query on a table with thousands of rows--which is still a small table, and will probably easily fit in memory--is to
load a scratch version of the
database with thousands of rows
"explain" the query you're interested
in
FWIW, the last test I ran like this involved close to a billion rows--about 50 million in each of about 20 tables. The execution plan for that query--which included about 20 left outer joins--was a lot different than it was for the sample data (just a few thousand rows).

You are ordering by jobapp.timestamp, but there is no index for timestamp so the tablesort (and probably the temporary) will be necessary try adding and index for timestamp to jobapp something like KEY timid (timestamp,id)

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008