MariaDB - Should I add index to my table? - mysql

Recently I was checking my system logs and I noticed some of my queries are very slow.
I have a table that stores user activites. The table structure is id (int), user (int), type (int), object (varchar), extra (mediumtext) and date (timestamp).
Also I only have index for id (BTREE, unique).
I have performance issues for following query;
SELECT DISTINCT object as usrobj
from ".MV15_PREFIX."useractivities
WHERE user='".$user_id."'
and type = '3'
limit 0,1000000"
Question is, should I also index user same as id? What should be the best practise I should follow?
This table is actively used and has over 500k+ rows in it. And there are 2k~ concurrent users online average on site.
The reason I am asking this question is I am not really good at managing DB, and also I have slow query issue on another table which has proper indexes.
Thanks in advance for suggestions.
Side note:
Result of mysqltuner
General recommendations:
Reduce or eliminate persistent connections to reduce connection usage
Adjust your join queries to always utilize indexes
Temporary table size is already large - reduce result set size
Reduce your SELECT DISTINCT queries without LIMIT clauses
Consider installing Sys schema from https://github.com/mysql/mysql-sys
Variables to adjust:
max_connections (> 768)
wait_timeout (< 28800)
interactive_timeout (< 28800)
join_buffer_size (> 64.0M, or always use indexes with joins)
(I will set max_connections > 768, not really sure about timeouts and as far I read topics/suggestions in Stackoverflow I think I shouldn't increase the size of join_buffer_size but I'd really appreciate getting feedback about these variables too.)
EDIT - SHOW INDEX result;
+--------------------+------------+-----------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| Table | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment |
+--------------------+------------+-----------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| ***_useractivities | 0 | PRIMARY | 1 | id | A | 434006 | NULL | NULL | | BTREE | | |
| ***_useractivities | 1 | user_index | 1 | user | A | 13151 | NULL | NULL | | BTREE | | |
| ***_useractivities | 1 | user_type_index | 1 | user | A | 10585 | NULL | NULL | | BTREE | | |
| ***_useractivities | 1 | user_type_index | 2 | type | A | 13562 | NULL | NULL | | BTREE | | |
+--------------------+------------+-----------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+

Most of these rules of thumb for PostgreSQL indexing apply to most SQL database management systems.
https://dba.stackexchange.com/a/31517/1064
So, yes, you will probably benefit from an index on user and an index on type. You might benefit more from an index on the pair user, type.
You'll benefit from learning how to read an execution plan, too.

For that query, either of these is optimal:
INDEX(user, type)
INDEX(type, user)
Separate indexes (INDEX(user), INDEX(type)) is likely to be not nearly as good.
MySQL's InnoDB has only BTree, not Hash. Anyway, BTree is essentially as good as Hash for 'point queries', and immensely better for 'range' queries.
Indexing tips.
Indexes help SELECTs and UPDATEs, sometimes a lot. Use them. The side effects are minor -- such as extra disk space used.

Related

Very simple AVG() aggregation query on MySQL server takes ridiculously long time

I am using MySQL server via Amazon could service, with default settings. The table involved mytable is of InnoDB type and has about 1 billion rows.
The query is:
select count(*), avg(`01`) from mytable where `date` = "2017-11-01";
Which takes almost 10 min to execute. I have an index on date. The EXPLAIN of this query is:
+----+-------------+---------------+------+---------------+------+---------+-------+---------+-------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+---------------+------+---------------+------+---------+-------+---------+-------+
| 1 | SIMPLE | mytable | ref | date | date | 3 | const | 1411576 | NULL |
+----+-------------+---------------+------+---------------+------+---------+-------+---------+-------+
The indexes from this table are:
+---------------+------------+-----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| Table | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment |
+---------------+------------+-----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| mytable | 0 | PRIMARY | 1 | ESI | A | 60398679 | NULL | NULL | | BTREE | | |
| mytable | 0 | PRIMARY | 2 | date | A | 1026777555 | NULL | NULL | | BTREE | | |
| mytable | 1 | lse_cd | 1 | lse_cd | A | 1919210 | NULL | NULL | YES | BTREE | | |
| mytable | 1 | zone | 1 | zone | A | 732366 | NULL | NULL | YES | BTREE | | |
| mytable | 1 | date | 1 | date | A | 85564796 | NULL | NULL | | BTREE | | |
| mytable | 1 | ESI_index | 1 | ESI | A | 6937686 | NULL | NULL | | BTREE | | |
+---------------+------------+-----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
If I remove AVG():
select count(*) from mytable where `date` = "2017-11-01";
It only takes 0.15 sec to return the count. The count of this specific query is 692792; The counts are similar for other dates.
I don't have an index over 01. Is it an issue? Why AVG() takes so long to compute? There must be something I didn't do properly.
Any suggestion is appreciated!
To count the number of rows with a specific date, MySQL has to locate that value in the index (which is pretty fast, after all that is what indexes are made for) and then read the subsequent entries of the index until it finds the next date. Depending on the datatype of esi, this will sum up to reading some MB of data to count your 700k rows. Reading some MB does not take much time (and that data might even already be cached in the buffer pool, depending on how often you use the index).
To calculate the average for a column that is not included in the index, MySQL will, again, use the index to find all rows for that date (the same as before). But additionally, for every row it finds, it has to read the actual table data for that row, which means to use the primary key to locate the row, read some bytes, and repeat this 700k times. This "random access" is a lot slower than the sequential read in the first case. (This gets worse by the problem that "some bytes" is the innodb_page_size (16KB by default), so you may have to read up to 700k * 16KB = 11GB, compared to "some MB" for count(*); and depending on your memory configuration, some of this data might not be cached and has to be read from disk.)
A solution to this is to include all used columns in the index (a "covering index"), e.g. create an index on date, 01. Then MySQL does not need to access the table itself, and can proceed, similar to the first method, by just reading the index. The size of the index will increase a bit, so MySQL will need to read "some more MB" (and perform the avg-operation), but it should still be a matter of seconds.
In the comments, you mentioned that you need to calculate the average over 24 columns. If you want to calculate the avg for several columns at the same time, you would need a covering index on all of them, e.g. date, 01, 02, ..., 24 to prevent table access. Be aware that an index that contains all columns requires as much storage space as the table itself (and it will take a long time to create such an index), so it might depend on how important this query is if it is worth those resources.
To avoid the MySQL-limit of 16 columns per index, you could split it into two indexes (and two queries). Create e.g. the indexes date, 01, .., 12 and date, 13, .., 24, then use
select * from (select `date`, avg(`01`), ..., avg(`12`)
from mytable where `date` = ...) as part1
cross join (select avg(`13`), ..., avg(`24`)
from mytable where `date` = ...) as part2;
Make sure to document this well, as there is no obvious reason to write the query this way, but it might be worth it.
If you only ever average over a single column, you could add 24 seperate indexes (on date, 01, date, 02, ...), although in total, they will require even more space, but might be a little bit faster (as they are smaller individually). But the buffer pool might still favour the full index, depending on factors like usage patterns and memory configuration, so you may have to test it.
Since date is part of your primary key, you could also consider changing the primary key to date, esi. If you find the dates by the primary key, you would not need an additional step to access the table data (as you already access the table), so the behaviour would be similar to the covering index. But this is a significant change to your table and can affect all other queries (that e.g. use esi to locate rows), so it has to be considered carefully.
As you mentioned, another option would be to build a summary table where you store precalculated values, especially if you do not add or modify rows for past dates (or can keep them up-to-date with a trigger).
For MyISAM tables, COUNT(*) is optimized to return very quickly if the SELECT retrieves from one table, no other columns are retrieved, and there is no WHERE clause.
For example:
SELECT COUNT(*) FROM student;
https://dev.mysql.com/doc/refman/5.6/en/group-by-functions.html#function_count
If you add AVG() or something else, you lose this optimization

Why does the same query explain & perform so differently on a slave MySQL server than a master?

I have a master MySQL server and a slave server. The data is replicated between them.
When I run this query on the master it's taking a number of hours; on the slave it takes seconds. The EXPLAIN plans back this up -- the slave examines far fewer rows than the master.
However, since the structure and data in these two databases are exactly the same (or should be at least), and they're both running the same version of MySQL (5.5.31 Enterprise), I don't understand what's causing this.
This is a similar symptom to this question (and others) but I don't think it's the same root cause because my two servers are in sync via MySQL replication, and the structure and data contents are (or should be) the same, and the OS & hardware resources are exactly the same on both servers -- they're VMWare and one is an image of the other.
I've verified that the number of rows in each table is exactly the same on both servers, and that their configurations are the same (except for the slave having directives pointing to the master). Short of going through the data itself to see if there are any differences I'm not sure what else I can check, and would be grateful for any advice.
The query is
SELECT COUNT(DISTINCT(cds.company_id))
FROM jobsmanager.companies c
, jobsmanager.company_jobsmanager_settings cjs
, jobsmanager.company_details_snapshot cds
, vacancies v
WHERE c.company_id = cjs.company_id
AND cds.company_id = c.company_id
AND cds.company_id = v.jobsmanager_company_id
AND cjs.is_post_a_job = 'Y'
AND cjs.can_access_jobsmanager = 'Y'
AND cjs.account_status != 'suspended'
AND v.last_live BETWEEN cds.record_date - INTERVAL 365 DAY AND cds.record_date
AND cds.record_date BETWEEN '2016-01-30' AND '2016-02-05';
The master explains it like this, 3 million rows on the driving table, no key usage, and takes over an hour to return a result:
+----+-------------+-------+--------+-------------------------+----------------+---------+---------------------------------+---------+--------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+--------+-------------------------+----------------+---------+---------------------------------+---------+--------------------------+
| 1 | SIMPLE | v | ALL | job_owner,last_live_idx | NULL | NULL | NULL | 3465433 | |
| 1 | SIMPLE | c | eq_ref | PRIMARY | PRIMARY | 4 | s1jobs.v.jobsmanager_company_id | 1 | Using where; Using index |
| 1 | SIMPLE | cds | ref | PRIMARY,company_id_idx | company_id_idx | 4 | jobsmanager.c.company_id | 538 | Using where |
| 1 | SIMPLE | cjs | eq_ref | PRIMARY,qidx,qidx2 | PRIMARY | 4 | jobsmanager.c.company_id | 1 | Using where |
+----+-------------+-------+--------+-------------------------+----------------+---------+---------------------------------+---------+--------------------------+
The slave uses a different driving table, uses an index, predicts more like 310,000 rows examined, and returns the result within a couple of seconds:
+----+-------------+-------+--------+-------------------------+-----------+---------+----------------------------+--------+--------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+--------+-------------------------+-----------+---------+----------------------------+--------+--------------------------+
| 1 | SIMPLE | cds | range | PRIMARY,company_id_idx | PRIMARY | 3 | NULL | 310381 | Using where; Using index |
| 1 | SIMPLE | c | eq_ref | PRIMARY | PRIMARY | 4 | jobsmanager.cds.company_id | 1 | Using index |
| 1 | SIMPLE | cjs | eq_ref | PRIMARY,qidx,qidx2 | PRIMARY | 4 | jobsmanager.c.company_id | 1 | Using where |
| 1 | SIMPLE | v | ref | job_owner,last_live_idx | job_owner | 2 | jobsmanager.cds.company_id | 32 | Using where |
+----+-------------+-------+--------+-------------------------+-----------+---------+----------------------------+--------+--------------------------+
I've run ANALYZE TABLE, OPTIMIZE TABLE and REPAIR TABLE ... QUICK on both servers to try to make them consistent, with no luck.
As a temporary solution I can run the queries on the slave, as they're in cron scripts and even if they take a long time on the slave they won't increase load on the master the way they do when they run on the master. However I'd be grateful for any other information on why these are different or what else I could check/revise which would explain such a drastic difference between the two. The only thing I can find is that the slave has more free memory, as it's in little use; would that alone account for this? If not what else?
$ ssh s1-mysql-01 free # master
total used free shared buffers cached
Mem: 99018464 98204624 813840 0 160752 55060632
-/+ buffers/cache: 42983240 56035224
Swap: 4095992 4095992 0
$ ssh s1-mysql-02 free # slave
total used free shared buffers cached
Mem: 99018464 80866420 18152044 0 224772 72575168
-/+ buffers/cache: 8066480 90951984
Swap: 4095992 206056 3889936
$
Thanks very much.
The only really big difference between the 2 explains is that on the master no index is used on the vacancies table.
You could try place an index hint (force index) into the select on master to force the use of job_owner index.
You can also try to run analyze table on all tables involved in the above query on the master to make sure that the table and index stats are updated.
I have also the same problem, but in my case slave was not using index.
Index hints (use index / force index) helped, but it's not good solution in this case. So I tried run analyze table on slave server, and it has fixed the problem:
ANALYZE NO_WRITE_TO_BINLOG TABLE tbl_name
Now both servers use correct indexes.
NO_WRITE_TO_BINLOG - needed in case when it's runing on replica.
Also ANALYZE should be executed during the low peak of load or a maintenance window, othervise you can get many user queries stuck in Waiting for table flush state.

Mysql query taking too much time

I have problem related to mysql database. i am linux webserver admin and i am facing a problem with a mysql query. The database is very small. I tried to track in logs and found that a query is taking minimum 5 sec to respond . The first page of site is coming from the database. Client are using cms. when the server gets some number of hits database server starts to give response very slowly and wait time increases from 5 sec to several seconds.
I checked slow query logs
{
Query_time: 11.480138 Lock_time: 0.003837 Rows_sent: 921 Rows_examined: 3333
SET timestamp=1346656767;
SELECT `Tender`.`id`,
`Tender`.`department_id`,
`Tender`.`title_english`,
`Tender`.`content_english`,
`Tender`.`title_hindi`,
`Tender`.`content_hindi`,
`Tender`.`file_name`,
`Tender`.`start_publish`,
`Tender`.`end_publish`,
`Tender`.`publish`,
`Tender`.`status`,
`Tender`.`createdBy`,
`Tender`.`created`,
`Tender`.`modifyBy`,
`Tender`.`modified`
FROM `mcms_tenders` AS `Tender`
WHERE `Tender`.`department_id` IN ( 31, 33, 32, 30 );
}
Every line in the log is same only there is diff in Query time.
Is there any way tweak the performance?
Update: Here is the EXPLAIN result:
+----+-------------+--------+------+---------------+------+---------+------+-‌-----+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+--------+------+---------------+------+---------+------+----‌​--+-------------+
| 1 | SIMPLE | Tender | ALL | NULL | NULL | NULL | NULL | 3542 | Using where |
+----+-------------+--------+------+---------------+------+---------+------+----‌​--+-------------+
1 row in set, 1 warning (0.00 sec)
client is saying they are using Index so i run the command to check the indexing.
I got following output. Does It means they are using Indexing.
+--------------+------------+----------+--------------+-------------+-----------+------------+----------+--------+------+------------+---------+
| Table | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment |
+--------------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+
| mcms_tenders | 0 | PRIMARY | 1 | id | A | 4264 | NULL | NULL | | BTREE | |
+--------------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+
The normal way to tweak the performance of a query like this is to create an index on department_id.
However, this assumes that Tenders is actually a table and not a view. You should confirm this, since the problem may be in a view.
Also, from what you describe the issue may be the connection from the server to the end users. I would try running the query locally on the server (or checking the execute time strictly on the server) to see if the query is really taking that long.
"when the server gets some number of hits"
Define 'some number'. It makes sense that reading the database is slower when it is more heavily used. Also, MySQL has a query cache that is fully invalidated when changes are made to the data. So every time someone inserts, deletes or modifies a record in this table, the next queries will be slower because the table date is still uncached.
But 11 seconds for a query like this is very slow, so either the load is way too high, the hardware insufficient or broken or your database lacks indexes (I always forget to mention that at first, because I assume adding indexes to be a second nature for anyone working with databases).

Table design for temporary table accessed by multiple processes and stores 1,000,000,000+ 4-column rows

I am using MySQL for temporary storage of the result of one billion or more results, where the results are calculated by processes executing in parallel.
Each result is calculated using a function [f] on the representations [r1] and [r2] of objects identified respectively by [o1] and [o2].
Currently, I use three tables to execute this process:
(1) A table mapping object identifiers to their representations:
mysql> describe v2_3282_fp;
+----------------+------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+----------------+------+------+-----+---------+-------+
| objid | text | YES | | NULL | |
| representation | text | YES | | NULL | |
+----------------+------+------+-----+---------+-------+
(2) A table holding jobs that each compute process should retrieve amd calculate:
mysql> describe v2_3282_job;
+----------+---------------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+----------+---------------------+------+-----+---------+----------------+
| jobid | bigint(20) unsigned | NO | PRI | NULL | auto_increment |
| workerid | int(11) | YES | | NULL | |
| pairid1 | text | YES | | NULL | |
| pairid2 | text | YES | | NULL | |
+----------+---------------------+------+-----+---------+----------------+
(3) A table holding the results of compute jobs:
mysql> describe v2_3282_res;
+-----------+---------------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+-----------+---------------------+------+-----+---------+----------------+
| resultid | bigint(20) unsigned | NO | PRI | NULL | auto_increment |
| pairid1 | text | YES | | NULL | |
| pairid2 | text | YES | | NULL | |
| pairscore | double(36,18) | YES | | NULL | |
+-----------+---------------------+------+-----+---------+----------------+
(the pairscore type is dynamically determined during execution, and not fixed to (36,18) .)
Once the representations have been registered, one process continually scans the result table for new results to transfer to an object existing in memory, and the remaining processes retrieve jobs to compute until they receive a job with a pair of identifiers signalling the end of computation.
During unit tests with 1,000,000 or so computations, this system works just fine.
However, as the demands to use this system have grown to 1,000,000,000+, I see that the system eventually gets bogged down in swapping back and forth between memory and disk.
When I check the system memory and swap space in use, the system memory used is completely used, but typically less than 20% of swap is used.
I have read that MySQL performance is best when entire tables can be read into memory, and resorting to disk I/O is the major bottleneck.
This seems to be the case for me as well, as running computations on my systems with 12 GB and 16 GB of RAM eventually requires more and more time between worker process cycles, though my lone system with 64 GB never seems to encounter this issue.
While the straightforward answer is, "Hey buddy, buy more RAM.", I think there is a more fundamental design issue that is causing my system to degrade as the demands grow. I know that MySQL is a well-engineered product widely used, and that database and table design consideration can greatly impact performance.
So without resorting to the brute force resolution of buying more memory, I am looking for suggestions on how to improve the engineering of the MySQL table design I came up with.
While I know the basics of MySQL table normalization and can create queries to implement my needs, I do not know much about each type of database engine, the details of indexing, and other database-specific design considerations.
The questions I have are:
(1) Would performance be any different if I split the result and job tables into smaller tables instead of single large tables? (I think not.)
(2) I currently issue a limit clause programatically to retrieve a fixed number of results in each retrieval cycle. However, I don't know if this can be further optimized over the simple "SELECT ... FROM [result table] LIMIT start, size". (I think so.)
(3) Does it make sense to tell the worker processes to sleep between cycles in order to let MySQL "catch up"? (I think not.)
My appreciation in advance for any advice from those experienced in database and table design.

can mysqldump on a large database be causing my long queries to hang?

I have a large database (approx 50GB). It is on a server I have little control over, but I know they are using mysqldump to do backups nightly.
I have a query that takes hours to finish. I set it to run, but it never actually finishes.
I've noticed that after the backup time, all the tables have a lock request (SHOW OPEN TABLES WHERE in_use > 0; lists all tables).
The tables from my query have in_use = 2, all other tables have in_use = 1.
So... what is happening here?
a) my query is running normally, blocking the dump from happening. I should just wait?
b) the dump is causing the server to hang (maybe lack of memory/disk space?)
c) something else?
EDIT: using MyISAM tables
There is a server admin who is not very competent, but if I ask him specific things he does them. What should I get him to check?
EDIT: adding query
SELECT citing.article_id as citing, citing.year, r.id_when_cited, cited_issue.country
FROM isi_lac_authored_articles as citing # 1M records
JOIN isi_citation_references r ON (citing.article_id = r.article_id) # 400M records
JOIN isi_articles cited ON (cited.id_when_cited = r.id_when_cited) # 25M records
JOIN isi_issues cited_issue ON (cited.issue_id = cited_issue.issue_id) # 1M records
This is what EXPLAIN has to say:
+----+-------------+-------------+------+--------------------------------------------------------------------------+---------------------------------------+---------+-------------------------------+---------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------------+------+--------------------------------------------------------------------------+---------------------------------------+---------+-------------------------------+---------+-------------+
| 1 | SIMPLE | cited_issue | ALL | NULL | NULL | NULL | NULL | 1156856 | |
| 1 | SIMPLE | cited | ref | isi_articles_id_when_cited,isi_articles_issue_id | isi_articles_issue_id | 49 | func | 19 | Using where |
| 1 | SIMPLE | r | ref | isi_citation_references_article_id,isi_citation_references_id_when_cited | isi_citation_references_id_when_cited | 17 | mimir_dev.cited.id_when_cited | 4 | Using where |
| 1 | SIMPLE | citing | ref | isi_lac_authored_articles_article_id | isi_lac_authored_articles_article_id | 16 | mimir_dev.r.article_id | 1 | |
+----+-------------+-------------+------+--------------------------------------------------------------------------+---------------------------------------+---------+-------------------------------+---------+-------------+
I actually don't understand why it needs to look at all the records in isi_issues table. Shouldn't it just be matching up by the isi_articles (cited) on issue_id? Both fields are indexed.
For a MySQL database of that size, you may want to consider setting up replication to a slave node, and then have your nightly database backups performed on the slave.
Yes -- some options to mysqldump will have the effect of locking all MyISAM tables while the backup is in progress, so that the backup is a consistent "snapshot" of a point in time.
InnoDB supports transactions, which make this unnecessary. It's also generally faster than MyISAM. You should use it. :)