Connection lost while building primary key. Fix or punt? - mysql

This question is about possible future improvements to a task I'm almost done with.
I have loaded a MySQL database with a subset of the Universal Medical Language System's Metathesaurus. I used a Java application called MetaMorphoSys, which generated a Bash wrapper, one SQL script for defining tables and importing data from text files, and another for indexing.
Loading and indexing a small UMLS subset (3.3 M rows in table MRSAT) goes to completion without errors. Loading a larger subset (39.4 M rows in MRSAT) is also successful, but then the indexing fails at this step after 1500 to 1800 seconds:
ALTER TABLE MRSAT ADD CONSTRAINT X_MRSAT_PK PRIMARY KEY BTREE (ATUI)
Error Code: 2013. Lost connection to MySQL server during query
My only use for the MySQL database is converting the relational rows to RDF triples. This conversion is performed by a single python script, which does seem to access the MRSAT table, but doesn't appear to use the ATUI column. At this point, I have extracted almost all of the data I want.
How can I tell if the absence of the primary key is detrimental to the performance of the RDF-generation queries?
I have increased some timeouts but haven't made all of the changes in suggested in other answers to that question.
The documentation from the provider suggests MySQL 5.5 over 5.6 due to disk space usage issues. I am using 5.6 anyway (as I have done in the past) on a generous AWS x1e.2xlarge instance running Ubuntu 18.
The documentation provides tuning suggestions for 5.5, but I don't see equivalent settings names in the 5.6 documentation. I have applied these:
bulk_insert_buffer_size = 100M
join_buffer_size = 100M
myisam_sort_buffer_size = 200M
query_cache_limit = 3M
query_cache_size = 100M
read_buffer_size = 200M
sort_buffer_size = 500M
For key_buffer = 600M I did key_buffer_size= 600M. I didn't do anything for table_cache = 300
The primary key is supposed to be set on the alphanumerical column ATUI
mysql> select * from MRSAT limit 9;
+----------+----------+----------+-----------+-------+---------+-------------+-------+--------+-----+------------+----------+------+
| CUI | LUI | SUI | METAUI | STYPE | CODE | ATUI | SATUI | ATN | SAB | ATV | SUPPRESS | CVF |
+----------+----------+----------+-----------+-------+---------+-------------+-------+--------+-----+------------+----------+------+
| C0000005 | L0000005 | S0007492 | A26634265 | AUI | D012711 | AT212456753 | NULL | TH | MSH | UNK (19XX) | N | NULL |
| C0000005 | L0000005 | S0007492 | A26634265 | AUI | D012711 | AT212480766 | NULL | TERMUI | MSH | T037573 | N | NULL |
| C0000005 | L0000005 | S0007492 | A26634265 | SCUI | D012711 | AT60774257 | NULL | RN | MSH | 0 | N | NULL |
| C0000005 | L0270109 | S0007491 | A26634266 | AUI | D012711 | AT212327137 | NULL | TERMUI | MSH | T037574 | N | NULL |
| C0000005 | L0270109 | S0007491 | A26634266 | AUI | D012711 | AT212456754 | NULL | TH | MSH | UNK (19XX) | N | NULL |
| C0000005 | NULL | NULL | NULL | CUI | NULL | AT00368929 | NULL | DA | MTH | 19900930 | N | NULL |
| C0000005 | NULL | NULL | NULL | CUI | NULL | AT01344283 | NULL | MR | MTH | 20020910 | N | NULL |
| C0000005 | NULL | NULL | NULL | CUI | NULL | AT02319637 | NULL | ST | MTH | R | N | NULL |
| C0000039 | L0000035 | S0007560 | A26674543 | AUI | D015060 | AT212481191 | NULL | TH | MSH | UNK (19XX) | N | NULL |
+----------+----------+----------+-----------+-------+---------+-------------+-------+--------+-----+------------+----------+------+

Related

ProxySQL Query Cache doesn't always respect the query rules for some reason

I use ProxySQL (2.0.17) to cache all SELECT queries sent to MySQL. The mysql_query_rules table looks like this:
+---------+--------+----------+------------+--------+-------------+------------+------------+--------+------------------------------+---------------+----------------------+--------------+---------+-----------------+-----------------------+-----------+--------------------+---------------+-----------+---------+---------+-------+-------------------+----------------+------------------+-----------+--------+-------------+-----------+---------------------+-----+-------+---------+
| rule_id | active | username | schemaname | flagIN | client_addr | proxy_addr | proxy_port | digest | match_digest | match_pattern | negate_match_pattern | re_modifiers | flagOUT | replace_pattern | destination_hostgroup | cache_ttl | cache_empty_result | cache_timeout | reconnect | timeout | retries | delay | next_query_flagIN | mirror_flagOUT | mirror_hostgroup | error_msg | OK_msg | sticky_conn | multiplex | gtid_from_hostgroup | log | apply | comment |
+---------+--------+----------+------------+--------+-------------+------------+------------+--------+------------------------------+---------------+----------------------+--------------+---------+-----------------+-----------------------+-----------+--------------------+---------------+-----------+---------+---------+-------+-------------------+----------------+------------------+-----------+--------+-------------+-----------+---------------------+-----+-------+---------+
| 1 | 1 | NULL | NULL | 0 | NULL | NULL | NULL | NULL | ^[(]?SELECT (?!SQL_NO_CACHE) | NULL | 0 | CASELESS | NULL | NULL | NULL | 300000 | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | 1 | NULL |
+---------+--------+----------+------------+--------+-------------+------------+------------+--------+------------------------------+---------------+----------------------+--------------+---------+-----------------+-----------------------+-----------+--------------------+---------------+-----------+---------+---------+-------+-------------------+----------------+------------------+-----------+--------+-------------+-----------+---------------------+-----+-------+---------+
One simple rule (I tried ^SELECT .* as well) and 300 seconds to wait until a cached query is purged.
For some reason, 5% of each query to be cached are still sent to the backend. For instance, this one is the most popular query:
+-----------+------------+----------+----------------+--------------------+--------------------------+------------+------------+------------+-------------+----------+----------+-------------------+---------------+
| hostgroup | schemaname | username | client_address | digest | digest_text | count_star | first_seen | last_seen | sum_time | min_time | max_time | sum_rows_affected | sum_rows_sent |
+-----------+------------+----------+----------------+--------------------+--------------------------+------------+------------+------------+-------------+----------+----------+-------------------+---------------+
| 2 | ------ | ---- | | 0xFB50749BCFE0DA3C | SELECT * FROM `language` | 12839 | 1621445210 | 1621455115 | 45069293213 | 31321 | 82235606 | 0 | 56960 |
| -1 | ------ | ---- | | 0xFB50749BCFE0DA3C | SELECT * FROM `language` | 326243 | 1621445210 | 1621455116 | 0 | 0 | 0 | 0 | 0 |
+-----------+------------+----------+----------------+--------------------+--------------------------+------------+------------+------------+-------------+----------+----------+-------------------+---------------+
I can't get my head around this peculiarity. Whenever I update stats_mysql_query_digest, count_star on hostgroup 2 (backend) gets incremented without waiting 300 seconds for the query to be purged.
The query cache size is set to 512 Mb. At its peak, it takes up around 100 Mb.
Help?..
Cranking mysql-query_cache_size_MB up to 5120 MB (which is ridiculous, of course) seems to have resolved the problem to some extent. The frequency of backend requests for that query has dropped by 10 times (thanks to ProxySQL's Query Logging you can log just one query and analyze it). The cache_ttl value is still somewhat far from being respected but I guess this workaround is better than nothing at this point.

Big data table MySQL query optimization

I'm trying to optimize my MySQL query to run more smoothly, but now I'm stuck.
I'm using this query:
SELECT
sr.path,
sr.keywordId,
sr.rank
FROM
serp_results sr
WHERE
sr.domain = 971
AND sr.searchEngine = 1
And it tries to get results from the table where it is approximately 544,003,737 rows. I recently added a compound index for columns searchEngine,domain, but it didn't work.
This is the table structure:
| Field | Type | Null | Key | Default | Extra |
|-------------------|----------------------|------|-----|---------|----------------|
| id | bigint(10) unsigned | NO | PRI | | auto_increment |
| keywordId | int(10) unsigned | NO | MUL | | |
| searchEngine | tinyint(3) unsigned | NO | PRI | | |
| position | smallint(5) unsigned | NO | | | |
| rank | float unsigned | NO | | | |
| path | varchar(500) | NO | | | |
| domain | bigint(20) unsigned | YES | MUL | | |
| firstDomainResult | tinyint(1) unsigned | NO | | | |
| added | date | YES | MUL | | |
+ indexes:
| index_name | index_algorithm | is_unique | column_name | |
|------------------------------------|-----------------|-----------|------------------------------------|---|
| serp_results_searchEngine_domain | BTREE | FALSE | searchEngine,domain | |
| serp_results_domain_index | BTREE | FALSE | domain | |
| serp_results_added_index | BTREE | FALSE | added | |
| keywordId_searchEngine_position | BTREE | TRUE | keywordId,searchEngine,position | |
| domain_firstDomainResult_keywordId | BTREE | FALSE | domain,firstDomainResult,keywordId | |
| PRIMARY | BTREE | TRUE | id,searchEngine | |
EDIT: It does take around 60+ seconds for a larger number of domain records.
Here's why it takes a long time. (I will intersperse some questions to help confirm my statements.)
Your table is InnoDB. (Yes?)
Your RAM is much smaller than the table (How big, in GB, is the table? How much RAM? What is the value of innodb_buffer_pool_size.)
The 662,733 rows are scattered around the table. (Or is there some reason why those rows might be clustered chronologically?)
You have a spinning HDD, or you have SSD drive(s). (I don't think this matters for my discussion; but which do you have?)
PRIMARY KEY(id, searchEngine) is very strange, given that id is AUTO_INCREMENT. (Can you justify it?)
Please provide SHOW CREATE TABLE; it is more descriptive than DESCRIBE.
Do you have other important queries? (Sometimes speeding up one query will slow down others. That is the direction I want to go. Show the ones you don't want me to slow down.)
Can domain be just INT? (I am looking also at shrinking the table size.)
As to why more than 60 seconds... A significant percentage of the 662,733 rows needed to be fetched from disk.
Get some of those questions answered; then I may have a concrete suggestion for speeding up the query.

SELECT query taking too long after MySQL upgrade

I just updated my MySQL version to 5.7. A SELECT query that has four INNER-JOINS and that previously took around 3 seconds to execute is now taking so long that I can't even keep track of it. A bit of profiling shows that the 'Send Data' part is taking too long. Can someone tell me what is going wrong? Here's some data. Note that the query is still running at this point in time:
+----------------------+-----------+
| Status | Duration |
+----------------------+-----------+
| starting | 0.001911 |
| checking permissions | 0.000013 |
| checking permissions | 0.000003 |
| checking permissions | 0.000003 |
| checking permissions | 0.000006 |
| Opening tables | 0.000030 |
| init | 0.000406 |
| System lock | 0.000018 |
| optimizing | 0.000019 |
| statistics | 0.000509 |
| preparing | 0.000052 |
| executing | 0.000004 |
| Sending data | 31.881794 |
| end | 0.000021 |
| query end | 0.003540 |
| closing tables | 0.000032 |
| freeing items | 0.000214 |
| cleaning up | 0.000028 |
+----------------------+-----------+
Here's the output of EXPLAIN:
+----+-------------+--------------------+------------+------+---------------+------------+---------+-------+---------+----------+----------------------------------------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+--------------------+------------+------+---------------+------------+---------+-------+---------+----------+----------------------------------------------------+
| 1 | SIMPLE | movie_data_primary | NULL | ref | cinestopId | cinestopId | 26 | const | 1 | 100.00 | NULL |
| 1 | SIMPLE | mg | NULL | ALL | NULL | NULL | NULL | NULL | 387498 | 10.00 | Using where; Using join buffer (Block Nested Loop) |
| 1 | SIMPLE | crw | NULL | ALL | NULL | NULL | NULL | NULL | 1383452 | 10.00 | Using where; Using join buffer (Block Nested Loop) |
| 1 | SIMPLE | cst | NULL | ALL | NULL | NULL | NULL | NULL | 2184556 | 10.00 | Using where; Using join buffer (Block Nested Loop) |
+----+-------------+--------------------+------------+------+---------------+------------+---------+-------+---------+----------+----------------------------------------------------+
Looks like indexing problem when you upgrade the msssql version-
Documentation says-
If you perform a binary upgrade without dumping and reloading tables,
you cannot upgrade directly from MySQL 4.1 to 5.1 or higher. This
occurs due to an incompatible change in the MyISAM table index
formatin MySQL 5.0. Upgrade from MySQL 4.1 to 5.0 and repair all
MyISAM tables. Then upgrade from MySQL 5.0 to 5.1 and check and
repair your tables.Modifications to the handling of character sets or
collations might change the character sort order, which causes the
ordering of entries in any index that uses an affected character
set or collation to be incorrect. Such changes result in several
possible problems: Comparison results that differ from previous
results
Inability to find some index values due to misordered index entries
Misordered ORDER BY results
Tables that CHECK TABLE reports as being in need of repair
Check for the links-
1)checking-table-incompatibilities
2)check-table
3)rebuilding-tables

Index not used in query. How to improve performance?

I have this query:
SELECT
*
FROM
`av_cita`
JOIN `av_cita_cstm` ON (
(
`av_cita`.`id` = `av_cita_cstm`.`id_c`
)
)
WHERE
av_cita.deleted = 0
This query takes over 120 seconds to finish, yet I have added all indexes.
When I ask for the execution plan:
explain SELECT * FROM `av_cita`
JOIN `av_cita_cstm` ON ( ( `av_cita`.`id` = `av_cita_cstm`.`id_c` ) )
WHERE av_cita.deleted = 0;
I get this:
+----+-------------+--------------+--------+----------------------+---------+---------+---------------------------+--------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+--------------+--------+----------------------+---------+---------+---------------------------+--------+-------------+
| 1 | SIMPLE | av_cita | ALL | PRIMARY,delete_index | NULL | NULL | NULL | 192549 | Using where |
| 1 | SIMPLE | av_cita_cstm | eq_ref | PRIMARY | PRIMARY | 108 | rednacional_v2.av_cita.id | 1 | |
+----+-------------+--------------+--------+----------------------+---------+---------+---------------------------+--------+-------------+
delete_index is listed in the possible_keys column, but the key is null, and it doesn't use the index.
Table and index definitions:
+------------------+--------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+------------------+--------------+------+-----+---------+-------+
| id | char(36) | NO | PRI | NULL | |
| name | varchar(255) | YES | MUL | NULL | |
| date_entered | datetime | YES | MUL | NULL | |
| date_modified | datetime | YES | | NULL | |
| modified_user_id | char(36) | YES | | NULL | |
| created_by | char(36) | YES | MUL | NULL | |
| description | text | YES | | NULL | |
| deleted | tinyint(1) | YES | MUL | 0 | |
| assigned_user_id | char(36) | YES | MUL | NULL | |
+------------------+--------------+------+-----+---------+-------+
+---------+------------+--------------------+--------------+------------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| Table | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment |
+---------+------------+--------------------+--------------+------------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| av_cita | 0 | PRIMARY | 1 | id | A | 192786 | NULL | NULL | | BTREE | | |
| av_cita | 1 | delete_index | 1 | deleted | A | 2 | NULL | NULL | YES | BTREE | | |
| av_cita | 1 | name_index | 1 | name | A | 96393 | NULL | NULL | YES | BTREE | | |
| av_cita | 1 | date_entered_index | 1 | date_entered | A | 96393 | NULL | NULL | YES | BTREE | | |
| av_cita | 1 | created_by | 1 | created_by | A | 123 | NULL | NULL | YES | BTREE | | |
| av_cita | 1 | assigned_user_id | 1 | assigned_user_id | A | 1276 | NULL | NULL | YES | BTREE | | |
| av_cita | 1 | deleted_id | 1 | deleted | A | 2 | NULL | NULL | YES | BTREE | | |
| av_cita | 1 | deleted_id | 2 | id | A | 192786 | NULL | NULL | | BTREE | | |
| av_cita | 1 | id | 1 | id | A | 192786 | NULL | NULL | | BTREE | | |
+---------+------------+--------------------+--------------+------------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
How can I improve the performance of this query?
The query is losing time on making the join. I would strongly suggest to create and index on av_cita_cstm.id_c. Then the plan will probably be changed to use that index for the av_cita_cstm table, which is much better than PRIMARY. As a consequence PRIMARY will be used on ac_cita.
I think that will bring a big improvement. You might still get more improvement if you make sure delete_index is defined with two fields: (deleted, id), and then move the where condition of the SQL statement into the join condition. But I am not sure MySql will see this as a possibility.
The index on deleted is not used probably because the optimizer has decided that a full table-scan is cheaper than using the index. MySQL tends to make this decision if the value you search for is found on about 20% or more of the rows in the table.
By analogy, think of the index at the back of a book. You can understand why common words like "the" aren't indexed. It would be easier to just read the book cover-to-cover than to flip back and forth to the index, which only tells you that "the" appears on a majority of pages.
If you think MySQL has made the wrong decision, you can make it pretend that a table-scan is more expensive than using a specific index:
SELECT
*
FROM
`av_cita` FORCE INDEX (deleted_index)
JOIN `av_cita_cstm` ON (
(
`av_cita`.`id` = `av_cita_cstm`.`id_c`
)
)
WHERE
av_cita.deleted = 0
Read http://dev.mysql.com/doc/refman/5.7/en/index-hints.html for more information about index hints. Don't overuse index hints, they're useful only in rare cases. Most of the time the optimizer makes the right decision.
Your EXPLAIN plan shows that your join to av_cita_cstm is already using a unique index (the clue is "type: eq_ref" and also the "rows: 1"). I don't think any new index is needed in that table.
I notice the EXPLAIN shows that the table-scan on av_cita scans about an estimated 192549 rows. I'm really surprised that this takes 120 seconds. On any reasonably powerful computer, that should run much faster.
That makes me wonder if you have something else that needs tuning or configuration on this server:
What other processes are running on the server? A lot of applications, perhaps? Are the other processes also running slowly on this server? Do you need to increase the power of the server, or move applications onto their own server?
If you're on MySQL 5.7, try querying the sys schema: this:
select * from sys.innodb_buffer_stats_by_table
where object_name like 'av_cita%';
Are there other costly SQL queries running concurrently?
Did you under-allocate MySQL's innodb_buffer_pool_size? If it's too small, it could be furiously recycling pages in RAM as it scans your table.
select ##innodb_buffer_pool_size;
Did you over-allocate innodb_buffer_pool_size? Once I helped tune a server that was running very slowly. It turned out they had a 4GB buffer pool, but only 1GB of physical RAM. The operating system was swapping like crazy, causing everything to run slowly.
Another thought: You have shown us the columns in av_cita, but not the table structure for av_cita_cstm. Why are you fetching SELECT *? Do you really need all the columns? Are there huge BLOB/TEXT columns in the latter table? If so, it could be reading a large amount of data from disk that you don't need.
When you ask SQL questions, it would help if you run
SHOW CREATE TABLE av_cita\G
SHOW TABLE STATUS LIKE 'av_cita'\G
And also run the same commands for the other table av_cita_cstm, and include the output in your question above.

Spliting database according to user's ID

I have a database of 5m rows and it grows and it's getting harder and harder to do operations with it.
Is it a good idea to split the table in 10 tables (v0_table, v1_table... v9_table), where the number(v*) is the first number of the user's id?
The user's id in my case are not auto-increment so it would sort the data evenly across those 10 tables.
The problem is I have never done similar things....
Can anyone spot any disadvantages?
EDIT:
I would appreciate any help with tuning the structure or the query.
So the slowest query is the following one:
SELECT logos.user,
logos.date,
logos.level,
logos.title,
Count(guesses.id),
Sum(guesses.points)
FROM logos
LEFT JOIN guesses
ON guesses.user = '".$user['uid']."'
AND guesses.done = '1'
AND guesses.logo = logos.id
WHERE open = '1'
GROUP BY level
Where guesses table:
+--------+------------+------+-----+-------------------+----------------+
| Field | Type | Null | Key | Default | Extra |
+--------+------------+------+-----+-------------------+----------------+
| id | int(11) | NO | PRI | NULL | auto_increment |
| logo | int(11) | NO | MUL | NULL | |
| user | int(11) | NO | MUL | NULL | |
| date | timestamp | NO | | CURRENT_TIMESTAMP | |
| points | int(4) | YES | MUL | 100 | |
| done | tinyint(1) | NO | MUL | 0 | |
+--------+------------+------+-----+-------------------+----------------+
LOGOS table:
+-------+--------------+------+-----+-------------------+----------------+
| Field | Type | Null | Key | Default | Extra |
+-------+--------------+------+-----+-------------------+----------------+
| id | int(11) | NO | PRI | NULL | auto_increment |
| name | varchar(100) | NO | | NULL | |
| img | varchar(222) | NO | MUL | NULL | |
| level | int(3) | NO | MUL | NULL | |
| date | timestamp | NO | MUL | CURRENT_TIMESTAMP | |
| user | int(11) | NO | MUL | NULL | |
| open | tinyint(1) | NO | MUL | 0 | |
+-------+--------------+------+-----+-------------------+----------------+
EXPLAIN:
+----+-------------+---------+------+----------------+------+---------+-------+------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+---------+------+----------------+------+---------+-------+------+----------------------------------------------+
| 1 | SIMPLE | logos | ref | open | open | 1 | const | 521 | Using where; Using temporary; Using filesort |
| 1 | SIMPLE | guesses | ref | done,user,logo | user | 4 | const | 87 | |
+----+-------------+---------+------+----------------+------+---------+-------+------+----------------------------------------------+
Your problem isn't that you have too much data, it's that this data is not properly indexed. Try adding an index:
CREATE INDEX open_level ON logos(open, level)
This should eliminate Using temporary; Using filesort on logos.
Basically, you need an index on this table for this query to cover two things: open - for WHERE open = '1' and level - for GROUP BY level in this order, as MySQL will first filter by open, then will group the results by level (implicitly sorting by it in process).
Short and sweet: No. This is never a good idea. Is your table properly indexed? Is MySQL properly tuned? Are your queries efficient? Are you using any caching?
Instead of sharding your table, you may want to examine other tables in your database to see if they can be split off into other dbs. For example tables, that are never joined to are great candidates for this type of vertical partitioning.
This allows you to optimize hardware for smaller sets of data.