Massive DB and mysql - mysql

A new project we are working required a lot of data analysis but we are finding this to be VERY slow, we are looking for ways to change our approach with software and or hardware.
We are currently running on a amazon ec2 instance (linux):
High-CPU Extra Large Instance
7 GB of memory
20 EC2 Compute Units (8 virtual cores with 2.5 EC2 Compute Units each)
1690 GB of instance storage
64-bit platform
I/O Performance: High
API name: c1.xlarge
processor : 7
vendor_id : GenuineIntel
cpu family : 6
model : 26
model name : Intel(R) Xeon(R) CPU E5506 # 2.13GHz
stepping : 5
cpu MHz : 2133.408
cache size : 4096 KB
MemTotal: 7347752 kB
MemFree: 728860 kB
Buffers: 40196 kB
Cached: 2833572 kB
SwapCached: 0 kB
Active: 5693656 kB
Inactive: 456904 kB
SwapTotal: 0 kB
SwapFree: 0 kB
One part of the db is articles and entities and a link table for example:
mysql> DESCRIBE articles_entities;
+------------+--------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+------------+--------------+------+-----+---------+-------+
| id | char(36) | NO | PRI | NULL | |
| article_id | char(36) | NO | MUL | NULL | |
| entity_id | char(36) | NO | MUL | NULL | |
| created | datetime | YES | | NULL | |
| modified | datetime | YES | | NULL | |
| relevance | decimal(5,4) | YES | MUL | NULL | |
| analysers | text | YES | | NULL | |
| anchor | varchar(255) | NO | | NULL | |
+------------+--------------+------+-----+---------+-------+
8 rows in set (0.00 sec)
As you can see from the table below we have a lot of assoications growing at a rate of 100,000+ a day
mysql> SELECT count(*) FROM articles_entities;
+----------+
| count(*) |
+----------+
| 2829138 |
+----------+
1 row in set (0.00 sec)
A simple query like the one below is taking too much time (12 secs)
mysql> SELECT count(*) FROM articles_entities WHERE relevance <= .4 AND relevance > 0;
+----------+
| count(*) |
+----------+
| 357190 |
+----------+
1 row in set (11.95 sec)
What should we be considering to improve our lookup times? Different DB storage? Different hardware.

As mrorigo asked, please provide the SHOW CREATE TABLE articles_entities so we can see the actual indexes of your table.
As a note from MySQL documentation http://dev.mysql.com/doc/refman/5.0/en/mysql-indexes.html
If the table has a multiple-column index, any leftmost prefix of the index can be used by the optimizer to find rows.
For example, if you have a three-column index on (col1, col2, col3), you have indexed search capabilities on (col1), (col1, col2), and (col1, col2, col3).
MySQL cannot use an index if the columns do not form a leftmost prefix of the index
So if relevance is part of a multi-column index, but isn't the leftmost column of that index, then the index is not used for your query.
This is a common issue that is often overlooked.

Using char(36) for keys is not the fastest you can do with MySQL. Use INT-types for keys if possible. If you index CHAR columns, the indexes will be VERY large compared to an (BIG)INT index (if not 'properly' created)
However, if your column values are not numeric, you are stuck with CHAR columns (which ARE still faster than VARCHAR, but can create large indexes).
Please provide a SHOW CREATE TABLE of tables to see key/index parameters, and also as the previous answer said, an EXPLAIN for the queries in question could help provide a better answer.
PS. Use SHOW TABLE STATUS LIKE '{table_name}' to see index (and data) sizes of the table.

There are three things that matter when it comes to query performance:
Indexes.
Memory.
Everything else.
The first thing to do is check your indexes. Do an EXPLAIN on your queries to find out how MySQL is processing them.
If that looks sensible, the next thing would be to check memory. How big is your total database? Memory is cheap these days, and queries that run from memory will be much, much faster than queries that have to read from disk.
After you've explored those, if performance is still slow, then it might be time to consider other options.

Related

Physical disk rewrite of mysql data

I am using mysql for the first time in years to help a friend out. The issue: a mysql table that gets updated a lot with INT and CHAR values. This web app site is hosted on a large generic provider, so I have no direct control of setup/parameters/etc. The performance has gotten really, really bad for this table, to the point where processing a data page that should take a max of 10 seconds is sometimes taking 15 minutes.
I initially tried running all updates as a single transaction, rather than the 50ish statements in a php loop in the web app (written several years ago). The problem, at least what I think, is that this app is running on a giant mysql instance with many other generic websites, and the disk speed just isn't able to handle so many updates.
I am able to use chron/batch jobs on this provider. The web app is mainly used during work hours, so I could limit access to the web app during overnight hours.
I normally work with postgresql or ms sql server, so my knowledge of mysql is fairly limited.
Would performance be increased if I force the table to be dropped and rewritten overnight? Is there some mysql function like postgres's vacuum? I have tried to search for information, but unfortunately using words like rewrite table just brings up references to sql syntax helpers or performance tuning.
Alternately, I guess that I could create a new storage mechanism in mysql, as long as it could be done via a php script. Would there be a better storage mode than the default storage engine for something frequently updated?
Performance of mysql depends on multiple factors that it's complicated enough to have a clear answer in every case. I think we can check the following steps to help figuring out on what to improve on INSERT data into mysql.
Database Engine.
There are 5 engine that you can use depends on your purposes: MyISAM, Memory, InnoDB, Archive, NDB.
Document
An engine which has Locking granularity as table will be slower than engine has its value as row because it will lock a table from changing when insert or update a single record, while Locking granularity as row mean locking only that row when you insert or update records.
When perform INSERT OR UPDATE record, engine has B-tree indexes attribute will be slower because it's have to rebuild it's indexes, so that you will have faster speed SELECT query. Therefore number of indexes in table will slow inserting and updating speed as well.
Indexes as CHAR will be slower than indexes as INT because it takes more time to figure out where to find the right node to store data in mysql.
MYSQL Statement
MYSQL has a estimation system that help you to discover performance of a query by add EXPLAIN before your mysql statement.
Example
EXPLAIN SELECT SQL_NO_CACHE * FROM Table_A WHERE id = 1;
Document
I worked on a web application, where we used mysql (it's really good !) to scale really large data.
In addition to what #Lam Nguyen said in his answer here is few things to consider,
Check which mysql engine you are using to see which locks it obtains during select, insert , update. To check which engine you are using here is a sample query with which you could run your litmus test.
mysql> show table status where name="<your_table_name>";
+-------+--------+---------+------------+------+----------------+-------------+-----------------+--------------+-----------+----------------+---------------------+-------------+------------+--------------------+----------+----------------+---------+
| Name | Engine | Version | Row_format | Rows | Avg_row_length | Data_length | Max_data_length | Index_length | Data_free | Auto_increment | Create_time | Update_time | Check_time | Collation | Checksum | Create_options | Comment |
+-------+--------+---------+------------+------+----------------+-------------+-----------------+--------------+-----------+----------------+---------------------+-------------+------------+--------------------+----------+----------------+---------+
| Login | InnoDB | 10 | Dynamic | 2 | 8192 | 16384 | 0 | 0 | 0 | NULL | 2019-04-28 12:16:59 | NULL | NULL | utf8mb4_general_ci | NULL | | |
+-------+--------+---------+------------+------+----------------+-------------+-----------------+--------------+-----------+----------------+---------------------+-------------+------------+--------------------+----------+----------------+---------+
The default engine which comes with mysql installation is InnoDB. InnoDB does not acquire any lock while inserting a row.
SELECT ... FROM is a consistent read, reading a snapshot of the database and setting no locks unless the transaction isolation level is set to SERIALIZABLE.
A locking read, an UPDATE, or a DELETE generally set record locks on every index record that is scanned in the processing of the SQL statement.
InnoDB lock sets
Check for columns which you are indexing. Index the column which you would really query a lot. Avoid indexing char columns.
To check which columns of you table got indexed run,
mysql> show index from BookStore2;
+------------+------------+----------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+---------+------------+
| Table | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment | Visible | Expression |
+------------+------------+----------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+---------+------------+
| Bookstore2 | 0 | PRIMARY | 1 | ISBN_NO | A | 0 | NULL | NULL | | BTREE | | | YES | NULL |
| Bookstore2 | 1 | SHORT_DESC_IND | 1 | SHORT_DESC | A | 0 | NULL | NULL | YES | BTREE | | | YES | NULL |
| Bookstore2 | 1 | SHORT_DESC_IND | 2 | PUBLISHER | A | 0 | NULL | NULL | YES | BTREE | | | YES | NULL |
+------------+------------+----------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+---------+------------+
3 rows in set (0.03 sec)
Do not run inner query on a large data set in a table. To actually see what your query does run explain on your query and see the number of rows iter
mysql> explain select * from login;
+----+-------------+-------+------------+------+---------------+------+---------+------+------+----------+-------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-------+------------+------+---------------+------+---------+------+------+----------+-------+
| 1 | SIMPLE | login | NULL | ALL | NULL | NULL | NULL | NULL | 2 | 100.00 | NULL |
+----+-------------+-------+------------+------+---------------+------+---------+------+------+----------+-------+
1 row in set, 1 warning (0.03 sec)
Avoid joining too may tables.
Make sure you are querying with a primary key in criteria or at least you are querying on your indexed column.
When your table grows too big make sure you split it across clusters.
With few tweaks, we would still be able to get query results in minimal time.

MySQL indexing columns vs joining tables

I am trying to figure out the most efficient way to extract values from database that has the structure similar to this:
table test:
int id (primary, auto increment)
varchar(50) stuff,
varchar(50) important_stuff;
where I need to do a query like
select * from test where important_stuff like 'prefix%';
The size of the entire table is approximately 10 million rows, however there are only about 500-1000 distinct values for important_stuff. My current solution is indexing important_stuff however the performance is not satisfactory. Will it be better to create a separate table that will match distinct important_stuff to a certain id, which will be stored in the 'test' table and then do
(select id from stuff_lookup where important_stuff like 'prefix%') a join select * from test b where b.stuff_id=a.id
or this:
select * from test where stuff_id exists in(select id from stuff_lookup where important_stuff like 'prefix%')
What is the best way to optimize things like that?
How big is innodb_buffer_pool_size? How much RAM is available? The former should be about 70% of the latter. You'll see in a minute why I bring up this setting.
Based on your 3 suggested SELECTs, the original one will work as good as the two complex ones. In some other case, the complex formulation might work better.
INDEX(important_stuff) is the 'best' index for
select * from test where important_stuff like 'prefix%';
Now, let's study how that query works with that index:
Reach into the BTree index, starting at 'prefix'. (Effort: Virtually instantaneous)
Scan forward for, say, 1000 entries. That will be about 10 InnoDB blocks (16KB each). Each entry will have the PRIMARY KEY (id). (Effort: <= 10 disk hits)
For each entry, look up the row (so you can get "*"). That's 1000 PK lookups in the BTree that contains both the PK and the data. At best, they might all be in 10 blocks. At worst, they could be in 1000 separate blocks. (Effort: 10-1000 blocks)
Total Effort: ~1010 blocks (worst case).
A standard spinning disk can handle ~100 reads/second. So. we are looking at 10 seconds.
Now, run the query again. Guess what; all those blocks are now in RAM (cached in the "buffer_pool", which is hopefully big enough for all of them). And it runs in less than 1 second.
OPTIMIZE TABLE was not necessary! It was not a statistics refresh, but rather caching that sped up the query.
I'm not MySQL user but I made some tests on my local database. I've added 10 millions rows as you wrote and distinct datas from third column are loaded quite fast. These are my results.
mysql> describe bigtable;
+-----------------+-------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+-----------------+-------------+------+-----+---------+----------------+
| id | int(11) | NO | PRI | NULL | auto_increment |
| stuff | varchar(50) | NO | | NULL | |
| important_stuff | varchar(50) | NO | MUL | NULL | |
+-----------------+-------------+------+-----+---------+----------------+
3 rows in set (0.03 sec)
mysql> select count(*) from bigtable;
+----------+
| count(*) |
+----------+
| 10000089 |
+----------+
1 row in set (2.87 sec)
mysql> select count(distinct important_stuff) from bigtable;
+---------------------------------+
| count(distinct important_stuff) |
+---------------------------------+
| 1000 |
+---------------------------------+
1 row in set (0.01 sec)
mysql> select distinct important_stuff from bigtable;
....
| is_987 |
| is_988 |
| is_989 |
| is_99 |
| is_990 |
| is_991 |
| is_992 |
| is_993 |
| is_994 |
| is_995 |
| is_996 |
| is_997 |
| is_998 |
| is_999 |
+-----------------+
1000 rows in set (0.15 sec)
Important information is that I refreshed statistics on this table (before this operation I needed ~10 seconds to load these data).
mysql> optimize table bigtable;

MySQL Performance issues / slow query with large amounts of data

MySql
I've a query that is taking sometime to load on a table, named impression that
has about 57 million rows. Table definition can be found below:
+-----------------+--------------+------+-----+
| Field | Type | Null | Key |
+-----------------+--------------+------+-----+
| id | int(11) | NO | PRI |
| data_type | varchar(16) | NO | MUL |
| object_id | int(11) | YES | |
| user_id | int(11) | YES | |
| posted | timestamp | NO | MUL |
| lat | float | NO | |
| lng | float | NO | |
| region_id | int(11) | NO | |
+-----------------+--------------+------+-----+
The indexes on the table are:
+------------+------------+----------+--------------+-------------+
| Table | Non_unique | Key_name | Seq_in_index | Column_name |
+------------+------------+----------+--------------+-------------+
| impression | 0 | PRIMARY | 1 | id |
| impression | 1 | posted | 1 | posted |
| impression | 1 | oi_dt | 1 | data_type |
| impression | 1 | oi_dt | 2 | object_id |
+------------+------------+----------+--------------+-------------+
A typical select statement goes something like:
SELECT COUNT(`id`)
FROM `impression`
WHERE
posted BETWEEN DATE('2014-01-04') AND DATE('2014-06-01')
AND `data_type` = 'event'
AND `object_id` IN ('1', '2', '3', '4', '5', '8', ...)
...and a typical record looks like (in order of the schema above):
'event', 1234, 81, '2014-01-02 00:00:01', 35.3, -75.2, 10
This statement takes approximately 26 seconds to run, which is where the problem
lies. Are there any solutions that can be employed here to reduce this time to well
below what it is now? Ideally it'd be < 1 second.
I'm open to switching storage solutions / etc... anything that'll help at this point.
Your assistance is most appreciated.
Other things possibly worth noting:
The table is using the InnoDB storage engine
using MySQL 5.5
Server: 8Gb RAM running CentOS 6 (Rackspace)
MySQL usually uses only one index per table in a given query. You have an index on posted and a compound index on data_type, object_id.
You should use EXPLAIN to find out which index your query is currently using. EXPLAIN will also tell you how many rows it estimates it will examine to produce the result set (it might examine many more rows than make it into the final result).
The columns should be in this order:
Columns in equality conditions, for example in your query data_type = 'event'
Columns in range conditions or sorting, but you only get one such column. Subsequent columns that are in range conditions or sorting do not gain any benefit from being added to the index after the first such column. So pick the column that is the most selective, that is, your condition narrows down the search to a smaller subset of the table.
Other columns in your select-list, if you have just a few such columns and you want to get the covering index effect. It's not necessary to add your primary key column if you use InnoDB, because every secondary index automatically includes the primary key column at the right end, even if you don't declare that.
So in your case, you might be better off with an index on data_type, posted. Try it and use EXPLAIN to confirm. It depends on whether the date range you give is more selective than the list of object_id's.
See also my presentation How to Design Indexes, Really.
Not sure if this is a viable solution for you, but partitioning may speed it up. I have a similar table for impressions and found the following to help it a lot. I'm querying mostly on the current day though.
ALTER TABLE impression PARTITION BY RANGE(TO_DAYS(posted))(
PARTITION beforeToday VALUES LESS THAN(735725),
PARTITION today VALUES LESS THAN(735726),
PARTITION future VALUES LESS THAN MAXVALUE
);
This does incur some maintenance (has to be updated often to get the benefits). If you are looking to query on a broader range, less maintenance would be required I think.

How to run OPTIMIZE TABLE with the least downtime

I have a MySQL 5.5 DB of +-40GB on a 64GB RAM machine in a production environment. All tables are InnoDB. There is also a slave running as a backup.
One table - the most important one - grew to 150M rows, inserting and deleting became slow. To speed up inserting and deleting I deleted half of the table. This did not speed up as expected; inserting and deleting is still slow.
I've read that running OPTIMIZE TABLE can help in such a scenario. As I understand this operation will require a read lock on the entire table and optimizing the table might take quite a while on a big table.
What would be a good strategy to optimize this table while minimizing downtime?
EDIT The specific table to be optimized has +- 91M rows and looks like this:
+-------------+--------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+-------------+--------------+------+-----+---------+----------------+
| id | int(11) | NO | PRI | NULL | auto_increment |
| channel_key | varchar(255) | YES | MUL | NULL | |
| track_id | int(11) | YES | MUL | NULL | |
| created_at | datetime | YES | | NULL | |
| updated_at | datetime | YES | | NULL | |
| posted_at | datetime | YES | | NULL | |
| position | varchar(255) | YES | MUL | NULL | |
| dead | int(11) | YES | | 0 | |
+-------------+--------------+------+-----+---------+----------------+
Percona Toolkit's pt-online-schema-change does this for you. In this case it worked very well.
300 ms to insert seems excessive, even with slow disks. I would look into the root cause. Optimizing this table is going to take a lot of time. MySQL will create a copy of your table on disk.
Depending on the size of your innodb_buffer_pool (if the table is innodb), free memory on the host, I would try to preload the whole table in the page cache of the OS, so that at least reading the data will be sped up by a couple orders of magnitude.
If you're using innodb_file_per_table, or if it's a MyISAM table, it's easy enough to make sure the whole file is cached using "time cat /path/to/mysql/data/db/huge_table.ibd > /dev/null". When you rerun the command, and it runs in under a few seconds, you can assume the file content is sitting in the OS page cache.
You can monitor the progress whilst the "optimize table" is running, by looking at the size of the temporary file. It's usually in the database data directory, with a temp filename starting with a dash (#) character.
This article suggests to first drop all indexes in a table, then optimize it, and then add indexes back. It claims the speed difference is 20 times compared to just optimize.
update your version of mysql, in 8.0.x version the optimize table is fastest than 5.5
a optimize in a table with 91 millions could take like 3 hours in your version, you can run in morning, like 3am to not disturb the users of your app.

more records takes less time

This is almost driving me insane
I do the following query:
SELECT * FROM `photo_person` WHERE photo_person.photo_id IN (SELECT photo_id FROM photo_person WHERE `photo_person`.`person_id` ='1')
When I change the id, I get different processing time. Although it's all the same queries and tables.
By changing the person_id I get the following:
-- person_id=1 ( 3 total, Query took 0.4523 sec)
-- person_id=2 ( 99 total, Query took 0.1340 sec)
-- person_id=3 ( 470 total, Query took 0.0194 sec)
-- person_id=4 ( 1,869 total, Query took 0.0024 sec)
I do not understand how with the increase of the number of records/results the query time is lower.
The table structures are very straight forward
UPDATE: I have already disabled mysql query cache, so every time I run the query, I would get the same exact value (of course it varies on the milisecond level but this is can be neglected)
UPDATE: table is MyISAM
CREATE TABLE IF NOT EXISTS `photo_person` (
`entry_id` int(11) NOT NULL AUTO_INCREMENT,
`photo_id` int(11) NOT NULL DEFAULT '0',
`person_id` int(11) NOT NULL DEFAULT '0',
PRIMARY KEY (`entry_id`),
UNIQUE KEY `PhotoID` (`photo_id`,`person_id`),
KEY `photo_id` (`photo_id`),
KEY `person_id` (`person_id`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8 AUTO_INCREMENT=182072 ;
Here is the results of the profiling
+----------+------------+-----------------------------+
| Query_ID | Duration |Query |
+----------+------------+-----------------------------+
| 1 | 0.45541200 | SELECT ...`person_id` ='1') |
| 2 | 0.44833700 | SELECT ...`person_id` ='2') |
| 3 | 0.45587800 | SELECT ...`person_id` ='3') |
| 4 | 0.45074900 | SELECT ...`person_id` ='4') |
+----------+------------+-----------------------------+
now since the number are the same, it must be the caching :(
So the aparently the caching kicks in a certain number of records or bytes
mysql> SHOW VARIABLES LIKE "%cac%";
+------------------------------+------------+
| Variable_name | Value |
+------------------------------+------------+
| binlog_cache_size | 32768 |
| have_query_cache | YES |
| key_cache_age_threshold | 300 |
| key_cache_block_size | 1024 |
| key_cache_division_limit | 100 |
| max_binlog_cache_size | 4294963200 |
| query_cache_limit | 1024 |
| query_cache_min_res_unit | 4096 |
| query_cache_size | 1024 |
| query_cache_type | ON |
| query_cache_wlock_invalidate | OFF |
| table_definition_cache | 256 |
| table_open_cache | 64 |
| thread_cache_size | 8 |
+------------------------------+------------+
14 rows in set (0.00 sec)
How are you testing the query speeds? I suspect it's not an appropriate way. The more you query the table, the more likely MySQL is to do some agressive pre-fetching on the table, meaning further queries on the table will be faster, despite they require scanning more data. The reason it is so is because MySQL will not have to load the pages from disk, since it's already pre-fetched them in memory.
As other people have stated, query cache could also mess up you test's results, especially if they implied re-running the query several times in a row to get an "average" runtime.
Add SQL_NO_CACHE to your query to see if it is the cache that tricks you.
To see what is taking time try to use PROFILING like this:
mysql> SET profiling = 1;
mysql> Your select goes here;
mysql> SHOW PROFILES;
Also, try to use the simpler query:
SELECT * FROM photo_person WHERE `photo_person`.`person_id` ='1'
I don't know if MySQL is optimising or not your query, but logically, your and this are equivalent - except that your uses a subquery - always avoid subqueries where possible