How to resolve update lock issue in MySQL - mysql

I have 2 MySQL UPDATE Query problem on my website.
Problem 1
I run a content site that updates page views for posts when users read.
Each time I send push notifications, my server times out; when I comment on the Update query that increments the page views, everything returns to normal.
This I think maybe as a result of hundreds of UPDATE query trying to update the views on the same row.
**The query that updated the tablename**
update table set views='$newview' where id=1
Query Explain
id: 1
select_type: SIMPLE
table: new_jobs
type: range
possible_keys: PRIMARY
key: PRIMARY
key_len: 4
ref: NULL
rows: 1
Extra: Using where
**tablename create table**
CREATE TABLE `tablename` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`company_id` int(11) DEFAULT NULL,
`job_title` varchar(255) DEFAULT NULL,
`slug` varchar(255) DEFAULT NULL,
`advert_date` date DEFAULT NULL,
`expiry_date` date DEFAULT NULL,
`no_deadline` int(1) DEFAULT 0,
`source` varchar(20) DEFAULT NULL,
`featured` int(1) DEFAULT 0,
`views` int(11) DEFAULT 1,
`email_status` int(1) DEFAULT 0,
`draft` int(1) DEFAULT 0,
`created_by` int(11) DEFAULT NULL,
`show_company_name` int(1) DEFAULT 1,
`display_application_method` int(1) DEFAULT 0,
`status` int(1) DEFAULT 1,
`upload_date` datetime DEFAULT NULL,
`country` int(1) DEFAULT 1,
`followers_email_status` int(1) DEFAULT 0,
`og_img` varchar(255) DEFAULT NULL,
`old_id` int(11) DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `new_jobs_company_id_index` (`company_id`),
KEY `new_jobs_views_index` (`views`),
KEY `new_jobs_draft_index` (`draft`),
KEY `new_jobs_country_index` (`country`)
) ENGINE=InnoDB AUTO_INCREMENT=151359 DEFAULT CHARSET=utf8
What is the best way of handling this?
[Scenario 2 removed on request]

Scenario 1. I would expect the update of a 'view' count (or 'click' or 'like' or whatever) to be more like
UPDATE t SET views = views + 1 WHERE id = 123;
I assume you have an index (probably the PRIMARY KEY) on id?
Since there are other things going on with that table, it may be wise to split off the rapidly incrementing counter into a separate table. This would avoid interfering with other queries. You can get other data, plus the counter, by using JOIN .. USING(id).
Scenario 2 does not make sense. It seems to keep the latest date for each email, but what does country mean? Since it seems like more than just a counter, you might want a separate table to log those 3 columns.
Please provide SHOW CREATE TABLE.
There are many things that novices perceive as a "crash". Please describe further -- out of connections, out of disk space, sluggishness, the client gave error message, other operations taking too long, etc. Each has a different remedy.
Query
Are you are currently logically doing
BEGIN;
$ct = SELECT views ... FOR UPDATE;
...
UPDATE ... SET views = $ct+1 WHERE ...;
COMMIT;
If so, that is much less efficient than
(with autocommit = ON)
UPDATE ... SET views = views+1 ...;
Note that the first version hangs onto the row longer. If you fail to use FOR UPDATE, you will drop some counts.
Splitting into a separate table sort of forces you to run the UPDATE as its own transaction.
Other
innodb_flush_log_at_trx_commit:
Default is 1, which is secure, but leads to at least one IOPs for each transaction.
2 leads to a flush once a second. During intense times, this is much more efficient. But a crash could lose up to one second's worth of updates. the inaccuracy of "view count" due to a rare crash is, in my opinion, acceptable.
KEY(views) needs to be updated every time views is changed. But, thanks to the "change buffer", this is unlikely to involve any extra I/O, at least now while you are doing the UPDATE.
INT(1) takes 4 bytes; the (1) has no meaning. Suggest changing to TINYINT (1 byte), thereby saving about 27 bytes per row. (7 columns plus 2 indexes)
country INT(1) -- Is it a flag? What is the meaning? Is it normalized to another table? Using 4 bytes for an id and an extra table when standard abbreviations ('US', 'UK', 'RU', 'IN', etc) would take 2 bytes? Suggest country CHAR(2) CHARACTER SET ascii COLLATE ascii_general_ci.
Indexing flags rarely benefits. Let's see the queries where you think such indexes might be used. And the EXPLAIN SELECT ... for them.

Related

Speed Up A Large Insert From Select Query With Multiple Joins

I'm trying to denormalize a few MySQL tables I have into a new table that I can use to speed up some complex queries with lots of business logic. The problem that I'm having is that there are 2.3 million records I need to add to the new table and to do that I need to pull data from several tables and do a few conversions too. Here's my query (with names changed)
INSERT INTO database_name.log_set_logs
(offload_date, vehicle, jurisdiction, baselog_path, path,
baselog_index_guid, new_location, log_set_name, index_guid)
(
select STR_TO_DATE(logset_logs.offload_date, '%Y.%m.%d') as offload_date,
logset_logs.vehicle, jurisdiction, baselog_path, path,
baselog_trees.baselog_index_guid, new_location, logset_logs.log_set_name,
logset_logs.index_guid
from
(
SELECT SUBSTRING_INDEX(SUBSTRING_INDEX(path, '/', 7), '/', -1) as offload_date,
SUBSTRING_INDEX(SUBSTRING_INDEX(path, '/', 8), '/', -1) as vehicle,
SUBSTRING_INDEX(path, '/', 9) as baselog_path, index_guid,
path, log_set_name
FROM database_name.baselog_and_amendment_guid_to_path_mappings
) logset_logs
left join database_name.log_trees baselog_trees
ON baselog_trees.original_location = logset_logs.baselog_path
left join database_name.baselog_offload_location location
ON location.baselog_index_guid = baselog_trees.baselog_index_guid);
The query itself works because I was able to run it using a filter on log_set_name however that filter's condition will only work for less than 1% of the total records because one of the values for log_set_name has 2.2 million records in it which is the majority of the records. So there is nothing else I can use to break this query up into smaller chunks from what I can see. The problem is that the query is taking too long to run on the rest of the 2.2 million records and it ends up timing out after a few hours and then the transaction is rolled back and nothing is added to the new table for the 2.2 million records; only the 0.1 million records were able to be processed and that was because I could add a filter that said where log_set_name != 'value with the 2.2 million records'.
Is there a way to make this query more performant? Am I trying to do too many joins at once and perhaps I should populate the row's columns in their own individual queries? Or is there some way I can page this type of query so that MySQL executes it in batches? I already got rid of all my indexes on the log_set_logs table because I read that those will slow down inserts. I also jacked my RDS instance up to a db.r4.4xlarge write node. I am also using MySQL Workbench so I increased all of it's timeout values to their maximums giving them all nines. All three of these steps helped and were necessary in order for me to get the 1% of the records into the new table but it still wasn't enough to get the 2.2 million records without timing out. Appreciate any insights as I'm not adept to this type of bulk insert from a select.
'CREATE TABLE `log_set_logs` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`purged` tinyint(1) NOT NULL DEFAUL,
`baselog_path` text,
`baselog_index_guid` varchar(36) DEFAULT NULL,
`new_location` text,
`offload_date` date NOT NULL,
`jurisdiction` varchar(20) DEFAULT NULL,
`vehicle` varchar(20) DEFAULT NULL,
`index_guid` varchar(36) NOT NULL,
`path` text NOT NULL,
`log_set_name` varchar(60) NOT NULL,
`protected_by_retention_condition_1` tinyint(1) NOT NULL DEFAULT ''1'',
`protected_by_retention_condition_2` tinyint(1) NOT NULL DEFAULT ''1'',
`protected_by_retention_condition_3` tinyint(1) NOT NULL DEFAULT ''1'',
`protected_by_retention_condition_4` tinyint(1) NOT NULL DEFAULT ''1'',
`general_comments_about_this_log` text,
PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=1736707 DEFAULT CHARSET=latin1'
'CREATE TABLE `baselog_and_amendment_guid_to_path_mappings` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`path` text NOT NULL,
`index_guid` varchar(36) NOT NULL,
`log_set_name` varchar(60) NOT NULL,
PRIMARY KEY (`id`),
KEY `log_set_name_index` (`log_set_name`),
KEY `path_index` (`path`(42))
) ENGINE=InnoDB AUTO_INCREMENT=2387821 DEFAULT CHARSET=latin1'
...
'CREATE TABLE `baselog_offload_location` (
`baselog_index_guid` varchar(36) NOT NULL,
`jurisdiction` varchar(20) NOT NULL,
KEY `baselog_index` (`baselog_index_guid`),
KEY `jurisdiction` (`jurisdiction`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1'
'CREATE TABLE `log_trees` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`baselog_index_guid` varchar(36) DEFAULT NULL,
`original_location` text NOT NULL, -- This is what I have to join everything on and since it's text I cannot index it and the largest value is above 255 characters so I cannot change it to a vachar then index it either.
`new_location` text,
`distcp_returncode` int(11) DEFAULT NULL,
`distcp_job_id` text,
`distcp_stdout` text,
`distcp_stderr` text,
`validation_attempt` int(11) NOT NULL DEFAULT ''0'',
`validation_result` tinyint(1) NOT NULL DEFAULT ''0'',
`archived` tinyint(1) NOT NULL DEFAULT ''0'',
`archived_at` timestamp NULL DEFAULT NULL,
`created_at` timestamp NULL DEFAULT CURRENT_TIMESTAMP,
`updated_at` timestamp NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
`dir_exists` tinyint(1) NOT NULL DEFAULT ''0'',
`random_guid` tinyint(1) NOT NULL DEFAULT ''0'',
`offload_date` date NOT NULL,
`vehicle` varchar(20) DEFAULT NULL,
PRIMARY KEY (`id`),
UNIQUE KEY `baselog_index_guid` (`baselog_index_guid`)
) ENGINE=InnoDB AUTO_INCREMENT=1028617 DEFAULT CHARSET=latin1'
baselog_offload_location has not PRIMARY KEY; what's up?
GUIDs/UUIDs can be terribly inefficient. A partial solution is to convert them to BINARY(16) to shrink them. More details here: http://localhost/rjweb/mysql/doc.php/uuid ; (MySQL 8.0 has similar functions.)
It would probably be more efficient if you have a separate (optionally redundant) column for vehicle rather than needing to do
SUBSTRING_INDEX(SUBSTRING_INDEX(path, '/', 8), '/', -1) as vehicle
Why JOIN baselog_offload_location? Three seems to be no reference to columns in that table. If there, be sure to qualify them so we know what is where. Preferably use short aliases.
The lack of an index on baselog_index_guid may be critical to performance.
Please provide EXPLAIN SELECT ... for the SELECT in your INSERT and for the original (slow) query.
SELECT MAX(LENGTH(original_location)) FROM .. -- to see if it really is too big to index. What version of MySQL are you using? The limit increased recently.
For the above item, we can talk about having a 'hash'.
"paging the query". I call it "chunking". See http://mysql.rjweb.org/doc.php/deletebig#deleting_in_chunks . That talks about deleting, but it can be adapted to INSERT .. SELECT since you want to "chunk" the select. If you go with chunking, Javier's comment becomes moot. Your code would be chunking the selects, hence batching the inserts:
Loop:
INSERT .. SELECT .. -- of up to 1000 rows (see link)
End loop

Add an effective index on a huge table

I have a MySQL database table with more than 34M rows (and growing).
CREATE TABLE `sensordata` (
`userID` varchar(45) DEFAULT NULL,
`instrumentID` varchar(10) DEFAULT NULL,
`utcDateTime` datetime DEFAULT NULL,
`dateTime` datetime DEFAULT NULL,
`data` varchar(200) DEFAULT NULL,
`dataState` varchar(45) NOT NULL DEFAULT 'Original',
`gps` varchar(45) DEFAULT NULL,
`location` varchar(45) DEFAULT NULL,
`speed` varchar(20) NOT NULL DEFAULT '0',
`unitID` varchar(5) NOT NULL DEFAULT '1',
`parameterID` varchar(5) NOT NULL DEFAULT '1',
`originalData` varchar(200) DEFAULT NULL,
`comments` varchar(45) DEFAULT NULL,
`channelHashcode` varchar(12) DEFAULT NULL,
`settingHashcode` varchar(12) DEFAULT NULL,
`status` varchar(7) DEFAULT 'Offline',
`id` int(11) NOT NULL AUTO_INCREMENT,
PRIMARY KEY (`id`),
UNIQUE KEY `id_UNIQUE` (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=98772 DEFAULT CHARSET=utf8
I access this table from multiple threads (at least 400 threads) every minute to insert data into the table.
As the table was growing, it was getting slower to read and write the data. One SELECT query used to take about 25 seconds, then I added a unique index
UNIQUE INDEX idx_userInsDate ( userID,instrumentID,utcDateTime)
This reduced the read time from 25 seconds to some milliseconds but it has increased the insert time as it has to update the index for each record.
Also If I run a SELECT query from multiple threads as the same time the queries take too long to return the data.
This is an example query
Select dateTime from sensordata WHERE userID = 'someUserID' AND instrumentID = 'someInstrumentID' AND dateTime between 'startDate' AND 'endDate' order by dateTime asc;
Can someone help me, to improve the table schema or add an effective index to improve the performance, please.
Thank you in advance
A PRIMARY KEY is a UNIQUE key. Toss the redundant UNIQUE(id) !
Is id referenced by any other tables? If not, then get rid of it all together. Instead have just
PRIMARY KEY ( userID, instrumentID, utcDateTime)
That is, if that triple is guaranteed to be unique. You mentioned DST -- use the datatype TIMESTAMP instead of DATETIME. Doing that, you can convert to DATETIME if needed, thereby eliminating one of the columns.
That one index (the PK) takes virtually no space since it is "clustered" with the data in InnoDB.
Your table is awfully fat with all those VARCHARs. For example, status can be reduced to a 1-byte ENUM. Others can be normalized. Things like speed can be either a 4-byte FLOAT or some smaller DECIMAL, depending on how much range and precision you need.
With 34M wide rows, you have probably recently exceeded the cacheability of the RAM you have. By making the row narrower, you will postpone that overflow.
Why attack the indexes? Every UNIQUE (including PRIMARY) index is checked before allowing the row to be inserted. By getting it down to 1 index, that minimizes the cost there. (InnoDB really needs a PRIMARY KEY.)
INT is 4 bytes. Do you have a billion instruments? Maybe instrumentID could be SMALLINT UNSIGNED, which is 2 bytes, with a max of 64K? Think about all the other IDs.
You have 400 INSERTs/minute, correct? That is not bad. If you get to 400/second, we need to have a different talk.
("Fill factor" is not tunable in MySQL because it does not make much difference.)
How much RAM do you have? What is the setting for innodb_buffer_pool_size? Optimal is somewhere around 70% of available RAM.
Let's see your main queries; there may be other issues to address.
It's not the indexes at fault here. It's your data types. As the size of the data on disk grows, the speed of all operations decrease. Indexes can certainly help speed up selects - provided your data is properly structured - but it appears that it isnt
CREATE TABLE `sensordata` (
`userID` int, /* shouldn't this have a foreign key constraint? */
`instrumentID` int,
`utcDateTime` datetime DEFAULT NULL,
`dateTime` datetime DEFAULT NULL,
/* what exactly are you putting here? Are you sure it's not causing any reduncy? */
`data` varchar(200) DEFAULT NULL,
/* your states will be a finite number of elements. They can be represented by constants in your code or a set of values in a related table */
`dataState` int,
/* what's this? Sounds like what you are saving in location */
`gps` varchar(45) DEFAULT NULL,
`location` point,
`speed` float,
`unitID` int DEFAULT '1',
/* as above */
`parameterID` int NOT NULL DEFAULT '1',
/* are you sure this is different from data? */
`originalData` varchar(200) DEFAULT NULL,
`comments` varchar(45) DEFAULT NULL,
`channelHashcode` varchar(12) DEFAULT NULL,
`settingHashcode` varchar(12) DEFAULT NULL,
/* as above and isn't this the same as */
`status` int,
`id` int(11) NOT NULL AUTO_INCREMENT,
PRIMARY KEY (`id`),
UNIQUE KEY `id_UNIQUE` (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=98772 DEFAULT CHARSET=utf8
1st of all: Avoid varchars for indexes and especially IDs. Each character position in the varchar generates an own index-entry internally!
2nd: Your select uses dateTime, your index is set to utcDateTime. It will only take userID and instrumentID and ignore the utcDateTime-Part.
Advise: Change your data types for the ids and change your index to match the query (dateTime, not utcDateTime)
Using an index decreases your performance on inserts, unluckily, there is nothing such as a fill factor for indexes in mysql right now. So the best thing you can do is try the indexes to be as small as possible.
Another approach on heavily loaded databases with random access would be: write to an unindexed table, read from an indexed one. At a given time, build the indexes and swap the tables (may require a third table for the index creation while leaving the other ones untouched in between).

Duplicating records in a mysql table intermittently causes statements to hang and not return

We are trying to duplicate existing records in a table: make 10 records out of one. The original table contains 75.000 records, and once the statements are done will contain about 750.000 (10 times as many). The statements sometimes finish after 10 minutes, but many times they never return. Hours later we will receive a timeout. This happens about 1 out of 3 times. We are using a test database where nobody is working on, so there is no concurrent access to the table. I don't see any way to optimise the SQL since to me the EXPLAIN PLAN looks fine.
The database is mysql 5.5 hosted on AWS RDS db.m3.x-large. The CPU load goes up to 50% during the statements.
Question: What could cause this intermittent behaviour? How do I resolve it?
This is the SQL to create a temporary table, make roughly 9 new records per existing record in ct_revenue_detail in the temporary table, and then copy the data from the temporary table to ct_revenue_detail
---------------------------------------------------------------------------------------------------------
-- CREATE TEMPORARY TABLE AND COPY ROLL-UP RECORDS INTO TABLE
---------------------------------------------------------------------------------------------------------
CREATE TEMPORARY TABLE ct_revenue_detail_tmp
SELECT r.month,
r.period,
a.participant_eid,
r.employee_name,
r.employee_cc,
r.assignments_cc,
r.lob_name,
r.amount,
r.gp_run_rate,
r.unique_id,
r.product_code,
r.smart_product_name,
r.product_name,
r.assignment_type,
r.commission_pcent,
r.registered_name,
r.segment,
'Y' as account_allocation,
r.role_code,
r.product_eligibility,
r.revenue_core,
r.revenue_ict,
r.primary_account_manager_id,
r.primary_account_manager_name
FROM ct_revenue_detail r
JOIN ct_account_allocation_revenue a
ON a.period = r.period AND a.unique_id = r.unique_id
WHERE a.period = 3 AND lower(a.rollup_revenue) = 'y';
This is the second query. It copies the records from the temporary table back to the ct_revenue_detail TABLE
INSERT INTO ct_revenue_detail(month,
period,
participant_eid,
employee_name,
employee_cc,
assignments_cc,
lob_name,
amount,
gp_run_rate,
unique_id,
product_code,
smart_product_name,
product_name,
assignment_type,
commission_pcent,
registered_name,
segment,
account_allocation,
role_code,
product_eligibility,
revenue_core,
revenue_ict,
primary_account_manager_id,
primary_account_manager_name)
SELECT month,
period,
participant_eid,
employee_name,
employee_cc,
assignments_cc,
lob_name,
amount,
gp_run_rate,
unique_id,
product_code,
smart_product_name,
product_name,
assignment_type,
commission_pcent,
registered_name,
segment,
account_allocation,
role_code,
product_eligibility,
revenue_core,
revenue_ict,
primary_account_manager_id,
primary_account_manager_name
FROM ct_revenue_detail_tmp;
This is the EXPLAIN PLAN of the SELECT:
+----+-------------+-------+------+------------------------+--------------+---------+------------------------------------+-------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+------+------------------------+--------------+---------+------------------------------------+-------+-------------+
| 1 | SIMPLE | a | ref | ct_period,ct_unique_id | ct_period | 4 | const | 38828 | Using where |
| 1 | SIMPLE | r | ref | ct_period,ct_unique_id | ct_unique_id | 5 | optusbusiness_20160802.a.unique_id | 133 | Using where |
+----+-------------+-------+------+------------------------+--------------+---------+------------------------------------+-------+-------------+
This is the definition of ct_revenue_detail:
ct_revenue_detail | CREATE TABLE `ct_revenue_detail` (
`participant_eid` varchar(255) DEFAULT NULL,
`lob_name` varchar(255) DEFAULT NULL,
`amount` decimal(32,16) DEFAULT NULL,
`employee_name` varchar(255) DEFAULT NULL,
`period` int(11) NOT NULL DEFAULT '0',
`pk_id` int(11) NOT NULL AUTO_INCREMENT,
`gp_run_rate` decimal(32,16) DEFAULT NULL,
`month` int(11) DEFAULT NULL,
`assignments_cc` int(11) DEFAULT NULL,
`employee_cc` int(11) DEFAULT NULL,
`unique_id` int(11) DEFAULT NULL,
`product_code` varchar(50) DEFAULT NULL,
`smart_product_name` varchar(255) DEFAULT NULL,
`product_name` varchar(255) DEFAULT NULL,
`assignment_type` varchar(100) DEFAULT NULL,
`commission_pcent` decimal(32,16) DEFAULT NULL,
`registered_name` varchar(255) DEFAULT NULL,
`segment` varchar(100) DEFAULT NULL,
`account_allocation` varchar(25) DEFAULT NULL,
`role_code` varchar(25) DEFAULT NULL,
`product_eligibility` varchar(25) DEFAULT NULL,
`rollup` varchar(10) DEFAULT NULL,
`revised_amount` decimal(32,16) DEFAULT NULL,
`original_amount` decimal(32,16) DEFAULT NULL,
`comment` varchar(255) DEFAULT NULL,
`amount_revised_flag` varchar(255) DEFAULT NULL,
`exclude_segment` varchar(10) DEFAULT NULL,
`revenue_type` varchar(50) DEFAULT NULL,
`revenue_core` decimal(32,16) DEFAULT NULL,
`revenue_ict` decimal(32,16) DEFAULT NULL,
`primary_account_manager_id` varchar(100) DEFAULT NULL,
`primary_account_manager_name` varchar(100) DEFAULT NULL,
PRIMARY KEY (`pk_id`,`period`),
KEY `ct_participant_eid` (`participant_eid`),
KEY `ct_period` (`period`),
KEY `ct_employee_name` (`employee_name`),
KEY `ct_month` (`month`),
KEY `ct_segment` (`segment`),
KEY `ct_unique_id` (`unique_id`)
) ENGINE=InnoDB AUTO_INCREMENT=15338782 DEFAULT CHARSET=utf8
/*!50100 PARTITION BY HASH (period)
PARTITIONS 120 */ |
Edit 29.9: The intermittent behaviour was caused by the omission of a delete SQL statement. If the original table was not deleted before automatically duplicating records. The first time all is fine: we started with 75,000 records and ended up with 750,000 records.
Because the delete statement was missed the next time we already had 750,000 records, and the script would make 7.5M records out of it. That would still work, but the subsequent run trying to make 7.5M into 75M records would fail. 1 in 3 failures.
We would then try all the scripts manually, and of course then we would delete the table properly, and all would go well. The reason why we didn't see that beforehand was that our application does not output anything when running the SQL.
The real delay would be with your second query inserting from the temporary table back into the original tables. There are several issues here.
Sheer amount of data
Looking at your table, there are several columns of varchar(255) a conservative estimate would put an average length of your rows at 2kb. That's roughly about 1.5GB that's being copied from one table to another and being moved to different partitions! Partitioning makes reads more efficient but for inserting the engine has to figure out which partition the data should be moved to so it's actually writing to lots of different files instead of sequentially to one file. For spinning disks, this is slow.
Rebuilding the indexes
One of the biggest costs of inserts is rebuilding the indexes. In your case you have many of them.
KEY `ct_participant_eid` (`participant_eid`),
KEY `ct_period` (`period`),
KEY `ct_employee_name` (`employee_name`),
KEY `ct_month` (`month`),
KEY `ct_segment` (`segment`),
KEY `ct_unique_id` (`unique_id`)
And some of this indexes like employee_name are on varchar(255) columns. That means pretty hefty indexes.
Solution part 1 - Normalize
Your database isn't normalized. Here is a classic example:
primary_account_manager_id varchar(100) DEFAULT NULL,
primary_account_manager_name varchar(100) DEFAULT NULL,
You should really be having a table called account_manager and these two fields should be in that. primary_account_manager_id probably should be an integer field. It is only the id that should be in your ct_revenue_detail table.
Similarly you really shouldn't have employee_name, registered_name etc in this table. They should be in separate tables and they should be linked to ct_revenue_detail by foreign keys.
Solution part 2 - Rethink indexes.
Do you need so many? Mysql only uses one index per table per where clause anyway so some of these indexes are probably never used. Is this one really needed:
KEY `ct_unique_id` (`unique_id`)
You already have primary key why do you even need another unique column?
Indexes for the SELECT: For
SELECT ...
FROM ct_revenue_detail r
JOIN ct_account_allocation_revenue a
ON a.period = r.period AND a.unique_id = r.unique_id
WHERE a.period = 3 AND lower(a.rollup_revenue) = 'y';
a needs INDEX(period, rollup_revenue) in either order. However, you also need to declare rollup_revenue to have a ..._ci collation and avoiding the column in a function. That is change lower(a.rollup_revenue) = 'y' to a.rollup_revenue = 'y'.
r needs INDEX(period, unique_id) in either order. But, as e4c5 mentioned, if unique_id is really "unique" in this table, then take advantage of such.
Bulkiness is a problem when shoveling data around.
decimal(32,16) takes 16 bytes and gives you precision and range that are probably unnecessary. Consider FLOAT (4 bytes, ~7 significant digits, adequate range) or DOUBLE (8 bytes, ~16 significant digits, adequate range).
month int(11) takes 4 bytes. If that is just a value 1..12, then use TINYINT UNSIGNED (1 byte).
DEFAULT NULL -- I suspect most columns will never be NULL; if so, say NOT NULL for them.
amount_revised_flag varchar(255) -- if that is really a "flag", such as "yes"/"no", then use an ENUM and save lots of space.
It is uncommon to have both an id and a name in the same table (see primary_account_manager*); that is usually relegated to a "normalization table".
"Normalize" (already mentioned by #e4c5).
HASH partitioning
Hash partitioning is virtually useless. Unless you can justify it (preferably with a benchmark), I recommend removing partitioning. More discussion.
Adding or removing partitioning usually involves changing the indexes. Please show us the main queries so we can help you build suitable indexes (especially composite indexes) for the queries.

Optimize Query on mysql

I have a query that runs really slow (15 20 seconds) when is not on memory and quite fast when is on memory (2s - 0.6s)
select count(distinct(concat(conexiones.tMacAdres,date_format(conexiones.fFecha,'%Y%m%d')))) as Conexiones,
sum(if(conexiones.tEvento='megusta',1,0)) as MeGusta,sum(if(conexiones.tEvento='megusta',conexiones.nAmigos,0)) as ImpactosMeGusta,
sum(if(conexiones.tEvento='checkin',1,0)) as CheckIn,sum(if(conexiones.tEvento='checkin',conexiones.nAmigos,0)) as ImpactosCheckIn,
min(conexiones.fFecha) Fecha_Inicio, now() Fecha_fin,datediff(now(),min(conexiones.fFecha)) as dias
from conexiones, instalaciones
where conexiones.idInstalacion=instalaciones.idInstalacion and conexiones.idInstalacion=190
and (fFecha between '2014-01-01 00:00:00' and '2016-06-18 23:59:59')
group by instalaciones.tNombre
order by instalaciones.idCliente
This is Table SCHEMAS:
Instalaciones with 1332 rows:
CREATE TABLE `instalaciones` (
`idInstalacion` int(10) unsigned NOT NULL AUTO_INCREMENT,
`idCliente` int(10) unsigned DEFAULT NULL,
`tRouterSerial` varchar(50) DEFAULT NULL,
`tFacebookPage` varchar(256) DEFAULT NULL,
`tidFacebook` varchar(64) DEFAULT NULL,
`tNombre` varchar(128) DEFAULT NULL,
`tMensaje` varchar(128) DEFAULT NULL,
`tWebPage` varchar(128) DEFAULT NULL,
`tDireccion` varchar(128) DEFAULT NULL,
`tPoblacion` varchar(128) DEFAULT NULL,
`tProvincia` varchar(64) DEFAULT NULL,
`tCodigoPosta` varchar(8) DEFAULT NULL,
`tLatitud` decimal(15,12) DEFAULT NULL,
`tLongitud` decimal(15,12) DEFAULT NULL,
`tSSID1` varchar(40) DEFAULT NULL,
`tSSID2` varchar(40) DEFAULT NULL,
`tSSID2_Pass` varchar(40) DEFAULT NULL,
`fSincro` datetime DEFAULT NULL,
`tEstado` varchar(10) DEFAULT NULL,
`tHotspot` varchar(10) DEFAULT NULL,
`fAlta` datetime DEFAULT NULL,
PRIMARY KEY (`idInstalacion`),
UNIQUE KEY `tRouterSerial` (`tRouterSerial`),
KEY `idInstalacion` (`idInstalacion`)
) ENGINE=InnoDB AUTO_INCREMENT=1332 DEFAULT CHARSET=utf8;
Conexiones with 2370365 rows
CREATE TABLE `conexiones` (
`idConexion` int(10) unsigned NOT NULL AUTO_INCREMENT,
`idInstalacion` int(10) unsigned DEFAULT NULL,
`idUsuario` int(11) DEFAULT NULL,
`tMacAdres` varchar(64) DEFAULT NULL,
`tUsuario` varchar(128) DEFAULT NULL,
`tNombre` varchar(64) DEFAULT NULL,
`tApellido` varchar(64) DEFAULT NULL,
`tEmail` varchar(64) DEFAULT NULL,
`tSexo` varchar(20) DEFAULT NULL,
`fNacimiento` date DEFAULT NULL,
`nAmigos` int(11) DEFAULT NULL,
`tPoblacion` varchar(64) DEFAULT NULL,
`fFecha` datetime DEFAULT NULL,
`tEvento` varchar(20) DEFAULT NULL,
PRIMARY KEY (`idConexion`),
KEY `idInstalacion` (`idInstalacion`),
KEY `tMacAdress` (`tMacAdres`) USING BTREE,
KEY `fFecha` (`fFecha`),
KEY `idUsuario` (`idUsuario`),
KEY `insta_fecha` (`idInstalacion`,`fFecha`)
) ENGINE=InnoDB AUTO_INCREMENT=2370365 DEFAULT CHARSET=utf8;
This is EXPLAIN
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE instalaciones const PRIMARY,idInstalacion PRIMARY 4 const 1
1 SIMPLE conexiones ref idInstalacion,fFecha,insta_fecha idInstalacion 5 const 110234 "Using where"
Thanks !
(Edited)
SHOW TABLE STATUS LIKE 'conexiones'
Name Engine Version Row_format Rows Avg_row_length Data_length Max_data_length Index_length Data_free Auto_increment Create_time Update_time Check_time Collation Checksum Create_options Comment
conexiones InnoDB 10 Compact 2305296 151 350060544 0 331661312 75497472 2433305 28/06/2016 22:26 NULL NULL utf8_general_ci NULL
Here's why it is so slow. And I will end with a possible speedup.
First, please do
SELECT COUNT(*) FROM conexiones
WHERE idInstalacion=190
and fFecha >= '2014-01-01'
and fFecha < '2016-06-19
in order to see how many rows we are dealing with. The EXPLAIN suggests 110234, but that is only a crude estimate.
Assuming there are 110K rows of conexiones involved in the query, and assuming the rows were (approximately) inserted in chronological order by fFecha, then...
There are a lot of rows to work with, and
They are scattered around the table on disk, hence
The query takes a lot of I/O, unless it is cached.
Let's further check on my last claim... How much RAM do you have? What is the value of innodb_buffer_pool_size? It should be about 70% of available RAM. Use a lower percentage if you have less than 4GB of RAM.
Assuming that conexiones is too big to be 'cached' in the 'buffer_pool', we need to find a way to decrease the I/O.
There are 1332 different values for idInstalacion. Perhaps you insert 1332 rows every few minutes/hours into conexiones? Since the PRIMARY KEY merely an AUTO_INCREMENT, those rows will be 'appended' to the end of the table.
Now let's look at where the idInstalacion=190 rows are. A new one of them occurs every 1332 (or so) rows. That means they are spread out. It means that (probably) no two rows are in the same block (16KB in InnoDB). That means that the 110234 will be in 110234 different blocks. That's about 2GB. If the buffer_pool is smaller than that, then there will be I/O. Even if it is bigger than that, that's a lot of data to touch.
But what to do about it? If we could arrange the =190 rows to be consecutive in the table, then the 2GB might drop to, say, 20MB -- a much more manageable and cacheable size. But how can that be done? By changing the PRIMARY KEY.
PRIMARY KEY(idInstalacion, fFecha, idConexion),
INDEX(idConexion)
and DROP any other indexes starting with idInstalacion or idConexion. To explain:
Since the PK is "clustered" with the data, all idInstalacion=190 rows over any consecutive fFetcha range will be consecutive in the data. So, fetching one block will get about 100 rows -- much less I/O.
A PK must be unique. Assuming (idInstalacion, fFecha) is not unique, I tacked on idConexion to make it unique.
I added INDEX(idConexion) to make AUTO_INCREMENT happy.
Potential drawback... Since this change rearranges the order of the data, other queries, including the INSERTs may be slowed down. The INSERTs will be scattered, but not really slowed down. 1332 "hots spots" would be accepting the new rows; that many blocks can easily be cached.
Arithmetic... If you have spinning drives, I would expect the existing structure to take about 1102 seconds (perhaps under 110 seconds for SSD) for 110234 rows. Since it is taking under 20 seconds, I suspect there is some caching (or you have SSDs) or the 110234 is grossly overestimated. My suggested change should decrease the "worst" time significantly, and slightly improve the "in memory" time. This "slight improvement" comes from being able to use the PK instead of a secondary key.
Caveat: Since 110234 * 1332 is nowhere near 2370365, much of my numerical analysis is probably nowhere near correct. For example, 2370365 rows with that schema is possible less than 1GB. Please provide SHOW TABLE STATUS LIKE 'conexiones'.
Addenda
"server has 2GB Ram and innodb_buffer_pool_size is 5368709120" -- Either that is a typo or it is terrible. Since the buffer_pool needs to reside in RAM, do not set the buffer_pool to 5GB. 500MB might be OK for your tiny 2GB of RAM.
The SHOW TABLE STATUS confirms that it (data + indexes) won't quite fit in 500M, so you may periodically experience I/O bound queries with 500M.
Increasing your RAM and buffer_pool would temporarily (until the data gets bigger) help performance.
Before putting this into production, test the ALTER and time the various queries you use:
ALTER TABLE conexiones
DROP PRIMARY KEY,
DROP INDEX insta_fecha,
DROP INDEX idInstalacion,
PRIMARY KEY(idInstalacion, fFecha, idConexion),
INDEX(idConexion)
Caution: The ALTER will need about 1GB of free disk space.
When timing, run with the Query Cache off, and run twice -- the first may involve I/O; the second is the 'in memory' as you mentioned.
Revised analysis: Since the bigger table has 300MB of data and some amount of indexes in use, and assuming 500MB buffer pool, I suspect that blocks are bumped out of the buffer pool some of the time. This fits well with your initial comment on the query's speed. My suggested index changes should help avoid the speed variance, but may hurt the performance of other queries.
Try to use a multi column index:
CREATE idx_nn_1 ON conexiones(idInstalacion,fFecha);
You might need to have it the other way around depending on the data, so test both. This avoids reading all the records for between condition on fFecha matching the idInstalacion condition, and should improve performance.
Try the following:
Either delete the idInstalacion INDEX or tell the engine to use the correct key in the from clause:
from conexiones use index (insta_fecha), instalaciones
And you don't need to JOIN, GROUP or ORDER. You are joining on a constant value (190) with one row. And you don't use any column from instalaciones.
So all you need is this:
select count(distinct(concat(conexiones.tMacAdres,date_format(conexiones.fFecha,'%Y%m%d')))) as Conexiones,
sum(if(conexiones.tEvento='megusta',1,0)) as MeGusta,sum(if(conexiones.tEvento='megusta',conexiones.nAmigos,0)) as ImpactosMeGusta,
sum(if(conexiones.tEvento='checkin',1,0)) as CheckIn,sum(if(conexiones.tEvento='checkin',conexiones.nAmigos,0)) as ImpactosCheckIn,
min(conexiones.fFecha) Fecha_Inicio, now() Fecha_fin,datediff(now(),min(conexiones.fFecha)) as dias
from conexiones -- use index (insta_fecha)
where conexiones.idInstalacion=190
and (fFecha between '2014-01-01 00:00:00' and '2016-06-18 23:59:59')
However - it doesn't mean it will be faster. MySQL will probably optimize all that stuff away.

MySql - Handle table size and performance

We are having a Analytics product. For each of our customer we give one JavaScript code, they put that in their web sites. If a user visit our customer site the java script code hit our server so that we store this page visit on behalf of this customer. Each customer contains unique domain name.
we are storing this page visits in MySql table.
Following is the table schema.
CREATE TABLE `page_visits` (
`domain` varchar(50) DEFAULT NULL,
`guid` varchar(100) DEFAULT NULL,
`sid` varchar(100) DEFAULT NULL,
`url` varchar(2500) DEFAULT NULL,
`ip` varchar(20) DEFAULT NULL,
`is_new` varchar(20) DEFAULT NULL,
`ref` varchar(2500) DEFAULT NULL,
`user_agent` varchar(255) DEFAULT NULL,
`stats_time` datetime DEFAULT NULL,
`country` varchar(50) DEFAULT NULL,
`region` varchar(50) DEFAULT NULL,
`city` varchar(50) DEFAULT NULL,
`city_lat_long` varchar(50) DEFAULT NULL,
`email` varchar(100) DEFAULT NULL,
KEY `sid_index` (`sid`) USING BTREE,
KEY `domain_index` (`domain`),
KEY `email_index` (`email`),
KEY `stats_time_index` (`stats_time`),
KEY `domain_statstime` (`domain`,`stats_time`),
KEY `domain_email` (`domain`,`email`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 |
We don't have primary key for this table.
MySql server details
It is Google cloud MySql (version is 5.6) and storage capacity is 10TB.
As of now we are having 350 million rows in our table and table size is 300 GB. We are storing all of our customer details in the same table even though there is no relation between one customer to another.
Problem 1: For few of our customers having huge number of rows in table, so performance of queries against these customers are very slow.
Example Query 1:
SELECT count(DISTINCT sid) AS count,count(sid) AS total FROM page_views WHERE domain = 'aaa' AND stats_time BETWEEN CONVERT_TZ('2015-02-05 00:00:00','+05:30','+00:00') AND CONVERT_TZ('2016-01-01 23:59:59','+05:30','+00:00');
+---------+---------+
| count | total |
+---------+---------+
| 1056546 | 2713729 |
+---------+---------+
1 row in set (13 min 19.71 sec)
I will update more queries here. We need results in below 5-10 seconds, will it be possible?
Problem 2: The table size is rapidly increasing, we might hit table size 5 TB by this year end so we want to shard our table. We want to keep all records related to one customer in one machine. What are the best practises for this sharding.
We are thinking following approaches for above issues, please suggest us best practices to overcome these issues.
Create separate table for each customer
1) What are the advantages and disadvantages if we create separate table for each customer. As of now we are having 30k customers we might hit 100k by this year end that means 100k tables in DB. We access all tables simultaneously for Read and Write.
2) We will go with same table and will create partitions based on date range
UPDATE : Is a "customer" determined by the domain? Answer is Yes
Thanks
First, a critique if the excessively large datatypes:
`domain` varchar(50) DEFAULT NULL, -- normalize to MEDIUMINT UNSIGNED (3 bytes)
`guid` varchar(100) DEFAULT NULL, -- what is this for?
`sid` varchar(100) DEFAULT NULL, -- varchar?
`url` varchar(2500) DEFAULT NULL,
`ip` varchar(20) DEFAULT NULL, -- too big for IPv4, too small for IPv6; see below
`is_new` varchar(20) DEFAULT NULL, -- flag? Consider `TINYINT` or `ENUM`
`ref` varchar(2500) DEFAULT NULL,
`user_agent` varchar(255) DEFAULT NULL, -- normalize! (add new rows as new agents are created)
`stats_time` datetime DEFAULT NULL,
`country` varchar(50) DEFAULT NULL, -- use standard 2-letter code (see below)
`region` varchar(50) DEFAULT NULL, -- see below
`city` varchar(50) DEFAULT NULL, -- see below
`city_lat_long` varchar(50) DEFAULT NULL, -- unusable in current format; toss?
`email` varchar(100) DEFAULT NULL,
For IP addresses, use inet6_aton(), then store in BINARY(16).
For country, use CHAR(2) CHARACTER SET ascii -- only 2 bytes.
country + region + city + (maybe) latlng -- normalize this to a "location".
All these changes may cut the disk footprint in half. Smaller --> more cacheable --> less I/O --> faster.
Other issues...
To greatly speed up your sid counter, change
KEY `domain_statstime` (`domain`,`stats_time`),
to
KEY dss (domain_id,`stats_time`, sid),
That will be a "covering index", hence won't have to bounce between the index and the data 2713729 times -- the bouncing is what cost 13 minutes. (domain_id is discussed below.)
This is redundant with the above index, DROP it:
KEY domain_index (domain)
Is a "customer" determined by the domain?
Every InnoDB table must have a PRIMARY KEY. There are 3 ways to get a PK; you picked the 'worst' one -- a hidden 6-byte integer fabricated by the engine. I assume there is no 'natural' PK available from some combination of columns? Then, an explicit BIGINT UNSIGNED is called for. (Yes that would be 8 bytes, but various forms of maintenance need an explicit PK.)
If most queries include WHERE domain = '...', then I recommend the following. (And this will greatly improve all such queries.)
id BIGINT UNSIGNED NOT NULL AUTO_INCREMENT,
domain_id MEDIUMINT UNSIGNED NOT NULL, -- normalized to `Domains`
PRIMARY KEY(domain_id, id), -- clustering on customer gives you the speedup
INDEX(id) -- this keeps AUTO_INCREMENT happy
Recommend you look into pt-online-schema-change for making all these changes. However, I don't know if it can work without an explicit PRIMARY KEY.
"Separate table for each customer"? No. This is a common question; the resounding answer is No. I won't repeat all the reasons for not having 100K tables.
Sharding
"Sharding" is splitting the data across multiple machines.
To do sharding, you need to have code somewhere that looks at domain and decides which server will handle the query, then hands it off. Sharding is advisable when you have write scaling problems. You did not mention such, so it is unclear whether sharding is advisable.
When sharding on something like domain (or domain_id), you could use (1) a hash to pick the server, (2) a dictionary lookup (of 100K rows), or (3) a hybrid.
I like the hybrid -- hash to, say, 1024 values, then look up into a 1024-row table to see which machine has the data. Since adding a new shard and migrating a user to a different shard are major undertakings, I feel that the hybrid is a reasonable compromise. The lookup table needs to be distributed to all clients that redirect actions to shards.
If your 'writing' is running out of steam, see high speed ingestion for possible ways to speed that up.
PARTITIONing
PARTITIONing is splitting the data across multiple "sub-tables".
There are only a limited number of use cases where partitioning buys you any performance. You not indicated that any apply to your use case. Read that blog and see if you think that partitioning might be useful.
You mentioned "partition by date range". Will most of the queries include a date range? If so, such partitioning may be advisable. (See the link above for best practices.) Some other options come to mind:
Plan A: PRIMARY KEY(domain_id, stats_time, id) But that is bulky and requires even more overhead on each secondary index. (Each secondary index silently includes all the columns of the PK.)
Plan B: Have stats_time include microseconds, then tweak the values to avoid having dups. Then use stats_time instead of id. But this requires some added complexity, especially if there are multiple clients inserting data. (I can elaborate if needed.)
Plan C: Have a table that maps stats_time values to ids. Look up the id range before doing the real query, then use both WHERE id BETWEEN ... AND stats_time .... (Again, messy code.)
Summary tables
Are many of the queries of the form of counting things over date ranges? Suggest having Summary Tables based perhaps on per-hour. More discussion.
COUNT(DISTINCT sid) is especially difficult to fold into summary tables. For example, the unique counts for each hour cannot be added together to get the unique count for the day. But I have a technique for that, too.
I wouldn't do this if i were you. First thing that come to mind would be, on receive a pageview message, i send the message to a queue so that a worker can pickup and insert to database later (in bulk maybe); also i increase the counter of siteid:date in redis (for example). Doing count in sql is just a bad idea for this scenario.