SQL Update table from another - mysql

I have table as following:
dev=> \d statemachine_history
Table "public.statemachine_history"
Column | Type | Modifiers
---------------+--------------------------+-------------------------------------------------------------------
id | bigint | not null default nextval('statemachine_history_id_seq'::regclass)
schema_name | character varying | not null
event | character varying | not null
identifier | integer | not null
initial_state | character varying | not null
final_state | character varying | not null
triggered_at | timestamp with time zone | not null default statement_timestamp()
triggered_by | text |
command | json |
flag | json |
created_at | timestamp with time zone |
created_by | json |
updated_at | timestamp with time zone |
updated_by | json |
Indexes:
"statemachine_log_pkey" PRIMARY KEY, btree (id)
"unique_statemachine_log_id" UNIQUE, btree (id)
"statemachine_history_identifier_idx" btree (identifier)
"statemachine_history_schema_name_idx" btree (schema_name)
AND
dev=> \d booking
Table "public.booking"
Column | Type | Modifiers
----------------+--------------------------+------------------------------------------------------
id | bigint | not null default nextval('booking_id_seq'::regclass)
pin | character varying |
occurred_at | timestamp with time zone |
membership_id | bigint |
appointment_id | bigint |
created_at | timestamp with time zone |
created_by | json |
updated_at | timestamp with time zone |
updated_by | json |
customer_id | bigint |
state | character varying | not null default 'booked'::character varying
Indexes:
"booking_pkey" PRIMARY KEY, btree (id)
Foreign-key constraints:
"booking_appointment_id_fkey" FOREIGN KEY (appointment_id) REFERENCES appointment(id)
"booking_customer_id_fkey" FOREIGN KEY (customer_id) REFERENCES customer(id)
"booking_membership_id_fkey" FOREIGN KEY (membership_id) REFERENCES membership(id)
Referenced by:
TABLE "booking_decline_reason" CONSTRAINT "booking_decline_reason_booking_id_fkey" FOREIGN KEY (booking_id) REFERENCES booking(id)
I am trying to update the booking.update_at from the statemachine_history.updated_at
Letting you know that there is a one to many relationship between the 2 tables so i want to MAX(statemachine_history.updated_at)
My try is:
UPDATE booking SET updated_at=
(
SELECT MAX(updated_at)
FROM statemachine_history
WHERE schema_name='Booking'
AND identifier=id
GROUP BY identifier
);
However the bookings.updated_at becomes null

All you really need to do is to make sure id reference booking.id by naming it explicitly;
UPDATE booking SET updated_at=
(
SELECT MAX(updated_at)
FROM statemachine_history
WHERE schema_name='Booking'
AND identifier = booking.id
GROUP BY identifier
);
A quick SQLfiddle to test with.
If there are real time requirements for the query, you'll want to look into TomH's join in another answer though.

This should do what you need:
UPDATE B
SET
updated_at = SQ.max_updated_at
FROM
Booking B
INNER JOIN
(
SELECT
identifier,
MAX(updated_at) AS max_updated_at
FROM
Statemachine_History SH
GROUP BY
identifier
) AS SQ ON SQ.identifier = B.id

[Solved] PSQL Query:
UPDATE booking SET updated_at=
(
SELECT MAX(updated_at)
FROM statemachine_history
WHERE schema_name='Booking'
AND identifier=booking.id
GROUP BY identifier
) WHERE exists(SELECT id FROM statemachine_history WHERE schema_name='Booking' AND identifier=booking.id);
This part:
WHERE exists(SELECT id FROM statemachine_history WHERE schema_name='Booking' AND identifier=booking.id);
is to avoid updating booking.updated_at, in case there is not such relation in statemachine_history table

Related

MySQL report query optimization and timezone issues

I'm faced with a MySQL database which contains an events table with ~70 million rows which has foreign keys to other tables and is used to generate reports. Constructing a performant query to select (while counting/summing values) and grouping data per day from this table is proving challenging.
The database structure is as follows:
CREATE TABLE `client` (
`id` int NOT NULL AUTO_INCREMENT,
`name` varchar(255) DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `idx_client_id_name` (`id`,`name`)
) ENGINE=InnoDB AUTO_INCREMENT=66 DEFAULT CHARSET=utf8mb3
CREATE TABLE `class` (
`id` int NOT NULL AUTO_INCREMENT,
`name` varchar(255) DEFAULT NULL,
`client_id` int DEFAULT NULL,
`duration` int DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `fk_client_id_idx` (`client_id`),
CONSTRAINT `fk_client_id` FOREIGN KEY (`client_id`) REFERENCES `client` (`id`) ON DELETE SET NULL ON UPDATE CASCADE
) ENGINE=InnoDB AUTO_INCREMENT=2606 DEFAULT CHARSET=utf8mb3
CREATE TABLE `event` (
`id` int NOT NULL AUTO_INCREMENT,
`start_time` datetime DEFAULT NULL,
`class_id` int DEFAULT NULL,
`venue_id` int DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `fk_class_id_idx` (`class_id`),
KEY `fk_venue_id_idx` (`venue_id`),
KEY `idx_1` (`venue_id`,`class_id`,`start_time`),
CONSTRAINT `fk_class_id` FOREIGN KEY (`class_id`) REFERENCES `class` (`id`) ON DELETE SET NULL ON UPDATE CASCADE,
CONSTRAINT `fk_venue_id` FOREIGN KEY (`venue_id`) REFERENCES `venue` (`id`) ON DELETE SET NULL ON UPDATE CASCADE
) ENGINE=InnoDB AUTO_INCREMENT=64093231 DEFAULT CHARSET=utf8mb3
CREATE TABLE `venue` (
`id` int NOT NULL AUTO_INCREMENT,
`name` varchar(255) DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `idx_venue_id_name` (`id`,`name`)
) ENGINE=InnoDB AUTO_INCREMENT=29 DEFAULT CHARSET=utf8mb3
The query which is fine on an events table with a few thousand rows to demonstrate the desired outcome is as follows:
SELECT
CAST(event.start_time as date) as day,
class.name,
client.name,
venue.name,
COUNT(class.name) AS occurrences,
SUM(class.duration) AS duration
FROM
class,
client,
event,
venue
WHERE
event.venue_id = venue.id
AND event.class_id = class.id
AND class.client_id = client.id
GROUP BY day, class.name, client.name, venue.name
The database isn't indexed and although I've tried indexing with things like alter table events add index idx_test (venue_id, class_id, start_time); to improve performance it's still incredibly slow (I tend to abort them when they're past the 10 minute mark so don't know for sure how long they'd take to complete).
I figured this was a good use case for a summary table (as suggested by Rick James' guide) so that I could hold a separate set of summarized data broken down into day with occurrences and total duration calculated/incremented with each addition to the table (IODKU). However I'm then also up against creating rows per day in a summary table based on what is considered a day in the database (UTC) which may not match with the application's "day" due to timezone offset.
Short of converting the start_time column to a timestamp type (which is then inconsistent with all other date types in the database) is there any way round this or is there any other optimization I could be making to the original events table resulting in a more responsive query? TIA
Update 23/05
Here's the buffer pool size:
SHOW VARIABLES LIKE 'innodb_buffer_pool_size';
+-------------------------+-----------+
| Variable_name | Value |
+-------------------------+-----------+
| innodb_buffer_pool_size | 134217728 |
+-------------------------+-----------+
I've also made a bit of progress with indexing, modifying the query and creating a summary table.
I tried various ordering of columns to test indexes and found idx_event_venueid_classid_starttime (below), to be the most efficient for the event table:
SHOW INDEXES FROM EVENT;
+-------+------------+-------------------------------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+---------+------------+
| Table | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment | Visible | Expression |
+-------+------------+-------------------------------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+---------+------------+
| event | 0 | PRIMARY | 1 | id | A | 62142912 | NULL | NULL | | BTREE | | | YES | NULL |
| event | 1 | fk_class_id_idx | 1 | class_id | A | 51286 | NULL | NULL | YES | BTREE | | | YES | NULL |
| event | 1 | fk_venue_id_idx | 1 | venue_id | A | 16275 | NULL | NULL | YES | BTREE | | | YES | NULL |
| event | 1 | idx_event_venueid_classid_starttime | 1 | venue_id | A | 13378 | NULL | NULL | YES | BTREE | | | YES | NULL |
| event | 1 | idx_event_venueid_classid_starttime | 2 | class_id | A | 81331 | NULL | NULL | YES | BTREE | | | YES | NULL |
| event | 1 | idx_event_venueid_classid_starttime | 3 | start_time | A | 63909472 | NULL | NULL | YES | BTREE | | | YES | NULL |
+-------+------------+-------------------------------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+---------+------------+
Here's my modified version of the query, using JOIN syntax and now uses CONVERT_TZ to convert from UTC to the timezone required for reporting and then group that by the date (discarding the time portion):
SELECT
DATE(CONVERT_TZ(event.start_time,
'UTC',
'Europe/London')) AS tz_date,
class.name,
client.name,
venue.name,
COUNT(class.id) AS occurrences,
SUM(class.duration) AS duration
FROM
event
JOIN
class ON class.id = event.class_id
JOIN
venue ON venue.id = event.venue_id
JOIN
client ON client.id = class.client_id
GROUP BY tz_date, class.name, client.name, venue.name;
And here's the output of explain for that query:
+----+-------------+--------+------------+--------+---------------------------------------------------------------------+-------------------------------------+---------+-------------------------+------+----------+------------------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+--------+------------+--------+---------------------------------------------------------------------+-------------------------------------+---------+-------------------------+------+----------+------------------------------+
| 1 | SIMPLE | venue | NULL | index | PRIMARY,idx_venue_id_name | idx_venue_id_name | 772 | NULL | 28 | 100.00 | Using index; Using temporary |
| 1 | SIMPLE | event | NULL | ref | fk_class_id_idx,fk_venue_id_idx,idx_event_venueid_classid_starttime | idx_event_venueid_classid_starttime | 5 | example.venue.id | 4777 | 100.00 | Using where; Using index |
| 1 | SIMPLE | class | NULL | eq_ref | PRIMARY,fk_client_id_idx | PRIMARY | 4 | example.event.class_id | 1 | 100.00 | Using where |
| 1 | SIMPLE | client | NULL | eq_ref | PRIMARY,idx_client_id_name | PRIMARY | 4 | example.class.client_id | 1 | 100.00 | NULL |
+----+-------------+--------+------------+--------+---------------------------------------------------------------------+-------------------------------------+---------+-------------------------+------+----------+------------------------------+
The query takes ~1m 20s to run now so I figured I could prepend that with an insert into to populate a summary table with the dates being timezone specific and run that on a nightly basis. Summary table structure:
CREATE TABLE `summary` (
`tz_date` date NOT NULL,
`class` varchar(255) NOT NULL,
`client` varchar(255) NOT NULL,
`venue` varchar(255) NOT NULL,
`occurrences` int NOT NULL,
`duration` int NOT NULL,
PRIMARY KEY (`tz_date`,`class`,`client`,`venue`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb3
From the original ~60m+ rows in the event table, the aggregated summary table is populated with ~66k rows.
To then generate the reports from the summary table it takes a fraction of a second (shown below with data snipped):
SELECT * FROM SUMMARY;
66989 rows in set (0.03 sec)
I haven't looked into the impact of inserting into event while the query to populate the summary table is running - is using InnoDB likely to slow that down?
No further indexes are likely to help. It need to scan all the events table, reaching into the other tables to get the names.
Some things for us to look at:
SHOW VARIABLES LIKE 'innodb_buffer_pool_size';
EXPLAIN SELECT ...
How much RAM do you have?
Do the aggregates (COUNT and SUM) look correct? In some situations involving JOIN, they can be over-inflated.
Please use the newer JOIN ... ON syntax. (Won't change performance.)
As you observed, a Summary Table may help -- but only of the older data is not being modified. Please provide the SHOW CREATE TABLE and query for it.
Yes, timezone vs "definition of day" is a thorny issue. Notice how StackOverflow defines day based on UTC.
How many new rows are there per day? Are they spread out somewhat evenly throughout the day? If the average number of rows per hour is at least 20, then the Summary Table could be based on half-hour intervals. (I picked that because of India time vs most of the rest of the world.) The 20 comes from a Rule of Thumb that says that a summary table should have one-tenth as many rows as the Fact table.
Yes, TIMESTAMP instead of DATETIME may be a workaround.
Since you are talking about moderately large tables, consider whether to change INT NULL to SMALLINT UNSIGNED NOT NULL or some other sized integer.
(As for the cliff in 2038, ask yourself how many databases have been active on the same hardware and software since 2006. That may give some perspective on whether your design must survive 16 years.)

How to assign a JSON variable with row_to_json in PostgreSQL

I'm trying to create an annonymous function in PostgreSQL to create mock data for an application.. I would like to do a SELECT query first (to get data from a random charter) and convert all the rows into a JSON with row_to_json, then assign the result into a variable of type JSON.
I need this charter information so I can add it into bookings table.
This is not working, I think I don't know how to associate the result of the select with the variable previously created; I'm getting the error that charterData is null and I would like to know how I can achieve this..
This is the annonymous func in SQL:
BEGIN;
DO $$
DECLARE charterData JSON;
DECLARE bookingId INTEGER;
BEGIN
SELECT row_to_json(t) INTO charterData FROM (select charter_id, name from charters) t WHERE charter_id = 1;
INSERT INTO bookings (charter, yacht, email, date, guests, total, start_hour, end_hour, hotel, arrival_date) values (charterData, '{"test":1}', 'a', '12/10/1995', 8, '78', '123', '123', '123', '123')
RETURNING booking_id INTO bookingId;
END $$;
COMMIT;
Table charter:
Table "public.charters"
Column | Type | Collation | Nullable | Default
-------------+-------------------+-----------+----------+----------------------------------------------
charter_id | integer | | not null | nextval('charters_charter_id_seq'::regclass)
name | character varying | | not null |
description | character varying | | not null |
sail_hours | integer | | not null |
Indexes:
"charters_pk" PRIMARY KEY, btree (charter_id)
"name_charter" UNIQUE CONSTRAINT, btree (name)
Referenced by:
TABLE "bookings" CONSTRAINT "charters_bookings_fk" FOREIGN KEY (charter) REFERENCES charters(name) ON DELETE CASCADE
TABLE "pricing" CONSTRAINT "charters_pricing_fk" FOREIGN KEY (charter_id) REFERENCES charters(charter_id) ON DELETE CASCADE
Bookings table:
Table "public.bookings"
Column | Type | Collation | Nullable | Default
----------------+-------------------+-----------+----------+----------------------------------------------
booking_id | integer | | not null | nextval('bookings_booking_id_seq'::regclass)
charter | json | | not null |
yacht | json | | not null |
email | character varying | | not null |
date | date | | not null |
guests | integer | | not null |
total | numeric | | not null |
start_hour | character varying | | not null |
end_hour | character varying | | not null |
alcohol | character varying | | |
transportation | character varying | | |
others | character varying | | |
arrival_date | character varying | | |
hotel | character varying | | |
Indexes:
"bookings_pk" PRIMARY KEY, btree (booking_id)
"end_hour" UNIQUE CONSTRAINT, btree (end_hour)
"start_hour" UNIQUE CONSTRAINT, btree (start_hour)
Foreign-key constraints:
"charters_bookings_fk" FOREIGN KEY (charter) REFERENCES charters(name) ON DELETE CASCADE
"yachts_bookings_fk" FOREIGN KEY (yacht) REFERENCES yachts(name) ON DELETE CASCADE
Referenced by:
TABLE "bookings_extra" CONSTRAINT "bookings_extra_fk" FOREIGN KEY (booking_id) REFERENCES bookings(booking_id) ON DELETE CASCADE
Okay I have found the answer... Was kinda silly but maybe this answer will help someone
BEGIN;
DO $$
DECLARE charter JSON;
DECLARE bookingId INTEGER;
BEGIN
charter := (SELECT row_to_json(t) FROM (SELECT charter_id, name FROM charters) t WHERE charter_id = $1);
INSERT INTO bookings
(charter, yacht, email, date, passengers, total, start_hour, end_hour, hotel, arrival_date, charter_price)
values (charter, '{"test":1}', 'a', '12/10/1995', 8, '78', '123', '123', '123', '123', '132')
RETURNING booking_id INTO bookingId;
END $$;
COMMIT;

Need help optimizing outer join SQL query

I am hoping to get some advice on how to optimize the performance of this query I have with an outer join. First I will explain what I am trying to do and then I'll show the code and results.
I have an Accounts table that has a list of all customer accounts. And I have a datausage table which keeps track of how much data each customer is using. A backend process running on multiple servers inserts records into the datausage table each day to keep track of how much usage occurred that day for each customer on that server.
The backend process works like this - if there is no activity on that server for an account on that day, no records are written for that account. If there is activity, one record is written with a "LogDate" of that day. This is happening on multiple servers. So collectively the datausage table winds up with no rows (no activity at all for that customer each day), one row (activity was only on one server for that day), or multiple rows (activity was on multiple servers for that day).
We need to run a report that lists ALL customers, along with their usage for a specific date range. Some customers may have no usage at all (nothing whatsoever in the datausage table). Some customers may have no usage at all for the current period (but usage in other periods).
Regardless of whether there is any usage or not (ever, or for the selected period) we need EVERY customer in the Accounts table to be listed in the report, even if they show no usage. Therefore it seems this required an outer join.
Here is the query I am using:
SELECT
Accounts.accountID as AccountID,
IFNULL(Accounts.name,Accounts.accountID) as AccountName,
AccountPlans.plantype as AccountType,
Accounts.status as AccountStatus,
date(Accounts.created_at) as Created,
sum(IFNULL(datausage.Core,0) + (IFNULL(datausage.CoreDeluxe,0) * 3)) as 'CoreData'
FROM `Accounts`
LEFT JOIN `datausage` on `Accounts`.`accountID` = `datausage`.`accountID`
LEFT JOIN `AccountPlans` on `AccountPlans`.`PlanID` = `Accounts`.`PlanID`
WHERE
(
(`datausage`.`LogDate` >= '2014-06-01' and `datausage`.`LogDate` < '2014-07-01')
or `datausage`.`LogDate` is null
)
GROUP BY Accounts.accountID
ORDER BY `AccountName` asc
This query takes about 2 seconds to run. However it only takes 0.3 seconds to run if the "or datausage.LogDate is NULL" is removed. However, it seems I must have that clause in there, because accounts with no usage are excluded from the result set if that does not appear.
Here is the table data:
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+--------------+--------+---------------------------------------------------------+---------+---------+----------------------+------- +----------------------------------------------------+
| 1 | SIMPLE | Accounts | ALL | PRIMARY,accounts_planid_foreign,accounts_cardid_foreign | NULL | NULL | NULL | 57 | Using temporary; Using filesort |
| 1 | SIMPLE | datausage | ALL | NULL | NULL | NULL | NULL | 96805 | Using where; Using join buffer (Block Nested Loop) |
| 1 | SIMPLE | AccountPlans | eq_ref | PRIMARY | PRIMARY | 4 | mydb.Accounts.planID | 1 | NULL |
The indexes on Accounts table are as follows:
| Table | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment |
+----------+------------+-------------------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| Accounts | 0 | PRIMARY | 1 | accountID | A | 57 | NULL | NULL | | BTREE | | |
| Accounts | 1 | accounts_planid_foreign | 1 | planID | A | 5 | NULL | NULL | | BTREE | | |
| Accounts | 1 | accounts_cardid_foreign | 1 | cardID | A | 0 | NULL | NULL | YES | BTREE | | |
The index on the datausage table is as follows:
| Table | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment |
+------------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| datausage | 0 | PRIMARY | 1 | UsageID | A | 96805 | NULL | NULL | | BTREE | | |
I tried creating different indexes on datausage to see if it would help, but nothing did. I tried an index on AccountID, an index on AccountID, LogData, and index on LogData, AccountID, and an index on LogData. None of these made any difference.
I also tried using a UNION ALL with one of the queries with the logdata range and the other query just where logdata is null, but the result was about the same (actually a bit worse).
Can someone please help me understand what may be going on and the ways in which I can optimize the query execution time? Thank you!!
UPDATE: At Philipxy's request, here are the table definitions. Note that I removed some columns and constraints that are not related to this query to help keep things as tight and clean as possible.
CREATE TABLE `Accounts` (
`accountID` varchar(25) NOT NULL,
`name` varchar(255) COLLATE utf8_unicode_ci DEFAULT NULL,
`status` int(11) NOT NULL,
`planID` int(10) unsigned NOT NULL DEFAULT '1',
`created_at` timestamp NOT NULL DEFAULT '0000-00-00 00:00:00'
PRIMARY KEY (`accountID`),
KEY `accounts_planid_foreign` (`planID`),
KEY `acctname_id_ndx` (`name`,`accountID`),
CONSTRAINT `accounts_planid_foreign` FOREIGN KEY (`planID`) REFERENCES `AccountPlans` (`planID`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci
CREATE TABLE `datausage` (
`UsageID` int(11) NOT NULL AUTO_INCREMENT,
`Core` int(11) DEFAULT NULL,
`CoreDelux` int(11) DEFAULT NULL,
`AccountID` varchar(25) DEFAULT NULL,
`LogDate` date DEFAULT NULL
PRIMARY KEY (`UsageID`),
KEY `acctusage` (`AccountID`,`LogDate`)
) ENGINE=MyISAM AUTO_INCREMENT=104303 DEFAULT CHARSET=latin1
CREATE TABLE `AccountPlans` (
`planID` int(10) unsigned NOT NULL AUTO_INCREMENT,
`name` varchar(150) COLLATE utf8_unicode_ci NOT NULL,
`params` text COLLATE utf8_unicode_ci NOT NULL,
`created_at` timestamp NOT NULL DEFAULT '0000-00-00 00:00:00',
`plantype` varchar(25) COLLATE utf8_unicode_ci NOT NULL,
PRIMARY KEY (`planID`),
KEY `acctplans_id_type_ndx` (`planID`,`plantype`)
) ENGINE=InnoDB AUTO_INCREMENT=10 DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci
First, you can simplify the query by moving the where clause to the on clause:
SELECT a.accountID as AccountID, coalesce(a.name, a.accountID) as AccountName,
ap.plantype as AccountType, a.status as AccountStatus,
date(a.created_at) as Created,
sum(coalesce(du.Core, 0) + (coalesce(du.CoreDeluxe, 0) * 3)) as CoreData
FROM Accounts a LEFT JOIN
datausage du
on a.accountID = du.`accountID` AND
du.`LogDate` >= '2014-06-01' and du.`LogDate` < '2014-07-01'
LEFT JOIN
AccountPlans ap
on ap.`PlanID` = a.`PlanID`
GROUP BY a.accountID
ORDER BY AccountName asc ;
(I also introduced table aliases to make the query easier to read.)
This version should make better uses of indexes because it eliminates the or in the where clause. However, it still won't use an index for the outer sort. The following might be better:
SELECT a.accountID as AccountID, coalesce(a.name, a.accountID) as AccountName,
ap.plantype as AccountType, a.status as AccountStatus,
date(a.created_at) as Created,
sum(coalesce(du.Core, 0) + (coalesce(du.CoreDeluxe, 0) * 3)) as CoreData
FROM Accounts a LEFT JOIN
datausage du
on a.accountID = du.`accountID` AND
du.LogDate >= '2014-06-01' and du.LogDate < '2014-07-01'LEFT JOIN
AccountPlans ap
on ap.PlanID = a.PlanID
GROUP BY a.accountID
ORDER BY a.name, a.accountID ;
For this, I would recommend the following indexes:
Accounts(name, AccountId)
Datausage(AccountId, LogDate)
AccountPlans(PlanId, PlanType)
When you left join with datausage you should restrict the output as much as possible right there. (JOIN means AND means WHERE means ON. Put the conditions in essentially whatever order will be clear and/or optimize when necessary.) The result will be a null-extended row when there was no usage; you want to leave that row in.
When you join with AccountPlans you don't want to introduce null rows (which can't happen anyway) so that's just an inner join.
The version below has the AccountPlan join as an inner join and put first. (Indexed) Accounts FK PlanID to AccountPlan means the DBMS knows the inner join will only ever generate one row per Accounts PK. So the output has key AccountId. That row can be immediately inner joined to datausage. (An index on its AccountID should help, eg for a merge join.) For the other way around there is no PlanID key/index on the outer join result to join with AccountPlan.
SELECT
a.accountID as AccountID,
IFNULL(a.name,a.accountID) as AccountName,
ap.plantype as AccountType,
a.status as AccountStatus,
date(a.created_at) as Created,
sum(IFNULL(du.Core,0) + (IFNULL(du.CoreDeluxe,0) * 3)) as CoreData
FROM Accounts a
JOIN AccountPlans ap ON ap.PlanID = a.PlanID
LEFT JOIN datausage du ON a.accountID = du.accountID AND du.LogDate >= '2014-06-01' AND du.LogDate < '2014-07-01'
GROUP BY a.accountID

Estimate/speedup huge table self-join on mysql

I have a huge table:
CREATE TABLE `messageline` (
`id` bigint(20) NOT NULL AUTO_INCREMENT,
`hash` bigint(20) DEFAULT NULL,
`quoteLevel` int(11) DEFAULT NULL,
`messageDetails_id` bigint(20) DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `FK2F5B707BF7C835B8` (`messageDetails_id`),
KEY `hash_idx` (`hash`),
KEY `quote_level_idx` (`quoteLevel`),
CONSTRAINT `FK2F5B707BF7C835B8` FOREIGN KEY (`messageDetails_id`) REFERENCES `messagedetails` (`id`) ON DELETE NO ACTION ON UPDATE NO ACTION
) ENGINE=InnoDB AUTO_INCREMENT=401798068 DEFAULT CHARSET=utf8 COLLATE=utf8_bin
I need to find duplicate lines this way:
create table foundline AS
select ml.messagedetails_id, ml.hash, ml.quotelevel
from messageline ml,
messageline ml1
where ml1.hash = ml.hash
and ml1.messagedetails_id!=ml.messagedetails_id
But this request is working >1 day already. This is too long. Few hours would be ok. How can I speed this up? Thanx.
Explain:
+----+-------------+-------+------+---------------+----------+---------+---------------+-----------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+------+---------------+----------+---------+---------------+-----------+-------------+
| 1 | SIMPLE | ml | ALL | hash_idx | NULL | NULL | NULL | 401798409 | |
| 1 | SIMPLE | ml1 | ref | hash_idx | hash_idx | 9 | skryb.ml.hash | 1 | Using where |
+----+-------------+-------+------+---------------+----------+---------+---------------+-----------+-------------+
You can find your duplicates like this
SELECT messagedetails_id, COUNT(*) c
FROM messageline ml
GROUP BY messagedetails_id HAVING c > 1;
If it is still too long, add a condition to split the request on an indexed field :
WHERE messagedetails_id < 100000
Is it required to do this solely with SQL? Because for such a number of records you would be better off to break this down into 2 steps:
First run the following query
CREATE TABLE duplicate_hashes
SELECT * FROM (
SELECT hash, GROUP_CONCAT(id) AS ids, COUNT(*) AS cnt,
COUNT(DISTINCT messagedetails_id) AS cnt_message_details,
GROUP_CONCAT(DISTINCT messagedetails_id) as messagedetails_ids
FROM messageline GROUP BY hash ORDER BY NULL HAVING cnt > 1
) tmp
WHERE cnt > cnt_message_details
This will give you the duplicate IDs for each hash and since you have an index on the hash field grouping by will be relatively fast. Now, by counting distinct messagedetails_id values and comparing you implicitly fulfill the requirement for different messagedetails_id
where ml1.hash = ml.hash
and ml1.messagedetails_id!=ml.messagedetails_id
Use a script to check each record of the duplicate_hashes table

MySQL fixing index so possible keys is not null on left join

This question follows on from the problem posted here when i run explain I on my query
SELECT u_id, SUM(counts.s_count * tablename.weighted) AS total FROM tablename
LEFT JOIN (SELECT a_id, s_count FROM tablename WHERE u_id = 1) counts
ON tablename.a_id = counts.a_id
GROUP BY u_id ORDER BY total DESC LIMIT 0,100;
I get the response
+----+-------------+--------------------+-------+---------------+-----------+---------+------+--------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+--------------------+-------+---------------+-----------+---------+------+--------+----------------------------------------------+
| 1 | PRIMARY | tablename | index | NULL | a_id | 3 | NULL | 7222350| Using index; Using temporary; Using filesort |
| 1 | PRIMARY | [derived2] | ALL | NULL | NULL | NULL | NULL | 37 | |
| 2 | DERIVED | tablename | ref | PRIMARY | PRIMARY | 4 | | 37 | Using index |
+----+-------------+--------------------+-------+---------------+-----------+---------+------+-------+----------------------------------------------+
the table is created with
CREATE TABLE IF NOT EXISTS tablename (
u_id INT NOT NULL,
a_id MEDIUMINT NOT NULL,
s_count MEDIUMINT NOT NULL,
weighted FLOAT NOT NULL,
INDEX (a_id),
PRIMARY KEY (u_id,a_id)
)ENGINE=INNODB;
how can I change the index or query to get it to make use of the key more effectively? Once the database grows to a 7 million rows the query takes about 30 seconds
edit
which can be created with dummy data using
CREATE TABLE IF NOT EXISTS tablename ( u_id INT NOT NULL, a_id MEDIUMINT NOT NULL,s_count MEDIUMINT NOT NULL, weighted FLOAT NOT NULL,INDEX (a_id), PRIMARY KEY (u_id,a_id))ENGINE=INNODB;
INSERT INTO tablename (u_id,a_id,s_count,weighted ) VALUES (1,1,17,0.0521472392638),(1,2,80,0.245398773006),(1,3,2,0.00613496932515),(1,4,1,0.00306748466258),(1,5,1,0.00306748466258),(1,6,20,0.0613496932515),(1,7,3,0.00920245398773),(1,8,100,0.306748466258),(1,9,100,0.306748466258),(1,10,2,0.00613496932515),(2,1,1,0.00327868852459),(2,2,1,0.00327868852459),(2,3,100,0.327868852459),(2,4,200,0.655737704918),(2,5,1,0.00327868852459),(2,6,1,0.00327868852459),(2,7,0,0.0),(2,8,0,0.0),(2,9,0,0.0),(2,10,1,0.00327868852459),(3,1,15,0.172413793103),(3,2,40,0.459770114943),(3,3,0,0.0),(3,4,0,0.0),(3,5,0,0.0),(3,6,10,0.114942528736),(3,7,1,0.0114942528736),(3,8,20,0.229885057471),(3,9,0,0.0),(3,10,1,0.0114942528736);
You can hardly force MySQL to use an index for the join with results of a subquery, but you can try to speed up the grouping by using a coverage index (an index that has enough data not to fetch the row it references):
Try to add an composite index (u_id, a_id, weighted)
And you will probably need to give MySQL a hint to use the index:
SELECT u_id, SUM(counts.s_count * tablename.weighted) AS total
FROM tablename USE INDEX(Index_3)
LEFT JOIN (SELECT a_id, s_count FROM tablename WHERE u_id = 1) counts
ON tablename.a_id = counts.a_id
GROUP BY u_id ORDER BY total DESC LIMIT 0,100;