I have several tables, containing (a.o.) the following fields:
tweets:
--------------------------
tweet_id ticker created_at
--------------------------
1 1 1298063318
2 1 1298053197
stocks:
---------------------------------
ticker date close volume
---------------------------------
1 1313013600 12.25 40370600
1 1312927200 11.60 37281300
wiki:
-----------------------
ticker date views
-----------------------
1 1296514800 550
1 1296601200 504
I want to compile an overview of # of tweets, close, volume and views per day (for rows identified by ticker = 1). The tweets table is leading, meaning that if there is a date on which there are no tweets, the close, volume and views for that day don't matter. In oter words, I want the output of a query to be like:
-------------------------------------
date tweets close volume views
-------------------------------------
2011-02-13 4533 12.25 40370600 550
2011-02-14 6534 11.60 53543564 340
2011-02-16 5333 13.10 56464333 664
In this example output, there were no tweets on 2011-02-15, so there is no need for the rest of the data of that day. My query thus far goes:
SELECT
DATE_FORMAT(FROM_UNIXTIME(tweets.created_at), '%Y-%m-%d') AS date,
COUNT(tweets.tweet_id) AS tweets,
stocks.close,
stocks.volume,
wiki.views
FROM tweets
LEFT JOIN stocks ON tweets.ticker = stocks.ticker
LEFT JOIN wiki ON tweets.ticker = wiki.ticker
WHERE tweets.ticker = 1
GROUP BY date
ORDER BY date ASC
Could someone verify if this query is correct? It doesn't run into any errors but it freezes my PC. Perhaps I should set an index here or there, possibly on the "ticker" columns?
[edit]
As requested, the table definitions:
CREATE TABLE `stocks` (
`ticker` int(3) NOT NULL,
`date` int(10) NOT NULL,
`open` decimal(8,2) NOT NULL,
`high` decimal(8,2) NOT NULL,
`low` decimal(8,2) NOT NULL,
`close` decimal(8,2) NOT NULL,
`volume` int(8) NOT NULL
) ENGINE=MyISAM DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;
CREATE TABLE `tweets` (
`tweet_id` int(11) NOT NULL AUTO_INCREMENT,
`ticker` varchar(5) NOT NULL,
`id_str` varchar(18) NOT NULL,
`created_at` int(10) NOT NULL,
`from_user` int(11) NOT NULL,
`text` text NOT NULL,
PRIMARY KEY (`tweet_id`),
KEY `id_str` (`id_str`),
KEY `from_user` (`from_user`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1;
CREATE TABLE `wiki` (
`ticker` int(3) NOT NULL,
`date` int(11) NOT NULL,
`views` int(6) NOT NULL
) ENGINE=MyISAM DEFAULT CHARSET=latin1;
I hope this helps.
You are right about indices, without an index on ticker you would have to do a where-search in all tables and if they are big that's going to take alot of time.
I suggest that you turn on logging of all queries that run without index at least every now and then to find queries that if not already are slow will be slow when the data increases.
Check queries with [EXPLAIN SELECT ...][2] if you find them slow, learn how to interpret the results (not easy but important) to understand where to put new indices.
Do believe you should check the joins between the tables. Your query does not indicate which stocks-rows (or wiki-row) is to be matched to the date for tweets. Based on example data the match is done for all stocks and wiki-rows which have the same ticker_id.
Does stocks and wiki-tables have only one row for certain day for one ticker? Assuming this is the case, more logical query would look like this:
SELECT
DATE_FORMAT(FROM_UNIXTIME(t.created_at), '%Y-%m-%d') AS date,
COUNT(t.tweet_id) AS tweets,
s.close,
s.volume,
w.views
FROM tweets t
LEFT JOIN stocks s ON t.ticker = s.ticker
and FROM_UNIXTIME(t.created_at,'%Y-%m-%d')=FROM_UNIXTIME(s.date,'%Y-%m-%d')
LEFT JOIN wiki w ON t.ticker = w.ticker
and FROM_UNIXTIME(t.created_at,'%Y-%m-%d')=FROM_UNIXTIME(w.date,'%Y-%m-%d')
WHERE tweets.ticker = 1
GROUP BY date, s.close, s.volume, w.views
ORDER BY date ASC
If there are more than one row in stocks/wiki for certain day for one ticker, the you need to apply aggregate function to those columns as well and change the COUNT(t.tweet_id) to COUNT(distinct t.created_at).
I think that one of problems is date calculation
DATE_FORMAT(FROM_UNIXTIME(tweets.created_at), '%Y-%m-%d') date
Try to add this field to the tweets table to avoid CPU consumption
edit:
you can use something like this
CREATE TABLE `stocks` (
`ticker` int(3) NOT NULL,
`date` int(10) NOT NULL,
`open` decimal(8,2) NOT NULL,
`high` decimal(8,2) NOT NULL,
`low` decimal(8,2) NOT NULL,
`close` decimal(8,2) NOT NULL,
`volume` int(8) NOT NULL,
`day_date` varchar(10) NOT NULL
) ENGINE=MyISAM DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;
CREATE TABLE `tweets` (
`tweet_id` int(11) NOT NULL AUTO_INCREMENT,
`ticker` varchar(5) NOT NULL,
`id_str` varchar(18) NOT NULL,
`created_at` int(10) NOT NULL,
`from_user` int(11) NOT NULL,
`text` text NOT NULL,
`day_date` varchar(10) NOT NULL,
PRIMARY KEY (`tweet_id`),
KEY `id_str` (`id_str`),
KEY `from_user` (`from_user`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1;
CREATE TABLE `wiki` (
`ticker` int(3) NOT NULL,
`date` int(11) NOT NULL,
`views` int(6) NOT NULL,
`day_date` varchar(10) NOT NULL
) ENGINE=MyISAM DEFAULT CHARSET=latin1;
SELECT
tweets.day_date AS date,
COUNT(tweets.tweet_id) AS tweets,
stocks.close as close,
stocks.volume as volume,
wiki.views as views
FROM tweets
LEFT JOIN stocks ON tweets.ticker = stocks.ticker
and tweets.day_date = stocks.day_date
LEFT JOIN wiki ON tweets.ticker = wiki.ticker
and tweets.day_date = wiki.day_date
WHERE tweets.ticker = 1
GROUP BY date, close, volume, views
ORDER BY date ASC
Related
I have 3 simple tables
Invoices ( ~500k records )
Invoice items, one-to-many relation to invoices ( ~10 million records )
Invoice payments, one-to-many relation to invoices ( ~700k records )
Now, as simple as it sounds, I need to query for unpaid invoices.
Here is the query I am using:
select * from invoices
LEFT JOIN (SELECT invoice_id, SUM(price) as totalAmount
FROM invoice_items
GROUP BY invoice_id) AS t1
ON t1.invoice_id = invoices.id
LEFT JOIN (SELECT invoice_id, SUM(payed_amount) as totalPaid
FROM invoice_payment_transactions
GROUP BY invoice_id) AS t2
ON t2.invoice_id = invoices.id
WHERE totalAmount > totalPaid
Unfortunately, this query takes around 30 seconds, so way to slow.
Of course I have indexes set for "invoice_id" on both payments and items.
When I "EXPLAIN" the query, I can see that mysql has to do a full table scan.
I also tried several other query approaches, using "EXISTS" or "IN" with subqueries, but I never got around the full table scan.
Pretty sure there is not much that can be done here ( except use some caching approach ), but maybe someone knows how to optimize this ?
I need this query to run in a +/-2 seconds max.
EDIT:
Thanks to everybody for trying. Please just know that I absolutely know how to adopt different caching strategies here, but this question is purely about optimizing this query !
Here are the ( simplified ) table definitions
CREATE TABLE `invoices`
(
`id` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
`created_at` timestamp NOT NULL DEFAULT current_timestamp(),
`date` date NOT NULL,
`title` enum ('M','F','Other') DEFAULT NULL,
`first_name` varchar(191) DEFAULT NULL,
`family_name` varchar(191) DEFAULT NULL,
`street` varchar(191) NOT NULL,
`postal_code` varchar(10) NOT NULL,
`city` varchar(191) NOT NULL,
`country` varchar(2) NOT NULL,
PRIMARY KEY (`id`),
KEY `date` (`date`)
) ENGINE = InnoDB
CREATE TABLE `invoice_items`
(
`id` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
`invoice_id` bigint(20) unsigned NOT NULL,
`created_at` timestamp NOT NULL DEFAULT current_timestamp(),
`name` varchar(191) DEFAULT NULL,
`description` text DEFAULT NULL,
`reference` varchar(191) DEFAULT NULL,
`quantity` smallint(6) NOT NULL,
`price` int(11) NOT NULL,
PRIMARY KEY (`id`),
KEY `invoice_items_invoice_id_index` (`invoice_id`),
) ENGINE = InnoDB
CREATE TABLE `invoice_payment_transactions`
(
`id` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
`invoice_id` bigint(20) unsigned NOT NULL,
`created_at` timestamp NOT NULL DEFAULT current_timestamp(),
`transaction_identifier` varchar(191) NOT NULL,
`payed_amount` mediumint(9) DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `invoice_payment_transactions_invoice_id_index` (`invoice_id`),
) ENGINE = InnoDB
Plan A:
Summary table by invoice_id and day. (as Bill suggested) Summary Tables
Plan B:
Change the design to be "current" and "history". This is where the "payments" is a "history" of money changing hands. Meanwhile "invoices" would be "current" in that it contains a "balance_owed" column. This is a philosophy change; it could (should) be encapsulated in a client subroutine and/or a database Stored Procedure.
Plan C: This may be useful if "most" of the invoices are paid off.
Have a flag in the invoices table to indicate paid-off. That will prevent "most" of the JOINs from occurring. (Well, adding that column is just as hard as doing Plan B.)
We have a query that performs some aggregation on one column.
The filtering of the data seems to be pretty fast, but the aggregation seems to take too much time.
This query returns ~ 1.5 million rows. It runs for 0.6 seconds (if we want to return the data to the client it takes ~ 2 minutes - the way we tested this is by using the pymysql python library. We used an unbuffered cursor, so we can distinguish between query run time and fetch time):
SELECT *
FROM t_data t1
WHERE (t1.to_date = '2019-03-20')
AND (t1.period = 30)
AND (label IN ('aa','bb') )
AND ( id IN (
SELECT id
FROM t_location_data
WHERE (to_date = '2019-03-20') AND (period = 30)
AND ( country = 'Narniya' ) ) )
But if we run this query:
SELECT MAX(val) val_max,
AVG(val) val_avg,
MIN(val) val_min
FROM t_data t1
WHERE (t1.to_date = '2019-03-20')
AND (t1.period = 30)
AND (label IN ('aa','bb') )
AND ( id IN (
SELECT id
FROM t_location_data
WHERE (to_date = '2019-03-20') AND (period = 30)
AND ( country = 'Narniya' ) ) )
We see that the time to run the query takes 40 seconds and the time to fetch the results in this case is obviously less than a second..
Any help with this terrible performance of the aggregation functions over RDS Aurora? Why calculating Max Min and Avergae on 1.5 million lines takes so long (When comparing to Python on those same numbers, the calculation takes less than 1 second..)
NOTE: We added random number to each select to make sure we do not get cached values.
We use Aurora RDS:
1 instance of db.r5.large (2 vCPU + 16 GB RAM)
MySQL Engine version: 5.6.10a
Create table:
Create Table: CREATE TABLE `t_data` (
`id` varchar(256) DEFAULT NULL,
`val2` int(11) DEFAULT NULL,
`val3` int(11) DEFAULT NULL,
`val` int(11) DEFAULT NULL,
`val4` int(11) DEFAULT NULL,
`tags` varchar(256) DEFAULT NULL,
`val7` int(11) DEFAULT NULL,
`label` varchar(32) DEFAULT NULL,
`val5` varchar(64) DEFAULT NULL,
`val6` int(11) DEFAULT NULL,
`period` int(11) DEFAULT NULL,
`to_date` varchar(64) DEFAULT NULL,
`data_line_id` int(10) unsigned NOT NULL AUTO_INCREMENT,
PRIMARY KEY (`data_line_id`),
UNIQUE KEY `id_data` (`to_date`,`period`,`id`),
KEY `index1` (`to_date`,`period`,`id`),
KEY `index3` (`to_date`,`period`,`label`)
) ENGINE=InnoDB AUTO_INCREMENT=218620560 DEFAULT CHARSET=latin1
Create Table: CREATE TABLE `t_location_data` (
`id` varchar(256) DEFAULT NULL,
`country` varchar(256) DEFAULT NULL,
`state` varchar(256) DEFAULT NULL,
`city` varchar(256) DEFAULT NULL,
`latitude` float DEFAULT NULL,
`longitude` float DEFAULT NULL,
`val8` int(11) DEFAULT NULL,
`val9` tinyint(1) DEFAULT NULL,
`period` int(11) DEFAULT NULL,
`to_date` varchar(64) DEFAULT NULL,
`location_line_id` int(10) unsigned NOT NULL AUTO_INCREMENT,
PRIMARY KEY (`location_line_id`),
UNIQUE KEY `id_location_data` (`to_date`,`period`,`id`,`latitude`,`longitude`),
KEY `index1` (`to_date`,`period`,`id`,`country`),
KEY `index2` (`country`,`state`,`city`),
KEY `index3` (`to_date`,`period`,`country`,`state`)
) ENGINE=InnoDB AUTO_INCREMENT=315944737 DEFAULT CHARSET=latin1
Parameters:
##innodb_buffer_pool_size/1024/1024/1024: 7.7900
##innodb_buffer_pool_instances: 8
UPDATE:
Adding the val index (like suggest by #rick-james) did improve the query dramatically (took ~2 seconds) only if I delete the AND ( id IN (SELECT id FROM t_location_data.. condition. If I leave it, the query runs for about ~25 seconds.. better than before but still not good..
Indexes needed:
t_data: INDEX(period, to_date, label, val)
t_data: INDEX(period, label, to_date, val)
t_location_data: INDEX(period, country, to_date, id)
Also, change from the slow IN ( SELECT ... ) to a JOIN:
FROM t_data AS d
JOIN t_location_data AS ld USING(id)
WHERE ...
Better yet, since the tables are 1:1 (is that correct?), combine the tables so as to eliminate the JOIN. If id is not the PRIMARY KEY in each table, you really need to provide SHOW CREATE TABLE and should change the name(s).
My problem is a slow search query with a one-to-many relationship between the tables. My tables look like this.
Table Assignment
CREATE TABLE `Assignment` (
`Id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`ProjectId` int(10) unsigned NOT NULL,
`AssignmentTypeId` smallint(5) unsigned NOT NULL,
`AssignmentNumber` varchar(30) NOT NULL,
`AssignmentNumberExternal` varchar(50) DEFAULT NULL,
`DateStart` datetime DEFAULT NULL,
`DateEnd` datetime DEFAULT NULL,
`DateDeadline` datetime DEFAULT NULL,
`DateCreated` datetime DEFAULT NULL,
`Deleted` datetime DEFAULT NULL,
`Lat` double DEFAULT NULL,
`Lon` double DEFAULT NULL,
PRIMARY KEY (`Id`),
KEY `idx_assignment_assignment_type_id` (`AssignmentTypeId`),
KEY `idx_assignment_assignment_number` (`AssignmentNumber`),
KEY `idx_assignment_assignment_number_external`
(`AssignmentNumberExternal`)
) ENGINE=InnoDB AUTO_INCREMENT=5280 DEFAULT CHARSET=utf8;
Table ExtraFields
CREATE TABLE `ExtraFields` (
`assignment_id` int(10) unsigned NOT NULL,
`name` varchar(30) NOT NULL,
`value` text,
PRIMARY KEY (`assignment_id`,`name`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
My search query
SELECT
`Assignment`.`Id`, COL_5_72, COL_5_73, COL_5_74, COL_5_75, COL_5_76,
COL_5_77 FROM (
SELECT
`Assignment`.`Id`,
`Assignment`.`AssignmentNumber` AS COL_5_72,
`Assignment`.`AssignmentNumberExternal` AS COL_5_73 ,
`AssignmentType`.`Name` AS COL_5_74,
`Assignment`.`DateStart` AS COL_5_75,
`Assignment`.`DateEnd` AS COL_5_76,
`Assignment`.`DateDeadline` AS COL_5_77 FROM `Assignment`
CASE WHEN `ExtraField`.`Name` = "WorkDistrict" THEN
`ExtraField`.`Value` end as COL_5_78 FROM `Assignment`
LEFT JOIN `ExtraFields` as `ExtraField` on
`ExtraField`.`assignment_id` = `Assignment`.`Id`
WHERE `Assignment`.`Deleted` IS NULL -- Assignment should not be removed.
AND (1=1) -- Add assignment filters.
) AS q1
GROUP BY `Assignment`.`Id`
HAVING 1 = 1
AND COL_5_78 LIKE '%Amsterdam East%'
ORDER BY COL_5_72 ASC, COL_5_73 ASC;
When the table is only around 3500 records my query takes a couple of seconds to execute and return the results.
What is a better way to search in the related data? Should I just add a JSON field to the Assignment table and use the MySQL 5.7 Json query features? Or did I made a mistake in designing my database?
You are using select from subquery that forces MySQL to create unindexed temp table for each execution. Remove subquery (you really don't need it here) and it will be much faster.
I want to get the SUM of QTYs of an item# grouped be months, the query takes too long (15 - 20) seconds to fetch.
-Total rows: 1495873
-Total Fetched rows: 9 - 12
The relation between two tables (invoice_header and invoice_detail) is (one to many) that the invoice_header is the header of an invoice, with only totals. Which is linked to invoice_detail using location ID (loc_id) and Invoice number (invo_no), as each location has its own serial number. The invoice detail contains the details of each invoice.
Is there's a better way to enhance the performance of that query, here it's:
SELECT SUM(invoice_detail.qty) AS qty, Month(invoice_header.date) AS month
FROM invoice_detail
JOIN invoice_header ON invoice_detail.invo_no = invoice_header.invo_no
AND invoice_detail.loc_id = invoice_header.loc_id
WHERE invoice_detail.item_id = {$itemId}
GROUP BY Month(invoice_header.date)
ORDER BY Month(invoice_header.date)
EXPLAIN:
invoice_header table structure:
CREATE TABLE `invoice_header` (
`invo_type` varchar(1) NOT NULL,
`invo_no` int(20) NOT NULL AUTO_INCREMENT,
`invo_code` varchar(50) NOT NULL,
`date` date NOT NULL,
`time` time NOT NULL,
`cust_id` int(11) NOT NULL,
`loc_id` int(3) NOT NULL,
`cash_man_id` int(11) NOT NULL,
`sales_man_id` int(11) NOT NULL,
`ref_invo_no` int(20) NOT NULL,
`total_amount` decimal(19,2) NOT NULL,
`tax` decimal(19,2) NOT NULL,
`discount_amount` decimal(19,2) NOT NULL,
`net_value` decimal(19,2) NOT NULL,
`split` decimal(19,2) NOT NULL,
`qty` int(11) NOT NULL,
`payment_type_id` varchar(20) NOT NULL,
`comments` varchar(255) NOT NULL,
PRIMARY KEY (`invo_no`,`loc_id`)
) ENGINE=InnoDB AUTO_INCREMENT=20286 DEFAULT CHARSET=utf8
invoice_detail table structure:
CREATE TABLE `invoice_detail` (
`invo_no` int(11) NOT NULL,
`loc_id` int(3) NOT NULL,
`serial` int(11) NOT NULL,
`item_id` varchar(11) NOT NULL,
`size_id` int(5) NOT NULL,
`qty` int(11) NOT NULL,
`rtp` decimal(19,2) NOT NULL,
`type` tinyint(1) NOT NULL,
PRIMARY KEY (`invo_no`,`loc_id`,`serial`),
KEY `item_id` (`item_id`),
KEY `size_id` (`size_id`),
KEY `invo_no` (`invo_no`),
KEY `serial` (`serial`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8
How long does it take by following SQL?
SELECT count(*)
FROM invoice_detail
WHERE invoice_detail.item_id = {$itemId}
If this SQL takes 15-20 seconds, you should add an index on field item_id on inovice_detail table.
The invoice_header already had the primary key on the join columns of invo_no and loc_id, so you don't need add other indexes on invoice_header table. But if you add an index on fields invo_no, loc_id and date, that may enhance a little performance by only index scan.
According to see the result of explain invoice_header can not use any multiple index.
You could create index on field invo_no on invoice_header Table.
Try creating a few indexes on some columns: invo_no, loc_id, item_id, date.
based on your explain, your query is scanning all rows in invoice_header. Index the columns that make the join... invo_no, loc_id
I have 2 tables, that i want to join, one is rooms and another is reservations.
Basically I want to search for rooms which are not reserved (not in reservation table) and to get the details of those rooms (which are not in reservation table) from room table.
Here are my tables structure:
CREATE TABLE `room` (
`roomID` int(11) NOT NULL AUTO_INCREMENT,
`hotelID` int(11) NOT NULL,
`roomtypeID` int(11) NOT NULL,
`roomNumber` int(11) NOT NULL,
`roomName` varchar(255) NOT NULL,
`roomName_en` varchar(255) NOT NULL,
`roomDescription` text,
`roomDescription_en` text,
`roomSorder` int(11) NOT NULL,
`roomVisible` tinyint(4) NOT NULL,
PRIMARY KEY (`roomID`)
) ENGINE=InnoDB AUTO_INCREMENT=29 DEFAULT CHARSET=utf8;
CREATE TABLE `reservation` (
`reservationID` int(11) NOT NULL AUTO_INCREMENT,
`customerID` int(11) NOT NULL,
`hotelID` int(11) NOT NULL,
`reservationCreatedOn` datetime NOT NULL,
`reservationCreatedFromIp` varchar(255) CHARACTER SET greek NOT NULL,
`reservationNumberOfAdults` tinyint(4) NOT NULL,
`reservationNumberOfChildrens` tinyint(4) NOT NULL,
`reservationArrivalDate` date NOT NULL,
`reservationDepartureDate` date NOT NULL,
`reservationCustomerComment` text CHARACTER SET greek,
PRIMARY KEY (`reservationID`)
) ENGINE=InnoDB AUTO_INCREMENT=47 DEFAULT CHARSET=utf8;
CREATE TABLE `reservationroom` (
`reservationroomID` int(11) NOT NULL AUTO_INCREMENT,
`reservationID` int(11) NOT NULL,
`hotelID` int(11) NOT NULL,
`roomID` int(11) NOT NULL,
PRIMARY KEY (`reservationroomID`)
) ENGINE=InnoDB AUTO_INCREMENT=47 DEFAULT CHARSET=utf8;
Here is the query that I have right now, which gives me wrong results:
SELECT * FROM room r
LEFT JOIN reservation re
ON r.hotelID = re.hotelID
WHERE re.hotelID = 13
AND NOT
(re.reservationArrivalDate >= '2014-07-07' AND re.reservationDepartureDate <= '2014-07-13')
I also have created a fiddle, with the data from both tables included:
http://sqlfiddle.com/#!2/4bb9ea/1
Any help will be deeply appreciated
Regards, John
i agree that room number was missed,
but query template should looks like
SELECT
*
FROM
room r
LEFT JOIN reservation re
ON r.hotelID = re.hotelID
WHERE r.hotelID = 2
AND NOT (
re.hotelID IS NOT NULL
AND re.reservationArrivalDate >= '2014-07-07'
AND re.reservationDepartureDate <= '2014-09-23'
) ;
You need change table in where statement from reservation to room. Also you need add re.hotelID to where statement as well, because on where statement you need check that record is not null ans only after try to check dates
Given the newly-added reservationroom table, consider using a NOT EXISTS sub-query to find rooms without reservations:
SELECT
*
FROM
room r
WHERE NOT EXISTS
(SELECT
*
FROM
reservationroom rr
WHERE
rr.reservationroomID = r.roomID
)