JOIN query is far too slow. Won't use INDEX? - mysql

I have a transitional table that I temporarily fill with some values before querying it and destroying it.
CREATE TABLE SearchListA(
`pTime` int unsigned NOT NULL ,
`STD` double unsigned NOT NULL,
`STD_Pos` int unsigned NOT NULL,
`SearchEnd` int unsigned NOT NULL,
UNIQUE INDEX (`pTime`,`STD` ASC) USING BTREE
) ENGINE = MEMORY;
It looks as such:
+------------+------------+---------+------------+
| pTime | STD | STD_Pos | SearchEnd |
+------------+------------+---------+------------+
| 1105715400 | 1.58474499 | 0 | 1105723200 |
| 1106297700 | 2.5997839 | 0 | 1106544000 |
| 1107440400 | 2.04860375 | 0 | 1107440700 |
| 1107440700 | 1.58864998 | 0 | 1107467400 |
| 1107467400 | 1.55207218 | 0 | 1107790500 |
| 1107790500 | 2.04239417 | 0 | 1108022100 |
| 1108022100 | 1.61385678 | 0 | 1108128000 |
| 1108771500 | 1.58835083 | 0 | 1108771800 |
| 1108771800 | 1.65734727 | 0 | 1108772100 |
| 1108772100 | 2.09378189 | 0 | 1109027700 |
+------------+------------+---------+------------+
Only columns pTime and SearchEnd are relevant to my problem.
My intention is to use this table to speed up searching through a much larger, static table.
The first column, pTime, is where the search should start
The fourth column, SearchEnd, is where the search should end
The larger table is similar; it looks like this:
CREATE TABLE `b50d1_abs` (
`pTime` int(10) unsigned NOT NULL,
`Slope` double NOT NULL,
`STD` double NOT NULL,
`Slope_Pos` int(11) NOT NULL,
`STD_Pos` int(11) NOT NULL,
PRIMARY KEY (`pTime`),
KEY `Slope` (`Slope`) USING BTREE,
KEY `STD` (`STD`),
KEY `ID1` (`pTime`,`STD`) USING BTREE
) ENGINE=MyISAM DEFAULT CHARSET=latin1 MIN_ROWS=339331 MAX_ROWS=539331 PACK_KEYS=1 ROW_FORMAT=FIXED;
+------------+-------------+------------+-----------+---------+
| pTime | Slope | STD | Slope_Pos | STD_Pos |
+------------+-------------+------------+-----------+---------+
| 1107309300 | 1.63257919 | 1.39241698 | 0 | 1 |
| 1107314400 | 6.8959276 | 0.22425643 | 1 | 1 |
| 1107323100 | 18.19909502 | 1.46854808 | 1 | 0 |
| 1107335400 | 2.50135747 | 0.4736305 | 0 | 0 |
| 1107362100 | 4.28778281 | 0.85576985 | 0 | 1 |
| 1107363300 | 6.96289593 | 1.41299044 | 0 | 0 |
| 1107363900 | 8.10316742 | 0.2859726 | 0 | 0 |
| 1107367500 | 16.62443439 | 0.61587645 | 0 | 0 |
| 1107368400 | 19.37918552 | 1.18746968 | 0 | 0 |
| 1107369300 | 21.94570136 | 0.94261744 | 0 | 0 |
| 1107371400 | 25.85701357 | 0.2741292 | 0 | 1 |
| 1107375300 | 21.98914027 | 1.59521158 | 0 | 1 |
| 1107375600 | 20.80542986 | 1.59231289 | 0 | 1 |
| 1107375900 | 19.62714932 | 1.50661679 | 0 | 1 |
| 1107381900 | 8.23167421 | 0.98048205 | 1 | 1 |
| 1107383400 | 10.68778281 | 1.41607579 | 1 | 0 |
+------------+-------------+------------+-----------+---------+
...etc (439340 rows)
Here, the columns pTime, STD, and STD_Pos are relevant to my problem.
For every element in the smaller table (SearchListA), I need to search the specified range within the larger table (b50d1_abs()) and return the row with the lowest b50d1_abs.pTime that is higher than the current SearchListA.pTime and that also matches the following conditions:
SearchListA.STD < b50d1_abs.STD AND SearchListA.STD_Pos <> b50d1_abs.STD_Pos
AND
b50d1_abs.pTime < SearchListA.SearchEnd
The latter condition is simply to reduce the length of the search.
This seems to me like a pretty straightforward query that should be able to use indexes; especially since all values are unsigned numbers - But I cannot get it to execute nearly fast enough! I think it is because it rebuilds the entire table each time instead of just omitting values from it.
I would be extremely grateful if someone takes a look at my code and figures out a more efficient way to go about this:
SELECT
m.pTime as OpenTime,
m.STD,
m.STD_Pos,
mu.pTime AS CloseTime
FROM
SearchListA m
JOIN b50d1_abs mu ON mu.pTime =(
SELECT
md.pTime
FROM
b50d1_abs as md
WHERE
md.pTime > m.pTime
AND md.pTime <=m.SearchEnd
AND m.STD < md.STD AND m.STD_Pos <> md.STD_Pos
LIMIT 1
);
Here is my EXPLAIN EXTENDED statement:
+----+--------------------+-------+--------+-----------------+---------+---------+------+--------+----------+--------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+--------------------+-------+--------+-----------------+---------+---------+------+--------+----------+--------------------------+
| 1 | PRIMARY | m | ALL | NULL | NULL | NULL | NULL | 365 | 100.00 | |
| 1 | PRIMARY | mu | eq_ref | PRIMARY,ID1 | PRIMARY | 4 | func | 1 | 100.00 | Using where; Using index |
| 2 | DEPENDENT SUBQUERY | md | ALL | PRIMARY,STD,ID1 | NULL | NULL | NULL | 439340 | 100.00 | Using where |
+----+--------------------+-------+--------+-----------------+---------+---------+------+--------+----------+--------------------------+
It looks like the lengthiest query (#2) doesn't use indexes at all!
If I try FORCE INDEX then it will list it under possible_keys, but still list NULL under Key and still take an extremely long time (over 80 seconds).
I need to get this query under 10 second; and even 10 is too long.

Your subquery is a dependent subquery, so the best case is that it's going to be evaluated once for every row in table m. Since m contains few rows, that would be OK.
But if you put that subquery in a JOIN condition, it is going to be executed (rows in m)*(rows in mu) times, no matter what.
Note that your results may be incorrect since :
return the row with the lowest b50d1_abs.pTime
but you don't specify that anywhere.
Try this query :
SELECT
m.pTime as OpenTime,
m.STD,
m.STD_Pos,
(
SELECT min( big.pTime )
FROM b50d1_abs as big
WHERE big.pTime > m.pTime
AND big.pTime <= m.SearchEnd
AND m.STD < big.STD AND m.STD_Pos <> big.STD_Pos
) AS CloseTime
FROM SearchListA m
or this one :
SELECT
m.pTime as OpenTime,
m.STD,
m.STD_Pos,
min( big.pTime )
FROM
SearchListA m
JOIN b50d1_abs as big ON (
big.pTime > m.pTime
AND big.pTime <= m.SearchEnd
AND m.STD < big.STD AND m.STD_Pos <> big.STD_Pos
)
GROUP BY m.pTime
(if you also want rows where the search was unsuccessful, make that a LEFT JOIN).
SELECT
m.pTime as OpenTime,
m.STD,
m.STD_Pos,
(
SELECT big.pTime
FROM b50d1_abs as big
WHERE big.pTime > m.pTime
AND big.pTime <= m.SearchEnd
AND m.STD < big.STD AND m.STD_Pos <> big.STD_Pos
ORDER BY big.pTime LIMIT 1
) AS CloseTime
FROM SearchListA m
(Try an index on b50d1_abs( pTime, STD, STD_Pos)
FYI here are some tests using Postgres on a test data set that should look like yours (maybe remotely, lol)
CREATE TABLE small (
pTime INT PRIMARY KEY,
STD FLOAT NOT NULL,
STD_POS BOOL NOT NULL,
SearchEnd INT NOT NULL
);
CREATE TABLE big(
pTime INTEGER PRIMARY KEY,
Slope FLOAT NOT NULL,
STD FLOAT NOT NULL,
Slope_Pos BOOL NOT NULL,
STD_POS BOOL NOT NULL
);
INSERT INTO small SELECT
n*100000,
random(),
random()<0.1,
n*100000+random()*50000
FROM generate_series( 1, 365 ) n;
INSERT INTO big SELECT
n*100,
random(),
random(),
random() > 0.5,
random() > 0.5
FROM generate_series( 1, 500000 ) n;
Query 1 : 6.90 ms (yes milliseconds)
Query 2 : 48.20 ms
Query 3 : 6.46 ms

I'll start a new answer cause it starts to look like a mess ;)
With your data I get, using MySQL 5.1.41
Query 1 : takes forever, Ctrl-C
Query 2 : 520 ms
Query 3 : takes forever, Ctrl-C
Explain for 2 looks good :
+----+-------------+-------+------+---------------------+------+---------+------+--------+------------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+------+---------------------+------+---------+------+--------+------------------------------------------------+
| 1 | SIMPLE | m | ALL | PRIMARY,STD,ID1,ID2 | NULL | NULL | NULL | 743 | Using temporary; Using filesort |
| 1 | SIMPLE | big | ALL | PRIMARY,ID1,ID2 | NULL | NULL | NULL | 439340 | Range checked for each record (index map: 0x7) |
+----+-------------+-------+------+---------------------+------+---------+------+--------+------------------------------------------------+
So, I loaded your data into postgres...
Query 1 : 14.8 ms
Query 2 : 100 ms
Query 3 : 14.8 ms (same plan as 1)
In fact rewriting 2 as query 1 (or 3) fixes a little optimizer shortcoming and finds the optimal query plan for this scenario.
Would you recommend using Postgres over MySql for this scenario?
Speed is extremely important to me.
Well, I don't know why mysql barfs so much on queries 1 and 3 (which are pretty simple and easy), in fact it should even beat postgres (using an index only scan) but apparently not, eh. You should ask a mysql specialist !
I'm more used to postgres... got fed up with mysql a long time ago ! If you need complex queries postgres usually wins big time (but you'll need to re-learn how to optimize and tune your new database)...

Related

Disable MySQL to convert string to 0 [duplicate]

I came across this auto typecasting of MYSQL from String to Integer seems to me weird.
mysql> select * from `isps` where `id` = '3ca6fb49-9749-3099-b30d-19ce56349ad6' OR `unique_id` = '3ca6fb49-9749-3099-b30d-19ce56349ad6';
+----+--------------------------------------+---------------+--------------+---------------------+---------------------+
| id | unique_id | name | code | created_at | updated_at |
+----+--------------------------------------+---------------+--------------+---------------------+---------------------+
| 3 | ee8db3be-1bf7-3440-8add-37232cfc4ecb | TTN | ttn | 2019-09-26 08:12:14 | 2019-09-26 08:12:14 |
| 7 | 3ca6fb49-9749-3099-b30d-19ce56349ad6 | ONE BROADBAND | onebroadband | 2019-09-26 08:12:14 | 2019-09-26 08:12:14 |
+----+--------------------------------------+---------------+--------------+---------------------+---------------------+
I had not expected result with id = 3 can anyone help with this.
DataType in my database
id - BIGINT
unique_id - varchar(200)
You can cast id to a string before comparing.
select * from `isps` where CAST(`id` AS CHAR) = '3ca6fb49-9749-3099-b30d-19ce56349ad6' OR `unique_id` = '3ca6fb49-9749-3099-b30d-19ce56349ad6';
Note that this will slow down the query significantly, since it won't be able to use the index on the id column.

MySql - Innodb - Corrupt Index / Foreign Key

I have a friggin bizarre situation going on. One of my nightly queries, which usually takes 5 mins, took >12 hours. Here is the query:
SELECT Z.id,
Z.seoAlias,
GROUP_CONCAT(DISTINCT LOWER(A.include)) AS include,
GROUP_CONCAT(DISTINCT LOWER(A.exclude)) AS exclude
FROM df_productsbystore AS X
INNER JOIN df_product_variants AS Y ON Y.id = X.id_variant
INNER JOIN df_products AS Z ON Z.id = Y.id_product
INNER JOIN df_advertisers AS A ON A.id = X.id_store
WHERE X.isActive > 0
AND Z.id > 60301433
GROUP BY Z.id
ORDER BY Z.id
LIMIT 45000;
I ran an EXPLAIN and got the following:
+----+-------------+-------+--------+------------------------------------------------------------------------------------+-----------+---------+---------------------+------+---------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+--------+------------------------------------------------------------------------------------+-----------+---------+---------------------+------+---------------------------------+
| 1 | SIMPLE | A | ALL | PRIMARY | NULL | NULL | NULL | 365 | Using temporary; Using filesort |
| 1 | SIMPLE | X | ref | UNIQUE_variantAndStore,idx_isActive,idx_store | idx_store | 4 | foenix.A.id | 600 | Using where |
| 1 | SIMPLE | Y | eq_ref | PRIMARY,UNIQUE,idx_prod | PRIMARY | 4 | foenix.X.id_variant | 1 | Using where |
| 1 | SIMPLE | Z | eq_ref | PRIMARY,UNIQUE_prods_seoAlias,idx_brand,idx_gender2,fk_df_products_id_category_idx | PRIMARY | 4 | foenix.Y.id_product | 1 | NULL |
+----+-------------+-------+--------+------------------------------------------------------------------------------------+-----------+---------+---------------------+------+---------------------------------+
Which looked different to my development environment. The df_advertisers section looked fishy to me, so I deleted and recreated the index on the X.id_store column, and now the EXPLAIN looks like this and the query is fast again:
+----+-------------+-------+--------+------------------------------------------------------------------------------------+------------------------+---------+-------------------+---------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+--------+------------------------------------------------------------------------------------+------------------------+---------+-------------------+---------+-------------+
| 1 | SIMPLE | Z | range | PRIMARY,UNIQUE_prods_seoAlias,idx_brand,idx_gender2,fk_df_products_id_category_idx | PRIMARY | 4 | NULL | 2090691 | Using where |
| 1 | SIMPLE | Y | ref | PRIMARY,UNIQUE,idx_prod | UNIQUE | 4 | foenix.Z.id | 1 | Using index |
| 1 | SIMPLE | X | ref | UNIQUE_variantAndStore,idx_isActive,idx_id_store | UNIQUE_variantAndStore | 4 | foenix.Y.id | 1 | Using where |
| 1 | SIMPLE | A | eq_ref | PRIMARY | PRIMARY | 4 | foenix.X.id_store | 1 | NULL |
+----+-------------+-------+--------+------------------------------------------------------------------------------------+------------------------+---------+-------------------+---------+-------------+
It would appear that the index magically disappeared. Can anyone explain how this is possible? Am I mean to run a mysqlcheck command or similar on a regular basis to avoid this kind of thing? I'm stumped!
Thanks
Next time, simply do ANALYZE TABLE df_productsbystore; It will be very fast, and may solve the problem.
ANALYZE recomputes the statistics on which the Optimizer depends for deciding, in this case, which table to start with. In rare situations, the stats get out of date and need a kick in the shin.
Caveat: I am assuming you are using InnoDB on a somewhat recent version. If you are using MyISAM, then ANALYZE is needed more often.
Do you really need 45K rows? What will you do with so many?
A way to speed the query up (probably) is to do everything you can with X and Z in a subquery, then JOIN A to do the rest:
SELECT XYZ.id, XYZ.seoAlias,
GROUP_CONCAT(DISTINCT LOWER(A.include)) AS include,
GROUP_CONCAT(DISTINCT LOWER(A.exclude)) AS exclude
FROM
(
SELECT Z.id, Z.seoAlias, X.id_store
FROM df_productsbystore AS X
INNER JOIN df_product_variants AS Y ON Y.id = X.id_variant
INNER JOIN df_products AS Z ON Z.id = Y.id_product
WHERE X.isActive > 0
AND Z.id > 60301433
GROUP BY Z.id -- may not be necessary ??
ORDER BY Z.id
LIMIT 45000
) AS XYZ
INNER JOIN df_advertisers AS A ON A.id = XYZ.id_store
GROUP BY ZYZ.id
ORDER BY XYZ.id;
Useful indexes:
Y: INDEX(id_product, id)
X: INDEX(id_variant, isActive, id_store)
In order to fix the problem I tried removing and recreating the index + FK. This did not solve the problem the first time, while the machine was under load, but it did work the second time, on a quiet machine.
It just feels like mysql is flaky. I really don't know what else to say.
Thanks for the help though

MySQL Entity Framework Wraps query into sub-select for Order By

We support both MSSQL and MySQL for Entityframework 6 in an MVC 5 Application. Now, the problem I am having is when using the MySQL connectors and LINQ, queries which have an INNER JOIN and an ORDER BY will cause the query to be brought into a sub-select and the ORDER BY is applied on the outside. This causes a substantial performance impact. This does not happen when using the MSSQL connector. Here is an example:
SELECT
`Project3`.*
FROM
(SELECT
`Extent1`.*,
`Extent2`.`Name_First`
FROM
`ResultRecord` AS `Extent1`
LEFT OUTER JOIN `ResultInputEntity` AS `Extent2` ON `Extent1`.`Id` = `Extent2`.`Id`
WHERE
`Extent1`.`DateCreated` <= '4/4/2016 6:29:59 PM'
AND `Extent1`.`DateCreated` >= '12/31/2015 6:30:00 PM'
AND 0000 = `Extent1`.`CustomerId`
AND (`Extent1`.`InUseById` IS NULL OR 0000 = `Extent1`.`InUseById` OR `Extent1`.`LockExpiration` < '4/4/2016 6:29:59 PM')
AND `Extent1`.`DivisionId` IN (0000)
AND `Extent1`.`IsDeleted` != 1
AND EXISTS( SELECT
1 AS `C1`
FROM
`ResultInputEntityIdentification` AS `Extent3`
WHERE
`Extent1`.`Id` = `Extent3`.`InputEntity_Id`
AND 0 = `Extent3`.`Type`
AND '0000' = `Extent3`.`Number`
AND NOT (`Extent3`.`Number` IS NULL)
OR LENGTH(`Extent3`.`Number`) = 0)
AND EXISTS( SELECT
1 AS `C1`
FROM
`ResultRecordAssignment` AS `Extent4`
WHERE
1 = `Extent4`.`AssignmentType`
AND `Extent4`.`AssignmentId` = 0000
OR 2 = `Extent4`.`AssignmentType`
AND `Extent4`.`AssignmentId` = 0000
AND `Extent4`.`ResultRecordId` = `Extent1`.`Id`)) AS `Project3`
ORDER BY `Project3`.`DateCreated` ASC , `Project3`.`Name_First` ASC , `Project3`.`Id` ASC
LIMIT 0 , 25
This query simply times out when being ran against against a few million rows. This is the explain for the above query:
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | extra |
| 1 | PRIMARY | Extent1 | ref | IX_ResultRecord_CustomerId,IX_ResultRecord_DateCreated,IX_ResultRecord_IsDeleted,IX_ResultRecord_InUseById,IX_ResultRecord_LockExpiration,IX_ResultRecord_DivisionId | IX_ResultRecord_CustomerId | 4 | const | 1 | Using where; Using temporary; Using filesort |
| 1 | PRIMARY | Extent2 | ref | PRIMARY | PRIMARY | 8 | Extent1.Id | 1 | |
| 4 | DEPENDENT SUBQUERY | Extent4 | ref | IX_RA_AT,IX_RA_A_ID,IX_RA_RR_ID | IX_RA_A_ID | 5 | const | 1 | Using where |
| 3 | DEPENDENT SUBQUERY | Extent3 | ALL | IX_InputEntity_Id,IX_InputEntityIdentification_Type,IX_InputEntityIdentification_Number | | | | 14341877 | Using where
Now, as it would get generated in MSSQL, or we simply get rid of the sub select to ORDER BY, the improvement is dramatic!
SELECT
`Extent1`.*,
`Extent2`.`Name_First`
FROM
`ResultRecord` AS `Extent1`
LEFT OUTER JOIN `ResultInputEntity` AS `Extent2` ON `Extent1`.`Id` = `Extent2`.`Id`
WHERE
`Extent1`.`DateCreated` <= '4/4/2016 6:29:59 PM'
AND `Extent1`.`DateCreated` >= '12/31/2015 6:30:00 PM'
AND 0000 = `Extent1`.`CustomerId`
AND (`Extent1`.`InUseById` IS NULL
OR 0000 = `Extent1`.`InUseById`
OR `Extent1`.`LockExpiration` < '4/4/2016 6:29:59 PM')
AND `Extent1`.`DivisionId` IN (0000)
AND `Extent1`.`IsDeleted` != 1
AND EXISTS( SELECT
1 AS `C1`
FROM
`ResultInputEntityIdentification` AS `Extent3`
WHERE
`Extent1`.`Id` = `Extent3`.`InputEntity_Id`
AND 9 = `Extent3`.`Type`
AND '0000' = `Extent3`.`Number`
AND NOT (`Extent3`.`Number` IS NULL)
OR LENGTH(`Extent3`.`Number`) = 0)
AND EXISTS( SELECT
1 AS `C1`
FROM
`ResultRecordAssignment` AS `Extent4`
WHERE
1 = `Extent4`.`AssignmentType`
AND `Extent4`.`AssignmentId` = 0000
OR 2 = `Extent4`.`AssignmentType`
AND `Extent4`.`AssignmentId` = 0000
AND `Extent4`.`ResultRecordId` = `Extent1`.`Id`)
ORDER BY `Extent1`.`DateCreated` ASC , `Extent2`.`Name_First` ASC , `Extent1`.`Id` ASC
LIMIT 0 , 25
This query now runs in 0.10 seconds! And the explain plan is now this:
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | extra |
| 1 | PRIMARY | <subquery2> | ALL | distinct_key | | | | 1 | Using temporary; Using filesort |
| 1 | PRIMARY | Extent1 | ref | PRIMARY,IX_ResultRecord_CustomerId,IX_ResultRecord_DateCreated,IX_ResultRecord_IsDeleted,IX_ResultRecord_InUseById,IX_ResultRecord_LockExpiration,IX_ResultRecord_DivisionId | PRIMARY | 8 | Extent3.InputEntity_Id | 1 | Using where |
| 1 | PRIMARY | Extent4 | ref | IX_RA_AT,IX_RA_A_ID,IX_RA_RR_ID | IX_RA_RR_ID | 8 | Extent3.InputEntity_Id | 1 | Using where; Start temporary; End temporary |
| 1 | PRIMARY | Extent2 | ref | PRIMARY | PRIMARY | 8 | Extent3.InputEntity_Id | 1 | |
| 2 | MATERIALIZED | Extent3 | ref | IX_InputEntity_Id,IX_InputEntityIdentification_Type,IX_InputEntityIdentification_Number | IX_InputEntityIdentification_Type | 4 | const | 1 | Using where |
Now, I have had this issue many times across the system, and it is clear that it is an issue with the MySQL EF 6 Connector deciding to always wrap queries in a sub-select to apply the ORDER BY, but only when there is a join in the query. This is causing major performance issues. Some answers I have seen suggest modifying the connector source code, but that can be tedious, has anyone had this same issue, know a work around, modified the connector already or have any other suggestions besides simply moving to SQL Server and leaving MySQL behind, as that is not an option.
Did you have a look to SQL Server generated SQL? Is it different or only performances are different?
Because [usually] is not the provider that decide the structure of the query (i.e. order a subquery). The provider just translate the structure of the query to the syntax of the DBMS. So, In your case the problem could be the DBMS optimizer.
In issues similar to your I used a different approach based on mapping a query to entities i.e. using ObjectContext.ExecuteStoreQuery.
It turns out that in order to work around this with the MySQL Driver, your entire lambda must be written in one go. Meaning in ONE Where(..) Predicate. This way the driver knows that it is all one result set. Now, if you build an initial IQueryable, and then keep appending Where clauses to it which access child tables, it will believe that there are multiple result sets and therefore wrap your entire query into a sub-select in order to sort and limit it.

Optimization of Mysql Join Statement Queries

SELECT GV.ID
, XS.SymbolId
, GS.ID
, XS.SymbolExchangeId
, XS.IssueId
, GE.ID
, XD.ACTIVE
, XD.ExchangeId
FROM TB_GDS_SECURITY GS
, WSOD_Vanilla.WSOD_XrefIssueSymbols XS
, WSOD_Vanilla.WSOD_XrefIssueData XD
, TB_GDS_EXCHANGE GE
, TB_GDS_VENDOR GV
WHERE XD.CompositeIssueID = GS.ISSUE_ID_TEMP
AND XD.IssueId = XS.IssueId
AND XD.ACTIVE = 'True'
AND GV.VENDOR_SHORT_NAME = XS.SymbolsetId
AND GE.EXCHANGE_SHORT_NAME = XD.ExchangeId
AND NOT EXISTS (SELECT ID
FROM TB_GDS_VENDOR_VENUE_SYMBOL VVS
WHERE VVS.ISSUE_ID = XS.IssueId
AND (SELECT ID
FROM TB_GDS_VENDOR GV
WHERE GV.VENDOR_SHORT_NAME = XS.SymbolsetId
)
);
This Particular query takes more than a minute. I want it to be done in lesser time.
the explain statement is as follow:
mysql> EXPLAIN SELECT GV.ID, XS.SymbolId, GS.ID, XS.SymbolExchangeId, XS.IssueId, GE.ID, XD.ACTIVE, XD.ExchangeId FROM TB_GDS_SECURITY GS, WSOD_Vanilla.WSOD_XrefIssueSymbols XS, WSOD_Vanilla.WSOD_XrefIssueData XD, TB_GDS_EXCHANGE GE, TB_GDS_VENDOR GV WHERE XD.CompositeIssueID = GS.ISSUE_ID_TEMP AND XD.IssueId = XS.IssueId AND XD.ACTIVE = 'True' AND GV.VENDOR_SHORT_NAME=XS.SymbolsetId AND GE.EXCHANGE_SHORT_NAME = XD.ExchangeId AND NOT exists (SELECT ID FROM TB_GDS_VENDOR_VENUE_SYMBOL VVS
-> WHERE VVS.ISSUE_ID = XS.IssueId AND (SELECT ID FROM TB_GDS_VENDOR GV WHERE GV.VENDOR_SHORT_NAME = XS.SymbolsetId));
+----+--------------------+-------+------+--------------------------------------------------------+-----------------------------+---------+----------------------------------+---------+------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+--------------------+-------+------+--------------------------------------------------------+-----------------------------+---------+----------------------------------+---------+------------------------------------+
| 1 | PRIMARY | XD | ref | WSOD_XrefIssueData_UINDX1,WSOD_XrefIssueData_INDX1 | WSOD_XrefIssueData_INDX1 | 21 | const | 3981788 | Using index condition; Using where |
| 1 | PRIMARY | GE | ref | TB_GDS_EXCHANGE_INDX1 | TB_GDS_EXCHANGE_INDX1 | 63 | WSOD_Vanilla.XD.ExchangeId | 1 | Using where; Using index |
| 1 | PRIMARY | GS | ref | TB_GDS_SECURITY_INDX1 | TB_GDS_SECURITY_INDX1 | 5 | WSOD_Vanilla.XD.CompositeIssueID | 1 | Using where; Using index |
| 1 | PRIMARY | XS | ref | PRIMARY | PRIMARY | 8 | WSOD_Vanilla.XD.IssueId | 1 | Using where |
| 1 | PRIMARY | GV | ref | TB_GDS_VENDOR_INDX1 | TB_GDS_VENDOR_INDX1 | 62 | WSOD_Vanilla.XS.SymbolsetId | 1 | Using where; Using index |
| 2 | DEPENDENT SUBQUERY | VVS | ref | TB_GDS_VVENUE_SYMBOL_UINDX1,TB_GDS_VVENUE_SYMBOL_INDX2 | TB_GDS_VVENUE_SYMBOL_UINDX1 | 5 | WSOD_Vanilla.XS.IssueId | 3 | Using where; Using index |
| 3 | DEPENDENT SUBQUERY | GV | ref | TB_GDS_VENDOR_INDX1 | TB_GDS_VENDOR_INDX1 | 62 | WSOD_Vanilla.XS.SymbolsetId | 1 | Using where; Using index |
+----+--------------------+-------+------+--------------------------------------------------------+-----------------------------+---------+----------------------------------+---------+------------------------------------+
7 rows in set (0.01 sec)
After optimizing still it takes around 1.35 min to execute.
Please help me out.
As per comment elsewhere, you've not shown us the index defintions.
Most likely the bog cost here is the NOT EXISTS sub-clause. MySQL does not handle push-predictates well, and you've also got a cartesian product in there. I'm struggling to understand what this actually does. MySQL (indeed no DBMS I'm familiar with) can effectively use an index for NOT EXISTS or <> or NOT LIKE. The nested nested query does not appear to serve any function since you already have an explicit join between these tables in the outer query:
WHERE...AND GV.VENDOR_SHORT_NAME = XS.SymbolsetId...
AND NOT EXISTS ( [a condition] AND
SELECT ID FROM TB_GDS_VENDOR GV
WHERE GV.VENDOR_SHORT_NAME = XS.SymbolsetId
Your only filtering apart from the joins and this ugly NOT EXISTS is "AND XD.ACTIVE = 'True'" - if this column only holds 2/3 values then it will likely be very inefficient to use an index.

SQL algorithm to as near to linear time as possible and tweaking of select statement

I am using MySQL version 5.5 on Ubuntu.
My database tables are setup as follows:
DDLs:
CREATE TABLE 'asx' (
'code' char(3) NOT NULL,
'high' decimal(9,3),
'low' decimal(9,3),
'close' decimal(9,3),
'histID' int(11) NOT NULL AUTO_INCREMENT,
PRIMARY KEY ('histID'),
UNIQUE KEY 'code' ('code')
)
CREATE TABLE 'asxhist' (
'date' date NOT NULL,
'average' decimal(9,3),
'histID' int(11) NOT NULL,
PRIMARY KEY ('date','histID'),
KEY 'histID' ('histID'),
CONSTRAINT 'asxhist_ibfk_1' FOREIGN KEY ('histID') REFERENCES 'asx' ('histID')
ON UPDATE CASCADE
)
t1:
| code | high | low | close | histID (primary key)|
| asx | 10.000 | 9.500 | 9.800 | 1
| nab | 42.000 | 41.250 | 41.350 | 2
t2:
| date | average | histID (foreign key) |
| 2013-01-01| 10.000 | 1 |
| 2013-01-01| 39.000 | 2 |
| 2013-01-02| 9.000 | 1 |
| 2013-01-02| 38.000 | 2 |
| 2013-01-03| 9.500 | 1 |
| 2013-01-03| 39.500 | 2 |
| 2013-01-04| 11.000 | 1 |
| 2013-01-04| 38.500 | 2 |
I am attempting to complete a select query that produces this as a result:
| code | high | low | close | asxhist.average |
| asx | 10.000 | 9.500 | 9.800 | 11.000, 9.5000 |
| nab | 42.000 | 41.250 | 41.350 | 38.500,39.500 |
Where the most recent information in table 2 is returned with table 1 in a csv format.
I have managed to get this far:
SELECT code, high, low, close,
(SELECT GROUP_CONCAT(DISTINCT t2.average ORDER BY date DESC SEPARATOR ',') FROM t2
WHERE t2.histID = t1.histID)
FROM t1;
Unfortunately this returns all values associated with hID. I'm taking a look at xaprb.com's firstleastmax-row-per-group-in-sql solution but I have been banging my head all day and the slight wooziness seems to be dimming my ability to comprehend how I should use it to my benefit. How can I limit the results to the most 5 recent values and considering the tables will eventually be megabytes in size, try and remain in O(n2) or less? (Or can I?)
Temporary work around using SUBSTRING_INDEX and not a feasible solution for huge data
SELECT code, high, low, close,
(SELECT SUBSTRING_INDEX(GROUP_CONCAT(asxhist.average), ',', 3)
FROM asxhist
WHERE asxhist.histID = asx.histID
ORDER BY date DESC)
FROM asx;
From what I gather Limit option in GROUP_CONCAT is still under feature-request.
Also on stackoverflow hack MySQL GROUP_CONCAT