I'm having some issues with an sql query that is going extremely slowly in an INNODB MySql table. There's only 32,000 rows and all the conditions in the where clause are indexed and either bits or bigints. I wouldn't have expected a speed issue at all but sometimes the queries take 2 mins to complete. It's difficult to tell because of caching but it appears that if I remove all of the select criteria from the query except for the id then the query executes in milliseconds.
The rows are large as they store email html so often a row is 1MB. When the query runs the computer uses a lot of harddrive resource which appears to be the source of the speed slowdown. As far as I know though the database shouldn't need to use any additional resource when select criteria are added, first it finds the rows and then it pulls out the information it needs just for those rows. Can someone correct me if I'm wrong or let me know if some other setting could explain this behaviour.
As requested here is the query:
select aplosemail0_.id as id4353_, aplosemail0_.active as active4353_, aplosemail0_.deletable as deletable4353_, aplosemail0_.editable as editable4353_, aplosemail0_.persistentData as persiste5_4353_, aplosemail0_.dateCreated as dateCrea6_4353_, aplosemail0_.dateInactivated as dateInac7_4353_, aplosemail0_.dateLastModified as dateLast8_4353_, aplosemail0_.displayId as displayId4353_, aplosemail0_.owner_id as owner37_4353_, aplosemail0_.userIdCreated as userIdC10_4353_, aplosemail0_.userIdInactivated as userIdI11_4353_, aplosemail0_.userIdLastModified as userIdL12_4353_, aplosemail0_.parentWebsite_id as parentW38_4353_, aplosemail0_.automaticSendDate as automat13_4353_, aplosemail0_.emailFrame_id as emailFrame39_4353_, aplosemail0_.emailGenerationType as emailGe14_4353_, aplosemail0_.email_generator_type as email15_4353_, aplosemail0_.email_generator_id as email16_4353_, aplosemail0_.emailReadDate as emailRe17_4353_, aplosemail0_.emailSentCount as emailSe18_4353_, aplosemail0_.emailSentDate as emailSe19_4353_, aplosemail0_.emailStatus as emailSt20_4353_, aplosemail0_.emailTemplate_id as emailTe40_4353_, aplosemail0_.emailType as emailType4353_, aplosemail0_.encryptionSalt as encrypt22_4353_, aplosemail0_.forwardedEmail_id as forward41_4353_, aplosemail0_.fromAddress as fromAdd23_4353_, aplosemail0_.hardDeleteDate as hardDel24_4353_, aplosemail0_.htmlBody as htmlBody4353_, aplosemail0_.incomingReadRetryCount as incomin26_4353_, aplosemail0_.isAskingForReceipt as isAskin27_4353_, aplosemail0_.isIncomingEmailDeleted as isIncom28_4353_, aplosemail0_.isSendingPlainText as isSendi29_4353_, aplosemail0_.isUsingEmailSourceAsOwner as isUsing30_4353_, aplosemail0_.mailServerSettings_id as mailSer42_4353_, aplosemail0_.maxSendQuantity as maxSend31_4353_, aplosemail0_.originalEmail_id as origina43_4353_, aplosemail0_.outerEmailFrame_id as outerEm44_4353_, aplosemail0_.plainTextBody as plainTe32_4353_, aplosemail0_.removeDuplicateToAddresses as removeD33_4353_, aplosemail0_.repliedEmail_id as replied45_4353_, aplosemail0_.sendStartIdx as sendSta34_4353_, aplosemail0_.subject as subject4353_, aplosemail0_.uid as uid4353_ from AplosEmail aplosemail0_ where aplosemail0_.emailType=0 and aplosemail0_.emailStatus<>2 and aplosemail0_.mailServerSettings_id=7 and aplosemail0_.active=1 order by aplosemail0_.id DESC limit 50
Related
I have a few queries that use geospatial conditions. These queries are running surprisingly slow. Initially I thought it was the geospatial calculation itself, but stripping everything down to just ST_POLYGON(TO_GEOGRAPHY(...)), it is still very slow. This would make sense if each row had it's own polygon, but the condition uses a static polygon in the query:
SELECT
ST_POLYGON(TO_GEOGRAPHY('LINESTRING(-95.75122850074004 28.793166796020444,-95.68622920563344 30.207416499279063,-94.5162418937178 32.56537633083211,-90.94128066286225 34.24734047810797,-88.17881062083825 36.812423897251634,-86.13133282498448 38.15341651409619,-85.28634198860107 38.66275098353796,-84.37635185711038 38.789523129087826,-82.84886842210855 38.4848923369382,-82.32887406125734 37.820427257446994,-82.26387476615074 36.96838022284757,-82.03637723327772 36.00158943485101,-80.99638851157454 35.34155096040939,-78.52641529752944 34.62260477275565,-77.51892622337955 34.005211031324734,-78.26641811710381 31.1020568651834,-80.24889661785029 29.926151366059756,-83.59636031583283 28.793166796020444,-95.75122850074004 28.793166796020444)'))
FROM TABLE(GENERATOR(ROWCOUNT=>1000000))
Snowflake should be able to figure out that it only needs to calculate this polygon once for the entire query. Yet, the more rows that are added, the slower it gets. On an x-small this query takes over a minute. Where this query:
SELECT
'LINESTRING(-95.75122850074004 28.793166796020444,-95.68622920563344 30.207416499279063,-94.5162418937178 32.56537633083211,-90.94128066286225 34.24734047810797,-88.17881062083825 36.812423897251634,-86.13133282498448 38.15341651409619,-85.28634198860107 38.66275098353796,-84.37635185711038 38.789523129087826,-82.84886842210855 38.4848923369382,-82.32887406125734 37.820427257446994,-82.26387476615074 36.96838022284757,-82.03637723327772 36.00158943485101,-80.99638851157454 35.34155096040939,-78.52641529752944 34.62260477275565,-77.51892622337955 34.005211031324734,-78.26641811710381 31.1020568651834,-80.24889661785029 29.926151366059756,-83.59636031583283 28.793166796020444,-95.75122850074004 28.793166796020444)'
FROM TABLE(GENERATOR(ROWCOUNT=>3000000))
(added 2mm more rows to match the byte count)
Can complete in 2s.
I tried "precomputing" the polygon myself with a WITH statement but SF figures out the WITH is redundant and drops it. I also tried setting a session variable, but you can't set a complex value like this one as a variable.
I believe this is a bug.
Geospatial functions are in preview for now, and the team is working hard on all kind of optimizations.
For this case I want to note that making the polygon a single row table would help, but I would still expect better performance as the team gets this feature out of beta.
Let me create a table with one row, the polygon:
create or replace temp table poly1
as
select ST_POLYGON(TO_GEOGRAPHY('LINESTRING(-95.75122850074004 28.793166796020444,-95.68622920563344 30.207416499279063,-94.5162418937178 32.56537633083211,-90.94128066286225 34.24734047810797,-88.17881062083825 36.812423897251634,-86.13133282498448 38.15341651409619,-85.28634198860107 38.66275098353796,-84.37635185711038 38.789523129087826,-82.84886842210855 38.4848923369382,-82.32887406125734 37.820427257446994,-82.26387476615074 36.96838022284757,-82.03637723327772 36.00158943485101,-80.99638851157454 35.34155096040939,-78.52641529752944 34.62260477275565,-77.51892622337955 34.005211031324734,-78.26641811710381 31.1020568651834,-80.24889661785029 29.926151366059756,-83.59636031583283 28.793166796020444,-95.75122850074004 28.793166796020444)'
)) polygon
;
To see if this would help, I tried a one million rows cross join:
select *
from poly1, TABLE(GENERATOR(ROWCOUNT=>1000000));
It takes 14 seconds, and in the query profiler you can see most time was spent on an internal TO_OBJECT(GET_PATH(POLY1.POLYGON, '_shape').
What's interesting to note is that the previous operation is mostly concerned with the ascii representation of the polygon. Running operations over this polygon is much quicker:
select st_area(polygon)
from poly1, TABLE(GENERATOR(ROWCOUNT=>1000000));
This query should have taken longer (finding the area of a polygon sounds more complicated than just selecting it), but turns out it only took 7 seconds (~half).
Thanks for the report, and the team will continue to optimize cases like this.
For anyone curious about the particular polygon in the question - it's a nice heart:
I have the following query
SELECT t.res, IF(t.res=0, "zero", "more than zero")
FROM (
SELECT table.*, IF (RAND()<=0.2,1, IF (RAND()<=0.4,2, IF (RAND()<=0.6,3,0))) AS res
FROM table LIMIT 20) t
which returns something like this:
That's exactly what you would expect. However, as soon as I remove the LIMIT 20 I receive highly unexpected results (there are more rows returned than 20, I cut it off to make it easier to read):
SELECT t.res, IF(t.res=0, "zero", "more than zero")
FROM (
SELECT table.*, IF (RAND()<=0.2,1, IF (RAND()<=0.4,2, IF (RAND()<=0.6,3,0))) AS res
FROM table) t
Side notes:
I'm using MySQL 5.7.18-15-log and this is a highly abstracted example (real query is much more difficult).
I'm trying to understand what is happening. I do not need answers that offer work arounds without any explanations why the original version is not working. Thank you.
Update:
Instead of using LIMIT, GROUP BY id also works in the first case.
Update 2:
As requested by zerkms, I added t.res = 0 and t.res + 1 to the second example
The problem is caused by a change introduced in MySQL 5.7 on how derived tables in (sub)queries are treated.
Basically, in order to optimize performance, some subqueries are executed at different times and / or multiple times leading to unexpected results when your subquery returns non-deterministic results (like in my case with RAND()).
There are two easy (and likewise ugly) workarounds to get MySQL to "materialize" (aka return deterministic results) these subqueries: Use LIMIT <high number> or GROUP BY id both of which force MySQL to materialize the subquery and return the expected results.
The last option is turn off derived_merge in the optimizer_switch variable: derived_merge=off (make sure to leave all the other parameters as they are).
Further readings:
https://mysqlserverteam.com/derived-tables-in-mysql-5-7/
Subquery's rand() column re-evaluated for every repeated selection in MySQL 5.7/8.0 vs MySQL 5.6
A very simple question:
I wanted to select all the rows that their keys have a certain prefix in Hive, but somehow it's not working.
The queries I've tried:
select * from solr_json_history where dt='20170814' and hour='2147' and substr(`_root_`,1,9)='P10004232' limit 100;
SELECT * FROM solr_json_history where dt='20170814' and hour='2147' and `_root_` like 'P19746284%' limit 100;
My Hue editor just hangs there without returning anything.
I've checked this time range there's data in my table by this query:
select * from solr_json_history where dt='20170814' and hour='2147' limit 15;
It's returning 15 records as expected.
Any help please?
Thanks a lot!
Per #musafir-safwan's request, I've added it as an answer here.
UPDATE:
I'm not able to provide sample data. But my problem got resolved.
Thanks for the commentator's attention.
My table does have data, no need to worry about that. Thanks for checking though.
The problem was due to a bad Hue UI design, when I issued the above two queries, it takes too long (longer than the set timeout on the UI) to get a response back, so simply, the UI doesn't reply anything, or gives a timeout reminder. It just hangs there.
Also, those two queries essentially making two RPC calls, so they timed out.
Then I changed to use below query:
select `_root_`,json, count(*) from solr_json_history where dt='20170814' and hour='2147' and substr(`_root_`,1,9)='P19746284' group by `_root_`,json;
the difference is that I added a count(*) which turns this query into a map-reduce job thing, thus no timeout limit, and then it returns the result that I wanted.
YMMV.
Thanks.
Can anyone please advise me on this error...
The database has 40,000 news stories but only the fields 'story' is large,
'old' is a numeric value 0 or 1,
'title' and 'shortstory' are very short or NULL.
any advice appreciated. This is the result of running a search database query.
Error: MySQL client ran out of memory
Statement: SELECT news30_access.usehtml, old, title, story, shortstory, news30_access.name AS accessname, news30_users.user AS authorname, timestamp, news30_story.id AS newsid FROM news30_story LEFT JOIN news30_users ON news30_story.author = news30_users.uid LEFT JOIN news30_access ON news30_users.uid = news30_access.uid WHERE title LIKE ? OR story LIKE ? OR shortstory LIKE ? OR news30_users.user LIKE ? ORDER BY timestamp DESC
The simple answer is: don't use story in the SELECT clause.
If you want the story, then limit the number of results being returned. Start with, say, 100 results by adding:
limit 100
to the end of the query. This will get the 100 most recent stories.
I also note that you are using like with story as well as other string columns. You probably want to be using match with a full text index. This doesn't solve your immediate problem (which is returning too much data to the client). But, it will make your queries run faster.
To learn about full text search, start with the documentation.
I'm using the Wordnet SQL database from here: http://wnsqlbuilder.sourceforge.net
It's all built fine and users with appropriate privileges have been set.
I'm trying to find synonyms of words and have tried to use the two example statements at the bottom of this page: http://wnsqlbuilder.sourceforge.net/sql-links.html
SELECT synsetid,dest.lemma,SUBSTRING(src.definition FROM 1 FOR 60) FROM wordsXsensesXsynsets AS src INNER JOIN wordsXsensesXsynsets AS dest USING(synsetid) WHERE src.lemma = 'option' AND dest.lemma <> 'option'
SELECT synsetid,lemma,SUBSTRING(definition FROM 1 FOR 60) FROM wordsXsensesXsynsets WHERE synsetid IN ( SELECT synsetid FROM wordsXsensesXsynsets WHERE lemma = 'option') AND lemma <> 'option' ORDER BY synsetid
However, they never complete. At least not in any reasonable amount of time and I have had to cancel all of the queries. All other queries seem to work find and when I break up the second SQL example, I can get the individual parts to work and complete in reasonable times (about 0.40 seconds)
When I try and run the full statement however, the MySQL command line client just hangs.
Is there a problem with this syntax? What is causing it to take so long?
EDIT:
Output of "EXPLAIN SELECT ..."
Output of "EXPLAIN EXTENDED ...; SHOW WARNINGS;"
I did more digging and looking into the various statements used and found the problem was in the IN command.
MySQL repeats the statement for every single row in the database. This is the cause of the hang, as it had to run through hundreds of thousands of records.
My remedy to this was to split the command into two separate database calls first getting the synsets, and then dynamically creating a bound SQL string to look for the words in the synsets.