Snowflake ST_POLYGON(TO_GEOGRAPHY(...)) Is Inefficient - gis

I have a few queries that use geospatial conditions. These queries are running surprisingly slow. Initially I thought it was the geospatial calculation itself, but stripping everything down to just ST_POLYGON(TO_GEOGRAPHY(...)), it is still very slow. This would make sense if each row had it's own polygon, but the condition uses a static polygon in the query:
SELECT
ST_POLYGON(TO_GEOGRAPHY('LINESTRING(-95.75122850074004 28.793166796020444,-95.68622920563344 30.207416499279063,-94.5162418937178 32.56537633083211,-90.94128066286225 34.24734047810797,-88.17881062083825 36.812423897251634,-86.13133282498448 38.15341651409619,-85.28634198860107 38.66275098353796,-84.37635185711038 38.789523129087826,-82.84886842210855 38.4848923369382,-82.32887406125734 37.820427257446994,-82.26387476615074 36.96838022284757,-82.03637723327772 36.00158943485101,-80.99638851157454 35.34155096040939,-78.52641529752944 34.62260477275565,-77.51892622337955 34.005211031324734,-78.26641811710381 31.1020568651834,-80.24889661785029 29.926151366059756,-83.59636031583283 28.793166796020444,-95.75122850074004 28.793166796020444)'))
FROM TABLE(GENERATOR(ROWCOUNT=>1000000))
Snowflake should be able to figure out that it only needs to calculate this polygon once for the entire query. Yet, the more rows that are added, the slower it gets. On an x-small this query takes over a minute. Where this query:
SELECT
'LINESTRING(-95.75122850074004 28.793166796020444,-95.68622920563344 30.207416499279063,-94.5162418937178 32.56537633083211,-90.94128066286225 34.24734047810797,-88.17881062083825 36.812423897251634,-86.13133282498448 38.15341651409619,-85.28634198860107 38.66275098353796,-84.37635185711038 38.789523129087826,-82.84886842210855 38.4848923369382,-82.32887406125734 37.820427257446994,-82.26387476615074 36.96838022284757,-82.03637723327772 36.00158943485101,-80.99638851157454 35.34155096040939,-78.52641529752944 34.62260477275565,-77.51892622337955 34.005211031324734,-78.26641811710381 31.1020568651834,-80.24889661785029 29.926151366059756,-83.59636031583283 28.793166796020444,-95.75122850074004 28.793166796020444)'
FROM TABLE(GENERATOR(ROWCOUNT=>3000000))
(added 2mm more rows to match the byte count)
Can complete in 2s.
I tried "precomputing" the polygon myself with a WITH statement but SF figures out the WITH is redundant and drops it. I also tried setting a session variable, but you can't set a complex value like this one as a variable.
I believe this is a bug.

Geospatial functions are in preview for now, and the team is working hard on all kind of optimizations.
For this case I want to note that making the polygon a single row table would help, but I would still expect better performance as the team gets this feature out of beta.
Let me create a table with one row, the polygon:
create or replace temp table poly1
as
select ST_POLYGON(TO_GEOGRAPHY('LINESTRING(-95.75122850074004 28.793166796020444,-95.68622920563344 30.207416499279063,-94.5162418937178 32.56537633083211,-90.94128066286225 34.24734047810797,-88.17881062083825 36.812423897251634,-86.13133282498448 38.15341651409619,-85.28634198860107 38.66275098353796,-84.37635185711038 38.789523129087826,-82.84886842210855 38.4848923369382,-82.32887406125734 37.820427257446994,-82.26387476615074 36.96838022284757,-82.03637723327772 36.00158943485101,-80.99638851157454 35.34155096040939,-78.52641529752944 34.62260477275565,-77.51892622337955 34.005211031324734,-78.26641811710381 31.1020568651834,-80.24889661785029 29.926151366059756,-83.59636031583283 28.793166796020444,-95.75122850074004 28.793166796020444)'
)) polygon
;
To see if this would help, I tried a one million rows cross join:
select *
from poly1, TABLE(GENERATOR(ROWCOUNT=>1000000));
It takes 14 seconds, and in the query profiler you can see most time was spent on an internal TO_OBJECT​(​GET_PATH​(​POLY1​.​POLYGON, '_shape'​)​​.
What's interesting to note is that the previous operation is mostly concerned with the ascii representation of the polygon. Running operations over this polygon is much quicker:
select st_area(polygon)
from poly1, TABLE(GENERATOR(ROWCOUNT=>1000000));
This query should have taken longer (finding the area of a polygon sounds more complicated than just selecting it), but turns out it only took 7 seconds (~half).
Thanks for the report, and the team will continue to optimize cases like this.
For anyone curious about the particular polygon in the question - it's a nice heart:

Related

Possible Bug in the Select InterfaceDist Command

I am new to using SQL. I was wondering whether there could be a bug in this program.
/* Insert to interface table all atoms that have diffASA>0*/
insert into NinterfaceAtom(PDB,Chain,Residue,ResId,Symbol,atom,diffASA)
select PDB,Chain,Residue,ResId,Symbol,Atom,max(ASA)-min(ASA) from perAtomASA
group by PDB,Chain,Residue,ResId,Symbol,Atom
having stddev(ASA)>0;
/* Insert to interface table all atoms that have enough distance */
insert ignore into NinterfaceAtoms (PDB,Chain,Residue,ResId,Symbol,atom)
select asa.PDB,asa.Chain,asa.Residue,asa.ResId,asa.Symbol,dist.Atom from interfaceDist dist
inner join
perAtomASA asa
on
dist.PDB=asa.PDB and
dist.Chain=asa.Chain and
dist.ResId=asa.ResId and
dist.Symbol=asa.Symbol and
Seperated=0
I am just unsure why the programmmer before me put asa.PDB instead of dist.PDB in the inner join section.
I was thinking the eighth line needed to be changed from:
select asa.PDB,asa.Chain,asa.Residue,asa.ResId,asa.Symbol,dist.Atom from interfaceDist dist
to:
select dist.PDB,dist.Chain,dist.Residue,dist.ResId,dist.Symbol,dist.Atom from interfaceDist dist
Is that correct? Thanks.
You are joining asa and dist. It is perfectly logical that their values are checked to make sure only matching pairs will be in the result. So, unless you have a very good reason to think that in the line
dist.PDB=asa.PDB
you need dist.PDB instead of asa.PDB, the command looks to be correct. And if this would be correct
dist.PDB=dist.PDB
then it would be trivial and it would be pointless to have this part checked at all. When you identify bugs, you either need to see behavioral problems of the software, or to understand the code you are looking at.
EDIT
In the select clause one can use asa.PDB or dist.PDB, because the two are equal because of the on condition which ensured their equality. If the two are different, then the pair will not be in the result. So, in terms of values, it makes no difference. But if it is more intuitive to have dist.PDB in the select, then you might want to change it (there is no harm in it, because the value is exactly the same) so the code will be more readable and later, if the code is changed and the two will no longer have to be equal, you will not have new bugs out of the blue sky.

How can I select certain rows that the key starts with a prefix in Hive?

A very simple question:
I wanted to select all the rows that their keys have a certain prefix in Hive, but somehow it's not working.
The queries I've tried:
select * from solr_json_history where dt='20170814' and hour='2147' and substr(`_root_`,1,9)='P10004232' limit 100;
SELECT * FROM solr_json_history where dt='20170814' and hour='2147' and `_root_` like 'P19746284%' limit 100;
My Hue editor just hangs there without returning anything.
I've checked this time range there's data in my table by this query:
select * from solr_json_history where dt='20170814' and hour='2147' limit 15;
It's returning 15 records as expected.
Any help please?
Thanks a lot!
Per #musafir-safwan's request, I've added it as an answer here.
UPDATE:
I'm not able to provide sample data. But my problem got resolved.
Thanks for the commentator's attention.
My table does have data, no need to worry about that. Thanks for checking though.
The problem was due to a bad Hue UI design, when I issued the above two queries, it takes too long (longer than the set timeout on the UI) to get a response back, so simply, the UI doesn't reply anything, or gives a timeout reminder. It just hangs there.
Also, those two queries essentially making two RPC calls, so they timed out.
Then I changed to use below query:
select `_root_`,json, count(*) from solr_json_history where dt='20170814' and hour='2147' and substr(`_root_`,1,9)='P19746284' group by `_root_`,json;
the difference is that I added a count(*) which turns this query into a map-reduce job thing, thus no timeout limit, and then it returns the result that I wanted.
YMMV.
Thanks.

Slow mysql query when additional select criteria added

I'm having some issues with an sql query that is going extremely slowly in an INNODB MySql table. There's only 32,000 rows and all the conditions in the where clause are indexed and either bits or bigints. I wouldn't have expected a speed issue at all but sometimes the queries take 2 mins to complete. It's difficult to tell because of caching but it appears that if I remove all of the select criteria from the query except for the id then the query executes in milliseconds.
The rows are large as they store email html so often a row is 1MB. When the query runs the computer uses a lot of harddrive resource which appears to be the source of the speed slowdown. As far as I know though the database shouldn't need to use any additional resource when select criteria are added, first it finds the rows and then it pulls out the information it needs just for those rows. Can someone correct me if I'm wrong or let me know if some other setting could explain this behaviour.
As requested here is the query:
select aplosemail0_.id as id4353_, aplosemail0_.active as active4353_, aplosemail0_.deletable as deletable4353_, aplosemail0_.editable as editable4353_, aplosemail0_.persistentData as persiste5_4353_, aplosemail0_.dateCreated as dateCrea6_4353_, aplosemail0_.dateInactivated as dateInac7_4353_, aplosemail0_.dateLastModified as dateLast8_4353_, aplosemail0_.displayId as displayId4353_, aplosemail0_.owner_id as owner37_4353_, aplosemail0_.userIdCreated as userIdC10_4353_, aplosemail0_.userIdInactivated as userIdI11_4353_, aplosemail0_.userIdLastModified as userIdL12_4353_, aplosemail0_.parentWebsite_id as parentW38_4353_, aplosemail0_.automaticSendDate as automat13_4353_, aplosemail0_.emailFrame_id as emailFrame39_4353_, aplosemail0_.emailGenerationType as emailGe14_4353_, aplosemail0_.email_generator_type as email15_4353_, aplosemail0_.email_generator_id as email16_4353_, aplosemail0_.emailReadDate as emailRe17_4353_, aplosemail0_.emailSentCount as emailSe18_4353_, aplosemail0_.emailSentDate as emailSe19_4353_, aplosemail0_.emailStatus as emailSt20_4353_, aplosemail0_.emailTemplate_id as emailTe40_4353_, aplosemail0_.emailType as emailType4353_, aplosemail0_.encryptionSalt as encrypt22_4353_, aplosemail0_.forwardedEmail_id as forward41_4353_, aplosemail0_.fromAddress as fromAdd23_4353_, aplosemail0_.hardDeleteDate as hardDel24_4353_, aplosemail0_.htmlBody as htmlBody4353_, aplosemail0_.incomingReadRetryCount as incomin26_4353_, aplosemail0_.isAskingForReceipt as isAskin27_4353_, aplosemail0_.isIncomingEmailDeleted as isIncom28_4353_, aplosemail0_.isSendingPlainText as isSendi29_4353_, aplosemail0_.isUsingEmailSourceAsOwner as isUsing30_4353_, aplosemail0_.mailServerSettings_id as mailSer42_4353_, aplosemail0_.maxSendQuantity as maxSend31_4353_, aplosemail0_.originalEmail_id as origina43_4353_, aplosemail0_.outerEmailFrame_id as outerEm44_4353_, aplosemail0_.plainTextBody as plainTe32_4353_, aplosemail0_.removeDuplicateToAddresses as removeD33_4353_, aplosemail0_.repliedEmail_id as replied45_4353_, aplosemail0_.sendStartIdx as sendSta34_4353_, aplosemail0_.subject as subject4353_, aplosemail0_.uid as uid4353_ from AplosEmail aplosemail0_ where aplosemail0_.emailType=0 and aplosemail0_.emailStatus<>2 and aplosemail0_.mailServerSettings_id=7 and aplosemail0_.active=1 order by aplosemail0_.id DESC limit 50

How could I know how much time it takes to query items in a table of MYSQL?

Our website has a problem: The visiting time of one page is too long. We have found out that it has a n*n matrix in that page; and for each item in the matrix, it queries three tables from MYSQL database. Every item in that matrix do the query quiet alike.
So I wonder maybe it is the large amount of MYSQL queries lead to the problem. And I want to try to fix it. Here is one of my confusions I list below:
1.
m = store.execute('SELECT X FROM TABLE1 WHERE I=1')
result = store.execute('SELECT Y FROM TABLE2 WHERE X in m')
2.
r = store.execute('SELECT X, Y FROM TABLE2');
result = []
for each in r:
i = store.execute('SELECT I FROM TABLE1 WHERE X=%s', each[0])
if i[0][0]=1:
result.append(each)
It got about 200 items in TABLE1 and more then 400 items in TABLE2. I don't know witch part takes the most time, so I can't make a better decision of how to write my sql statement.
How could I know how much time it takes to do some operation in MYSQL? Thank you!
Rather than installing a bunch of special tools, you could take a dead-simple approach like this (pardon my Ruby):
start = Time.new
# DB query here
puts "Query XYZ took #{Time.now - start} sec"
Hopefully you can translate that to Python. OR... pardon my Ruby again...
QUERY_TIMES = {}
def query(sql)
start = Time.new
connection.execute(sql)
elapsed = Time.new - start
QUERY_TIMES[sql] ||= []
QUERY_TIMES[sql] << elapsed
end
Then run all your queries through this custom method. After doing a test run, you can make it print out the number of times each query was run, and the average/total execution times.
For the future, plan to spend some time learning about "profilers" (if you haven't already). Get a good one for your chosen platform, and spend a little time learning how to use it well.
I use the MySQL Workbench for SQL development. It gives response times and can connect remotely to MySQL servers granted you have the permission (which in this case will give you a more accurate reading).
http://www.mysql.com/products/workbench/
Also, as you've realized it appears you have a SQL statement in a for loop. That could drastically effect performance. You'll want to take a different route with retrieving that data.

How to tune the following MySQL query?

I am using the following MySQL query which is working fine, I mean giving me the desired output but... lets first see the query:
select
fl.file_ID,
length(fl.filedesc) as l,
case
when
fl.part_no is null
and l>60
then concat(fl.fileno,' ', left(fl.filedesc, 60),'...')
when
fl.part_no is null
and length(fl.filedesc)<=60
then concat(fl.fileno,' ',fl.filedesc)
when
fl.part_no is not null
and length(fl.filedesc)>60
then concat(fl.fileno,'(',fl.part_no,')', left(fl.filedesc, 60),'...')
when
fl.part_no is not null
and length(fl.filedesc)<=60
then concat(fl.fileno,'(',fl.part_no,')',fl.filedesc)
end as filedesc
from filelist fl
I don't want to use the length function repeatedly because I guess it would hit the database everytime claiming performance issue. Please suggest if I can store the length once and use it several times.
Once you have accessed a given row, what you do with the columns has only a small impact on performance. So your guess that it "hits the database" more to serve repeated use of that length function isn't as bad as you think.
The analogy I would use is a postal carrier delivering mail to your house, which is miles outside of town. He drives for 20 minutes to your mailbox, and then he worries that it takes too much time to insert one letter at a time into your mailbox, instead of all the letters at once. The cost of that inefficiency is insignificant compared to the long drive.
That said, you can make the query more concise or easier to code or to look at. But this probably won't have a big benefit for performance.
select
fl.file_ID,
concat(fl.fileno,
ifnull(concat('(',fl.part_no,')'), ' '),
left(fl.filedesc,60),
if(length(fl.filedesc)>60,'...','')
) as filedesc
from filelist fl