AWS Athena Timing out / not enough resources at this scale factor - mysql

I ran into a problem with athena when I was trying to join and order two tables together. My query statement looks very similar to this:
SELECT *
from Table_1
LEFT JOIN Table_2 ON Table_1
where Table_1.id = Table_2.id AND Table_1.date = Table_2.date
ORDER BY Table_1.id, Table_1.date
My Tables are potentially big depending on the dataset I am working with, with about a million rows or more. After doing some research, I realize that the ORDER BY could potentially be slowing down my query, but even when I take it out, it is still timing out. At the same time, I need the ORDER BY to structure my data because I will be turning this into a csv file. I have also read that I could split my query up in order to use different workers and take advantage of Athena's ability to do parallel work, but I don't know exactly know how to do that in Athena, so if someone could elaborate and explain how that could possibly done, that would be perfect. Another thing that I was thinking of doing was partitioning my data based on columns, but I would like it if someone could explain the benefits to me of doing that since I won't be selecting only a portion of my table, but the whole table every time.
I don't know if this is relevant, but my file sizes are usualy around ~100mb or less. However, from the different posts on here that I see with the same problem, they are dealing with more than 10gb, so I'm not sure if there's just something wrong fundamentally with my use of Athena.
Edit: I was thinking of paginating my queries to see if that could fix my issue, such as using offset and limit in a loop and just appending the data together. Would that be a viable solution?

After doing some more tests on my code to see at what point it breaks in terms of the payload size, I realized that my total amount of rows were waaaaay more than it should be. I found out that my select statement is not what it actually is as I portrayed it in my post. it was missing the AND Table_1.date = Table_2.date part due to a bug in the construction of query (because I am conditionally creating it). This resulted in the amount of rows being multiplied by up to 10x as I've noticed, which is what was screwing up my query and eating up all of athena's resources. So everything works fine now. However, I will leave this post up mostly for learning purposes to see if any of the questions to this potential problem are answered.

Related

How to improve a simple MySQL-Query

There is this rather simple query that I have to run on a livesystem, in order to get a count. The problem is that the table and database are rather inefficiently designed and since it is a livesystem altering it is not an option at this point.
So I have to figure out a query that runs fast and won't slow down the system too much, because for the time of the query execution the system basically stops which is not really what I would like a livesystem to do, so I need to streamline my query in order to make it perform in an acceptable time.
SELECT id1, count(id2) AS count FROM table GROUP BY id1 ORDER BY count
DESC;
So here is the query, unfortunately it is so simple that I am out of ideas on how to further improve it, maybe someone else has an idea ... ?
Application Get "good enough" results via application changes:
If you have access to the application, but not the database, then there are possibilities:
Periodically run that slow query and capture the results. Then use the cached results.
Do you need all
What is the goal? Find a few of the most common id1's? Rank all of them?
Back to the query
COUNT(id2) checks for id2 being not null; this us usually unnecessary, so COUNT(*) is better. However the speedup is insignificant.
ORDER BY NULL is irrelevant if you are picking off the rows with the highest COUNT -- the sort needs to be done somewhere. Moving it to the application does not help; at least not much.
Adding LIMIT 10 would only help because of cutting down on the time to send the data back to the client.
INDEX(id1) is the best index for the query (after changing to COUNT(*)). But the operation still requires
full index scan to do the COUNT and GROUP BY
sort the grouped results -- for the ORDER BY
Zero or near-zero downtime
Do you have replication established? Galera Clustering?
Look into pt-online-schema-change and gh-ost.
What is the real goal?
We cannot fix the query as written. What things can we change? Better yet, what is the ultimate goal -- perhaps there is an approach that does not involve any query that looks the least like the one you are trying to speed up.
Now I have just dumped the table and imported it into a MySQL-Docker, ran the query there, took ages and I actually had to move my entire Docker because the dump was so huge, but in the end I got my results and now I know how many id2s are associated with specific id1s (apostrophe to form a plural? You may want to double-check that ;) ).
As it was already pointed out, there wasn't much room for improvement on the query anymore.
FYI suddenly the care about stopping the system was gone and now we are indexing the table, so far it took 6 hours, no end in sight :D
Anyways, thanks for the help everyone.

Smart Queries That Deal With NULL Values

I currently inherited a table similar to the one in the image below. I don't have the resources to do what should be done in the allotted time, which is obviously to normalize the data into separate tables break it into a few smaller tables to eliminate redundancy, etc.
My current idea for a short-term solution is to create a query for each product type and store it in a new table based on ParentSKU. In the image below, a different query would be necessary for each of the 3 example ParentSKUs. This will work okay, but if new attributes are added to a SKU the query needs to be adjusted manually. What would be ideal in the short term (but probably not very likely) is to be able to come up with a query that would only include and display attributes where there weren't any NULL values. The desired results for each of the three ParentSKUs would be the same as they are in the examples below. If there were only 3 queries total, that would be easy enough, but there are dozens of combinations based on the products and categories of each product.
I'm certainly not the man for the job, but there are scores of people way smarter than I am that frequent this site every day that may be able to steer me in a better direction. I realize I'm probably asking for the impossible here, but as the saying goes, "There are no stupid questions, only ill-advised questions that deservedly and/or inadvertently draw the ire of StackOverflow users for various reasons." Okay, I embellished a tad, but you get my point...
I should probably add that this is currently a MySQL database.
Thanks in advance to anyone that attempts to help!
First create SKUTypes with the result of
SELECT ParentSKU , count(Attr1) as Attr1,..
FROM tbl_attr
GROUP BY ParentSKU;
Then create script which will generate an SQL query for every row of SKUTypes taking every AttrN column which value > 0.

MySQL code locks up server

I have code that I have tested on the same table, but with only a few records in it.
It worked great on a handful (30) records. Did exactly what I wanted it to do.
When I added a 200 records to the table - it locks up. I have to restart apache and have tried waiting for ever for it to finish.
I could use some help figuring out why.
My table has the proper indexes and I am not having trouble in any other way.
Thanks in advance.
UPDATE `base_data_test_20000_rows` SET `NO_TOP_RATING` =
(SELECT COUNT(`ID`) FROM `base_data_test_20000_rows_2`
WHERE
`base_data_test_20000_rows_2`.`ID` != `base_data_test_20000_rows`.`ID`
AND
`base_data_test_20000_rows_2`.`ANALYST` = `base_data_test_20000_rows`.`ANALYST`
AND
`base_data_test_20000_rows_2`.`IRECCD` =
(SELECT COUNT(`ID`) FROM `base_data_test_20000_rows_2`
WHERE `IRECCD` =
(select MIN(`IRECCD`) FROM `base_data_test_20000_rows_2`
WHERE
`base_data_test_20000_rows_2`.`ANNDATS_CONVERTED` >= DATE_SUB(`base_data_test_20000_rows`.`ANNDATS_CONVERTED`,INTERVAL 1 YEAR)
AND
`base_data_test_20000_rows_2`.`ID` != `base_data_test_20000_rows`.`ID`
AND
`base_data_test_20000_rows_2`.`ESTIMID` = `base_data_test_20000_rows`.`ESTIMID`
)
)
)
WHERE `base_data_test_20000_rows`.`ANALYST` != ''
The code is just meant to look a year back for a particular brokerage - get the lowest rating - then count the number of times that analyst had the lowest rating. write that vale to the NO_TOP_RATING column.
I'm pretty sure I was wrong with my original suggestion, chainging the select count to it's own number won't help since you have conditions on your query
This is merely a hackish solution. The real way to solve this would be to optimize your query. But as a work around you could set the record count to a mysql variable, and then reference that variable in the query.
This means that you will have to make sure you set the count to the variable before you run the query. But this means that should records be added in between the time you set the variable and complete running the query you will not have the right count.
http://dev.mysql.com/doc/refman/5.0/en/user-variables.html
further thoughts:
I took a closer look before submitting this answer. That might not actually be possible since you have the where statements which is individualized to each record.
It's slow because you are using a query that counts within a query that counts within a query that has a min. It's like you are iterating through every row three times each time you iterate through a row. Which is an exponential search. So if the database has 10 records you are possibly going through each record 10^3ish times. At the number of rows you have, it's hellish.
I'm sure that there is a way to do what you are trying to do, but I can't actually tell what you are trying to do.
I would have to agree with DRapp that seeing dummy data could help us analyze what's really going on.
Since I can't wrap my head around it all, what I would try, without fully understanding what you are doing, would be to create a view of each of your sub queries and then do a query on that. http://dev.mysql.com/doc/refman/5.0/en/create-view.html
But that probably won't escape the redundancy, but it might help with the speed. But since I don't fully understand what you are doing, that's probably not the best answer.
Another not so good answer would be if you aren't running this on a mission critical db and it can go offline while you run the query then you could just changed your mysql settings and let this query run for those hours you quoted and hope it doesn't crash. But that seems less than ideal, as I have no idea if that requires additional disk space or memory to preform.
So really my best answer I can ever give you at this point, is try to see if you can approach your problem from a different angle. Or post some dummy data of what the info in Base_data_test_20000_rows looks like and what you expect it to look like after the query runs.
-Hope that helps point you to the right direction

MySQL JOIN vs LIKE - faster selects?

Okay, so first of all let me tell a little about what I'm trying to do. Basically, during my studies I wrote a little webservice in PHP that calculates how similar movies are to each other based on some measurable sizes like length, actors, directors, writers, genres etc. The data I used for this was basically a collection of data accquired from omdbapi.com.
I still have that database, but it is technically just a SINGLE table that contains all the information to each movie. This means, that for each movie all the above mentioned parameters are divided by commas. Therefore I have so far used a query that encapsulates all these things by using LIKE statements. The query can become quite large as I will pretty much query for every parameter within the table, sometimes 5 different LIKE statements for different actors, the same for directors and writers. Back when I last used this, it took about 30 to 60 seconds to enter a single movie and receive a list of 15 similar ones.
Now I started my first job and to teach myself in my freetime, I want to work on my own website. Because I have no real concept for what I want to do with it, I thought I'd get out my old "movie finder" again and use it differently this time.
Now to challenge myself, I want the whole thing to be faster. Understand, that the data is NEVER changed, only read. It is also not "really" relational, as actor names and such are just strings and have no real entry anywhere else. Which essentially means having the same name will be treated as the same actor.
Now here comes my actual question:
Assuming I want my select queries to operate faster, would it make sense to run a script that splits the comma divided strings into extra tables (these are n to m relations, see attempt below) and then JOIN all these tables (they will be 8 or more) or will using LIKE as I currently do be about the same speed? The ONLY thing I am trying to achieve is faster select queries, as there is nothing else to really do with the data.
This is what I currently have. Keep in mind, I would still have to create tables for the relation between movies + each of these tables. After doing that, I could remove the columns in the movie table and would end up having to join a lot of tables with EACH query. The only real advantage I can see here, is that it would be easier to create an index on individuals tables, rather than one (or a few) covering the one, big movie table.
I hope all of this even makes sense to you. I appreciate any answer short or long, like I said this is mostly for self studies and as such, I don't have/need a real business model.
I don't understand what you currently have. It seems that you only showd the size of tables but not its internal structure. You need to separate data into separate tables using normalization rules and then put correct indexes. Indexes will make your queries very fast. What does the sizing above your query mean? Have you ever run EXPLAIN ANALYZE for you queries, and please post the query I cannot guess your query out of the result. There are a lot of optimization videos on YT.

Should I split up a complex query into one to filter results and one to gather data?

I'm designing a central search function in a PHP web application. It is focused around a single table and each result is exactly one unique ID out of that table. Unfortunately there are a few dozen tables related to this central one, most of them being 1:n relations. Even more unfortunate, I need to join quite a few of them. A couple to gather the necessary data for displaying the results, and a couple to filter according to the search criteria.
I have been mainly relying on a single query to do this. It has a lot of joins in there and, as there should be exactly one result displayed per ID, it also works with rather complex subqueries and group by uses. It also gets sorted according to a user-set sort method and there's pagination in play as well done by the use of LIMIT.
Anyways, this query has become insanely complex and while I nicely build it up in PHP it is a PITA to change or debug. I have thus been considering another approach, and I'm wondering just how bad (or not?) this is for performance before I actually develop it. The idea is as follows:
run one less complex query only filtering according the search parameters. This means less joins and I can completely ignore group by and similar constructs, I will just "SELECT DISTINCT item_id" on this and get a list of IDs
then run another query, this time only joining in the tables I need to display the results (only about 1/4 of the current total joins) using ... WHERE item_id IN (....), passing the list of "valid" IDs gathered in the first query.
Note: Obviously the IN () could actually contain the first query in full instead of relying on PHP to build up a comma-separated list).
How bad will the IN be performance-wise? And how much will it possibly hurt me that I can not LIMIT the first query at all? I'm also wondering if this is a common approach to this or if there are more intelligent ways to do it. I'd be thankful for any input on this :)
Note to clarify: We're not talking about a few simple joins here. There is even (simple) hierarchical data in there where I need to compare the search parameter against not only the items own data but also against its parent's data. In no other project I've ever worked on have I encountered a query close to this complexity. And before you even say it, yes, the data itself has this inherent complexity, which is why the data model is complex too.
My experience has shown that using the WHERE IN(...) approach tends to be slower. I'd go with the joins, but make sure you're joining on the smallest dataset possible first. Reduce down the simple main table, then join onto that. Make sure your most complex joins are saved to the end to minimize the rows required to search. Try to join on indexes wherever possible to improve speed, and ditch wildcards in JOINS where possible.
But I agree with Andomar, if you have the time build both and measure.