I have code that I have tested on the same table, but with only a few records in it.
It worked great on a handful (30) records. Did exactly what I wanted it to do.
When I added a 200 records to the table - it locks up. I have to restart apache and have tried waiting for ever for it to finish.
I could use some help figuring out why.
My table has the proper indexes and I am not having trouble in any other way.
Thanks in advance.
UPDATE `base_data_test_20000_rows` SET `NO_TOP_RATING` =
(SELECT COUNT(`ID`) FROM `base_data_test_20000_rows_2`
WHERE
`base_data_test_20000_rows_2`.`ID` != `base_data_test_20000_rows`.`ID`
AND
`base_data_test_20000_rows_2`.`ANALYST` = `base_data_test_20000_rows`.`ANALYST`
AND
`base_data_test_20000_rows_2`.`IRECCD` =
(SELECT COUNT(`ID`) FROM `base_data_test_20000_rows_2`
WHERE `IRECCD` =
(select MIN(`IRECCD`) FROM `base_data_test_20000_rows_2`
WHERE
`base_data_test_20000_rows_2`.`ANNDATS_CONVERTED` >= DATE_SUB(`base_data_test_20000_rows`.`ANNDATS_CONVERTED`,INTERVAL 1 YEAR)
AND
`base_data_test_20000_rows_2`.`ID` != `base_data_test_20000_rows`.`ID`
AND
`base_data_test_20000_rows_2`.`ESTIMID` = `base_data_test_20000_rows`.`ESTIMID`
)
)
)
WHERE `base_data_test_20000_rows`.`ANALYST` != ''
The code is just meant to look a year back for a particular brokerage - get the lowest rating - then count the number of times that analyst had the lowest rating. write that vale to the NO_TOP_RATING column.
I'm pretty sure I was wrong with my original suggestion, chainging the select count to it's own number won't help since you have conditions on your query
This is merely a hackish solution. The real way to solve this would be to optimize your query. But as a work around you could set the record count to a mysql variable, and then reference that variable in the query.
This means that you will have to make sure you set the count to the variable before you run the query. But this means that should records be added in between the time you set the variable and complete running the query you will not have the right count.
http://dev.mysql.com/doc/refman/5.0/en/user-variables.html
further thoughts:
I took a closer look before submitting this answer. That might not actually be possible since you have the where statements which is individualized to each record.
It's slow because you are using a query that counts within a query that counts within a query that has a min. It's like you are iterating through every row three times each time you iterate through a row. Which is an exponential search. So if the database has 10 records you are possibly going through each record 10^3ish times. At the number of rows you have, it's hellish.
I'm sure that there is a way to do what you are trying to do, but I can't actually tell what you are trying to do.
I would have to agree with DRapp that seeing dummy data could help us analyze what's really going on.
Since I can't wrap my head around it all, what I would try, without fully understanding what you are doing, would be to create a view of each of your sub queries and then do a query on that. http://dev.mysql.com/doc/refman/5.0/en/create-view.html
But that probably won't escape the redundancy, but it might help with the speed. But since I don't fully understand what you are doing, that's probably not the best answer.
Another not so good answer would be if you aren't running this on a mission critical db and it can go offline while you run the query then you could just changed your mysql settings and let this query run for those hours you quoted and hope it doesn't crash. But that seems less than ideal, as I have no idea if that requires additional disk space or memory to preform.
So really my best answer I can ever give you at this point, is try to see if you can approach your problem from a different angle. Or post some dummy data of what the info in Base_data_test_20000_rows looks like and what you expect it to look like after the query runs.
-Hope that helps point you to the right direction
Related
I ran into a problem with athena when I was trying to join and order two tables together. My query statement looks very similar to this:
SELECT *
from Table_1
LEFT JOIN Table_2 ON Table_1
where Table_1.id = Table_2.id AND Table_1.date = Table_2.date
ORDER BY Table_1.id, Table_1.date
My Tables are potentially big depending on the dataset I am working with, with about a million rows or more. After doing some research, I realize that the ORDER BY could potentially be slowing down my query, but even when I take it out, it is still timing out. At the same time, I need the ORDER BY to structure my data because I will be turning this into a csv file. I have also read that I could split my query up in order to use different workers and take advantage of Athena's ability to do parallel work, but I don't know exactly know how to do that in Athena, so if someone could elaborate and explain how that could possibly done, that would be perfect. Another thing that I was thinking of doing was partitioning my data based on columns, but I would like it if someone could explain the benefits to me of doing that since I won't be selecting only a portion of my table, but the whole table every time.
I don't know if this is relevant, but my file sizes are usualy around ~100mb or less. However, from the different posts on here that I see with the same problem, they are dealing with more than 10gb, so I'm not sure if there's just something wrong fundamentally with my use of Athena.
Edit: I was thinking of paginating my queries to see if that could fix my issue, such as using offset and limit in a loop and just appending the data together. Would that be a viable solution?
After doing some more tests on my code to see at what point it breaks in terms of the payload size, I realized that my total amount of rows were waaaaay more than it should be. I found out that my select statement is not what it actually is as I portrayed it in my post. it was missing the AND Table_1.date = Table_2.date part due to a bug in the construction of query (because I am conditionally creating it). This resulted in the amount of rows being multiplied by up to 10x as I've noticed, which is what was screwing up my query and eating up all of athena's resources. So everything works fine now. However, I will leave this post up mostly for learning purposes to see if any of the questions to this potential problem are answered.
There is this rather simple query that I have to run on a livesystem, in order to get a count. The problem is that the table and database are rather inefficiently designed and since it is a livesystem altering it is not an option at this point.
So I have to figure out a query that runs fast and won't slow down the system too much, because for the time of the query execution the system basically stops which is not really what I would like a livesystem to do, so I need to streamline my query in order to make it perform in an acceptable time.
SELECT id1, count(id2) AS count FROM table GROUP BY id1 ORDER BY count
DESC;
So here is the query, unfortunately it is so simple that I am out of ideas on how to further improve it, maybe someone else has an idea ... ?
Application Get "good enough" results via application changes:
If you have access to the application, but not the database, then there are possibilities:
Periodically run that slow query and capture the results. Then use the cached results.
Do you need all
What is the goal? Find a few of the most common id1's? Rank all of them?
Back to the query
COUNT(id2) checks for id2 being not null; this us usually unnecessary, so COUNT(*) is better. However the speedup is insignificant.
ORDER BY NULL is irrelevant if you are picking off the rows with the highest COUNT -- the sort needs to be done somewhere. Moving it to the application does not help; at least not much.
Adding LIMIT 10 would only help because of cutting down on the time to send the data back to the client.
INDEX(id1) is the best index for the query (after changing to COUNT(*)). But the operation still requires
full index scan to do the COUNT and GROUP BY
sort the grouped results -- for the ORDER BY
Zero or near-zero downtime
Do you have replication established? Galera Clustering?
Look into pt-online-schema-change and gh-ost.
What is the real goal?
We cannot fix the query as written. What things can we change? Better yet, what is the ultimate goal -- perhaps there is an approach that does not involve any query that looks the least like the one you are trying to speed up.
Now I have just dumped the table and imported it into a MySQL-Docker, ran the query there, took ages and I actually had to move my entire Docker because the dump was so huge, but in the end I got my results and now I know how many id2s are associated with specific id1s (apostrophe to form a plural? You may want to double-check that ;) ).
As it was already pointed out, there wasn't much room for improvement on the query anymore.
FYI suddenly the care about stopping the system was gone and now we are indexing the table, so far it took 6 hours, no end in sight :D
Anyways, thanks for the help everyone.
I have a table with approx. 70000 entries. It holds information about brands, models and categories of goods. The user can query them using any combination of those, and the displayed counter of goods matching the criteria has to be updated according to his selection.
I have it done using a query like
SELECT model,COUNT(*) AS count FROM table$model_where
GROUP BY model
ORDER BY count DESC
where $model_where depends on what the other conditions were. But my boss asked me to redo these queries into using a special counter table, because he believes they are slowing the whole process down, but a benchmark I put suggests otherwise, sample output:
The code took: 0 wallclock secs (0.02 usr + 0.00 sys = 0.02 CPU)
and it measures the whole routine from the start and until the data is send to the user, you can see it's really fast.
I have done some research on this matter, but I still haven't seen a definitive answer as to when to use COUNT(*) vs counter tables. Who is right? I'm not persuaded we actually need manual tracking of this, but maybe I just know little.
Depending on your specific case, this might, or might not be a case of premature optimization.
If next week you'll have 100x bigger tables, it might not be the case, but otherwise it is.
Also, your boss should take into consideration that you and everybody else will have to make sure that counters are updated whenever an INSERT or DELETE happens on the counted records. There are frameworks which do that automatically (ruby on rails's ActiveRecord comes to mind), but if you're not using one of them, there are about a gazillion ways you can end up with wrong counters in the DB
This is a super strange question, and it usefulness it's probably limited to my problem; I'm going to explain what I'm asking and why I need it.
My problem:
I have a table, let's say with 2 columns, serve the next table as example:
id|value
1 A
2 B
3 C
4 A
5 A
Now, If I do a "SELECT id WHERE value = 'A', I would get 3 results, 1, 4, 5. If I do a "SELECT id WHERE value = 'B', I would get 1 result, 2. And so on, if there were more entries, I would get the corresponding numbers of rows as my result according the value I'm looking in my query. It's all good.
But now, here comes my problem. Let's say I want to get every row for every query, but with the next restriction:
Do not modify the queries.
If I do "SELECT id WHERE value = 'A'", I would get every id, if I do "SELECT id WHERE value = 'B'", I would get every id, and so on.
"But if I can't modify my query, then what can I do?" You may ask, well, you can modify the table, like changing the value of the column 'value' to a value that would match every value, that's a wildcard, hence the title of the question, but I'm pretty sure if I update all 'value' values to '%', it doesn't work (I tried knowing this wouldn't work, but still, I couldn't lose anything trying).
So, you can do whatever you want, the only restriction is to not modify the queries.
I know this is kind of the inverse of how databases and tables should work, but this is a problem I've been presented with, maybe this is impossible, but maybe it's not.
Edit:
I know this makes little to no sense at all, but I'm asking this as a kind of challenge, appealing to the creatives minds out there. Don't worry about vulnerabilities or anything else, just ask yourselves: "How would I do it?"
Before I present any solutions, let me make it clear that you are solving the wrong problem. You should be figuring out how to change your queries; that restriction will continue to generate more problems. Any solution to this problem will be so complex it will generate more problems.
Hopefully this really is just an intellectual exercise.
I'm also going to only give sketches on how to do this, because this is just an intellectual exercise RIGHT?!
The first, and most comprehensive solution is to "just" change the source code of your MySQL database to respond to the queries however you like. It's an Open Source database. Download the source code, change it, recompile, and install.
The downside to this solution (assuming you can make it work) is it effects every connection to the database and has to be repeated every time you want to upgrade MySQL.
Assuming this is restricted to one table, and that the set of WHERE clauses is fixed, you can duplicate every row in that table to have every value which might be queried. For example, if you have id's 1 and 2 and value is only ever A, B or C, you'd make a table like this:
id|value
1 A
1 B
1 C
2 A
2 B
2 C
Then there are various man-in-the-middle attacks you can do to strip off the WHERE clause. If it's a fixed set of programs which are the problem you could alter the database API library they use. In Perl this would be the DBI library. In PHP this would be mysqli or PDO. And so on.
A more comprehensive solution would be to replace the MySQL server's socket (both the TCP and Unix socket) with your own little server. This would read and parse the MySQL network protocol (you may be able to extract the code to do this from the MySQL source), alter the query to strip the WHERE clause, and send it on to the real MySQL server.
These are all terrible solutions that are horribly difficult to implement correctly. Even if you got them working 100%, you're left with a system that does strange things to database queries which is likely to cause further problems down the road.
One of the most creative solutions to a problem is to realize you're solving the wrong problem.
I encourage you to post the circumstances that lead to this question, as another question, because that is the real problem. Also the management failures which lead to it will be a nice train wreck to watch.
The aim is: getting the highest number of rows and not getting more rows than rows loaded, after 5 seconds. The aim is not creating a timeout.
after months, I thought maybe this would work and it didn't:
declare #d1 datetime2(7); set #d1=getdate();
select c1,c2 from t1 where (datediff(ss,#d1,getdate())<5)
Although the trend in recent years for relational databases has moved more and more toward cost-based query optimization, there is no RDBMS I am aware of that inherently supports designating a maximum cost (in time or I/O) for a query.
The idea of "just let it time out and use the records collected so far" is a flawed solution. The flaw lies in the fact that a complex query may spend the first 5 seconds performing a hash on a subtree of the query plan, to generate data that will be used by a later part of the plan. So after 5 seconds, you may still have no records.
To get the most records possible in 5 seconds, you would need a query that had a known estimated execution plan, which could then be used to estimate the optimal number of records to request in order to make the query run for as close to 5 seconds as possible. In other words, knowing that the query optimizer estimates it can process 875 records per second, you could request 4,375 records. The query might run a bit longer than 5 seconds sometimes, but over time your average execution should fall close to 5 seconds.
So...how to make this happen?
In your particular situation, it's not feasible. The catch is "known estimated execution plan". To make this work reliably, you'd need a stored procedure with a known execution plan, not an ad-hoc query. Since you can't create stored procedures in your environment, that's a non-starter. For others who want to explore that solution, though, here's an academic paper by a team who implemented this concept in Oracle. I haven't read the full paper, but based on the abstract it sounds like their work could be translated to any RDBMS that has cost-based optimization (e.g. MS SQL, MySQL, etc.)
OK, So what can YOU do in your situation?
If you can't do it the "right" way, solve it with a hack.
My suggestion: keep your own "estimated cost" statistics.
Do some testing in advance and estimate how many rows you can typically get back in 4 seconds. Let's say that number is 18,000.
So you LIMIT your query to 18,000 rows. But you also track the execution time every time you run it and keep a moving average of, say, the last 50 executions. If that average is less than 4.5s, add 1% to the query size and reset the moving average. So now your app is requesting 18,180 rows every time. After 50 iterations, if the moving average is under 4.5s, add 1% again.
And if your moving average ever exceeds 4.75s, subtract 1%.
Over time, this method should converge to an optimized N-rows solution for your particular query/environment/etc. And should adjust (slowly but steadily) when conditions change (e.g. high-concurrency vs low-concurrency)
Just one -- scratch that, two -- more things...
As a DBA, I have to say...it should be exceedingly rare for any query to take more than 5 seconds. In particular, if it's a query that runs frequently and is used by the front end application, then it absolutely should not ever run for 5 seconds. If you really do have a user-facing query that can't complete in 5 seconds, that's a sign that the database design needs improvement.
Jonathan VM's Law Of The Greenbar Report I used to work for a company that still used a mainframe application that spit out reams of greenbar dot-matrix-printed reports every day. Most of these were ignored, and of the few that were used, most were never read beyond the first page. A report might have thousands of rows sorted by descending account age...and all that user needed was to see the 10 most aged. My law is this: The number of use cases that actually require seeing a vast number of rows is infinitesimally small. Think - really think - about the use case for your query, and whether having lots and lots of records is really what that user needs.
Your while loop idea won't solve the problem entirely. It is possible that the very first iteration through the loop could take longer than 5 seconds. Plus, it will likely result in retrieving far fewer rows in the allotted time than if you tried to do it with just a single query.
Personally, I wouldn't try to solve this exact problem. Instead, I would do some testing, and through trial and error identify a number of records that I am confident will load in under five seconds. Then, I would just place a LIMIT on the loading query.
Next, depending on the requirements I would either set a timeout on the DB call of five seconds or just live with the chance that some calls will exceed the time restriction.
Lastly, consider that on most modern hardware for most queries, you can return a very large number of records within five seconds. It's hard to imagine returning all of that data to the UI and still have it be usable, if that is your intention.
-Jason
I've never tried this, but if a script is running this query you could try running an unbuffered query (in php, this would be something like mysql_unbuffered_query())... you could then store these into an array while the query is running. You could then set the mysql query timeout to five minutes. When the query is killed, if you've set your while() loop to check for a timeout response it can then terminate the loop and you'll have an array with all of the records returned in 5 minutes. Again, I'm not sure this would work, but I'd be interested to see if it would accomplish what you're looking to do.
You could approach this problem like this, but I doubt that this logic is really what I'd recommend for real world use.
You have a 10s interval, you try one query, it gets you the row in 0.1s. That would imply you could get at least 99 similar queries still in the remaining 9.9s.
However, getting 99 queries at once should proove faster than getting them one-by-one (which your initial calculation would suggest). So you get the 99 queries and check the time again.
Let's say the operation performed 1.5 times as fast as the single query, because getting more queries at once is more efficient, leaving you with 100rows at a time of 7.5s. You calculate that by average you have so far gotten 100rows per 7.5s, calculate a new amount of possible queries for the rest of the time and query again, and so on. You would, however, need to set a threshold limit for this loop, let's say something like: Don't get any new queries any more after 9.9s.
This solution obviously is neither the most smooth nor something I'd really use, but maybe it serves to solve the OP's problem.
Also, jmacinnes already pointed out: "It is possible that the very first iteration through the loop could take longer than 10[5] seconds."
I'd certainly be interested myself, if someone can come up with a proper solution to this problem.
To get data from the table you should do two things:
execute a query (SELECT something FROM table)
fill the table or read data
You are asking about second one. I'm not that familiar with php, but I think it does not matter. We use fetching to get first records quickly and show them to the user, then fetch records as needed. In ADO.NET you could use IDataReader to get records one by one, in php I think you could use similar methods, for example - mysqli_fetch_row in mysqli extension or mysql_fetch_row in mysql extension. In this case you could stop reading data at any moment.