I have a database with following table structure :
id | entry | log_type | user_id | created_on |
------------------------------------------------|
1 |a | error | 1 | 1433752884000|
2 |b | warn | 2 | 1433752884001|
3 |c | error | 2 | 1433752884002|
4 |d | warn | 4 | 1433752884003|
I want to obtain the last record from the table based on created_on field, currently i am using the following query to obtain the result list and obtain the last record on it using java:
select * from log_table l where l.user_id=2 and l.log_type = 'error' order by l.created_on desc;
I am using JPA and i execute the query using .getResultList() on the Query interface .Once i get the result list i do a get(0) to obtain the desired last record .
I have a large table with too much data , above query takes too long to execute and stalls the application. I cannot add additional index for now on existing data . Apart from adding the index on the data is there an alternate approach to avoid stalling of this query .
I was thinking of executing the following query,
select * from log_table l where l.user_id=2 and l.log_type = 'error' order by l.created_on desc limit 1;
Currently i cannot execute my second query on the database as it might cause my application to stall. Will execution of the second query be faster than the first query ?
I don't have a sufficiently large dataset available to reproduce the stalling problems on my local system and hence . I tried executing the queries on my local database and due to the lack of the available large dataset , unable to determine if the second query would be faster with the addition of "limit" on the returned query.
If the above second query isn't supposed to provide a better result , what would be the approach that i should to get an optimized query .
In case the second query should be good enough to avoid stalling , is it due to the reason that the DB fetches only one record instead instead of the entire set of records ? does the database handle looking/fetching for a single record differently as compared to looking/fetching too many records (as in first query) to improve query timings.
The performance depends...
ORDER BY x LIMIT 1
is a common pattern. It may or may not be very efficient -- It depends on the query and the indexes.
In your case:
where l.user_id=2 and l.log_type = 'error' order by l.created_on desc
this would be optimal:
INDEX(user_id, log_type, created_on)
With that index, it will essentially do one probe to find the row you need. Without that index, it will scan much or all of the table, sort it descending (ORDER BY .. DESC) and deliver the first row (LIMIT 1)
Before you do your query.getResultList(), you need to query.setMaxResults(1). This is the equivalent of LIMIT 1.
But be aware that if your Entity has a Collection of related sub-objects JOINed to it in the query, the Entity Manager may still have to do an unbounded select to get all the data it needs to build the first Entity. See this question and answer for more information about that.
In your case, as you only need one Entity, I would recommend lazy-loading any attached Entities after you have done the initial query.
Related
I have a table with huge number of records. When I query from that specific table specially when using ORDER BY in query, it takes too much execution time.
How can I optimize this table for Sorting & Searching?
Here is an example scheme of my table (jobs):
+---+-----------------+---------------------+--+
| id| title | created_at |
+---+-----------------+---------------------+--+
| 1 | Web Developer | 2018-04-12 10:38:00 | |
| 2 | QA Engineer | 2018-04-15 11:10:00 | |
| 3 | Network Admin | 2018-04-17 11:15:00 | |
| 4 | System Analyst | 2018-04-19 11:19:00 | |
| 5 | UI/UX Developer | 2018-04-20 12:54:00 | |
+---+-----------------+---------------------+--+
I have been searching for a while, I learned that creating INDEX can help improving the performance, can someone please elaborate how the performance can be increased?
Add "explain" word before ur query, and check result
explain select ....
There u can see what u need to improve, then add index on ur search and/or sorting field and run explain query again
If you want to earn performance on your request, a way is paginating it. So, you can put a limit (as you want) and specify the page you want to display.
For example SELECT * FROM your_table LIMIT 50 OFFSET 0.
I don't know if this answer will help you in your problem but you can try it ;)
Indexes are the databases way to create lookup trees (B-Trees in most cases) to more efficiently sort, filter, and find rows.
Indexes are used to find rows with specific column values quickly.
Without an index, MySQL must begin with the first row and then read
through the entire table to find the relevant rows. The larger the
table, the more this costs. If the table has an index for the columns
in question, MySQL can quickly determine the position to seek to in
the middle of the data file without having to look at all the data.
This is much faster than reading every row sequentially.
https://dev.mysql.com/doc/refman/5.5/en/mysql-indexes.html
You can use EXPLAIN to help identify how the query is currently running, and identify areas of improvement. It's important to not over-index a table, for reasons probably beyond the scope of this question, so it'd be good to do some research on efficient uses of indexes.
ALTER TABLE jobs
ADD INDEX(created_at);
(Yes, there is a CREATE INDEX syntax that does the equivalent.)
Then, in the query, do
ORDER BY created_at DESC
However, with 15M rows, it may still take a long time. Will you be filtering (WHERE)? LIMITing?
If you really want to return to the user 15M rows -- well, that is a lot of network traffic, and that will take a long time.
MySQL details
Regardless of the index declaration or version, the ASC/DESC in ORDER BY will be honored. However it may require a "filesort" instead of taking advantage of the ordering built into the index's BTree.
In some cases, the WHERE or GROUP BY is to messy for the Optimizer to make use of any index. But if it can, then...
(Before MySQL 8.0) While it is possible to declare an index DESC, the attribute is ignored. However, ORDER BY .. DESC is honored; it scans the data backwards. This also works for ORDER BY a DESC, b DESC, but not if you have a mixture of ASC and DESC in the ORDER BY.
MySQL 8.0 does create DESC indexes; that is, the BTree containing the index is stored 'backwards'.
A MySQL table contains the following two table tables (simplified):
(~13000) (~7000000 rows)
--------------- --------------------
| packages | | packages_prices |
--------------- --------------------
| id (int) |<- ->| package_id (int) |
| state (int) | | variant_id (int) |
- - - - - - - | for_date (date) |
| price (float) |
- - - - - - - - -
Each package_id/for_date combination has only a few (average 3) variants.
And state is 0 (inactive) or 1 (active). Around 4000 of the 13000 are active.
First I just want to know which packages have a price set (regardless of variation), so I add a composite key covering (1) for_date and (2) pid and I query:
select distinct package_id from packages_prices where for_date > date(now())
This query takes 1 seconds to return 3500 rows, which is too much. An Explain tells me that the composite key is used with key_len 3, and 2000000 rows are examined with 100% filtered with type range. Using where; Using index; Using temporary. The distinct takes it back to 3500 rows.
If I take out distinct, the Using temporary is not longer mentioned, but the query then returns 1000000 rows and still takes 1 seconds.
question 1 : why is this query so slow and how do I speed it up without having to add or change the columns in the table? I would expect that, given the composite key, this query should be able to cost less than 0,01s.
Now I want to know which active packages that have a price set.
So I add a key on state and I add a new composite key just like above, but in reverse order. And I write my query like this:
select distinct packages.id from packages
inner join packages_prices on id = package_id and for_date > date(now())
where state = 1
The query now takes 2 seconds. An Explain tells me for the packages table the key on state is used with key_len 4, examines 4000 rows and filters 100% type type ref. Using index; Using temporary. And for the packages_prices table the new composite key is used with key_len 4, examines 1000 rows and filters 33.33% with type ref. Using where; Using index; Distinct. The distinct takes it back to 3000 rows.
If I take out distinct, the Using temporary and Distinct are no longer mentioned, but the query return 850000 rows and takes 3 seconds.
question 2 : Why is the query that much slower now? Why is range no longer being used according to the Explain? And why has filtering with the new composite key dropped to 33.33%? I expected the composite key to filter 100% procent again.
This all seems very basic and trivial, but it has been costing me hours and hours and I still don't understand what's really going on under the hood.
Your observations are consistent with the way MySQL works. For your first query, using the index (for_date, package_id), MySQL will start at the specified date (using the index to find that position), but then has to go to the end of the index, because every next entry can reveal a yet unknown package_id. A specific package_id could e.g. have just been used on the latest for_date. That search will add up to your 2000000 examined rows. The relevant data is retrieved from the index, but it will still take time.
What to do about that?
With some creative rewriting, you can transform your query to the following code:
select package_id from packages_prices
group by package_id
having max(for_date) > date(now());
It will give you the same result as your first query: if there is at least one for_date > date(now()) (which will make it part of your resultset), that will be true for max(for_date) too. But this will only have to check one row per package_id (the one having max(for_date)), all other rows with for_date > date(now()) can be skipped.
MySQL will do that by using index for group-by-optimization (that text should be displayed in your explain). It will require the index (package_id, for_date) (that you already have) and only has to examine 13000 rows: Since the list is ordered, MySQL can jump directly to the last entry for each package_id, which will have the value for max(for_date); and then continue with the next package_id.
Actually, MySQL can use this method to optimize a distinct to (and will probably do that if you remove the condition on for_date), but is not always able to find a way; a really clever optimizer could have rewritten your query the same way I did, but we are not there yet.
And depending on your data distribution, that method could have been a bad idea: if you have e.g. 7000000 package_id, but only 20 of them in the future, checking each package_id for the maximum for_date will be much slower than just checking 20 rows that you can easily find by the index on for_date. So knowledge about your data will play an important role in choosing a better (and maybe optimal) strategy.
You can rewrite your second query in the same way. Unfortunately, such optimizations are not always easy to find and often specific to a specific query and situation. If you have a different distribution (as mentioned above) or if you e.g. slightly change your query and add an end-date, that method would not work anymore and you have to come up with another idea.
My understanding is that in (My)SQL a SELECT DISTINCT should do the same thing as a GROUP BY on all columns, except that GROUP BY does implicit sorting, so these two queries should be the same:
SELECT boardID,threadID FROM posts GROUP BY boardID,threadID ORDER BY NULL LIMIT 100;
SELECT DISTINCT boardID,threadID FROM posts LIMIT 100;
They're both giving me the same results, and they're giving identical output from EXPLAIN:
+----+-------------+-------+------+---------------+------+---------+------+---------+-----------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+------+---------------+------+---------+------+---------+-----------------+
| 1 | SIMPLE | posts | ALL | NULL | NULL | NULL | NULL | 1263320 | Using temporary |
+----+-------------+-------+------+---------------+------+---------+------+---------+-----------------+
1 row in set
But on my table the query with DISTINCT consistently returns instantly and the one with GROUP BY takes about 4 seconds. I've disabled the query cache to test this.
There's 25 columns so I've also tried creating a separate table containing only the boardID and threadID columns, but the same problem and performance difference persists.
I have to use GROUP BY instead of DISTINCT so I can include additional columns without them being included in the evaluation of DISTINCT. So now I don't how to proceed. Why is there a difference?
First of all, your queries are not quite the same - GROUP BY has ORDER BY, but DISTINCT does not.
Note, that in either case, index is NOT used, and that cannot be good for performance.
I would suggest creating compound index for (boardid, threadid) - this should let both queries to make use of index and both should start working much faster
EDIT: Explanation why SELECT DISTINCT ... LIMIT 100 is faster than GROUP BY ... LIMIT 100 when you do not have indexes.
To execute first statement (SELECT DISTINCT) server only needs to fetch 100, maybe slightly more rows and can stop as soon as it has 100 different rows - no more work to do.
This is because original SQL statement did not specify any order, so server can deliver any 100 rows as it pleases, as long as they are distinct. But, if you were to impose any index-less ORDER BY on this before LIMIT 100, this query will immediately become slow.
To execute second statement (SELECT ... GROUP BY ... LIMIT 100), MySQL always does implicit ORDER BY by the same columns as were used in GROUP BY. In other words, it cannot quickly stop after fetching first few 100+ rows until all records are fetched, groupped and sorted. After that, it applies ORDER BY NULL you added (which does not do much I guess, but dropping it may speed things up), and finally, it gets first 100 rows and throws away remaining result. And of course, this is damn slow.
When you have compound index, all these steps can be done very quickly in either case.
I am building a front-end to a largish db (10's of millions of rows). The data is water usage for loads of different companies and the table looks something like:
id | company_id | datetime | reading | used | cost
=============================================================
1 | 1 | 2012-01-01 00:00:00 | 5000 | 5 | 0.50
2 | 1 | 2012-01-01 00:01:00 | 5015 | 15 | 1.50
....
On the frontend users can select how they want to view the data, eg: 6 hourly increments, daily increments, monthly etc. What would be the best way to do this quickly. Given the data changes so much and the number of times any one set of data will be seen, caching the query data in memcahce or something similar is almost pointless and there is no way to build the data before hand as there are too many variables.
I figured using some kind of agregate aggregate table would work having tables such as readings, readings_6h, readings_1d with exactly the same structure, just already aggregated.
If this is a viable solution, what is the best way to keep the aggregate tables upto date and accurate. Besides the data coming in from meters the table is read only. Users don't ever have to update or write to it.
A number of possible solutions include:
1) stick to doing queries with group / aggregate functions on the fly
2) doing a basic select and save
SELECT `company_id`, CONCAT_WS(' ', date(`datetime`), '23:59:59') AS datetime,
MAX(`reading`) AS reading, SUM(`used`) AS used, SUM(`cost`) AS cost
FROM `readings`
WHERE `datetime` > '$lastUpdateDateTime'
GROUP BY `company_id`
3) duplicate key update (not sure how the aggregation would be done here, also making sure that the data is accurate not counted twice or missing rows.
INSERT INTO `readings_6h` ...
SELECT FROM `readings` ....
ON DUPLICATE KEY UPDATE .. calculate...
4) other ideas / recommendations?
I am currently doing option 2 which is taking around 15 minutes to aggregate +- 100k rows into +- 30k rows over 4 tables (_6h, _1d, _7d, _1m, _1y)
TL;DR What is the best way to view / store aggregate data for numerous reports that can't be cached effectively.
This functionality would be best served by a feature called materialized view, which MySQL unfortunately lacks. You could consider migrating to a different database system, such as PostgreSQL.
There are ways to emulate materialized views in MySQL using stored procedures, triggers, and events. You create a stored procedure that updates the aggregate data. If the aggregate data has to be updated on every insert you could define a trigger to call the procedure. If the data has to be updated every few hours you could define a MySQL scheduler event or a cron job to do it.
There is a combined approach, similar to your option 3, that does not depend on the dates of the input data; imagine what would happen if some new data arrives a moment too late and does not make it into the aggregation. (You might not have this problem, I don't know.) You could define a trigger that inserts new data into a "backlog," and have the procedure update the aggregate table from the backlog only.
All these methods are described in detail in this article: http://www.fromdual.com/mysql-materialized-views
I have a more or less good working query (concerning to the result) but it takes about 45seconds to be processed. That's definitely too long for presenting the data in a GUI.
So my demand is to find a much faster/efficient query (something around a few milliseconds would be nice)
My data table has something around 3000 ~2,619,395 entries and is still growing.
Schema:
num | station | fetchDate | exportValue | error
1 | PS1 | 2010-10-01 07:05:17 | 300 | 0
2 | PS2 | 2010-10-01 07:05:19 | 297 | 0
923 | PS1 | 2011-11-13 14:45:47 | 82771 | 0
Explanation
the exportValue is always incrementing
the exportValue represents the actual absolute value
in my case there are 10 stations
every ~15 minutes 10 new entries are written to the table
error is just an indicator for a proper working station
Working query:
select
YEAR(fetchDate), station, Max(exportValue)-MIN(exportValue)
from
registros
where
exportValue > 0 and error = 0
group
by station, YEAR(fetchDate)
order
by YEAR(fetchDate), station
Output:
Year | station | Max-Min
2008 | PS1 | 24012
2008 | PS2 | 23709
2009 | PS1 | 28102
2009 | PS2 | 25098
My thoughts on it:
writing several queries with between statements like 'between 2008-01-01 and 2008-01-02' to fetch the MIN(exportValue) and between 2008-12-30 and 2008-12-31' to grab the MAX(exportValue) - Problem: a lot of queries and the problem with having no data in a specified time range (it's not guaranteed that there will be data)
limiting the resultset to my 10 stations only with using order by MIN(fetchDate) - problem: takes also a long time to process the query
Additional Info:
I'm using the query in a JAVA Application. That means, it would be possible to do some post-processing on the resultset if necessary. (JPA 2.0)
Any help/approaches/ideas are very appreciated. Thanks in advance.
Adding suitable indexes will help.
2 compound indexes will speed things up significantly:
ALTER TABLE tbl_name ADD INDEX (error, exportValue);
ALTER TABLE tbl_name ADD INDEX (station, fetchDate);
This query running on 3000 records should be extremely fast.
Suggestions:
do You have PK set on this table? station, fetchDate?
add indexes; You should experiment and try with indexes as rich.okelly suggested in his answer
depending on experiments with indexes, try breaking your query into multiple statements - in one stored procedure; this way You will not loose time in network traffic between multiple queries sent from client to mysql
You mentioned that You tried with separate queries and there is a problem when there is no data for particular month; it is regular case in business applications, You should handle it in a "master query" (stored procedure or application code)
guess fetchDate is current date and time at the moment of record insertion; consider keeping previous months data in sort of summary table with fields: year, month, station, max(exportValue), min(exportValue) - this means that You should insert summary records in summary table at the end of each month; deleting, keeping or moving detail records to separate table is your choice
Since your table is rapidly growing (every 15 minutes) You should take the last suggestion into account. Probably, there is no need to keep detailed history at one place. Archiving data is process that should be done as part of maintenance.