Optimizing summing/grouping query with millions of rows on MYSQL - mysql

I have a MySQL table with nearly 4.000.000 rows containing income transactions of more than 100.000 employees.
There are three columns relevant in it, which are:
Employee ID [VARCHAR and INDEX] (not unique since one employee gets more than one income);
Type of Income [also VARCHAR and INDEX]
Value of the Income [Decimal; 10,2]
What I was looking to do seems to be very simple to me. I wanted to sum all the income occurrences grouping by each employee, filtering by one type.
For that, I was using the following code:
SELECT
SUM(`value`) AS `SumofValue`,
`type`,
`EmployeeID`
FROM
`Revenue`
GROUP BY `EmployeeID`
HAVING `type` = 'X'
And the result was supposed to be something like this:
SUM TYPE EMPLOYEE ID
R$ 250,00 X 250000008377
R$ 5.000,00 X 250000004321
R$ 3.200,00 X 250000005432
R$ 1.600,00 X 250000008765
....
However, this is taking a long time. I decide to use the LIMIT command to limit the results just to 1.000 rows and it is working, but if i want to do for the whole table, it would take approximately 1 hous according to my projections. This seems to be way too much time for something that does not look sooooo demandable to me (but i'm assuming i'm probably wrong). Not only that, but this is just the first step on an even more complex query that i intend to run in the future, in which i will group also by Employer ID, aside with Employee ID (one person can get income from more than one employer).
Is there any way to optimize this? Is there anything wrong with my code? Is there any secret path to increase the speed of this operation? Should I index the column of the value of the income as well? If this is a MySQL limitation, is there any option that could handle this better?
I would really appreaciate any help.
Thanks in advance
DISCLOSURE: This is a open government database. All this data is lawfully open to the public.

First, phrase the query using WHERE, rather than HAVING -- filter before doing the aggregation:
SELECT SUM(`value`) AS `SumofValue`,
MAX(type) as type,
EmployeeID
FROM Revenue r
WHERE `type` = 'X'
GROUP BY EmployeeID;
Next, try using this index: (type, EmployeeId, value). At the very least, this is a covering index for the query. MySQL (depending on the version) might be smart enough to use it for the aggregation as well.

As per your defined schema, Why you are using VARCHAR datatype for Employee ID and Type.
You can create reference table for Type with 1-->X, 2-->Y...So basically integer reference will be for type in transaction table.
Just create one dummy table with below one and execute your same query which was taking hour. Even you will see major change in execution plan as well.
CREATE TABLE test_transaction
(
Employee_ID BIGINT,
Type SMALLINT,
Income DECIMAL(10,2)
)
Create separate index on Employee_ID and Type column.

Related

How would indexes improve aggregate lookup speeds in MySQL?

I'm building a virtual currency system, and as it is situated in real money, accurate retention of data is a very key goal. I am building my currency so that user wallets do not have a fixed value (an 'amount'), rather, they are the net of all transactions to and from that wallet -- sort of like bitcoin.
So, if I'm calculating the amount of a wallet, my MySQL query looks like this.
SELECT (
SELECT IFNULL(SUM(`tx`.`amount`), 0)
FROM `transactions` AS `tx`
WHERE `tx`.`to_wallet_id` = 5
) - (
SELECT IFNULL(SUM(`tx`.`amount`), 0)
FROM `transactions` AS `tx`
WHERE `tx`.`from_wallet_id` = 5
) AS `net`
This query builds a net value using the aggregate data of a SUM() for all transactions towards a wallet subtracted by all transactions from a wallet. The final value represents how much money the wallet currently contains.
In short, I want to know how I can optimize my table so that these queries are very fast and scale as well possible.
I would assume I should index [from_wallet_id, amount] and [to_wallet_id, amount], but I'm very uncertain.
I would suggest to do the following:
Make amount field is not nullable. And put 0 as default value.
Create indexes according to you idea ([from_wallet_id, amount] and [to_wallet_id, amount]). At allows to run queries which retrieves all necessary data from indexes.
If it doesn't help, you can think about the following options:
Divide the transaction table on 2 part: in_transaction and out_transaction
Keep aggregate values in the separate table and update its after any changes.

Large mysql query with group by

I have a pricing history table with half a billion records. It is formatted like this:
Id, sku, vendor, price, datetime
What I want to do is get average price of all products by vendor for a certain date range. Most products are updated once every 3 days, but it varies.
So, this is the query I want to run:
SELECT
avg(price)
FROM table
WHERE
vendor='acme'
AND datetime > '12-15-2014'
AND datetime < '12-18-2014'
GROUP BY sku
This 3 day range is broad enough that i will for sure get at least one price sample, but some skus may have been sampled more than once, hence group by to try and get only one instance of each sku.
The problem is, this query runs and runs and doesn't seem to finish (more than 15 minutes). There are around 500k unique skus.
Any ideas?
edit: corrected asin to sku
For this query to be optimized by mysql you need to create a composite index
(vendor, datetime, asin)
IN THIS PARTICULAR ORDER (it mattters)
It also worth trying creating another one
(vendor, datetime, asin, price)
since it may perform better (since it's a so called "covering index").
The indexes with other order, like (datetime, vendor) (which is suggested in another answer) are useless since the datetime is used in a range comparison.
Few notes:
The index will be helpful if only the vendor='acme' AND datetime > '12-15-2014' AND datetime < '12-18-2014' filter condition covers a small part of the whole table (say less than 10%)
Mysql does not support dd-mm-yyyy literals (at least it's not documented, see references) so I assume it must be yyyy-mm-dd instead
Your comparison does not cover the first second of the December 15th, 2014. So you probably wanted datetime >= '2014-12-15' instead.
References:
http://dev.mysql.com/doc/refman/5.6/en/range-optimization.html
http://dev.mysql.com/doc/refman/5.6/en/date-and-time-literals.html
You need an index to support your query. Suggest you create an index on vendor and datetime like so:
CREATE INDEX pricing_history_date_vendor ON pricing_history (datetime, vendor);
Also, I assume you wanted to group by sku rather than undefined column asin.
Not to mention your non-standard SQL date format MM-dd-yyyy as pointed out by others in comments (should be yyyy-MM-dd).

Two(with subquery) or one query to select max(date) in where clause. MySQL

I need to create a table and store there cached status of some events. So I will have to do only two operations:
1) Insert id of event, it's status, and time of when this record was stored in db;
2) Get last record with certain event id.
There are several methods to get the result (status):
Method 1:
SELECT status FROM status_log a
WHERE a.event_id = 1
ORDER BY a.update_date DESC
LIMIT 1
Method 2:
SELECT status FROM status_log a
WHERE a.update_date = (
SELECT max(b.update_date) FROM status_log b
WHERE b.event_id = 1
) AND a.event_id = 1
So I have two questions:
Which query to use
Which field type to set to update_date field (int or timestamp)
Actually, your second query does not resolve question 'find record with greatest date of update for event #1' - because there could be many different events with same latest update_date. So, in terms of semantics - you should use first query. (after your edit this is fixed)
First query will be effective if you'll create an index by event_id index and this column will have good cardinality (i.e. WHERE clause will filter few enough rows by using that index). However, this can be improved by adding column update_date to index - but that makes sense only if there will be many rows with same event_id (many enough for MySQL to use second index part) - and again with good cardinality inside first index part.
But in practice - my advice is just a theory, you'll have to figure it out with EXPLAIN syntax and your own measures on real data.
As for data type - common practice is to use proper data type (i.e. datetime/timestamp for something which means time point)
Which query to use
I believe the first one should be faster. Anyway just run an EXPLAIN on them and you'll find out yourself.
The index you should be using will be:
ALERT TABLE status_log ADD INDEX(event_id, update_date)
Now... did you notice that those queries are NOT equivalent? The second one will return all status from all event_id that have a maximum date.
Which field type to set to update_date field (int or timestamp)
If you have a field named update_date I just can't imagine why an int would serve the same purpose. Rephrasing the question to choose between datetime or timestamp, then the answer is up to the requirements. If you just want to know when a record in the DB was updated use a timestamp. If the update_date refers to an entity in your domain model go for a datetime. You will most likely need to perform calculations on the date (add time, remove time, extract a month, etc) so using a unix timestamp (which I'd say should be almost write-only) will result in extra calculation time because you'll have to convert the timestamp to a datetime and then perform the function over that result.

SQL ORDER BY performance

I have a table with more than 1 million records. The problem is that the query takes too much times, like 5 minutes. The "ORDER BY" is my problem, but i need the expression in the query order by to get most popular videos. And because of the expression i can't create an index on it.
How can i resolve this problem?
Thx.
SELECT DISTINCT
`v`.`id`,`v`.`url`, `v`.`title`, `v`.`hits`, `v`.`created`, ROUND((r.likes*100)/(r.likes+r.dislikes),0) AS `vote`
FROM
`videos` AS `v`
INNER JOIN
`votes` AS `r` ON v.id = r.id_video
ORDER BY
(v.hits+((r.likes-r.dislikes)*(r.likes-r.dislikes))/2*v.hits)/DATEDIFF(NOW(),v.created) DESC
Does the most popular have to be calculated everytime? I doubt if the answer is yes. Some operations will take a long time to run no matter how efficient your query is.
Also bear in mind you have 1 million now, you might have 10 million in the next few months. So the query might work now but not in a month, the solution needs to be scalable.
I would make a job to run every couple of hours to calculate and store this information on a different table. This might not be the answer you are looking for but I just had to say it.
What I have done in the past is to create a voting system based on Integers.
Nothing will outperform integers.
The voting system table has 2 Columns:
ProductID
VoteCount (INT)
The votecount stores all the votes that are submitted.
Like = +1
Unlike = -1
Create an Index in the vote table based on ID.
You have to alternatives to improve this:
1) create a new column with the needed value pre-calculated
1) create a second table that holds the videos primary key and the result of the calculation.
This could be a calculated column (in the firts case) or modify your app or add triggers that allow you to keep it in sync (you'd need to manually load it the firs time, and later let your program keep it updated)
If you use the second option your key could be composed of the finalRating plus the primary key of the videos table. This way your searches would be hugely improved.
Have you try moving you arithmetic of the order by into your select, and then order by the virtual column such as:
SELECT (col1+col2) AS a
FROM TABLE
ORDER BY a
Arithmetic on sort is expensive.

MySql Query Grouping by Time

I am trying to create a report to understand the time-of-day that orders are being placed, so I need to sum and group them by time. For example, I would like a sum of all orders placed between 1 and 1:59, then the next row listing the sum of all orders between 2:00 and 2:59, etc. The field is a datetime variable, but for the life me I haven't been able to find the right query to do this. Any suggestions sending me down the right path would be greatly appreciated.
Thanks
If by luck it is mysql and by sum of orders you mean the number of orders and not the value amount:
select date_format(date_field, '%Y-%m-%d %H') as the_hour, count(*)
from my_table
group by the_hour
order by the_hour
This king of grouping (using a calculated field) will certainly not scale over time. If you really need to execute this specific GROUP BY/ORDER BY frequently, you should create an extra field (an UNSIGNED TINYINT field will suffice) storing the hour and place an INDEX on that column.
That is of course if your table is becoming quite big, if it is small (which cannot be stated in mere number of records because it is actually a matter of server configuration and capabilities as well) you won't probably notice much difference in performance.