How would indexes improve aggregate lookup speeds in MySQL? - mysql

I'm building a virtual currency system, and as it is situated in real money, accurate retention of data is a very key goal. I am building my currency so that user wallets do not have a fixed value (an 'amount'), rather, they are the net of all transactions to and from that wallet -- sort of like bitcoin.
So, if I'm calculating the amount of a wallet, my MySQL query looks like this.
SELECT (
SELECT IFNULL(SUM(`tx`.`amount`), 0)
FROM `transactions` AS `tx`
WHERE `tx`.`to_wallet_id` = 5
) - (
SELECT IFNULL(SUM(`tx`.`amount`), 0)
FROM `transactions` AS `tx`
WHERE `tx`.`from_wallet_id` = 5
) AS `net`
This query builds a net value using the aggregate data of a SUM() for all transactions towards a wallet subtracted by all transactions from a wallet. The final value represents how much money the wallet currently contains.
In short, I want to know how I can optimize my table so that these queries are very fast and scale as well possible.
I would assume I should index [from_wallet_id, amount] and [to_wallet_id, amount], but I'm very uncertain.

I would suggest to do the following:
Make amount field is not nullable. And put 0 as default value.
Create indexes according to you idea ([from_wallet_id, amount] and [to_wallet_id, amount]). At allows to run queries which retrieves all necessary data from indexes.
If it doesn't help, you can think about the following options:
Divide the transaction table on 2 part: in_transaction and out_transaction
Keep aggregate values in the separate table and update its after any changes.

Related

Optimizing MySQL table for selecting many rows in date range

I have an InnoDB table in MySQL where I have to select and sum a lot of data in date ranges. It seems I can't get to a point where it runs fast enough for the use case.
The table is as follows:
user_id: int
date: date
amount: int
The table has several hundred million rows.
A date range can return up to 10 million rows.
Amount is 1-10
I have a composite index on all three columns in the order: user_id, date, amount.
The query I use for selecting is:
SELECT
SUM(amount)
FROM table
WHERE user_id = ?
AND request_date <= ?
AND request_date >= ?
I hardcode the dates into the query.
Anything else I can do to speed up this query? I should be able to do the query about 20 times a second.
It's running on DI with 8gb RAM and 4 CPUs (not dedicated).
Update
The output of EXPLAIN is:
select_type: SIMPLE
type: range
possible_keys: composite
key: composite
key_len: 7
ref: null
rows: 14994440
Extra: Using where; Using index
I've used various techniques in the past to do similar stuff.
You should consider partitioning your table. That involves creating a column that contains a partition identifier, which could be a date, or year-month
I've had some performance increase by splitting the date and time portion. The advantage is that you can then quickly grab all data from a date by looking at the date field, without even considering the time portion.
If you know what kind of data you'll be requesting, and you can allow for some delays, you can pre-calculate. It looks like you're working with log-data, so I assume that query results for anything that's older than today will never change. You should exploit that, for example by having a separate table with aggregated data. If you only need to calculate "today" things will be much faster. Or accept that numbers are a bit old, you can just pre-calculate periodically.
The table that I'm talking about could be something like:
CREATE table aggregated_requests AS
SELECT user_id, request_date, SUM(amount) as amount
FROM table
After that, rewrite your query above like this, and i'll be extremely fast:
SELECT SUM(amount)
FROM aggregated_requests
WHERE user_id = ?
AND request_date <= ?
AND request_date >= ?
Plan A: INDEX(user_id, request_date, amount) -- optimal for the WHERE, also "covering". OK, you have that; so, on to plan B:
Plan B (even better): Build and maintain a Summary table of, say, daily subtotals. Then query that table instead. More: http://mysql.rjweb.org/doc.php/summarytables
Partitioning is unlikely to help more than a good index (as in Plan A).
More on B
If you need up-to-the-minute totals, there are multiple approaches to achieve it using summary tables without waiting until the next day.
IODKU against the summary table at the same time (possibly in a Trigger) that you insert the row data. This keeps the summary table up to date, but with non-trivial overhead.
Hybrid. Reach into the summary table for whole days, then total up 'today' from the raw data and add it on.
Summarize by hour instead of by day. This either gives you only hourly resolution, or you can combine with the "hybrid" to make that run faster.
(My blog gives those 3, plus 3 more.)
Other
"Amount is 1-10" -- I hope you are using a 1-byte TINYINT, not a 4-byte INT. That's 300MB of difference. Perhaps user_id could be smaller than INT.

Optimizing summing/grouping query with millions of rows on MYSQL

I have a MySQL table with nearly 4.000.000 rows containing income transactions of more than 100.000 employees.
There are three columns relevant in it, which are:
Employee ID [VARCHAR and INDEX] (not unique since one employee gets more than one income);
Type of Income [also VARCHAR and INDEX]
Value of the Income [Decimal; 10,2]
What I was looking to do seems to be very simple to me. I wanted to sum all the income occurrences grouping by each employee, filtering by one type.
For that, I was using the following code:
SELECT
SUM(`value`) AS `SumofValue`,
`type`,
`EmployeeID`
FROM
`Revenue`
GROUP BY `EmployeeID`
HAVING `type` = 'X'
And the result was supposed to be something like this:
SUM TYPE EMPLOYEE ID
R$ 250,00 X 250000008377
R$ 5.000,00 X 250000004321
R$ 3.200,00 X 250000005432
R$ 1.600,00 X 250000008765
....
However, this is taking a long time. I decide to use the LIMIT command to limit the results just to 1.000 rows and it is working, but if i want to do for the whole table, it would take approximately 1 hous according to my projections. This seems to be way too much time for something that does not look sooooo demandable to me (but i'm assuming i'm probably wrong). Not only that, but this is just the first step on an even more complex query that i intend to run in the future, in which i will group also by Employer ID, aside with Employee ID (one person can get income from more than one employer).
Is there any way to optimize this? Is there anything wrong with my code? Is there any secret path to increase the speed of this operation? Should I index the column of the value of the income as well? If this is a MySQL limitation, is there any option that could handle this better?
I would really appreaciate any help.
Thanks in advance
DISCLOSURE: This is a open government database. All this data is lawfully open to the public.
First, phrase the query using WHERE, rather than HAVING -- filter before doing the aggregation:
SELECT SUM(`value`) AS `SumofValue`,
MAX(type) as type,
EmployeeID
FROM Revenue r
WHERE `type` = 'X'
GROUP BY EmployeeID;
Next, try using this index: (type, EmployeeId, value). At the very least, this is a covering index for the query. MySQL (depending on the version) might be smart enough to use it for the aggregation as well.
As per your defined schema, Why you are using VARCHAR datatype for Employee ID and Type.
You can create reference table for Type with 1-->X, 2-->Y...So basically integer reference will be for type in transaction table.
Just create one dummy table with below one and execute your same query which was taking hour. Even you will see major change in execution plan as well.
CREATE TABLE test_transaction
(
Employee_ID BIGINT,
Type SMALLINT,
Income DECIMAL(10,2)
)
Create separate index on Employee_ID and Type column.

Is there a better way to optimise / index this query?

I have a semi-large (10,000,000+ record) credit card transaction database that I need to query regularly. I have managed to optimise most queries to be sub 0.1 seconds but I'm struggling to do the same for sub-queries.
The purpose of the following query is to obtain the number of "inactive" credit cards (credit cards that have not made a card transaction in the last x days / weeks) for both the current user's company, and all companies (so as to form a comparison).
The sub-query first obtains the last card transaction of all credit cards, and then the parent query removes any expired credit cards, and groups the card based on their associated company and whether or not the they are deemed "inactive" (the (UNIX_TIMESTAMP() - (14 * 86400)) is used in place of a PHP time calculation.
SELECT
SUM(IF(LastActivity < (UNIX_TIMESTAMP() - (14 * 86400)), 1, 0)) AS AllInactiveCards,
SUM(IF(LastActivity >= (UNIX_TIMESTAMP() - (14 * 86400)), 1, 0)) AS AllActiveCards,
SUM(IF(LastActivity < (UNIX_TIMESTAMP() - (14 * 86400)) AND lastCardTransactions.CompanyID = 15, 1, 0)) AS CompanyInactiveCards,
SUM(IF(LastActivity >= (UNIX_TIMESTAMP() - (14 * 86400)) AND lastCardTransactions.CompanyID = 15, 1, 0)) AS CompanyActiveCards
FROM CardTransactions
JOIN
(
SELECT
CardSerialNumberID,
MAX(CardTransactions.Timestamp) AS LastActivity,
CardTransactions.CompanyID
FROM CardTransactions
GROUP BY
CardTransactions.CardSerialNumberID, CardTransactions.CompanyID
) lastCardTransactions
ON
CardTransactions.CardSerialNumberID = lastCardTransactions.CardSerialNumberID AND
CardTransactions.Timestamp = lastCardTransactions.LastActivity AND
CardTransactions.CardExpiryTimestamp > UNIX_TIMESTAMP()
The indexes in use are on CardSerialNumberID, CompanyID, Timestamp for the inner query, and CardSerialNumberID, Timestamp, CardExpiryTimestamp, CompanyID for the outer query.
The query takes around 0.4 seconds to execute when done multiple times, but the initial run can be as slow as 0.9 - 1.1 seconds, which is a big problem when loading a page with 4-5 of these types of query.
One thought I did have was to calculate the overall inactive card number in a routine separate to this, perhaps run daily. This would allow me to adjust this query to only pull records for a single company, thus reducing the dataset and bringing the query time down. However, this is only really a temporary fix, as the database will continue to grow until the same amount of data is being analysed anyway.
Note: The query above's fields have been modified to make them more generic, as the specific subject this query is used on is quite complex. As such there is no DB schema to give (and if there was, you'd need a dataset of 10,000,000+ records anyway to test the query I suppose). I'm more looking for a conceptual fix than for anyone to actually give me an adjusted query.
Any help is very much appreciated!
You're querying the table transactions two times, so your query has a size of Transactions x Transactions, which might be big.
One idea would be to monitor all credit cards for the last x days/weeks and save them in an extra table INACTIVE_CARDS that gets updated every day (add a field with the number of days of inactivity). Then you could limit the SELECT in your subquery to just search in INACTIVE_CARDS
SELECT
CardSerialNumberID,
MAX(Transactions.Timestamp) AS LastActivity,
Transactions.CompanyID
FROM Transactions
WHERE CardSerialNumberID in INACTIVE_CARDS
GROUP BY
Transactions.CardSerialNumberID, Transactions.CompanyID
Of course a card might have become active in the last hour, but you don't need to check all transactions for that.
Please use different "aliases" for the two instances of Transactions. What you have is confusing to read.
The inner GROUP BY:
SELECT card_sn, company, MAX(ts)
FROM Trans
GROUP BY card_sn, company
Now this index is good ("covering") for the inner:
INDEX(CardSerialNumberID, CompanyID, Timestamp)
Recommend testing (timing) the subquery by itself.
For the outside query:
INDEX(CardSerialNumberID, Timestamp, -- for JOINing (prefer this order)
CardExpiryTimestamp, CompanyID) -- covering (in this order)
Please move CardTransactions.CardExpiryTimestamp > UNIX_TIMESTAMP() to a WHERE clause. It is helpful to the reader that the ON clause contain only the conditions that tie the two tables together. The WHERE contains any additional filtering. (The Optimizer will run this query the same, regardless of where you put that clause.)
Oh. Can that filter be applied in the subquery? It will make the subquery run faster. (It may impact the optimal INDEX, so I await your answer.)
I have assumed that most rows have not "expired". If they have, then other techniques may be better.
For much better performance, look into building and maintaining summary tables of the info. Or, perhaps, rebuild (daily) a table with these stats. Then reference the summary table instead of the raw data.
If that does not work, consider building a temp table with the "4-5" info at the start of the web page, then feed off it the tmp table.
Rather than repetitively calculating - 14 days and current UNIX_TIMESTAMP(), follow advice of
https://code.tutsplus.com/tutorials/top-20-mysql-best-practices--net-7855
then prior to SELECT .....
code similar to:
$uts_14d = UNIX_TIMESTAMP() - (14 * 86400);
$uts = UNIX_TIMESTAMP();
and substitute the ($uts_14d and $uts) variables result in 5 lines of your code?

Optimizing MySQL Data Retrieval for Time Series Application

I'm working on a Web app to display some analytics data from a MYSQL database table. I expect to collect data from about 10,000 total users at the most. This table is going to have millions of records per user.
I'm considering giving each user their own table, but more importantly I want to figure out how to optimize data retrieval.
I get data from the database table using a series of SELECT COUNT queries for a particular day. An example is below:
SELECT * FROM
(SELECT COUNT(id) AS data_point_1 FROM my_table WHERE customer_id = '1' AND datetime_added LIKE '2013-01-20%' AND status_id = '1') AS col_1
CROSS JOIN
(SELECT COUNT(id) AS data_point_2 FROM my_table WHERE customer_id = '1' AND datetime_added LIKE '2013-01-20%' AND status_id = '0') AS col_2
CROSS JOIN ...
When I want to retrieve data from the last 30 days, the query will be 30 times as long as it is above; 60 days likewise, etc. The user will have the ability to select the number of days e.g. 30, 60, 90, and a custom range.
I need the data for a time series chart. Just to be clear, data for each day could range from thousands of records to millions.
My question is:
Is this the most performant way of retrieving this data, or is there a better way to getting all the time series data I need in one SQL query?! How is this going to work when a user needs data from the last 2 years i.e. a MySQL Query that is potential over a thousand lines long?!
Should I consider caching the retrieved data (using memcache for example) for extended periods of time e.g. an hour or more, to reduce server (Being that this is analytics data, it really should be real-time but I'm afraid of overloading the server with queries for the same data even when there are no changes)?!
Any assitance would be appreciated.
First, you should not put each user in a separate table. You have other options that are not nearly as intrusive on your application.
You should consider partitioning the data. Based on what you say, I would have one partition by time (by day, week, or month) and an index on the users. Your query should probably look more like:
select date(datetime), count(*)
from t
where userid = 1 and datetime between DATE1 and DATE2
group by date(datetime)
You can then pivot this, either in an outer query or in an application.
I would also suggest that you summarize the data on a daily basis, so your analyses can run on the summarized tables. This will make things go much faster.

MySql Query Grouping by Time

I am trying to create a report to understand the time-of-day that orders are being placed, so I need to sum and group them by time. For example, I would like a sum of all orders placed between 1 and 1:59, then the next row listing the sum of all orders between 2:00 and 2:59, etc. The field is a datetime variable, but for the life me I haven't been able to find the right query to do this. Any suggestions sending me down the right path would be greatly appreciated.
Thanks
If by luck it is mysql and by sum of orders you mean the number of orders and not the value amount:
select date_format(date_field, '%Y-%m-%d %H') as the_hour, count(*)
from my_table
group by the_hour
order by the_hour
This king of grouping (using a calculated field) will certainly not scale over time. If you really need to execute this specific GROUP BY/ORDER BY frequently, you should create an extra field (an UNSIGNED TINYINT field will suffice) storing the hour and place an INDEX on that column.
That is of course if your table is becoming quite big, if it is small (which cannot be stated in mere number of records because it is actually a matter of server configuration and capabilities as well) you won't probably notice much difference in performance.