I have a MySQL table which contains Aggregate Revenue of a company for two year from it's different clients to perform analysis on Current year Revenue and future Revenue forcast, having these fields for time,
MonthID,
QuarterID and
YearID
It is having month wise data, also have quarter wise (I can get that by aggregating the month, but have added row for faster select for quarter..i am showing that on a graph) and so on Year wise.
Now to reduce some data count and for optimization for faster select, I have removed MonthID, QtrID and YrID. And added a new column Frequency, that has values like
Last Month
1st Qtr
2nd Qtr
So on, It has reduced my row counts to half, but still i don't feel it very optimized, some expert advice will be highly appreciated that what can else be done on this. My table has around a million records.
Data Warehousing with millions of rows begs for "Summary Tables". With them you can get significant speedup; typically 10-fold.
I would build Summary Table(s) on a per-month basis (based on your specs), then roll them up to get Qtr and Year "reports".
I have a pair of blogs that goes into more detail:
http://mysql.rjweb.org/doc.php/datawarehouse and
http://mysql.rjweb.org/doc.php/summarytables
One million rows is somewhat small as DW applications go. Those blogs are aimed at that size, plus much bigger DWs (10s, even 100s, of millions of rows). (For 1 billion rows, even more techniques need to be pulled into play.)
If you have further difficulties, please provide SHOW CREATE TABLE and some tentative SELECTs. Those would help me point out details.
Related
Suppose i have a simple table with this columns:
| id | user_id | order_id |
About 1,000,000 rows is inserted to this table per month and as it is clear relation between user_id and order_id is 1 to M.
The records in the last month needed for accounting issues and the others is just for showing order histories to the users.To archive records before last past month,i have two options in my mind:
first,create a similar table and each month copy old records to it.so it will get bigger and bigger each month according to growth of orders.
second,create a table like below:
| id | user_id | order_idssss |
and each month, for each row to be inserted to this table,if there exist user_id,just update order_ids and add new order_id to the end of order_ids.
in this solution number of rows in the table will be get bigger according to user growth ratio.
suppose for each solution we have an index on user_id.
.
Now question is which one is more optimized for SELECT all order_ids per user in case of load on server.
the first one has much more records than the second one,but in the second one some programming language is needed to split order_ids.
The first choice is the better choice from among the two you have shown. With respect, I should say your second choice is a terrible idea.
MySQL (with all SQL dbms systems) is excellent at handling very large numbers of rows of uniformly laid out (that is, normalized) data.
But, your best choice is to do nothing except create appropriate indexes to make it easy to look up order history by date or by user. Leave all your data in this table and optimize lookup instead.
Until this table contains at least fifty million rows (at least four years' worth of data), the time you spend reprogramming your system to allow it to be split into a current and an archive version will be far more costly than just keeping it together.
If you want help figuring out which indexes you need, you should ask another question showing your queries. It's not clear from this question how you look up orders by date.
In a 1:many relationship, don't make an extra table. Instead have the user_id be a column in the Orders table. Furthermore, this is likely to help performance:
PRIMARY KEY(user_id, order_id),
INDEX(order_id)
Is a "month" a calendar month? Or "30 days ago until now"?
If it is a calendar month, consider PARTITION BY RANGE(TO_DAYS(datetime)) and have an ever-increasing list of monthly partitions. However, do not create future months in advance; create them just before they are needed. More details: http://mysql.rjweb.org/doc.php/partitionmaint
Note: This would require adding datetime to the end of the PK.
At 4 years' worth of data (48 partitions), it will be time to rethink things. (I recommend not going much beyond that number of partitions.)
Read about "transportable tablespaces". This may become part of your "archiving" process.
Use InnoDB.
With that partitioning, either of these becomes reasonably efficient:
WHERE user_id = 123
AND datetime > CURDATE() - INTERVAL 30 DAY
WHERE user_id = 123
AND datetime >= '2017-11-01' -- or whichever start-of-month you need
Each of the above will hit at most one non-empty partition more than the number of months desired.
If you want to discuss this more, please provide SHOW CREATE TABLE (in any variation), plus some of the important SELECTs.
I want to store daily fund data for approximately 2000 funds over 20 years or more. At first I figured I would just create one giant table with one column per fund and one row per date. I ran into trouble trying to create this table and also realise that a table like that would have a lot of NULL values (almost half the values would be NULL).
Is there a more efficient way of structuring the table or database for quickly finding and fetching the data for a specific fund over hundreds (or thousands) of days?
The alternative way I've thought of doing this is with three columns (date, fund_id, fund_value). This however does not seem optimal to me since both the date and fund_id would be duplicated many times over. Having a few million data points just for the date (instead of a few thousand) seems wasteful.
Which is the better option? Or is there a better way to accomplish this?
Having the three columns you mention is fine. fund_value is the price of fund_id on fund_date. So fund_id and fund_date would be the PK of this table. I don't understand what you mean "having a few million data points just for the date..." If you have 20k funds, a particular date will appear in at most 20k rows -- one for each fund. This is not needless duplication. This is necessary to uniquely identify the value of a particular fund on a particular date. If you added, say, fund_name to the table, that would be needless duplication. We assume the fund name will not change from day to day. Unchanging (static) data about each fund would be contained in a separate table. The field fund_id of this table would then be a FK reference to the static table.
To query the value of the funds on a particular date:
select fund_date as ValueDate, fund_id, fund_value
from fund_value_history
where fund_date = #aDate
and fund_id = #aFund -- to limit to a particular fund
To show the dates a fund increased in value from one day to the next:
select h1.fund_date, h2.fund_value as PreviousValue,
h1.fund_value PresentValue
from fund_value_history h1
join fund_value_history h2
on h2.fund_id = h1.fund_id
and h2.fund_date =(
select max( fund_date )
from fund_value_history
where fund_id = h2.fund_id
and fund_date < h2.fund_date )
where h2.fund_value < h1.fund_value
and fund_id = #aFund;
This would be a sizable result set but you could modify the WHERE clause to show, for example, all funds whose values on a particular date was greater than the previous day, or the values of all funds (or particular fund) on a particular date and the previous day, or any number of interesting results.
You could then join to the static table to add fund name or any other descriptive data.
The three column approach you considered is the correct one. There would be no wasted space due to missing values, and you can add and remove funds at any time.
Have a search for "database normalisation", which is the discipline that covers this sort of design decision.
Edit: I should add that you're free to include other metrics in that table, of course. Since historical data is effectively static you can also store "change since previous day" as well, which is redundant strictly speaking, but may help to optimise certain queries such as "show me all the funds that decreased in value on this day".
I'm not a database specialist, therefore I'm coming here for a little help.
I have planty of measured data and I want help myself with data manipulation. Here is my situation:
There are cca 10 stations, measuring every day. Everyday, one produces cca 3000 rows (with cca 15 columns) of data. Data have to be downloaded once a day from every station to the centralized server. That means cca 30 000 inserted rows into the database every day. (daily counts are mutable)
Now, I've already had data from a few past years, so for every station, I have a few milions of rows. There are also cca 20 "dead" stations - don't work anymore, but there are data from a few years.
Sum this all up and we'll get cca 50+ millions of rows, produced by 30 stations and cca 30 000 rows inserted every day. Looking ahead, let's assume 100 millions of rows in database.
My question is obvious - how would you suggest to store this data?
Measured values(columns) are only numbers (int, or double + datetime) - no text, or fulltext search, basically the only index I need is DATETIME.
Data will not be updated, nor deleted. I just need a fast select of a range of data (eg. from 1.1.2010 to 3.2.2010)
So as I wrote, I want to use MySQL because that's the database I know best. I've read, that it should easily handle this amount of data, but still, I appreciate any suggestion for this very situation.
Again:
10 stations, 3000 rows per day each => cca 30 000 inserts per day
cca 40-50 millions of rows yet to be inserted from binary files
DB is going to grow (100+ millions of rows)
The only thing I need is to SELECT data as fast as possible.
As far as I know, MySQL should handle this amount of data. I also know, that my only index will be date and time in DATETIME type (should be faster then others, am I right?)
The thing I can't decide is, whether create one huge table with 50+ millions of rows (with station id), or create table for every station separately. Basically, I don't need to perform any JOIN on these stations. If I need to do time coincidence, I can just select the same range of time on stations. Are there any dis/advanteges on these approaches?
Can anyone confirm/decline my thoughts? Do you think, that there is a better solution? I appreciate any help or discussion.
MySQL should be able to handle this pretty well. Instead of indexing just your DATETIME column, I suggest you create two compound indexes, as follows:
(datetime, station)
(station, datetime)
Having both these indexes in place will help accelerate queries that choose date ranges and group by stations or vice versa. The first index will also serve the purpose that just indexing datetime will serve.
You have not told us what your typical query is. Nor have you told us whether you plan to age out old data. Your data is an obvious candidate for range partitioning (http://dev.mysql.com/doc/refman/5.6/en/partitioning-range.html) but we'd need more information to help you design a workable partitioning criterion.
Edit after reading your comments.
A couple of things to keep in mind as you build up this system.
First, Don't bother with partitions for now.
Second, I would get everything working with a single table. Don't split stuff by station or year. Get yourself the fastest disk storage system you can afford and a lot of RAM for your MySQL server and you should be fine.
Third, take some downtime once in a while to do OPTIMIZE TABLE; this will make sure your indexes are good.
Fourth, don't use SELECT * unless you know you need all the columns in the table. Why? Because
SELECT datetime, station, temp, dewpoint
FROM table
WHERE datetime >= DATE(NOW() - INTERVAL 60 DAY)
ORDER BY station, datetime
can be directly satisfied from sequential access to a compound covering index on
(station, datetime, temp, dewpoint)
whereas
SELECT *
FROM table
WHERE datetime >= DATE(NOW() - INTERVAL 60 DAY)
ORDER BY station, datetime
needs to random-access your table. You should read up on compound covering indexes.
Fifth, avoid the use of functions with column names in your WHERE clauses. Don't say
WHERE YEAR(datetime) >= 2003
or anything like that. MySQL can't use indexes for that kind of query. Instead say
WHERE datetime >= '2003-01-01'
to allow the indexes to be exploited.
I would like to perform some request in mysql that i know will be really slow:
I have 3 tables:
Users:
id, username, email
Question:
id, date, question
Answer
id_question, id_user, response, score
And i would like to do some statistics like the top X users with the best score (sum of all the scores) for all time or for a given amount of time (last month for example). Or it could be users between the 100th and the 110th range
I will have thousands of users and hundred of questions so the requests could be very long since I'll need to order by sum of scores, limit to a given range and sometimes only select some questions depending on the date, ...
I would like to know if there are some methods to optimize the requests!
If you have a lot of data no other choices, Only you can optimize it with creating new table that will somehow summarizes data in every day/week or month. Maybe summarizes scores by each week for users and stamps by that weeks date, or by month. As the range of summing longer as much your query works fast.
For archived statistics, you can create tables that store rankings that won't move (last year, last month, last day). Try to calculated as much as possible statistics in such tables, put indexes on id_user, date, type_of_ranking...
Try to limit as much as possible subqueries.
I have an order table that contains dates and amounts for each order, this table is big and contains more that 1000000 records and growing.
We need to create a set of queries to calculate certain milestones, is there a way in mysql to figure out on which date we reached an aggregate milestone of x amount.
For e.g we crossed 1 m sales on '2011-01-01'
Currently we scan the entire table then use the logic in PHP to figure out the date, but it would be great if this could be done in mysql without reading so many records at 1 time.
There maybe elegant approaches, but what you can do is maintain a row in another table which contains, current_sales and date it occurred. Every time you have a sale, increment the value, and store sales date. If the expected milestones(1 Million, 2 Million etc) are known in advance, you can store them away when they occur(in same or different table)
i think using gunner's logic with trigger will be a good option as it reduce your efforts to maintain the row and after that you can send mail notification through trigger to know the milestone status