How to structure mysql table to avoid extra rows - mysql

I am maintaining record of expenses an expenses table looks like this
Expenses(id,name)
Expenses_data(id,amount,expense_id)
Expenses are based on years, lets say 10 years and i am saving it as months, so it would 120 months
If i would have 10 expenses then expenses_data would have 120*10 = 1200 Rows
I want to save it from 1200 rows to 120 rows and data would be like this as i enter in excel
id month marketing electricity bank charges
1 month-1 100 200 300
2 month-2 95.5 5000 100
Please suggest if it is possible and how ?

I think you probably what to stick to the database structure you already have, but use a query to display the data in the format you wish.
If you think about the number of data-points you're storing, there's not much difference between your sought schema and what you already have -- it's still 1200 data-points of expenses. Having to upgrade your schema each time you add an expense column would be pretty invasive.
Sticking with a query for your excel export would allow the database to understand the concept of expense categories, and updating your export query to include the new category would be much easier than modifying the schema. The necessary JOINs could even be calculated programmatically by iterating an initial query of "What Expense Categories are known?"

Related

When should I build an "averages" table in SQL instead of querying AVG(x)?

I have a table of "outcomes" of scores for a few hundred people on specific days. E.g.
Person
Date
Score
1
1/1/2021
10
2
1/1/2021
15
1
2/2/2022
20
3
2/2/2022
17
I will need to repeatedly compare each players' average score for a specific date range. E.g. get each player's average score between 1/1/2021 and 12/31/2021.
I know that I could query their average using the AVG(score) aggregate function, like SELECT Person, AVG(Score) FROM outcomes WHERE date < ?;
However, since I have hundreds of players with possibly hundreds of outcomes, I am worried that repeatedly doing this query will be produce a lot of row reads. I am considering creating an "averages" table or view where there is an entry for each player on each unique date, but the Score is an average score for the outcomes before that date.
Something like:
Person
EndDate
AVG(Score)
1
1/2/2021
10
2
1/2/2021
15
3
1/2/2021
0
1
2/3/2022
15
2
2/3/2022
15
3
2/3/2022
17
I realize that this is essentially at least doubling the amount of storage required, because each outcome will also have the associated "average" entry.
How is this kind of problem often addressed in practice? At what point does creating an "averages" table make sense? When is using the AVG(x) function most appropriate? Should I just add an "average" column to my "outcomes" table?
I was able to implement my query using the aggregate AVG(x) function, but I am considered about the number of row reads that my database quickly started requiring.
What you are describing is a form of denormalization. Storing the result of an aggregation instead of running the query every time you need it.
When to implement this? When running the query cannot be done fast enough to meet your performance goals.
Be cautious about adopting denormalization too soon. It comes with a cost.
The risk is that if your underlying data changes, but your denormalized copy is not updated, then the stored averages will be outdated. You have to decide whether it's acceptable to query outdated aggregate results from the denormalized table, and how often you want to update those stored results. There isn't one answer to this — it's up to your project requirements and your judgment.

Correct DB design to store huge amount of stock cryptocurrencies data in DB

I want to store large amount of cryptocurrencies data in db. Then I want to show nice javascript price graphs with historical prices on webpage.
Problem is that I am not sure what database design is best for this problem, I was thinking about Mysql DB, but maybe NOSQL db are better in this case, I don’t know.
What I need:
I need to track at least 100 crypto currencies with historical and
current prices and other stock information like volume etc…
I am going to insert new data every 10 minutes for each crypto ((6
records / hour * 24h * 365 days) * 100 for each crypto = 5 256 000
new records per year )
I need to query various time ranges for each coin to draw graph on webpage.
My idea:
I came with this solution but I need to know if this is ok or I am completely wrong and naive.
In this case I would have 2 tables, first parent table where I would store all necessary info about coins, children table where would be all prices, but this child table would have to contain a huge amount of data, which is worrying me.
My table structure example:
tbl_coin_detail:
id. |Tick_name | Name |Algorithm |Icon
1 | BTC |Bitcoin |SHA256 |path/to/img
2 | ETH |Ethereum |Ethash |path/to/img
.
.
.
tbl_prices:
id | price_USD | price_EUR | datetime | Volume_Day_BTC | FK_coin
1 | 6537.2 | 5 632,28 | 2018-07-01 15:00:00 | 62121.7348556964 | 1
2 | 466.89 | 401.51 | 2018-07-01 15:01:00 | 156373.79481106618 | 2
.
.
.
Another idea is to make separate table for each coin prices, that would mean 100 tables with all historical and current prices and stock info instead of one huge table.
I am really not sure here, what is better, all prices in one table are good for simple querying, but I guess it can be huge performance bottleneck, make queries from separated table will be worse for querying, because I will need to write query for each table but it can help with performance.
Can you point me to right direction how to solve this? SQL DB or NOSQL what is better?
Thank you in advance.
MySQL recommendations...
You have Volume_Day_BTC, yet you say "6 records/hour" -- is the record daily or more fine grained.
The volume of data is not that great, but it will be beneficial to shrink the datatypes before you get started.
id is unnecessary; use PRIMARY KEY(coin, datetime) instead.
Think carefully about the datatype for prices and volumes. At one extreme is space (hence, somewhat, speed); at the other, precision.
DOUBLE -- 8 bytes, about 16 significant digits, large range
DECIMAL(17, 11) -- 8 bytes, limited to $1M and 11 decimal places (not enough?)
DECIMAL(26, 13) -- 12 bytes, maybe big enough?
etc.
Would it be OK to summarize data over, say, one month to save space? Hourly or daily avg/hi/low, etc. This would be very useful for speeding up fetching data for graphing.
In particular, I recommend keeping a Summary table by coin+day with volume, price, etc. Consider using FLOAT (4 bytes, 7 significant digits, sufficient range) as more than good enough for graphing.
So, I am recommending 3 tables:
Coins -- 100 rows with meta info about the currencies.
Prices -- 5M rows/year of details -- unless trimmed (400MB/year)
Summary -- 36500 rows/year for graphing range more than, say, a week. (4MB/yr)
It may be worth it to have an hourly summary table for shorter-range graphs. There is no need to go with weekly or monthly summaries; they can be derived from the daily with sufficient efficiency.
Use InnoDB.
Summary tables
To be honest, that's far from 'huge'. We aren't talking billions of records here, so any properly indexed DB will do just fine.

Dynamically creating tables based on data for year and quarter

I am trying to store approximately 11 billion records in the destination table by querying various source tables via inner join.
I would need to store the data based on Year and Quarter. There is a need to store data from year 2000 onward.
So I would have tables for e.g
FinData2015_1
FinData2015_2
FinData2015_3
FinData2015_4
FinData2016_1
FinData2016_2
FinData2016_3
FinData2016_4
I planned to create the physical tables right from year 2000 to 50 years from now and implement a split condition component in SSIS.
So I would create in all 67 tables in all and 67 split conditions
See the screenshot below
Is there a better way to do this. That is creating table dynamically if data for that year and quarter exist

Consolidated Data, except under certain conditions

I have the following results(temp table):
Product Side(buy/sell) TotalQuantity AverageWeightedPrice Cost
Prod1 1 100 120 12,000
Prod1 2 -50 130 -6,500
Prod2
Prod2
So on and so forth for multiple products.
I consolidated it to (with a groupby):
Product Side(buy/sell) TotalQuantity AverageWeightedPrice Cost
Prod1 1 50 110 5,500
Prod2
I want the consolidated results, unless it is certain conditions:
When Side 1 and 2 have the same Quantity. Consolidated quantity would be 0 and I would not be able to calculate the AverageWeightedPrice anymore.
When consolidated quantity = 0, the other condition is when TotalQuantity and Cost are of inversed values, ie, when Quantity is positive and Cost is Negative (and when Quantity+ and Cost-)
If it is the certain conditions, I would want to return the UNcosolidated data.
I am having trouble excluding between consolidated and unconsolidated data at the same time.
This is not something that SQL is really good for.
It is totally possible to produce this result with pure sql but the solution is very complex and you should not do it if you have any other option for it since sql does not provide any optimization over complex tasks like this and the syntax can be hard to master. Maintaining the query could also prove to be hard task in the future.
In cases like this you should use sql only to fetch the data and then format (consolidate) it using the whichever programming language or application that is executing the query.

Should I worry about 1B+ rows in a table?

I've got a table which keeps track of article views. It has the following columns:
id, article_id, day, month, year, views_count.
Let's say I want to keep track of daily views / each day for every article. If I have 1,000 user written articles. The number of rows would compute to:
365 (1 year) * 1,000 => 365,000
Which is not too bad. But let say. The number of articles grow to 1M. And as time passes by to 3 years. The number of rows would compute to:
365 * 3 * 1,000,000 => 1,095,000,000
Obviously, over time, this table will keep growing. And quite fast. What problems will this cause? Or should I not worry since RDBM's handle situations like this quite commonly?
I plan on using the views data in our reports. Either break it down to months or even years. Should I worry about 1B+ rows in a table?
The question to ask yourself (or your stakeholders) is: do you really need 1-day resolution on older data?
Have a look into how products like MRTG, via RRD, do their logging. The theory is you don't store all the data at maximum resolution indefinitely, but regularly aggregate them into larger and larger summaries.
That allows you to have 1-second resolution for perhaps the last 5-minutes, then 5-minute averages for the last hour, then hourly for a day, daily for a month, and so on.
So, for example, if you have a bunch of records like this for a single article:
year | month | day | count | type
-----+-------+-----+-------|------
2011 | 12 | 1 | 5 | day
2011 | 12 | 2 | 7 | day
2011 | 12 | 3 | 10 | day
2011 | 12 | 4 | 50 | day
You would then at regular periods create a new record(s) that summarises these data, in this example just the total count for the month
year | month | day | count | type
-----+-------+-----+-------|------
2011 | 12 | 0 | 72 | month
Or the average per day:
year | month | day | count | type
-----+-------+-----+-------+------
2011 | 12 | 0 | 2.3 | month
Of course you may need some flag to indicate the "summarised" status of the data, in this case I've used a 'type' column for finding the "raw" records and the processed records, allowing you to purge out the day records as required.
INSERT INTO statistics (article_id, year, month, day, count, type)
SELECT article_id, year, month, max(day), sum(count), 'month'
FROM statistics
WHERE type = 'day'
GROUP BY article_id, year, month, type
(I haven't tested that query, it's just an example)
The answer is "it depends". but yes, it will probably be a lot to deal with.
However - this is generally a problem of "cross that bridge when you need to". It's a good idea to think about what you could do if this becomes a problem for you in the future.. but it's probably too early to actually implement any suggestions until they're necessary.
My suggestion, if it ever occurs, is to not keep the individual records for longer than X-months (where you adjust X according to your needs). Instead, you'd store the aggregated data that you currently feed into your reports. What you'd do is run, say, a daily script that looks at your records and grabs any that are over X months old... and create a "daily_stats" object of some sort, then delete the originals (or better yet, archives them somewhere).
This will ensure that only X-months worth of data are ever in the db - but you still have quick access to an aggregated form of the stats for long-timeline reports.
It's not something you need to worry about if you can put some practices in place.
Partition the table; this should make archiving easier to do
Determine how much data you need at present
Determine how much data you can archive
Ensure that the table has the right build, perhaps in terms of data types and indexes
Schedule for a time when you will archive partitions that meet the aging requirements
Schedule for index checking (and other table checks)
If you have a DBA in your team, then you can discuss it with him/her, and I'm sure they'll be glad to assist.
Also, like what is used in many data warehouses, and I just saw #Taryn's post (which I agree with -> )store aggregated data as well. This is quickly suggested based on the data you keep in the involved table. If you have trouble with possible editing/updating of records, then it brings to light (even more) the fact that you will just have to set restrictions like how much data to keep (which means this data is what can be modified) and have procedures+jobs in place to ensure that the aggregated data is checked/updated daily and can be updated/checked manually when any changes are made. This way, data integrity is maintained. Discuss with your DBA what other approaches you can take...
By the way, in case you didn't already know.. Aggregated data are normally needed for weekly or monthly reports, and many other reports based upon an interval. Granulize your aggregation as needed, but not so much that it becomes too tedious or seemingly exaggerated.