I have created a train schedule database in MYSQL. There are several thousand routes for each day. But with a few exceptions most of the routes are similar for every working day, but differ on weekends.
At this time I basically update my SQL tables at midnight each day, to get the departures for the next 24 hours. This is however very inconvenient. So I need a way to store dates in my tables so I don't have to do this every day.
I tried to create a separate table where I stored dates for each routenumber (routenumbers are resetted each day), but this made my query so slow that it was impossible to use. Does this mean I would have to store my departure and arrival times as datetimes? In that case the main table containing routes would have several million entries.
Or is there another way?
My routetable looks like this:
StnCode (referenced in seperate Station table)
DepTime
ArrTime
Routenumber
legNumber
How were you storing the dates? A single date/time field? That'd certainly be the most compact representation, but also the most difficult to index and scan, especially if you're doing queries of the following type:
SELECT ...
WHERE MONTH(DepTime) = 4 AND DAY(DepTime) = 19;
Such a construct would require a full table scan to tear apart each date field and extract the month/day. For such a case, it'd be better to denomalize a bit and split the datetime into seperate year/month/day/hour/minute fields and place indeces onto them. Bit more of a hassle to maintain, but would also speed up querying by specific time parts immensely.
Instead of storing Schedules in terms of dates, you can store them against day (Sun, Mon, Tue, etc). This will eliminate storing the dates for routes. You can treat the routes as predetermined, and thus they are fixed. As the number of trains are around 8000(passenger trains) and days are fixed (7), routes are (50-1000), each table like, 1, 1A, AS PUBLISHED IN RAILWAY BOOKS,
This will avoid storing huge combinations of train schedules into the db since every date is translated into one of the weekdays and we are not missing any data.
You can create a table for storing days which will have at most 7 days.
I would suggest to model the database in such a way, that each station is a touch point, and not as station id....
and you can introduce the hub concept in the design, to identify 3-4 stations, which are of the same city....
each station is a touch point, where it is supported by facilities like,boarding point,HALT POINT etc...
cause, not all the stations are boarding points for all the trains..
facilities are the ones which are available at different stations...
all facilities are not available for all the trains...,
ex: Kazipet is a station, which is also a junction...but for FEW TRAINS,ON few routes,
they pass thru the station, and it also halts at the station, but, it will not allow new passengers to board at the station(s).
But, it will allow the same on reverse routes...
Related
In order to analyze dates and times I am creating a MySQL table where I want to keep the time information. Some example analyses will be stuff like:
Items per day/week/month/year
Items per weekday
Items per hour
etc.
Now in regards to performance, what way should I record in my datatable:
date type: Unix timestamp?
date type: datetime?
or keep date information in one row each, e.g. year, month, day in separate fields?
The last one, for example, would be handy if I'm analysing by weekday; I wouldn't have to perform WEEKDAY(item.date) on MySQL but could simply use WHERE item.weekday = :w.
Based on your usage, you want to use the native datetime format. Unix formats are most useful when the major operations are (1) ordering; (2) taking differences in seconds/minutes/hours/days; and (3) adding seconds/minutes/hours/days. They need to be converted to internal date time formats to get the month or week day, for instance.
You also have a potential indexing issue. If you want to select ranges of days, hours, months and so on for your results, then you want an index on the column. For this purpose an index on a datetime is probably sufficient.
If the summaries are by hour, you might find it helpful to stored the date component in a date field and the hour in a separate column. That would be particularly helpful if you are combining hours from different days.
Whether you break out other components of the date, such as weekday and month, for indexing purposes would depend on the volume of data in the table, performance requirements, and the queries you are planning on running. I would not be inclined to do this, except as a later optimization.
The rule of thumb is: store things as they should be stored, don't do performance tweaks until you're hitting the bottleneck. If you store your date as separate fields, you'll eventually stumble upon a situation you need this date as a whole inside your database (e.g. update query for a particular range of time), and this will be like hell - condition from 3 april 2015 till 15 may 2015 would be as giant as possible.
You should keep your dates as date type. This will grant you maximum flexibility, (most probably) query readability and will keep all of your opportunities to work with them. The only thing I really can recommend is storing the same date divided into year/month/day in next columns - of course, this will bloat your database and require extreme caution on update scenarios, but this will allow you to use any variant of source data in your queries.
There is some value, x, which I am recording every 30 seconds, currently into a database with three fields:
ID
Time
Value
I am then creating a mobile app which will use that data to plot charts in views of:
Last hour
Last 24 hours.
7 Day
30 Day
Year
Obviously, saving every 30 seconds for the last year and then sending that data to a mobile device will be too much (it would mean sending 1051200 values).
My second thought was perhaps I could use the average function in MySQL, for example, collect all of the averages for every 7 days (creating 52 points for a year), and send those points. This would work, but still MySQL would be trawling through creating averages and if many users connect, it's going to be bad.
So simply put, if these are my views, then I do not need to keep track of all that data. Nobody should care what x was a year ago to the precision of every 30 seconds, this is fine. I should be able to use "triggers" to create some averages.
I'm looking for someone to check what I have below is reasonable:
Store values every 30s in a table (this will be used for the hour view, 120 points)
When there are 120 rows are in the 30s table (120 * 30s = 60 mins = 1 hour), use a trigger to store the first half an hour in a "half hour average" table, remove the first 60 entries from the 30s table. This new table will need to have an id, start time, end time and value. This half hour average will be used for the 24 hour view (48 data points).
When the half hour table has more than 24 entries (12 hours), store the first 6 as an average in a 6 hour average table and then remove from the table. This 6 hour average will be used for the 7 day view (28 data points).
When there are 8 entries in the 6 hour table, remove the first 4 and store this as an average day, to be used in the 30 day view (30 data points).
When there are 14 entries in the day view, remove the first 7 and store in a week table, this will be used for the year view.
This doesn't seem like the best way to me, as it seems to be more complicated than I would imagine it should be.
The alternative is to keep all of the data and let mysql find averages as and when needed. This will create a monstrously huge database. I have no idea about the performance yet. The id is an int, time is a datetime and value is a float. Is 1051200 records too many? Now is a good time to add, I would like to run this on a raspberry pi, but if not.. I do have my main machine which I could use.
Your proposed design looks OK. Perhaps there are more elegant ways of doing this, but your proposal should work too.
RRD (http://en.wikipedia.org/wiki/Round-Robin_Database) is a specialised database designed to do all of this automatically, and it should be much more performant than MySQL for this specialised purpose.
An alternative is the following: keep only the original table (1051200 records), but have a trigger that generates the last hour/day/year etc views every time a new record is added (e.g. every 30 seconds) and store/cache the result somewhere. Then your number-crunching workload is independent of the number of requests/clients you have to serve.
1051200 records may or may not be too many. Test in your Raspberry Pi to find out.
Let me give a suggestion on the physical layout of your table, regardless on whether you decide to keep all data or "prune" it from time to time...
Since you generate a new row "every 30 seconds", then Time can serve as a natural key without fear of exceeding the resolution of the underlying data type and causing duplicated keys. You don't need ID in this scenario1, so your table is simply:
Time (PK)
Value
And since InnoDB tables are clustered, not having secondary indexes2 means the whole table is stored in a single B-Tree, which is as efficient as it gets from storage and querying perspective. On top of that, Value is automatically covered, which may not have been the case in your original design unless you specifically designed your index(es) for that.
Using time as key can be tricky in general, but I think may be worth it in this particular case.
1 Unless there are other tables that reference it through FOREIGN KEYs, or you have already written too much code that depends on it.
2 Which would be necessary in the original design to support efficient aggregation.
I have a table for news articles, containing amongst others the author, the time posted and the word count for each article. The table is rather large, containing more than one million entries and growing with an amount of 10.000 entries each day.
Based on this data, a statistical analysis is done, to determine the total number of words a specific author has published in a specific time-window (i.e. one for each hour of each day, one for each day, one for each month) combined with an average for a time-span. Here are two examples:
Author A published 3298 words on 2011-11-04 and 943.2 words on average for each day two month prior (from 2011-09-04 to 2011-11-03)
Author B published 435 words on 2012-01-21 between 1pm and 2pm and an average of 163.94 words each day between 1pm and 2pm in the 30 days before
Current practice is to start a script at the end of each defined time-window via cron-job, which calculates the count and the averages and stores it in a separate table for each time-window (i.e. one for each hourly window, one for each daily, one for each monthly etc...).
The calculation of sums and averages can easily be done in SQL, so I think Views might be a more elegant solution to this, but I don't know about the implications on performance.
Are Views an appropriate solution to the problem described above?
I think you can use materialize views for it. It's not really implemented in MySQL, but you can implement it with tables. Look at
views will not be equivalent to your denormalization.
if you are moving aggregate numbers somewhere else, then that has a certain cost, which you are paying - in order to keep the data correct, and a certain benefit, which is much less data to look through when querying.
a view will save you from having to think too hard about the query each time you run it, but it will still need to look through the larger amount of data in the original tables.
while i'm not a fan of denormalization, since you already did it, i think the view will not help.
I want to build a MySQL database for storing the ranking of a game every 1h.
Since this database will become quite large in a short time, I figured it's important to have a proper design. Therefor some advice would be gratefully appreciated.
In order to keep it as small as possible, I decided to log only the first 1500 positions of the ranking. Every ranking of a player holds the following values:
ranking position, playername, location, coordinates, alliance, race, level1, level2, points1, points2, points3, points4, points5, points6, date/time
My approach was to simply grab all values of each top 1500 player every hour by a php script and insert them into the MySQL as one row. So every day the MySQL will grow 36,000 rows. I will have a second script that deletes every row that is older than 28 days, otherwise the database would get insanely huge. Both scripts will run as a cronjob.
The following queries will be performed on this data:
The most important one is simply the query for a certain name. It should return all stats for the player for every hour as an array.
The second is a query in which all players have to be returned that didn't gain points1 during a certain time period from the latest entry. This should return a list of players that didn't gain points (for the last 24h for example).
The third is a query in which all players should be listed that lost a certain amount or more points2 in a certain time period from the latest entry.
The queries shouldn't take a lifetime, so I thought I should probably index playernames, points1 and points2.
Is my approach to this acceptable or will I run into a performance/handling disaster? Is there maybe a better way of doing this?
Here is where you risk a performance problem:
Your indexes will speed up your reads, but will considerably slow down your writes. Especially since your DB will have over 1 million rows in that one table at any given time. Since your writes are happening via cron, you should be okay as long as you insert your 1500 rows in batches rather than one round trip to the DB for every row. I'd also look into query compiling so that you save that overhead as well.
Ranhiru Cooray is correct, you should only store data like the player name once in the DB. Create a players table and use the primary key to reference the player in your ranking table. The same will go for location, alliance and race. I'm guessing that those are more or less enumerated values that you can store in another table to normalize your design and be returned in your results with appropriates JOINs. Normalizing your data will reduce the amount of redundant information in your database which will decrease it's size and increase it's performance.
Your design may also be flawed in your ranking position. Can that not be calculated by the DB when you select your rows? If not, can it be done by PHP? It's the same as with invoice tables, you never store the invoice total because it is redundant. The items/pricing/etc can be used to calculate the order totals.
With all the adding/deleting, I'd be sure to run OPTIMIZE frequently and keep good backups. MySQL tables---if using MyISAM---can become corrupted easily in high writing/deleting scenarios. InnoDB tends to fair a little better in those situations.
Those are some things to think about. Hope it helps.
I have a database called RankHistory that is populated daily with each user's username and rank for the day (rank as in 1,2,3,...). I keep logs going back 90 days for every user, but my user base has grown to the point that the MySQL database holding these logs is now in excess of 20 million rows.
This data is recorded solely for the use of generating a graph showing how a user's rank has changed for the past 90 days. Is there a better way of doing this than having this massive database that will keep growing forever?
How great is the need for historic data in this case? My first thought would be to truncate data older than a certain threshold, or move it to an archive table that doesn't require as frequent or fast access as your current data.
You also mention keeping 90 days of data per user, but the data is only used to show a graph of changes to rank over the past 30 days. Is the extra 60 days' data used to look at changes over previous periods? If it isn't strictly necessary to keep that data (or at least not keep it in your primary data store, as per my first suggestion), you'd neatly cut the quantity of your data by two-thirds.
Do we have the full picture, though? If you have a daily record per user, and keep 90 days on hand, you must have on the order of a quarter-million users if you've generated over twenty million records. Is that so?
Update:
Based on the comments below, here are my thoughts: If you have hundreds of thousands of users, and must keep a piece of data for each of them, every day for 90 days, then you will eventually have millions of pieces of data - there's no simple way around that. What you can look into is minimizing that data. If all you need to present is a calculated rank per user per day, and assuming that rank is simply a numeric position for the given user among all users (an integer between 1 - 200000, for example), storing twenty million such records should not put unreasonable strain on your database resources.
So, what precisely is your concern? Sheer data size (i.e. hard-disk space consumed) should be relatively manageable under the scenario above. You should be able to handle performance via indexes, to a certain point, beyond which the data truncation and partitioning concepts mentioned can come into play (keep blocks of users in different tables or databases, for example, though that's not an ideal design...)
Another possibility is, though the specifics are somewhat beyond my realm of expertise, you seem to have an ideal candidate for an OLAP cube, here: you have a fact (rank) that you want to view in the context of two dimensions (user and date). There are tools out there for managing this sort of scenario efficiently, even on very large datasets.
Could you run an automated task like a cron job that checks the database every day or week and deletes entries that are more than 90 days old?
Another option, do can you create some "roll-up" aggregate per user based on whatever the criteria is... counts, sales, whatever and it is all stored based on employee + date of activity. Then you could have your pre-aggregated rollups in a much smaller table for however long in history you need. Triggers, or nightly procedures can run a query for the day and append the results to the daily summary. Then your queries and graphs can go against that without dealing with performance issues. This would also help ease moving such records to a historical database archive.
-- uh... oops... that's what it sounded like you WERE doing and STILL had 20 million+ records... is that correct? That would mean you're dealing with about 220,000+ users???
20,000,000 records / 90 days = about 222,222 users
EDIT -- from feedback.
Having 222k+ users, I would seriously consider that importance it is for "Ranking" when you have someone in the 222,222nd place. I would pair the daily ranking down to say the top 1,000. Again, I don't know the importance, but if someone doesn't make the top 1,000 does it really matter???