I have to import an availability calendar of 30,000 places into MySQL, and I am stuck on structure design. I need something which will allow me to easily subquery and join availability of checkIn for a given date.
Actually, each day has several options
Can checkIn and CheckOut
Not Available
CanCheckIn only
CanCheckOut
OnRequest
now what would be a most optimal solution for a table?
PlaceId Day AvailabilityCode ???
Then I would have 366 * 30, 000 rows? I am afraid of that.
Is there any better way to do?
The xml data I should parse looks like this
<?xml version="1.0" encoding="utf-8" ?>
<vacancies>
<vacancy>
<code>AT1010.200.1</code>
<startday>2010-07-01</startday>
<availability>YYYNNNQQ</availability>
<changeover>CCIIOOX</changeover>
<minstay>GGGGGGGG</minstay>
<flexbooking>YYYYY</flexbooking>
</vacancy>
</vacancies>
Where
Crucial additional information: The problem is that the availability calendar is given as an XML feed, and I have to import it and repopulate my database each 10-20 minutes.
I think your problem is the XML feed, not the table structure. The easiest solution would be to ask the feed provider to deliver just a delta rather than a whole dump. But presumably there's a good reason why that is not possible.
So you will have to do it. You should store the XML feeds somehow, and compare the new file with the previous one. This will give you the delta, which you can then apply to your database table. There are several approaches you could take, and which you choose will largely depend on your programming prowess, and the capabilities of your database product.
For instance, MySQL has only had XML functionality since 5.1 and it is still pretty limited. So if you want to preprocess the XML file you will probably have to do it outside the database. An alternative approach would be to load the latest file into a staging table and use SQL to find and apply the differences.
you only need to add rows when something is not available. A missing row for a date and room can be implicitly interpreted as availability
365 * 30000 is a little over 10 million records in a table with only small fields (int id, date or day, and a code, which is probably an int as well or maybe a char(1)). This is very doable in MySQL and will only become a problem if you got many reads and frequent updates to this table. If it is only updates now and then, it will not be much of a problem to have tables with 10 or 20 million records.
But maybe there's a better solution, although it may be more complex.
It sounds to me like some soort of booking programme. If so, each place will probably have seasons in which they can be booked. You can give each place a default value, or maybe even a default value per season. For instance, a place is available from march to august, and unavailable in the other months. Then, when a place is booked during the summer and it becomes unavailable, you can put that value in the table you suggested.
That way, you can check if a record exists for a given day for the requested place. If it does not exist, you check the default value in the 'places' table (30000 records), or the 'seasons' table where you got a record per season per place (maybe 2 to 4 records per place). That way you can cut the number of records down by a lot.
But remember this will not work if you got bookings for almost every day for each place. In that case, you will hardly ever need the defaults, and there will still be millions of records in the state-per-day table. Like I said before, this may not be a problem at all, but anyway you should consider whether the more complex solution will indeed help you decrease the data or not. It depends on you situation.
Related
I have a database that each entry is a business, some of the data is opening and closing times. Originally, to support multiple opening and closing times in a day I was going to store it as a string such as: '09:00-15:00,17:00-22:00' and then split it up and convert it to TIMESTAMPS server-side. I now understand that it is "bad form" to store times as strings. But the solution requires a second table.
What exactly is the issue with using a string instead of DATE or TIMESTAMP? If I kept it all in one table it would get pulled with my current query and then just processed and converted into a TIMESTAMP. Using multiple tables causes more queries. Is the string manipulation and conversion really that much more taxing than another query searching through an entire table that will end up with thousands of entries?
There are three separate issues here. One is whether it is better to combine two elementary data items into a list, and store the list in one place, or to keep the elementary items separate, and store them in two separate places. For these purposes, I'm considering the intersection of a single row and a single column to be one place.
The second issue is whether it is better to identify a time interval as a single elementary item or better to describe a time interval as two elementary items, side by side.
The third issue is whether it is better to store points in time as strings or use the DBMS timestamp datatype.
For the first issue, it's better to store separate elementary items in separate places. In this case, in two separate rows of the same table.
For the second issue, it's better to describe a time interval as two timestamps side by side than to combine them into an interval, for the DBMS systems I have used. But there are DBMS systems that have a time interval datatype that is distinct from the timestamp datatype. Your mileage may vary.
For the third issue, it's better to use the DBMS timestamp datatype to describe a point in time than a character string. Different DBMS products have different facilities for storing a time without any date associated with it, and in your case, you may want to make use of this. It depends on how you will be searching the data. If you are going to want to find all the rows that contain a 9:00 to 15:00 time range regardless of date, you will want to make use of this. If you are going to want to find all the rows that contain a 9:00 to 15:00 range on any Tuesday, I suggest you look into data warehousing techniques.
Why?
For the first two answers, it has to do with normalization, indexing, searching strategies, and query optimization. These concepts are all related, and each of them will help you understand the other three.
For the third answer, it has to do with using the power of the DBMS to do the detailed work for you. Any good DBMS will have tools for subtracting timestamps to yield a time interval, and for ordering timestamps correctly. Getting the DBMS to do this work for you will save you a lot of work in your application programming, and generally use less computer resources in production.
Having said that, I don't know mySQL, so I don't know how good it is for date manipulations or for query optimization.
By '09:00-15:00,17:00-22:00', you mean that it is open for two periods of time? Then 2 rows in one table. If the open times vary by day of week, then another column to specify that. If you want holidays, another column.
But... What will you be doing with the data?...
If you are going to write SQL to use the times, then what I described may work best.
If the processing will be in your application code, then a CSV file that you load every time, parse every time, etc, may be best.
Or, you could put that CSV file into a table as one big string. (However, this is not much different than having a file.)
There are three separate issues here. One is whether it is better to combine two elementary data items into a list, and store the list in one place, or to keep the elementary items separate, and store them in two separate places. For these purposes, I'm considering the intersection of a single row and a single column to be one place.
The second issue is whether it is better to identify a time interval as a single elementary item or better to describe a time interval as two elementary items, side by side.
The third issue is whether it is better to store dates as strings or use the DBMS timestamp datatype.
For the first issue, it's better to store separate elementary items in separate places. In this case, in two separate rows of the same table.
For the second issue, it's better to describe a time interval as two timestamps side by side than to combine them into an interval, for the DBMS systems I have used. But there are DBMS systems that have a time interval datatype that is distinct from the timestamp datatype. You mileage may vary.
For the third issue, it's better to use the DBMS timestamp datatype to describe a point in time than a character string. Different DBMS products have different facilities for storing a time without any date associated with it, and in your case, you may want to make use of this. It depends on how you will be searching the data. If you are going to want to find all the rows that contain a 9:00 to 15:00 time range regardless of date, you will want to make use of this. If you are going to want to find all the rows that contain a 9:00 to 15:00 range on any Tuesday, I suggest you look into data warehousing techniques.
Why do I gove these answers?
For the first two answers, it has to do with normalization, indexing, and searching efficiency. These concepts are all related, and you should learn them all at about the same time, because each of them will help you understand the other two.
For the third answer, it has to do with using the power of the DBMS to do the detailed work for you. Any good DBMS will have tools for subtracting timestamps to yield a time interval, and for ordering timestamps correctly. Getting the DBMS to do this work for you will save you a lot of work in your application programming, and generally use less computer resources in production.
I have a database called RankHistory that is populated daily with each user's username and rank for the day (rank as in 1,2,3,...). I keep logs going back 90 days for every user, but my user base has grown to the point that the MySQL database holding these logs is now in excess of 20 million rows.
This data is recorded solely for the use of generating a graph showing how a user's rank has changed for the past 90 days. Is there a better way of doing this than having this massive database that will keep growing forever?
How great is the need for historic data in this case? My first thought would be to truncate data older than a certain threshold, or move it to an archive table that doesn't require as frequent or fast access as your current data.
You also mention keeping 90 days of data per user, but the data is only used to show a graph of changes to rank over the past 30 days. Is the extra 60 days' data used to look at changes over previous periods? If it isn't strictly necessary to keep that data (or at least not keep it in your primary data store, as per my first suggestion), you'd neatly cut the quantity of your data by two-thirds.
Do we have the full picture, though? If you have a daily record per user, and keep 90 days on hand, you must have on the order of a quarter-million users if you've generated over twenty million records. Is that so?
Update:
Based on the comments below, here are my thoughts: If you have hundreds of thousands of users, and must keep a piece of data for each of them, every day for 90 days, then you will eventually have millions of pieces of data - there's no simple way around that. What you can look into is minimizing that data. If all you need to present is a calculated rank per user per day, and assuming that rank is simply a numeric position for the given user among all users (an integer between 1 - 200000, for example), storing twenty million such records should not put unreasonable strain on your database resources.
So, what precisely is your concern? Sheer data size (i.e. hard-disk space consumed) should be relatively manageable under the scenario above. You should be able to handle performance via indexes, to a certain point, beyond which the data truncation and partitioning concepts mentioned can come into play (keep blocks of users in different tables or databases, for example, though that's not an ideal design...)
Another possibility is, though the specifics are somewhat beyond my realm of expertise, you seem to have an ideal candidate for an OLAP cube, here: you have a fact (rank) that you want to view in the context of two dimensions (user and date). There are tools out there for managing this sort of scenario efficiently, even on very large datasets.
Could you run an automated task like a cron job that checks the database every day or week and deletes entries that are more than 90 days old?
Another option, do can you create some "roll-up" aggregate per user based on whatever the criteria is... counts, sales, whatever and it is all stored based on employee + date of activity. Then you could have your pre-aggregated rollups in a much smaller table for however long in history you need. Triggers, or nightly procedures can run a query for the day and append the results to the daily summary. Then your queries and graphs can go against that without dealing with performance issues. This would also help ease moving such records to a historical database archive.
-- uh... oops... that's what it sounded like you WERE doing and STILL had 20 million+ records... is that correct? That would mean you're dealing with about 220,000+ users???
20,000,000 records / 90 days = about 222,222 users
EDIT -- from feedback.
Having 222k+ users, I would seriously consider that importance it is for "Ranking" when you have someone in the 222,222nd place. I would pair the daily ranking down to say the top 1,000. Again, I don't know the importance, but if someone doesn't make the top 1,000 does it really matter???
I have a web application that has a MySql database with a device_status table that looks something like this...
deviceid | ... various status cols ... | created
This table gets inserted into many times a day (2000+ per device per day (estimated to have 100+ devices by the end of the year))
Basically this table gets a record when just about anything happens on the device.
My question is how should I deal with a table that is going to grow very large very quickly?
Should I just relax and hope the database will be fine in a few months when this table has over 10 million rows? and then in a year when it has 100 million rows? This is the simplest, but seems like a table that large would have terrible performance.
Should I just archive older data after some time period (a month, a week) and then make the web app query the live table for recent reports and query both the live and archive table for reports covering a larger time span.
Should I have an hourly and/or daily aggregate table that sums up the various statuses for a device? If I do this, what's the best way to trigger the aggregation? Cron? DB Trigger? Also I would probably still need to archive.
There must be a more elegant solution to handling this type of data.
I had a similar issue in tracking the number of views seen for advertisers on my site. Initially I was inserting a new row for each view, and as you predict here, that quickly led to the table growing unreasonably large (to the point that it was indeed causing performance issues which ultimately led to my hosting company shutting down the site for a few hours until I had addressed the issue).
The solution I went with is similar to your #3 solution. Instead of inserting a new record when a new view occurs, I update the existing record for the timeframe in question. In my case, I went with daily records for each ad. what timeframe to use for your app would depend entirely on the specifics of your data and your needs.
Unless you need to specifically track each occurrence over the last hour, you might be over-doing it to even store them and aggregate later. Instead of bothering with the cron job to perform regular aggregation, you could simply check for an entry with matching specs. If you find one, then you update a count field of the matching row instead of inserting a new row.
I was wondering if somebody knows an elegant solution to the following:
Suppose I have a table that holds orders, with a bunch of data. So I'm at 1M records, and searches begin to take time. So I want to speed it up by archiving some data that is more than 3 years old - saving it into a table called orders-archive, and then purging them from the orders table. So if we need to research something or customer wants to pull older information - they still can, but 99% of the lookups are done on the orders no older than a year and a half - so there is no reason to keep looking through older data all the time. These move & purge operations can be then croned to be done on a weekly basis. I already did some tests and I know that I will slash my search times by about 4 times. So far so good, right?
However I was thinking about how to implement older archival lookups and the only reasonable thing I can think of is some sort of if-else If not found in orders, do a search in orders-archive. However - I have about 20 tables that I want to archive and god knows how many searches / finds are done through out the code, that I don't want to modify. So I was wondering if there is an elegant rails-way solution to this problem, by extending a model somehow? Has anyone dealt with similar case before?
Thank you.
MySQL 5.x can handle this natively using Horizontal Partitioning.
The basic idea behind partitioning is that you tell the database to store records in a certain range in a separate file. You can still query against all the records, but as long as you're querying only current records, the database engine won't be encumbered with all of the archived records.
You can use the order_date column or something similar as the cutoff for your partitions. This is the elegant solution.
Overview of Partitioning in MySQL
Otherwise, your if/else idea with dynamically generated queries seems about right. You can add year numbers after the archival tables and use reflection to build a list of tables, then have at it.
Hey, does anyone know the proper way to set up a MySQL database to gather pageviews? I want to gather these pageviews to display in a graph later. I have a couple ways mapped out below.
Option A:
Would it be better to count pageviews each time someone visits a site and create a new row for every pageview with a time stamp. So, 50,000 views = 50,000 rows of data.
Option B:
Count the pageviews per day and have one row that counts the pageviews. every time someone visits the site the count goes up. So, 50,000 views = 1 row of data per day. Every day a new row will be created.
Are any of the options above the correct way of doing what I want? or is there a better more efficient way?
Thanks.
Option C would be to parse access logs from the web server. No extra storage needed, all sorts of extra information is stored, and even requests to images and JavaScript files are stored.
..
However, if you just want to track visits to pages where you run your own code, I'd definitely go for Option A, unless you're expecting extreme amounts of traffic on your site.
That way you can create overviews per hour of the day, and store more information than just the timestamp (like the visited page, the user's browser, etc.). You might not need that now, but later on you might thank yourself for not losing that information.
If at some point the table grows too large, you can always think of ways on how to deal with that.
If you care about how your pageviews vary with time in a day, option A keeps that info (though you might still do some bucketing, say per-hour, to reduce overall data size -- but you might do that "later, off-line" while archiving all details). Option B takes much less space because it throws away a lot of info... which you might or might not care about. If you don't know whether you care, I think that, in doubt, you should keep more data rather than less -- it's reasonably easy to "summarize and archive" overabundant data, but it's NOT at all easy to recover data you've aggregated away;-). So, aggregating is riskier...
If you do decide to keep abundant per-day data, one strategy is to use multiple tables, say one per day; this will make it easiest to work with old data (summarize it, archive it, remove it from the live DB) without slowing down current "logging". So, say, pageviews for May 29 would be in PV20090529 -- a different table than the ones for the previous and next days (this does require dynamic generation of the table name, or creative uses of ALTER VIEW e.g. in cron-jobs, etc -- no big deal!). I've often found such "sharding approaches" to have excellent (and sometimes unexpected) returns on investment, as a DB scales up beyond initial assumptions, compared to monolithic ones...