MYSQL - Database Design Large-scale real world deployment - mysql

I would love to hear some opinions or thoughts on a mysql database design.
Basically, I have a tomcat server which recieves different types of data from about 1000 systems out in the field. Each of these systems are unique, and will be reporting unique data.
The data sent can be categorized as frequent, and unfrequent data. The unfrequent data is only sent about once a day and doesn't change much - it is basically just configuration based data.
Frequent data, is sent every 2-3 minutes while the system is turned on. And represents the current state of the system.
This data needs to be databased for each system, and be accessible at any given time from a php page. Essentially for any system in the field, a PHP page needs to be able to access all the data on that client system and display it. In other words, the database needs to show the state of the system.
The information itself is all text-based, and there is a lot of it. The config data (that doesn't change much) is key-value pairs and there is currently about 100 of them.
My idea for the design was to have 100+ columns, and 1 row for each system to hold the config data. But I am worried about having that many columns, mainly because it isn't too future proof if I need to add columns in the future. I am also worried about insert speed if I do it that way. This might blow out to a 2000row x 200column table that gets accessed about 100 times a second so I need to cater for this in my initial design.
I am also wondering, if there is any design philosophies out there that cater for frequently changing, and seldomly changing data based on the engine. This would make sense as I want to keep INSERT/UPDATE time low, and I don't care too much about the SELECT time from php.
I would also love to know how to split up data. I.e. if frequently changing data can be categorised in a few different ways should I have a bunch of tables, representing the data and join them on selects? I am worried about this because I will probably have to make a report to show common properties between all systems (i.e. show all systems with a certain condition).
I hope I have provided enough information here for someone to point me in the right direction, any help on the matter would be great. Or if someone has done something similar and can offer advise I would be very appreciative. Thanks heaps :)
~ Dan

I've posted some questions in a comment. It's hard to give you advice about your rapidly changing data without knowing more about what you're trying to do.
For your configuration data, don't use a 100-column table. Wide tables are notoriously hard to handle in production. Instead, use a four-column table containing these columns:
SYSTEM_ID VARCHAR System identifier
POSTTIME DATETIME The time the information was posted
NAME VARCHAR The name of the parameter
VALUE VARCHAR The value of the parameter
The first three of these columns are your composite primary key.
This design has the advantage that it grows (or shrinks) as you add to (or subtract from) your configuration parameter set. It also allows for the storing of historical data. That means new data points can be INSERTed rather than UPDATEd, which is faster. You can run a daily or weekly job to delete history you're no longer interested in keeping.
(Edit if you really don't need history, get rid of the POSTTIME column and use MySQL's nice extension feature INSERT ON DUPLICATE KEY UPDATE when you post stuff. See http://dev.mysql.com/doc/refman/5.0/en/insert-on-duplicate.html)
If your rapidly changing data is similar in form (name/value pairs) to your configuration data, you can use a similar schema to store it.
You may want to create a "current data" table using the MEMORY access method for this stuff. MEMORY tables are very fast to read and write because the data is all in RAM in your MySQL server. The downside is that a MySQL crash and restart will give you an empty table, with the previous contents lost. (MySQL servers crash very infrequently, but when they do they lose MEMORY table contents.)
You can run an occasional job (every few minutes or hours) to copy the contents of your MEMORY table to an on-disk table if you need to save history.
(Edit: You might consider adding memcached http://memcached.org/ to your web application system in the future to handle a high read rate, rather than constructing a database design for version 1 that handles a high read rate. That way you can see which parts of your overall app design have trouble scaling. I wish somebody had convinced me to do this in the past, rather than overdesigning for early versions. )

Related

Medium-term temporary tables - creating tables on the fly to last 15-30 days?

Context
I'm currently developing a tool for managing orders and communicating between technicians and services. The industrial context is broadcast and TV. Multiple clients expecting media files each made to their own specs imply widely varying workflows even within the restricted scope of a single client's orders.
One client can ask one day for a single SD file and the next for a full-blown HD package containing up to fourteen files... In a MySQL db I am trying to store accurate information about all the small tasks composing the workflow, in multiple forms:
DATETIME values every time a task is accomplished, for accurate tracking
paths to the newly created files in the company's file system in VARCHARs
archiving background info in TEXT values (info such as user comments, e.g. when an incident happens and prevents moving forward, they can comment about it in this feed)
Multiply that by 30 different file types and this is way too much for a single table. So I thought I'd break it up by client: one table per client so that any order only ever requires the use of that one table that doesn't manipulate more than 15 fields. Still, this a pretty rigid solution when a client has 9 different transcoding specs and that a particular order only requires one. I figure I'd need to add flags fields for each transcoding field to indicate which ones are required for that particular order.
Concept
I then had this crazy idea that maybe I could create a temporary table to last while the order is running (that can range from about 1 day to 1 month). We rarely have more than 25 orders running simultaneously so it wouldn't get too crowded.
The idea is to make a table tailored for each order, eliminating the need for flags and unnecessary forever empty fields. Once the order is complete the table would get flushed, JSON-encoded, into a TEXT or BLOB so it can be restored later if changes need made.
Do you have experience with DBMS's (MySQL in particular) struggling from such practices if it has ever existed? Does this sound like a viable option? I am happy to try (which I already started) and I am seeking advice so as to keep going or stop right here.
Thanks for your input!
Well, of course that is possible to do. However, you can not use the MySQL temporary tables for such long-term storage, you will have to use "normal" tables, and have some clean-up routine...
However, I do not see why that amount of data would be too much for a single table. If your queries start to run slow due to much data, then you should add some indexes to your database. I also think there is another con: It will be much harder to build reports later on, when you have 25 tables with the same kind of data, you will have to run 25 queries and merge the data.
I do not see the point, really. The same kinds of data should be in the same table.

Best database design for storing a high number columns?

Situation: We are working on a project that reads datafeeds into the database at our company. These datafeeds can contain a high number of fields. We match those fields with certain columns.
At this moment we have about 120 types of fields. Those all needs a column. We need to be able to filter and sort all columns.
The problem is that I'm unsure what database design would be best for this. I'm using MySQL for the job but I'm are open for suggestions. At this moment I'm planning to make a table with all 120 columns since that is the most natural way to do things.
Options: My other options are a meta table that stores key and values. Or using a document based database so I have access to a variable schema and scale it when needed.
Question:
What is the best way to store all this data? The row count could go up to 100k rows and I need a storage that can select, sort and filter really fast.
Update:
Some more information about usage. XML feeds will be generated live from this table. we are talking about 100 - 500 requests per hours but this will be growing. The fields will not change regularly but it could be once every 6 months. We will also be updating the datafeeds daily. So checking if items are updated and deleting old and adding new ones.
120 columns at 100k rows is not enough information, that only really gives one of the metrics: size. The other is transactions. How many transactions per second are you talking about here?
Is it a nightly update with a manager running a report once a week, or a million page-requests an hour?
I don't generally need to start looking at 'clever' solutions until hitting a 10m record table, or hundreds of queries per second.
Oh, and do not use a Key-Value pair table. They are not great in a relational database, so stick to proper typed fields.
I personally would recommend sticking to a conventional one-column-per-field approach and only deviate from this if testing shows it really isn't right.
With regards to retrieval, if the INSERTS/UPDATES are only happening daily, then I think some careful indexing on the server side, and good caching wherever the XML is generated, should reduce the server hit a good amount.
For example, you say 'we will be updating the datafeeds daily', then there shouldn't be any need to query the database every time. Although, 1000 per hour is only 17 per minute. That probably rounds down to nothing.
I'm working on a similar project right now, downloading dumps from the net and loading them into the database, merging changes into the main table and properly adjusting the dictionary tables.
First, you know the data you'll be working with. So it is necessary to analyze it in advance and pick the best table/column layout. If you have all your 120 columns containing textual data, then a single row will take several K-bytes of disk space. In such situation you will want to make all queries highly selective, so that indexes are used to minimize IO. Full scans might take significant time with such a design. You've said nothing about how big your 500/h requests will be, will each request extract a single row, a small bunch of rows or a big portion (up to whole table)?
Second, looking at the data, you might outline a number of columns that will have a limited set of values. I prefer to do the following transformation for such columns:
setup a dictionary table, making an integer PK for it;
replace the actual value in a master table's column with PK from the dictionary.
The transformation is done by triggers written in C, so although it gives me upload penalty, I do have some benefits:
decreased total size of the database and master table;
better options for the database and OS to cache frequently accessed data blocks;
better query performance.
Third, try to split data according to the extracts you'll be doing. Quite often it turns out that only 30-40% of the fields in the table are typically being used by the all queries, the rest 60-70% are evenly distributed among all of them and used partially. In this case I would recommend splitting main table accordingly: extract the fields that are always used into single "master" table, and create another one for the rest of the fields. In fact, you can have several "another ones", logically grouping data in a separate tables.
In my practice we've had a table that contained customer detailed information: name details, addresses details, status details, banking details, billing details, financial details and a set of custom comments. All queries on such a table were expensive ones, as it was used in the majority of our reports (reports typically perform Full scans). Splitting this table into a set of smaller ones and building a view with rules on top of them (to make external application happy) we've managed to gain a pleasant performance boost (sorry, don't have numbers any longer).
To summarize: you know the data you'll be working with and you know the queries that will be used to access your database, analyze and design accordingly.

handling/compressing large datasets in multiple tables

In an application at our company we collect statistical data from our servers (load, disk usage and so on). Since there is a huge amount of data and we don't need all data at all times we've had a "compression" routine that takes the raw data and calculates min. max and average for a number of data-points, store these new values in the same table and removes the old ones after some weeks.
Now I'm tasked with rewriting this compression routine and the new routine must keep all uncompressed data we have for one year in one table and "compressed" data in another table. My main concerns now are how to handle the data that is continuously written to the database and whether or not to use a "transaction table" (my own term since I cant come up with a better one, I'm not talking about the commit/rollback transaction functionality).
As of now our data collectors insert all information into a table named ovak_result and the compressed data will end up in ovak_resultcompressed. But are there any specific benefits or drawbacks to creating a table called ovak_resultuncompressed and just use ovak_result as a "temporary storage"? ovak_result would be kept minimal which would be good for the compressing routine, but I would need to shuffle all data from one table into another continually, and there would be constant reading, writing and deleting in ovak_result.
Are there any mechanisms in MySQL to handle these kind of things?
(Please note: We are talking about quite large datasets here (about 100 M rows in the uncompressed table and about 1-10 M rows in the compressed table). Also, I can do pretty much what I want with both software and hardware configurations so if you have any hints or ideas involving MySQL configurations or hardware set-up, just bring them on.)
Try reading about the ARCHIVE storage engine.
Re your clarification. Okay, I didn't get what you meant from your description. Reading more carefully, I see you did mention min, max, and average.
So what you want is a materialized view that updates aggregate calculations for a large dataset. Some RDBMS brands such as Oracle have this feature, but MySQL doesn't.
One experimental product that tries to solve this is called FlexViews (http://code.google.com/p/flexviews/). This is an open-source companion tool for MySQL. You define a query as a view against your raw dataset, and FlexViews continually monitors the MySQL binary logs, and when it sees relevant changes, it updates just the rows in the view that need to be updated.
It's pretty effective, but it has a few limitations in the types of queries you can use as your view, and it's also implemented in PHP code, so it's not fast enough to keep up if you have really high traffic updating your base table.

Storing large, session-level datasets?

I'm working on building a web application that consists of users doing the following:
Browse and search against a Solr server containing millions of entries. (This part of the app is working really well.)
Select a privileged piece of this data (the results of some particular search), and temporarily save it as a "dataset". (I'd like dataset size to be limited to something really large, say half a million results.)
Perform some sundry operations on that dataset.
(The frontend's built in Rails, though I doubt that's really relevant to how to solve this particular problem.)
Step two, and how to retrieve the data for step 3, are what's giving me trouble. I need to be able to temporarily save datasets, recover them when they're needed, and expire them after a while. The problem is, my results have SHA1 checksum IDs, so each ID is 48 characters. A 500,000 record dataset, even if I only store IDs, is 22 MB of data. So I can't just have a single database table and throw a row in it for each dataset that a user constructs.
Has anybody out there ever needed something like this before? What's the best way to approach this problem? Should I generate a separate table for each dataset that a user constructs? If so, what's the best way to expire/delete these tables after a while? I can deploy a MySQL server if needed (though I don't have one up yet, all the data's in Solr), and I'd be open to some crazier software as well if something else fits the bill.
EDIT: Some more detailed info, in response to Jeff Ferland below.
The data objects are immutable, static, and reside entirely within the Solr database. It might be more efficient as files, but I would much rather (for reasons of search and browse) keep them where they are. Neither the data nor the datasets need to be distributed across multiple systems, I don't expect we'll ever get that kind of load. For now, the whole damn thing runs inside a single VM (I can cross that bridge if I get there).
By "recovering when needed," what I mean is something like this: The user runs a really carefully crafted search query, which gives them some set of objects as a result. They then decide they want to manipulate that set. When they (as a random example) click the "graph these objects by year" button, I need to be able to retrieve the full set of object IDs so I can take them back to the Solr server and run more queries. I'd rather store the object IDs (and not the search query), because the result set may change underneath the user as we add more objects.
A "while" is roughly the length of a user session. There's a complication, though, that might matter: I may wind up needing to implement a job queue so that I can defer processing, in which case the "while" would need to be "as long as it takes to process your job."
Thanks to Jeff for prodding me to provide the right kind of further detail.
First trick: don't represent your SHA1 as text, but rather as the 20 bytes it takes up. The hex value you see is a way of showing bytes in human readable form. If you store them properly, you're at 9.5MB instead of 22.
Second, you haven't really explained the nature of what you're doing. Are your saved datasets references to immutable objects in the existing database? What do you mean by recovering them when needed? How long is "a while" when you talk about expiration? Is the underlying data that you're referencing static or dynamic? Can you save the search pattern and an offset, or do you need to save the individual reference?
Does the data related to a session need to be inserted into a database? Might it be more efficient in files? Does that need to be distributed across multiple systems?
There are a lot of questions left in my answer. For that, you need to better express or even define the requirements beyond the technical overview you've given.
Update: There are many possible solutions for this. Here are two:
Write those to a single table (saved_searches or such) that has an incrementing search id. Bonus points for inserting your keys in sorted order. (search_id unsigned bigint, item_id char(20), primary key (search_id, item_id). That will really limit fragmentation, keep each search clustered, and free up pages in a roughly sequential order. It's almost a rolling table, and that's about the best case for doing great amounts of insertions and deletions. In that circumstance, you pay a cost for insertion, and double that cost for deletion. You must also iterate the entire search result.
If your search items have an incrementing primary id such that any new insertion to the database will have a higher value than anything that is already in the database, that is the most efficient. Alternately, inserting a datestamp would achieve the same effect with less efficiency (every row must actually be checked in a query instead of just the index entries). If you take note of that maximum id, and you don't delete records, then you can save searches that use zero space by always setting a maximum id on the saved query.

Private messaging system, large single table versus many small tables

I'm considering a design for a private messaging system and I need some input here, basically I have several questions regarding this. I've read most of the related questions and they've given me some thought already.
All of the basic messaging systems I've thus far looked into use a single table for all of the users' messages. With indexes etc this approach would seem fine.
What I wanted to know is if there would be any benefit to splitting the user messages into separate tables. So when a new user is created a new table is created (either in the same or a dedicated message database) which stores all of the messages - sent and received -for that user.
What are the pitfalls/benefits to approaching things that way?
I'm writing in PHP would the code required to write be particularly more cumbersome than the first large table option?
Would the eventual result, with a large amount of smaller tables be a more robust, trouble free design than one large table?
In the event of large amounts of concurrent users, how would the performance of the server compare where dealing with one large versus many small tables?
Any help with those questions or other input would be appreciated. I'm currently working through a smaller scale design for my test site before rewriting the PM module and would like to optimise it. My poor human brain handles separate table far more easily, but the same isn't necessarily so for a computer.
You'll just get headaches from moving to small numerous tables. Databases are made for handling lots of data, let it do it's thing.
You'll likely end up using dynamic table names in queries (SELECT * FROM $username WHERE ...), making smart features like stored procedures and possibly parameterized queries a lot trickier if not outright impossible. Usually a really bad idea.
Try rewriting SELECT * FROM messages WHERE authorID = 1 ORDER BY date_posted DESC, but where "messages" is anywhere between 1 and 30,000 different tables. Keeping your table relations monogamous will keep them bidirectional, way more useful.
If you think table size will really be a problem, set up an "archived messages" clone table and periodically move old & not-unread messages there where they won't get in the way. Also note how most forum software with private messaging allows for limiting user inbox sizes. There are a few ways to solve the problem while keeping things sane.
I'm agreeing with #MarkR here - in that initially the one table for messages is definitely the way to proceed. As time progresses and should you end up with a very large table then you can consider how to partition the table to best proceed. That's counter to the way I'd normally advise design, but we're talking about one table which is fairly simple - not a huge enterprise system.
A very long time ago (pre availability of SQL databases) I built a system that stored private and public messages, and I can confirm that once you split a message base logical entity into more than one everything¹ becomes a lot more complicated; and I doubt that a user per file is the right approach - the overheads will be massive compared to the benefit.
Avoid auto-increment[2] - and using natural keys is very important to the future scalability. Designing well to ensure that you can insert and retrieve without locking will be of more benefit.
¹ Indexing, threading, searching, purging/archiving.
² Natural keys are better if you can find one for your data as the autoincremented ID does not describe the data at all and databases are good at locating based on the primary key, so a natural primary key can improve things. Autoincrement can cause problems with a distributed database; it also leaks data when presented externally (to see the number of users registered just create a new account and check your user ID). If you can't find a natural key then a UUID (or GUID) may still be a better option - providing that the database has good support for this as a primary key. See When to use an auto-incremented primary key and when not to
Creating one table per user certainly won't scale well when there are a large number of users with a small number of messages. The way MySQL handles table opening/closing, very large numbers of tables (> 10k, say) become quite inefficient, especially at server startup and shutdown, as well as trying to backup non-transactional tables.
However, the way you've worded your question sounds like a case of premature optimisation. Make it work first, then fix performance problems. This is always the right way to do things.
Partitioning / sharding will become necessary once your scale gets high enough. But there are a lot of other things to worry about in the mean time. Sort them out first :)
One table is the right way to go from an RDBMS PoV. I recommend you use it until you know better.
Splitting large amounts of data into smaller sets makes sense if you're trying to avoid locking issues: for example - locking the messages table - doing big selects or updating huge amounts of data at once. In this case long running queries could block whole table and everyone needs to wait... You should ask yourself if this going to happen in your case? At least for me it looks like messaging system is not going to have such things because all information is being pushed into table or retrieved from it in rather small sets. If this is a user centric application - so, for example, getting all messages for single user is quite easy and fast to do, the same goes also for creating new messages for one or another particular user... Unless you would have really huge amounts of users/messages in your system.
Splitting data into multiple tables has also some drawbacks - you will need kind of management system or logic how do you split everything - giving separate table for each user could grow up soon into hundreds or thousands of tables - which is, in my opinion, not that nice. Therefore probably you would need some other criteria how to split the data. If you want splitting logic to be dynamic and easy adjustable - you would probably need also to save it in DB somehow. As you see complexity grows...
As advantage of such data sharding could be the scalability - you could easy put different sets of data on different machines once single machine is not able to handle whole load.
It depends how your message system works.
Are there cuncurrency issue?
Does it need to be scalable as the application accomodate more customers?
Designing one table will perfectly work on small, one message at a time single user system.
However, if you are considering multiple user, concurrent messaging system, the tables should be splited
Data model for Real time application is recommended to be "normalized"(Spliting table) due to "locking & latching" and data redundency issue.
Locking policy varies by Database Vendor. If you have tables that have updates & select by applicaiton concurrently, "Locking"(page level, row level, table level depending on vendor) issue araise. Some bad DB & app design completely lock the table so message never go through.
Redendency issue is more clear. If you use only one table, some information(like user. I guess one user could send multiple messages) is redundent.
Try to google with "normalization", 'Locking"..