I developed a stats site for a game as a learning project a few years back. It's still used today and I'd like to get it cleaned up a bit.
The database is one area that needs improvement. I have a table for the game statistics, which has GameID, PlayerID, Kills, Deaths, DamageDealt, DamageTaken, etc. In total, there are about 50 fields in that single table and many more that could be added in the future. At what point are there too many fields? It currently has 57,341 rows and is 153.6 MiB by itself.
I also have a few fields that stores arrays in a BLOB in this same table. An example of the array is Player vs Player matchups. The array stores how many times that player killed another player in the game. These are the bigger fields in filesize. Is storing an array in a BLOB advised?
The array looks like:
[Killed] => Array
(
[SomeDude] => 13
[GameGuy] => 10
[AnotherPlayer] => 8
[YetAnother] => 7
[BestPlayer] => 3
[APlayer] => 9
[WorstPlayer] => 2
)
These tend to not exceed more than 10 players.
I prefer to not have one table with an undetermined number of columns (with more to come) but rather to have an associated table of labels and values, so each user has an id and you use that id as a key into the table of labels and values. That way you only store the data you need per user. I believe this approach is called EAV (as per Triztian's comment) and it's how medical databases are kept, since there are SO many potential fields for an individual patient, even while any given patient only has a very small number of those fields with actual data.
so, you'd have
user:
id | username | some_other_required_field
user_data:
id | user_id | label | value
Now you can have as many or as few user_data rows as you need per user.
[Edit]
As to your array, I would treat this with a relational table as well. Something like:
player_interraction:
id | player_id | player_id | interraction_type
here you would store the two players who had an interaction and what type of interaction it was.
The table design seems mostly fine. As long as the columns you are storing can't be calculated from the other columns within the same row. IE, you're not storing SelfKills, OtherDeath, and TotalDeaths (where TotalDeaths = SelfKills + OtherDeath). That wouldn't make sense and could be cut out of your table.
I'd be curious to learn more about how you are storing those Arrays in a BLOB - what purpose do they serve in a BLOB? Why aren't they normalized into a table for easy data transformation and analytics? (OR are they and they are just being stored as an array here for easy of data display to end users).
Also, I'd be curious how much data your BLOB's take up vs the rest of the table. Generally speaking, the size of the rows isn't as big of a deal as the number of rows, and ~60K is no big deal at all. As long as you aren't writing queries that need to check every column value (ideally you're ignoring blobs when trying to write a where clause).
With mysql you've got a hard limit of roughly 4000 columns (fields) and 65Kb total storage per row. If you need to store large strings, use a text field, they're stored on disk. Blobs really should be reserved for non-textual data (if you must).
Don't worry in general about the size of your db, but think about the structure and how it's organized and indexed. I've seen small db's run like crap.
If you still want numbers, when you're total DB gets in the GB range or past a couple hundred thousand rows in a single table, then start worrying more about things--150M in 60K rows isn't much and table scans aren't going to cost you much in performance. However, now's the time to make sure you create good covering indexes on your heavily used queries.
There's nothing wrong with adding columns to a database table as time goes on. Database designs change all the time. The thing to keep in mind is how the data is grouped. I have always treated a database table as a collection of like items.
Things I consider are as follows:
When inserting data into a row how many columns will be null?
Does this new column apply to 80% of my data that is already there?
Will I be doing several updates to a few columns in this table?
If so, do I need to keep track of what the previos values were just in case?
By thinking about you data like this you may discover that you need to break your table up into a handful of separate smaller tables linked together by foreign keys.
Related
I have a basic question about database designing.
I have a lot of files which I have to read and insert them in database. Each file has some thousand lines and each line has about 30 fields (by these types: small int, int, big int, varchar, json). Of course I use multi threads along with bulk inserting in order to increase insert speed (finally I have 30-40 millions records).
After inserting I want to have some sophisticated analysis and the performance is important to me.
Now I get each line fields and I'm ready to insert so I have 3 approaches:
1- One big table:
In this case I can create a big table with 30 columns and stores all of the files fields in that. So there is a table with huge size which I want to have a lot of analysis on it.
2- A fairly large table (A) and some little tables (B)s
In this case I can create some little tables which consist of the columns that have fairly identical records if we separate them from the other columns. So these little tables just has some hundred or thousand records instead of 30 millions records. So in fairly large table (A), I emit the columns which I put them in another table and I use a foreign key instead of them. Finally I has a table (A) with 20 columns and 30 millions records and some tables (B) with 2-3 columns and 100-50000 records for each of them. So in order to analysis the table A, I have to use some joins ,for example in select and ...
3- just a fairly large table
In this case I can create a fairly large table like table A in above case (with 20 columns) and instead of using foreign keys, I use a mapping between source columns and destination columns (this is something like foreign keys but has a little difference). For example I have 3 columns c1,c2,c3 that in case 2, I put them in another table B and use foreign key to access them, but now I assign a specific number to each distinctive records consist of c1,c2,c3 at inserting time and store the relation between the record and its assigned value in the program codes. So this table is completely like the table A in case number 2 but there is no need to use join in select or ...
While the inserting time is important, the analysis time that I will have is more important to me, so I want to know your opinion about which of these case is better and also I will glad to see the other solutions.
From a design perspective 30 to 40 million is not that bad a number. Performance is fully dependent on how you would design your DB to be.
If you are using SQL Server then you could consider putting the large table on a separate database file group. I have worked on one case in a similar fashion where we had around 1.8 Billion record in a single table.
For the analysis if you are not going to look into the entire data in one shot. You could consider a vertical partitioning of the data. You could use a partition schema based on your need. Some sample could be to split the data as yearly partitions and this will help if your analysis will be limited to a years worth of data(just an example).
The major thing would be de-normalization /normalization based on your need and of course non clustered/clustered indexing of the data. Again this will depend on what sort of analysis queries you would be using.
A single thread can INSERT one row at a time and finish 40M rows in a day or two. With LOAD DATA, you can do it in perhaps an hour or less.
But is loading the real question? For doing grouping, summing, etc, the question is about SELECT. For "analytics", the question is not one of table structure. Have a single table for the raw data, plus one or more "Summary tables" to make the selects really fast for your typical queries.
Until you give more details about the data, I cannot give more details about a custom solution.
Partitioning (vertical or horizontal) is unlikely to help much in MySQL. (Again, details needed.)
Normalization shrinks the data, which leads to faster processing. But, it sounds like the dataset is so small that it will all fit in RAM?? (I assume your #2 is 'normalization'?)
Beware of over-normalization.
tl;dr at the bottom.
So, I have an application with roughly the following schema:
`budget`hasMany =>
`item1`
`item2`
...
`item10`
Now, this 10 items share a set of 23 fields that match in all the 10 of the items. At least other 20 fields are shared in 7 or more items.
This came like this, in retrospective it was idiotic but at the moment it seemed the right thing.
So, with this into mind, I thought: why the hell not make 9 tables dissapear, make 1 table that contains the all the fields from all the items, given that a lot are shared anyway.
What would I gain? Lots of code would dissapear. Lots of tables would dissapear. Retrieving a budget with all it's item would require only a join with a single table, instead of 10 joins.
My doubts come from the fact that this new table would have around 80 columns. All small columns, storing mostly integers, doubles or small varchars. Still, 80 columns strikes me as a lot. Another problem is that in the future, instead of having 10 tables with 1kk records each, I would have 1 big table with 10kk records.
So, my question is: Is it worth changing in order to remove some redundancy, reduce the amount of code and enchance the habilities to retrieve and work with the data?
tl;dr Should I combine 10 tables into 1 table, considering that the 10 tables share a lot of common fields (but still the new table will have 80 columns), in order to reduce the number of tables, the amount of code in the app and enchance the way I retrieve data?
As far as I know, which might not be a lot, it is optimal to split up a database into singular pieces (as it currently is). It is called to normalize the database ("https://en.wikipedia.org/wiki/Relational_database").
It limits the Errors that might happen to the database and makes it less risky to change things through updates etcetera as well as better if you want to insert one item but not another (if you only had 1 table all others would be null and you would always have to go back and fetch info etc. which will make the insert statements harder).
If you will always have all 20 items inserted at a time and always do queries based on all of them (no advanced computation on singular items) then it might be reasonable to put everything into one table. But if you want to insert only a couple of items and you then want to make more complex computations I would advice you to keep them separated and linked through some kind of Customer_Id or w/e
#yBrodsky , as example you should create a table for furniture that will store furniture name, id and description. And another table that would store its attributes with furniture id.
Furniture table will have colums id, furniture_title, description.
And
Other table will have
id, furniture_id, attribute_key, attribute_value
Say I have a simple table of products:
Id, ProductCode, Price, Description etc.
And I have 10,000 products... but 100 of them require sound samples (e.g. they are xylophones).
I want to store in the db whether a product has a sound sample.
Therefore, is it better to store in the products table as a "has_sound" boolean (true or false) coloumn, or as a seperate, one column table that just lists all the product id's with sounds?
Storing in the products table means the vast majority will just have "has_sound = false", which seems like a bit of a waste.
But storing just a list of "products with sounds" also seems a bit "wrong" to me.
Many thanks :)
You have 10,000 rows.
Even if you choose an inefficient 4 byte field size you're looking at all of ~40k on disk by adding a field to the product table. In contrast, an empty innodb table with (int, tinyint) fields is ~100k on disk (plus an additional RAM overhead to hold table metadata). Filling that table with 100 records makes no difference because everything fits within one allocation page.
Neither of these overheads even come remotely close to being a performance consideration.
Do what makes the code clearest, simplest and most maintainable for the next developer who comes along (which in this case is to store an extra field on the product table).
The new table is more correctly relational. If it were me I'd have a two column table, product ID and a BLOB with the sound sample for those products that have a sound sample. While you could have a Boolean (or NULLable BLOB) on the table, splitting it out allows for better partitioning and additional data around the sound sample (different sample formats, multiple octaves/pitches/notes or whatever) is kept in its correct place next to the sound.
As Levi said though, "the best" is the most maintainable as there will be no significant performance or waste issues at this scale.
I wanted to create a table with the name of the table being a date. When I gather stock data for that day, I wanted to store it like this:
$date = date('Y-m-d');
$mysqli->query(
"CREATE TABLE IF NOT EXISTS `$date`(ID INT Primary Key)"
);
That way I will have a database like:
2013-5-1: AAPL | 400 | 400K
MFST | 30 | 1M
GOOG | 700 | 2M
2013-5-2: ...
I think it would be easier to store information like this, but I see a similar question to this was closed.
How to add date to MySQL table name?
"Generating more and more tables is exactly the opposite of "keeping
the database clean". A clean database is one with a sensible,
normalized, fixed schema which you can run queries against."
If this is not the right way to do it, could someone suggest what would be? Many people commenting on this question stated that this was not a "clean" solution?
Do not split your data into several tables. This will become a maintenance nightmare, even though it might seem sensible to do so at first.
I suggest you create a date column that holds the information you currently want to put into the table name. Databases are pretty clever in storing dates efficiently. Just make sure to use the right datatype, not a string. By adding an index to that column you will also not get a performance penalty when querying.
What you gain is full flexibility in querying. There will be virtually no limits to the data you can extract from a table like this. You can join with other tables based on date ranges etc. This will not be possible (or at least much more complicated and cumbersome) when you split the data by date into tables. For example, it will not even be easy to just get the average of some value over a week, month or year.
If - and that's depending on the real amount of data you will collect - some time in the future the data grows dramatically, to more than several million rows I would estimate - you can have a look at the data partitioning features MySQL offers out of the box. However, I would not advise to use them immediately, unless you already have a clearly cut growth model for the data.
In my experience there is very seldom a real need for this technique in most cases. I have worked with tables in the 100s of gigabytes range, with tables having millions of rows. It is all a matter of good indexing and carefully crafted queries when the data gets huge.
I have a membership database that I am looking to rebuild. Every member has 1 row in a main members table. From there I will use a JOIN to reference information from other tables. My question is, what would be better for performance of the following:
1 data table that specifies a data type and then the data. Example:
data_id | member_id | data_type | data
1 | 1 | email | test#domain.com
2 | 1 | phone | 1234567890
3 | 2 | email | test#domain2.com
Or
Would it be better to make a table of all the email addresses, and then a table of all phone numbers, etc and then use a select statement that has multiple joins
Keep in mind, this database will start with over 75000 rows in the member table, and will actually include phone, email, fax, first and last name, company name, address city state zip (meaning each member will have at least 1 of each of those but can be have multiple (normally 1-3 per member) so in excess of 75000 phone numbers, email addresses etc)
So basically, join 1 table of in excess of 750,000 rows or join 7-10 tables of in excess of 75,000 rows
edit: performance of this database becomes an issue when we are inserting sales data that needs to be matched to existing data in the database, so taking a CSV file of 10k rows of sales and contact data and querying the database to try to find which member attributes to which sales row from the CSV? Oh yeah, and this is done on a web server, not a local machine (not my choice)
The obvious way to structure this would be to have one table with one column for each data item (email, phone, etc) you need to keep track of. If a particular data item can occur more than once per member, then it depends on the exact nature of the relationship between that item and the member: if the item can naturally occur a variable number of times, it would make sense to put these in a separate table with a foreign key to the member table. But if the data item can occur multiple times in a limited, fixed set of roles (say, home phone number and mobile phone number) then it makes more sense to make a distinct column in the member table for each of them.
If you run into performance problems with this design (personally, I don't think 75000 is that much - it should not give problems if you have indexes to properly support your queries) then you can partition the data. Mysql supports native partitioning (http://dev.mysql.com/doc/refman/5.1/en/partitioning.html), which essentially distributes collections of rows over separate physical compartments (the partitions) while maintaining one logical compartment (the table). The obvious advantage here is that you can keep querying a logical table and do not need to manually bunch up the data from several places.
If you still don't think this is an option, you could consider vertical partitioning: that is, making groups of columns or even single columns an put those in their own table. This makes sense if you have some queries that always need one particular set of columns, and other queries that tend to use another set of columns. Only then would it make sense to apply this vertical partitioning, because the join itself will cost performance.
(If you're really running into the billions then you could consider sharding - that is, use separate database servers to keep a partition of the rows. This makes sense only if you can either quickly limit the number of shards that you need to query to find a particular member row or if you can efficiently query all shards in parallel. Personally it doesn't seem to me you are going to need this.)
I would strongly recommend against making a single "data" table. This would essentially spread out each thing that would naturally be a column to a row. This requires a whole bunch of joins and complicates writing of what otherwise would be a pretty straightforward query. Not only that, it also makes it virtually impossible to create proper, efficient indexes over your data. And on top of that it makes it very hard to apply constraints to your data (things like enforcing the data type and length of data items according to their type).
There are a few corner cases where such a design could make sense, but improving performance is not one of them. (See: entity attribute value antipattern http://karwin.blogspot.com/2009/05/eav-fail.html)
YOu should research scaling out vs scaling up when it comes to databases. In addition to aforementioned research, I would recommend that you use one table in our case if you are not expecting a great deal of data. If you are, then look up dimensions in database design.
75k is really nothing for a DB. You might not even notice the benefits of indexes with that many (index anyway :)).
Point is that though you should be aware of "scale-out" systems, most DBs MySQL inclusive, can address this through partitioning allowing your data access code to still be truly declarative vs. programmatic as to which object you're addressing/querying. It is important to note sharding vs. partitioning, but honestly are conversations when you start exceeding records approaching the count in 9+ digits, not 5+.
Use neither
Although a variant of the first option is the right approach.
Create a 'lookup' table that will store values of data type (mail, phone etc...). Then use the id from your lookup table in your 'data' table.
That way you actually have 3 tables instead of two.
Its best practice for a classic many-many relationship such as this