In MySQL, is it generally faster/more efficient/scalable to return 100 rows with 3 columns, or 1 row with 100 columns?
In other words, when storing many key => value pairs related to a record, is it better to store each key => value pair in a separate row with with the record_id as a key, or to have one row per record_id with a column for each key?
Also, assume also that keys will need to be added/removed fairly regularly, which I assume would affect the long term maintainability of the many column approach once the table gets sufficiently large.
Edit: to clarify, by "a regular basis" I mean the addition or removal of a key once a month or so.
You should never add or remove columns on a regular basis.
http://en.wikipedia.org/wiki/Entity-Attribute-Value_model
There are a lot of bad things about this model and I would not use it if there was any other alternative. If you don't know the majority (except a few user customizable fields) of data columns you need for your application, then you need to spend more time in design and figure it out.
If your keys are preset (known at design time), then yes, you should put each key into a separate column.
If they are not known in design time, then you have to return your data as a list of key-value pairs which you should later parse outside the RDBMS.
If you are storing key/value pairs, you should have a table with two columns, one for the key (make this the PK for the table) and one for the value (probably don't need this indexed at all). Remember, "The key, the whole key, and nothing but the key."
In the multi-column approach, you will find that you table grows without bound because removing the column will nuke all the values and you won't want to do it. I speak from experience here having worked on a legacy system that had one table with almost 1000 columns, most of which were bit fields. Eventually, you stop being able to make the case to delete any of the columns because someone might be using it and the last time you did it, you had work till 2 am rolling back to backups.
First: determine how frequently your data needs to be accessed. If the data always needs to be retrieved in one shot and most of it used then consider storing all the key pairs as a serialized value or as an xml value. If you need to do any sort of complex analysis on that data and you need the value pairs then columns are ok but limit them to values that you know you will need to perform your queries on. It’s generally easier to design queries that use one column for one parameter than row. You will also find it easier to work with
the returned values if they are all in one row than many.
Second: separate your most frequently accessed data and put it in its own table and the other data in another. 100 columns is a lot by the way so I recommend that you split your data into smaller chunks that will be more manageable.
Lastly: If you have data that may frequently change then you should use create the column (key) in one table and then use its numerical key value against which you would store the key value. This assumes that you will be using the same key more than once and should speed up your search when you go to do your lookup.
Related
I am currently learning about foreign keys and trying to add them as much as I can in my application to ensure data-integrity. I am using INNODB on Mysql.
My clicks table has a structure something like...
id, timestamp, link_id, user_id, ip_id, user_agent_id, ... etc for about 12 _id columns.
Obviously these all point to other tables, so should I add a foreign key on them? MySQL is creating an index automatically for every foreign key, so essentially I'll have an index on every column? Is this what I want?
FYI - this table will essentially be my most bulky table. My research basically tells me I'm sacrificing performance for integrity but doesn't suggest how harsh the performance drop will be.
Right before inserting such a row, you did 12 inserts or lookups to get the ids, correct? Then, as you do the INSERT, it will do 12 checks to verify that all of those ids have a match. Why bother; you just verified them with the code.
Sure, have FKs in development. But in production, you should have weeded out all the coding mistakes, so FKs are a waste.
A related tip -- Don't do all the work at once. Put the raw (not-yet-normalized) data into a staging table. Periodically do bulk operations to add new normalization keys and get the _id's back. Then move them into the 'real' table. This has the added advantage of decreasing the interference with reads on the table. If you are expecting more than 100 inserts/second, let's discuss further.
The generic answer is that if you considered a data item so important that you created a lookup table for the possible values, then you should create a foreign key relationship to ensure you are not getting any orphan records.
However, you should reconsider, whether all data items (fields) in your clicks table need a lookup table. For example ip_id field probably represents an IP address. You can simply store the IP address directly in the clicks table, you do not really need a lookup table, since IP addresses have a wide range and the IP addresses are unique.
Based on the re-evaluation of the fields, you may be able to reduce the number of related tables, thus the number of foreign keys and indexes.
Here are three things to consider:
What is the ratio of reads to writes on this table? If you are reading much more often than writing, then more indexes could be good, but if it is the other way around then the cost of maintaining those indexes becomes harder to bear.
Are some of the foreign keys not very selective? If you have an index on the gender_id column then it is probably a waste of space. My general rule is that indexes without included columns should have about 1000 distinct values (unless values are unique) and then tweak from there.
Are some foreign keys rarely or never going to be used as a filter for a query? If you have a last_modified_user_id field but you never have any queries that will return a list of items which were last modified by a particular user then an index on that field is less useful.
A little bit of knowledge about indexes can go a long way. I recommend http://use-the-index-luke.com
Currently I have lots of rows in mysql db
venue_id
venue_name
venue_location
venue_geolocation
venue_type
venue_url
venue_manager
venue_phone
venue_logo
venue_company
venue_zip
venue_vat
venue_visible
Would it be more efficient to store most of the data in one array and in one row like venue_data. Then it would leave only 3 rows venue_id, venue_data, venue_visible. Then in my application I could explode that array. Would it save time, server load?
Storing the values as array (concatenating different values into a string?) is definitely a bad idea because:
You will loose the readability,
you won't be able to easily search on concatenated columns,
you cannot index these columns properly.
Furthermore it does not have an impact to the performance - see also Is there a performance decrease if there are too many columns in a table?
If you are unhappy with the many columns, you should consider normalizing (DB Normalization) your db schema.
You must ask yourself whether the amount of time and space you 'might' save is worth the cost.
Consider:
Combining columns into one will still have a comparable length as all of them separately
More space could potentially be saved by using appropriately sized data types
Disk space is cheap
Having distinct columns gives you the power to query any of those columns
Distinct columns also allows you to easily add or remove columns at a later date without having to re-construct every row's combined column
Distinct columns you can use $result->fetch_assoc() to immediately get your result row in an array, vs. spending processing time parsing a complex string
Parsing such a string may be prone to errors that selecting specific columns is not
You can add foreign key constraints and indexes on individual columns which would not work if you combined them
You can easily search on distinct columns, but not if you combine them
I can think of plenty more reasons why distinct columns are a better choice than trying to optimize code in a way that likely will not even save you any time. The query may be a few milliseconds faster, but you lost that time processing the string.
I'm going to create a table which will have an amount of rows between 1000-20000, and I'm having fields that might repeat a lot... about 60% of the rows will have this value, where about each 50-100 have a shared value.
I've been concerned about efficiency lately and I'm wondering whether it would be better to store this string on each row (it would be between 8-20 characters) or to create another table and link them with its representative ID instead... So having ~1-50 rows in this table replacing about 300-5000 strings with ints?
Is this a good approach, or at all even neccessary?
Yes, it's a good approach in most circumstances. It's called normalisation, and is mainly done for two reasons:
Removing repeated data
Avoiding repeating entities
I can't tell from your question what the reason would be in your case.
The difference between the two is that the first reuses values that just happen to look the same, while the second connects values that have the same meaning. The practical difference is in what should happen if a value changes, i.e. if the value changes for one record, should the value itself change so that it changes for all other records also using it, or should that record be connected to a new value so that the other records are left unchanged.
If it's for the first reason then you will save space in the database, but it will be more complicated to update records. If it's for the second reason you will not only save space, but you will also reduce the risk of inconsistency, as a value is only stored in one place.
That is a good approach to have a look-up table for the strings. That way you can build more efficient indexes on the integer values. It wouldn't be absolutely necessary but as a good practice I would do that.
I would recommend using an int with a foreign key to a lookup table (like you describe in your second scenario). This will cause the index to be much smaller than indexing a VARCHAR so the storage required would be smaller. It should perform better, too.
Avitus is right, that it's generally a good practice to create lookups.
Think about the JOINS you will use this table in. 1000-20000 rows are not a lot to be handled by MySQL. If you don't have any, I would not bother about the lookups, just index the column.
BUT as soon as you start joining the table with others (of the same size) that's where the performance loss comes in, which you can (most likely) compensate by introducing lookups.
The entities to be stored have 25+ properties (table columns). The entities are pretty diverse, meaning that, most of the columns are empty. On average, I'd say, less than 20% (<5) properties have a value in any particular item. So, I have a lot of redundant empty columns for most of the table rows. Almost all of the columns are decimal numbers.
Given this scenario, would you suggest serializing the columns instead, or perhaps, create another table named "Property", which would contain all the possible properties and then creating yet another table "EntityProperty" which would map an property to an entity using foreign keys? Or would you leave it as it is?
An example scenario where this kind of redundancy might occur could be the following:
We have an imaginary universe with lots of planets. We are creating a space mining game and each planet has 30 different mineral contents. Most of the planets have only 2-3 minerals.
The simplest solution would be to create a single table 'Planets' with 30 columns, one for each mineral. The problem here is that most of rows in the 'Planets' table have 25+ columns, in which each of one the value is null or zero. Thus we have lot of redundant data. Say, we would have 500k-1M records. I would guess it costs a byte at most to save a null or zero decimal value. Thus, we waste space 500,000-1,000,000 bytes, ie. one megabyte at most.
The other solution would be to create two additional tables. Instead of storing all the minerals in the 'Planets' table, we take them all out and create a table for the minerals called 'Minerals'. This would contain only 30 rows, one for each different mineral type. Then, we create a table called 'PlanetMineral' which contains a reference to a planet row and to a mineral row, and additionally this table would have a column telling the amount of the mineral the planet has. Apparently, in many database systems this complicates queries since you have to do possible several joins. I'm using SQL server with LINQ to SQL, which scaffolds the foreign key constraint into class object property, accessible through code. (ie. I can simply access the minerals a planet has with planet.Minerals) So, from this perspective it doesn't add complexity. The redundancy is a small portion (like 1/15) of the first solution. The reason there is still some overhead is because of the foreign keys we need to store.
As for the data query efficiency, I really don't know how the costs of the queries would compare between these two solutions.
It depends:
How many entities (rows) you are planning to have?
What kind of queries you run against that table?
Will there be a lot of new properties in future?
How are you planning to use the properties?
You seem to be concerned about wasting space with simple table? Try to calculate if space saving with other approaches are really significant and worthwhile. The disk is (usually) cheap.
If you have low number of rows, then the single table is probably better (it is easier to implement).
If you plan to create complex queries against the properties (eg. where property1 < 123) then the simple table is probably easier.
If you are planing to add lot of new properties in the future then the Property/EntityProperties approach could be useful.
I'd go with the simple one table approach because you have a rather small amount of rows (<1M), you are probably running your database with server machines and not some handheld/mobile thing (SQLServer) and your database schema is rather rigid.
For numbers, I would personally leave it as is, in 1 table. Numbers are compressed into a few bytes, and the overhead for having an EntityProperty table would far outweight that. Serializing is an option, but it means you cannot use SQL to search or compute the properties, you have to get the data, deserialise, and then compute.
I have a table where one of the columns is a sort of id string used to group several rows from the table. Let's say the column name is "map" and one of the values for map is e.g. "walmart". The column has an index on it, because I use to it filter those rows which belong to a certain map.
I have lots of such maps and I don't know how much space the different map values take up from the table. Does MYSQL recognizes the same map value is stored for multiple rows and stores it only once internally and only references it with an internal numeric id?
Or do I have to replace the map string with a numeric id explicitly and use a different table to pair map strings to ids if I want to decrease the size of the table?
MySQL will store the whole data for every row, regardless of whether the data already exists in a different row.
If you have a limited set of options, you could use an ENUM field, else you could pull the names into another table and join on it.
I think MySQL will duplicate your content each time : it stores data row by row, unless you explicitly specify otherwise (putting the data in another table, like you suggested).
Using another table will mean you need to add a JOIN in some of your queries : you might want to think a bit about the size of your data (are they that big ?), compared to the (small ?) performance loss you may encounter because of that join.
Another solution would be using an ENUM datatype, at least if you know in advance which string you will have in your table, and there are only a few of those.
Finally, another solution might be to store an integer "code" corresponding to the strings, and have those code translated to strings by your application, totally outside of the database (or use some table to store the correspondances, but have that table cached by your application, instead of using joins in SQL queries).
It would not be as "clean", but might be better for performances -- still, this may be some kind of micro-optimization that is not necessary in your case...
If you are using the same values over and over again, then there is a good functional reason to move it to a separate table, totally aside from disk space considerations: To avoid problems with inconsistent data.
Suppose you have a table of Stores, which includes a column for StoreName. Among the values in StoreName "WalMart" occurs 300 times, and then there's a "BalMart". Is that just a typo for "WalMart", or is that a different store?
Also, if there's other data associated with a store that would be constant across the chain, you should store it just once and not repeatedly.
Of course, if you're just showing locations on a map and you really don't care what they are, it's just a name to display, then this would all be irrelevant.
And if that's the case, then buying a bigger disk is probably a simpler solution than redesigning your database just to save a few bytes per record. Because if we're talking arbitrary strings for place names here, then trying to find duplicates and have look-ups for them is probably a lot of work for very little gain.