How to design the database when you need too many columns? [duplicate] - mysql

This question already has answers here:
How do you know when you need separate tables?
(9 answers)
Closed 9 years ago.
I have a table called cars but each car has hundreds of attributes and they keep on increasing over time (horsepower, torque, a/c, electric windows, etc...) My table has each attribute as a column. Is that the right way to do it when I have thousands of rows and hundreds of columns? Also, I made each attribute a column so I facilitate advanced searching / filtering.
Using MySQL database.
Thanks

This is an interesting question IMHO, and the answer may depend on your specific data model and implementation. The most important factor in this case is data density.
How much of each row is actually filled up, in average?
If most of your fields are always present, then data scope partition may be the way to go.
If most of your fields are empty, then a metadata-like structure (like #JayC suggested) may be more attractive.
Let's use the case you mentioned, and do some simulations.
On the first case, scope partition, the idea is to implement partitions based on scope or usage. As an example of partitioning by usage, let's say that the most retrieved fields are Model, Year, Maker and Color. These fields may compose your main [CAR] table, the owner of the ID field which will exclusively identify the vehicle.
Now let's say that Engine, Horsepower, Torque and Cylinders are also used for searches from time to time, but not so frequently. These may exist on a secondary table [CAR_INFO_1], which is tied to the first table by the presence of the CAR_ID field, a foreign key. Proceed by creating as many partitions you need.
Advantage: Simpler queries. You may coalesce all information about a vehicle if you do a joint query (for example inside a VIEW).
Downside: Maintenance. Each new field must be implemented in the model itself, and you need an updated data model to locate where the field you need is actually stored (or abstract it inside a view.)
Metadata format is much more elegant, but demands more of your database engine. Check #JayC's and #Nitzan Shaked's answers for details.
Advantages: 100% data density. You'll never have empty Data values. Also maintenance - a new attribute is created by adding it as a row to the metadata identifier table. Data structure is less complex as well.
Downside: Complex queries, together with more complex execution plans. Let's say you need all Ford cars made in 2010 that are blue. It would be very trivial on the first case:
SELECT * FROM CAR WHERE Model='Ford' AND Year='2010' AND Color='Blue'
Now the same query on a metadata-structured model:
Assume the existence of this two tables,
CAR_METADATA_TYPE
ID DESC
1 'Model'
2 'Year'
3 'Color'
and
CAR_METADATA [CAR_ID], [METADATA_TYPE_ID], [VALUE]
The query itself would like something like this:
SELECT * FROM CAR, CAR_METADATA [MP1], CAR_METADATA [MP2], CAR_METADATA [MP3]
WHERE MP1.CAR_ID = CAR.ID AND MP1.METADATA_TYPE_ID = 1 AND MP1.Value='Ford'
AND MP2.CAR_ID = CAR.ID AND MP2.METADATA_TYPE_ID = 2 AND MP2.Value='2010'
AND MP3.CAR_ID = CAR.ID AND MP3.METADATA_TYPE_ID = 3 AND MP3.Value='Blue'
So, it all depends on you needs. But given your case, my suggestion would be the Metadata format.
(But do a model cleanup first - no repeated fields, 1:N data on their own table instead of inline fields like Color1, Color2, Color3, this kind of stuff ;) )

I guess the obvious question is, then: why not have a table car_attrs(car, attr, value)? Each attribute is a row. Most queries can be re-written to use this form.

If it is all about features, create a features table, list all your features as rows and give them some sort of automatic id, and create a car_features that with foreign keys to both your cars table and your features table that associates cars with features, maybe along with any values associated with the relationship (one passenger electric seat, etc.).

If you have ever changing attributes, then consider storing them in an XML blob or text structure in one column. This structure is not relational. The most important attributes will then be duplicated in additional columns so you can craft queries to search on them as the Blob will not be searchable from SQL queries. This will cut down on the amount of columns in that table and allow for expansion without changing the database schema.
As others as suggested, if you want all the attributes in a table, then use an attribute table to define them. Then will depend on your requirements and needs of the application.

Related

How to use Key-Value pair in Relational database(MySql)?

I wanted to use a relational database(MySql) to store my data as key-value pair.
I would be getting no. of key-value pairs dynamically.
I can create a simple table to store them in separate columns.
Values can be of type- int, varchar, text or date.
The problem which I am facing is:
When I need to run a query on key whose value should be an integer and I need to use and greater than or less than query with it. Same case when I need to use between query with date fields.
How can I achieve it?
------------------------------------------------Edit---------------------------------------------------
For greater clarity, I am providing the background for this question which I have divided into three parts:
1. Data 2: Use Case 3. Possible Designs
1. Data
Suppose I'm creating data store for census of a country**(Just an example)**. Fields for storing data would be different for male, female, boy or girl and also it will vary according to the person's profession. The number of fields depends on the requirement which can increase up to 500 or more.
2. Use Case
Show a paginated list of persons whose monthly income is between $7000 to $10000. User can click on any page number and the database should directly fetch the data for that page number. For example, if we are showing 10 results in a page and user clicks on the 5th page then we should show him the list of the person's from 40 to 50.
Some of the values belonging to a particular group store description which can have large data. So they should be stored as TEXT.
3. Possible Designs
I can create a separate table for each different type and store their data in respective fields. But the problem I'm thinking about this approach is that MySQL table has a maximum row size limit of 65,535 bytes. Going by this approach and storing all data horizontally might cross the max size limit. As the number of fields are not fixed and can change as per requirement.
Instead of storing data horizontally I can store them vertically using Entity Attribute Value design(key-value pair). For now, the increase in the number of rows due to this design is not a problem. Using this I can store data of all male, female or child in the same table. But the problem with this approach is:
I will lose the Datatype of certain important fields. I can not query and get the list of persons whose income is more than 1000.
For storing data or all fields in single Value type, I need to make it varchar. But some fields store large data which requires TEXT as the type.
Considering the above problem, I thought that instead of creating only one value field, I will create multiple value fields like value_int, value_varchar, value_date or value_text.
DB structure
For this problem, I will be using MySQL and cannot change the DB due to certain restrictions. So I am looking for a design with MySQL only.
Going by key-value approach is a good idea or not? Or any other possible design which can be used?
In very general terms, if you know the entities and attributes of your problem domain, and the data is relational, I'd use a relational schema (your "possible design 1"). If you actually encounter problems with maximum row width, your problem domain might contain logical subgroupings of attributes, so you can split them into separate table.
For instance:
Person (id, name, ...)
Person_demographics (person_id, age, location, ...)
Person_finance (person_id, income, wealth...)
If you don't know the entities and attributes in advance, I recommend using MySQL's JSON support. or XML support. This gives you access to much better query options than EAV.
The problem with EAV-like solutions in your scenario is that any non-trivial queries end up being incredibly complicated - "find all responses where salary is between x and y, and the age is z, in locations (a, b, c)" turns into a horrible mess of SQL, but with XPath this is pretty straightforward.

When dealing with databases, does adding a different table when we can use a simple hash a good thing?

For example, here's the problem I faced... I have three tables. Products, Districtrates, Deliverycharges. In my app, a product's delivery will be calculated through a pre-defined rate defined in the Districtrates table. If we want, we can also add a custom rate overriding the pre-defined rate. Each product can have all 25 districts or only some. So here's my solution :
Create three tables as I mentioned above. Districtrates table will only have 25 records for all the 25 districts in my country. For each product, I will add 25 records to the Deliverycharges table with the productID, deliveryrateID and a custom rate value if available. Some products might have less than 25 districts (Only the ones available for that product).
I can even store this in a simple hash in one cell in the products table. Like this : {district1: nil, district2: 234, district4: 543} (It's in Ruby syntax). In here, if the value is nil, we can take the default value from the deliveryrate table. Here also the hash will have all 25 districts! But the above method (creating a table) is easy to work with. The only problem is, it will add nearly 25 records per each product.
So my question is, is this a good thing? This is only one scenario... There are more where we can use one simple array or hash in a cell rather than creating a table. Creating a table is easy to maintain. But is it the right way?
One of the main points of using a relational database is the ability to query (and update) the data in it using SQL.
That only works if you put the data in a form that the database actually understands. Traditionally, this means defining a table schema.
There are now extensions to let the database work with "semi-structured" data (such as XML/JSON/JSONB), but you should only need to go there when the data really does not fit into the relational model, otherwise you are giving up on a lot of features/performance.
If you put a Ruby string into a text column, you will not have any way to use it from SQL. So no proper searching, indexing, or efficient updates of these delivery rates.

Redshift Usage - 1 row by 400 columns per user or (20-400) rows by 4 columns per user

We are building an analytics engine which has to store attribute preference score for each user. We are expecting 400 attributes and they may change(at what frequency is not known as yet). We are planning to store this in Redshift.
My qs is:
Should we store as 1 row per user with 400 cols(1 column for each attribute)
or should we go for a table structure like
(uid, attribute id, attribute value, preference score) which will be (20-400)rows by 3 columns
Which kind of storage would lead to a better performance in Redshift.
Should be really consider NoSQL for this?
Note:
1. This is a backend for real time application with increasing number of users.
2. For processing, the above table has to be read with entire information of all attibutes for one user i.e indirectly create a 1*400 matrix at runtime.
Please help me which desgin would be ideal for such a use case. Thank you
You can go for tables like given in this example and then use bitwise functions
http://docs.aws.amazon.com/redshift/latest/dg/r_bitwise_examples.html
Bitwise functions are here
For your problem, I would suggest a two table design. Its more pain in the beginning but will help in future.
First table would be a key value kind of first table, which would store all the base data and would be kind of future proof, where you can add/remove more attributes, but this table will continue working.
And a N(400 in your case) column 2nd table. This second table you can build using the first table. For the second table, you can start with a bare minimum set of columns .. lets say only 50 out of those 400. So that querying this table would be really fast. And the structure of this table can be refreshed periodically to match with the current reporting requirements. Also you will always have the base table in case you need to backfill any data.

MySQL table with multiple values in one field [duplicate]

This question already has answers here:
Is storing a delimited list in a database column really that bad?
(10 answers)
Closed 4 years ago.
I'm building a database for a real estate company.. Those real estate properties are managed by the sales people using a CodeIgniter website.
How should I proceed with the database design for this table. The fields like address/location, price etc. are pretty straightforward but there are some sections like Kitchen Appliences, Property Usage etc... where the sales people can check multiple values for that field.
Furthermore, a property can have multiple people attached to it (from the table people), this can be an intermediaire, owner, seller, property lawyer etc... Should I use one field for this or just I create an extra table and normalize the bindings?
Is the best way to proceed just using one field and using serialized data or is there a better way for this?
In a relational database you should never, EVER, put more than one value in a single field - this violates first normal form.
With the sections like Kitchen Appliances, Property Usage etc... where the sales people can check multiple values for that field, it depends on whether it will always be the same set of multiple options that can be specified:
If there are always the same set of multiple options, you could include a boolean column on the property table for each of the options.
If there are likely to be a variety of different options applicable to different properties, it makes more sense to set up a separate table to hold the available options, and a link table to hold which options apply to which properties.
The first approach is simpler and can be expected to perform faster, while the latter approach is more flexible.
A similar consideration applies to the people associated with each house; however, given that a single property can have (for example) more than one owner, the second approach seems like the only one viable. I therefore suggest separate tables to hold people details (name, phone number, etc) and people roles (such as owner, seller, etc.) and a link table to link roles to properties and people.
You should create extra table for it....
For example...
Consider the scenario that 1 item may have many categories and 1 category may have many items...
Then you can create table like this...
Create three tables for that....
(1) tblItem :
fields:
itemId (PK)
itemName
etc..
(2) tblCategory :
fields:
categoryId (PK)
categoryName
etc..
(3) tblItemCategory :
fields:
itemId (PK/FK)
categoryId (PK/FK)
So according to your data you can create an extra table for it....
I think it would be optimize way....
If you want your data to be in third normal form, then you should think in terms of adding extra tables, rather than encoding multiple values into various fields. However it depends how far your brief goes.

Many highly similar objects in the same database table

Hello, stackoverflow community!
I am working on a rather large database-driven web application. The underlying database is growing in complexity as more components are being added, but so far I've had absolutely no trouble normalizing the data quite nicely.
However, this final component implies a table that can hold products.
Each product has a category, and depending on the category, has different fields.
Making a table for each product category doesn't seem right, as there are currently five types, and they still have quite a lot of fields in common. (but in weird ways - a few general fields such as description and price are common to all 5 categories, but some attributes are shared between 1 and 2, others 3,4,5 and so on).
I'm trying to steer away from the EAV model for obvious performance reasons.
The thing is that according to what product type the user wants to enter into the database there is a somewhat (but not completely) different field structure - all of them have a name and general description, but other attributes such as "area covered" can be applied only to certain categories such as seeds and pesticides, but not fuel, which would have a diesel/gasoline boolean and a bunch of other fuel-related attributes.
Should I just extract the core features in a table, and make another five for each category type? That would be a bit hard to expand in the future.
My current idea would be to have the product table contain all the fields from all the possible categories, and then just have another table to describe which category from the product table has which fields.
product: id | type | name | description | price | composition | area covered | etc.
fields: id | name (contains a list of the fields in the above table)
product-fields: id | product_type | field_id (links a bunch of fields to the product table based on the product type)
I reckon this wouldn't be too slow, easy to search (no need to actually join the other tables, just perform the search on the main product table based on some inputs) and it would facilitate things like form generation and data validation with just one lightweight additional query /join. (fetch a product from the db and join a concatenated list of the fields actually used in a string - split that and display the proper form fields based on what it contains, i.e. the fields actually associated with that product.
Thanks for your trouble!
Andrei Bârsan
EAV can actually be quite good at storing data and fetching that databack again when you know the key. It also excels in it's ability to add fields without changing the schema. But where it's quite poor is when you need the equivilent of WHERE field1 = x and field2 = y.
So while I agree the data behaviour is important (how many products share the same fields, etc), the use of that data is also important.
Which fields need searching, which fields are always just data storage, etc
In most cases I'd suggest keeping all fields that need searching, in combination with each other, in the same table.
In practice this often leads to a single table solution.
New fields require schema changes, new indexes, etc
Potential for sparsely populated data, using more space than is 'required'
Allows simple queries, simple indexing and often the fastest queries
Often, though not always, the space overhead is marginal
Where the sparse-data overheads reach a critical point, I would then head towards additional tables grouped by what fields they contain. More specifically, I would not create tables by product. This is on the dual assumption that most/all fields will be shared across at least some products, and that those fields will need searching.
This gives a schema more like...
Main_table ( PK, Product_Type, Field1, Field2, Field3 )
Geo_table ( PK, county, longitute, latitude )
Value ( PK, cost, sale_price, tax )
etc
You may also have a meta-data table describing which product types have which fields, etc.
What this schema allows is a more densly populated set of tables, which can be easily indexed and so quickly searched, while minimising table clutter and joins by grouping related fields.
In the end, there isn't a true answer, it's all a balancing act. My general rule of thumb is to stay with a single table until I actually have a real and pressing reason not to, not just a theoretical one.
In my experience unless you are writing a a complete framework that can render fully described fields (we are talking about a lot of metadata describing each field) it is not worth separating field definitions from the main object. Modern frameworks (like Grails) allow for virtual zero pain adding a new column to a domain/Model class and table.
If your common field overlap is about 80% between all object types I would put them all in 1 table and use Table per Hierarchy inheritance model, where a descriminator field helps you tell your object types apart. On the other hand if you have 20% overlap of common fields then go with Table per Class inheritance model with base class and table containing common fields. And other joint tables hang off the base.
Should I just extract the core features in a table, and make another five for each category type? That would be a bit hard to expand in the future.
This is called a SuperType - SubType relationship. It works very well if most of your queries are one of two types:
If you will be querying mostly the SupetType table and only drilling down into the SubType table infrequently.
If you will be querying the database after being filtered to a specific SubType.