I have a question about data modeling.
I have a table called "sales" where I store different levels of aggregation of customer sales. It has the following attributes:
id (integer)
period_id (integer)
customer_id (integer)
product_category_id (integer)
channel_id (integer)
value (float)
Depending on what "id" attributes are filled, I know the level of aggregation. For example:
If period_id, customer_id and product_category_id are filled, but channel_id is NULL, I know it's aggregated by all channels. If also product_category_id is NULL, I know it's aggregated by all channels and product categories.
Associated to each row of that sales table, I have an associate row in performance_analysis table, which store statistical analysis of those sales. This table has the following attributes:
sales_id (integer)
and a bunch of numerical statistical values
I believe that storing those different levels of aggregation in the sames (sales) table is not a good practice, and I'm planning to make some changes. My idea is to score just the most disaggregated level, and get each level of aggregation on-the-fly, using SQL to aggregate. In that scenario, all the references attributes of "sales" table will be filled, and I'll just GROUP BY and SUM according to my needs.
The problem is: by doing this, I lose the 1x1 association with the performance_analysis table. Then, I would have to move the reference attributes to the analysis table and the problem persists.
I would still have to use that NULL attributes hack to know which level of aggregation is.
It is important to notice that aggregate that analysis data is not trivial. I can't just SUM the attributes, they're specific to the analyzed values. So it's not data duplication as it is on the "sales" case. But it still have different levels of "aggregation" on the same table.
What is the best way to store that data?
You're certainly on the right track as far as holding the sales data at its most granular. What you're describing is very much like a dimensional model's fact table, and Ralph Kimball (a key figure in dimensional modelling) would always advise that you hold your measures at the lowest grain possible. If you're not already familiar with dimensional modelling, I would suggest you do some reading into it, as you are working in a very similar way and might find some useful information, both for this particular issue and perhaps for other design decisions you need to make.
As far as your statistical values, the rules of dimensional modelling would also tell you that you simply cannot store measures which are at different grains in the same table. If you really cannot calculate them on-the-fly, then make separate tables at each aggregation level, and include the appropriate ID columns for each level.
It could be worth looking into multidimensional tools (OLAP cubes, etc.), as it's possible that rather than carrying these calculations out and then storing them in the database, you might be able to add a layer which allows those - and more - calculations to be carried out at run time. For some use cases this has obvious benefits over being restricted to only those calculations which have been defined at design time. They would certainly be an obvious fit on top of the dimensional data structure that you are creating.
Related
I have a MySQL table like this, and I want to create indexes that make all queries to the table run fast. The difficult thing is that there are many possible combinations of where conditions, and that the size of table is large (about 6M rows).
Table name: items
id: PKEY
item_id: int (the id of items)
category_1: int
category_2: int
.
.
.
category_10: int
release_date: date
sort_score: decimal
item_id is not unique because an item can have several numbers of category_x .
An example of queries to this table is:
SELECT DISTINCT(item_id) FROM items WHERE category_1 IN (1, 2) AND category_5 IN (3, 4), AND release_date > '2019-01-01' ORDER BY sort_score
And another query maybe:
SELECT DISTINCT(item_id) FROM items WHERE category_3 IN (1, 2) AND category_4 IN (3, 4), AND category_8 IN (5) ORDER BY sort_score
If I want to optimize all the combinations of where conditions , do I have to make a huge number of composite indexes of the column combinations? (like ADD INDEX idx1_3_5(category_1, category_3, category_5))
Or is it good to create 10 tables which have data of category_1~10, and execute many INNER JOIN in the queries?
Or, is it difficult to optimize this kind of queries in MySQL, and should I use other middlewares , such as Elasticsearch ?
Well, the file (it is not a table) is not at all Normalised. Therefore no amount indices on combinations of fields will help the queries.
Second, MySQL is (a) not compliant with the SQL requirement, and (b) it does not have a Server Architecture or the features of one.
Such a Statistics, which is used by a genuine Query Optimiser, which commercial SQL platforms have. The "single index" issue you raise in the comments does not apply.
Therefore, while we can fix up the table, etc, you may never obtain the performance that you seek from the freeware.
Eg. in the commercial world, 6M rows is nothing, we worry when we get to a billion rows.
Eg. Statistics is automatic, we have to tweak it only when necessary: an un-normalised table or billions of rows.
Or ... should I use other middlewares , such as Elasticsearch ?
It depends on the use of genuine SQL vs MySQL, and the middleware.
If you fix up the file and make a set of Relational tables, the queries are then quite simple, and fast. It does not justify a middleware search engine (that builds a data cube on the client system).
If they are not fast on MySQL, then the first recommendation would be to get a commercial SQL platform instead of the freeware.
The last option, the very last, is to stick to the freeware and add a big fat middleware search engine to compensate.
Or is it good to create 10 tables which have data of category_1~10, and execute many INNER JOIN in the queries?
Yes. JOINs are quite ordinary in SQL. Contrary to popular mythology, a normalised database, which means many more tables than an un-normalised one, causes fewer JOINs, not more JOINs.
So, yes, Normalise that beast. Ten tables is the starting perception, still not at all Normalised. One table for each of the following would be a step in the direction of Normalised:
Item
Item_id will be unique.
Category
This is not category-1, etc, but each of the values that are in category_1, etc. You must not have multiple values in a single column, it breaks 1NF. Such values will be (a) Atomic, and (b) unique. The Relational Model demands that the rows are unique.
The meaning of category_1, etc in Item is not given. (If you provide some example data, I can improve the accuracy of the data model.) Obviously it is not [2].
.
If it is a Priority (1..10), or something similar, that the users have chosen or voted on, this table will be a table that supplies the many-to-many relationship between Item and Category, with a Priority for each row.
.
Let's call it Poll. The relevant Predicates would be something like:
Each Poll is 1 Item
Each Poll is 1 Priority
Each Poll is 1 Category
Likewise, sort_score is not explained. If it is even remotely what it appears to be, you will not need it. Because it is a Derived Value. That you should compute on the fly: once the tables are Normalised, the SQL required to compute this is straight-forward. Not one that you compute-and-store every 5 minutes or every 10 seconds.
The Relational Model
The above maintains the scope of just answering your question, without pointing out the difficulties in your file. Noting the Relational Database tag, this section deals with the Relational errors.
The Record ID field (item_id or category_id is yours) is prohibited in the Relational Model. It is a physical pointer to a record, which is explicitly the very thing that the RM overcomes, and that is required to be overcome if one wishes to obtain the benefits of the RM, such as ease of queries, and simple, straight-forward SQL code.
Conversely, the Record ID is always one additional column and one additional index, and the SQL code required for navigation becomes complex (and buggy) very quickly. You will have enough difficulty with the code as it is, I doubt you would want the added complexity.
Therefore, get rid of the Record ID fields.
The Relational Model requires that the Keys are "made up from the data". That means something from the logical row, that the users use. Usually they know precisely what identifies their data, such as a short name.
It is not manufactured by the system, such as a RecordID field which is a GUID or AUTOINCREMENT, which the user does not see. Such fields are physical pointers to records, not Keys to logical rows. Such fields are pre-Relational, pre-DBMS, 1960's Record Filing Systems, the very thing that RM superseded. But they are heavily promoted and marketed as "relational.
Relational Data Model • Initial
Looks like this.
All my data models are rendered in IDEF1X, the Standard for modelling Relational databases since 1993
My IDEF1X Introduction is essential reading for beginners.
Relational Data Model • Improved
Ternary relations (aka three-way JOINs) are known to be a problem, indicating that further Normalisation is required. Codd teaches that every ternary relation can be reduced to two binary relations.
In your case, perhaps a Item has certain, not all, Categories. The above implements Polls of Items allowing all Categories for each Item, which is typical error in a ternary relation, which is why it requires further Normalisation. It is also the classic error in every RFS file.
The corrected model would therefore be to establish the Categories for each Item first as ItemCategory, your "item can have several numbers of category_x". And then to allow Polls on that constrained ItemCategory. Note, this level of constraining data is not possible in 1960' Record Filing Systems, in which the "key" is a fabricated id field:
Each ItemCategory is 1 Item
Each ItemCategory is 1 Category
Each Poll is 1 Priority
Each Poll is 1 ItemCategory
Your indices are now simple and straight-forward, no additional indices are required.
Likewise your query code will now be simple and straight-forward, and far less prone to bugs.
Please make sure that you learn about Subqueries. The Poll table supports any type of pivoting that may be required.
It is messy to optimize such queries against such a table. Moving the categories off to other tables would only make it slower.
Here's a partial solution... Identify the categories that are likely to be tested with
=
IN
a range, such as your example release_date > '2019-01-01'
Then devise a few indexes (perhaps no more than a dozen) that have, say, 3-4 columns. Those columns should be ones that are often tested together. Order the columns in the indexes based on the list above. It is quite fine to have multiple = columns (first), but don't include more than one 'range' (last).
Keep in mind that the order of tests in WHERE does not matter, but the order of the columns in an INDEX does.
I'm working on a website which should be multilingual and also in some products number of fields may be more than other products (for example may be in the future a products have an extra feature which old products doesn't have it). because of this problem I decided to have a product table with common fields which all products can have and in all languages are same (like width and height) and add another three tables for storing extra fields as below:
field (id,name)
field_name(field_id,lang_id,name)
field_value(product_id, field_id, lang_id, value)
by doing this I can fetch all the values from one table but the problem is that values can be in different types, for example it could be a number or a text. I checked on an open source project "Drupal" and in that they create a table for each field type and by doing joins they will retrieve a node data. I want to know which way will impact the performance more? having a table for each extra field or storing all of their value in one table and convert their type on the fly by casting?
thank you in advance
Yes, but no. You are storing your data in an entity-attribute-value form (EAV). This is rather inefficient in general. Here are some issues:
As you have written it, you cannot do type checking.
You cannot set-up foreign key relationships in the database.
Fetching the results for a single row requires multiple joins or a group by.
You cannot write indexes on a specific column to speed access.
There are some work-arounds. You can get around the typing issue by having separate columns for different types. So, the data structure would have:
Name
Type
ValueString
ValueInt
ValueDecimal
Or whatever types you want to support.
There are some other "tricks" if you want to go this route. The most important is to decimal align the numbers. So, instead of storing '1' and '10', you would store ' 1' and '10'. This makes the value more amenable to ordering.
When faced with such a problem, I often advocate a hybrid approach. This approach would have a fixed record with the important properties all nicely located in columns with appropriate types and indexes -- columns such as:
ProductReleaseDate
ProductDescription
ProductCode
And whatever values are most useful. An EAV table can then be used for additional properties that are optional. This generally balances the power of the relational database to handle structured data along with the flexibility of an EAV approach to support variable columns.
I have 2 tables which I join very often. To simplify this, the join gives back a range of IDs that I use in another (complex) query as part of an IN.
So I do this join all the time to get back specific IDs.
To be clear, the query is not horribly slow. It takes around 2 mins. But since I call this query over a web page, the delay is noticeable.
As a concrete example let's say that the tables I am joining is a Supplier table and a table that contains the warehouses the supplier equipped specific dates. Essentially I get the IDs of suppliers that serviced specific warehouses at specific dates.
The query it self can not be improved since it is a simple join between 2 tables that are indexed but since there is a date range this complicates things.
I had the following idea which, I am not sure if it makes sense.
Since the data I am querying (especially for previous dates) do not change, what if I created another table that has as primary key, the columns in my where and as a value the list of IDs (comma separated).
This way it is a simple SELECT of 1 row.
I.e. this way I "pre-store" the supplier ids I need.
I understand that this is not even 1st normal formal but does it make sense? Is there another approach?
It makes sense as a denormalized design to speed up that specific type of query you have.
Though if your date range changes, couldn't it result in a different set of id's?
The other approach would be to really treat the denormalized entries like entries in a key/value cache like memcached or redis. Store the real data in normalized tables, and periodically update the cached, denormalized form.
Re your comments:
Yes, generally storing a list of id's in a string is against relational database design. See my answer to Is storing a delimited list in a database column really that bad?
But on the other hand, denormalization is justified in certain cases, for example as an optimization for a query you run frequently.
Just be aware of the downsides of denormalization: risk of data integrity failure, poor performance for other queries, limiting the ability to update data easily, etc.
In the absence of knowing a lot more about your application it's impossible to say whether this is the right approach - but to collect and consider that volume of information goes way beyond the scope of a question here.
Essentially I get the IDs of suppliers that serviced specific warehouses at specific dates.
While it's far from clear why you actually need 2 tables here, nor if denormalizing the data woul make the resulting query faster, one thing of note here is that your data is unlikely to change after capture, hence maintaining the current structure along with a materialized view would have minimal overhead. You first need to test the query performance by putting the sub-query results into a properly indexed table. If you get a significant performance benefit, then you need to think about how you maintain the new table - can you substitute one of the existing tables with a view on the new table, or do you keep both your original tables and populate data into the new table by batch, or by triggers.
It's not hard to try it out and see what works - and you'll get a far beter answer than anyone here can give you.
I have a membership database that I am looking to rebuild. Every member has 1 row in a main members table. From there I will use a JOIN to reference information from other tables. My question is, what would be better for performance of the following:
1 data table that specifies a data type and then the data. Example:
data_id | member_id | data_type | data
1 | 1 | email | test#domain.com
2 | 1 | phone | 1234567890
3 | 2 | email | test#domain2.com
Or
Would it be better to make a table of all the email addresses, and then a table of all phone numbers, etc and then use a select statement that has multiple joins
Keep in mind, this database will start with over 75000 rows in the member table, and will actually include phone, email, fax, first and last name, company name, address city state zip (meaning each member will have at least 1 of each of those but can be have multiple (normally 1-3 per member) so in excess of 75000 phone numbers, email addresses etc)
So basically, join 1 table of in excess of 750,000 rows or join 7-10 tables of in excess of 75,000 rows
edit: performance of this database becomes an issue when we are inserting sales data that needs to be matched to existing data in the database, so taking a CSV file of 10k rows of sales and contact data and querying the database to try to find which member attributes to which sales row from the CSV? Oh yeah, and this is done on a web server, not a local machine (not my choice)
The obvious way to structure this would be to have one table with one column for each data item (email, phone, etc) you need to keep track of. If a particular data item can occur more than once per member, then it depends on the exact nature of the relationship between that item and the member: if the item can naturally occur a variable number of times, it would make sense to put these in a separate table with a foreign key to the member table. But if the data item can occur multiple times in a limited, fixed set of roles (say, home phone number and mobile phone number) then it makes more sense to make a distinct column in the member table for each of them.
If you run into performance problems with this design (personally, I don't think 75000 is that much - it should not give problems if you have indexes to properly support your queries) then you can partition the data. Mysql supports native partitioning (http://dev.mysql.com/doc/refman/5.1/en/partitioning.html), which essentially distributes collections of rows over separate physical compartments (the partitions) while maintaining one logical compartment (the table). The obvious advantage here is that you can keep querying a logical table and do not need to manually bunch up the data from several places.
If you still don't think this is an option, you could consider vertical partitioning: that is, making groups of columns or even single columns an put those in their own table. This makes sense if you have some queries that always need one particular set of columns, and other queries that tend to use another set of columns. Only then would it make sense to apply this vertical partitioning, because the join itself will cost performance.
(If you're really running into the billions then you could consider sharding - that is, use separate database servers to keep a partition of the rows. This makes sense only if you can either quickly limit the number of shards that you need to query to find a particular member row or if you can efficiently query all shards in parallel. Personally it doesn't seem to me you are going to need this.)
I would strongly recommend against making a single "data" table. This would essentially spread out each thing that would naturally be a column to a row. This requires a whole bunch of joins and complicates writing of what otherwise would be a pretty straightforward query. Not only that, it also makes it virtually impossible to create proper, efficient indexes over your data. And on top of that it makes it very hard to apply constraints to your data (things like enforcing the data type and length of data items according to their type).
There are a few corner cases where such a design could make sense, but improving performance is not one of them. (See: entity attribute value antipattern http://karwin.blogspot.com/2009/05/eav-fail.html)
YOu should research scaling out vs scaling up when it comes to databases. In addition to aforementioned research, I would recommend that you use one table in our case if you are not expecting a great deal of data. If you are, then look up dimensions in database design.
75k is really nothing for a DB. You might not even notice the benefits of indexes with that many (index anyway :)).
Point is that though you should be aware of "scale-out" systems, most DBs MySQL inclusive, can address this through partitioning allowing your data access code to still be truly declarative vs. programmatic as to which object you're addressing/querying. It is important to note sharding vs. partitioning, but honestly are conversations when you start exceeding records approaching the count in 9+ digits, not 5+.
Use neither
Although a variant of the first option is the right approach.
Create a 'lookup' table that will store values of data type (mail, phone etc...). Then use the id from your lookup table in your 'data' table.
That way you actually have 3 tables instead of two.
Its best practice for a classic many-many relationship such as this
Our company has many different entities, but a good chunk of those database entities are people. So we have customers, and employees, and potential clients, and contractors, and providers and all of them have certain attributes in common, namely names and contact phone numbers.
I may have gone overboard with object-oriented thinking but now I am looking at making one "Person" table that contains all of the people, with flags/subtables "extending" that model and adding role-based attributes to junction tables as necessary. If we grow to say 250.000 people (on MySQL and ISAM) will this so greatly impact performance that future DBAs will curse me forever? Our single most common search is on name/surname combinations.
For, e.g. a company like Salesforce, are Clients/Leads/Employees all in a centralised table with sub-views (for want of a better term) or are they separated into different tables?
Caveat: this question is to do with "we found it better to do this in the real world" as opposed to theoretical design. I like the above solution, and am confident that with views, proper sizing and accurate indexing, that performance won't suffer. I also feel that the above doesn't count as a MUCK, just a pretty big table.
One 'person' table is the most flexible, efficient, and trouble-free approach.
It will be easy for you to do limited searches - find all people with this last name and who are customers, for example. But you may also find you have to look up someone when you don't know what they are - that will be easiest when you have one 'person' table.
However, you must consider the possibility that one person is multiple things to you - a customer because the bought something and a contractor because you hired them for a job. It would be better, therefore, to have a 'join' table that gives you a many to many relationship.
create person_type (
person_id int unsigned,
person_type_id int unsigned,
date_started datetime,
date_ended datetime,
[ ... ]
)
(You'll want to add indexes and foreign keys, of course. person_id is a FK to 'person' table; 'person_type_id' is a FK to your reference table for all possible person types. I've added two date fields so you can establish when someone was what to you.)
Since you have many different "types" of Persons, in order to have normalized design, with proper Foreign Key constraints, it's better to use the supertype/subtype pattern. One Person table (with the common to all attributes) and many subtype tables (Employee, Contractor, Customer, etc.), all in 1:1 relationship with the main Person table, and with necessary details for every type of Person.
Check this answer by #Branko for an example: Many-to-Many but sourced from multiple tables
250.000 records for a database is not very much. If you set your indexes appropriately you will never find any problems with that.
You should probably set a type for a user. Those types should be in a different table, so you can see what the type means (make it an TINYINT or similar). If you need additional fields per user type, you could indeed create a different table for that.
This approach sounds really good to me
Theoretically it would be possible to be a customer for the company you work for.
But if that's not the case here, then you could store people in different tables depending on their role.
However like Topener said, 250.000 isn't much. So I would personally feel safe to store every single person in one table.
And then have a column for each role (employee, customer, etc.)
Even if you end up with a one table solution (for core person attributes), you are going to want to abstract it with views and put on some constraints.
The last thing you want to do is send confidential information to clients which was only supposed to go to employees because someone didn't join correctly. Or an accidental cross join which results in income being doubled on a report (but only for particular clients which also had an employee linked somehow).
It really depends on how you want the layers to look and which components are going to access which layers and how.
Also, I would think you want to revisit your choice of MyISAM over InnoDB.