How to optimize MySQL queries with many combinations of where conditions? - mysql

I have a MySQL table like this, and I want to create indexes that make all queries to the table run fast. The difficult thing is that there are many possible combinations of where conditions, and that the size of table is large (about 6M rows).
Table name: items
id: PKEY
item_id: int (the id of items)
category_1: int
category_2: int
.
.
.
category_10: int
release_date: date
sort_score: decimal
item_id is not unique because an item can have several numbers of category_x .
An example of queries to this table is:
SELECT DISTINCT(item_id) FROM items WHERE category_1 IN (1, 2) AND category_5 IN (3, 4), AND release_date > '2019-01-01' ORDER BY sort_score
And another query maybe:
SELECT DISTINCT(item_id) FROM items WHERE category_3 IN (1, 2) AND category_4 IN (3, 4), AND category_8 IN (5) ORDER BY sort_score
If I want to optimize all the combinations of where conditions , do I have to make a huge number of composite indexes of the column combinations? (like ADD INDEX idx1_3_5(category_1, category_3, category_5))
Or is it good to create 10 tables which have data of category_1~10, and execute many INNER JOIN in the queries?
Or, is it difficult to optimize this kind of queries in MySQL, and should I use other middlewares , such as Elasticsearch ?

Well, the file (it is not a table) is not at all Normalised. Therefore no amount indices on combinations of fields will help the queries.
Second, MySQL is (a) not compliant with the SQL requirement, and (b) it does not have a Server Architecture or the features of one.
Such a Statistics, which is used by a genuine Query Optimiser, which commercial SQL platforms have. The "single index" issue you raise in the comments does not apply.
Therefore, while we can fix up the table, etc, you may never obtain the performance that you seek from the freeware.
Eg. in the commercial world, 6M rows is nothing, we worry when we get to a billion rows.
Eg. Statistics is automatic, we have to tweak it only when necessary: an un-normalised table or billions of rows.
Or ... should I use other middlewares , such as Elasticsearch ?
It depends on the use of genuine SQL vs MySQL, and the middleware.
If you fix up the file and make a set of Relational tables, the queries are then quite simple, and fast. It does not justify a middleware search engine (that builds a data cube on the client system).
If they are not fast on MySQL, then the first recommendation would be to get a commercial SQL platform instead of the freeware.
The last option, the very last, is to stick to the freeware and add a big fat middleware search engine to compensate.
Or is it good to create 10 tables which have data of category_1~10, and execute many INNER JOIN in the queries?
Yes. JOINs are quite ordinary in SQL. Contrary to popular mythology, a normalised database, which means many more tables than an un-normalised one, causes fewer JOINs, not more JOINs.
So, yes, Normalise that beast. Ten tables is the starting perception, still not at all Normalised. One table for each of the following would be a step in the direction of Normalised:
Item
Item_id will be unique.
Category
This is not category-1, etc, but each of the values that are in category_1, etc. You must not have multiple values in a single column, it breaks 1NF. Such values will be (a) Atomic, and (b) unique. The Relational Model demands that the rows are unique.
The meaning of category_1, etc in Item is not given. (If you provide some example data, I can improve the accuracy of the data model.) Obviously it is not [2].
.
If it is a Priority (1..10), or something similar, that the users have chosen or voted on, this table will be a table that supplies the many-to-many relationship between Item and Category, with a Priority for each row.
.
Let's call it Poll. The relevant Predicates would be something like:
Each Poll is 1 Item
Each Poll is 1 Priority
Each Poll is 1 Category
Likewise, sort_score is not explained. If it is even remotely what it appears to be, you will not need it. Because it is a Derived Value. That you should compute on the fly: once the tables are Normalised, the SQL required to compute this is straight-forward. Not one that you compute-and-store every 5 minutes or every 10 seconds.
The Relational Model
The above maintains the scope of just answering your question, without pointing out the difficulties in your file. Noting the Relational Database tag, this section deals with the Relational errors.
The Record ID field (item_id or category_id is yours) is prohibited in the Relational Model. It is a physical pointer to a record, which is explicitly the very thing that the RM overcomes, and that is required to be overcome if one wishes to obtain the benefits of the RM, such as ease of queries, and simple, straight-forward SQL code.
Conversely, the Record ID is always one additional column and one additional index, and the SQL code required for navigation becomes complex (and buggy) very quickly. You will have enough difficulty with the code as it is, I doubt you would want the added complexity.
Therefore, get rid of the Record ID fields.
The Relational Model requires that the Keys are "made up from the data". That means something from the logical row, that the users use. Usually they know precisely what identifies their data, such as a short name.
It is not manufactured by the system, such as a RecordID field which is a GUID or AUTOINCREMENT, which the user does not see. Such fields are physical pointers to records, not Keys to logical rows. Such fields are pre-Relational, pre-DBMS, 1960's Record Filing Systems, the very thing that RM superseded. But they are heavily promoted and marketed as "relational.
Relational Data Model • Initial
Looks like this.
All my data models are rendered in IDEF1X, the Standard for modelling Relational databases since 1993
My IDEF1X Introduction is essential reading for beginners.
Relational Data Model • Improved
Ternary relations (aka three-way JOINs) are known to be a problem, indicating that further Normalisation is required. Codd teaches that every ternary relation can be reduced to two binary relations.
In your case, perhaps a Item has certain, not all, Categories. The above implements Polls of Items allowing all Categories for each Item, which is typical error in a ternary relation, which is why it requires further Normalisation. It is also the classic error in every RFS file.
The corrected model would therefore be to establish the Categories for each Item first as ItemCategory, your "item can have several numbers of category_x". And then to allow Polls on that constrained ItemCategory. Note, this level of constraining data is not possible in 1960' Record Filing Systems, in which the "key" is a fabricated id field:
Each ItemCategory is 1 Item
Each ItemCategory is 1 Category
Each Poll is 1 Priority
Each Poll is 1 ItemCategory
Your indices are now simple and straight-forward, no additional indices are required.
Likewise your query code will now be simple and straight-forward, and far less prone to bugs.
Please make sure that you learn about Subqueries. The Poll table supports any type of pivoting that may be required.

It is messy to optimize such queries against such a table. Moving the categories off to other tables would only make it slower.
Here's a partial solution... Identify the categories that are likely to be tested with
=
IN
a range, such as your example release_date > '2019-01-01'
Then devise a few indexes (perhaps no more than a dozen) that have, say, 3-4 columns. Those columns should be ones that are often tested together. Order the columns in the indexes based on the list above. It is quite fine to have multiple = columns (first), but don't include more than one 'range' (last).
Keep in mind that the order of tests in WHERE does not matter, but the order of the columns in an INDEX does.

Related

Is the performance of join two one-to-one tables on a single-node database the same as a pure select on an equivalent denormalized table?

There are two big (millions of records) one-to-one tables:
course
prerequisite with a foreign key reference to the course table
in single-node relational MySQL database. A join is needed to list the full description of all the courses.
An alternative is to have only one single table to contain both the course and prerequisite data in the same database.
Question: is the performance of the join query still slower than that of a simple select query without join on the single denormalized table albeit the fact that they are on the same single-node MYSQL database?
It's true that denormalization is often done to shorten the work to look up one record with its associated details. This usually means the query responds in less time.
But denormalization improves one query at the expense of other queries against the same data. Making one query faster will often make other queries slower. For example, what if you want to query the set of courses that have a given prerequisite?
It's also a risk when you use denormalization that you create data anomalies. For example, if you change a course name, you would also need to update all the places where it is named as a prerequisite. If you forget one, then you'll have a weird scenario where the obsolete name for a course is still used in some places.
How will you know you found them all? How much work in the form of extra queries will you have to do to double-check that you have no anomalies? Do those types of extra queries count toward making your database slower on average?
The purpose of normalizing a database is not performance. It's avoiding data anomalies, which reduces your work in other ways.

Implementing efficient foreign keys in a relational database

All popular SQL databases, that I am aware of, implement foreign keys efficiently by indexing them.
Assuming a N:1 relationship Student -> School, the school id is stored in the student table with a (sometimes optional) index. For a given student you can find their school just looking up the school id in the row, and for a given school you can find its students by looking up the school id in the index over the foreign key in Students. Relational databases 101.
But is that the only sensible implementation? Imagine you are the database implementer, and instead of using a btree index on the foreign key column, you add an (invisible to the user) set on the row at the other (many) end of the relation. So instead of indexing the school id column in students, you had an invisible column that was a set of student ids on the school row itself. Then fetching the students for a given school is a simple as iterating the set. Is there a reason this implementation is uncommon? Are there some queries that can't be supported efficiently this way? The two approaches seem more or less equivalent, modulo particular implementation details. It seems to me you could emulate either solution with the other.
In my opinion it's conceptually the same as splitting of the btree, which contains sorted runs of (school_id, student_row_id), and storing each run on the school row itself. Looking up a school id in the school primary key gives you the run of student ids, the same as looking up a school id in the foreign key index would have.
edited for clarity
You seem to be suggesting storing "comma separated list of values" as a string in a character column of a table. And you say that it's "as simple as iterating the set".
But in a relational database, it turns out that "iterating the set" when its stored as list of values in a column is not at all simple. Nor is it efficient. Nor does it conform to the relational model.
Consider the operations required when a member needs to be added to a set, or removed from the set, or even just determining whether a member is in a set. Consider the operations that would be required to enforce integrity, to verify that every member in that "comma separated list" is valid. The relational database engine is not going to help us out with that, we'll have to code all of that ourselves.
At first blush, this idea may seem like a good approach. And it's entirely possible to do, and to get some code working. But once we move beyond the trivial demonstration, into the realm of real problems and real world data volumes, it turns out to be a really, really bad idea.
The storing comma separated lists is all-too-familiar SQL anti-pattern.
I strongly recommend Chapter 2 of Bill Karwin's excellent book: SQL Antipatterns: Avoiding the Pitfalls of Database Programming ISBN-13: 978-1934356555
(The discussion here relates to "relational database" and how it is designed to operate, following the relational model, the theory developed by Ted Codd and Chris Date.)
"All nonkey columns are dependent on the key, the whole key, and nothing but the key. So help me Codd."
Q: Is there a reason this implementation is uncommon?
Yes, it's uncommon because it flies in the face of relational theory. And it makes what would be a straightforward problem (for the relational model) into a confusing jumble that the relational database can't help us with. If what we're storing is just a string of characters, and the database never needs to do anything with that, other than store the string and retrieve the string, we'd be good. But we can't ask the database to decipher that as representing relationships between entities.
Q: Are there some queries that can't be supported efficiently this way?
Any query that would need to turn that "list of values" into a set of rows to be returned would be inefficient. Any query that would need to identify a "list of values" containing a particular value would be inefficient. And operations to insert or remove a value from the "list of values" would be inefficient.
This might buy you some small benefit in a narrow set of cases. But the drawbacks are numerous.
Such indices are useful for more than just direct joins from the parent record. A query might GROUP BY the FK column, or join it to a temp table / subquery / CTE; all of these cases might benefit from the presence of an index, but none of the queries involve the parent table.
Even direct joins from the parent often involve additional constraints on the child table. Consequently, indices defined on child tables commonly include other fields in addition to the key itself.
Even if there appear to be fewer steps involved in this algorithm, that does not necessarily equate to better performance. Databases don't read from disk a column at a time; they typically load data in fixed-size blocks. As a result, storing this information in a contiguous structure may allow it to be accessed far more efficiently than scattering it across multiple tuples.
No database that I'm aware of can inline an arbitrarily large column; either you'd have a hard limit of a few thousand children, or you'd have to push this list to some out-of-line storage (and with this extra level of indirection, you've probably lost any benefit over an index lookup).
Databases are not designed for partial reads or in-place edits of a column value. You would need to fetch the entire list whenever it's accessed, and more importantly, replace the entire list whenever it's modified.
In fact, you'd need to duplicate the entire row whenever the child list changes; the MVCC model handles concurrent modifications by maintaining multiple versions of a record. And not only are you spawning more versions of the record, but each version holds its own copy of the child list.
Probably most damning is the fact that an insert on the child table now triggers an update of the parent. This involves locking the parent record, meaning that concurrent child inserts or deletes are no longer allowed.
I could go on. There might be mitigating factors or obvious solutions in many of these cases (not to mention outright misconceptions on my part), though there are probably just as many issues that I've overlooked. In any case, I'm satisfied that they've thought this through fairly well...

Does this de-normalization make sense?

I have 2 tables which I join very often. To simplify this, the join gives back a range of IDs that I use in another (complex) query as part of an IN.
So I do this join all the time to get back specific IDs.
To be clear, the query is not horribly slow. It takes around 2 mins. But since I call this query over a web page, the delay is noticeable.
As a concrete example let's say that the tables I am joining is a Supplier table and a table that contains the warehouses the supplier equipped specific dates. Essentially I get the IDs of suppliers that serviced specific warehouses at specific dates.
The query it self can not be improved since it is a simple join between 2 tables that are indexed but since there is a date range this complicates things.
I had the following idea which, I am not sure if it makes sense.
Since the data I am querying (especially for previous dates) do not change, what if I created another table that has as primary key, the columns in my where and as a value the list of IDs (comma separated).
This way it is a simple SELECT of 1 row.
I.e. this way I "pre-store" the supplier ids I need.
I understand that this is not even 1st normal formal but does it make sense? Is there another approach?
It makes sense as a denormalized design to speed up that specific type of query you have.
Though if your date range changes, couldn't it result in a different set of id's?
The other approach would be to really treat the denormalized entries like entries in a key/value cache like memcached or redis. Store the real data in normalized tables, and periodically update the cached, denormalized form.
Re your comments:
Yes, generally storing a list of id's in a string is against relational database design. See my answer to Is storing a delimited list in a database column really that bad?
But on the other hand, denormalization is justified in certain cases, for example as an optimization for a query you run frequently.
Just be aware of the downsides of denormalization: risk of data integrity failure, poor performance for other queries, limiting the ability to update data easily, etc.
In the absence of knowing a lot more about your application it's impossible to say whether this is the right approach - but to collect and consider that volume of information goes way beyond the scope of a question here.
Essentially I get the IDs of suppliers that serviced specific warehouses at specific dates.
While it's far from clear why you actually need 2 tables here, nor if denormalizing the data woul make the resulting query faster, one thing of note here is that your data is unlikely to change after capture, hence maintaining the current structure along with a materialized view would have minimal overhead. You first need to test the query performance by putting the sub-query results into a properly indexed table. If you get a significant performance benefit, then you need to think about how you maintain the new table - can you substitute one of the existing tables with a view on the new table, or do you keep both your original tables and populate data into the new table by batch, or by triggers.
It's not hard to try it out and see what works - and you'll get a far beter answer than anyone here can give you.

Which of these 2 MySQL DB Schema approaches would be most efficient for retrieval and sorting?

I'm confused as to which of the two db schema approaches I should adopt for the following situation.
I need to store multiple attributes for a website, e.g. page size, word count, category, etc. and where the number of attributes may increase in the future. The purpose is to display this table to the user and he should be able to quickly filter/sort amongst the data (so the table strucuture should support fast querying & sorting). I also want to keep a log of previous data to maintain a timeline of changes. So the two table structure options I've thought of are:
Option A
website_attributes
id, website_id, page_size, word_count, category_id, title_id, ...... (going up to 18 columns and have to keep in mind that there might be a few null values and may also need to add more columns in the future)
website_attributes_change_log
same table strucuture as above with an added column for "change_update_time"
I feel the advantage of this schema is the queries will be easy to write even when some attributes are linked to other tables and also sorting will be simple. The disadvantage I guess will be adding columns later can be problematic with ALTER TABLE taking very long to run on large data tables + there could be many rows with many null columns.
Option B
website_attribute_fields
attribute_id, attribute_name (e.g. page_size), attribute_value_type (e.g. int)
website_attributes
id, website_id, attribute_id, attribute_value, last_update_time
The advantage out here seems to be the flexibility of this approach, in that I can add columns whenever and also I save on storage space. However, as much as I'd like to adopt this approach, I feel that writing queries will be especially complex when needing to display the tables [since I will need to display records for multiple sites at a time and there will also be cross referencing of values with other tables for certain attributes] + sorting the data might be difficult [given that this is not a column based approach].
A sample output of what I'd be looking at would be:
Site-A.com, 232032 bytes, 232 words, PR 4, Real Estate [linked to category table], ..
Site-B.com, ..., ..., ... ,...
And the user needs to be able to sort by all the number based columns, in which case approach B might be difficult.
So I want to know if I'd be doing the right thing by going with Option A or whether there are other better options that I might have not even considered in the first place.
I would recommend using Option A.
You can mitigate the pain of long-running ALTER TABLE by using pt-online-schema-change.
The upcoming MySQL 5.6 supports non-blocking ALTER TABLE operations.
Option B is called Entity-Attribute-Value, or EAV. This breaks rules of relational database design, so it's bound to be awkward to write SQL queries against data in this format. You'll probably regret using it.
I have posted several times on Stack Overflow describing pitfalls of EAV.
Also in my blog: EAV FAIL.
Option A is a better way ,though the time may be large when alert table for adding a extra column, querying and sorting options are quicker. I have used the design like Option A before, and it won't take too long when alert table while millions records in the table.
you should go with option 2 because it is more flexible and uses less ram. When you are using option1 then you have to fetch a lot of content into the ram, so will increases the chances of page fault. If you want to increase the querying time of the database then you should defiantly index your database to get fast result
I think Option A is not a good design. When you design a good data model you should not change the tables in a future. If you domain SQL language, using queries in option B will not be difficult. Also it is the solution of your real problem: "you need to store some attributes (open number, not final attributes) of some webpages, therefore, exist an entity for representation of those attributes"
Use Option A as the attributes are fixed. It will be difficult to query and process data from second model as there will be query based on multiple attributes.

MySQL performance; large data table or multiple data tables?

I have a membership database that I am looking to rebuild. Every member has 1 row in a main members table. From there I will use a JOIN to reference information from other tables. My question is, what would be better for performance of the following:
1 data table that specifies a data type and then the data. Example:
data_id | member_id | data_type | data
1 | 1 | email | test#domain.com
2 | 1 | phone | 1234567890
3 | 2 | email | test#domain2.com
Or
Would it be better to make a table of all the email addresses, and then a table of all phone numbers, etc and then use a select statement that has multiple joins
Keep in mind, this database will start with over 75000 rows in the member table, and will actually include phone, email, fax, first and last name, company name, address city state zip (meaning each member will have at least 1 of each of those but can be have multiple (normally 1-3 per member) so in excess of 75000 phone numbers, email addresses etc)
So basically, join 1 table of in excess of 750,000 rows or join 7-10 tables of in excess of 75,000 rows
edit: performance of this database becomes an issue when we are inserting sales data that needs to be matched to existing data in the database, so taking a CSV file of 10k rows of sales and contact data and querying the database to try to find which member attributes to which sales row from the CSV? Oh yeah, and this is done on a web server, not a local machine (not my choice)
The obvious way to structure this would be to have one table with one column for each data item (email, phone, etc) you need to keep track of. If a particular data item can occur more than once per member, then it depends on the exact nature of the relationship between that item and the member: if the item can naturally occur a variable number of times, it would make sense to put these in a separate table with a foreign key to the member table. But if the data item can occur multiple times in a limited, fixed set of roles (say, home phone number and mobile phone number) then it makes more sense to make a distinct column in the member table for each of them.
If you run into performance problems with this design (personally, I don't think 75000 is that much - it should not give problems if you have indexes to properly support your queries) then you can partition the data. Mysql supports native partitioning (http://dev.mysql.com/doc/refman/5.1/en/partitioning.html), which essentially distributes collections of rows over separate physical compartments (the partitions) while maintaining one logical compartment (the table). The obvious advantage here is that you can keep querying a logical table and do not need to manually bunch up the data from several places.
If you still don't think this is an option, you could consider vertical partitioning: that is, making groups of columns or even single columns an put those in their own table. This makes sense if you have some queries that always need one particular set of columns, and other queries that tend to use another set of columns. Only then would it make sense to apply this vertical partitioning, because the join itself will cost performance.
(If you're really running into the billions then you could consider sharding - that is, use separate database servers to keep a partition of the rows. This makes sense only if you can either quickly limit the number of shards that you need to query to find a particular member row or if you can efficiently query all shards in parallel. Personally it doesn't seem to me you are going to need this.)
I would strongly recommend against making a single "data" table. This would essentially spread out each thing that would naturally be a column to a row. This requires a whole bunch of joins and complicates writing of what otherwise would be a pretty straightforward query. Not only that, it also makes it virtually impossible to create proper, efficient indexes over your data. And on top of that it makes it very hard to apply constraints to your data (things like enforcing the data type and length of data items according to their type).
There are a few corner cases where such a design could make sense, but improving performance is not one of them. (See: entity attribute value antipattern http://karwin.blogspot.com/2009/05/eav-fail.html)
YOu should research scaling out vs scaling up when it comes to databases. In addition to aforementioned research, I would recommend that you use one table in our case if you are not expecting a great deal of data. If you are, then look up dimensions in database design.
75k is really nothing for a DB. You might not even notice the benefits of indexes with that many (index anyway :)).
Point is that though you should be aware of "scale-out" systems, most DBs MySQL inclusive, can address this through partitioning allowing your data access code to still be truly declarative vs. programmatic as to which object you're addressing/querying. It is important to note sharding vs. partitioning, but honestly are conversations when you start exceeding records approaching the count in 9+ digits, not 5+.
Use neither
Although a variant of the first option is the right approach.
Create a 'lookup' table that will store values of data type (mail, phone etc...). Then use the id from your lookup table in your 'data' table.
That way you actually have 3 tables instead of two.
Its best practice for a classic many-many relationship such as this