What I want to achieve:
I am developing website with a catalog of products.
This is normalized model (simplified) of entities which are related to my question:
So some product features exist (like size and type in this example), which all have predefined sets of values (e.g. sizes 1, 2 and 3 exist, and type may be 1, 2 or 3 (these sets do not have to be equal, just example.)).
Relationship between Product and each of features is "many-to-many" - different values of one feature do not exclude each other.
My task is to build form which will allow user to filter search results, based on features of products. Example screenshot:
Multiple checked values of one feature are mixed using "AND" logic, so if I have sizes One and Three checked, I need all products, which have both sizes (+ may have any other sizes, that doesn't matter, but selected ones must be present).
Number near each value of feature represents amount of products, which is returned if user checks this value right now. So it is effectively a number of products satisfying filter "current active filter + this one value applied".
When user checks/unchecks any value, counters must be updated considering new "current filter".
Problem:
Real use case is: ~200k products, ~6 features with ~5-15 values each.
My COUNT queries, (especially with decent number of selected options) are too slow, and to render the form I need as many of these counts as there are values of all filters - in total that gives unacceptable response time.
What I have tried:
Query to retrieve results:
select * from products p, product_size ps
where p.id = ps.product_id
and (ps.size_id IN (1, 2, 3, 5))
group by p.id
having count(p.id) = 4;
(this is to select products which have sizes 1, 2, 3 and 5 at the same time).
It completes in ~0.360 sec on 120k products, almost same time with COUNT wrapped around it. And this query does not allow more than one feature (but I could place values of all features in one table).
Another query to retrieve the same set:
SELECT ps1.product_id
FROM product_size AS ps1, (SELECT id FROM size AS s1 WHERE id IN (1, 2, 3, 5)) AS t
WHERE ps1.size_id = t.id
GROUP BY ps1.product_id
HAVING COUNT(ps1.size_id) = (SELECT COUNT(id) FROM (SELECT id FROM size AS s2 WHERE id IN (1, 2, 3, 5)) AS t2);
It completes in ~0.230 sec (same time when wrapped in COUNT) and does not allow multiple features too.
It is modified query I found here: https://www.simple-talk.com/sql/t-sql-programming/divided-we-stand-the-sql-of-relational-division/ (second query in "Division with a Remainder" part).
Alternative schema:
Denormalized model, where value of each feature is a boolean column in Products table.
The query is obvious here:
select * from products
where `size_1` = 1 and `size_2` = 1
and `size_3` = 1 and `size_5` = 1;
Weird and harder to maintain in application's code, but completes in ~0.056 sec when COUNT-ing.
None of these methods are acceptable per se because multiplied ~30 times (to populate all counters in form) that gives inadequate response time.
Caching and precomputing
Data in DB is going to be updated only few times a day (like, may be, even 2), so I could probably precompute counts for all combinations of filters when data is updated (I haven't measured necessary time to be honest), but it is anyway not going to work too - search form has fields with arbitrary values (like min/max price and text search by the product's name), which I can't precompute for.
Load counters in form dynamically
Render form, but fetch numbers through AJAX, so user would be able to see page, and then, after quite long waiting, numbers. This is my last thought, but it seems like poor quality of service for me (may be it is worse than no counters at all).
I am stuck. Any hints? May be I am not seeing some bigger picture? I would be very glad to any advice.
UPDATE: if we forget about counters, what is the effective and usually used way (query) for just retrieving results with such a filters (or what am I doing wrong)? Like "find post with all requested tags" model, that is equivalent. I suspect it can be faster than my 0.230 sec (query #2), considering small (?) amount of rows for MySQL.
You can
Create one table which will store all possible combinations (product_id <> size_id <> type_id)
Update this table when Admin will make any changes in product from backend (assuming there will be a backend management)
In frontend, for filters, use this table instead of product tables, and extract product ids once filter query is fired
Once you have list of product ids for result, you can fetch actual data by using those product Ids
I have used this before, and it worked for me, you can first make table and try running query to check response time.
Hope this helps.
Related
I was poking around a TFS database today to try and run some statistics and I came across a table called tbl_Number. This table contains one column Number, and all the values are just the values 1 to 500,000. None of the values differ from their respective index in the list, as you can see in the screenshot from queries I ran in LinqPad:
Tbl_Numbers.Max(x => x.Number).Dump(); //max value
Tbl_Numbers.Count().Dump(); //number of entries
var asList = Tbl_Numbers.ToList();
asList.Where(x => asList[x.Number - 1].Number != x.Number).Any().Dump();
//False shows that every entry matches the value at its ordinal location in the list
My question is: What would the use of such a table be? Is this in case one of the referenced numbers needs to change for some reason? The only way to identify a number from this table is by using that same number, so I don't see what use this table could be.
I realize this question could lead to answers that are conjecture, but I'd be interested to see if there's some programming principal that I'm unaware of that's being used here.
It can be used in OUTER JOINS to make sure that you always get all the numbers in a given range, even if there is no data related to that number.
For example, suppose I want to return the count of customers who bought 3,4 or 5 products on their last order. But in fact, there are no customers who bought 4 products. If I just ran a count query on my data, I wouldn't get a row for the customers who bought 4 products at all.
However, If I query my numbers table and LEFT JOIN to my data, I will get the number 4, and a count of 0 or NULL, depending on how I wrote my query.
People also often do this with Date tables, by the way.
I would like to achieve something like you see on Facebook:
- Posting status
- Comment status
- Like status (like for comments not implemented yet)
My tables structure is like this :
Posts Users Comments Likes
------- ------- -------- -------
ID ID ID ID
UserID Username PostID PostID
Content UserID UserID
Date Content
Date
So at this time when someone access to the main page the system is going to show the 10 lasts posts. My query uses LEFT JOIN on theses tables.
If for example there is 10 posts without any comments and any likes the query will return 10 records.
But for each comment or likes my query will return a new record (row) with some NULL value in the corresponding column.
At the end by simply wanting to retrieve 10 posts my query will return at least 50 rows (if each post has some comments and likes).
I was wondering if that will cause problem in the future. And I was wondering if I should better use multiple queries and parse all the results into an array like:
1. Select the 10 last posts
2. Save the IDs into array and all data into global array
3. Parse the array and make a prepared query for the comments something like:
SELECT * FROM COMMENTS WHERE PostID IN (1, 2, 3, 4, 5, 6,...)
4. Save the result into global array
5. Repeat again for the like table
I hope my explanation was clear enough :) Thank you
Doing one 50 row query reduces the overhead when communicating with the server, on the other hand it adds processing after the rows are retrieved.
It really depends on the overall solution.
However, unless the application is performance critical with the server being the bottleneck, i would go with 10 result sets - one per row, probably using some class/widget/object to display the post on the page.
I'm not an expert, if I understand correctly your option are:
A) the single mega query that will return a lot of NULL's and repeated values.
[Note: By "all" I mean, all you are interested in]
B) Three queries: One for all posts, one for all comments, and one for all likes (all joined with the users table), and then you can process them into objects or structs or dictionaries with whatever language you are using to query the database.
I would go with the second because It is easier, and the increase in order of magnitude seems benign, and probably even more flexible design wise.
What I would prefer NOT to do is one query per post. That would probably become a problem sooner than later. At least much sooner than A or B.
I have a table of > 250k rows of 'names' (and ancillary info) which I am displaying using jQuery Datatables.
My Users can choose any 'name' (Row), which is then flagged as 'taken' (and timestamped).
A (very) cut down version of the table is:
Key, Name, Taken, Timestamp
I would like to be able to display the 'taken' rows (in timestamp order) first and then the untaken records in their key order [ASC] next.
The problem would be simple, but, because of size constraints (both visual UI & data set size) My display mechanism paginates - 10 / 20 / 50 / 100 rows (user choice)
Which means a) the total number of 'taken' will vary and b) the pagination length varies.
Thus I can see no obvious method of keeping track of the pagination.
(My Datatable tells me the count of the start record and the length of the displayed records)
My SQL (MySQL) at this level is weak, and I have no idea how to return a record set that accounts for the 'taken' offset without some kind of new (or internal MySQL) numeric indices to paginate to.
I thought of:
Creating a temporary table with the key and a new numeric indices on
each pagination.
Creating a trigger that re-ordered the table when the row was
'taken'.
Having a "Running order" column that was updated on each new 'taken'
Some sort of cursor based procedure (at this point my hair was
ruffled as the explanations shot straight over the top of my head!)
All seem excessive.
I also thought of doing a lot of manipulation in PHP (involving separate queries, dependant on the pagination size, amount of names already taken, and keeping a running record of the pagination position.)
To the Human Computer (Brain) the problem is untaxing - but translating it into SQL has foxed me, as has coming up with a fast alternative to 1-3 (the test case on updating the "Running order" solution took almost three minutes to complete!)
It 'feels' like there should be a smart SQL query answer to this, but all efforts with ORDER BY, LIMITS, and the like fall over unless I return the whole dataset and do a lot of nasty counting.
Is there something like a big elephant in the room I am missing - or am I stuck with the hard slog to get what I need.
A query that displays the 'taken' rows (in timestamp order) first and then the untaken records in their key order [ASC] next:
SELECT *
FROM `table_name`
ORDER BY `taken` DESC, IF(`taken` = 1, `Timestamp`, `Key`) ASC
LIMIT 50, 10
The LIMIT values: 10 is the page size, 50 is the index of the first element on page 6.
Change the condition on IF(taken = 1,Timestamp,Key) with the correct condition to match the values you store in column taken. I assumed you store 1 when the row is 'taken' and 0 otherwise.
my question is about selecting the best method to perform a job. i'll present the goal and my various solutions.
i have a list of items and a list of categories. each can belong to a number of categories.
items (id, name, ...other fields...)
categories (id, name, ...... )
category_items (category_id, item_id)
the items list is very large and is updated every 10 minutes (using cron). the categories list is fixed.
on my page, i'm showing a large list of items and i have category filters. the whole filtering is done on client side javascript. The reason is that the items that are currently available are limited to +- 1000, so all the data (items+categories) will be loaded together.
this page is about to be viewed many times, so performance is an issue here. I have several ideas, both will result in a good performance. in all of them, the complete list of categories will be sent. the items however...
running a single select using join and group_concat. something like this:
SELECT i.*, GROUP_CONCAT(ci.category_id SEPARATOR ",") AS category_list
FROM items AS i
LEFT JOIN category_items AS ci ON (ci.item_id = i.id)
WHERE ... GROUP BY i.id ORDER BY ...
creating a view with the above
storing the GROUP_CONCAT result as an additional column. this will only be updated every several minutes under cron.
indexing is done correctly so all methods will work relatively fast. join is a heavy operation, so my question is about (2), (3):
is the view updated only on every CRUD or is it calculated on every select? if it is only updated on CRUD, it should be about the same as storing the column.
take in mind that the items table will grow, and only latest rows will be selected.
Solution 4. Have a MEMORY type table, which will be updated with results of your query from solution 1, by the same cron script, that updates items table.
Other than that: 1. and 2. are equivalent. MySQL's views are not materialised, so querying the view will actually run the SELECT from point 1.
Creating a view is just saving a query, (2) will still run the query and the join. (3) of course will save the time at the expense of space.
The answer, therefore is a question: Do you and your app value time or space?
Also, instead of using cron to update the cache field (your GROUP_CONCAT) you could use a trigger on the category_items table;
I want to design a database which is described as follows:
Each product has only one status at one time point. However, the status of a product can change during its life time. How could I design the relationship between product and status which can easily be queried all product of a specific status at current time? In addition, could anyone please give me some in-depth details about design database which related to time duration as problem above? Thanks for any help
Here is a model to achieve your stated requirement.
Link to Time Series Data Model
Link to IDEF1X Notation for those who are unfamiliar with the Relational Modelling Standard.
Normalised to 5NF; no duplicate columns; no Update Anomalies, no Nulls.
When the Status of a Product changes, simply insert a row into ProductStatus, with the current DateTime. No need to touch previous rows (which were true, and remain true). No dummy values which report tools (other than your app) have to interpret.
The DateTime is the actual DateTime that the Product was placed in that Status; the "From", if you will. The "To" is easily derived: it is the DateTime of the next (DateTime > "From") row for the Product; where it does not exist, the value is the current DateTime (use ISNULL).
The first model is complete; (ProductId, DateTime) is enough to provide uniqueness, for the Primary Key. However, since you request speed for certain query conditions, we can enhance the model at the physical level, and provide:
An Index (we already have the PK Index, so we will enhance that first, before adding a second index) to support covered queries (those based on any arrangement of { ProductId | DateTime | Status } can be supplied by the Index, without having to go to the data rows). Which changes the Status::ProductStatus relation from Non-Identifying (broken line) to Identifying type (solid line).
The PK arrangement is chosen on the basis that most queries will be Time Series, based on Product⇢DateTime⇢Status.
The second index is supplied to enhance the speed of queries based on Status.
In the Alternate Arrangement, that is reversed; ie, we mostly want the current status of all Products.
In all renditions of ProductStatus, the DateTime column in the secondary Index (not the PK) is DESCending; the most recent is first up.
I have provided the discussion you requested. Of course, you need to experiment with a data set of reasonable size, and make your own decisions. If there is anything here that you do not understand, please ask, and I will expand.
Responses to Comments
Report all Products with Current State of 2
SELECT ProductId,
Description
FROM Product p,
ProductStatus ps
WHERE p.ProductId = ps.ProductId -- Join
AND StatusCode = 2 -- Request
AND DateTime = ( -- Current Status on the left ...
SELECT MAX(DateTime) -- Current Status row for outer Product
FROM ProductStatus ps_inner
WHERE p.ProductId = ps_inner.ProductId
)
ProductId is Indexed, leading col, both sides
DateTime in Indexed, 2nd col in Covered Query Option
StatusCode is Indexed, 3rd col in Covered Query Option
Since StatusCode in the Index is DESCending, only one fetch is required to satisfy the inner query
the rows are required at the same time, for the one query; they are close together (due to Clstered Index); almost always on the same page due to the short row size.
This is ordinary SQL, a subquery, using the power of the SQL engine, Relational set processing. It is the one correct method, there is nothing faster, and any other method would be slower. Any report tool will produce this code with a few clicks, no typing.
Two Dates in ProductStatus
Columns such as DateTimeFrom and DateTimeTo are gross errors. Let's take it in order of importance.
It is a gross Normalisation error. "DateTimeTo" is easily derived from the single DateTime of the next row; it is therefore redundant, a duplicate column.
The precision does not come into it: that is easily resolved by virtue of the DataType (DATE, DATETIME, SMALLDATETIME). Whether you display one less second, microsecond, or nanosecnd, is a business decision; it has nothing to do with the data that is stored.
Implementing a DateTo column is a 100% duplicate (of DateTime of the next row). This takes twice the disk space. For a large table, that would be significant unnecessary waste.
Given that it is a short row, you will need twice as many logical and physical I/Os to read the table, on every access.
And twice as much cache space (or put another way, only half as many rows would fit into any given cache space).
By introducing a duplicate column, you have introduced the possibility of error (the value can now be derived two ways: from the duplicate DateTimeTo column or the DateTimeFrom of the next row).
This is also an Update Anomaly. When you update any DateTimeFrom is Updated, the DateTimeTo of the previous row has to be fetched (no big deal as it is close) and Updated (big deal as it is an additional verb that can be avoided).
"Shorter" and "coding shortcuts" are irrelevant, SQL is a cumbersome data manipulation language, but SQL is all we have (Just Deal With It). Anyone who cannot code a subquery really should not be coding. Anyone who duplicates a column to ease minor coding "difficulty" really should not be modelling databases.
Note well, that if the highest order rule (Normalisation) was maintained, the entire set of lower order problems are eliminated.
Think in Terms of Sets
Anyone having "difficulty" or experiencing "pain" when writing simple SQL is crippled in performing their job function. Typically the developer is not thinking in terms of sets and the Relational Database is set-oriented model.
For the query above, we need the Current DateTime; since ProductStatus is a set of Product States in chronological order, we simply need the latest, or MAX(DateTime) of the set belonging to the Product.
Now let's look at something allegedly "difficult", in terms of sets. For a report of the duration that each Product has been in a particular State: the DateTimeFrom is an available column, and defines the horizontal cut-off, a sub set (we can exclude earlier rows); the DateTimeTo is the earliest of the sub set of Product States.
SELECT ProductId,
Description,
[DateFrom] = DateTime,
[DateTo] = (
SELECT MIN(DateTime) -- earliest in subset
FROM ProductStatus ps_inner
WHERE p.ProductId = ps_inner.ProductId -- our Product
AND ps_inner.DateTime > ps.DateTime -- defines subset, cutoff
)
FROM Product p,
ProductStatus ps
WHERE p.ProductId = ps.ProductId
AND StatusCode = 2 -- Request
Thinking in terms of getting the next row is row-oriented, not set-oriented processing. Crippling, when working with a set-oriented database. Let the Optimiser do all that thinking for you. Check your SHOWPLAN, this optimises beautifully.
Inability to think in sets, thus being limited to writing only single-level queries, is not a reasonable justification for: implementing massive duplication and Update Anomalies in the database; wasting online resources and disk space; guaranteeing half the performance. Much cheaper to learn how to write simple SQL subqueries to obtain easily derived data.
"In addition, could anyone please give me some in-depth details about design database which related to time duration as problem above?"
Well, there exists a 400-page book entitled "Temporal Data and the Relational Model" that addresses your problem.
That book also addresses numerous problems that the other responders have not addressed in their responses, for lack of time or for lack of space or for lack of knowledge.
The introduction of the book also explicitly states that "this book is not about technology that is (commercially) available to any user today.".
All I can observe is that users wanting temporal features from SQL systems are, to put it plain and simple, left wanting.
PS
Even if those 400 pages could be "compressed a bit", I hope you don't expect me to give a summary of the entire meaningful content within a few paragraphs here on SO ...
tables similar to these:
product
-----------
product_id
status_id
name
status
-----------
status_id
name
product_history
---------------
product_id
status_id
status_time
then write a trigger on product to record the status and timestamp (sysdate) on each update where the status changes
Google "bi-temporal databases" and "slowly changing dimensions".
These are two names for esentially the same pattern.
You need to add two timestamp columns to your product table "VALID_FROM" and "VALID_TO".
When your product status changes you add a NEW row with "VALID_FROM" of now() some other known effective data/time and set the "VALID_TO" to 9999-12-31 23:59:59 or some other date ridiculously far into the future.
You also need to zap the "9999-12-31..." date on the previously current row to the current "VALID_FROM" time - 1 microsecond.
You can then easily query the product status at any given time.