I need to build application where users can search/filter products by multiple characteristics. There are 25 product groups. Each product group have around 10 product characteristics. I have a few data base design solutions, but none of them seems appropriate enough:
Create 25 tables per each group with column names storing product group characteristics.
Create one table with all products and as many columns as there are product characteristics (~ 200)
EAV: create 1 table for all characteristics and 1 table with all products and their attributes stored in rows, not in column names. This solution will result in writing a lot of application code, because I won't be able to select a product with all characteristics in one row. I will have to write application code to group mysql results.
I believe there are already solutions for problems like mine. Thanks for help.
EDIT:
In most cases the characteristics in groups are entirely different. These are starter/alternator components. Only around 25% of characteristics can overlap, like physical characteristics, length, diameter, etc.
I would suggest the following:
Create 3 tables; Groups, GroupCharacteristics,Products.
Groups is linked to both tables.
GroupCharacteristics will have the list of characteristics, using 3 columns, (1)GroupName,(2)CharacteristicName,(3)Mapping [Values for mapping could be C01,C02 through C10]
You will use mapping later on.
One group has many characteristics so it's a one to many link.
Products will have 12 Columns; (1)ProductName/Id,(2)GroupName,(3)C01,(4)C02 ... (12)C10.
The C** columns will be filled with the values of the related characteristics in order to keep them mapped correctly.
Groups:
[GroupName]
1-Vehicles
2-Furniture
Characteristics:
[Map][Group][Characteristic]
1-C01 | Vehicles | Length
2-C02 | Vehicles | Volume
3-C03 | Vehicles | Type
4-C01 | Furniture | Height
5-C02 | Furniture | Volume
6-C03 | Furniture | Length
Products:
[ProdName][Group][C01][C02][C03]...
1-Car | Vehicles | 2 | 50 | Hatchback
2-Jet | Vehicles | 10 | 70 | Null
3-Table| Furniture | 1 | null | 1.6
4-Cup | Furniture |0.1 | 0.12 | null
String col = Select Map from Characteristics where Group = 'Vehicles' and Characteristic = ' Type'
-- this returns the column (in this case C03) then --
String sql = "Select ProdName from Products where Group = 'Vehicles' and "+col+"='Hatchback'"
-- this will build the query in a string then you just execute it --
execute(sql)
-- in whatever language you're using this is just the basic idea behind the code you have to write.
Related
Suppose I had these 4 tables, consisting of various foreign key relationships (eg a area must belong to a location, a shop must belong to area, an item price must belong to a shop ect..)
----------------------------------
|Location Name | Location ID |
| | |
----------------------------------
-------------------------------------------------
|Area Name | Area ID | Location ID |
| | | |
-------------------------------------------------
-------------------------------------------------
| Shop Name | Shop ID | Area ID |
| | | |
-------------------------------------------------
----------------------------------
| Item Price | Shop ID |
| | |
----------------------------------
And I wanted the sum of 'Item Price' that belonged to a specific location id. So all the areas and shops item price total for location id 'x'.
One way I found to do this is to join all the tables for one location and get the amount eg:
SELECT SUM(Item Price) FROM
items
left join shops ON (items.shop id = shops.shop id)
left join areas ON (shops.area id = areas.area id)
left join locations ON (areas.location id = location.location id)
WHERE Location Id = 4;
However is this the best way to do this since it involves retrieving the full tree of the data and filtering it out? Would there be a better way if there are a million rows or is this the best way?
You can try sub query --
SELECT SUM(Item Price) FROM
items
left join shops ON (items.shop id = shops.shop id)
left join (select area id from areas where Location Id = 4) as Ar ON (shops.area id = areas.area id)
If you define the right indexes, then the query does not read all the millions of rows for each table.
Think about a telephone book and how you look up a name. Do you read the whole book cover to cover looking for the name? No, you take advantage of the fact that the book is sorted by lastname, firstname and you go directly to the name. It takes only a few tries to find the right page. In fact, on average it takes about log2N tries for a book with N names in it.
The same kind of search happens for each join. If you have indexes, each comparison expression uses a similar lookup to find matching rows in the joined table. It's pretty fast.
But if that's not fast enough, you can also use denormalization, which in this case would be storing all the data in one table, with many columns wide.
----------------------------------------------------------------------
|Location Name | Area Name | Shop Name | Item Name | Item Price |
| | | | | |
----------------------------------------------------------------------
The advantage of denormalization is that it avoids certain joins. It stores the row just like one of the rows you'd get from the result set of your example joined SQL query. You just read one row from the table and you have all the information you need.
The disadvantage of denormalization is the redundant storage of data. Presumably each shop has many items. But each item is stored on a row of its own, which means that row has to repeat the names of the shop, area, and location.
By storing those data repeatedly, you create an opportunity for "anomalies" like if you change the name of a given shop, but you mistakenly change it only on a few rows instead of everywhere the shop name appears. Now you have two names for the same shop, and someone else looking at the database has no way of knowing which one is correct.
In general, maintaining multiple normalized tables in preferable, because each "fact" is stored exactly once, so there can be no anomalies.
Creating indexes to help your queries is sufficient for most applications.
You might like my presentation, How to Design Indexes, Really, and the video: https://www.youtube.com/watch?v=ELR7-RdU9XU
I'll try to explain my situation: I'm trying to create a search engine for products on my website, so when the user needs to find a product I need to show similar ones, here's an example.
User searches:
assassins creed OR assassinscreed OR aSsAssIn's CreeD assuming there are no letters/numbers mispelling (those 3 queries should produce the same result)
Expected results:
Assassin's Creed AND Assassin's Creed: Unity AND Assassin's Creed: Special Edition
What have I tried so far
I have created a MySQL field for the search engine which contains a parsed name of the product (Assassin's Creed: Unity -> assassinscreedunity
I parse the search query
I search using MySQL's INSTR()
My problem
I'm fine by using this, but I heard it can be slow when the number of rows increases, I've created a full-text index in my table, but I don't think it would help, so I need another solution.
Thanks for any answer, and ask me anything before downvoting.
First of all, you should keep track of performance issues in your queries more precisely than 'heard it cand be slow' and 'think it would help'. One starting point may be the Slow Query Log.
If you have a table which contains the same parsed name in more than one row, consider normalizing your database. In the specific case, store unique parsed names in one table, and only the id of the corresponding parsed name in the table you described in your question. This way, you only need to check the smaller table with unique names and can then quickly find all matching entries in the main table by id.
Example:
Consider the following table with your structure
id | product_name | rating
-----------------------------------
1 | assassinscreedunity | 5
2 | assassinscreedunity | 2
3 | monkeyisland | 3
4 | monkeyisland | 5
5 | assassinscreedunity | 4
6 | monkeyisland | 4
you would have to scan all six entries to find relevant rows.
In contrast, consider two tables like this:
id | p_id | rating
--------------------
1 | 1 | 5
2 | 1 | 2
3 | 2 | 3
4 | 2 | 5
5 | 1 | 4
6 | 2 | 4
id | name
--------------------------
1 | assassinscreedunity
2 | monkeyisland
In this case, you only have to scan two entries (compared to six) and can then efficiently look up relevant rows using the integer id.
To further enhance the performance, you could extend the concept of a parsed name and use hashes. For example, you could calculate the SHA1-hash of your parsed name which is a 160 bit value. You can find entries in your database for this value very efficiently. To match substrings, you can add them to the second table as well. Since the hash only needs to computed once, you still can use the database to match by an integer. Another thing for you might be fuzzy hashing.
In addition, you should read up on the Rabin–Karp algorithm or string searching in general.
We are building a web database system and we need to allow some products to be made of other products. For example combining 2 or more products as a new product. We are using CakePhp and MySQL.
Here is the data structure diagram of our database:
https://www.dropbox.com/s/ksv22rt45uv69k9/Data%20Structure%20Diagram.png
Would we need to have self referencing products table or create a new table?
You can do either. There are pros and cons to both. Either way you will need a cross reference table.
The cross reference table can refer itself.
products in item
+---------------------+----------------------------+------------+
| product_identifier | product_identifier_child | quantity |
+---------------------+----------------------------+------------+
| 1 | 1 | 1 |
| 2 | 1 | 1 |
| 2 | 2 | 2 |
| 3 | 2 | 1 |
+---------------------+----------------------------+------------+
On the bright side, this method means you only have one table of data and only one new cross reference table, and you can add new products as you see fit (along with multiple of the same products, say, with a gift basket). On the downside, your table will be trying to do two different things at the same time. Products that have other products in them may not have a model number. Also, how will you determine whether to check the inventory table? Are you going to have inventory for products that are made out of products, or would you sooner check stock on individual products in order to see if your combo products are in stock? The latter is much more flexible, and you can still reserve inventory this way. It just allows all of your inventory to be in the same unit scale in your inventory table.
To add more flexibility, you can create another table, base products, which has values only the building block products are going to have.
base products
+--------------------------+----------+--------------+
| base product identifier | brand | model number |
+--------------------------+----------+--------------+
You could then attach your inventories to your base products table, and your cross reference table would be products to base products.
The negative here is that now instead of two tables, you have three. However, I am a fan of more tables with fewer columns thanks to increased flexibility. Even if the table tasks are not completely different, letting each table specialize completely can make things a lot easier.
There are numerous ways to go but optimal situation is the one that requires no data duplication and no NULL values. Without stressing yourself out about getting all the way there, try to keep your NULL values out of indexed columns and make sure your name values are only showing up in one place.
I have 6 tables. These are simplified for this example.
user_items
ID | user_id | item_name | version
-------------------------------------
1 | 123 | test | 1
data
ID | name | version | info
----------------------------
1 | test | 1 | info
data_emails
ID | name | version | email_id
------------------------
1 | test | 1 | 1
2 | test | 1 | 2
emails
ID | email
-------------------
1 | email#address.com
2 | second#email.com
data_ips
ID | name | version | ip_id
----------------------------
1 | test | 1 | 1
2 | test | 1 | 2
ips
ID | ip
--------
1 | 1.2.3.4
2 | 2.3.4.5
What I am looking to achieve is the following.
The user (123) has the item with name 'test'. This is the basic information we need for a given entry.
There is data in our 'data' table and the current version is 1 as such the version in our user_items table is also 1. The two tables are linked together by the name and version. The setup is like this as a user could have an item for which we dont have data, likewise there could be an item for which we have data but no user owns..
For each item there are also 0 or more emails and ips associated. These can be the same for many items so rather than duplicate the actual email varchar over and over we have the data_emails and data_ips tables which link to the emails and ips table respectively based on the email_id/ip_id and the respective ID columns.
The emails and ips are associated with the data version again through the item name and version number.
My first query is is this a good/well optimized database setup?
My next query and my main question is joining this complex data structure.
What i had was:
PHP
- get all the user items
- loop through them and get the most recent data entry (if any)
- if there is one get the respective emails
- get the respective ips
Does that count as 3 queries or essentially infinite depending on the number of user items?
I was made to believe that the above was inefficient and as such I wanted to condense my setup into using one query to get the same data.
I have achieved that with the following code
SELECT user_items.name,GROUP_CONCAT( emails.email SEPARATOR ',' ) as emails, x.ip
FROM user_items
JOIN data AS data ON (data.name = user_items.name AND data.version = user_items.version)
LEFT JOIN data_emails AS data_emails ON (data_emails.name = user_items.name AND data_emails.version = user_items.version)
LEFT JOIN emails AS emails ON (data_emails.email_id = emails.ID)
LEFT JOIN
(SELECT name,version,GROUP_CONCAT( the_ips.ip SEPARATOR ',' ) as ip FROM data_ips
LEFT JOIN ips as the_ips ON data_ips.ip_id = the_ips.ID )
x ON (x.name = data.name AND x.version = user_items.version)
I have done loads of reading to get to this point and worked tirelessly to get here.
This works as I require - this question seeks to clarify what are the benefits of using this instead?
I have had to use a subquery (I believe?) to get the ips as previously it was multiplying results (I believe based on the complex joins). How this subquery works I suppose is my main confusion.
Summary of questions.
-Is my database setup well setup for my usage? Any improvements would be appreciated. And any useful resources to help me expand my knowledge would be great.
-How does the subquery in my sql actually work - what is the query doing?
-Am i correct to keep using left joins - I want to return the user item, and null values if applicable to the right.
-Am I essentially replacing a potentially infinite number of queries with 2? Does this make a REAL difference? Can the above be improved?
-Given that when i update a version of an item in my data table i know have to update the version in the user_items table, I now have a few more update queries to do. Is the tradeoff off of this setup in practice worthwhile?
Thanks to anyone who contributes to helping me get a better grasp of this !!
Given your data layout, and your objective, the query is correct. If you've only got a small amount of data it shouldn't be a performance problem - that will change quickly as the amount of data grows. However when you ave a large amount of data there are very few circumstances where you should ever see all your data in one go, implying that the results will be filtered in some way. Exactly how they are filtered has a huge impact on the structure of the query.
How does the subquery in my sql actually work
Currently it doesn't work properly - there is no GROUP BY
Is the tradeoff off of this setup in practice worthwhile?
No - it implies that your schema is too normalized.
So, I have to draw upon all the powers of the greatest mySQL minds that SO has to offer. I have to summarize detail records based on the IP address in each record. Here's the scenario:
In short, we have consortiums that want to know: "Which schools within my consortium watched which videos how many times"? In SQL terms, it amounts to COUNTing the detail records, grouped by which IP range it might fall into.
We have several university Consortiums - each with a handful of different schools that are members.
Each school within a consortium uses various IP ranges to access the videos that we serve to these schools.
The IP Ranges are specified with wild cards, so each school specifies something like '100.200.35.x, 100.201.x.x, 100.202.39.50, etc.', with the average number of ranges per school being 10 or 15.
The raw text log files to summarize are already in a database (one row for each log entry), and has the actual IP address that accessed the video file.
There are 100's of millions of detail records, so I fully expect this to be a long slow process that runs for a considerable period.
PHP scripts exist that can "explode" the wildcards into the individual IPs that are represented, but I fear this will be the final answer and could take weeks to run.
(For simplicity sake, I'm only going to refer to the video filename that was accessed and COUNT the log entries for it, but in fact all the details such as start/stop/duration,etc. are there and will ultimately be part of this solution.)
With Consortium records something like this: (All table designs except log details open to suggestion):
| id|consortium |
| 10|Ivy League |
| 20|California |
And School/IP records something like this:
| id|school |consortium_id|
| 101|Harvard |10 |
| 102|Yale |10 |
| 103|UCLA |20 |
| 104|Berkeley |20 |
| id|school_id|ip_range |
| 1| 101 |100.200.x.x |
| 2| 101 |100.201.65.x |
| 3| 101 |100.202.39.50 |
| 4| 101 |100.202.39.51 |
| 5| 101 |100.200.x.x |
| 6| 101 |100.201.65.x |
| 7| 101 |100.202.39.50 |
And detail records something like this:
|session |ip_address |filename |
|560554790925|100.202.390.500|history101.mp4 |
|406417611526|43.22.90.5 |newsreel.mp4 |
|650423700223|100.202.39.50 |history101.mp4 |
|650423700223|100.202.50.12 |science101.mp4 |
|513057324209|100.202.39.56 |history101.mp4 |
I like to think I'm pretty handy with mySQL, but this one is stretching it, and am hoping that there's a spectacular function or set of steps that someone might offer.
With your existing data structure, you could do string matching as follows (but it's not very efficient):
SELECT schools.school, detail.filename, COUNT(*)
FROM schools
JOIN ipranges ON schools.id = ipranges.school_id
JOIN detail ON detail.ip_address LIKE REPLACE(ipranges.ip_range, 'x', '%')
WHERE schools.consortium_id = ?
GROUP BY schools.school, detail.filename
A better way would be to store your IP ranges as network address and prefix length:
ALTER TABLE ipranges
ADD COLUMN network INT UNSIGNED,
ADD COLUMN prefix TINYINT;
UPDATE ipranges SET
network = INET_ATON(REPLACE(ip_range, 'x', 0)),
prefix = 32 - 8*(CHAR_LENGTH(ip_range) - CHAR_LENGTH(REPLACE(ip_range,'x',''));
ALTER TABLE ipranges
DROP COLUMN ip_range;
ALTER TABLE detail
ADD COLUMN ip_address_new INT UNSIGNED;
UPDATE detail SET
ip_address_new = INET_ATON(ip_address);
ALTER TABLE detail
DROP COLUMN ip_address,
CHANGE ip_address_new ip_address INT UNSIGNED;
Then it would merely be a case of performing some bit comparisons:
SELECT schools.school, detail.filename, COUNT(*)
FROM schools
JOIN ipranges ON schools.id = ipranges.school_id
JOIN detail ON detail.ip_address & ~((1 << 32 - ipranges.prefix) - 1)
= ipranges.network
WHERE schools.consortium_id = ?
GROUP BY schools.school, detail.filename
SELECT D.filename, S.school, COUNT(D.*)
FROM detail_records AS D
INNER JOIN ip_map AS I ON D.ip_address LIKE CONCAT(SUBSTRING(I.ip_range, 1, LOCATE('x', I.ip_range)-1), '%')
INNER JOIN school AS S ON S.id = I.school_id
INNER JOIN consortium AS C ON C.id = S.consortium_id
WHERE S.consortium_id = <consortium identifier>
GROUP BY D.filename, S.school