mySQL database: Separating/clustering(?) data - mysql

Currently I'm dealing with kinda large mySQL transactional database for one e-commerce project. We obtain data from e-shops including products sold. Each e-shop adds information about similarities between products and list them as groups. So, for instance shop A sends information:
Group 1: iPhone blue, iPhone black, iPhone green
Group 2: iPad blue, iPad black, iPad green, etc.
Another e-shop sends this kind of information:
Group 3: iPhone pink, iPhone black
Group 4: iPad blue, iPad pink
Each product is stored in table Products: (Important: This table has about 150 000 000 rows)
Id | Name
------------------
1 | iPhone blue
2 | iPhone black
3 | iPhone green
4 | iPhone pink
5 | iPad blue
6 | iPad black
7 | iPad green
8 | iPad pink
Also, there is a table Groups with groups stated above: (M:N relationship)
Id | Id_product | Group
--------------------------
1 | 1 | 1
2 | 2 | 1
3 | 3 | 1
4 | 5 | 2
5 | 6 | 2
6 | 7 | 2
7 | 4 | 3
8 | 1 | 3
9 | 5 | 4
10 | 8 | 4
Now, the problem is that groups 1 + 3 and groups 2 + 4 should be merged together.
Current (horrible) solution to this problem is based on obtaing all groups for the product (by GROUP_CONCAT function in query) and then all products from these groups. Then updating table groups to merge these groups into one.
Main problems with this approach are:
Very problematic computational complexity.
Groups obtained from e-shops can be wrong(!). Imagine this group:
Group5: iPhone Black, iPad Black. Taking this group into account, whole separation process is wrong. You end up with one group with iPhones and iPads together (that's wrong).
So, now, finally, the question:
Any ideas how to approach this problem? Just hints/tips will be enough, I'm just totally stuck with lack of my knowledge.
I was playing around with fuzzy-hashing algorithms / k-means clustering, but it seems to me that it is not suitable for this problem. Fuzzy-hashing seems to be getting into account names of the products (that can be good with iPhone, but cannot image it with T-shirts, their names are not very "well-prepared", so it's hard to guess differences just from the name). Am I missing something?
So, any idea?
Anyway, just for the purpose of solving this particular problem, it's possible to introduce different database solution, there's no problem in that.
Thanks in advance:)
Chmelda

An idea might be to add a table "group_conversion" which translates each external group number into your own group number.
In this case the table would look like:
Group_external | NameMatch | ID_my_group
----------------------------------------
1 | null | 1
2 | null | 2
3 | null | 1
4 | null | 2
5 | "IPhone%" | 1
5 | "IPad%" | 2
When inserting new data coming from an e-shop, you should first translate the incoming group number to your own group numbering, before adding it to the Groups table.
The NameMatch field is only used if you want to separate products whitin an incoming group (the Group5 you mentioned).
So if this field is null, just convert the ID. Otherwise only convert the ID if the name of the product matches NameMatch.
To convert your current data it might help to create a new table (e.g. Groups2) which has the same fields as Groups, with the only difference that Group is a reference to the new group numbering.
You can then fill the new table by converting each record of Groups.
After conversion is done, drop the Groups table and rename the Groups2 table.
In this way you will get a much smaller table size for Groups and the table already contains merged data, so no separate queries are needed for merging.
Hope this will help!

Related

Pagination and concurrent update

Lets suppose there is a table products with data
id | product | amount
---------------------
1 | keyboard| 1
2 | monitor | 2
3 | computer| 3
4 | mouse | 4
And userA loads data from this table by two products. So he does
SELECT * FROM products ORDER BY amount LIMIT 2 OFFSET 0 and he gets products with id 1 and 2. However, while userA is reading two first positions userB changed data in table, so it became
id | product | amount
---------------------
1 | keyboard| 3
2 | monitor | 1
3 | computer| 3
4 | mouse | 1
Now UserA need the rest data so he does SELECT * FROM products ORDER BY amount LIMIT 2 OFFSET 2 and he gets products with id 2 and 4 and this is not what he expected.
So we see here that there is a problem with pagination if there are more then one user and some user do updates while other are reading some pages. How are such problems solved? Of course in real example there are thousands of rows + joins.
In the above example, the order by variable is the challenge.
one way you can solve this is by,
In the example, before binding next page query result data to the list, check whether any of the item in the result already fetched in previous query. if so, update those list item with its updated value and skip adding them again. This has to be done in the client side.
And it is always better to audit this kind of data by adding a field like "LastModificationTime" indicating when the row last modified, and display it along with other columns.

storing dynamic attributes in mysql database

I’m creating a database design for a webshop. I want to store products with different attributes. Currently I have one table with +100 columns, but I want to optimize this.
This is what I’ve come up with so far. I have some questions (see below) about my design so far.
Disclaimer: this is a database DESIGN. I do not have some php/sql-code because I don’t’ know if this is the correct way to do it. I will try to make this question as substantiated as possible.
Here we go…
I have 3 tables:
The first table is the table “products” which will store all the general information about each product (id, name, sku, images, …)
The second table is the table “attributes” which will store all the attributes (eg. color, width, height, has_bluetooth, …) but NOT the values
The third table stores the values for each attribute (table "attributes_values")
Table: products
Product_id | Name | SKU
------------------------------------------------------
1 | iPhone 7 | iphone7
2 | HTC One | htcone
3 | Galaxy S8 | galaxys8
As you can see, I have 3 products in my database
Table: attributes
Attribute_id | Name
---------------------------------------
1 | Color
2 | Weight
3 | Height
As you can see, I have 3 different attributes in my database – note that some products will not have each attribute
Table: attributes_values
Attribute_value_id | Attribute_id | Product_id | Value
-----------------------------------------------------------------------
1 | 1 | 1 | Black
2 | 2 | 1 | 0,125 kg
3 | 3 | 1 | 10 cm
4 | 1 | 2 | Gold
5 | 1 | 2 | 0,15 kg
As you can see, product 1 (the iphone) has 3 attributes, product 2 (the htc one) has 2 attributes and product 3 (the galaxy s8) has zero attributes.
My questions
First of all, is this a good approach? I want to create a “dashboard” in PHP where I can dynamically add new attributes when I add new types of products to my database. That’s why I separated the attributes name and value in 2 different tables.
Secondly, how do I fetch the information from the database. I want to select the product + all the attributes it has (and the values associated with each attribute). I think this is the way to do it. Please correct me if I’m wrong.
SELECT
p.name, // the product name
p.sku, // the product SKU
v.value, // the attribute value
a.name // the attribute name
FROM
products AS p
LEFT JOIN
attributes_values AS v
ON
p.product_id = v.product_id
LEFT JOIN
attributes AS a
ON
v.attribute_id = a.attribute_id
I hope my questions are as clear as possible. If not, feel free to ask. English is not my native language so excuse me for some grammar errors. Thank you all!
I have found the following links, maybe they can help.
https://dba.stackexchange.com/questions/24636/product-attribute-list-design-pattern
How to design a product table for many kinds of product where each product has many parameters
http://www.practicalecommerce.com/A-Better-Way-to-Store-Ecommerce-Product-Information
http://buysql.com/mysql/14-how-to-automate-pivot-tables.html

MySQL query to compare content of one column with title of other column

I know the title makes no sense at first glance. But here's the situation: the DB table is named 'teams'. In it, there are a bunch of columns for positions in a soccer team (gk1, def1, def2, ... , st2). Each column is type VARCHAR and contains a player's name. There is also a column named 'captain'. The content of that column (not the most fortunate solution) is not the name of the captain, but rather the position.
So if the content of 'st1' is Zlatan Ibrahimovic and he's the captain, then the content of 'captain' is the string 'st1', and not the string 'Zlatan Ibrahimovic'.
Now, I need to write a query which gets a row form the 'teams' table, but only if the captain is Zlatan Ibrahimovic. Note that at this point I don't know if he plays st1, st2 or some other position. So I need to use just the name in the query, to check if the position he plays on is set as captain. Logically, it would look like:
if(Zlatan is captain)
get row content
In MySQL, the if condition would actually be the 'where' clause. But is there a way to write it?
$query="select * from teams where ???";
The "Teams" table structure is:
-----------------------------------------------------------------
| gk1 | def1 | def2 | ... | st2 | captain |
-----------------------------------------------------------------
| player1 | player2 | player3 | ... | playerN | captainPosition |
-----------------------------------------------------------------
Whith all fields being of VARCHAR type.
Because the content of the captain column is the position and not the name, and you want to choose based on the position, this is trivial.
$query="select * from teams where captain='st1'";
Revised following question edit:
Your database design doesn't allow this to be done very efficiently. You are looking at a query like
SELECT * FROM teams WHERE
(gk1='Zlatan' AND captain='gk1') OR
(de1='Zlatan' AND captain='de1') OR
...
The design mandates this sort of query for many functions: how you can find the team which a particular player plays for without searching every position? [Actually you could do that by finding the name in a concatenation of all the positions, but it's still not very efficient or flexible]
A better solution would be to normalise your data so you had a single table showing which player was playing where:
Situation
Team | Player | Posn | Capt
-----+--------+------+------
1 | 12 | 1 | 0
1 | 11 | 2 | 1
1 | 13 | 10 | 0
...with other tables which allow you to identify the Team, Player and Postion referenced here. There would need to be some referential checks to ensure that each team had only one captain, and only plays one goalkeeper, etc.
You could then easily see that the captain of Team 1 is Player 11 who plays in position 2; or find the team (if any) for which player 11 is captain.
SELECT Name FROM Teams
WHERE Situation.Team = Teams.id
AND Situation.Capt = 1
AND Situation.Player = Players.id
AND Players.Name = 'Zlatan';
A refinement on that idea might be
Situation
Team | Player | Posn | Capt | Playing
-----+--------+------+------+--------
1 | 12 | 1 | 0 | 1
1 | 11 | 2 | 1 | 1
1 | 13 | 10 | 0 | 0
1 | 78 | 1 | 0 | 0
...so that you could have two players who are goalkeepers (for example) but only field of them.
Redesigning the database may be a lot of work; but it's nowhere near as complicated or troublesome as using your existing design. And you will find that the performance is better if you don't need to use inefficient queries.
By what have you exposed, you just need to put the two conditions and check if the query returned 1 record. If it returns no records, he is not the captain:
SELECT *
FROM Teams
WHERE name = 'Zlatan Ibrahimovic' AND position = 'st1';

MySQL help, updating fields based on a calculation

I've been trying to work out this particular problem and though I can easily think of a sort of brute force PHP + MySQL solution, I want some guidance on solving this particular problem without iterating through fields with PHP.
So.. with that, here's the problem.
I want to precalculate each room's size relative to all other rooms with a single query (or 3), such that rooms that are bigger than 66% of all other rooms can have a category filled in as Large, while rooms within a 33%-66% range are given Medium and the rest are considered Small.
I have a general idea of how to complete this, but I'm hoping that someone more adept as SQL queries could at least point me in the right direction.
The hardest part for me comes with being able to simultaneously update every field that fits the criteria of falling within a certain range :(.
Here's an example of the table
Rooms
ID | Length | Width | Relative Size [Expected Values]
------------------------------------
1 | 15 | 12 | Large
2 | 15 | 12 | Large
3 | 10 | 10 | Medium
4 | 10 | 10 | Medium
5 | 8 | 9 | Small
6 | 8 | 8 | Small
7 | 8 | 7 | Small
8 | 10 | 9 | Medium
I'd be perfectly happy with a resources or clues that can assist me with this, but I've been going in circles.
Thanks for any attempts at helping me out.
Edit: I ended up going with a standard deviation approach, since it makes a lot more sense in this case, I'll post it as an answer.
I couldnt think of a way to do it in 1 query but here is my attempt to do it in 3, assuming size is Null before populating
Update table set size = 'Large' where ID in (select TOP 33 PERCENT ID from table order by length*width Desc)
Update table set size = 'Small' where ID in (select TOP 33 PERCENT ID from table order by length*width Asc)
Update table set size = 'Medium' where size = null

How do you get the records from single dataset to display onto 4 corners of landscape page? [SSRS]

I am using Microsoft SQL Server Reporting Services 2005
I have a report that when printed I want to display a record in each of the 4 corners of a landscape page.
I am using a single Dataset that returns 1 to many records.
How do I accomplish this with a table or matrix?
For example if I had 6 records in my dataset:
Page 1
|---------------------|
| record 1 | record 2 |
|---------------------|
| record 3 | record 4 |
|---------------------|
Page 2
|---------------------|
| record 5 | record 6 |
|---------------------|
| [empty] | [empty] |
|---------------------|
So I have found a successful way to do this (with help from cdonner's suggestion), have 2 identical table templates and have one display all odd records and the other to display all even records.
This is what Design Mode looks like:
|-------------------|
| table 1 | table 2 |
|-------------------|
Then, what I did was on each tablerow of each table added the expressions to the Visibility > Hidden property of the tablerow:
For Odd Rows:
=RowNumber(Nothing) Mod 2 = 0
For Even Rows:
=RowNumber(Nothing) Mod 2 = 1
The only way I can think of is by using subreports, one showing all even rows, the other one showing all odd rows.
To add Groups to Jon's answer, place tables 1 and 2 within a parent table which performs the grouping:
table-parent
group-row-header // header text..?
group-row-footer // group name is important for below
rectangle
table-child-1 | table-child-2 | etc // =RowNumber("my-group-name")
Note RowNumber must be based on the group so that it resets with each loop.