SQL- Selecting the most similar product

SQL- Selecting the most similar product - mysql

Alright, I have a relation which stores two keys, a product Id and an attribute Id. I want to figure out which product is most similar to a given product. (Attributes are actually numbers but it makes the example more confusing so they have been changed to letters to simplify the visual representation.)
Prod_att
Product | Attributes
1 | A
1 | B
1 | C
2 | A
2 | B
2 | D
3 | A
3 | E
4 | A
Initially this seems fairly simple, just select the attributes that a product has and then count the number of attributes per product that are shared. The result of this is then compared to the number of attributes a product has and I can see how similar two products are. This works for products with a large number of attributes relative to their compared products, but issues arise when products have very few attributes. For example product 3 will have a tie for almost every other product (as A is very common).
SELECT Product, count(Attributes)
FROM Prod_att
WHERE Attributes IN
(SELECT Attributes
FROM prod_att
WHERE Product = 1)
GROUP BY Product
;
Any suggestions on how to fix this or improvements to my current query?
Thanks!
*edit: Product 4 will return count() =1 for all Products. I would like to show Product 3 is more similar as it has fewer differing attributes.

Try this
SELECT
a_product_id,
COALESCE( b_product_id, 'no_matchs_found' ) AS closest_product_match
FROM (
SELECT
*,
#row_num := IF(#prev_value=A_product_id,#row_num+1,1) AS row_num,
#prev_value := a_product_id
FROM
(SELECT #prev_value := 0) r
JOIN (
SELECT
a.product_id as a_product_id,
b.product_id as b_product_id,
count( distinct b.Attributes ),
count( distinct b2.Attributes ) as total_products
FROM
products a
LEFT JOIN products b ON ( a.Attributes = b.Attributes AND a.product_id <> b.product_id )
LEFT JOIN products b2 ON ( b2.product_id = b.product_id )
/*WHERE */
/* a.product_id = 3 */
GROUP BY
a.product_id,
b.product_id
ORDER BY
1, 3 desc, 4
) t
) t2
WHERE
row_num = 1
The above query gets the closest matches for all the products, you can include the product_id in the innermost query, to get the results for a particular product_id, I have used LEFT JOIN so that even if a product has no matches, its displayed
SQLFIDDLE
Hope this helps

Try the "Lower bound of Wilson score confidence interval for a Bernoulli parameter". This explicitly deals with the problem of statistical confidence when you have small n. It looks like a lot of math, but actually this is about the minimum amount of math you need to do this sort of thing right. And the website explains it pretty well.
This assumes it is possible to make the step from positive / negative scoring to your problem of matching / not matching attributes.
Here's an example for positive and negative scoring and 95% CL:
SELECT widget_id, ((positive + 1.9208) / (positive + negative) -
1.96 * SQRT((positive * negative) / (positive + negative) + 0.9604) /
(positive + negative)) / (1 + 3.8416 / (positive + negative))
AS ci_lower_bound FROM widgets WHERE positive + negative > 0
ORDER BY ci_lower_bound DESC;

You could write a little view that will give you the total shared attributes between two products.
create view vw_shared_attributes as
select a.product,
b.product 'product_match',
count(*) 'shared_attributes'
from your_table a
inner join test b on b.attribute = a.attribute and b.product <> a.product
group by a.product, b.product
and then use that view to select the top match.
select product,
(select top 1 s.product_match from vw_shared_attributes s where t.product = s.product order by s.shared_attributes desc)
from your_table t
group by product
See http://www.sqlfiddle.com/#!6/53039/1 for an example

Related

Select by fixed categories top result and rest in others

On the begging Im gonna say that I can't change any structure of database, only select is allowed.
I thought about this about 1 week and can't find solution for my problem.
Lets assume I have table like this: https://www.db-fiddle.com/f/cEbW3ZoQBRpun3Pt5g3h3v/1
There I have products with their categories and I'd to make report which show me how much there is product in X category and select TOP 3 with count and others display in "Others" with all other count. But some categories should be count together (I'd like to predefined them in select query), for example I'd have A count with G and B-C, E-F.
So result should looks like:
B-C = 7
A-G = 6
D = 5
OTHERS = 6
Tell me, it's even possible with just select, if yes please tell me how (example would be nice) if not I'm gonna tell this to my manager because RN he won't believe me saying "you can do it".

I would approach this with a derived table that represents the mapping between categories, that the original table can be left joined with. You can then aggregate, which gives you the count of product for each of these "real" categories. Then, you can use window functions (available in MySQL 8.0 only) and an additional level of aggregation to separate the top 3 from the rest of the "real" categories.
select
case when rn <= 3 then real_category else 'Other' end final_category,
sum(no_products) no_products
from (
select
coalesce(x.new_category, p.category) real_category,
count(*) no_products,
rank() over(order by count(*) desc) rn
from products p
left join (
select 'A' category, 'A-G' new_category
union all select 'G', 'A-G'
union all select 'B', 'B-C'
union all select 'C', 'B-C'
union all select 'E', 'E-F'
union all select 'F', 'E-F'
) x on x.category = p.category
group by real_category
) t
group by final_category
order by no_products desc
Demo on DB Fiddle:
final_category | no_products
:------------- | ----------:
A-G | 6
B-C | 6
D | 5
Other | 5

Count number of elements per decile

I have a table that's something like this:
Name | Frequency
----------------
Bill | 12
Joe | 23
Hank | 1
Stew | 98
I need to figure out how many people make up each decile of total frequency. I.e. if total sum(frequency) is 10,000 then each decile will have size 1,000. I need to know how many people make up each 1000. Right now I have done:
with rankedTable as (select * from TABLE order by frequency desc limit XXXX)
select sum(frequency) from rankedTable
And I am changing the XXXX so that the sum(frequency) adds up to decile values (which I know from sum(frequency)/10). There has to be a faster way of doing this.

I think this can give the n-percentile a user belongs to. I use variables for readability, but they are not strictly necessary.
set #sum := (select sum(freq) from t);
set #n := 10; -- define the N in N-perentile
select b.name, b.freq, sum(a.freq) as cumulative_sum, floor(sum(a.freq) / #sum * #n) as percentile
from t a join t b on b.freq >= a.freq
group by b.name
From this it is easy to count the members of each percentile:
select percentile, count(*) as `count`
from
(
select b.name, b.freq, sum(a.freq) as cumulative_sum, floor(sum(a.freq) / #sum * #n) as percentile
from t a join t b on b.freq >= a.freq
group by b.name
) x
group by percentile;
I hope this helps!

How do optimise sql query using join between multiple tables

i have two tables having following structure
Table A
itemId categoryId orderDate
==========================================
1 23 2016-11-08
1 23 2016-11-12
1 23 2016-11-16
Table B have the structure
categoryId stock price
==========================================
23 500 600
However mine desired output should be as like
Result C
price stock orderdate qty
600 500 2016-11-08 (first order date) 3 (3 time appearance in first table)
Here is what i have tried so far
select b.price,b.stock from B b, A a
where b.categoryId = (
select a.categoryId
from A
GROUP BY categoryId
HAVING COUNT(categoryId)>1
)
and (a.orderdate = (
select MIN(orderdate)
from A
where categoryId = b.categoryId)
)
i have following result
price stock orderdate
600 500 2016-11-08
i have no idea how do find qty as it is appeared 3 times in first table.

I think you want the records in table a grouped by item id and category id, so include these two in your group by statement. Then the other columns you have to aggregate using MIN, MAX, AVG, SUM, etc. I use MIN which will give you the smallest number in the group for that particular column, although it shouldn't matter in this case whether you use MIN or MAX or AVG - it's all the same. Then COUNT(*) will just count the number of recrods in the group.
Also, joins are generally preferred over listing tables with commas.
SELECT a.itemid, a.categoryid, MIN(b.price), MIN(b.stock), min(a.orderdate), count(*) as qty
FROM a
INNER JOIN b ON a.categoryid = b.categoryid
GROUP BY a.itemid, a.categoryid

You also need to select COUNT(*)

how about use following sql
select min(price), min(stock), min(orderDate), COUNT(categoryId)
from A,B where A.categoryId = B.categoryId
GROUP by A.categoryId

You could create views for your subqueries and give them meaningful names e.g. CategoriesUsedInMultipleOrders, MostRecentOrderByCategory. This would 'optimize' you query by abstracting away complexity and making it easier for the human reader to understand.

This is the Query with the appropriate join method see Result:
SELECT B.price, B.stock, MIN( A.orderDate ) AS orderdate, COUNT( * ) AS qty
FROM TableA A, TableB B
WHERE A.categoryId = B.CategoryId
GROUP BY A.categoryId, B.price, B.stock

Taking sum from master and child table

I hope that I am able to explain the situation as much as possible :)
We need to take sum from Master and child records of MySQL tables. The current query is as follows:
select sum(
abs(ifnull(dt.N_RETAIL_PRICE,0 ) * ifnull(dt.N_QTY_NO ,0) )
+ ifnull(st.shipping_total,0 ) + ifnull(st.TaxAmount,0 ) - abs(ifnull(st.discount ,0))
) Total
FROM inv_store_transaction st
inner join inv_store_transaction_det dt
on st.OID = dt.INV_STORE_TRANSACTION
where st.INV_TRANSACTION_TYPE = 35
and st.INV_STORES = 1
The issue what we suspect is that if the detail column has more than 1 row, the columns of master will be summed that many times.
e.g if detail has say 3 rows, then the sum of its relevant master data will also be taken 3 times.
To summarize, we need to take a grand total of all Invoices that fall under the given condition.
Any help appreciated.

The solution to this problem is to pre-aggregate the detail data:
select (sum(dt.amt) + sum((st.shipping_total) + sum(st.TaxAmount) -
sum(abs(st.discount))
) Total
FROM inv_store_transaction st inner join
(select dt.INV_STORE_TRANSACTION,
abs(coalesce(dt.N_RETAIL_PRICE, 0) * coalesce(dt.N_QTY_NO, 0)) as dtamt
from inv_store_transaction_det dt
group by dt.INV_STORE_TRANSACTION
) dt
on st.OID = dt.INV_STORE_TRANSACTION
where st.INV_TRANSACTION_TYPE = 35 and st.INV_STORES = 1
You don't need to test for NULL unless all the rows have a NULL value for one of the columns.

How do I efficiently create logical subsets of data in a many-to-many mapping table?

I have a many-to-many relationship between invoices and credit card transactions, which I'm trying to map sums of together. The best way to think of the problem is to imagine TransactionInvoiceMap as a bipartite graph. For each connected subgraph, find the total of all invoices and the total of all transactions within that subgraph. In my query, I want to return the values computed for each of these subgraphs along with the transaction ids they're associated with. Totals for related transactions should be identical.
More explicitly, given the following transactions/invoices
Table: TransactionInvoiceMap
TransactionID InvoiceID
1 1
2 2
3 2
3 3
Table: Transactions
TransactionID Amount
1 $100
2 $75
3 $75
Table: Invoices
InvoiceID Amount
1 $100
2 $100
3 $50
my desired output is
TransactionID TotalAsscTransactions TotalAsscInvoiced
1 $100 $100
2 $150 $150
3 $150 $150
Note that invoices 2 and 3 and transactions 2 and 3 are part of a logical group.
Here's a solution (simplified, names changed) that apparently works, but is very slow. I'm having a hard time figuring out how to optimize this, but I think it would involve eliminating the subqueries into TransactionInvoiceGrouping. Feel free to suggest something radically different.
with TransactionInvoiceGrouping as (
select
-- Need an identifier for each logical group of transactions/invoices, use
-- one of the transaction ids for this.
m.TransactionID,
m.InvoiceID,
min(m.TransactionID) over (partition by m.InvoiceID) as GroupingID
from TransactionInvoiceMap m
)
select distinct
g.TransactionID,
istat.InvoiceSum as TotalAsscInvoiced,
tstat.TransactionSum as TotalAsscTransactions
from TransactionInvoiceGrouping g
cross apply (
select sum(ii.Amount) as InvoiceSum
from (select distinct InvoiceID, GroupingID from TransactionInvoiceGrouping) ig
inner join Invoices ii on ig.InvoiceID = ii.InvoiceID
where ig.GroupingID = g.GroupingID
) as istat
cross apply (
select sum(it.Amount) as TransactionSum
from (select distinct TransactionID, GroupingID from TransactionInvoiceGrouping) ig
left join Transactions it on ig.TransactionID = it.TransactionID
where ig.GroupingID = g.GroupingID
having sum(it.Amount) > 0
) as tstat

I've implemented the solution in a recursive CTE:
;with TranGroup as (
select TransactionID
, InvoiceID as NextInvoice
, TransactionID as RelatedTransaction
, cast(TransactionID as varchar(8000)) as TransactionChain
from TransactionInvoiceMap
union all
select g.TransactionID
, m1.InvoiceID
, m.TransactionID
, g.TransactionChain + ',' + cast(m.TransactionID as varchar(11))
from TranGroup g
join TransactionInvoiceMap m on g.NextInvoice = m.InvoiceID
join TransactionInvoiceMap m1 on m.TransactionID = m1.TransactionID
where ',' + g.TransactionChain + ',' not like '%,' + cast(m.TransactionID as varchar(11)) + ',%'
)
, RelatedTrans as (
select distinct TransactionID, RelatedTransaction
from TranGroup
)
, RelatedInv as (
select distinct TransactionID, NextInvoice as RelatedInvoice
from TranGroup
)
select TransactionID
, (
select sum(Amount)
from Transactions
where TransactionID in (
select RelatedTransaction
from RelatedTrans
where TransactionID = t.TransactionID
)
) as TotalAsscTransactions
, (
select sum(Amount)
from Invoices
where InvoiceID in (
select RelatedInvoice
from RelatedInv
where TransactionID = t.TransactionID
)
) as TotalAsscInvoiced
from Transactions t
There is probably some room for optimization (including object naming on my part!) but I believe I have at least a correct solution which will gather all possible Transaction-Invoice relations to include in the calculations.
I was unable to get the existing solutions on this page to give the OP's desired output, and they got uglier as I added more test data. I'm not sure if the OP's posted "slow" solution is correct as stated. It's very possible that I'm misinterpreting the question.
Additional info:
I've often seen that recursive queries can be slow when working with large sets of data. Perhaps that can be the subject of another SO question. If that's the case, things to try on the SQL side might be to limit the range (add where clauses), index base tables, select the CTE into a temp table first, index that temp table, think of a better stop condition for the CTE...but profile first, of course.

If I have understood the question right, I think you are trying to find the minimum of transaction id for each invoice and I have used ranking function to do the same.
WITH TransactionInvoiceGrouping AS (
SELECT
-- Need an identifier for each logical group of transactions/invoices, use
-- one of the transaction ids for this.
m.TransactionID,
m.InvoiceID,
ROW_NUMBER() OVER (PARTITION BY m.InvoiceID ORDER BY m.TransactionID ) AS recno
FROM TransactionInvoiceMap m
)
SELECT
g.TransactionID,
istat.InvoiceSum AS TotalAsscInvoiced,
tstat.TransactionSum AS TotalAsscTransactions
FROM TransactionInvoiceGrouping g
CROSS APPLY(
SELECT SUM(ii.Amount) AS InvoiceSum
FROM TransactionInvoiceGrouping ig
inner JOIN Invoices ii ON ig.InvoiceID = ii.InvoiceID
WHERE ig.TransactionID = g.TransactionID
AND ig.recno = 1
) AS istat
CROSS APPLY(
SELECT sum(it.Amount) AS TransactionSum
FROM TransactionInvoiceGrouping ig
LEFT JOIN transactions it ON ig.TransactionID = it.TransactionID
WHERE ig.TransactionID = g.TransactionID
AND ig.recno = 1
HAVING SUM(it.Amount) > 0
) AS tstat
WHERE g.recno = 1

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

SQL- Selecting the most similar product - mysql

Related

Select by fixed categories top result and rest in others

Count number of elements per decile

How do optimise sql query using join between multiple tables

Taking sum from master and child table

How do I efficiently create logical subsets of data in a many-to-many mapping table?

Categories

Resources