Efficient SQL query to find overlap between lists - mysql

Let's say I have a MySQL table order_items (idorder, iditem, amount) that contains items that people ordered from a web shop. I want to find orders similar to an order X by finding other orders with similar items in similar amounts.
Here is my current approach:
SELECT SQL_CALC_FOUND_ROWS
SUM(GREATEST(1, LEAST(cown.amount, cother.amount))) hits,
cother.`idorder`
FROM order_items cown
LEFT JOIN order_items cother ON (
cother.`idorder` != 1
AND cown.iditem = cother.iditem
)
WHERE cown.`idorder` = 1 AND cother.idorder IS NOT NULL
GROUP BY cother.idorder ASC
ORDER BY hits DESC
This selects all items from a given order and left joins them with items from other orders. Then I group by the other order ID and sum up the amount of overlap between them.
Is there a more efficient way to do this?

It looks like you need a recommendation engine. That would be tricky to implement in plain sql and not sure how reliable. For starters have a look into the Apache Mahout project.
There is a nice example with Mahout and MySQL which you can try for yourself available on github: https://github.com/jasebell/RecommenderDemo and it looks like the thing you want.

Related

Hadoop Hive query select and group by from separate tables

Below are the avg_mileage table and the trucks table.
What I'm trying to do is to compile an query which allows me to select or create a table with avg_mileage.avgmpg and group by the trucks.model in order from highest to lowest of avg_mileage.avg_mpg.
Something like this:
Isn't this a simple join rather than a group by? (sorry can't "comment" because I don't have enough rep yet.)
OK, I think I get your question. You've already done.
SELECT truckid, avg(mpg) avgmpg FROM truck_mileage GROUP BY truckid;
Now you want truck.model rather than truckid, AND you want it sorted?
SELECT model, avgmpg FROM avg_mileage JOIN trucks ON (avg_mileage.truckid = trucks.truckid) ORDER BY avgmpg DESC;
Try something like that.

Multitable counting and multiplying in same query

I have got a somewhat complicated problem. This is my situation (ERD).
For a dashboard i need to create a pivot table that shows me the total amount of competences used by the vacancies. Therefore I need to:
Count the amount of vacancies per template
Count the amount of templates per competence
and last: multiply these numbers to get the total amount of comps used.
I have the first query:
SELECT vacancytemplate_id, count(id)
FROM vacancies
group by vacancytemplate_id;
And the second query isn't that difficult either, but I don't know what the right solution will be. I'm literally brainstuck. My mind can't comprehend how I can achieve the next step and put it down in a query. Please kind stranger, help me out :)
EDIT: my desired result is something like this
NameOfComp, NrOfTimesUsed
Leading, 17
Inspiring, 2
EDIT2: the meta query it should look like:
SELECT NameOfComp, (count of the competences used by templates) * (number of vacancies per template)
EDIT3: http://sqlfiddle.com/#!9/2773ca SQLFiddle
Thanks a lot!
If I am understanding your request correctly, you are wanting a count of competences per vacancy. This can be done very simply due to your table structure:
Select v.ID, count(*) from vacancy as v inner join CompTemplate_Table as CT
on v.Template_ID = CT.Template_ID group by v.ID;
The reason you can do only one join is because there will be a record in the CompTemplate_Table for every competency in each template. Additionally, the same key is used to join vacancy to templates as is used to join templates to CompTemplate_Table, so they represent the same key value (and you can skip joining the Templates table if you don't need data from there).
If you are wanting to add this data to a pivot table, I will leave that exercise to you. There are a number of tutorials available if you do a quick google search and it should not be that hard.
UPDATE: For the second query you are looking at something like:
Select cp.NameOfComp, count(*) from vacancy as v inner join CompTemplate_Table as CT
on v.Template_ID = CT.Template_ID inner join competencies as CP
on CP.ID = CT.Comp_ID
group by CP.NameOfComp
The differences here are you are adding in the comptetencies table, as you need data from that, and grouping by the CP.NameOfComp instead of the vacancy id. You can also restrict this to specific templates, competencies, or vacancies by adding in search conditions (e.g. where CP.ID = 12345)

MySQL: Count then sort by the count total

I know other posts talk about this, but I haven't been able to apply anything to this situation.
This is what I have so far.
SELECT *
FROM ccParts, ccChild, ccFamily
WHERE parGC = '26' AND
parChild = chiId AND
chiFamily = famId
ORDER BY famName, chiName
What I need to do is see the total number of ccParts with the same ccFamily in the results. Then, sort by the total.
It looks like this is close to what you want:
SELECT f.famId, f.famName, pc.parCount
FROM (
SELECT c.chiFamily AS famId, count(*) AS parCount
FROM
ccParts p
JOIN ccChild c ON p.parChild = c.chiId
WHERE p.parGC ='26'
GROUP BY c.chiFamily
) pc
JOIN ccFamily f ON f.famId = pc.famId
ORDER BY pc.parCount
The inline view (between the parentheses) is the headliner: it does your grouping and counting. Note that you do not need to join table ccFamily there to group by family, as table ccChild already carries the family information. If you don't need the family name (i.e. if its ID were sufficient), then you can stick with the inline view alone, and there ORDER BY count(*). The outer query just associates family name with the results.
Additionally, MySQL provides a non-standard mechanism by which you could combine the outer query with the inline view, but in this case it doesn't help much with either clarity or concision. The query I provided should be accepted by any SQL implementation, and it's to your advantage to learn such syntax and approaches first.
In the SELECT, add something like count(ccParts) as count then ORDER BY count instead? Not sure about the structure of your tables so you might need to improvise.

Join on 3 tables insanely slow on giant tables

I have a query which goes like this:
SELECT insanlyBigTable.description_short,
insanlyBigTable.id AS insanlyBigTable,
insanlyBigTable.type AS insanlyBigTableLol,
catalogpartner.id AS catalogpartner_id
FROM insanlyBigTable
INNER JOIN smallerTable ON smallerTable.id = insanlyBigTable.catalog_id
INNER JOIN smallerTable1 ON smallerTable1.catalog_id = smallerTable.id
AND smallerTable1.buyer_id = 'xxx'
WHERE smallerTable1.cont = 'Y' AND insanlyBigTable.type IN ('111','222','33')
GROUP BY smallerTable.id;
Now, when I run the query first time it copies the giant table into a temp table... I want to know how I can prevent that? I am considering a nested query, or even to reverse the join (not sure the effect would be to run faster), but that is well, not nice. Any other suggestions?
To figure out how to optimize your query, we first have to boil down exactly what it is selecting so that we can preserve that information while we change things around.
What your query does
So, it looks like we need the following
The GROUP BY clause limits the results to at most one row per catalog_id
smallerTable1.cont = 'Y', insanelyBigTable.type IN ('111','222','33'), and buyer_id = 'xxx' appear to be the filters on the query.
And we want data from insanlyBigTable and ... catalogpartner? I would guess that catalogpartner is smallerTable1, due to the id of smallerTable being linked to the catalog_id of the other tables.
I'm not sure on what the purpose of including the buyer_id filter on the ON clause was for, but unless you tell me differently, I'll assume the fact it is on the ON clause is unimportant.
The point of the query
I am unsure about the intent of the query, based on that GROUP BY statement. You will obtain just one row per catalog_id in the insanelyBigTable, but you don't appear to care which row it is. Indeed, the fact that you can run this query at all is due to a special non-standard feature in MySQL that lets you SELECT columns that do not appear in the GROUP BY statement... however, you don't get to select WHICH columns. This means you could have information from 4 different rows for each of your selected items.
My best guess, based on column names, is that you are trying to bring back a list of items that are in the same catalog as something that was purchased by a given buyer, but without any more than one item per catalog. In addition, you want something to connect back to the purchased item in that catalog, via the catalogpartner table's id.
So, something probably akin to amazon's "You may like these items because you purchased these other items" feature.
The new query
We want 1 row per insanlyBigTable.catalog_id, based on which catalog_id exists in smallerTable1, after filtering.
SELECT
ibt.description_short,
ibt.id AS insanlyBigTable,
ibt.type AS insanlyBigTableLol,
(
SELECT smallerTable1.id FROM smallerTable1 st
WHERE st.buyer_id = 'xxx'
AND st.cont = 'Y'
AND st.catalog_id = ibt.catalog_id
LIMIT 1
) AS catalogpartner_id
FROM insanlyBigTable ibt
WHERE ibt.id IN (
SELECT (
SELECT ibt.id AS ibt_id
FROM insanlyBigTable ibt
WHERE ibt.catalog_id = sti.catalog_id
LIMIT 1
) AS ibt_id
FROM (
SELECT DISTINCT(catalog_id) FROM smallerTable1 st
WHERE st.buyer_id = 'xxx'
AND st.cont = 'Y'
AND EXISTS (
SELECT * FROM insanlyBigTable ibt
WHERE ibt.type IN ('111','222','33')
AND ibt.catalog_id = st.catalog_id
)
) AS sti
)
This query should generate the same result as your original query, but it breaks things down into smaller queries to avoid the use (and abuse) of the GROUP BY clause on the insanlyBigTable.
Give it a try and let me know if you run into problems.

order by count from another table

I hope itll be legal to post this as i'm aware of other similar posts on this topic. But im not able to get the other solutions to work, so trying to post my own scenario. Pretty much on the other examples like this one, im unsure how they use the tablenames and rows. is it through the punctuation?
SELECT bloggers.*, COUNT(post_id) AS post_count
FROM bloggers LEFT JOIN blogger_posts
ON bloggers.blogger_id = blogger_posts.blogger_id
GROUP BY bloggers.blogger_id
ORDER BY post_count
I have a table with articles, and a statistics table that gets new records every time an article is read. I am trying to make a query that sorts my article table by counting the number of records for that article id in the statistics table. like a "sort by views" functions.
my 2 tables:
article
id
statistics
pid <- same as article id
Looking at other examples im lacking the left join. just cant wrap my head around how to work that. my query at the moment looks like this:
$query = "SELECT *, COUNT(pid) AS views FROM statistics GROUP BY pid ORDER BY views DESC";
Any help is greatly appreciated!
SELECT article.*, COUNT(statistics.pid) AS views
FROM article LEFT JOIN statistics ON article.id = statistics.pid
GROUP BY article.id
ORDER BY views DESC
Ideas:
Combine both tables using a join
If an article has no statistics, fill up with NULL, i.e. use a left join
COUNT only counts non-NULL values, so count by right table to give correct zero results
GROUP BY to obtain exactly one result row for every article, i.e. to count statistics for each article individually