So I want to do the following for a project.
I have 3 tables. First two concern us now (the third is for your better understanding):
author {id, name}
authorship {id, id1, id2}
paper {id, title}
authorship connects author with paper and authorship.id1 refers to author.id, authorship.id2 refers to paper.id.
What I want to do is make a graph with a node for each author and edge that is determined by the amount of common papers between two authors.
w=1 - union_of_common_papers/intersection_of_common_papers
So what I have built (with some help from stackoverflow) an sql script that returns all couples of co-authors plus the amount of union and intersection of common papers. After that I will use the data with java. It's the following:
SELECT DISTINCT a1.name, a2.name, (
SELECT concat(count(a.id2), ',', count(DISTINCT a.id2))
FROM authorship a
WHERE a.id1=a1.id or a.id1=a2.id) as weight
FROM authorship au1
INNER JOIN authorship au2 ON au1.id2 = au2.id2 AND au1.id1 <> au2.id1
INNER JOIN author a1 ON au1.id1 = a1.id
INNER JOIN author a2 ON au2.id1 = a2.id;
this does my job and returns a list like:
+-----------------+---------------------+---------+
| name | name | weight |
+-----------------+---------------------+---------+
| Kurt | Michael | 161,157 |
| Kurt | Miron | 138,134 |
| Kurt | Manish | 19,18 |
| Roy | Gregory | 21,20 |
| Roy | Richard | 74,71 |
....
where in weight I can see 2 numbers a,b where b is intersection an b-a is the union of the common papers.
but this takes a lot of time.
And all the overhead is by this extra subselect
(SELECT concat(count(a.id2), ',', count(DISTINCT a.id2))
FROM authorship a
WHERE a.id1=a1.id or a.id1=a2.id) as weight
without this line all records (1M+) were returned in less than 2mins.
with this line 50 records need more than 1.5mins
I use mysql on linux through command line
Any ideas how I can optimize it?
author has ~130,000 records
authorship ~1,300,000 records
query should return ~1,200,000 records
This is what explain returns for this query. don't know how to use it.
+----+--------------------+-------+--------+---------------------+-----------+---------+--------------+---------+-----------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+--------------------+-------+--------+---------------------+-----------+---------+--------------+---------+-----------------+
| 1 | PRIMARY | a1 | ALL | PRIMARY | NULL | NULL | NULL | 124768 | Using temporary |
| 1 | PRIMARY | au1 | ref | NewIndex1,NewIndex2 | NewIndex1 | 5 | dblp.a1.ID | 4 | Using where |
| 1 | PRIMARY | au2 | ref | NewIndex1,NewIndex2 | NewIndex2 | 5 | dblp.au1.id2 | 1 | Using where |
| 1 | PRIMARY | a2 | eq_ref | PRIMARY | PRIMARY | 4 | dblp.au2.id1 | 1 | |
| 2 | DEPENDENT SUBQUERY | a | ALL | NewIndex1 | NULL | NULL | NULL | 1268557 | Using where |
+----+--------------------+-------+--------+---------------------+-----------+---------+--------------+---------+-----------------+
You should be able to get your data directly from the joins in the outer query.
You can count the number of papers in common by counting the distinct id2 where that is the same for both authors.
You can count the total number of papers as the number of distinct papers for each author minus the ones in common (because otherwise, these would be counted twice):
SELECT a1.name, a2.name,
COUNT(distinct case when au1.id2 = au2.id2 then au1.id2 end) as CommonPapers,
COUNT(distinct au1.id2) + COUNT(distinct au2.id2) - COUNT(distinct case when au1.id2 = au2.id2 then au1.id2 end) as TotalPapers
FROM authorship au1 INNER JOIN
authorship au2
ON au1.id2 = au2.id2 AND au1.id1 <> au2.id1 INNER JOIN
author a1
ON au1.id1 = a1.id INNER JOIN
author a2
ON au2.id1 = a2.id
group by a1.name, a2.name;
In your data structure, id1 and id2 are lousy names. Have you considered something like idauthor and idpaper or something like that?
The above query counts the intersection correctly, but not the total, because of the initial inner join. one way around this is a full outer join, but that is not allowed in MySQL. We can do this with additional subqueries:
SELECT a1.name, a2.name,
COUNT(distinct case when au1.id2 = au2.id2 then au1.id2 end) as CommonPapers,
(ap1.NumPapers + ap2.NumPapers - COUNT(distinct case when au1.id2 = au2.id2 then au1.id2 end)
) as TotalPapers
FROM authorship au1 INNER JOIN
authorship au2
ON au1.id2 = au2.id2 AND au1.id1 <> au2.id1 INNER JOIN
author a1
ON au1.id1 = a1.id INNER JOIN
author a2
ON au2.id1 = a2.id inner join
(select au.id1, count(*) as numpapers
from authorship au
) ap1
on ap1.id1 = au1.id1 inner join
(select au.id1, count(*) as numpapers
from authorship au
) ap2
on ap2.id1 = au2.id1 inner join
group by a1.name, a2.name;
Related
What is the right way to select films which labels are 'Action' AND 'Drama' using INNER JOIN ?
I've tried this query, the result must be 'Taken, The Godfather' but, no result returned.
SELECT
f.film_guid,
f.film_name
FROM
films as f
INNER JOIN
film_labels as l ON l.film_guid = f.film_guid
WHERE
l.label = 'Action' AND l.label = 'Drama'
Table: films
+------------+----------------+
| film_guid | film_name |
+------------+----------------+
| filmguid_1 | Taken |
| filmguid_2 | Matrix |
| filmguid_3 | The Godfather |
+------------+----------------+
Table: film_labels
+------------+----------------+
| film_guid | label |
+------------+----------------+
| filmguid_1 | Action |
| filmguid_1 | Drama |
| filmguid_1 | Family |
| filmguid_2 | Action |
| filmguid_3 | Action |
| filmguid_3 | Drama |
+------------+----------------+
You are looking for a rows in film_labels that contains both Action and Drama, which cannot happen. You need to search across labels that correspond to the given film, which suggest aggregation:
SELECT f.film_guid, f.film_name
FROM films as f
INNER JOIN film_labels as l ON l.film_guid = f.film_guid
WHERE l.label IN ('Action', 'Drama') -- either one, or the other
GROUP BY f.film_guid, f.film_name
HAVING COUNT(*) = 2 -- both match
Note that you could also use exists with correlated subquery. It is a bit longer to type but could be more efficient (with the right indexes in place), since it avoids the need for aggregation:
SELECT f.*
FROM films as f
WHERE
EXISTS (SELECT 1 FROM film_labels l WHERE l.film_guid = f.film_guid AND l.label = 'Action')
AND EXISTS (SELECT 1 FROM film_labels l WHERE l.film_guid = f.film_guid AND l.label = 'Drama')
For performance with the second query, you want an index on film_labels(film_guid , label).
I have two tables:
person
+-----+------------+---------------+
| id | name | address |
+-----+------------+---------------+
| 1 | John Smith | 123 North St. |
| 2 | Joe Dirt | 456 South St. |
+-----+------------+---------------+
person_fields
+-----+------------+-----------+-------+
| id | type | person_id | value |
+-----+------------+-----------+-------+
| 1 | isHappy | 1 | 1 |
| 2 | hasFriends | 1 | 1 |
| 3 | hasFriends | 2 | 1 |
I want to select all the people from person for whom isHappy AND hasFriends is TRUE. Here's what I have tried:
SELECT person.*
FROM person
INNER JOIN person_fields
ON person.id = person_fields.person_id
WHERE
(person_fields.type = 'isHappy' AND person_fields.value IS TRUE)
AND
(person_fields.type = 'hasFriends' AND person_fields.value IS TRUE)
Unfortunately, this does not work because you can't have a single record in person_fields that has type = 'isHappy' AND type = 'hasFriends'. I can't OR these two conditions because that would return both John Smith and Joe Dirt, but I only want John Smith because he is the only one who is happy and has friends at the same time.
Any suggestions? Thanks in advance!
The standard solution looks like this:
SELECT person_id
FROM person_fields
WHERE type IN ('ishappy','hasfriends')
GROUP
BY person_id
HAVING COUNT(1) = 2;
...where '2' is equal to the number of arguments in IN()
Note that this assumes that (person_id,type) is UNIQUE
By joining twice:
SELECT person.*
FROM person
INNER JOIN person_fields happy
ON person.id = happy.person_id AND happy.type='isHappy' AND happy.value
INNER JOIN person_fields friends
ON person.id = friends.person_id AND friends.type='hasFriends' AND friends.value
You can join person_fields in twice, once for isHappy and once forhasFriends.
SELECT p.*
FROM person p
INNER JOIN person_fields f1 ON p.id = f1.person_id
INNER JOIN person_fields f2 ON f1.person_id = f2.person_id
WHERE f1.type = 'isHappy' AND f2.type = 'hasFriends'
I'm not sure where the value field comes into this but you can throw an extra condition OR two in if you need it
AND f1.value = 1 AND f2.value = 1
If you always know how many conditions you want to check for, and you always want to only return the person records who have all conditions, then you can use a sub-query and modify the HAVING clause to select the maximum number of type from person_fields:
SELECT
p.id,
p.name,
(SELECT COUNT(DISTINCT pf.type) FROM person_fields AS pf WHERE pf.person_id = p.id and pf.value = true) AS types
FROM
person AS p
HAVING types = 2
Result:
id name types
1 John Smith 2
I have 2 tables: Equipment, and Equipment_Type:
+-------------------------------+
| EqId [PK] | ETId [FK] | EqNum |
+-------------------------------+
| 1 | 1 | ABC |
| 2 | 1 | DEF |
| 3 | 3 | GHI |
+-------------------------------+
| ETId [PK] | Code | Discipline |
+-------------------------------+
| 1 | MOT | ELEC |
| 2 | MOT | MECH |
| 3 | SW | ELEC |
So from this example, we can see that both of our equipment are electrical motors.
However, due to a misunderstanding in the initial population, all of the equipment types were identified a ELEC disciplines. Since then, the MECH equipment has been identified, and I have to find all of the equipment that has been duplicated in the Equipment_Type table, and change them to reference the MECH equipment types instead.
I tried this:
SELECT * FROM Equipment EQ
INNER JOIN Equipment_Type ET on ET.ETId = EQ.ETId
WHERE ET.Discipline = 'MECH';
Which (obviously) returns no results - as with all the other JOIN queries.
What I want to achieve is a search that will select only the Equipment that has an ELEC Equipment Type that is also a MECH equipment type. I realise this requires a nested query, but I'm not sure where to place it.
So the search should return:
+---------------------------+
| EqNum | ETId | Discipline |
+---------------------------+
| DEF | 1 | ELEC |
Because that entry needs to be changed to the MECH discipline (i.e. ETId = 2 instead of 1)
Here is one method that aggregates the types to get the codes that have both disciplines:
select e.*
from equipment e join
equipment_type et
on e.etid = et.etid join
(select et.code
from equipment_type et
group by et.code
having sum(discipline = 'MECH') > 0 and sum(discipline = 'ELEC') > 0
) ett
on ett.code = et.code;
Another method would use two joins:
select e.*
from equipment e join
equipment_type ete
on e.etid = ete.etid and ete.discipline = 'ELEC' join
equipement_type etm
on ete.code = etm.code and etm.discipline = 'MECH';
This version might be faster with the right indexes.
Do it like this:
select eq_id, eq_name,et_id,description from
(select eq_id,eq_name from equipment) as a
left join
(select et_id,description from equipment_type) as b
on a.eq_id = b.et_id where description = 'ELEC';
IF YOU WANT TO INCLUDE 'MECH';
select eq_id, eq_name,et_id,description from
(select eq_id,eq_name from equipment) as a
left join
(select et_id,description from equipment_type) as b
on a.eq_id = b.et_id where description = 'ELEC' or description = 'MECH';
CHANGE 'ELEC' TO 'MECH':
select eq_id, eq_name,et_id,
case when description = 'ELEC' then 'MECH' else description end as description,
'MECH' as MECH_FIELD from
(select eq_id,eq_name from equipment) as a
left join
(select et_id,description from equipment_type) as b
on a.eq_id = b.et_id where description = 'ELEC' or description = 'MECH';
Suppose I have two tables, people and emails. emails has a person_id, an address, and an is_primary:
people:
id
emails:
person_id
address
is_primary
To get all email addresses per person, I can do a simple join:
select * from people join emails on people.id = emails.person_id
What if I only want (at most) one row from the right table for each row in the left table? And, if a particular person has multiple emails and one is marked as is_primary, is there a way to prefer which row to use when joining?
So, if I have
people: emails:
------ -----------------------------------------
| id | | id | person_id | address | is_primary |
------ -----------------------------------------
| 1 | | 1 | 1 | a#b.c | true |
| 2 | | 2 | 1 | b#b.c | false |
| 3 | | 3 | 2 | c#b.c | true |
| 4 | | 4 | 4 | d#b.c | false |
------ -----------------------------------------
is there a way to get this result:
------------------------------------------------
| people.id | emails.id | address | is_primary |
------------------------------------------------
| 1 | 1 | a#b.c | true |
| 2 | 3 | c#b.c | true | // chosen over b#b.c because it's primary
| 3 | null | null | null | // no email for person 3
| 4 | 4 | d#b.c | false | // no primary email for person 4
------------------------------------------------
You got it a bit wrong, how left/right joins work.
This join
select * from people join emails on people.id = emails.person_id
will get you every column from both tables for all records that match your ON condition.
The left join
select * from people left join emails on people.id = emails.person_id
will give you every record from people, regardless if there's a corresponding record in emails or not. When there's not, the columns from the emails table will just be NULL.
If a person has multiple emails, multiple records will be in the result for this person. Beginners often wonder then, why the data has duplicated.
If you want to restrict the data to the rows where is_primary has the value 1, you can do so in the WHERE clause when you're doing an inner join (your first query, although you ommitted the inner keyword).
When you have a left/right join query, you have to put this filter in the ON clause. If you would put it in the WHERE clause, you would turn the left/right join into an inner join implicitly, because the WHERE clause would filter the NULL rows that I mentioned above. Or you could write the query like this:
select * from people left join emails on people.id = emails.person_id
where (emails.is_primary = 1 or emails.is_primary is null)
EDIT after clarification:
Paul Spiegel's answer is good, therefore my upvote, but I'm not sure if it performs well, since it has a dependent subquery. So I created this query. It may depend on your data though. Try both answers.
select
p.*,
coalesce(e1.address, e2.address) AS address
from people p
left join emails e1 on p.id = e1.person_id and e1.is_primary = 1
left join (
select person_id, address
from emails e
where id = (select min(id) from emails where emails.is_primary = 0 and emails.person_id = e.person_id)
) e2 on p.id = e2.person_id
Use a correlated subquery with LIMIT 1 in the ON clause of the LEFT JOIN:
select *
from people p
left join emails e
on e.person_id = p.id
and e.id = (
select e1.id
from emails e1
where e1.person_id = e.person_id
order by e1.is_primary desc, -- true first
e1.id -- If e1.is_primary is ambiguous
limit 1
)
order by p.id
sqlfiddle
I am trying to perform a count to get the total number of results in a pagination but the query is too slow 2.12s
+-------+
| size |
+-------+
| 50000 |
+-------+
1 row in set (2.12 sec)
my count query
select count(appeloffre0_.ID_APPEL_OFFRE) as size
from ao.appel_offre appeloffre0_
inner join ao.acheteur acheteur1_
on appeloffre0_.ID_ACHETEUR=acheteur1_.ID_ACHETEUR
where
(exists (select 1 from ao.lot lot2_ where lot2_.ID_APPEL_OFFRE=appeloffre0_.ID_APPEL_OFFRE and lot2_.ESTIMATION_COUT>=1))
and (exists (select 1 from ao.lieu_execution lieuexecut3_ where lieuexecut3_.appel_offre=appeloffre0_.ID_APPEL_OFFRE and lieuexecut3_.region=1))
and (exists (select 1 from ao.ao_activite aoactivite4_ where aoactivite4_.ID_APPEL_OFFRE=appeloffre0_.ID_APPEL_OFFRE and (aoactivite4_.ID_ACTIVITE=1)))
and appeloffre0_.DATE_OUVERTURE_PLIS>'2015-01-01'
and (appeloffre0_.CATEGORIE='fournitures' or appeloffre0_.CATEGORIE='travaux' or appeloffre0_.CATEGORIE='services')
and acheteur1_.ID_ENTITE_MERE=2
explain cmd :
+----+--------------------+--------------+------+---------------------------------------------+--------------------+---------+--------------------------------+-------+--------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+--------------------+--------------+------+---------------------------------------------+--------------------+---------+--------------------------------+-------+--------------------------+
| 1 | PRIMARY | acheteur1_ | ref | PRIMARY,acheteur_ibfk_1 | acheteur_ibfk_1 | 5 | const | 3 | Using where; Using index |
| 1 | PRIMARY | appeloffre0_ | ref | appel_offre_ibfk_2 | appel_offre_ibfk_2 | 4 | ao.acheteur1_.ID_ACHETEUR | 31061 | Using where |
| 4 | DEPENDENT SUBQUERY | aoactivite4_ | ref | ao_activites_activite_fk,ao_activites_ao_fk | ao_activites_ao_fk | 4 | ao.appeloffre0_.ID_APPEL_OFFRE | 3 | Using where |
| 3 | DEPENDENT SUBQUERY | lieuexecut3_ | ref | fk_ao_lieuex,fk_region_lieuex | fk_ao_lieuex | 4 | ao.appeloffre0_.ID_APPEL_OFFRE | 1 | Using where |
| 2 | DEPENDENT SUBQUERY | lot2_ | ref | FK_LOT_AO | FK_LOT_AO | 4 | ao.appeloffre0_.ID_APPEL_OFFRE | 5 | Using where |
+----+--------------------+--------------+------+---------------------------------------------+--------------------+---------+--------------------------------+-------+--------------------------+
the index acheteur_ibfk_1 is a FK references table ENTITE_MERE because i have and acheteur1_.ID_ENTITE_MERE=2 in where clause.
You can have multiple conditions on your joins by using ON condition1 AND condition2 etc.
SELECT COUNT(appeloffre0_.ID_APPEL_OFFRE) as size
FROM ao.appel_offre appeloffre0_
JOIN ao.acheteur acheteur1_ ON appeloffre0_.ID_ACHETEUR=acheteur1_.ID_ACHETEUR
JOIN ao.lot lot2_ ON appeloffre0_.ID_APPEL_OFFRE=lot2_.ID_APPEL_OFFRE AND lot2_.ESTIMATION_COUT>=1
JOIN ao.lieu_execution lieuexecut3_ ON appeloffre0_.ID_APPEL_OFFRE=lieuexecut3_.ID_APPEL_OFFRE AND lieuexecut3_.ID_ACTIVITE=1
JOIN ao.ao_activite aoactivite4_ ON appeloffre0_.ID_APPEL_OFFRE=aoactivite4_.ID_APPEL_OFFRE AND aoactivite4_.ID_ACTIVITE=1
WHERE appeloffre0_.DATE_OUVERTURE_PLIS>'2015-01-01'
AND (appeloffre0_.CATEGORIE='fournitures' OR appeloffre0_.CATEGORIE='travaux' OR appeloffre0_.CATEGORIE='services')
AND acheteur1_.ID_ENTITE_MERE=2;
You can try:
select count(aa.ID_APPEL_OFFRE) as size
from (
select ID_APPEL_OFFRE, ID_ACHETEUR from ao.appel_offre appeloffre0_
inner join ao.acheteur acheteur1_
on appeloffre0_.ID_ACHETEUR=acheteur1_.ID_ACHETEUR
where appeloffre0_.DATE_OUVERTURE_PLIS>'2015-01-01'
and (appeloffre0_.CATEGORIE in ('fournitures','travaux','services'))
and (acheteur1_.ID_ENTITE_MERE=2)) aa
inner join ao.lot lot2_ on lot2_.ID_APPEL_OFFRE=aa.ID_APPEL_OFFRE
inner join ao.lieu_execution lieuexecut3_ on lieuexecut3_.appel_offre=aa.ID_APPEL_OFFRE
inner join ao.ao_activite aoactivite4_ on aoactivite4_.ID_APPEL_OFFRE=aa.ID_APPEL_OFFRE
where
aoactivite4_.ID_ACTIVITE=1
and lot2_.ESTIMATION_COUT>=1
and lieuexecut3_.region=1;
But I haven't seen your tables so I am not 100% sure that you won't get duplicates because of joins.
A couple of low-hanging fruits might also be found by ensuring that your appeloffre0_.CATEGORIE and appeloffre0_.DATE_OUVERTURE_PLIS have indexes on them.
Other fields which should have indexes on them are ao.lot.ID_APPEL_OFFRE, ao.lieu_execution.ID_APPEL_OFFRE and ao.ao_activite.ID_APPEL_OFFRE, and ao.appel_offre.ID_ACHETEUR (all the joined fields).
I would have the following indexes on your tables if not already... These are covering indexes for your query meaning the index has the applicable column to get your results without having to go to the actual raw data pages.
table index
appel_offre ( DATE_OUVERTURE_PLIS, CATEGORIE, ID_APPEL_OFFRE, ID_ACHETEUR )
lot ( ID_APPEL_OFFRE, ESTIMATION_COUT )
lieu_execution ( appel_offre, region )
ao_activite ( ID_APPEL_OFFRE, ID_ACTIVITE )
Having indexes on just individual columns won't really help optimize what you are looking for. Also, I am doing count of DISTINCT ID_APPEL_OFFRE's in case any of the JOINed tables have more than 1 record, it does not create a Cartesian result count for you
select
count(distinct AOF.ID_APPEL_OFFRE) as size
from
ao.appel_offre AOF
JOIN ao.acheteur ACH
on AOF.ID_ACHETEUR = ACH.ID_ACHETEUR
and ACH.ID_ENTITE_MERE = 2
JOIN ao.lot
ON AOF.ID_APPEL_OFFRE = lot.ID_APPEL_OFFRE
and lot.ESTIMATION_COUT >= 1
JOIN ao.lieu_execution EX
ON AOF.ID_APPEL_OFFRE = EX.appel_offre
and EX.region = 1
JOIN ao.ao_activite ACT
ON AOF.ID_APPEL_OFFRE = ACT.ID_APPEL_OFFRE
and ACT.ID_ACTIVITE = 1
where
AOF.DATE_OUVERTURE_PLIS > '2015-01-01'
and ( AOF.CATEGORIE = 'fournitures'
or AOF.CATEGORIE = 'travaux'
or AOF.CATEGORIE = 'services')
Like #FuzzyTree said in his comment exists is faster than an inner join if it's not a 1:1 relationship because it terminates as soon as it finds 1 whereas the join will get every matching row.
But the solution is that We add in and not exists :
where ( appeloffre0_.ID_APPEL_OFFRE IN (select lot2_.ID_APPEL_OFFRE from ao.lot lot2_
where lot2_.ESTIMATION_COUT>=1)
)
So the query run very fast than exists or joins .
select count(appeloffre0_.ID_APPEL_OFFRE) as size
from ao.appel_offre appeloffre0_
inner join ao.acheteur acheteur1_
on appeloffre0_.ID_ACHETEUR=acheteur1_.ID_ACHETEUR
where
( appeloffre0_.ID_APPEL_OFFRE IN (select lot2_.ID_APPEL_OFFRE from ao.lot lot2_ where lot2_.ESTIMATION_COUT>=1))
and (appeloffre0_.ID_APPEL_OFFRE IN (select lieuexecut3_.appel_offre from ao.lieu_execution lieuexecut3_ where lieuexecut3_.region=1))
and (appeloffre0_.ID_APPEL_OFFRE IN (select aoactivite4_.ID_APPEL_OFFRE from ao.ao_activite aoactivite4_ where aoactivite4_.ID_ACTIVITE=1 ))
and appeloffre0_.DATE_OUVERTURE_PLIS>'2015-01-01'
and (appeloffre0_.CATEGORIE='fournitures' or appeloffre0_.CATEGORIE='travaux' or appeloffre0_.CATEGORIE='services')
and acheteur1_.ID_ENTITE_MERE=2