Getting stuck doing a complicated SQL query for patent research purposes - mysql

I am trying to gather data for a research study for my university thesis. Unfortunately I am not a computer science or programming expert and do not have any SQL experience.
For my thesis I need to do a SQL query answering the question: "Give me all patents of a company X where there is more than one applicant (other company) in a specific time span". The data I want to extract is stored on a database called PATSTAT (where I have a 1 month trial) and is using - dont be surprised SQL.
I tried a lot of queries but all the time I am getting different syntax errors.
This is how the interface looks like:
http://www10.pic-upload.de/07.07.13/7u5bqf7jsow.png
I think I have a really good understanding of what (also from an SQL POV) needs to be done but I cannot execute it.
My idea: As result I want the names of the companies (with reference to the company entered below)
SELECT person_name from tls206_person table
Now because I need a criteria like
WHERE nb_applicants > 1 from tls201_appln table
I need to join these two tables tls206 and tls201. I did read some brief introduction guide on SQL (provided by european patent office) and because both tables have no common "reference key" we need to use the table tls207_pers_appln als "intermediate" so to speak. Now thats the point where I am getting stuck. I tried the following but this is not working
SELECT person_name, tls201_appln.nb_applicants
FROM tls206_person
INNER JOIN tls207_pers_appln ON tls206_person.person_id= tls207_pers_appln.person_id
INNER JOIN tls207_pers_appln ON tls201_appln.appln_id=tls201_appln.appln_id
WHERE person_name = "%Samsung%"
AND tls201_appln.nb_applicants > 1
AND tls201_appln.ipr_type = "PI"
I get the following error: "0:37:11 [SELECT - 0 row(s), 0 secs] [Error Code: 1064, SQL State: 0] Not unique table/alias: 'tls207_pers_appln'"
I think for just 4 Hours SQL my approach is not to bad but I really need some guidance on how to proceed because I am not making any progress.
Ideally I would like to count (for every company) and for every row respectively how many "nb_applicants" were found.
If you need further information for giving me guidance, just let me know.
Looking forward to your answers.
Best regards
Kendels

another way of doing the same thing, which you might find easier to understand (if you are new to sql it is impressive you have got so far), is:
SELECT tls206_person.person_name, tls201_appln.nb_applicants
FROM tls206_person, tls207_pers_appln, tls201_appln
WHERE tls206_person.person_id = tls207_pers_appln.person_id
AND tls201_appln.appln_id = tls201_appln.appln_id
AND tls206_person.person_name LIKE "%Samsung%"
AND tls201_appln.nb_applicants > 1
AND tls201_appln.ipr_type = "PI"
(it's equivalent to the other answer, but instead of trying to understand the JOIN syntax, you just write out all the logic and SQL is smart enough to make it work - this is often called the "new" or "ISO" inner join syntax, if you want to google for more info) (although it is possible, i suppose, that this newer syntax isn't supported by the database you are using).

You are referencing the table tls201_appln, but it is not in the from clause. I am guessing that the second reference to tls207_pers_appln should be to the other table:
SELECT person_name, tls201_appln.nb_applicants
FROM tls206_person
INNER JOIN tls207_pers_appln ON tls206_person.person_id = tls207_pers_appln.person_id
INNER JOIN tls201_appln ON tls201_appln.appln_id = tls207_pers_appln.appln_id
WHERE person_name like '%Samsung%"'
AND tls201_appln.nb_applicants > 1
AND tls201_appln.ipr_type = "PI"

For my thesis I need to do a SQL query answering the question: "Give me all patents of a company X where there is more than one applicant (other company) in a specific time span".
Let me rephrase that for you :
SELECT * FROM patents p -- : "Give me all patents
WHERE p.company = 'X' -- of a company X
AND EXISTS ( -- where there is
SELECT *
FROM applicants x1
WHERE x1.patent_id = p.patent_id
AND x1.company <> 'X' -- another company:: exclude ourselves
AND x1.application_date >= $begin_date -- in a specific time span
AND x1.application_date < $end_date
-- more than one applicant (other company)
-- To avoid aggregation: Just repeat the same subquery
AND EXISTS ( -- where there is
SELECT *
FROM applicants x2
WHERE x2.patent_id = p.patent_id
AND x2.company <> 'X' -- another company:: exclude ourselves
AND x2.company <> x1.company -- :: exclude other other company, too
AND x2.application_date >= $begin_date -- in a specific time span
AND x2.application_date < $end_date
)
)
;
[Note: Since the OP did not give any table definitions, I had to invent these]
This is not the perfect query, but it does express your intentions. Given sane keys/indexes it will perform reasonably, too.

Related

Join to tables and String Compare (large data set)

I am very new to SQL and don't really know much about what i'm doing. I'm trying to figure out how to get a list of leads and owners whose corresponding campaign record types are stated as "inter"
So far I have tried joining the two tables and running a string compare I found on a different stack overflow page. Separately they work fine but together everything breaks... I only get the error "You have an error in your SQL syntax; check the manual"
select a.LeadId, b.OwnerId from
(select * from CampaignMember as a
join
select * from Campaign as b
on b.id = a.CampaignId)
where b.RecordTypeId like "inter%"
Schema:
Campaign CampaignMember
------------- ----------------
Id CampaignId
OwnerId LeadId
RecordTypeId ContactId
The string compare is also very slow. I am looking at a table of 600M values. Is there a faster alternative?
Is there also a way to get more specific errors in MySQL?
If you format your code properly, it will be very easy to see why it's not working.
select a.LeadId, b.OwnerId
from (
select *
from CampaignMember as a
join select *
from Campaign as b on b.id = a.CampaignId
)
where b.RecordTypeId like "inter%"
It's not a valid JOIN format. Also the last part, SQL use single quote ' instead of double quote "
Probably what you want is something like this
SELECT a.LeadId, b.OwnwerId
FROM CampaignMember a
JOIN Campaign b ON b.id = a.CampaignId
WHERE b.RecordTypeId LIKE 'inter%'
Try this:
select CampaignMember.LeadId, Campaign.OwnerId from
Campaign
inner join
CampaignMember
on CampaignMember.CampaignId= Campaign.id
where Campaign.RecordTypeId like "inter%"
MySql is generally pretty poor and handling sub-selects, so you should avoid them when possible. Also, your sub-select isn't filtering any rows, so it has to evaluate every row before applying the LIKE filter. This is sometimes "intelligently" handled by the query engine, but you should try to minimize reliance on the engine to optimize the query.
Additionally, you really should only return the columns that you care about; SELECT * is ok for confirming things, but slows queries down.
Therefore, the query posted by Eric (above) is actually the best choice.

MySQL query and compare two different tables

I'm very new to SQL queries, so forgive me if this is a really easy question.
I have 2 database tables HWC and collection, HWC.id is referenced in collection.col
HWC
- id (PRIMARY)
- stuff
- more stuff
- lots more stuff
- Year
collection
- id (PRIMARY)
- userId
- col
Question:
I want to query the collection table for a specific user to see what entries from HWC they are missing.
I don't even know where to start logically, I don't expect anyone to build the query for me, but pointing me in the correct direction would be very much appreciated.
You want items from the collect that the user is missing. This suggests a left outer join. In particular, you want to keep everything in the HWC table and find those things that are missing:
select hwc.*
from hwc left join
collection c
on hwc.id = c.col
where hwc.id is null and c.user_id = #UserId;
When learning SQL, students often learn this syntax:
select hwc.*
from hwc
where hwc.id not in (select c.col from collection c where c.user_id = #UserId);
This is perfectly good SQL. Some databases don't do a great job optimizing not in. And, it can behave unexpectedly when c.col is NULL. For these reasons, this is often rewritten as a not exists query:
select hwc.*
from hwc
where not exists (select 1
from collection c
where c.col = hwc.id and c.user_id = #UserId
);
I offer you these different alternatives because you are learning SQL. It is worth learning how all three work. In the future, you should find each of these mechanisms (left join, not in, and not exists) useful.
It sounds like you mean SQL JOINS.
SQL Joins Tutorial:
Lets say you want to Query your collection like so:
SELECT collection.userId, HWC.stuff
FROM collection
INNER JOIN HWC ON collection.col = HWC.id
This will pick userId from collection, and stuff from HWC, where these ID's have relations.
Hope I helped, good luck!

SQL query to select based on many-to-many relationship

This is really a two-part question, but in order not to mix things up, I'll divide into two actual questions. This one is about creating the correct SQL statement for selecting a row based on values in a many-to-many related table:
Now, the question is: what is the absolute simplest way of getting all resources where e.g metadata.category = subject AND where that category's corresponding metadata.value ='introduction'?
I'm sure this could be done in a lot of different ways, but I'm a novice in SQL, so please provide the simplest way possible... (If you could describe briefly what the statement means in plain English that would be great too. I have looked at introductions to SQL, but none of those I have found (for beginners) go into these many-to-many selections.)
The easiest way is to use the EXISTS clause. I'm more familiar with MSSQL but this should be close
SELECT *
FROM resources r
WHERE EXISTS (
SELECT *
FROM metadata_resources mr
INNER JOIN metadata m ON (mr.metadata_id = m.id)
WHERE mr.resource_id = r.id AND m.category = 'subject' AND m.value = 'introduction'
)
Translated into english it's 'return me all records where this subquery returns one or more rows, without returning the data for those rows'. This sub query is correlated to the outer query by the predicate mr.resource_id = r.id which uses the outer row as the predicate value.
I'm sure you can google around for more examples of the EXIST statement

How to retrieve "dynamic" attributes stored in multiple rows as normal records?

I have a system built on a relational MySQL database that allows people to store details of "leads". In addition, people can create their own columns under which to store data and then when adding new accounts can add data under them. The table structure looks like this:
LEADS -
id,
email,
user_id
ATTRIBUTES -
id,
attr_name,
user_id
ATTR_VALUES -
lead_id,
attr_id,
value,
user_id
Obviously in these tables "user_id" refers to a "Users" table that just contains people that can log into the system.
I am writing a function to output lead details and currently am just pulling through the basic lead details as a query, and then pulling through every attribute value associated with that lead (joining on the attributes table to get the name) and then joining the arrays in PHP. This is a little messy, and I was wondering if there was a way to do this in one SQL query. I have read a little about something called a "pivot table", but am struggling to understand how it works.
Any help would be greatly appreciated. Thanks!
You could do the pivoting in a single query like the following:
select l.id lead_id,
l.email,
group_concat(distinct case when a.attr_name = 'Home Phone' then v.value end) HomePhone,
...
from leads l
left join attr_values v on l.id = v.lead_id
left join attributes a on v.attr_id = a.id
group by l.id
You will need to include a separate group_concat-derived field for each attribute you want to display.
I would have a look at this link. That explain the fundamental of a pivot:
"pivot table" or a "crosstab report" SQL Characteristic Functions: Do
it without "if", "case", or "GROUP_CONCAT". Yes, there is use for
this..."if" statements sometimes cause problems when used in
combination. The simple secret, and it's also why they work in almost
all databases, is the following functions: sign (x) returns -1,0, +1
for values x < 0, x = 0, x > 0 respectively abs( sign( x) ) returns 0
if x = 0 else, 1 if x > 0 or x < 0 1-abs( sign( x) ) complement of the
above, since this returns 1 only if x = 0
It a also explain a more simple way of pivoting exams. Maybe this can shed some light over it?
What you probably want from mysql is to make an sql value (attr_name in your case) a column. This principle is called pivot table (sometimes also cross tables or crosstab queries) and is not supported by mysql. Not because mysql is insufficient, but because the pivot operation is not a database operation - the result is not a normal database table and is not designed for further database operations. The only purpose of pivot operation a presentation - that's why it belongs to presentation layer, not database.
Thus, every solution of trying to get a pivot table from mysql will always be hacky. What I recommend is to get the data from database in normal format, by simply doing something like:
select *
from attr_values join attributes using on attr_id = attributes.id
join leads on leads.id = lead_id
and then transform the database output in the presentation language (PHP, JSP, Python or whatever you use).
I'll be careful to assume that pivot will achieve your simplification goal. Pivot will only work if you attr_name are consistent. Since you tied a userid to it, I assume it wouldn't. In addition, you will have multiple values for one attr_name. I'm afraid pivot table wouldn't produce the result you are looking for.
I would suggest that you keep your transactional and reporting tables separate. Have an ETL routine that will clean (ie. make the attr_name and attr_value) consistent through translation. This will make your reports more meaningful.
In summary, for immediate output to end-user, PHP is the best you can do. For reporting, transform the EAV to a row/column first before attempting to report on it.

At least one X but no Ys Query

I come across this pattern occasionally and I haven't found a terribly satisfactory way to solve it.
Say I have a employee table and an review table. Each employee can have more than one review. I want to find all the employees who have at least one "good" review but no "bad" reviews.
I haven't figured out how to make subselects work without knowing the employee ID before hand and I haven't figured out the right combination of joins to make this happen.
Is there a way to do this WITHOUT stored procedures, functions or bringing the data server side? I've gotten it to work with those but I'm sure there's another way.
Since you haven't posted your DB Structure, I made some assumptions and simplifications (regarding the rating column, which probably is number and not a character field). Adjust accordingly.
Solution 1: Using Joins
select distinct e.EmployeeId, e.Name
from employee e
left join reviews r1 on e.EmployeeId = r1.EmployeeId and r1.rating = 'good'
left join reviews r2 on e.EmployeeId = r2.EmployeeId and r1.rating = 'bad'
where r1.ReviewId is not null --meaning there's at least one
and r2.ReviewId is null --meaning there's no bad review
Solution 2: Grouping By and Filtering with Conditional Count
select e.EmployeeId, max(e.Name) Name
from employee e
left join reviews r on e.EmployeeId = r.EmployeeId
group by e.EmployeeId
having count(case r.rating when 'good' then 1 else null end) > 0
and count(case r.rating when 'bad' then 1 else null end) = 0
Both solutions are SQL ANSI compatible, which means both work with any RDBMS flavor that fully support SQL ANSI standards (which is true for most RDBMS).
As pointed out by #onedaywhen, the code will not work in MS Access (have not tested, I'm trusting in his expertise on the subject).
But I have one saying on this (which might make some people upset): I hardly consider MS Access a RDBMS. I have worked with it in the past. Once you move on (Oracle, SQL Server, Firebird, PostGreSQL, MySQL, you name it), you do not ever want to come back. Seriously.
The question -- return rows on side A based on nonexistence of a match in B -- (employees with No "Bad" reviews) describes an "anti-semi join". There are numerous ways to accomplish this kind of query, at least 5 I've discovered in MS Sql 2005 and above.
I know this solution works in MSSQL 2000 and above, and is the most efficient out of the 5 ways I've tried in MS Sql 2005 and 2008. I am not sure if it will work in MySQL, but it should, as it reflects a rather common set operation.
Note, the IN clause gives the subquery access to the employee table in the outer scope.
SELECT EE.*
FROM employee EE
WHERE
EE.EmpKey IN (
SELECT RR.EmpKey
FROM review RR
WHERE RR.EmpKey = EE.EmpKey
AND RR.ScoreCategory = 'good'
)
AND
EE.EmpKey NOT IN (
SELECT RR.EmpKey
FROM review RR
WHERE RR.EmpKey = EE.EmpKey
AND RR.ScoreCategory = 'bad'
)
It's possible. The particular syntax depends on how you store 'good' and 'bad' reviews.
Suppose you had a classification column in review that had values 'good' and 'bad'.
Then you could do:
SELECT employee.*
FROM employee
JOIN review
ON employee.id=review.employee_id
GROUP BY employee.id
HAVING SUM(IF(classification='good',1,0))>0 -- count up #good reviews, > 0
AND SUM(IF(classification='bad',1,0))=0 -- count up #bad reviews, = 0.
SELECT ???? FROM employee,review
WHERE employees.id = review.id
GROUP BY employees.id
HAVING SUM(IF(review='good',1,0)) > 1 AND SUM(IF(review='bad',1,0)) = 0