Selecting IDs linked with CPC codes in the same column - mysql

I am using the PATSTAT database to select the APPLN_ID of patent applications that have a cpc classification symbol but not another. I need to do this in order to retrieve a control dataset of patents to verify my hypothesis.
PATSTAT is a relational database where each patent application has a set of attributes. The TLS224 table contains multiple rows with the same APPLN_ID and different CPC symbols. I want to retrieve the APPLN_IDs that have a set of symbols A but that do not have a set of symbols B.
From this example data
| APPLN_ID | CPC_CLASS_SYMBOL |
| 2345 | C07K 16/26 |
| 2345 | C07K2317/34 |
| 2345 | C07K2317/76 |
| 2345 | G01N 33/74 |
| 2345 | B01L 9/527 |
| 1000 | C07K2317/34 |
| 1000 | C07K 16/26 |
| 1000 | C07K2317/76 |
| 1000 | B01L 3/5025 |
| 9999 | B01L 3/5025 |
| 9999 | G01N2333/47 |
| 9999 | G01N2333/4727 |
I want to obtain this as a result.
| APPLN_ID |
| 1000 |
Here, the set of values A that must be included are 'C07K 16/26' ,'C07K2317/34', 'C07K2317/76', while the value B that must NOT be present is G01N 33/74.
How can I do that?
This is what I came out with so far (I know that the WHERE IN and NOT IN clauses nullify each other, but it is just to show an example).
SELECT DISTINCT p2.APPLN_ID
FROM (SELECT p1.APPLN_ID, p1.PUBLN_AUTH, YEAR(p1.PUBLN_DATE)
FROM TLS211_PAT_PUBLN p1
WHERE YEAR(p1.PUBLN_DATE) = 2008
AND PUBLN_AUTH = 'WO') p2
JOIN (SELECT DISTINCT cpc3.APPLN_ID
FROM TLS224_APPLN_CPC cpc3
WHERE cpc3.APPLN_ID IN
(SELECT APPLN_ID
FROM TLS224_APPLN_CPC
WHERE CPC_CLASS_SYMBOL NOT IN ('G01N 33/74'))
AND cpc3.APPLN_ID IN
(SELECT APPLN_ID
FROM TLS224_APPLN_CPC
WHERE CPC_CLASS_SYMBOL IN ('C07K 16/26', 'C07K2317/34', 'C07K2317/76'))
) cpc1
ON cpc1.APPLN_ID = p2.APPLN_ID
I am still a newbie to SQL so any help is appreciated!
Thank you

your IN and NOT IN doesn't make sense.
if CPC_CLASS_SYMBOL are in the first Group they are automatocally NOT IN your second
Your WHERE clause would only give you APPLN_ID (and some more) the have these symbols and everything else is excluded.

Related

Optimizing a conditional join in MySQL that depends on the character length of the source table

I'm using MySQL 5.7 and I'm trying to do a join with one of my source tables to a reference table in order to get the appropriate corresponding values. However, I'd like the join to be conditional so it can match according to the length of the value found in the source column.
Source Table
|---------------------|------------------|
| Company_Name | NAICS_Code |
|---------------------|------------------|
| Chem Inc | 325 |
|---------------------|------------------|
| Joe's Farming | 1112 |
|---------------------|------------------|
Reference Table
|---------------------|------------------|--------------------|------------------|
| NAICS_Code_3_Digit | NAICS_Code_ | NAICS_Code_4_Digit | NAICS_Cod_ |
| | 3D_Description | | 4D_Description |
|---------------------|------------------|--------------------|------------------|
| 325 | Chemicals | 3252 | Resin and Rubber|
|---------------------|------------------|--------------------|------------------|
| 111 | Crop Production | 1112 | Fruit and Nuts |
|---------------------|------------------|----------------------------------------
Final Table
|---------------------|------------------|------------------|--------------------|
| Company_Name | NAICS_Code | NAICS_Code_3D_ | NAICS_Code_4D |
| | | Description | Description |
|---------------------|------------------|---------------------------------------|
| Chem Inc | 325 | Chemicals | NULL |
|---------------------|------------------|------------------|--------------------|
| Joe's Farming | 1112 | Crop Production | Fruit and Nuts |
|---------------------|------------------|------------------|--------------------|
While I'm able to write a query that works, it takes an extremely long time and I' curious as to if there is a better way. Here's what I got so far:
SELECT src.Company_Name,
src.NAICS_Code,
CASE
WHEN LENGTH(src.NAICS_Code < 3 THEN NULL
ELSE ref.NAICS_Code_3D_Description
END AS NAICS_Code_3D_Description,
CASE
WHEN LENGTH(src.NAICS_Code < 4 THEN NULL
ELSE ref.NAICS_Code_4D Description
END AS NAICS_Code_4D_Description
FROM source_table AS src
LEFT JOIN reference_table AS ref ON CASE
WHEN LENGTH(src.NAICS_Code) = 4
AND src.NAICS_Code = ref.NAICS_Code_4_Digit THEN 1
WHEN LENGTH(src.NAICS_Code) = 3
AND src.NAICS_Code = ref.NAICS_Code_3_Digit THEN 1
ELSE 0
END = 1;
It might be more efficient to left join twice:
this avoids the need for the complicated logic in the on clause of the join
conditions are exclusive so it will not generate duplicates in the resultset
then you can use coalesce() in the select clause
So:
select
s.compay_name,
s.naics_code,
coalesce(r1.naics_code_3d_description, r2.naics_code_3d_description) naics_code_3d_description,
r2.naics_code_4d_description
from source_table s
left join reference_table r1 on r1.naics_code_3_digit = s.naics_code
left join reference_table r2 on r2.naics_code_4_digit = s.naics_code
If you want to evict source rows that did not match in the reference table, you can add a where clause, like:
where r1.naics_code_3_digit is not null or r2.naics_code_3d_description is not null

99% working behavior from mysql statement needs to be 100%

I have inherrited a DB that I've been tasked to mine for Data.
There are 2 tables that are loosely associated - atm and dslams.
The atm table contains "remotename", "rst", and "CardNumber" fields that relate to the dslams "hostname" field.
The atm table contains the port information for the dslam cards and the dslams table contains the information about the dslam card itself.
I've been tasked with printing out all the locations (dslams.name) that have a certain type of card (dslams.model="6256") and a count of all the ports on that card that have a certain level of service (atm.speed LIKE "RI_%%09" OR atm.speed LIKE "RI%%1%").
I've crafted the following statement which almost works...
SELECT distinct(dslams.name) AS Remote, Count(atm.speed) AS Customers, dslams.model
FROM dslams
LEFT JOIN atm
ON (dslams.hostname = CONCAT(atm.remotename,'-',atm.rst,'-S',atm.CardNumber)) AND (atm.speed LIKE "RI_%_%09" OR atm.speed LIKE "RI_%_%1_%")
GROUP BY dslams.name
HAVING dslams.model="6256"
ORDER BY dslams.name;
This prints out exactly what I need for all but 1 of the locations.
ie.
MariaDB [dsl]> SELECT distinct(dslams.name) AS Remote, Count(atm.speed) AS Customers, dslams.model
-> FROM dslams
-> LEFT JOIN atm
-> ON (dslams.hostname = CONCAT(atm.remotename,'-',atm.rst,'-S',atm.CardNumber)) AND (atm.speed LIKE "RI_%_%09" OR atm.speed LIKE "RI_%_%1_%")
-> GROUP BY dslams.name
-> HAVING dslams.model="6256"
-> ORDER BY dslams.name;
+---------+-----------+-------+
| Remote | Customers | model |
+---------+-----------+-------+
| ANTH-C2 | 1 | 6256 |
| BETY-C2 | 1 | 6256 |
| BHOT-C2 | 6 | 6256 |
| BNSH-C2 | 1 | 6256 |
| BUG2-C2 | 1 | 6256 |
| CCRK-C2 | 0 | 6256 |
...
| STLN-C2 | 1 | 6256 |
| SUMR-C2 | 2 | 6256 |
...
| WGRV-C2 | 0 | 6256 |
+---------+-----------+-------+
63 rows in set (0.34 sec)
For some reason there's one location that's not getting counted - STWL-C2.
MariaDB [dsl]> SELECT distinct(name), model FROM dslams WHERE model="6256" order by name;
+---------+-------+
| name | model |
+---------+-------+
| ANTH-C2 | 6256 |
| BETY-C2 | 6256 |
| BHOT-C2 | 6256 |
| BNSH-C2 | 6256 |
| BUG2-C2 | 6256 |
| CCRK-C2 | 6256 |
...
| STWL-C2 | 6256 |
...
| WGRV-C2 | 6256 |
+---------+-------+
64 rows in set (0.00 sec)
There's no difference in the tables between the STWL-C2 location and the other locations so it should print out with a count of 0.
Can anyone help me figure out why that 1 location is being missed?
Any help or direction would be appreciated as I am a rookie SQL programmer trying to understand this as best I can.
Best Regards,
Joe
Don't use HAVING dslams.model = '6256', put that in the WHERE clause. When you use HAVING, it filters after grouping. When you group by name, the result can contain the model from any row in the group, and it won't necessarily choose model = '6256'.
SELECT dslams.name AS Remote, Count(atm.speed) AS Customers, dslams.model
FROM dslams
LEFT JOIN atm
ON (dslams.hostname = CONCAT(atm.remotename,'-',atm.rst,'-S',atm.CardNumber)) AND (atm.speed LIKE "RI_%_%09" OR atm.speed LIKE "RI_%_%1_%")
WHERE dslams.model = '6256'
GROUP BY dslams.name
ORDER BY dslams.name;

mysql search query for 2 columns with single parameter

I am new to databases. In mysql database I have one table course. My question is: how to search all related words in both columns course_name and course_description and i need to get all the matched words in both columns? Can any one tell me the sql query for it? I have tried to write a query, but I am getting some syntax errors.
+----------+-----------+-----------------+------------+------------+
| courseId | cname | cdesc | sdate | edate |
+----------+-----------+-----------------+------------+------------+
| 301 | physics | science | 2013-01-03 | 2013-01-06 |
| 303 | chemistry | science | 2013-01-09 | 2013-01-09 |
| 402 | afm | finanace | 2013-01-18 | 2013-01-25 |
| 403 | English | language | 2013-01-17 | 2013-01-24 |
| 404 | Telugu | spoken language | 2013-01-10 | 2013-01-22 |
+----------+-----------+-----------------+------------+------------+
SELECT * from course WHERE cname='%searchtermhere%' AND cdesc='%searchtermhere%'
Adding the percent % makes the search within each value and not just beginning with.
If you want to search exact word
SELECT * FROM course WHERE cname ='word' AND cdesc = 'word'
OR you can also find each value and not just start from begining.
SELECT * FROM course WHERE cname = '".%searchtermhere%."' AND cdesc = '".%searchtermhere%."'
Since you say single parameter i guess. You will get either 'science' as input or 'physics' as input. Then you could simply use 'OR'.
select * from course where cname = (Input) or cdesc = (Input)

MySQL Multi Duplicate Record Merging

A previous DBA managed a non relational table with 2.4M entries, all with unique ID's. However, there are duplicate records with different data in each record for example:
+---------+---------+--------------+----------------------+-------------+
| id | Name | Address | Phone | Email | LastVisited |
+---------+---------+--------------+---------+------------+-------------+
| 1 | bob | 12 Some Road | 02456 | | |
| 2 | bobby | | 02456 | bob#domain | |
| 3 | bob | 12 Some Rd | 02456 | | 2010-07-13 |
| 4 | sir bob | | 02456 | | |
| 5 | bob | 12SomeRoad | 02456 | | |
| 6 | mr bob | | 02456 | | |
| 7 | robert | | 02456 | | |
+---------+---------+--------------+---------+------------+-------------+
This isnt the exact table - the real table has 32 columns - this is just to illustrate
I know how to identify the duplicates, in this case i'm using the phone number. I've extracted the duplicates into a seperate table - there's 730k entires in total.
What would be the most efficient way of merging these records (and flagging the un-needed records for deletion)?
I've looked at using UPDATE with INNER JOIN's, but there are several WHERE clauses needed, because i want to update the first record with data from subsequent records, where that subsequent record has additional data the former record does not.
I've looked at third party software such as Fuzzy Dups, but i'd like a pure MySQL option if possible
The end goal then is that i'd be left with something like:
+---------+---------+--------------+----------------------+-------------+
| id | Name | Address | Phone | Email | LastVisited |
+---------+---------+--------------+---------+------------+-------------+
| 1 | bob | 12 Some Road | 02456 | bob#domain | 2010-07-13 |
+---------+---------+--------------+---------+------------+-------------+
Should i be looking at looping in a stored procedure / function or is there some real easy thing i've missed?
U have to create a PROCEDURE, but before that
create ur own temp_table like :
Insert into temp_table(column1, column2,....) values (select column1, column2... from myTable GROUP BY phoneNumber)
U have to create the above mentioned physical table so that u can run a cursor on it.
create PROCEDURE myPROC
{
create a cursor on temp::
fetch the phoneNumber and id of the current row from the temp_table to the local variable(L_id, L_phoneNum).
And here too u need to create a new similar_tempTable which will contain the values as
Insert into similar_tempTable(column1, column2,....) values (Select column1, column2,.... from myTable where phoneNumber=L_phoneNumber)
The next step is to extract the values of each column u want from similar_tempTable and update into the the row of myTable where id=L_id and delete the rest duplicate rows from myTable.
And one more thing, truncate the similar_tempTable after every iteration of the cursor...
Hope this will help u...

Table has pairs of matching records, need to select and update only one record

I have a table with pairs of matching records that I query like this:
select id,name,amount,type from accounting_entries
where name like "%05" and amount != 0 order by name limit 10;
Results:
+------+----------------------+--------+-------+
| id | name | amount | type |
+------+----------------------+--------+-------+
| 786 | D-1194-838HELLUJP-05 | -5800 | DEBIT |
| 785 | D-1194-838HELLUJP-05 | -5800 | DEBIT |
| 5060 | D-1195-UOK4HS5POF-05 | -5000 | DEBIT |
| 5059 | D-1195-UOK4HS5POF-05 | -5000 | DEBIT |
| 246 | D-1196-0FUCJI66BX-05 | -7000 | DEBIT |
| 245 | D-1196-0FUCJI66BX-05 | -7000 | DEBIT |
| 9720 | D-1197-W2J0EC1BOB-05 | -6500 | DEBIT |
| 9719 | D-1197-W2J0EC1BOB-05 | -6500 | DEBIT |
| 2694 | D-1198-MFKIKHGW0S-05 | -5500 | DEBIT |
| 2693 | D-1198-MFKIKHGW0S-05 | -5500 | DEBIT |
+------+----------------------+--------+-------+
10 rows in set (0.01 sec)
I need to perform an update so that the resulting data will look like this:
+------+----------------------+--------+--------+
| id | name | amount | type |
+------+----------------------+--------+--------+
| 786 | D-1194-838HELLUJP-05 | -5800 | DEBIT |
| 785 | C-1194-838HELLUJP-05 | 5800 | CREDIT |
| 5060 | D-1195-UOK4HS5POF-05 | -5000 | DEBIT |
| 5059 | C-1195-UOK4HS5POF-05 | 5000 | CREDIT |
| 246 | D-1196-0FUCJI66BX-05 | -7000 | DEBIT |
| 245 | C-1196-0FUCJI66BX-05 | 7000 | CREDIT |
| 9720 | D-1197-W2J0EC1BOB-05 | -6500 | DEBIT |
| 9719 | C-1197-W2J0EC1BOB-05 | 6500 | CREDIT |
| 2694 | D-1198-MFKIKHGW0S-05 | -5500 | DEBIT |
| 2693 | C-1198-MFKIKHGW0S-05 | 5500 | CREDIT |
+------+----------------------+--------+--------+
10 rows in set (0.01 sec)
One entry should negate the other entry. It doesn't matter if I update the first or second matching record, what matters is that one has a positive amount and the other has a negative amount. And the type and name need to be updated.
Any clues on how to do this? What would the update command look like? Maybe using a group by clause? I have some ideas on how to do it with a stored procedure, but can I do it with a simple update?
Try this:
UPDATE accounting_entries as ae
SET name = 'C' + SubString(name, 1, Length(name) - 1))
amount = amount * -1
type = 'Credit'
WHERE id =
(SELECT MIN(id) FROM
(SELECT * FROM accounting_entries) as temp
GROUP BY name)
The key is the subquery in the WHERE section that limits the updates to the lowest ID of each name value. The assumption is that the lower ID is the one that you will always want to update. If this is not correct, then update the subquery based on whatever rule you would use.
Edit: Update to subquery based on technique found here, due to limitation on mysql defined here.
This query gives a method for updating all records at once (as it seemed like this is what the OP was looking for. However, the most efficient way to do this would be to enumerate through all records in code (php, asp.net, etc), and through code-based methods update the rows that needed to change. This would eliminate the performance issues inherent with running updates off of subqueries in mysql.
If the ID:s for a pair always match the formula x and x+1, you could say something like
WHERE MOD(`id`, 2) = 1
EDIT: I haven't tested this code, so I can't guarantee that it's possible to put a column name into a MOD like this, but it might be worth a try, and/or further investigation.
Does this constraint hold true all the time (D == -C) ?
If so, you do not need to keep redundant data in your table, store only one "amount" value (for example the debit):
786 | 1194-838HELLUJP-05 | -5800
and then, on the application level, append a D- to the name and get the raw amount or append a C- and get the - amount.