MySQL, compare a set of records with another set - mysql

I'm working on some kind of RNA probe data. For one probe, a set of RNA sequences are supplied, typically 20-40 sequences, e.g.
As can be seen from the picture, each sequence, within a set, is about 30 characters long.
When populating the database with NEW probes, a new set of sequences are supplied and associated with the new probe.
We will need to check and make sure that the new set of sequences are not the same as one that already exists in the database.
First test would be the number of sequences (the data above got 20). That is a simple test.
If the set size is equal, then we need to check each item within a set. However, the order of the items, within each set, are irrelevant.
Question is, does MySQL have an inbuilt command to check for equality between two sets, where the order of each item, in each set, is irrelevant?

The short answer to your question, is "no", there's no built-in command in MySQL to check for equality. In some other databases, there is an INTERSECT or MINUS/EXCEPT operator that would do pretty much what you are asking for. I've made the assumption here that sequences within a probe are unique. The SQL below can probably be adapted to do the job. I've prepared and tested a DBFiddle sample here. Basically what it does is joins all the sequences in the new probe to all the sequences in the existing probes, then checks to see if the number of records returned from the join is the same as the total count of records in the existing probe. If the counts match, then the new probe is a duplicate. The query will return the id of the existing duplicate probe. HTH.
SELECT x.probe,
COUNT(*) AS newrecs,
proberecs
FROM (SELECT a.probe,
a.rnaseq
FROM rnaprobes a
JOIN newprobe b
ON a.rnaseq = b.rnaseq) x
JOIN (SELECT probe,
COUNT(*) AS proberecs
FROM rnaprobes
GROUP BY probe) c
ON x.probe = c.probe
GROUP BY x.probe
HAVING COUNT(*) = proberecs

Related

sql query not returning non-unique value in table

I have a MySQL database for an investor to track his investments:
the 'deal' table has info about the investments, including different categories for the investment (asset_class).
Another table ('updates') tracks updates on a specific investment (investment name, date, and lots of financial details.)
I want to write a query that allows the user to select all updates from 'updates' under a specific asset_class. However, as mentioned, asset_class is in the investment table. I wrote the following query:
SELECT *
FROM updates
WHERE updates.invest_name IN (SELECT deal.deal_name
FROM deal
WHERE deal.asset_class = '$asset_class'
);
I'm using PHP, so $asset_class is the selected variable of asset_class.
However, the query only returns unique update names, but I want to see ALL updates for the given asset_class, even if several updates are made under one investment name.
Any advice? Thanks!
Your query should do what you intend. In general, though, this type of query would be written using a JOIN. More importantly use parameter placeholders instead of munging query strings:
SELECT u.*
FROM updates u JOIN
deal d
ON u.invest_name = d.deal_name
WHERE d.asset_class = ?;
This can take advantage of indexes on deal(asset_class, deal_name) and updates(invest_name).
The ? represents a parameter that you pass into the query when you run it. The exact syntax depends on how you are making the call.

MySQL LEAST() with arbitrary number of parameters; longest match in a table

I would like to create a MySQL query to find the longest match (of a given ip address in quad-dotted format) that is present in a table of subnets.
Ultimately, I'd like to create a LEFT JOIN that will display every quad-dotted ip address in one table joined with their longest matches in another table. I don't want to create any temporary tables or structure it as a nested query.
I'm somewhat of a MySQL newbie, but what I'm thinking is something like this:
SELECT `ip_address`
LEFT JOIN ON
SELECT `subnet_id`
FROM `subnets_table`
WHERE (`maximum_ip_value` - `minimum_ip_value`) =
LEAST(<list of subnet intervals>)
WHERE INET_ATON(<given ip address>) > `minimum_ip_value`
AND INET_ATON(<given ip address>) < `maximum_ip_value`;
Such that minimum_ip_value and maximum_ip_value are the lowest and highest decimal-formatted ip addresses possible in a given subnet-- e.g., for the subnet 172.16.0.0/16:
minimum_ip_value = 2886729728 (or 172.16.0.0)
maximum_ip_value = 2886795263 (or 172.16.255.255)
And <list of subnet intervals> contains all intervals in subnets_table where <given ip address> is between minimum_ip_value and maximum_ip_value
And if more than one interval contains <given ip address>, then the smallest interval (i.e., smallest subnet, or most specific and "longest" match) is joined.
Ultimately, all I really want is the subnet_id value that corresponds with that interval.
So my questions are:
1) Can I use the LEAST() function with an arbitrary number of parameters? I'd like to compare every row of subnets_table, or more specifically, every row's interval between minimum_ip_value and maximum_ip_value, and select the smallest interval.
2) Can I perform all of this computation within a LEFT JOIN query? I'm fine with any suggestions that will be fast, encapsulated, and avoid repetitive queries of the same data.
I'm wondering if this is even possible to perform in a single query (i.e., without querying the subnets table for each ip address), but I don't know enough to rule it out. Please advise if this looks like it won't work, so I can try another angle.
Thanks.
After some research and trial & error, I see that there are a few issues with the prototype query above:
The LEAST() function takes only a set number of arguments. As per my original question, I want a function that will work on an arbitrary number of arguments, or every row in a table. That is a different function in MySQL, MIN().
The function MIN() has a lower precedence than the JOIN functions in MySQL, and is evaluated after the JOIN functions in any given query. Therefore, I can't JOIN on the MIN() of a set of values, because the MIN() doesn't exist yet at the time the JOIN is performed.
The only way I could see to solve this issue was to perform two separate queries: one with the MIN(), performed first, and another with the JOIN, performed on the results of the first query. This meant that for a table with n rows, I'd perform n^n queries, instead of n queries. That wasn't acceptable.
To work around the issue, I wrote a new script that modifies the database before any of these queries are ever performed. Each subnet is given its own "bucket" of ip values, and all values in that range map to that subnet. If a more specific (i.e., smaller) subnet overlaps a less specific (i.e., larger) subnet, then the more specific range is mapped only to the smaller subnet, and the larger subnet retains only the values from the less specific range. Now any given ip address falls into only one "bucket", and maps to only one subnet, which is its most specific match. I can JOIN on this match and never have to worry about the MIN() function.

SQL newbie: execution order of subqueries?

Warning: This is a soft question, where you'll be answering to someone who has just started teaching himself SQL from the ground up. I haven't gotten my database software set up yet, so I can't provide tables to run queries against. Some patience required.
Warnings aside, I'm experimenting with basic SQL but I'm having a little bit of a rough time getting a clear answer about the inner workings of subqueries and their execution order within my query.
Let us say my query looks something like shit:
SELECT * FROM someTable
WHERE someFirstValue = someSecondValue
AND EXISTS (
SELECT * FROM someOtherTable
WHERE someTable.someFirstValue = someOtherTable.someThirdValue
)
;
The reason I'm here, is because I don't think I understand fully what is going on in this query.
Now I don't want to seem lazy, so I'm not going to ask you guys to "tell me what's going on here", so instead, I'll provide my own theory first:
The first row in someTable is checked so see if someFirstValue is the same as someSecondValue in that row.
If it isn't, it goes onto the second row and checks it too. It continues like this until a row passes this little inspection.
If a row does pass, it opens up a new query. If the table produced by this query contains even a single row, it returns TRUE, but if it's empty it returns FALSE.
My theory ends here, and my confusion begins.
Will this inner query now compare only the rows that passed the first WHERE? Or will it check all the items someTable and someOtherTable?
Rephrased; will only the rows that passed the first WHERE be compared in the someTable.someFirstValue = someOtherTable.someThirdValue subquery?
Or will the subquery compare all the elements from someTable to all the elements in someOtherTable regardless of which passed the first WHERE and which didn't?
UPDATE: Assume I'm using MySQL 5.5.32. If that matters.
The answer is that SQL is a descriptive language that describes the result set being produced from a query. It does not specify how the query is going to be run.
In your case the query has several options on how it might run, depending on the database engine, what the tables look like, and indexes. The query itself:
SELECT t.*
FROM someTable t
WHERE t.someFirstValue = t.someSecondValue AND
EXISTS (SELECT *
FROM someOtherTable t2
WHERE t.someFirstValue = t2.someThirdValue
);
Says: "Get me all columns from SomeTable where someFirstValue = someSecondValue and there is a corresponding row in someOtherTable where that's table column someThirdValue is the same as someFirstValue".
One possible way to approach this query would be to scan someTable and first check for the first condition. When the two columns match, then look up someFirstValue in an index on someOtherTable(someThirdValue) and keep the row if the values match. As I say, this is one approach, and there are others.

Intelligent Comparison based Update - Access / VBA

Need to intelligently perform updates on an access table.
Expert VBA / Intelligent Thinking would be required.
Table1 (For reference only)
CompanyCode Text
RegionCategory Number (1-99)
RegionCount Number(0 - 25000)
Table2
InvoiceNumber Number
CompanyCode Text
NumRows Number
RegionCode FourdigitNumber
ConfirmationRemark Y / N
Ourobjective is to put a Yes or No in the 'ConfirmationRemark' Column.
Rules :
1.Select only those InvoiceNumbers which have exactly two rows from Table2 and different RegionCode. These will have the same CompanyCode. RegionCategory is first two digits of RegionCode.
2.For these two Invoices - The difference between the two RegionCategory must be greater than two.
3.LookUp The RegionCount , from Table1
Decision Making :
We are now basically comparing two Invoices with different RegionCodes.
Idea is that , the Invoice with higher RegionCount is the one to be marked Yes.
1.The difference between RegionCount must be considerable. 'considerable' - I am trying to determine what would be the right number. Let us take 500 for now.
2.The Invoice with lower Region Count - should have RegionCount - Zero (bestCase) or very very low. If The Invoice with lower Region Count has a high RegionCount value > 200 , then we cannot successfully conclude.
3.NumRows , is prefered to be 1 or lesser than the other. This comparison , is not mandatory , hence we shall have a provision to not check for this. Mark the Other Invoice as 'N'
You have many ways to approach that type of complex update.
If you are lucky, you may be able to craft a SQL UPDATE statement that can include all the changes, but often you will have to resort to a combination of SELECT queries and custom VBA to filter them based on the results of calculations or lookups involving other data.
A few hints
Often, we tend to think about a problem in terms of 'what are the steps to get to the data that match the criteria'.
Sometimes, though, it's easier to turn the problem on its head and instead ask yourself 'what are the steps to get to the data that do not match the criteria'.
Because in your case the result is boolean, true or false, you could simply set the ConfirmationRemark field to True for all records and then update those that should be set to False, instead of the other way around.
Break down each step (as you did) and try to find the simplest SELECT query that will return just the data you need for that step. If a step is too complex, break it down further.
Compose your broken down SELECT statements together to slowly build a more complex query that tends toward your goal.
Once you have gone as far as you can, either construct an UPDATE Table2 SET ConfirmationRemark=True WHERE InvoiceNumber IN (SELECT InvoiceNumber ....) or use VBA to go through the recordset of results from your complext SELECT statement and do some more checks before you update the field in code.
Some issues
Unfortunately, despite your efforts to document your situation, there are not enough details for us to really help:
you do not mention which are the primary keys (from what you say, it seems that Table2 could have multiple records with identical InvoiceNumber)
the type of data you are dealing with is not obvious. You should include a sample of data and identify which ones should end-up with ConfirmationRemark set.
your problem is really too localised, meaning it is too specific to your to be of value to anyone else, although I think that with a bit more details your question could be of interest, if only to show an example of how to approach complex data updates in Access.

remove duplicates in mysql database

I have a table with columns latitude and longitude. In most cases the value extends past the decimal quite a bit: -81.7770051972473 on the rare occasion the value is like this: -81.77 for some records.
How do I find duplicates and remove one of the duplicates for only the records that extend beyond two decimal places?
Using some creative substring, float, and charindex logic, I came up with this:
delete l1
from
latlong l1
inner join (
select
id,
substring(cast(latitude as varchar), 0, INSTR(CAST(latitude as varchar))+3, '.') as truncatedLat
from
latlong
) l2 on
l1.id <> l2.id
and l1.latitude = cast(l2.truncatedLat as float)
Before running, try select * in lieu of delete l1 first to make sure you're deleting the right rows.
I should note that this worked on SQL Server using functions I know exist in MySQL, but I wasn't able to test it against a MySQL instance, so there may be some little tweaking that needs to be done. For example, in SQL Server, I used charindex instead of instr, but both should work similarly.
Not sure how to do that purely in SQL.
I have used scripting languages like PHP or CFML to solve similar needs by building a query to pull the records then looping over the record set and performing some comparison. If true, then VERY CAREFULLY call another function, passing in the record ID and delete the record. I would probably even leave the record in the table, but mark some another column as isDeleted.
If you are more ambitious than I, it looks like this thread is close to what you want
Deleting Duplicates in MySQL
finding multi column duplicates mysql
Using an external programming language (Perl, PHP, Java, Assembly...):
Select * from database
For each row, select * from database where newLat >= round(oldLat,2) and newLat < round(oldLat,2) + .01 and //same criteria for longitude
Keep one of them based on whatever criteria you choose. If lowest primary key, sort by that and skip the first result.
Delete everything else.
Repeat skipping to this step for any records you already deleted.
If for some reason you want to identify everything with greater than 2 digit precision:
select * from database where lat != round(lat,2), or long != round(long,2)