remove duplicates in mysql database - mysql

I have a table with columns latitude and longitude. In most cases the value extends past the decimal quite a bit: -81.7770051972473 on the rare occasion the value is like this: -81.77 for some records.
How do I find duplicates and remove one of the duplicates for only the records that extend beyond two decimal places?

Using some creative substring, float, and charindex logic, I came up with this:
delete l1
from
latlong l1
inner join (
select
id,
substring(cast(latitude as varchar), 0, INSTR(CAST(latitude as varchar))+3, '.') as truncatedLat
from
latlong
) l2 on
l1.id <> l2.id
and l1.latitude = cast(l2.truncatedLat as float)
Before running, try select * in lieu of delete l1 first to make sure you're deleting the right rows.
I should note that this worked on SQL Server using functions I know exist in MySQL, but I wasn't able to test it against a MySQL instance, so there may be some little tweaking that needs to be done. For example, in SQL Server, I used charindex instead of instr, but both should work similarly.

Not sure how to do that purely in SQL.
I have used scripting languages like PHP or CFML to solve similar needs by building a query to pull the records then looping over the record set and performing some comparison. If true, then VERY CAREFULLY call another function, passing in the record ID and delete the record. I would probably even leave the record in the table, but mark some another column as isDeleted.
If you are more ambitious than I, it looks like this thread is close to what you want
Deleting Duplicates in MySQL
finding multi column duplicates mysql

Using an external programming language (Perl, PHP, Java, Assembly...):
Select * from database
For each row, select * from database where newLat >= round(oldLat,2) and newLat < round(oldLat,2) + .01 and //same criteria for longitude
Keep one of them based on whatever criteria you choose. If lowest primary key, sort by that and skip the first result.
Delete everything else.
Repeat skipping to this step for any records you already deleted.
If for some reason you want to identify everything with greater than 2 digit precision:
select * from database where lat != round(lat,2), or long != round(long,2)

Related

MySQL, compare a set of records with another set

I'm working on some kind of RNA probe data. For one probe, a set of RNA sequences are supplied, typically 20-40 sequences, e.g.
As can be seen from the picture, each sequence, within a set, is about 30 characters long.
When populating the database with NEW probes, a new set of sequences are supplied and associated with the new probe.
We will need to check and make sure that the new set of sequences are not the same as one that already exists in the database.
First test would be the number of sequences (the data above got 20). That is a simple test.
If the set size is equal, then we need to check each item within a set. However, the order of the items, within each set, are irrelevant.
Question is, does MySQL have an inbuilt command to check for equality between two sets, where the order of each item, in each set, is irrelevant?
The short answer to your question, is "no", there's no built-in command in MySQL to check for equality. In some other databases, there is an INTERSECT or MINUS/EXCEPT operator that would do pretty much what you are asking for. I've made the assumption here that sequences within a probe are unique. The SQL below can probably be adapted to do the job. I've prepared and tested a DBFiddle sample here. Basically what it does is joins all the sequences in the new probe to all the sequences in the existing probes, then checks to see if the number of records returned from the join is the same as the total count of records in the existing probe. If the counts match, then the new probe is a duplicate. The query will return the id of the existing duplicate probe. HTH.
SELECT x.probe,
COUNT(*) AS newrecs,
proberecs
FROM (SELECT a.probe,
a.rnaseq
FROM rnaprobes a
JOIN newprobe b
ON a.rnaseq = b.rnaseq) x
JOIN (SELECT probe,
COUNT(*) AS proberecs
FROM rnaprobes
GROUP BY probe) c
ON x.probe = c.probe
GROUP BY x.probe
HAVING COUNT(*) = proberecs

Right way to phrase MySQL query across many (possible empty) tables

I'm trying to do what I think is a set of simple set operations on a database table: several intersections and one union. But I don't seem to be able to express that in a simple way.
I have a MySQL table called Moment, which has many millions of rows. (It happens to be a time-series table but that doesn't impact on my problem here; however, these data have a column 'source' and a column 'time', both indexed.) Queries to pull data out of this table are created dynamically (coming in from an API), and ultimately boil down to a small pile of temporary tables indicating which 'source's we care about, and maybe the 'time' ranges we care about.
Let's say we're looking for
(source in Temp1) AND (
((source in Temp2) AND (time > '2017-01-01')) OR
((source in Temp3) AND (time > '2016-11-15'))
)
Just for excitement, let's say Temp2 is empty --- that part of the API request was valid but happened to include 'no actual sources'.
If I then do
SELECT m.* from Moment as m,Temp1,Temp2,Temp3
WHERE (m.source = Temp1.source) AND (
((m.source = Temp2.source) AND (m.time > '2017-01-01')) OR
((m.source = Temp3.source) AND (m.time > '2016-11'15'))
)
... I get a heaping mound of nothing, because the empty Temp2 gives an empty Cartesian product before we get to the WHERE clause.
Okay, I can do
SELECT m.* from Moment as m
LEFT JOIN Temp1 on m.source=Temp1.source
LEFT JOIN Temp2 on m.source=Temp2.source
LEFT JOIN Temp3 on m.source=Temp3.source
WHERE (m.source = Temp1.source) AND (
((m.source = Temp2.source) AND (m.time > '2017-01-01')) OR
((m.source = Temp3.source) AND (m.time > '2016-11-15'))
)
... but this takes >70ms even on my relatively small development database.
If I manually eliminate the empty table,
SELECT m.* from Moment as m,Temp1,Temp3
WHERE (m.source = Temp1.source) AND (
((m.source = Temp3.source) AND (m.time > '2016-11-15'))
)
... it finishes in 10ms. That's the kind of time I'd expect.
I've also tried putting a single unmatchable row in the empty table and doing SELECT DISTINCT, and it splits the difference at ~40ms. Seems an odd solution though.
This really feels like I'm just conceptualizing the query wrong, that I'm asking the database to do more work than it needs to. What is the Right Way to ask the database this question?
Thanks!
--UPDATE--
I did some actual benchmarks on my actual database, and came up with some really unexpected results.
For the scenario above, all tables indexed on the columns being compared, with an empty table,
doing it with left joins took 3.5 minutes (!!!)
doing it without joins (just 'FROM...WHERE') and adding a null row to the empty table, took 3.5 seconds
even more striking, when there wasn't an empty table, but rather ~1000 rows in each of the temporary tables,
doing the whole thing in one query took 28 minutes (!!!!!), but,
doing each of the three AND clauses separately and then doing the final combination in the code took less than a second.
I still feel I'm expressing the query in some foolish way, since again, all I'm trying to do is one set union (OR) and a few set intersections. It really seems like the DB is making this gigantic Cartesian product when it seriously doesn't need to. All in all, as pointed out in the answer below, keeping some of the intelligence up in the code seems to be the better approach here.
There are various ways to tackle the problem. Needless to say it depends on
how many queries are sent to the database,
the amount of data you are processing in a time interval,
how the database backend is configured to manage it.
For your use case, a little more information would be helpful. The optimization of your query by using CASE/COUNT(*) or CASE/LIMIT combinations in queries to sort out empty tables would be one option. However, if-like queries cost more time.
You could split the SQL code to downgrade the scaling of the problem from 1*N^x to y*N^z, where z should be smaller than x.
You said that an API is involved, maybe you are able handle the temporary "no data" tables differently or even don't store them?
Another option would be to enable query caching:
https://dev.mysql.com/doc/refman/5.5/en/query-cache-configuration.html

SQL SELECT Query for Multiple Values and Ranges in Column

SQL SELECT query for multiple values and ranges in row data of a given column.
Problem description:
Server: MySQL
Database: Customer
Table: Lan
Column: Allowed VLAN (in the range 1-4096)
One row has data as below in the column Allowed VLAN:
180,181,200,250-499,550-811,826-mismatched
I need a SELECT statement WHERE the column Allowed VLAN includes a given number for instance '600'. The given number '600' is even one of the comma separated value or included in any of the ranges "250-499","550-811" or it is just the starting number value of "826-mismatched" range.
SELECT * WHERE `Allowed VLAN`='600' OR `Allowed VLAN` LIKE '%600%' OR (`Allowed VLAN` BETWEEN '1-1' AND '1-4096');
I could not figure it out how to deal with data ranges with WHERE Clause. I have solved the problem with PHP code using explode() split functions etc., but I think there are some SQL SELECT solutions.
I would be appreciated for any help.
I would highly recommend normalizing your data. Storing a comma-separated list of items in a single row is generally never a good idea.
Assuming you can make such a change, then something like this should work for you (although you could consider storing your ranges in different columns to make it even easier):
create table lan (allowedvan varchar(100));
insert into lan values
('180'),('181'),('200'),('250-499'),('550-811'),('826-mismatched');
select *
from lan
where allowedvan = '600'
or
(instr(allowedvan,'-') > 0 and
'600' >= left(allowedvan,instr(allowedvan,'-')-1) and
'600' <= right(allowedvan,length(allowedvan)-instr(allowedvan,'-'))
)
SQL Fiddle Demo
This uses INSTR to determine if the value contains a range (a hyphen) and then uses LEFT and RIGHT to get the range. Do not use LIKE because that could return inaccurate results (600 is like 1600 for example).
If you are unable to alter your database, then perhaps look into using a split function (several posts on it on SO) and then you can do something similar to the above method.

Intelligent Comparison based Update - Access / VBA

Need to intelligently perform updates on an access table.
Expert VBA / Intelligent Thinking would be required.
Table1 (For reference only)
CompanyCode Text
RegionCategory Number (1-99)
RegionCount Number(0 - 25000)
Table2
InvoiceNumber Number
CompanyCode Text
NumRows Number
RegionCode FourdigitNumber
ConfirmationRemark Y / N
Ourobjective is to put a Yes or No in the 'ConfirmationRemark' Column.
Rules :
1.Select only those InvoiceNumbers which have exactly two rows from Table2 and different RegionCode. These will have the same CompanyCode. RegionCategory is first two digits of RegionCode.
2.For these two Invoices - The difference between the two RegionCategory must be greater than two.
3.LookUp The RegionCount , from Table1
Decision Making :
We are now basically comparing two Invoices with different RegionCodes.
Idea is that , the Invoice with higher RegionCount is the one to be marked Yes.
1.The difference between RegionCount must be considerable. 'considerable' - I am trying to determine what would be the right number. Let us take 500 for now.
2.The Invoice with lower Region Count - should have RegionCount - Zero (bestCase) or very very low. If The Invoice with lower Region Count has a high RegionCount value > 200 , then we cannot successfully conclude.
3.NumRows , is prefered to be 1 or lesser than the other. This comparison , is not mandatory , hence we shall have a provision to not check for this. Mark the Other Invoice as 'N'
You have many ways to approach that type of complex update.
If you are lucky, you may be able to craft a SQL UPDATE statement that can include all the changes, but often you will have to resort to a combination of SELECT queries and custom VBA to filter them based on the results of calculations or lookups involving other data.
A few hints
Often, we tend to think about a problem in terms of 'what are the steps to get to the data that match the criteria'.
Sometimes, though, it's easier to turn the problem on its head and instead ask yourself 'what are the steps to get to the data that do not match the criteria'.
Because in your case the result is boolean, true or false, you could simply set the ConfirmationRemark field to True for all records and then update those that should be set to False, instead of the other way around.
Break down each step (as you did) and try to find the simplest SELECT query that will return just the data you need for that step. If a step is too complex, break it down further.
Compose your broken down SELECT statements together to slowly build a more complex query that tends toward your goal.
Once you have gone as far as you can, either construct an UPDATE Table2 SET ConfirmationRemark=True WHERE InvoiceNumber IN (SELECT InvoiceNumber ....) or use VBA to go through the recordset of results from your complext SELECT statement and do some more checks before you update the field in code.
Some issues
Unfortunately, despite your efforts to document your situation, there are not enough details for us to really help:
you do not mention which are the primary keys (from what you say, it seems that Table2 could have multiple records with identical InvoiceNumber)
the type of data you are dealing with is not obvious. You should include a sample of data and identify which ones should end-up with ConfirmationRemark set.
your problem is really too localised, meaning it is too specific to your to be of value to anyone else, although I think that with a bit more details your question could be of interest, if only to show an example of how to approach complex data updates in Access.

SQL update query for balances using Access raises 'Operation must use an updateable query'

I have the following query (MS Access 2010) which I'm trying to use to update a table with a running balance:
UPDATE Accounts a SET a.CurrentBalance =
(SELECT sum(iif(c.categoryid = 2,t.Amount * -1, t.Amount)) +
(select a1.openingbalance
from accounts a1 where a1.accountid = a.accountid) AS TotalAmount
FROM transactions t inner join (
transactiontypes tt inner join
categories c on c.categoryid = tt.categoryid)
on t.transactiontypeid = tt.transactiontypeid);
The tables used are:
A work around for the "Query must use an updateable query" is to use a temporary table and then update the final target data based on the aggregated data in the temporary table. In your case, as mwolfe suggests, you have an aggregate function in the inner select query. The workaround could provide a quick fix for this situation, as it has for me.
Ref: http://support.microsoft.com/kb/328828
This article helped me understand the specifics of the situation and provided the work around:
http://www.fmsinc.com/MicrosoftAccess/query/non-updateable/index.html
You cannot use aggregate functions (like SUM) in an update query. See Why is my query read-only? for a full list of conditions that will cause your query to be "non-updateable".
The Access db engine includes support for domain functions (DMax, DSum, DLookup, etc.). And domain functions can often allow you to circumvent non-updateable query problems.
Consider DSum() with these 3 rows of data in MyTable.
id MyNumber
1 2
2 3
3 5
Then in the Immediate window, here are 2 sample DSum() expressions.
? DSum("MyNumber", "MyTable")
10
? DSum("IIf(id=1,MyNumber * -1, MyNumber)", "MyTable")
6
I think you may be able to use something like that second expression as a replacement for the sum(iif(c.categoryid = 2,t.Amount * -1, t.Amount) part of your query.
And perhaps you can use a DLookup() expression to get your TotalAmount value. Unfortunately I got frustrated trying to translate your current SQL to domain functions. And I realize this isn't a complete solution, but hope it will point you to something useful. If you edit your question to show us brief samples of the starting data and what you hope to achieve from your UPDATE statement based on that sample data, I would be willing to have another look at this.
Finally, consider whether you absolutely must store CurrentBalance in a table. As a rule of thumb, avoid storing derived values. Instead, use a SELECT query to compute the derived value when you need it. That approach would guarantee CurrentBalance is always up-to-date whenever you retrieve it. It would also spare you the effort to create a working UPDATE statement.