How can I sanitize my DB from these duplicates

How can I sanitize my DB from these duplicates - mysql

I have a table with the following fields:
id | domainname | domain_certificate_no | keyvalue
An example for the output of a select statement can be as:
'57092', '02a1fae.netsolstores.com', '02a1fae.netsolstores.com_1', '55525772666'
'57093', '02a1fae.netsolstores.com', '02a1fae.netsolstores.com_2', '22225554186'
'57094', '02a1fae.netsolstores.com', '02a1fae.netsolstores.com_3', '22444356259'
'97168', '02aa6aa.netsolstores.com', '02aa6aa.netsolstores.com_1', '55525772666'
'97169', '02aa6aa.netsolstores.com', '02aa6aa.netsolstores.com_2', '22225554186'
'97170', '02aa6aa.netsolstores.com', '02aa6aa.netsolstores.com_3', '22444356259’
I need to sanitize my db such that: I want to remove the domain names that have repeated keyvalue for the first domain_certificate_no (i.e, in this example, I look for the field domain_certificate_no: 02aa6aa.netsolstores.com_1, since it is number 1, and has repeated value for the key, then I want to remove the whole chain which is 02aa6aa.netsolstores.com_2 and 02aa6aa.netsolstores.com_3 and this by deleting the domain name that this chain belongs to which is 02aa6aa.netsolstores.com.
How can I automate the checking process for the whole DB. So, I have a query that checks any domain name in the pattern ('%.%.%) EDIT: AND they have share domain name (in this ex: netsolstores.com) , if it finds cert no. 1 that belongs to this domain name has a repeated key value, then delete. Otherwise no. Please, note tat, it is ok for domain_certificate_no to have repeated value if it is not number 1.
EDIT: I only compare the repeated valeues for the same second level domain name. Ex: in this question, I compare the values that share the domain name: .netsolstores.com. If I have another domain name, with sublevel domains, I do the same. But the point is that I don't need to compare the whole DB. Only the values with shared domain name (but different sub domain).

I'm not sure what happens with '02aa6aa.netsolstores.com_1' in your example.
The following keeps only the minimum id for any repeated key:
with t as (
select t.*,
substr(domain_certificate_no,
instr(domain_certificate_no, '_') + 1, 1000) as version,
left(domain_certificate_no, instr(domain_certificate_no, '_') - 1) as dcn
from t
)
select t.*
from t join
(select keyvalue, min(dcn) as mindcn
from t
group by keyvalue
) tsum
on t.keyvalue = tsum.keyvalue and
t.dcn = tsum.mindcn
For the data you provide, this seems to do the trick. This will not return the "_1" version of the repeats. If that is important, the query can be pretty easily modified.
Although I prefer to be more positive (thinking about the rows to keep rather than delete), the following should delete what you want:
with t as (
select t.*,
substr(domain_certificate_no,
instr(domain_certificate_no, '_') + 1, 1000) as version,
left(domain_certificate_no, instr(domain_certificate_no, '_') - 1) as dcn
from t
),
tokeep as (
select t.*
from t join
(select keyvalue, min(dcn) as mindcn
from t
group by keyvalue
) tsum
on t.keyvalue = tsum.keyvalue and
t.dcn = tsum.mindcn
)
delete from t
where t.id not in (select id from tokeep)
There are other ways to express this that are possibly more efficient (depending on the database). This, though, keeps the structure of the original query.
By the way, when trying new DELETE code, be sure that you stash a copy of the table. It is easy to make a mistake with DELETE (and UPDATE). For instance, if you leave out the WHERE clause, all the rows will disappear, after the long painful process of logging all of them. You might find it faster to simply select the desired results into a new table, validate them, then truncate the old table and re-insert them.

Related

Query to find entries and transpose

I've got a machine log available in an SQL table. I can do a bit in SQL, but I'm not good enough to process the following:
In the data column there are entries containing "RUNPGM: Recipe name" and "RUNBRKPGM: Recipe name"
What I want is a view containing 4 columns:
TimeStamp RUNPGM
TimeStamp RUNBRKPGM
Recipe Name
Time Difference in seconds
There is a bit of a catch:
Sometimes the machine logs an empty RUNBRKPGM that should be ignored
The RUNBRKPGM is sometimes logged with an error message. This entry should also be ignored.
It's always the RUNBRKPGM entry with just the recipe name that's the actual end of the recipe.

NOTE: I understand this is not a full/complete answer, but with info available in question as of now, I believe it at least helps give a starting point since this is too complicated (and formatted) to put in the comments:
If Recipe is everything in the DATA field except the 'RUNPGM = ' part you can do somethign similar to this:
SELECT
-- will give you a col for TimeStamp for records with RUNPGM
CASE WHEN DATA LIKE 'RUNPGM%' THEN TS ELSE '' END AS RUNPGM_TimeStamp,
-- will give you a col for TimeStamp for records with RUNBRKPGM
CASE WHEN DATA LIKE 'RUNBRKPGM%' THEN TS ELSE '' END AS RUNBRKPGM_TimeStamp,
-- will give you everything after the RUNPGM = (which I think is the recipe you are referring to)
CASE WHEN DATA LIKE 'RUNPGM%' THEN REPLACE(DATA, 'RUNPGM = ', '' AS RUNPGM_Recipe,
-- will give you everything after the RUNBRKPGM = (which I think is the recipe you are referring to)
CASE WHEN DATA LIKE 'RUNBRKPGM:%' THEN REPLACE(DATA, 'RUNBRKPGM = ', '' AS RUNPGM_Recipe
FROM TableName
Im not sure what columns you want to get the Time Difference on though so I dont have that column in here.
Then if you need to do additional logic/formatting on the columns once they are separated you can put the above in a sub select.

As a first swing, I'd try the following:
Create a view that uses string splitting to break the DATA column into a its parts (e.g. RunType and RecipeName)
Create a simple select that outputs the recipe name and tstamp where the runtype is RUNPGM.
Then add an OUTER APPLY:
Essentially, joining onto itself.
SELECT
t1.RecipeName,
t1.TimeStamp AS Start,
t2.TimeStamp AS Stop
--date func to get run time, pseudo DATEDIFF(xx,t1.TimeStamp, t2.TimeStamp) as RunTime
FROM newView t1
OUTER APPLY ( SELECT TOP ( 1 ) *
FROM newView x
WHERE x.RecipeName = t1.RecipeName
AND RunType = 'RUNBRKPGM'
ORDER BY ID DESC ) t2
WHERE t1.RunType = 'RUNPGM';

SQLAlchemy foreign keys mapped to list of ids, not entities

In the usual Customer with Orders example, this kind of SQLAlchemy code...
data = db.query(Customer)\
.join(Order, Customer.id == Order.cst_id)\
.filter(Order.amount>1000)
...would provide instances of the Customer model that are associated with e.g. large orders (amount > 1000). The resulting Customer instances would also include a list of their orders, since in this example we used backref for that reason:
class Order:
...
customer = relationship("customers", backref=backref('orders'))
The problem with this, is that iterating over Customer.orders means that the DB will return complete instances of Order - basically doing a 'select *' on all the columns of Order.
What if, for performance reasons, one wants to e.g. read only 1 field from Order (e.g. the id) and have the .orders field inside Customer instances be a simple list of IDs?
customers = db.query(Customer)....
...
pdb> print customers[0].orders
[2,4,7]
Is that possible with SQLAlchemy?

What you could do is make a query this way:
(
session.query(Customer.id, Order.id)
.select_from(Customer)
.join(Customer.order)
.filter(Order.amount > 1000)
)
It doesn't produce the exact result as what you have asked, but it gives you a list of tuples which looks like [(customer_id, order_id), ...].
I am not entirely sure if you can eagerly load order_ids into Customer object, but I think it should, you might want to look at joinedload, subqueryload and perhaps go through the relationship-loading docs if that helps.
In this case it works you could write it as;
(
session.query(Customer)
.select_from(Customer)
.join(Customer.order)
.options(db.joinedload(Customer.orders))
.filter(Order.amount > 1000)
)
and also use noload to avoid loading other columns.

I ended up doing this optimally - with array aggregation:
data = db.query(Customer).with_entities(
Customer,
func.ARRAY_AGG(
Order.id,
type_=ARRAY(Integer, as_tuple=True)).label('order_ids')
).outerjoin(
Orders, Customer.id == Order.cst_id
).group_by(
Customer.id
)
This returns tuples of (CustomerEntity, list) - which is exactly what I wanted.

Reduce MySQL Code down or combine SELECT Statements

I have made a few relations to do with a banking database system.
this is my current code. The table has
SELECT COUNT(AccountType) AS Student_Total FROM Account
WHERE AccountType ='Student'
and SortCode = 00000001;
SELECT COUNT(AccountType) AS Student_Total FROM Account
WHERE AccountType ='Student'
and SortCode = 00000002;
SELECT COUNT(AccountType) AS Student_Total FROM Account
WHERE AccountType ='Student'
and SortCode = 00000003;
the rest of the code is a duplicate of this part with the next type of 'Account' and looping back through sortcode's 1-3 again.
I was wondering if there was a more elegant way of producing this. I need to count the number of student, current and saver accounts for each bank.
Or is there a way to combine lots of selects together to make a neat table?

That's what GROUP BY is for!
SELECT SortCode,COUNT(AccountType) AS Student_Total FROM Account
WHERE AccountType ='Student'
GROUP BY SortCode;
UPDATE:
You can also GROUP BY with multiple grouping fields:
SELECT SortCode,AccountType,COUNT(AccountType) AS Student_Total FROM Account
GROUP BY SortCode,AccountType;

You could also apply a PIVOT approach to this query to always return a single row and know the fixed-final columns of the result set. However, applying a group by allows for more flexibility of returned rows, especially if you have a large amount of individual things you are trying to tally up.
select
A.AccountType,
SUM( IF( A.SortCode = 1, 1, 0 )) as SortCode1Cnt,
SUM( IF( A.SortCode = 2, 1, 0 )) as SortCode2Cnt,
SUM( IF( A.SortCode = 3, 1, 0 )) as SortCode3Cnt
from
Account A
where
A.AccountType = 'Student'
AND A.SortCode IN ( 1, 2, 3 )
group by
A.AccountType
Note... it appears your sort code is a numeric as you have no quotes around indicating a character string. So, all the leading zeros are irrelevant. And if you were only doing based on a single Account Type, you don't even need the leading Account Type column and can remove the group by too.

Return zero when records not found

Im making a table generator as a school project.
In MySQL I have 3 tables namely process,operation,score. Everything looked fine until i tested out my "ADD column" button in the web app.
Previous saved data should be read properly but also include the new column in the format, problem is the previous data queried does not include any values for the new table, so I intended it to return a score of 0 if no records were found, tried IFNULL & COALESCE but nothing happens(maybe im just using it wrong)
process - processID, processName
operation - operationID, operationName
score - scoreID, score, processID, operationID, scoreType (score
types are SELF,GL,FINAL)
ps = (PreparedStatement)dbconn.prepareStatement("SELECT score FROM score WHERE processID=? and operationID=? and type=?ORDER BY processid");
here's a pic of a small sample http://i50.tinypic.com/2yv3rf9.jpg

The reason that IFNULL doesn't work is that it only has an effect on values. A result set with no rows has no values, so it does nothing.
First, it's probably better to do this on the client than on the server. But if you have to do it on the server, there's a couple of approaches I can think of.
Try this:
SELECT IFNULL(SUM(score), 0) AS score
FROM score
WHERE processID=? and operationID=? and type=?
ORDER BY processid
The SUM ensures that exactly one row will be returned.
If you need to return multiple rows when the table contains multiple matching rows then you can use this (omitting the ORDER BY for simplicity):
SELECT score
FROM score
WHERE processID = ? and operationID = ? and type = ?
UNION ALL
SELECT 0
FROM (SELECT 0) T1
WHERE NOT EXISTS
(
SELECT *
FROM score
WHERE processID = ? and operationID = ? and type = ?
)

mysql - satisfy composite primary key while using 'insert into xxx select'

I am importing data to a table structured: content_id|user_id|count - all integers all comprise the composite primary key
The table I want to select it from is structured: content_id|user_id
For reasons quite specific to my use case, I will need to fire quite a lot of data into this regularly enough to want a pure MySQL solution
insert into new_db.table
select content_id,user_id,xxx from old_db.table
I want each row to go in with xxx set to 0, unless this would create a duplicate key, in which case I wish to increment the number, for the current user_id/content_id combination
Not being a MySQL expert, I tried a few options like trying to populate xxx by selecting from the target table during insert, with no luck. Also tried using ON DUPLICATE KEY to increment counters instead of the usual UPDATE. But it all seemed a bit daft so I thought I would come here!
Any ideas anyone? I have a backup option of wrapping this in PHP, but it would drastically raise the overall running time of the script in which this would be the only non-pure MySQL part
Any help really appreciated. thanks in advance!
--edit
this may sound really awful in principle. but id settle for a way to do it in an update after entering random numbers (i have sent in random numbers to allow me to continue other work at the moment) - and this is a purely dev setup
--edit again
12|234
51|45
51|45
51|45
23|67
would ideally insert
12|234|0
51|45|0
51|45|1
51|45|2
23|67|0

INSERT INTO new_db.table (content_id, user_id, cnt)
SELECT old.content_id, old.user_id, COUNT(old.*) - 1 FROM old_db.table old
GROUP BY old.content_id, old.user_id
this would be the way I would go, so if 1 entry it would put 0 on cnt, for more it would just put 1-2-3 etc.
Edit:
Your correct answer would be somewhat complicated but I tested it and it works:
INSERT INTO newtable(user_id,content_id,cnt)
SELECT o1.user_id, o1.content_id,
CASE
WHEN COALESCE(#rownum, 0) = 0
THEN #rownum:=c-1
ELSE #rownum:=#rownum-1
END as cnt
FROM
(SELECT user_id, content_id, COUNT(*) as c FROM oldtable
GROUP BY user_id, content_id ) as grpd
LEFT JOIN
(SELECT oldtable.* FROM oldtable) o1 ON
(o1.user_id = grpd.user_id AND o1.content_id = grpd.content_id)
;

Assuming that in the old db table (source), you will not have the same (content_id, user_id) combination, then you can import using this query
insert newdbtable
select o.content_id, o.user_id, ifnull(max(n.`count`),-1)+1
from olddbtable o
left join newdbtable n on n.content_id=o.content_id and n.user_id=o.user_id
group by o.content_id, o.user_id;

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008