Duplicate removal not working on table with many NULLs - mysql

Perhaps I've been staring at the screen too long but I have the following [legacy] table I'm messing with:
describe t3_test;
+--------------------+------------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+--------------------+------------------+------+-----+---------+----------------+
| provnum | varchar(24) | YES | MUL | NULL | |
| trgt_mo | datetime | YES | | NULL | |
| mcare | varchar(2) | YES | | NULL | |
| bed2prsn_asst | varchar(2) | YES | | NULL | |
| trnsfr2prsn_asst | varchar(2) | YES | | NULL | |
| tlt2prsn_asst | varchar(2) | YES | | NULL | |
| hygn2prsn_asst | varchar(2) | YES | | NULL | |
| bath2psrn_asst | varchar(2) | YES | | NULL | |
| ampmcare2prsn_asst | varchar(2) | YES | | NULL | |
| any2prsn_asst | varchar(2) | YES | | NULL | |
| n | float | YES | | NULL | |
| pct | float | YES | | NULL | |
| trgt_qtr | varchar(12) | YES | | NULL | |
| recno | int(10) unsigned | NO | PRI | NULL | auto_increment |
| enddate | date | YES | | NULL | |
+--------------------+------------------+------+-----+---------+----------------+
15 rows in set (0.00 sec)
I have data that looks like this..
"555223","2008-10-01 00:00:00",NULL,"1",NULL,NULL,NULL,NULL,NULL,NULL,"40","93.0233","2008Q4","5767343","2008-12-31"
"555223","2008-10-01 00:00:00",NULL,"1",NULL,NULL,NULL,NULL,NULL,NULL,"40","93.0233","2008Q4","4075309","2008-12-31"
"555223","2008-10-01 00:00:00",NULL,"0",NULL,NULL,NULL,NULL,NULL,NULL,"3","6.97674","2008Q4","4075308","2008-12-31"
"555223","2008-10-01 00:00:00",NULL,"0",NULL,NULL,NULL,NULL,NULL,NULL,"3","6.97674","2008Q4","5767342","2008-12-31"
"555223","2008-10-01 00:00:00","N",NULL,"1",NULL,NULL,NULL,NULL,NULL,"36","83.7209","2008Q4","4075327","2008-12-31"
"555223","2008-10-01 00:00:00","N","1",NULL,NULL,NULL,NULL,NULL,NULL,"36","83.7209","2008Q4","4075323","2008-12-31"
"555223","2008-10-01 00:00:00","Y","1",NULL,NULL,NULL,NULL,NULL,NULL,"4","9.30233","2008Q4","4075325","2008-12-31"
"555223","2008-10-01 00:00:00",NULL,NULL,"0",NULL,NULL,NULL,NULL,NULL,"3","6.97674","2008Q4","4075310","2008-12-31"
"555223","2008-10-01 00:00:00",NULL,NULL,"1",NULL,NULL,NULL,NULL,NULL,"40","93.0233","2008Q4","4075311","2008-12-31"
The first two lines of the table clearly appear to be dupes (minus the A.I. index "recno"). I've tried a half dozen dupe-removal routines and they are not automatically removed.
At this point I am not sure what exactly is wrong? Is it possible there's an invisible character somewhere? Is it possible a letter is in a different character encoding? When I dump the data to CSV as is listed, it doesn't look any different.
Do you have a delete routine that would work on this file structure that would remove anything that is a dupe (minus the recno field)? I have been staring at this for two days and for some reason, it escapes me. (btw, I am aware of the column name anomaly for bathd2psrn_asst - that's not it)
This (original) table has over 13 million records in it. And is over 3GB in size so I'm looking for the most efficient way to kill dupes.. Any ideas?
Here's an example of one of the dupe-killing techniques I used that did not work:
DELETE a FROM t3_test as a, t3_test as b WHERE
(a.provnum=b.provnum)
AND (a.trgt_mo=b.trgt_mo OR a.trgt_mo IS NULL AND b.trgt_mo IS NULL)
AND (a.mcare=b.mcare OR a.mcare IS NULL AND b.mcare IS NULL)
AND (a.bed2prsn_asst=b.bed2prsn_asst OR a.bed2prsn_asst IS NULL AND b.bed2prsn_asst IS NULL)
AND (a.trnsfr2prsn_asst=b.trnsfr2prsn_asst OR a.trnsfr2prsn_asst IS NULL AND b.trnsfr2prsn_asst IS NULL)
AND (a.tlt2prsn_asst=b.tlt2prsn_asst OR a.tlt2prsn_asst IS NULL AND b.tlt2prsn_asst IS NULL)
AND (a.hygn2prsn_asst=b.hygn2prsn_asst OR a.hygn2prsn_asst IS NULL AND b.hygn2prsn_asst IS NULL)
AND (a.bath2psrn_asst=b.bath2psrn_asst OR a.bath2psrn_asst IS NULL AND b.bath2psrn_asst IS NULL)
AND (a.ampmcare2prsn_asst=b.ampmcare2prsn_asst OR a.ampmcare2prsn_asst IS NULL AND b.ampmcare2prsn_asst IS NULL)
AND (a.any2prsn_asst=b.any2prsn_asst OR a.any2prsn_asst IS NULL AND b.any2prsn_asst IS NULL)
AND (a.n=b.n OR a.n IS NULL AND b.n IS NULL)
AND (a.pct=b.pct OR a.pct IS NULL AND b.pct IS NULL)
AND (a.trgt_qtr=b.trgt_qtr OR a.trgt_qtr IS NULL AND b.trgt_qtr IS NULL)
AND (a.enddate=b.enddate OR a.enddate IS NULL AND b.enddate IS NULL)
AND (a.recno>b.recno);

For such a large table, delete can be quite inefficient -- all the logging needed for the deletes is very cumbersome.
I might recommend that you try the truncate/insert approach:
create table temp_t3_test as (
select provnum, targ_mo, . . .,
min(recno) as recno,
enddate
from t3_test
group by provnum, targ_mo, . . ., enddate;
truncate table t3_test;
insert into t3_test(provnum, targ_mo, . . . , recno, enddate)
select *
from temp_t3_test;

Try:
CREATE TABLE t3_new AS
(
SELECT provnum,
trgt_mo,
mcare,
bed2prsn_asst,
trnsfr2prsn_asst,
tlt2prs‌​n_asst,
hygn2prsn_ass‌​t,
bath2psrn_asst,
amp‌​mcare2prsn_asst,
any2‌​prsn_asst,
n,
pct,
trgt‌​_qtr,
Min(recno),
endd‌​ate
FROM t3_test
GROUP BY provnum,
trgt_mo,
mcare,
bed2prsn_asst,
trnsfr2prsn_asst,
tlt2prs‌​n_asst,
hygn2prsn_ass‌​t,
bath2psrn_asst,
amp‌​mcare2prsn_asst,
any2‌​prsn_asst,
n,
pct,
trgt‌​_qtr,
enddate
)
When you use min(recno), you don't actually select just one row. you select the minimum of all recno and use the same value for all the rows. To remove less rows, you can use distinct or group by as I have used. I would say that you can remove the rec no from the temp table and use a new auto increment column in the table that you create again to avoid gaps in the ids.
This is to be used in with the method suggested by Gordon Linoff.

In the case of this scenario, the problem was not with the SQL statement. It was a problem with the DATA, but it was not visible.
The two fields designated type "float" held hidden decimal values that were slightly different from each other. Converting those fields to DECIMAL(a,b) type made the dupes show up and be properly deleted by conventional means.
Special thanks to Gordon Linoff for suggesting looking into this.

Related

Make a MariaDB view that includes a boolean

I created this view:
CREATE OR REPLACE VIEW vista_metadatos AS
SELECT m.*, f.archivo IS NOT NULL AS myBooleanColumn
FROM metadatos m
LEFT JOIN facturas f ON (m.uuid = f.uuid)
However myBooleanColumn is being returned as an INT and I want it to be a Boolean which in this case should be a TINYINT:
> desc vista_metadatos;
+-----------------------+--------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+-----------------------+--------------+------+-----+---------+-------+
| uuid | varchar(40) | YES | | NULL | |
| otherBooleanColumn | tinyint(1) | YES | | NULL | |
| myBooleanColumn | int(1) | NO | | 0 | |
| ... | varchar(42) | NO | | | |
+-----------------------+--------------+------+-----+---------+-------+
From this desc I know that views can hold TINYINT, but how can I create a view that uses that condition as a TINYINT?
You should not really care. A view does not actually store the data, so there is no overhead anyway. And you can use an INT that has 0/1 values just like you use a BOOLEAN.
As far as the documentation states, BOOLEAN (or TINYINT()) are not supported targets for casting. So althought that might not be satisfying from a pure intellectual point of view, you'll have to live with this...
CAST and CONVERT have no boolean, use an IF instead
CREATE OR REPLACE VIEW vista_metadatos AS
SELECT
m.*
,IF(f.archivo IS NOT NULL, 'TRUE', 'FALSE') AS myBooleanColumn
FROM metadatos m
LEFT JOIN facturas f ON (m.uuid = f.uuid)

Mysql Query performance very slow

The below query was taking more than 8 min and 900 000 rows processed. it is very slow and affect my product. I can't identify why the query getting slow, all index are set fine.
explain SELECT
COUNT(DISTINCT (cinfo.CONTACT_ID))
FROM
cinfo
INNER JOIN
LTocMapping ON cinfo.CONTACT_ID = LTocMapping.CONTACT_ID
WHERE
(((((((((cinfo.COUNTRY LIKE '%Panama%')
OR (cinfo.COUNTRY LIKE '%PANAMA%'))
AND (((cinfo.CONTACT_EMAIL NOT LIKE '%test%')
AND (cinfo.CONTACT_EMAIL NOT LIKE '%engine%'))
OR (cinfo.CONTACT_EMAIL IS NULL)))
AND ((SELECT
(GROUP_CONCAT(Temp.LIST_ID
ORDER BY Temp.LIST_ID) REGEXP ('.*,*221715000514445053,*.*$'))
FROM
LTocMapping Temp
WHERE
((LTocMapping.CONTACT_ID = Temp.CONTACT_ID)
AND (((Temp.MAPPING_ID >= 221715000000000000)
AND (Temp.MAPPING_ID <= 221715999999999999))
OR ((Temp.MAPPING_ID >= 0)
AND (Temp.MAPPING_ID <= 999999999999))))
GROUP BY Temp.CONTACT_ID) = '0'))
AND ((SELECT
(GROUP_CONCAT(Temp.LIST_ID
ORDER BY Temp.LIST_ID) REGEXP ('.*,*221715000520574130,*.*$'))
FROM
LTocMapping Temp
WHERE
((LTocMapping.CONTACT_ID = Temp.CONTACT_ID)
AND (((Temp.MAPPING_ID >= 221715000000000000)
AND (Temp.MAPPING_ID <= 221715999999999999))
OR ((Temp.MAPPING_ID >= 0)
AND (Temp.MAPPING_ID <= 999999999999))))
GROUP BY Temp.CONTACT_ID) = '0'))
AND (LTocMapping.LIST_ID IN (221715000520574130 , 221715000201569885)))
AND (LTocMapping.STATUS = BINARY 'subscribed'))
AND (((cinfo.CONTACT_STATUS = BINARY 'active')
OR (cinfo.CONTACT_STATUS = BINARY 'softbounce'))
AND (LTocMapping.STATUS = BINARY 'subscribed')))
AND (((cinfo.CONTACT_ID >= 221715000000000000)
AND (cinfo.CONTACT_ID <= 221715999999999999))
OR ((cinfo.CONTACT_ID >= 0)
AND (cinfo.CONTACT_ID <= 999999999999))))
And the answer will be
Below tables FYR
Table 1 :
mysql> desc cinfo;
+------------------------+--------------+------+-----+-----------+-------+
| Field | Type | Null | Key | Default | Extra |
+------------------------+--------------+------+-----+-----------+-------+
| CONTACT_ID | bigint(19) | NO | PRI | NULL | |
| CONTACT_EMAIL | varchar(100) | NO | MUL | NULL | |
| TITLE | varchar(20) | YES | | NULL | |
| FIRSTNAME | varchar(100) | YES | | NULL | |
| LASTNAME | varchar(50) | YES | | NULL | | |
| ADDED_BY | varchar(20) | YES | | NULL | |
| ADDED_TIME | bigint(19) | NO | | NULL | |
| LAST_UPDATED_TIME | bigint(19) | NO | | NULL | |
+------------------------+--------------+------+-----+-----------+-------+
Table 2 :
mysql> desc LTocMapping;
+---------------------+--------------+------+-----+------------+-------+
| Field | Type | Null | Key | Default | Extra |
+---------------------+--------------+------+-----+------------+-------+
| MAPPING_ID | bigint(19) | NO | PRI | NULL | |
| CONTACT_ID | bigint(19) | NO | MUL | NULL | |
| LIST_ID | bigint(19) | NO | MUL | NULL | |
| STATUS | varchar(100) | YES | | subscribed | |
| MAPPING_STATUS | varchar(20) | YES | | connected | |
| MAPPING_TIME | bigint(19) | YES | | NULL | |
+---------------------+--------------+------+-----+------------+-------+
As Far as I can tell, your subqueries are the bottleneck:
For the first subquery, you are using LTocMapping.CONTACT_ID
For the second subquery, you are using LTocMapping.CONTACT_ID as well.
These references (to values of the outer query) are causing these inner queries to become correlated subqueries (also called dependent subqueries). And that means: For every row you are going to fetch on one of the outer tables (~970000) - you are firing 2 additional queries on another table.
So, that's 1.8 Million (as it seems as well not trivial) queries you are executing.
Most the time, a correlated subquery can be replaced by a proper join. But this depends on the usecase. You also can join the same table twice, when using a different alias.
But to outline some join-options, you need to explain, why the subqueries resulting in the condition group_concat(....) = '0' are important - or maybe better, what you want to achieve.
(ps.: You can also see, that explain outlines them as dependent subquery)
OR is inefficient, see if you can avoid it.
Leading wildcards in LIKE are inefficient. See if a FULLTEXT index would work for you.
With a proper COLLATION, you don't need to test both upper and lower case. Also you can avoid use of BINARY. In both cases, you might be able to use an index. (What indexes do you have?)
Try to change from
WHERE ( ( SELECT ... ) = '0' )
to
WHERE ( NOT EXISTS ( SELECT ... ) )
(The SELECT will need some modification.)
(Please get rid of some of the redundant parens; it is hard to read.)
(Please use SHOW CREATE TABLE; it is more descriptive than DESCRIBE.)

Update query not working

I have an update query that shouldbe working but for some reason it doesnt work
String sql="UPDATE TB_EARTHORIENTATIONPARAMETER_UI SET YEAR='year1', MONTH='month1', DAY='day1', MJD='mjd1', WHERE (EOPID=1)";
It gives me the following error
Incorrect integer value 'year1' for column YEAR at row1
my table consist of the following columns and their types
| EOPID | int(11) | NO | PRI | NULL | auto_increment |
| YEAR | int(11) | YES | | NULL | |
| MONTH | int(11) | YES | | NULL | |
| DAY | int(11) | YES | | NULL | |
| MJD | int(11) | YES | | NULL | |
I retrieve the valuues to use in my sql update query from a jTable in the following manner
Object year=model.getValueAt(row, column);
years=year.toString();
year1=Integer.parseInt(years);
so i believe i am using the correct type but i cant figure out why it wont update . Is this a mysql version thing?
Your query should be like.
String sql="UPDATE TB_EARTHORIENTATIONPARAMETER_UI
SET
YEAR="+year1+",
MONTH="+month1+",
DAY="+day1+",
MJD="+mjd1+"
WHERE
EOPID=1";
Where year1, month1, day1, mjd1 should be variables containing appropriate values (there is an extra comm before the WHERE clause though).
The system is complaining that you're giving it a STRING ("year1"), not the integer value (e.g. 2012) it's expecting.
You should write this more like:
String sql="UPDATE TB_EARTHORIENTATIONPARAMETER_UI SET YEAR=" +
year1.toString() + ", month..."

Problem querying MySQL DB

I have this project I am working on, I have a table schema, see below
+--------+------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+--------+------------+------+-----+---------+----------------+
| codeId | int(15) | NO | PRI | NULL | auto_increment |
| code | varchar(9) | YES | | NULL | |
| status | varchar(5) | YES | | 0 | |
+--------+------------+------+-----+---------+----------------+
This table is used for authorizations of codes, however some people send codes like dsfffMUBBDG345qwewqe for authorization, please note the capitalized part. In the code column there is a code MUBBDG345. I need to be able to check from the table if any combination of 9 characters the codes sent matches any of the codes in the db.
I have tried using this query but i just does not work.
select code, codeId, status from authCodes where 'dsfffMUBBDG345qwewqe' like code;
Is this even possible with a mysql query only?
you want to use
SELECT code, codeId, status
FROM authCodes
WHERE 'dsfffMUBBDG345qwewqe' LIKE CONCAT('%', code, '%')

MySQL: return field for which no related entries exist in another table

First, sorry for the title, as I'm no native english-speaker, this is pretty hard to phrase. In other words, what I'm trying to achieve is this:
I'm trying to fetch all domain names from the table virtual_domains where there is no corresponding entry in the virtual_aliases table starting like "postmaster#%".
So if I have two domains:
foo.org
example.org
An they got aliases like:
info#foo.org => admin#foo.org
postmaster#foo.org => user1#foo.org
info#example.org => admin#example.org
I want the query to return only the domain "foo.org" as "example.org" is missing the postmaster alias.
This is the table layout:
mysql> show columns from virtual_aliases;
+-------------+--------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+-------------+--------------+------+-----+---------+----------------+
| id | int(11) | NO | PRI | NULL | auto_increment |
| domain_id | int(11) | NO | MUL | NULL | |
| source | varchar(100) | NO | | NULL | |
| destination | varchar(100) | NO | | NULL | |
+-------------+--------------+------+-----+---------+----------------+
mysql> show columns from virtual_domains;
+-------+-------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+-------+-------------+------+-----+---------+----------------+
| id | int(11) | NO | PRI | NULL | auto_increment |
| name | varchar(50) | NO | | NULL | |
+-------+-------------+------+-----+---------+----------------+
I tried for many hours with IF, CASE, LIKE queries with no success. I don't need a final solution, maybe just a hint with some explanation. Thanks!
SELECT * FROM virtual_domains AS domains
LEFT JOIN virtual_aliases AS aliases
ON domains.id = aliases.domain_id
WHERE aliases.domain_id IS NULL
LEFT JOIN returns all records from the "left" table, even they have no corresponding records in "right" table. Those records will have the right table fields set to NULL. Use WHERE to strip all the others.
I guess I didn't understand you correctly the first time. You have several entries in aliases for single domain, and you want to display only those domains that don't have an entry in aliases table that starts with "postmaster"?
In this case you are should use NOT IN like this:
SELECT * FROM virtual_domains AS domains
WHERE domains.id NOT IN (
SELECT domain_id
FROM virtual_aliases
WHERE whatever_column LIKE "postmaster#%"
)
select id,domain from virtual_domains
where id not in (select domain_id from virtual_aliases)
SELECT * FROM virtual_domains vd
LEFT JOIN virtual_aliases va ON vd.id = va.domain_id
AND va.destination NOT LIKE 'postmaster#%';