MySQL::Eliminating redundant elements from a table? - mysql

I have a table like this:
+-------+---------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+-------+---------+------+-----+---------+-------+
| v1 | int(11) | YES | MUL | NULL | |
| v2 | int(11) | YES | MUL | NULL | |
+-------+---------+------+-----+---------+-------+
There is a tremendous amount of duplication in this table. For instance, elements like the following:
+------+------+
| v1 | v2 |
+------+------+
| 1 | 2 |
| 1 | 3 |
| 1 | 4 |
| 1 | 5 |
| 1 | 6 |
| 1 | 7 |
| 1 | 8 |
| 1 | 9 |
| 2 | 1 |
| 4 | 1 |
| 5 | 1 |
| 6 | 1 |
| 7 | 1 |
| 8 | 1 |
| 9 | 1 |
+------+------+
The table is large with 1540000 entries. To remove the redundant entries (i.e. to get a table having only (1,9) and no (9,1) entries), I was thinking of doing it with a subquery but is there a better way of doing this?

Actually, #Mark's approach will work too. I just figured out another way of doing it and was wondering if I can some feedback on this as well. I tested it and it seems to work fast.
SELECT v1,v2 FROM table WHERE v1<v2 UNION SELECT v2,v1 FROM table WHERE v1>v2;
In the case where this is right, you can always create a new table:
CREATE TABLE newtable AS SELECT v1,v2 FROM edges WHERE v1<v2 UNION SELECT v2,v1 FROM edges WHERE v1>v2;

Warning: these commands modify your database. Make sure you have a backup copy so that you can restore the data again if necessary.
You can add the requirement that v1 must be less than v2 which will cut your storage requirement roughly in half. You can make sure all the rows in the database satisfy this condition and reorder those that don't and delete one of the rows when you have both.
This query will insert any missing rows where you have for example (5, 1) but not (1, 5):
INSERT INTO table1
SELECT T1.v2, T1.v1
FROM table1 T1
LEFT JOIN table1 T2
ON T1.v1 = T2.v2 AND T1.v2 = T2.v1
WHERE T1.v1 > T1.v2 AND T2.v1 IS NULL
Then this query deletes the rows you don't want, like (5, 1):
DELETE table1 WHERE v1 > v2
You might need to change other places in your code that were programmed before this constraint was added.

Related

Improve self-JOIN SQL Query performance

I try to improve performance of a SQL query, using MariaDB 10.1.18 (Linux Debian Jessie).
The server has a large amount of RAM (192GB) and SSD disks.
The real table has hundreds of millions of rows but I can reproduce my performance issue on a subset of the data and a simplified layout.
Here is the (simplified) table definition:
CREATE TABLE `data` (
`uri` varchar(255) NOT NULL,
`category` tinyint(4) NOT NULL,
`value` varchar(255) NOT NULL,
PRIMARY KEY (`uri`,`category`),
KEY `cvu` (`category`,`value`,`uri`),
KEY `cu` (`category`,`uri`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8
To reproduce the actual distribution of my content, I insert about 200'000 rows like this (bash script):
#!/bin/bash
for i in `seq 1 100000`;
do
mysql mydb -e "INSERT INTO data (uri, category, value) VALUES ('uri${i}', 1, 'foo');"
done
for i in `seq 99981 200000`;
do
mysql mydb -e "INSERT INTO data (uri, category, value) VALUES ('uri${i}', 2, '$(($i % 5))');"
done
So, we insert about:
100'000 rows in category 1 with a static string ("foo") as value
100'000 rows in category 2 with a number between 1 and 5 as the value
20 rows have a common "uri" between each dataset (category 1 / 2)
I always run an ANALYZE TABLE before querying.
Here is the explain output of the query I run:
MariaDB [mydb]> EXPLAIN EXTENDED
-> SELECT d2.uri, d2.value
-> FROM data as d1
-> INNER JOIN data as d2 ON d1.uri = d2.uri AND d2.category = 2
-> WHERE d1.category = 1 and d1.value = 'foo';
+------+-------------+-------+--------+----------------+---------+---------+-------------------+-------+----------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+------+-------------+-------+--------+----------------+---------+---------+-------------------+-------+----------+-------------+
| 1 | SIMPLE | d1 | ref | PRIMARY,cvu,cu | cu | 1 | const | 92964 | 100.00 | Using where |
| 1 | SIMPLE | d2 | eq_ref | PRIMARY,cvu,cu | PRIMARY | 768 | mydb.d1.uri,const | 1 | 100.00 | |
+------+-------------+-------+--------+----------------+---------+---------+-------------------+-------+----------+-------------+
2 rows in set, 1 warning (0.00 sec)
MariaDB [mydb]> SHOW WARNINGS;
+-------+------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Level | Code | Message |
+-------+------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Note | 1003 | select `mydb`.`d2`.`uri` AS `uri`,`mydb`.`d2`.`value` AS `value` from `mydb`.`data` `d1` join `mydb`.`data` `d2` where ((`mydb`.`d1`.`category` = 1) and (`mydb`.`d2`.`uri` = `mydb`.`d1`.`uri`) and (`mydb`.`d2`.`category` = 2) and (`mydb`.`d1`.`value` = 'foo')) |
+-------+------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
1 row in set (0.00 sec)
MariaDB [mydb]> SELECT d2.uri, d2.value FROM data as d1 INNER JOIN data as d2 ON d1.uri = d2.uri AND d2.category = 2 WHERE d1.category = 1 and d1.value = 'foo';
+-----------+-------+
| uri | value |
+-----------+-------+
| uri100000 | 0 |
| uri99981 | 1 |
| uri99982 | 2 |
| uri99983 | 3 |
| uri99984 | 4 |
| uri99985 | 0 |
| uri99986 | 1 |
| uri99987 | 2 |
| uri99988 | 3 |
| uri99989 | 4 |
| uri99990 | 0 |
| uri99991 | 1 |
| uri99992 | 2 |
| uri99993 | 3 |
| uri99994 | 4 |
| uri99995 | 0 |
| uri99996 | 1 |
| uri99997 | 2 |
| uri99998 | 3 |
| uri99999 | 4 |
+-----------+-------+
20 rows in set (0.35 sec)
This query returns 20 rows in ~350ms.
It seems quite slow to me.
Is there a way to improve performance of such query? Any advice?
Can you try the following query?
SELECT dd.uri, max(case when dd.category=2 then dd.value end) v2
FROM data as dd
GROUP by 1
having max(case when dd.category=1 then dd.value end)='foo' and v2 is not null;
I cannot at the moment repeat your test, but my hope is that having to scan the table just once could compensate the usage of the aggregate functions.
Edited
Created a test environment and tested some hypothesis.
As of today, the best performance (for 1 million rows) has been:
1 - Adding an index on uri column
2 - Using the following query
select d2.uri, d2.value
FROM data as d2
where exists (select 1
from data d1
where d1.uri = d2.uri
AND d1.category = 1
and d1.value='foo')
and d2.category=2
and d2.uri in (select uri from data group by 1 having count(*) > 1);
The ironic thing is that in the first proposal I tried to minimize the access to the table and now I'm proposing three accesses.
Edited: 30/10
Ok, so I've done some other experiments and I would like to summarize the outcomes.
First, I'd like to expand a bit Aruna answer:
what I found interesting in the OP question, is that it is an exception to a classic "rule of thumb" in database optimization: if the # of desired results is very small compared to the dimension of the tables involved, it should be possible with the correct indexes to have a very good performance.
Why can't we simply add a "magic index" to have our 20 rows? Because we don't have any clear "attack vector".. I mean, there's no clearly selective criteria we can apply on a record to reduce significatevely the number of the target rows.
Think about it: the fact that the value must be "foo" is just removing 50% of the table form the equation. Also the category is not selective at all: the only interest thing is that, for 20 uri, they appear both in records with category 1 and 2.
But here lies the issue: the condition involves comparing two rows, and unfortunately, to my knowledge, there's no way an index (not even the Oracle Function Based Indexes) can reduce a condition that is dependant on info on multiple rows.
The conlclusion might be: if these kind of query is what you need, you should revise your data model. For example, if you have a finite and small number of categories (lets' say three=, your table might be written as:
uri, value_category1, value_category2, value_category3
The query would be:
select uri, value_category2
where value_category1='foo' and value_category2 is not null;
By the way, let's go back tp the original question.
I've created a slightly more efficient test data generator (http://pastebin.com/DP8Uaj2t).
I've used this table:
use mydb;
DROP TABLE IF EXISTS data2;
CREATE TABLE data2
(
uri varchar(255) NOT NULL,
category tinyint(4) NOT NULL,
value varchar(255) NOT NULL,
PRIMARY KEY (uri,category),
KEY cvu (category,value,uri),
KEY ucv (uri,category,value),
KEY u (uri),
KEY cu (category,uri)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
The outcome is:
+--------------------------+----------+----------+----------+
| query_descr | num_rows | num | num_test |
+--------------------------+----------+----------+----------+
| exists_plus_perimeter | 10000 | 0.0000 | 5 |
| exists_plus_perimeter | 50000 | 0.0000 | 5 |
| exists_plus_perimeter | 100000 | 0.0000 | 5 |
| exists_plus_perimeter | 500000 | 2.0000 | 5 |
| exists_plus_perimeter | 1000000 | 4.8000 | 5 |
| exists_plus_perimeter | 5000000 | 26.7500 | 8 |
| max_based | 10000 | 0.0000 | 5 |
| max_based | 50000 | 0.0000 | 5 |
| max_based | 100000 | 0.0000 | 5 |
| max_based | 500000 | 3.2000 | 5 |
| max_based | 1000000 | 7.0000 | 5 |
| max_based | 5000000 | 49.5000 | 8 |
| max_based_with_ucv | 10000 | 0.0000 | 5 |
| max_based_with_ucv | 50000 | 0.0000 | 5 |
| max_based_with_ucv | 100000 | 0.0000 | 5 |
| max_based_with_ucv | 500000 | 2.6000 | 5 |
| max_based_with_ucv | 1000000 | 7.0000 | 5 |
| max_based_with_ucv | 5000000 | 36.3750 | 8 |
| standard_join | 10000 | 0.0000 | 5 |
| standard_join | 50000 | 0.4000 | 5 |
| standard_join | 100000 | 2.4000 | 5 |
| standard_join | 500000 | 13.4000 | 5 |
| standard_join | 1000000 | 33.2000 | 5 |
| standard_join | 5000000 | 205.2500 | 8 |
| standard_join_plus_perim | 5000000 | 155.0000 | 2 |
+--------------------------+----------+----------+----------+
The queries used are:
- query_max_based_with_ucv.sql
- query_exists_plus_perimeter.sql
- query_max_based.sql
- query_max_based_with_ucv.sql
- query_standard_join_plus_perim.sql query_standard_join.sql
The best query is still the "query_exists_plus_perimeter"that I've put after the first environment creation.
It is mainly due to the number of rows analysed. Even though you have tables indexed the main decision making condition "WHERE d1.category = 1 and d1.value = 'foo'" filters huge amount of rows
+------+-------------+-------+-.....-+-------+----------+-------------+
| id | select_type | table | | rows | filtered | Extra |
+------+-------------+-------+-.....-+-------+----------+-------------+
| 1 | SIMPLE | d1 | ..... | 92964 | 100.00 | Using where |
Each and every matching row it has to read the table again which is for category 2. Since it is reading on primary key it can get the matching row directly.
On your original table check the cardinality of the combination of category and value. If it is more towards unique, you can add an index on (category, value) and that should improve the performance. If it is same like the example given you may not get any performance improvement.

A query with those fields, that are not required to get from huge fields within a table

i want a query to display few fields from a table on webpage, but there are huge number of fields in that table, but i want such type of query that except few fields and displaying all remaining fields' data on web page.
Like: i have a table with 50 fields,
+--------+--------+--------+----------+ +----------+
| Col1 | Col2 | Col3 | Col4 | ---- | Col50 |
|---------|---------|----------|-----------| |-----------|
| | | | | ---- | |
| | | | | ---- | |+-------+--------+--------+----------+ +----------+
but i want to display only 48 fields on that page. then any query that except those 2 fields name(Col49 and Col50) that are not required and show remaining data. So instead of writing: SELECT Col1, Col2, Col3, Col4,...Col48 FROM table; any alternate way to writing like that SELECT *-(Col49,Col50) FROM table;
The best way to solve this is using view you can create view with those 18 columns and retrieve data form it
example
mysql> SELECT * FROM calls;
+----+------------+---------+
| id | date | user_id |
+----+------------+---------+
| 1 | 2016-06-22 | 1 |
| 2 | 2016-06-22 | NULL |
| 3 | 2016-06-22 | NULL |
| 4 | 2016-06-23 | 2 |
| 5 | 2016-06-23 | 1 |
| 6 | 2016-06-23 | 1 |
| 7 | 2016-06-23 | NULL |
+----+------------+---------+
7 rows in set (0.06 sec)
mysql> CREATE VIEW C_VIEW AS
-> SELECT id,date from calls;
Query OK, 0 rows affected (0.20 sec)
mysql> select * from C_VIEW;
+----+------------+
| id | date |
+----+------------+
| 1 | 2016-06-22 |
| 2 | 2016-06-22 |
| 3 | 2016-06-22 |
| 4 | 2016-06-23 |
| 5 | 2016-06-23 |
| 6 | 2016-06-23 |
| 7 | 2016-06-23 |
+----+------------+
7 rows in set (0.00 sec)
Instead of mentioning select * from ... include the column names in the query.. select col1, col2,...col18 from ..
This answer is according to you question. So add more details to your question to get more clear answer.
You need to pass column names to get data for only those columns.
e.g.
table_name = dummy_tab
col_1 | col_2 | col_3 | col_4
--------------------------------
1 | John | 20 | abc#def
--------------------------------
2 | Doe | 21 | def#xyz
...
You can do:
SELECT col_1, col_2 FROM dummy_tab;
This will give:
col_1 | col_2
---------------
1 | John
---------------
2 | Doe
A bit of a pain but you can reduce effort and error by interrogating information_schema and excluding the fields you don't want by name or by position for example.
CREATE TABLE `dates` (
`id` INT(11) NULL DEFAULT NULL,
`dte` DATE NULL DEFAULT NULL,
`CalMonth` INT(11) NULL DEFAULT NULL,
`CalMonthDescLong` VARCHAR(10) NULL DEFAULT NULL,
`CalMonthDescShort` VARCHAR(10) NULL DEFAULT NULL,
`calQtr` INT(11) NULL DEFAULT NULL
)
COLLATE='latin1_swedish_ci'
ENGINE=InnoDB
;
use information_schema;
select concat('`',REPLACE(t.NAME,'/dates','.'), '`',replace(T.NAME,'sandbox/',''),c.NAME,'`',',')
from INNODB_SYS_TABLES t
join INNODB_SYS_COLUMNS c on c.TABLE_ID = t.TABLE_ID
where t.NAME like ('sandbox%')
and t.name like ('%dates')
and pos not in(1,4)
Best run from command line with the output piped to a text file.

MySQL Independently select last value across multiple columns which isn't null

Here's an example dataset that I'm dealing with:
+----+-----+-----+-----+-----+
| id | a | b | c | d |
+----+-----+-----+-----+-----+
| 1 | 1 | | | |
| 2 | | 2 | | |
| 3 | | | | |
| 4 | | | | 4 |
| 5 | | 3 | | |
+----+-----+-----+-----+-----+
I want to select the bottom-most values. If this value has never been set, then I'd want "null", otherwise, I want the bottom-most result. In this case, I'd want the resultset:
+-----+-----+-----+-----+
| a | b | c | d |
+-----+-----+-----+-----+
| 1 | 3 | | 4 |
+-----+-----+-----+-----+
I tried queries such as variations of:
SELECT DISTINCT `a`,`b`,`c`,`d`
FROM `test`
WHERE `a` IS NOT NULL
AND `b` IS NOT NULL
AND `c` IS NOT NULL
AND `d` IS NOT NULL
ORDER BY 'id' DESC LIMIT 1;
This didn't work.
Would I have to run queries for each value individually, or is there a way to do it within just that one query?
If you are OK with changing type to a char, you can do this:
SELECT substring_index(GROUP_CONCAT(a),',',1) as LastA,
substring_index(GROUP_CONCAT(b),',',1) as LastB,
substring_index(GROUP_CONCAT(c),',',1) as LastC,
substring_index(GROUP_CONCAT(d),',',1) as LastD
FROM
(
SELECT id, a, b, c, d
FROM MyTable
ORDER BY id DESC
) x;
SqlFiddle here
Notes:
The intermediate derived table is needed as the input to GROUP_CONCAT needs to be ordered.
After compressing the rows with GROUP_CONCAT (using the default comma delimiter), we then scrape out the first column with substring_index. substring_index on NULL returns NULL, as required.
If you need the resultant columns to be INT, you'll need to cast each column again.

Select a column from table based on other column values

I have a table in MySql and table name is logs
+---------------+---------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+---------------+---------------+------+-----+---------+-------+
| domain | varchar(50) | YES | MUL | NULL | |
| sid | varchar(100) | YES | MUL | NULL | |
| email | varchar(100) | YES | MUL | NULL | |
+---------------+---------------+------+-----+---------+-------+
The following are sample rows from the table
+------------+----------------+---------------
| sid | email | domain|
+------------------------------------+-------+
| 1 | | xxx123#yahoo.com | xxx |
| 2 | | xxx123#yahoo.com | xxx |
| 2 | | yyy123#yahoo.com | yyy |
| 2 | | yyy123#yahoo.com | yyy |
| 3 | | zzz123#yahoo.com | zzz |
| 4 | | qqq123#yahoo.com | qqq |
| 2 | | ppp123#yahoo.com | ppp |
+---+--------+-----------------------+-------+
I want a query something like
select * from logs
where sid IN (select sid from logs
where domain="xxx" AND email="xxx123#yahoo.com")
Desired output
+------------+-----------------------+--------
| sid | email | domain|
+------------------------------------+-------+
| 1 | | xxx123#yahoo.com | xxx |
| 2 | | xxx123#yahoo.com | xxx |
| 2 | | yyy123#yahoo.com | yyy |
| 2 | | yyy123#yahoo.com | yyy |
| 2 | | ppp123#yahoo.com | ppp |
+---+--------+-----------------------+-------+
I can do it using joins but is there any way to get results without using joins or any optimized version of this query
You can use where exists as
select l1.* from logs l1
where exists(
select 1 from logs l2
where l1.sid = l2.sid
and l2.domain = 'xxx'
and l2.email = 'xxx123#yahoo.com'
);
First get a proper id on those rows. Second have you tried it? it looks like it should work. I have no idea why you want that though.
If it actually doesn't work try this structure, could be faster:
SELECT *
FROM some_table
WHERE relevant_field IN
(
SELECT * FROM
(
SELECT relevant_field
FROM some_table
WHERE conditions
) AS subquery
)
Do you want the whole table as result or just one column?
If I get your question right I would simple use:
SELECT * FROM logs WHERE domain="xxx" AND email="xxx123#yahoo.com"
Or if you want only the sid just replace the * with sid.
And if all sid´s are numbers, why don´t you use int or something similar as column type?
It seems like you are doing something redundant just by looking at your request you seem to look for
select * from logs where domain="xxx" AND email="xxx123#yahoo.com"
I dont't know why you are using the first part of the SQL string since this is not a join from other sql tables.
Or am i missing something?

mysql - select distinct mutually exclusive (based on another column's value) rows

First off, I would like to say that if after reading the question, anyone has a suggestion on a more informative title for this question, please tell me, as I think mine is somewhat lacking, now, on to business...
Given this table structure:
+---------+-------------------------------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+---------+-------------------------------------+------+-----+---------+----------------+
| id | int(11) | NO | PRI | NULL | auto_increment |
| account | varchar(20) | YES | UNI | NULL | |
| domain | varchar(100) | YES | | NULL | |
| status | enum('FAILED','PENDING','COMPLETE') | YES | | NULL | |
+---------+-------------------------------------+------+-----+---------+----------------+
And this data:
+----+---------+------------------+----------+
| id | account | domain | status |
+----+---------+------------------+----------+
| 1 | jim | somedomain.com | COMPLETE |
| 2 | bob | somedomain.com | COMPLETE |
| 3 | joe | somedomain.com | COMPLETE |
| 4 | frank | otherdomain.com | COMPLETE |
| 5 | betty | otherdomain.com | PENDING |
| 6 | shirley | otherdomain.com | FAILED |
| 7 | tom | thirddomain.com | FAILED |
| 8 | lou | fourthdomain.com | COMPLETE |
+----+---------+------------------+----------+
I would like to select all domains which have a 'COMPLETE' status for all accounts (rows).
Any domains which have a row containing any value other then 'COMPLETE' for the status must not be returned.
So in the above example, My expected result would be:
+------------------+
| domain |
+------------------+
| somedomain.com |
| fourthdomain.com |
+------------------+
Obviously, I can achieve this by using a sub-query such as:
mysql> select distinct domain from test_table where status = 'complete' and domain not in (select distinct domain from test_table where status != 'complete');
+------------------+
| domain |
+------------------+
| somedomain.com |
| fourthdomain.com |
+------------------+
2 rows in set (0.00 sec)
This will work fine on our little mock-up test table, but in the real situation, the tables in question will be tens (or even hundreds) of thousands of rows, and I'm curious if there is some more efficient way to do this, as the sub-query is slow and intensive.
How about this:
select domain
from test_table
group by domain
having sum(case when status = 'COMPLETE'
then 0 else 1 end) = 0
I think this will work. Effectively just joins two basic queries together, then compares their count.
select
main.domain
from
your_table main
inner join
(
select
domain, count(id) as cnt
from
your_table
where
status = 'complete'
group by
domain
) complete
on complete.domain = main.domain
group by
main.domain
having
count(main.id) = complete.cnt
You should also ensure you have an index on domain as this relies on a join on that column.