I have a MySQL database the stores news articles with the publications date (just day information), the source, and category. Based on these I want to generate a table that holds the article counts w.r.t. to these 3 parameters.
Since for some combinations of these 3 parameters there might be no article, a simple GROUP BY won't do. I therefore first generate a table news_article_counts with all possible combinations of the 3 parameters, and an default article_count of 0 -- like this:
SELECT * FROM news_article_counts;
+--------------+------------+----------+---------------+
| published_at | source | category | article_count |
+------------- +------------+----------+---------------+
| 2016-08-05 | 1826089206 | 0 | 0 |
| 2016-08-05 | 1826089206 | 1 | 0 |
| 2016-08-05 | 1826089206 | 2 | 0 |
| 2016-08-05 | 1826089206 | 3 | 0 |
| 2016-08-05 | 1826089206 | 4 | 0 |
| ... | ... | ... | ... |
+--------------+------------+----------+---------------+
For testing, I now created a temporary table tmp as the GROUP BY result from the original news article table:
SELECT * FROM tmp LIMIT 6;
+--------------+------------+----------+-----+
| published_at | source | category | cnt |
+--------------+------------+----------+-----+
| 2016-08-05 | 1826089206 | 3 | 1 |
| 2003-09-19 | 1826089206 | 4 | 1 |
| 2005-08-08 | 1826089206 | 3 | 1 |
| 2008-07-22 | 1826089206 | 4 | 1 |
| 2008-11-26 | 1826089206 | 8 | 1 |
| ... | ... | ... | ... |
+--------------+------------+----------+-----+
Given these two tables, the following query works as expected:
SELECT * FROM news_article_counts c, tmp t
WHERE c.published_at = t.published_at AND c.source = t.source AND c.category = t.category;
But now I need to update the article_count of table news_article_counts with the values in table tmp where the 3 parameters match up. For this I'm using the following query (I've tried different ways but with the same results):
UPDATE
news_article_counts c
INNER JOIN
tmp t
ON
c.published_at = t.published_at AND
c.source = t.source AND
c.category = t.category
SET
c.article_count = t.cnt;
Executing this query yields this error:
ERROR 1062 (23000): Duplicate entry '2018-04-07 14:46:17-1826089206-1' for key 'uniqueIndex'
uniqueIndex is a joint index over published_at, source, category of table news_article_counts. But this shouldn't be a problem since I do not -- as far as I can tell -- update any of those 3 values, only article_count.
What confuses me most is that in the error it mentions the timestamp I executed the query (here: 2018-04-07 14:46:17). I have no absolutely idea where this comes into play. In fact, some rows in news_article_counts now have 2018-04-07 14:46:17 as value for published_at. While this explains the error, I cannot see why published_at gets overwritten with the current timestamp. There is no ON UPDATE CURRENT_TIMESTAMP on this column; see:
CREATE TABLE IF NOT EXISTS `test`.`news_article_counts` (
`published_at` TIMESTAMP NOT NULL,
`source` INT UNSIGNED NOT NULL,
`category` INT UNSIGNED NOT NULL,
`article_count` INT UNSIGNED NOT NULL DEFAULT 0,
UNIQUE INDEX `uniqueIndex` (`published_at` ASC, `source` ASC, `category` ASC))
ENGINE = MyISAM
DEFAULT CHARACTER SET = utf8mb4;
What am I missing here?
UPDATE 1: I actually checked the table definition of news_article_counts in the database. And there's indeed the following:
mysql> SHOW COLUMNS FROM news_article_counts;
+---------------+------------------+------+-----+-------------------+-----------------------------+
| Field | Type | Null | Key | Default | Extra |
+---------------+------------------+------+-----+-------------------+-----------------------------+
| published_at | timestamp | NO | | CURRENT_TIMESTAMP | on update CURRENT_TIMESTAMP |
| source | int(10) unsigned | NO | | NULL | |
| category | int(10) unsigned | NO | | NULL | |
| article_count | int(10) unsigned | NO | | 0 | |
+---------------+------------------+------+-----+-------------------+-----------------------------+
But why is on update CURRENT_TIMESTAMP set. I double and triple-checked my CREATE TABLE statement. I removed the joint index, I added an artificial primary key (auto_increment). Nothing help. I've even tried to explicitly remove these attributes from published_at with:
ALTER TABLE `news_article_counts` CHANGE `published_at` `published_at` TIMESTAMP NOT NULL;
Nothing seems to work for me.
It looks like you have the explicit_defaults_for_timestamp system variable disabled. One of the effects of this is:
The first TIMESTAMP column in a table, if not explicitly declared with the NULL attribute or an explicit DEFAULT or ON UPDATE attribute, is automatically declared with the DEFAULT CURRENT_TIMESTAMP and ON UPDATE CURRENT_TIMESTAMP attributes.
You could try enabling this system variable, but that could potentially impact other applications. I think it only takes effect when you're actually creating a table, so it shouldn't affect any existing tables.
If you don't to make a system-level change like this, you could add an explicit DEFAULT attribute to the published_at column of this table, then it won't automatically add ON UPDATE.
I need to reduce the size of MySQL database. I recoded some information which striped ';' and ':' from sources column (~10% char reduction). After doing so, the size of the table is exactly the same as before. How is it possible? I'm using MyISAM engine.
btw: Unfortunately, I cannot compress the tables with myisampack.
mysql> INSERT INTO test SELECT protid1, protid2, CS, REPLACE(REPLACE(sources, ':', ''), ';', '') FROM homologs_9606;
Query OK, 41917131 rows affected (4 min 11.30 sec)
Records: 41917131 Duplicates: 0 Warnings: 0
mysql> select TABLE_NAME name, ROUND(TABLE_ROWS/1e6, 3) 'million rows', ROUND(DATA_LENGTH/power(2,30), 3) 'data GB', ROUND(INDEX_LENGTH/power(2,30), 3) 'index GB' from information_schema.TABLES WHERE TABLE_NAME IN ('homologs_9606', 'test') ORDER BY TABLE_ROWS DESC LIMIT 10;
+---------------+--------------+---------+----------+
| name | million rows | data GB | index GB |
+---------------+--------------+---------+----------+
| test | 41.917 | 0.857 | 1.075 |
| homologs_9606 | 41.917 | 0.887 | 1.075 |
+---------------+--------------+---------+----------+
2 rows in set (0.01 sec)
mysql> select * from homologs_9606 limit 10;
+---------+---------+-------+--------------------------------+
| protid1 | protid2 | CS | sources |
+---------+---------+-------+--------------------------------+
| 5635338 | 1028608 | 0.000 | 10:,1 |
| 5644385 | 1028611 | 0.947 | 5:1,1;8:0.943,35;10:1,1;11:1,1 |
| 5652325 | 1028611 | 0.947 | 5:1,1;8:0.943,35;10:1,1;11:1,1 |
| 5641128 | 1028612 | 1.000 | 8:1,10 |
| 5636414 | 1028616 | 0.038 | 8:0.038,104;10:,1 |
| 5636557 | 1028616 | 0.000 | 8:,4 |
| 5637419 | 1028616 | 0.011 | 5:,1;8:0.011,91;10:,1 |
| 5641196 | 1028616 | 0.080 | 5:1,1;8:0.074,94;10:,1;11:,4 |
| 5642914 | 1028616 | 0.000 | 8:,3 |
| 5643778 | 1028616 | 0.056 | 8:0.057,70;10:,1 |
+---------+---------+-------+--------------------------------+
10 rows in set (4.55 sec)
mysql> select * from test limit 10;
+---------+---------+-------+-------------------------+
| protid1 | protid2 | CS | sources |
+---------+---------+-------+-------------------------+
| 5635338 | 1028608 | 0.000 | 10,1 |
| 5644385 | 1028611 | 0.947 | 51,180.943,35101,1111,1 |
| 5652325 | 1028611 | 0.947 | 51,180.943,35101,1111,1 |
| 5641128 | 1028612 | 1.000 | 81,10 |
| 5636414 | 1028616 | 0.038 | 80.038,10410,1 |
| 5636557 | 1028616 | 0.000 | 8,4 |
| 5637419 | 1028616 | 0.011 | 5,180.011,9110,1 |
| 5641196 | 1028616 | 0.080 | 51,180.074,9410,111,4 |
| 5642914 | 1028616 | 0.000 | 8,3 |
| 5643778 | 1028616 | 0.056 | 80.057,7010,1 |
+---------+---------+-------+-------------------------+
10 rows in set (0.00 sec)
mysql> describe test;
+---------+------------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+---------+------------------+------+-----+---------+-------+
| protid1 | int(10) unsigned | YES | PRI | NULL | |
| protid2 | int(10) unsigned | YES | PRI | NULL | |
| CS | float(4,3) | YES | | NULL | |
| sources | varchar(100) | YES | | NULL | |
+---------+------------------+------+-----+---------+-------+
4 rows in set (0.00 sec)
mysql> describe homologs_9606;
+---------+------------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+---------+------------------+------+-----+---------+-------+
| protid1 | int(10) unsigned | NO | PRI | 0 | |
| protid2 | int(10) unsigned | NO | PRI | 0 | |
| CS | float(4,3) | YES | | NULL | |
| sources | varchar(100) | YES | | NULL | |
+---------+------------------+------+-----+---------+-------+
4 rows in set (0.00 sec)
EDIT1: Added average column length.
mysql> select AVG(LENGTH(sources)) from test;
+----------------------+
| AVG(LENGTH(sources)) |
+----------------------+
| 5.2177 |
+----------------------+
1 row in set (10.04 sec)
mysql> select AVG(LENGTH(sources)) from homologs_9606;
+----------------------+
| AVG(LENGTH(sources)) |
+----------------------+
| 6.8792 |
+----------------------+
1 row in set (9.95 sec)
EDIT2: I was able to strip some more MB by setting NOT NULL to all columns.
mysql> drop table test
Query OK, 0 rows affected (0.42 sec)
mysql> CREATE table test (protid1 INT UNSIGNED NOT NULL DEFAULT '0', protid2 INT UNSIGNED NOT NULL DEFAULT '0', CS FLOAT(4,3) NOT NULL DEFAULT '0', sources VARCHAR(100) NOT NULL DEFAULT '0', PRIMARY KEY (protid1, protid2), KEY `idx_protid2` (protid2)) ENGINE=MyISAM CHARSET=ascii;
Query OK, 0 rows affected (0.06 sec)
mysql> INSERT INTO test SELECT protid1, protid2, CS, REPLACE(REPLACE(sources, ':', ''), ';', '') FROM homologs_9606;
Query OK, 41917131 rows affected (2 min 7.84 sec)
mysql> select TABLE_NAME name, ROUND(TABLE_ROWS/1e6, 3) 'million rows', ROUND(DATA_LENGTH/power(2,30), 3) 'data GB', ROUND(INDEX_LENGTH/power(2,30), 3) 'index GB' from information_schema.TABLES WHERE TABLE_NAME IN ('homologs_9606', 'test');
Records: 41917131 Duplicates: 0 Warnings: 0
+---------------+--------------+---------+----------+
| name | million rows | data GB | index GB |
+---------------+--------------+---------+----------+
| homologs_9606 | 41.917 | 0.887 | 1.075 |
| test | 41.917 | 0.842 | 1.075 |
+---------------+--------------+---------+----------+
2 rows in set (0.02 sec)
They are not exactly the same. Your query clearly shows that test is about 30 MB smaller than homologs_9606:
+---------------+--------------+---------+
| name | million rows | data GB |
+---------------+--------------+---------+
| test | 41.917 | 0.857 | <-- 0.857 < 0.887
| homologs_9606 | 41.917 | 0.887 |
+---------------+--------------+---------+
How much storage should we expect for your table? Let us check Data Type Storage Requirements:
INTEGER(10): 4 bytes
FLOAT(4): 4 bytes
VARCHAR(100): L+1
where L is the number of character bytes, which is usually one byte per character but sometimes more if you use a Unicode character set.
Your rows on average will need:
INTEGER + INTEGER + FLOAT + VARCHAR =
4 + 4 + 4 + (L + 1) = L + 13 bytes
We can infer your original average L as (0.887*1024^3 / 41917131) - 13 = 9.72. You say that you stripped 10% from sources, which means your new L is 9.72*0.9 = 8.75. That gives an expected new total storage requirement of ((8.75 + 13) * 41917131) / 1024^3 = 0.849 GB
I suspect that the difference (between 0.849 and 0.857) might be due to the fact that test have two columns set as NULLable that homologs_9606 do not have, but I do not know enough about the MyISAM engine to calculate this exactly. I can however guess! On a minimum you would need 1 bit per column per row to store a NULL state, which in your case means two bits per row or 2*41917131 = 83834262 bits = 10 479 283 bytes = 0.010 GB. The total 0.849+0.010 = 0.859 shoots slightly over the goal (about 2 MB too much). But I have made some roundings and your 10% figure is also an estimate so I am sure the rest is lost in translation.
Another reason could be if you use a Unicode character set on sources in test, in which case some characters may use more than one byte each, but since the NULLable columns seems to account for everything I do not think this is the case for your table.
Summary
Your two tables are not the same size, they differ by 30 MB.
The size of your new table is around the expected size.
You can save some more space in your new table by making protid1 and protid2 into NOT NULL columns.
The "table" is stored in a .MYD file. This file will never shrink due to UPDATEs or DELETEs. SHOW TABLE STATUS (or the equivalent query into information_schema) may show Data_length shrinking, but Data_free will increase.
You can shrink the .MYD file by doing OPTIMIZE TABLE. But that will copy the table over, thereby needing extra disk space during the process. And this action is only very rarely worth doing.
Changing to NOT NULL may not free up space if you had a lot of nulls -- "" takes 1 or 2 bytes for a VARCHAR because of the length. (And your code may need to handle '' differently than NULL.)
The space taken for each row is actually 1 byte more than previously mentioned -- this byte handles knowing whether the row exists or is the beginning of a hole.
For large text fields, I like to do this to save space. (This applies to both MyISAM and InnoDB.) Compress the text and store it into a BLOB column (instead of TEXT). For most text, that is a 3:1 shrinkage. It takes a little extra code and CPU time in the client, but it saves a lot of I/O in the server. Often the net result is "faster". I would not use it for the varchar you have; I would only do it on columns bigger than, say, 50 characters average.
Back to the original question. It sounds like there were only about 30M colons and semicolons in the entire table. Could it be that the first 10 rows are not representative?
For some reasons it seems that the rows are not being updated. Any idea why this would happen ?
UPDATE hts SET assigned='1' AND Owner='ms' WHERE hid='217477'
Query OK, 0 rows affected (0.16 sec)
Rows matched: 1 Changed: 0 Warnings: 0
select assigned, Owner from hts where hid='217477';
+----------+-------+
| assigned | Owner |
+----------+-------+
| NULL | NULL |
+----------+-------+
Show columns from hts
+------------+--------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+------------+--------------+------+-----+---------+-------+
| hid | varchar(25) | YES | UNI | NULL | |
| assigned | int(11) | NO | | 0 | |
| Owner | varchar(10) | YES | | NULL | |
+------------+--------------+------+-----+---------+-------+
two things you can try..
first try removing the AND from the SET.. usually you do that with a comma..
UPDATE hts SET assigned=1, Owner='ms' WHERE hid='217477'
second try removing the quotes form the hid if it is an INT and not a VARCHAR
UPDATE hts SET assigned=1, Owner='ms' WHERE hid=217477
not sure why you are storing integers as strings.. when in doubt you should ALWAYS store data by its intended datatype.
RECOMMENDATION: change the datatypes if they are varchar to int. your update would look like this.
UPDATE hts SET assigned=1, Owner='ms' WHERE hid=217477
assigned should be integer as well as hid
Here is my code to delete my first row.
But not effected!
mysql> select * from myt;
+--------+--------------+------+---------+
| Fname | Lname | age | phone |
+--------+--------------+------+---------+
| NULL | Jackson | NULL | NULL |
| stive | NULL | NULL | NULL |
| ghbfgf | rtrgf | 22 | 111 |
| zxas | zxa | 30 | 6547812 |
| wewew | uytree | 22 | 658478 |
+--------+--------------+------+---------+
5 rows in set (0.00 sec)
mysql> delete from myt
-> Where Fname = "NULL";
Query OK, 0 rows affected (0.00 sec)
Thanks!
use IS NULL.
You cannot use arithmetic comparison operators such as =, <, or <> to test for NULL.
DELETE FROM myt WHERE Fname IS NULL
Working with NULL Values
NULL is not a value.
NULL means nothing is present.
So usage of FNAME = "NULL" is wrong.
delete from myt Where Fname IS NULL;
Your first row is NULL (none) not "NULL"
NULL is not a value in RDBMS; it is a marker for a missing value. When you are using "NULL" it denotes a string value. You can simply use "IS NULL". Hope this helps.
For a preferences module I have "system defaults", and "user preferences".
If there is no personal/user preference stored, then use the system default values instead.
Here is my system preferences table:
mysql> desc rbl;
+-------------+---------------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+-------------+---------------------+------+-----+---------+-------+
| id | varchar(3) | NO | PRI | | |
| rbl_url | varchar(100) | NO | | | |
| description | varchar(100) | NO | | | |
| is_default | tinyint(1) unsigned | YES | | 1 | |
+-------------+---------------------+------+-----+---------+-------+
4 rows in set (0.00 sec)
Example data from system prefs:
mysql> select * from rbl;
+----+----------------------+------------------------------+------------+
| id | rbl_url | description | is_default |
+----+----------------------+------------------------------+------------+
| 1 | sbl-xbl.spamhaus.org | Spamhaus SBL-XBL | 1 |
| 2 | pbl.spamhaus.org | Spamhaus PBL | 1 |
| 3 | bl.spamcop.net | Spamcop Blacklist | 1 |
| 4 | rbl.example.com | Example RBL - not functional | 0 |
+----+----------------------+------------------------------+------------+
... and Query for system defaults:
mysql> SELECT rbl_url FROM rbl WHERE is_default='1';
+----------------------+
| rbl_url |
+----------------------+
| sbl-xbl.spamhaus.org |
| pbl.spamhaus.org |
| bl.spamcop.net |
+----------------------+
3 rows in set (0.01 sec)
So far so good.
OK. Now I need a user preferences table, and I came up with this:
mysql> desc rbl_pref;
+-----------+-----------------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+-----------+-----------------------+------+-----+---------+----------------+
| id | mediumint(8) unsigned | NO | PRI | NULL | auto_increment |
| domain_id | mediumint(8) unsigned | NO | | NULL | |
| rbl_id | tinyint(1) unsigned | NO | | NULL | |
+-----------+-----------------------+------+-----+---------+----------------+
3 rows in set (0.00 sec)
(FYI - A "user" is represented by "domain_id". )
Let's view the preferences of a specific user who has personalized preferences saved:
mysql> select * from rbl_pref where domain_id='2277';
+----+-----------+--------+
| id | domain_id | rbl_id |
+----+-----------+--------+
| 4 | 2277 | 1 |
| 5 | 2277 | 2 |
| 6 | 2277 | 4 |
+----+-----------+--------+
3 rows in set (0.00 sec)
... again, but in a simpler format:
mysql> SELECT rbl.rbl_url FROM rbl_pref,rbl
WHERE rbl_pref.rbl_id=rbl.id AND domain_id='2277';
+----------------------+
| rbl_url |
+----------------------+
| sbl-xbl.spamhaus.org |
| pbl.spamhaus.org |
| rbl.example.com |
+----------------------+
3 rows in set (0.00 sec)
.. so far so good. If a user has stored a preference, a result is found.
The problem example now is, user 1999 has no custom preferences.
In place of the "Empty set" result, I want the system defaults.
mysql> SELECT rbl.rbl_url FROM rbl_pref,rbl
WHERE rbl_pref.rbl_id=rbl.id AND domain_id='1999';
Empty set (0.00 sec)
I was excited to find a very similar question:
mysql if row doesn't exist, grab default value
However after a couple of days trial and error and documentation review, I could not translate that answer over to here.
Like the above question, this must be done as a single MySQL query. I am not actually making this query from PHP, but from Exim macros (and it is a very picky language... best to feed it "one liners" as variable assignments, as I try to do here.. )
UPDATE: Tried one type of a UNION query suggested by #Biff McGriff, below. The table did not display in my comment reply, so here it is again:
mysql> SELECT rbl.rbl_url FROM rbl_pref,rbl
WHERE rbl_pref.rbl_id=rbl.id AND domain_id='2277'
UNION SELECT rbl_url FROM rbl WHERE is_default='1';
+----------------------+
| rbl_url |
+----------------------+
| sbl-xbl.spamhaus.org |
| pbl.spamhaus.org |
| rbl.example.com |
| bl.spamcop.net |
+----------------------+
4 rows in set (0.00 sec)
As you can see above, user 2277 did not opt in to rbl_id 3 (bl.spamcop.net), but that's showing up anyways.
What my UNION query seems to be doing is combining the result set. So user_pref acts as "in addition to" global defaults, and I was assuming/expecting I would get a result set matching either half of the query.
So my question now is, is it better (or possible, how) to solve this as "either result set" (either subquery on either side of the UNION)? OR do I really need a new field on rbl_pref, called for example "enabled". The latter seems to be more correct - that I need something in rbl_pref to explicitly designate opt-in or opt-out (other than the implicit "that pref is not here - no rbl_id=3 - in the over ridden user result SET")
UPDATE: All set, thanks #Imre L, and everyone else. I learned something through this example.
You should be able to use a left join and then coalesce the user's field with the default field.
NOTE: you have to enter the domain_id in two places.
SELECT rbl.rbl_url FROM rbl
JOIN rbl_pref ON rbl_pref.rbl_id=rbl.id AND domain_id=2277
UNION
SELECT rbl.rbl_url FROM rbl
WHERE rbl.is_default
AND NOT EXISTS (SELECT 1 FROM rbl_pref WHERE domain_id=2277 LIMIT 1)
;
Now one or the other side of UNION will be optimized away with impossible where
You also should not use varchar(3) for rbl.id but some sort of integer
and preferable same type as rbl_pref.rbl_id for which tinyint is too tiny
and when you compare integers fields in sql code domain_id='2277' you should not use ' or " around constants integers.
You can get away whith it mostly but sometimes it may confuse mysql optimizer.
Also for optimal performance and consistency i suggest you the add the index:
ALTER TABLE rbl_pref
ADD UNIQUE INDEX ux_domain_rbl (domain_id, rbl_id);