I have a table with 2 foreign keys. I'm somewhat new to MySQL, can someone tell me which is the right way in applying an INDEX to tables?
# Sample 1
CREATE TABLE IF NOT EXISTS `my_table` (
`topic_id` INT UNSIGNED NOT NULL ,
`course_id` INT UNSIGNED NOT NULL ,
PRIMARY KEY (`topic_id`, `course_id`) ,
INDEX `topic_id_idx` (`topic_id` ASC) ,
INDEX `course_id_idx` (`course_id` ASC) )
ENGINE = InnoDB
DEFAULT CHARACTER SET = utf8
COLLATE = utf8_general_ci;
# Sample 2
CREATE TABLE IF NOT EXISTS `my_table` (
`topic_id` INT UNSIGNED NOT NULL ,
`course_id` INT UNSIGNED NOT NULL ,
PRIMARY KEY (`topic_id`, `course_id`) ,
INDEX `topic_id_idx` (`topic_id`, `course_id`) ,
ENGINE = InnoDB
DEFAULT CHARACTER SET = utf8
COLLATE = utf8_general_ci;
I guess what I'm really asking is what's the difference between defining both as separate indexes and the other as combined?
The reason why you might want one of these over the other has to do with how you plan on querying the data. Getting this determination right can be a bit of trick.
Think of the combined key in terms of, for example, looking up a student's folder in a filing cabinet, first by the student's last name, and then by their first name.
Now, in the case of the two single indexes in your example, you could imagine, in the student example, having two different sets of organized folders, one with every first name in order, and another with ever last name in order. In this case, you'll always have to work through the greatest amount of similar records, but that doesn't matter so much if you only have one name or the other anyway. In such a case, this arrangement gives you the greatest flexibility while still only maintaining indexes over two columns.
In contrast, if given both first and last name, it's a lot easier for us as humans to look up a student first by last name, then by first name (within a smaller set of potentials). However, when the last name is not known, it makes it very difficult to find the student by first name alone because the students with the same first-name are potential interleaved with every veration of last name (table scan). This is all true for the algorithms the computer uses to look up the information too.
So, as a rule of thumb, add the extra key to a single index if you are going to be filtering the data by both values at once. If at times you will have one and not the other, make sure which ever value that is, it's the leftmost key in the index. If the value could be either, you'll probably want both indexes (one of these could actually have both key for the best of both words, but even that comes at a cost in terms of writes). Getting this stuff right can be pretty important, as this often amounts to an all or nothing game. If all the data the dbms requires to preform the indexed lookup isn't present, it will probably resort to a table scan. Mysql's explain feature is one tool which can be helpful in checking your configuration and identifying optimizations.
if u create index by using one key, then when the data is searched it will find through only that key.
INDEX `topic_id_idx` (`topic_id` ASC) ,
INDEX `course_id_idx` (`course_id` ASC)
in this situation data is searched topic_id and course_id separately. but if you combine them data is searched combining them.
for a example if you have some data as follows :
topic_id course_id
----------
abc 1
pqr 2
abc 3
if you want to search abc - 3 if you put separate indexes then it will search these two columns separately and find the result.
but if you combine them then it will search abc+3 directly.
Related
I'm still trying to get my head around the best way to use INDEXES in MySQL. How do you know when to merge them together and when to have them separate?
Below are the indexes from the Wordpress posts table. See how post_name, post_parent and post_author are seperate entries? And then they have type_status_date which is a mixture of 4 fields?
http://img215.imageshack.us/img215/5976/screenshot20120426at431.png
I don't understand the logic behind this? Can anyone enlighten me?
Going to be a bit of a long answer but here we go. Please note I am not going to deal with the differences in database engines here(MyISAM and InnoDB have distinct way of implementing what I am trying to describe)
First thing you have to understand about a index is that it is a separate data structure stored on disk. Normally this is a b-tree data structure containing the column(s) that you have indexed and also contain a pointer to the row in the table(this pointer is normally the primary key).
The only index that is stored with the data is the primary key index. Thus a primary key index IS the table.
Lets assume you have following table definition.
CREATE TABLE `Student` (
`StudentNumber` INT NOT NULL ,
`Name` VARCHAR(32) NULL ,
`Surname` VARCHAR(32) NULL ,
`StudentEmail` VARCHAR(32) NULL ,
PRIMARY KEY (`StudentNumber`) );
Since we have a primary key on StudentID there will be a index containing the primary key and the other columns in the index. If you had to look at the data in the index you would probably see something like this.
1 , John ,Doe ,Jdoe#gmail.com
As you can see this is the table data once again showing you that the primary key index IS the table.
The StudentNumber column is indexed which allows your to effectively search on it the rest of the data is stored with the key. Thus if ran the following query:
SELECT * FROM Student WHERE StudentNumber=1
MySQL would use the primary index to quickly find the row and the read the data stored with the indexed column. Since there is a index MySQL can use the index to do a effective binary seek operation on the b-tree.
Also when it comes to retrieving the data after doing the search MySQL can read the data from the index thus we are using 1 operation in the index to retrieve the data. Now if I ran the following query:
SELECT * FROM Student WHERE Name ='Joe'
MySQL would check if there is a index that it could use to speed the query up. However in my case there is no index on name so MySQL would do a sequential read from the table one row at a time from the first row to the last.
At each row it would evaluate the row against the where clause and return matching row. So basically it reads the primary key index from top to bottom. Remember the primary key index is the table.
If I ran the following statement:
ALTER TABLE `TimLog`.`student`
ADD INDEX `ix_name` (`Name` ASC) ;
ALTER TABLE `TimLog`.`student`
ADD INDEX `ix_surname` (`Surname` ASC) ;
MySQL would create new indexes on the Student table. This will be stored away from the table on disk and the data inside would look something like this:
Data in ix_Name
John, 1 <--PRIMARY KEY VALUE
Data in ix_Surname
Doe, 1 <--PRIMARY KEY VALUE
Notice the data in the ix_Name index is the name and the primary key value. Great so if I ran the previous select statement MySQL would then read the ix_name index and get the primary key value for matching items and then use the primary key index to get the rest of the data.
So the number of operations to get the data from the index is 2. The matching rows are found in the index and then a lookup happens on the primary key to get the row data out.
You now have the following query:
SELECT * FROM Student WHERE Name='John' AND surname ='Doe'
Here MySQL cant use both indexes as it would be a waste of operations. If MySQL had to use both indexes in this query the following would happen(this should not happen).
1 Find in the ix_Name the rows with the value John
2 Read the primary key that matches to get the row data
3 Store the matching results
4 Find in the ix Surname the rows with the value Doe
5 Read the primary key that matches to get row data.
6 Store the matching results
7 Take the Name results and Surname results and merge them
8 Return query results.
This is really a waste of IO as MySQL would then read the table twice. Basically using one index would be better than trying to use two(I will explain in a momnet why). MySQL will choose 1 index to use in a this simple query.
So how does MySQL decide on which index to use?
MySQL keeps statistics around indexes internally. These statistics tell MySQL basically how unique a index is. So for the sake of argument lets say the surname index (ix_surname)was more unique than the name index(ix_name) MySQL would use the surname index (ix_surname).
Thus query retrieval would be like this:
1 Use the ix_surname and find rows that match the value Doe
2 Read the primary key and apply the filter for the value John on the actual column data in the row.
3 Return the matched row.
As you can see the number of operations in this search is much less. I have over simplified a lot of the technical detail. Indexing is a interesting thing to master but you have to look at it from the perspective of how do I get the data with the minimal amount of IO.
Hope it is as clear as mud now!
MySQL cannot normally use more than one index at a time. That means, for instance, that when you have a query that filters or sorts on two fields you put them both into the same index.
WordPress likely has a common query that filters and/or sorts on post_type, post_status and post_date. Making an educated guess as to what they stand for, this would likely be the core query for WordPress's Post listing pages. So the three fields are put into the same index.
I have a site with a bunch of users, and a bunch of "nodes" (content). Each node can be downloaded, and besides the particular node id in question, each download has a "license" associated with it (so a user can download node 5 for 'commercial use' or for 'personal use', etc.), as well as a price for each license.
My goal is to keep track of downloads in such a way that allows me to:
Get the number of downloads for a given node id and license id over a given time period (how many times has node 5 been downloaded in the last month for 'commercial use'?).
Get the total number of downloads for a given node id and license id.
Get the number of downloads for a given node_id regardless of license (all downloads for 'commercial use' and 'personal use' combined).
Get the node ids (and corresponding license ids) that have been downloaded by a given user that meet a given price criteria (i.e. price = 0, or price > 0).
Trivial data to store if optimization doesn't matter, but my issue is one of normalization/optimization for tables that may easily grow to millions of rows. Specifically, assume that:
Number of downloads is in the tens of millions.
Number of nodes is in the hundreds of thousands.
Number of users is in the tens of thousands.
I'm fairly new to any "real" mysql work, so I appreciate your help, and pointing out where I'm being stupid. Here's what I've got so far:
all_downloads table
+-------------+---------+------------+---------+-----------+-------+
| download_id | node_id | license_id | user_id | timestamp | price |
+-------------+---------+------------+---------+-----------+-------+
download_id is a a unique key for this table. This table is a problem, because it could potentially have tens of millions of rows.
downloads_counted table
Instead of adding up the total number of downloads for a given node and license by querying the all_downloads table, the downloads are counted during cron run, and those numbers are stored separately in a downloads_counted table:
+---------------------------------------------------------------------------+
| node_id | license_id | downloads_total | downloads_month | downloads_week |
+---------------------------------------------------------------------------+
The license id situation is new (formerly there was only one license, so licenses were not tracked in the database), so that's something I'm just trying to figure out how to work with now. In the past, node_id was a unique key for this table. I'm assuming that what I should do now is make the combination of node_id and license_id into a unique primary key. Or is it just as well to leave node_id as the only key for this table, and grab all rows for a given node_id, then parse the results in php (separating or combining downloads for each particular license)? Is it within best practice to have a table with no unique key?
In any case, I think this table is mostly okay, as it shouldn't grow to more than 1 or 2 million rows.
The question of returning downloads for a given user
This is the main area where I need help. I have considered just making the user_id a key in the all_downloads table, and simply querying for all rows that contain a given user_id. But I am concerned about querying this table in the long run, as it will be very large from the start, and could easily grow to tens of millions of rows.
I have considered creating a user_downloads table that would look something like this:
+---------------------+
| user_id | downloads |
+---------------------+
Where downloads would be a serialized array of node_ids and associated license ids and prices like so (5 is the node_id and would be the index within the top-level array of node_ids):
downloads = array('5' = array(license = array('personal', 'commercial'), price = 25))
I realize storing arrays of data in a single cell is considered bad practice, and I'm not sure that it would improve performance, since the array of downloads could easily grow into the thousands for a given user. However, I'm not sure how to create another table structure that would be more efficient than my all_downloads table at getting the downloads for a given user.
Any and all help is much appreciated!
====================================
Followup questions to Bill Karwin's answer:
timestamp is unfortunately going to be a unix timestamp stored in an
int(11), rather than a datetime (to conform to Drupal standards). I
assume that doesn't really change anything from an optimization
standpoint?
node_id/license_id/user_id (your idea for a clustered primary key) is
not guaranteed to be unique, because users are allowed to download the same node under the same license as many times as they want. This
was my primary reason for having a unique download_id for each row...
is there a special reason that having a download_id would hurt performance? Or would it be acceptable to make the primary key a cluster of download_id/node_id/license_id/user_id? Or will having the download_id as the first part of the compound key throw off its usefulness?
Do you think it still makes sense to have a downloads_counted table, or would that be considered redundant? My thinking is that it would still help performance, since download counts (downloads total, this week, this month, etc.) are going to be showing up very frequently on the site, and the downloads_counted table would have one or two orders of magnitude fewer rows than the all_downloads table.
My idea for the downloads_counted table:
CREATE TABLE downloads_counted (
node_id INT UNSIGNED NOT NULL,
license_id INT UNSIGNED NOT NULL,
downloads_total INT UNSIGNED NOT NULL,
downloads_month INT UNSIGNED NOT NULL,
downloads_week INT UNSIGNED NOT NULL,
downloads_day INT UNSIGNED NOT NULL,
PRIMARY KEY (node_id, license_id),
KEY (node_id)
) ENGINE=InnoDB;
The secondary key on node_id is for getting all downloads for all licenses for a given node_id... is this key redundant, though, if node_id is already the first part of the compound primary key?
Here's how I would design the table:
CREATE TABLE all_downloads (
node_id INT UNSIGNED NOT NULL,
license_id INT UNSIGNED NOT NULL,
user_id INT UNSIGNED NOT NULL,
timestamp DATETIME NOT NULL,
price NUMERIC (9,2),
PRIMARY KEY (node_id,license_id,user_id),
KEY (price)
) ENGINE=InnoDB;
Notice I omitted the download_id.
Now you can run the queries you need to:
Get the number of downloads for a given node id and license id over a given time period (how many times has node 5 been downloaded in the last month for 'commercial use'?).
SELECT COUNT(*) FROM all_downloads WHERE (node_id,license_id) = (123,456)
AND timestamp > NOW() - INTERVAL 30 DAY
This should make good use of the clustered primary index, reducing the set of rows examined until the timestamp comparison only applies to a small subset.
Get the total number of downloads for a given node id and license id.
SELECT COUNT(*) FROM all_downloads WHERE (node_id,license_id) = (123,456);
Like the above, this makes use of the clustered primary index. Counting is accomplished by an index scan.
Get the number of downloads for a given node_id regardless of license (all downloads for 'commercial use' and 'personal use' combined).
SELECT COUNT(*) FROM all_downloads WHERE (node_id) = (123);
Ditto.
Get the node ids (and corresponding license ids) that have been downloaded by a given user that meet a given price criteria (i.e. price = 0, or price > 0).
SELECT node_id, license_id FROM all_downloads WHERE price = 0 AND user_id = 789;
This reduces the rows examined by using the secondary index on price. Then you take advantage of the fact that secondary indexes in InnoDB implicitly contain the columns of the primary key, so you don't even need to read the base data. This is called a covering index or an index-only query.
As for your other questions:
No, it's not a good practice to define a table without a primary key constraint.
No, it's not a good practice to store a serialized array in a single column. See my answer for the question "Is storing a comma separated list in a database column really that bad?"
timestamp ... doesn't really change anything from an optimization standpoint?
I prefer datetime over timestamp only because datetime includes timezone information, and timestamp does not. You can always convert a datetime to a UNIX timestamp integer in a query result, using the UNIX_TIMESTAMP() function.
would it be acceptable to make the primary key a cluster of download_id/node_id/license_id/user_id? Or will having the download_id as the first part of the compound key throw off its usefulness?
The benefit of a clustered key is that the rows are stored in order of the index. So if you query based on node_id frequently, there's a performance advantage to putting that first in the compound clustered index. I.e. if you are interested in the set of rows for a given node_id, it's a benefit that they're stored together because you defined the clustered index that way.
Do you think it still makes sense to have a downloads_counted table, or would that be considered redundant?
Sure, storing aggregate results in a table is a common way to reduce the work of counting up frequently-needed totals so often. But do so judiciously, because it takes some work to keep these totals in sync with the real data. The benefit is greater if you need to read the pre-calculated totals frequently, and multiple times for each time they are updated. Make sure you treat the aggregated totals as less authoritative than the real download data, and have a plan for re-generating the totals when they get out of sync.
Some people also put these aggregates into memcached keys instead of in a table, for even faster lookups. If the volatile data in memcached is lost for some reason, you can re-populate it from the download data.
PRIMARY KEY (node_id, license_id),
KEY (node_id)
) ENGINE=InnoDB;
is this key redundant, though, if node_id is already the first part of the compound primary key?
Yes. MySQL allows you to create redundant indexes, and this is an example of a redundant index. Any query that could use the secondary key on node_id could just as easily use the primary key. In fact, in this case the optimizer will never use the secondary key, because it will prefer the clustered index of the primary key.
You can use pt-duplicate-key-checker to analyze a database for redundant indexes.
I have a table with symbol names (e.g. functions) and their start memory address and end memory address placement. Now I want to look up many addresses that are between the start and end addresses and map to each symbol name (or simpler the start addr as below example).
I do a query like this:
SELECT r.caller_addr AS caller_addr,sm.addrstart AS caller FROM rets AS r
JOIN symbolmap AS sm ON r.caller_addr BETWEEN sm.addrstart AND sm.addrend;
rets is a table that contains approximately a million caller_addr. The symbolmap table is created as:
CREATE TABLE
symbolmap
(addrstart BIGINT NOT NULL,
addrend BIGINT NOT NULL,
name VARCHAR(45),
PRIMARY KEY (addrstart),
UNIQUE INDEX (addrend)) ENGINE = InnoDB;
All the addrstart to addrend rows are none overlapping, i.e. there can only be one row hit for any requested addr (r.caller_addr in the example). The symbolmap table contains 42000 rows. I have tried a few other index methods as well, but still the select takes very long time (many 10s of minutes) and has not managed to finish.
Any suggestions on better indexes or other select statements that have better performance? I'm running this on MySQL 5.1.41 and I don't need to worry about portability.
When I have searched for what others do I only find results with constant boundaries and not when finding the row having the right boundaries. But it seems to me like a quite general problem.
Try to combine the two columns in a single index:
CREATE TABLE
symbolmap
(addrstart BIGINT NOT NULL,
addrend BIGINT NOT NULL,
name VARCHAR(45),
PRIMARY KEY (addrstart, addrend)
) ENGINE = InnoDB;
Also make sure that caller_addr is also bigint
Assume that I have one big table with three columns: "user_name", "user_property", "value_of_property". Lat's also assume that I have a lot of user (let say 100 000) and a lot of properties (let say 10 000). Then the table is going to be huge (1 billion rows).
When I extract information from the table I always need information about a particular user. So, I use, for example where user_name='Albert Gates'. So, every time the mysql server needs to analyze 1 billion lines to find those of them which contain "Albert Gates" as user_name.
Would it not be wise to split the big table into many small ones corresponding to fixed users?
No, I don't think that is a good idea. A better approach is to add an index on the user_name column - and perhaps another index on (user_name, user_property) for looking up a single property. Then the database does not need to scan all the rows - it just need to find the appropriate entry in the index which is stored in a B-Tree, making it easy to find a record in a very small amount of time.
If your application is still slow even after correctly indexing it can sometimes be a good idea to partition your largest tables.
One other thing you could consider is normalizing your database so that the user_name is stored in a separate table and use an integer foriegn key in its place. This can reduce storage requirements and can increase performance. The same may apply to user_property.
you should normalise your design as follows:
drop table if exists users;
create table users
(
user_id int unsigned not null auto_increment primary key,
username varbinary(32) unique not null
)
engine=innodb;
drop table if exists properties;
create table properties
(
property_id smallint unsigned not null auto_increment primary key,
name varchar(255) unique not null
)
engine=innodb;
drop table if exists user_property_values;
create table user_property_values
(
user_id int unsigned not null,
property_id smallint unsigned not null,
value varchar(255) not null,
primary key (user_id, property_id),
key (property_id)
)
engine=innodb;
insert into users (username) values ('f00'),('bar'),('alpha'),('beta');
insert into properties (name) values ('age'),('gender');
insert into user_property_values values
(1,1,'30'),(1,2,'Male'),
(2,1,'24'),(2,2,'Female'),
(3,1,'18'),
(4,1,'26'),(4,2,'Male');
From a performance perspective the innodb clustered index works wonders in this similar example (COLD run):
select count(*) from product
count(*)
========
1,000,000 (1M)
select count(*) from category
count(*)
========
250,000 (500K)
select count(*) from product_category
count(*)
========
125,431,192 (125M)
select
c.*,
p.*
from
product_category pc
inner join category c on pc.cat_id = c.cat_id
inner join product p on pc.prod_id = p.prod_id
where
pc.cat_id = 1001;
0:00:00.030: Query OK (0.03 secs)
Properly indexing your database will be the number 1 way of improving performance. I once had a query take a half an hour (on a large dataset, but none the less). Then we come to find out that the tables had no index. Once indexed the query took less than 10 seconds.
Why do you need to have this table structure. My fundemental problem is that you are going to have to cast the data in value of property every time you want to use it. That is bad in my opinion - also storing numbers as text is crazy given that its all binary anyway. For instance how are you going to have required fields? Or fields that need to have constraints based on other fields? Eg start and end date?
Why not simply have the properties as fields rather than some many to many relationship?
have 1 flat table. When your business rules begin to show that properties should be grouped then you can consider moving them out into other tables and have several 1:0-1 relationships with the users table. But this is not normalization and it will degrade performance slightly due to the extra join (however the self documenting nature of the table names will greatly aid any developers)
One way i regularly see databqase performance get totally castrated is by having a generic
Id, property Type, Property Name, Property Value table.
This is really lazy but exceptionally flexible but totally kills performance. In fact on a new job where performance is bad i actually ask if they have a table with this structure - it invariably becomes the center point of the database and is slow. The whole point of relational database design is that the relations are determined ahead of time. This is simply a technique that aims to speed up development at a huge cost to application speed. It also puts a huge reliance on business logic in the application layer to behave - which is not defensive at all. Eventually you find that you wan to use properties in a key relationsip which leads to all kinds of casting on the join which further degrades performance.
If data has a 1:1 relationship with an entity then it should be a field on the same table. If your table gets to more than 30 fields wide then consider movign them into another table but dont call it normalisation because it isnt. It is a technique to help developers group fields together at the cost of performance in an attempt to aid understanding.
I don't know if mysql has an equivalent but sqlserver 2008 has sparse columns - null values take no space.
SParse column datatypes
I'm not saying a EAV approach is always wrong, but i think using a relational database for this approach is probably not the best choice.
Our MySQL web analytics database contains a summary table which is updated throughout the day as new activity is imported. We use ON DUPLICATE KEY UPDATE in order that the summarization overwrites earlier calculations, but are having difficulty because one of the columns in the summary table's UNIQUE KEY is an optional FK, and contains NULL values.
These NULLs are intended to mean "not present, and all such cases are equivalent". Of course, MySQL usually treats NULLs as meaning "unknown, and all such cases are not equivalent".
Basic structure is as follows:
An "Activity" table containing an entry for each session, each belonging to a campaign, with optional filter and transaction IDs for some entries.
CREATE TABLE `Activity` (
`session_id` INTEGER AUTO_INCREMENT
, `campaign_id` INTEGER NOT NULL
, `filter_id` INTEGER DEFAULT NULL
, `transaction_id` INTEGER DEFAULT NULL
, PRIMARY KEY (`session_id`)
);
A "Summary" table containing daily rollups of total number of sessions in activity table, an d the total number of those sessions which contain a transaction ID. These summaries are split up, with one for every combination of campaign and (optional) filter. This is a non-transactional table using MyISAM.
CREATE TABLE `Summary` (
`day` DATE NOT NULL
, `campaign_id` INTEGER NOT NULL
, `filter_id` INTEGER DEFAULT NULL
, `sessions` INTEGER UNSIGNED DEFAULT NULL
, `transactions` INTEGER UNSIGNED DEFAULT NULL
, UNIQUE KEY (`day`, `campaign_id`, `filter_id`)
) ENGINE=MyISAM;
The actual summarization query is something like the following, counting up the number of sessions and transactions, then grouping by campaign and (optional) filter.
INSERT INTO `Summary`
(`day`, `campaign_id`, `filter_id`, `sessions`, `transactions`)
SELECT `day`, `campaign_id`, `filter_id
, COUNT(`session_id`) AS `sessions`
, COUNT(`transaction_id` IS NOT NULL) AS `transactions`
FROM Activity
GROUP BY `day`, `campaign_id`, `filter_id`
ON DUPLICATE KEY UPDATE
`sessions` = VALUES(`sessions`)
, `transactions` = VALUES(`transactions`)
;
Everything works great, except for the summary of cases where the filter_id is NULL. In these cases, the ON DUPLICATE KEY UPDATE clause does not match the existing row, and a new row is written every time. This is due to the fact that "NULL != NULL". What we need, however, is "NULL = NULL" when comparing the unique keys.
I am looking for ideas for workarounds or feedback on those we have come up with so far. Workarounds we have thought of so far follow.
Delete all summary entries containing a NULL key value prior to running the summarization. (This is what we are doing now)
This has the negative side effect of returning results with missing data if a query is executed during the summarization process.
Change the DEFAULT NULL column to DEFAULT 0, which allows the UNIQUE KEY to be matched consistently.
This has the negative side effect of overly complicating the development of queries against the summary table. It forces us to use a lot of "CASE filter_id = 0 THEN NULL ELSE filter_id END", and makes for awkward joining since all of the other tables have actual NULLs for the filter_id.
Create a view which returns "CASE filter_id = 0 THEN NULL ELSE filter_id END", and using this view instead of the table directly.
The summary table contains a few hundred thousand rows, and I've been told view performance is quite poor.
Allow the duplicate entries to be created, and delete the old entries after summarization completes.
Has similar problems to deleting them ahead of time.
Add a surrogate column which contains 0 for NULL, and use that surrogate in the UNIQUE KEY (actually we could use PRIMARY KEY if all columns are NOT NULL).
This solution seems reasonable, except that the example above is only an example; the actual database contains half a dozen summary tables, one of which contains four nullable columns in the UNIQUE KEY. There is concern by some that the overhead is too much.
Do you have a better workaround, table structure, update process or MySQL best practice which can help?
EDIT: To clarify the "meaning of null"
The data in the summary rows containing NULL columns are considered to belong together only in the sense that of being a single "catch-all" row in summary reports, summarizing those items for which that data point does not exist or is unknown. So within the context of the summary table itself, the meaning is "the sum of those entries for which no value is known". Within the relational tables, on the other hand, these truly are NULL results.
The only reason for putting them into a unique key on the summary table is to allow for automatic update (by ON DUPLICATE KEY UPDATE) when re-calculating the summary reports.
Maybe a better way to describe it is by the specific example that one of the summary tables groups results geographically by the zip code prefix of the business address given by the respondent. Not all respondents provide a business address, so the relationship between the transaction and addresses table is quite correctly NULL. In the summary table for this data, a row is generated for each zip code prefix, containing the summary of data within that area. An additional row is generated to show the summary of data for which no zip code prefix is known.
Altering the rest of the data tables to have an explicit "THERE_IS_NO_ZIP_CODE" 0-value, and placing a special record in the ZipCodePrefix table representing this value, is improper--that relationship truly is NULL.
I think something along the lines of (2) is really the best bet — or, at least, it would be if you were starting from scratch. In SQL, NULL means unknown. If you want some other meaning, you really ought to use a special value for that, and 0 is certainly an OK choice.
You should do this across the entire database, not just this one table. Then you shouldn't wind up with weird special cases. In fact, you should be able to get rid of a lot of your current ones (example: currently, if you want the summary row where there is no filter, you have the special case "filter is null" as opposed to the normal case "filter = ?".)
You should also go ahead and create a "not present" entry in the referred-to table as well, to keep the FK constraint valid (and avoid special cases).
PS: Tables w/o a primary key are not relational tables and should really be avoided.
edit 1
Hmmm, in that case, do you actually need the on duplicate key update? If you're doing a INSERT ... SELECT, then you probably do. But if your app is supplying the data, just do it by hand — do the update (mapping zip = null to zip is null), check how many rows were changed (MySQL returns this), if 0 do an insert.
With modern versions of MariaDB (formerly MySQL), upserts can be done simply with insert on duplicate key update statements if you go with surrogate column route #5. Adding MySQL's generated stored columns or MariaDB persistent virtual columns to apply the uniqueness constraint on the nullable fields indirectly keeps nonsense data out of the database in exchange for some bloat.
e.g.
CREATE TABLE IF NOT EXISTS bar (
id INT PRIMARY KEY AUTO_INCREMENT,
datebin DATE NOT NULL,
baz1_id INT DEFAULT NULL,
vbaz1_id INT AS (COALESCE(baz1_id, -1)) STORED,
baz2_id INT DEFAULT NULL,
vbaz2_id INT AS (COALESCE(baz2_id, -1)) STORED,
blam DOUBLE NOT NULL,
UNIQUE(datebin, vbaz1_id, vbaz2_id)
);
INSERT INTO bar (datebin, baz1_id, baz2_id, blam)
VALUES ('2016-06-01', null, null, 777)
ON DUPLICATE KEY UPDATE
blam = VALUES(blam);
For MariaDB replace STORED with PERSISTENT, indexes require persistence.
MySQL Generated Columns
MariaDB Virtual Columns
Change the DEFAULT NULL column to DEFAULT 0, which allows the UNIQUE KEY to be matched consistently. This has the negative side effect of overly complicating the development of queries against the summary table. It forces us to use a lot of "CASE filter_id = 0 THEN NULL ELSE filter_id END", and makes for awkward joining since all of the other tables have actual NULLs for the filter_id.
Create a view which returns "CASE filter_id = 0 THEN NULL ELSE filter_id END", and using this view instead of the table directly. The summary table contains a few hundred thousand rows, and I've been told view performance is quite poor.
View performance in MySQL 5.x will be fine, as the view does nothing but replace a zero with a null. Unless you use aggregates/sorts in a view, most any query against the view will be re-written by the query optimizer to just hit the underlying table.
And of course, since it's an FK, you'll have to create an entry in the referred-to table with an id of zero.
I'm more than a decade late, but I feel my solution should be an answer on here as I had this exact same problem, and this worked for me. If you know what's got to be updated, you can update them manually just before your existing summarization query, then ignore all cases where filter_id is null in your existing query so it won't get inserted as a record again.
For your example:
UPDATE `Summary` s
LEFT JOIN `Activity` a
ON s.`campaign_id` = a.`campaign_id`
SET s.`sessions` = a.COUNT(`session_id`) ,
SET s.`transactions` = a.COUNT(`transaction_id` IS NOT NULL)
WHERE s.`day` = a.`day`
AND s.`campaign_id` = a.`campaign_id`
AND s.`filter_id` IS NULL
AND a.`filter_id` IS NULL;
INSERT INTO `Summary`
(`day`, `campaign_id`, `filter_id`, `sessions`, `transactions`)
SELECT `day`, `campaign_id`, `filter_id`
, COUNT(`session_id`) AS `sessions`
, COUNT(`transaction_id` IS NOT NULL) AS `transactions`
FROM Activity
WHERE `filter_id` IS NOT NULL
GROUP BY `day`, `campaign_id`, `filter_id`
ON DUPLICATE KEY UPDATE
`sessions` = VALUES(`sessions`)
, `transactions` = VALUES(`transactions`);