In my app, there is a very large table (>40 million rows) that will have a default scope set on the model.
The default scope will look at a specific DATETIME column and check that it IS NULL. The DATETIME column will probably never be used to search for a specific date. Should I be using an index here, and if so, how?
The WHERE <column_name> IS NULL will be added to almost every single query made on this table from the app. On the one hand, since the column is essentially being treated as a boolean, I am tempted to think that it should not be indexed. However, it seems that with such a huge table, an index should provide value, especially for a query like
SELECT COUNT(*) FROM <table_name> WHERE <column_name> IS NULL
I am also a bit confused about how I should index, since the WHERE clause will be appended to every query. I do not think that it would make sense to created an index on all columns of this table. This is being done in MySQL. Thanks
Related
I wonder why this could happen? I have a simple table with a primary key Id column, an indexed column A and some other normal columns (of datetime) and all fields have non-null value.
When I try counting the rows on the primary key column like this:
select count(Id) from my_table
It takes about 0.4 seconds to return the value (a total of about 1.1M records).
I tried the same query but for a normal (datetime as mentioned before) column, it takes almost the same time (actually a bit slower).
But when I tried the same query on the indexed column A, it takes up to 1.2 seconds to return the count:
select count(A) from my_table
The A column index info (if any needed for your inspection):
type: BTREE
Allows NULL: Yes
Unique: No
Packed: (empty)
Could you give me some explanation to this issue? could we do anything to improve it better?
I cannot count on other column because actually I have to count distinctly on that column so the returned count has a different meaning for each column.
The fastest is COUNT(*). The * is a convention; it does not mean "all columns". COUNT(1) is equivalent.
Use COUNT(col) when you want to exclude any rows where col is NULL. (This is rarely needed.) If col is declared NOT NULL, then it is really a waste to include col. (Note: I said declared; you said something different: have non-null value.) According to Allows NULL: Yes, you declared the column NULL, hence COUNT(col) will check every col.
What happens with COUNT(*):
For a count without a WHERE, the Optimizer picks the 'smallest' index and walks through it. The PRIMARY KEY is always(?) the largest, so it is usually shunned. The reason for this algorithm is that it assumes it will have to read the entire index from disk; I/O is slow; and "smaller" means less I/O.
If there is a WHERE clause, then the optimal index is determined by what is in the WHERE clause -- that is a huge topic unto itself.
It is a good idea to say NOT NULL unless you have business logic that calls for NULLs ("not yet specified", "optional", "deleted", "Not Applicable", ...) There are several cases (most are obscure) where a NOT NULL performs slightly better than NULL.
Caveat: My Answer applies to InnoDB, but not totally to MyISAM.
I've found this answer myself just by guessing and trying it out. I've tried altering the column A to make it NOT NULL (it's nullable before), after that the count on that column is just normal again like other columns (indexing is not involved here at all).
It's just simple like that :)
I have a very large (> 100 million rows) table and a query that I need to write needs to check if a column of type TEXT is null. I don't care about the actual text, only if the column value is null or not.
I'm trying to write an index to make that query efficient, and I want the query to be satisfied by only looking at the index e.g. such that the explain output of that query shows Using index instead of Using index condition.
To be concrete, the table I'm dealing with has a structure like this:
CREATE TABLE foo (
id INT NOT NULL AUTO_INCREMENT,
overridden TEXT,
rowState VARCHAR(32) NOT NULL,
active TINYINT NOT NULL DEFAULT 0,
PRIMARY KEY (id),
) ENGINE=InnoDB
And the query I want is this:
SELECT
IF(overridden IS NULL, "overridden", rowState) AS state,
COUNT(*)
FROM foo
WHERE active=1
GROUP BY state
The overridden column is a description of why the column is overridden but, as I mentioned above, I don't care about its content, I only want to use it as a boolean flag.
I've tried creating an index on (active, rowState, overridden(1)) but the explain output still shows Using index condition.
Is there something better that I could do?
I suspect this would be a good application for a "prefix" index. (Most attempts at using such are a fiasco.)
The optimal order, I think, would be:
INDEX(active, -- handle the entire WHERE first
overridden(1), -- IS NULL possibly works well
rowState) -- to make index "covering"
Using index condition means that the selection of the row(s) is handled in the Engine, not the "Handler".
EXPLAIN FORMAT=JSON SELECT ... may provide more insight into what is going on. So might the "Optimizer trace".
Or...
If the MySQL version is new enough, create a "generated column" that persists with the computed value of overridden IS NULL.
I've found a few questions that deal with this problem, and it appears that MySQL doesn't allow it. That's fine, I don't have to have a subquery in the FROM clause. However, I don't know how to get around it. Here's my setup:
I have a metrics table that has 3 columns I want: ControllerID, TimeStamp, and State. Basically, a data gathering engine contacts each controller in the database every 5 minutes and sticks an entry in the metrics table. The table has those three columns, plus a MetricsID that I don't care about. Maybe there is a better way to store those metrics, but I don't know it.
Regardless, I want a view that takes the most recent TimeStamp for each of the different ControllerIDs and grabs the TimeStamp, ControllerID, and State. So if there are 4 controllers, the view should always have 4 rows, each with a different controller, along with its most recent state.
I've been able to create a query that gets what I want, but it relies on a subquery in the FROM clause, something that isn't allowed in a view. Here is what I have so far:
SELECT *
FROM
(SELECT
ControllerID, TimeStamp, State
FROM Metrics
ORDER BY TimeStamp DESC)
AS t
GROUP BY ControllerID;
Like I said, this works great. But I can't use it in a view. I've tried using the max() function, but as per here: SQL: Any straightforward way to order results FIRST, THEN group by another column? if I want any additional columns besides the GROUP BY and ORDER BY columns, max() doesn't work. I've confirmed this limitation, it doesn't work.
I've also tried to alter the metrics table to order by TimeStamp. That doesn't work either; the wrong rows are kept.
Edit: Here is the SHOW CREATE TABLE of the Metrics table I am pulling from:
CREATE TABLE Metrics (
MetricsID int(11) NOT NULL AUTO_INCREMENT,
ControllerID int(11) NOT NULL,
TimeStamp timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
State tinyint(4) NOT NULL,
PRIMARY KEY (MetricsID),
KEY makeItFast (ControllerID,MetricsID),
KEY fast (ControllerID,TimeStamp),
KEY fast2 (MetricsID),
KEY MetricsID (MetricsID),
KEY TimeStamp (TimeStamp)
) ENGINE=InnoDB AUTO_INCREMENT=8958 DEFAULT CHARSET=latin1
If you want the most recent row for each controller, the following is view friendly:
SELECT ControllerID, TimeStamp, State
FROM Metrics m
WHERE NOT EXISTS (SELECT 1
FROM Metrics m2
WHERE m2.ControllerId = m.ControllerId and m2.Timestamp > m.TimeStamp
);
Your query is not correct anyway, because it uses a MySQL extension that is not guaranteed to work. The value for state doesn't necessary come from the row with the largest timestamp. It comes from an arbitrary row.
EDIT:
For best performance, you want an index on Metrics(ControllerId, Timestamp).
Edit Sorry, I misunderstood your question; I thought you were trying to overcome the nested-query limitation in a view.
You're trying to display the most recent row for each distinct ControllerID. Furthermore, you're trying to do it with a view.
First, let's do it. If your MetricsID column (which I know you don't care about) is an autoincrement column, this is really easy.
SELECT ControllerId, TimeStamp, State
FROM Metrics m
WHERE MetricsID IN (
SELECT MAX(MetricsID) MetricsID
FROM Metrics
GROUP BY ControllerID)
ORDER BY ControllerID
This query uses MAX ... GROUP BY to extract the highest-numbered (most recent) row for each controller. It can be made into a view.
A compound index on (ControllerID, MetricsID) will be able to satisfy the subquery with a highly efficient loose index scan.
The root cause of my confusion: I didn't read your question carefully enough.
The root cause of your confusion: You're trying to take advantage of a pernicious MySQL extension to GROUP BY. Your idea of ordering the subquery may have worked. But your temporary success is an accidental side-effect of the present implementation. Read this: http://dev.mysql.com/doc/refman/5.6/en/group-by-handling.html
I have a column that is a datetime, converted_at.
I plan on making calls that check WHERE converted_at is not null very often. As such, I'm considering having a boolean field converted. Is their a significant performance difference between checking if a field is not null vs if it is false?
Thanks.
If things are answerable in a single field you favour that over to splitting the same thing into two fields. This creates more infrastructure, which, in your case is avoidable.
As to the nub of the question, I believe most database implementation, MySQL included, will have an internal flag which is boolean anyways for representing the NULLability of a field.
You should rely that this is done for you correctly.
As to performance, the bigger question should be on profiling the typical queries that you run on your database and where you created appropriate indexes and analyze table on to improve execution plans and which indexes are used during queries. This question will have a far bigger impact to performance.
Using WHERE converted_at is not null or WHERE converted = FALSE will probably be the same in matters of query performance.
But if you have this additional bit field, that is used to store whether the converted_at field is Null or not, you'll have to somehow maintain integrity (via triggers?) whenever a new row is added and every time the column is updated. So, this is a de-normalization. And also means more complicated code. Moreover, you'll have at least one more index on the table (which means a bit slower Insert/Update/Delete operations).
Therefore, I don't think it's good to add this bit field.
If you can change the column in question from NULL to NOT NULL (possibly by normalizing the table), you may get some performance gain (at the cost/gain of having more tables).
I had the same question for my own usage. So I decided to put it to the test.
So I created all the fields required for the 3 possibilities I imagined:
# option 1
ALTER TABLE mytable ADD deleted_at DATETIME NULL;
ALTER TABLE mytable ADD archived_at DATETIME NULL;
# option 2
ALTER TABLE mytable ADD deleted boolean NOT NULL DEFAULT 0;
ALTER TABLE mytable ADD archived boolean NOT NULL DEFAULT 0;
# option 3
ALTER TABLE mytable ADD invisibility TINYINT(1) UNSIGNED NOT NULL DEFAULT 0
COMMENT '4 values possible' ;
The last is a bitfield where 1=archived, 2=deleted, 3=deleted + archived
First difference, you have to create indexes for optioon 2 and 3.
CREATE INDEX mytable_deleted_IDX USING BTREE ON mytable (deleted) ;
CREATE INDEX mytable_archived_IDX USING BTREE ON mytable (archived) ;
CREATE INDEX mytable_invisibility_IDX USING BTREE ON mytable (invisibility) ;
Then I tried all of the options using a real life SQL request, on 13k records on the main table, here is how it looks
SELECT *
FROM mytable
LEFT JOIN table1 ON mytable.id_qcm = table1.id_qcm
LEFT JOIN table2 ON table2.id_class = mytable.id_class
INNER JOIN user ON mytable.id_user = user.id_user
where mytable.id_user=1
and mytable.deleted_at is null and mytable.archived_at is null
# and deleted=0
# and invisibility=0
order BY id_mytable
Used alternatively the above commented filter options.
Used mysql 5.7.21-1 debian9
My conclusion:
The "is null" solution (option 1) is a bit faster, or at least same performance.
The 2 others ("deleted=0" and "invisibility=0") seems in average a bit slower.
But the nullable fields option have decisive advantages: No index to create, easier to update, easier to query. And less storage space used.
(additionnaly inserts & updates virtually should be faster as well, since mysql do not need to update indexes, but you never would be able to notice that).
So you should use the nullable datatime fields option.
I have a table with a nullable datetime field.
I'll execute queries like this:
select * from TABLE where FIELD is not null
select * from TABLE where FIELD is null
Should I index this field or is not necessary? I will NOT search for some datetime value in that field.
It's probably not necessary.
The only possible edge case when index can be used (and be of help) is if the ratio of null / not-null rows is rather big (e.g. you have 100 NULL datetimes in the table with 100,000 rows). In that case select * from TABLE where FIELD is null would use the index and be considerably faster for it.
In short: yes.
Slightly longer: yeeees. ;-)
(From http://dev.mysql.com/doc/refman/5.0/en/mysql-indexes.html) - "A search using col_name IS NULL employs indexes if col_name is indexed."
It would depend on the number of unique values and the number of records in the table. If your just searching on whether or not a column is null or not, you'll probably have one query use it and one not depending on the amount of nulls in the table overall.
For example: If you have a table with 99% of the records have the querying column as null and you put/have an index on the column and then execute:
SELECT columnIndexed FROM blah WHERE columnIndexed is null;
The optimizer most likely won't use the index. It won't because it will cost more to read the index and then read the associated data for the records, than to just access the table directly. Index usage is based on the statistical analysis of a table, and one major player in that is cardinality of the values. In general, indexes work best and give the best performance when they select a small subset of the rows in the table. So if you change the above query to select where columnIndexed is not null, your bound to use the index.
For more details check out the following: http://dev.mysql.com/doc/refman/5.1/en/myisam-index-statistics.html