telling if key exists in mysql table is taking too long - mysql

I started with this question: is my large mysql table destined for failure?
The answer that I found from that question was satisfactory. I have a table with 22 million rows that I would like to grow to about 100 million. At this time, the table minute_data structure is like this:
A problem that I am having is as follows. I need to execute this query:
select datediff(date,now()) from minute_data where symbol = "CSCO" order by date desc limit 1;
Which is very fast ( < 1 sec ) when the table contains the value "CSCO". The problem is, sometimes I will query for a symbol that is not in the table already. When I execute a query like this for, say, symbol = "ABCD":
select datediff(date,now()) from minute_data where symbol = "ABCD" order by date desc limit 1;
Then the query takes a LONG TIME... like forever ( 180 seconds ).
A way I can get around this is by making sure that the table contains the symbol I am looking for before I execute the query. The fastest way I found to do this is with the follow query, which I just need to use to check to see if the table minute_data contains the symbol I am looking for or not. Basically I just need it to return a boolean value so I know if the symbol is in the table or not:
select count(1) from minute_data where symbol = "CSCO";
This query takes over 30 seconds to return 1 value, way too long for my liking, since the query above, which actually returns a datediff calculation only takes less than 1 second.
symbol column is part of the pri key, I thought it should be able to figure out if a value exists there very quickly.
What am I doing wrong? Is there a fast way to do what I want to do? Should I change the structure of the data to optimize performance?
Thank You!
UPDATE
I think I found a good solution to this problem. From the answer below by LastCoder, I did the following:
1) Created a new table called minute_data_2 with the exact same definition as minute_data.
2)
ALTER TABLE minute_data_2 ADD PRIMARY KEY (symbol, date);
3)
INSERT IGNORE INTO minute_data_2 SELECT * FROM minute_data;
4)
DROP TABLE minute_data;
5) Rename minute_data_2 to minute_data
Now I am seeing blindingly fast speed for the same query which I described above as taking more than 180 second, now completes in .001 seconds. Amazing.

Did you try using EXISTS (...)
select datediff(date,now()) from minute_data
where EXISTS(SELECT * FROM minute_data WHERE symbol = "CSCO")
AND symbol = "CSCO" order by date desc limit 1;
Even though symbol is a primary key, it seems you have the timestamp as a PK as well which makes me think you are using a COMPOSITE pk which means the ordering is by timestamp then symbol. You may want to put separate index on symbol, if all you have is a composite one where timestamp is first.

I think is better to make a table named symbols and add a reference to that table in your minute_data table:
symbols:
symbol_id (INT, Primary Key, Auto Increment)
symbol_text (VARCHAR)
minute_data:
key_col (BIGINT, Primary Key, Auto Increment)
symbol_id (INT, Index)
other_field
Use InnoDB as table type for adding references.
Try to avoid duplicate entries into your tables..

Related

How to optimise mysql query as Full ProcessList is showing Sending Data for over 24 hours

I have the following query that runs forever and I am looking to see if there is anyway that I can optimise it. This is running on a table that has in total 1,406,480 rows of data but apart from the Filename and Refcolumn, the ID and End_Date have both been indexed.
My Query:
INSERT INTO UniqueIDs
(
SELECT
T1.ID
FROM
master_table T1
LEFT JOIN
master_table T2
ON
(
T1.Ref_No = T2.Ref_No
AND
T1.End_Date = T2.End_Date
AND
T1.Filename = T2.Filename
AND
T1.ID > T2.ID
)
WHERE T2.ID IS NULL
AND
LENGTH(T1.Ref_No) BETWEEN 5 AND 10
)
;
Explain Results:
The reason for not indexing the Ref_No is that this is a text column and therefore I get a BLOB/TEXT error when I try and index this column.
Would really appreciate if somebody could advise on how I can quicken this query.
Thanks
Thanks to Bill in regards to multi column indexes I have managed to make some headway. I first ran this code:
CREATE INDEX I_DELETE_DUPS ON master_table(id, End_Date);
I then added a new column to show the length of the Ref_No but had to change it from the query Bill mentioned as my version of MySQL is 5.5. So I ran it in 3 steps:
ALTER TABLE master_table
ADD COLUMN Ref_No_length SMALLINT UNSIGNED;
UPDATE master_table SET Ref_No_length = LENGTH(Ref_No);
ALTER TABLE master_table ADD INDEX (Ref_No_length);
Last step was to change my insert query with the where clause for the length. This was changed to:
AND t1.Ref_No_length between 5 and 10;
I then ran this query and within 15 mins I had 280k worth of id's inserted into my UniqueIDs table. I did go change my insert script to see if I could add more values to the length by doing the following:
AND t1.Ref_No_length IN (5,6,7,8,9,10,13);
This was to bring in the values where length was also equal to 13. This query took a lot longer, 2hr 50 mins to be precise but the additional ask of looking for all rows that have length of 13 gave me an extra 700k unique ids.
I am looking at ways to optimise the query with the IN clause, but a big improvement where this query kept running for 24 hours. So thank you so much Bill.
For the JOIN, you should have a multi-column index on (Ref_No, End_Date, Filename).
You can create a prefix index on a TEXT column like this:
ALTER TABLE master_table ADD INDEX (Ref_No(10));
But that won't help you search based on the LENGTH(). Indexing only helps search by value indexed, not by functions on the column.
In MySQL 5.7 or later, you can create a virtual column like this, with an index on the values calculated for the virtual column:
ALTER TABLE master_table
ADD COLUMN Ref_No_length SMALLINT UNSIGNED AS (LENGTH(Ref_No)),
ADD INDEX (Ref_No_length);
Then MySQL will recognize that your condition in your query is the same as the expression for the virtual column, and it will automatically use the index (exception: in my experience, this doesn't work for expressions using JSON functions).
But this is no guarantee that the index will help. If most of the rows match the condition of the length being between 5 and 10, the optimizer will not bother with the index. It may be more work to use the index than to do a table-scan.
the ID and End_Date have both been indexed.
You have PRIMARY KEY(id) and redundantly INDEX(id)? A PK is a unique key.
"have both been indexed" -- INDEX(a), INDEX(b) is not the same as INDEX(a,b) -- they have different uses. Read about "composite" indexes.
That query smells a lot like "group-wise" max done in a very slow way. (Alas, that may have come from the online docs.)
I have compiled the fastest ways to do that task here: http://mysql.rjweb.org/doc.php/groupwise_max (There are multiple versions, based on MySQL version and what issues your code can/cannot tolerate.)
Please provide SHOW CREATE TABLE. One important question: Is id the PRIMARY KEY?
This composite index may be useful:
(Filename, End_Date, Ref_No, -- first, in any order
ID) -- last
This, as others have noted, is unlikely to be helped by any index, hence T1 will need a full-table-scan:
AND LENGTH(T1.Ref_No) BETWEEN 5 AND 10
If Ref_No cannot be bigger than 191 characters, change it to a VARCHAR so that it can be used in an index. Oh, did I ask for SHOW CREATE TABLE? If you can't make it VARCHAR, then my recommended composite index is
INDEX(Filename, End_Date, ID)

MySQL updates getting very slow towards end of the table

I have a table "data" which holds around 100,000,000 records.
I have added a new column to it "batch_id" (Integer).
On the application layer, I'm updating the batch_id in batches of 10,000 records for each of the 100,000,000 records (the batch_id is always the same for 10k).
I'm doing something like this (application layer pseudo code):
loop {
$batch_id = $batch_id + 1;
mysql.query("UPDATE data SET batch_id='$batch_id' WHERE batch_id IS NULL LIMIT 10000");
}
I have an index on the batch_id column.
In the beginning, this update statement took ~30 seconds. I'm now halfway through the Table and it's getting slower and slower. At the moment the same statement takes around 10 minutes(!). It reached a point where this is no longer feasible as it would take over a month to update the whole table at the current speed.
What could I do to speed it up, and why is MySQL Getting slower towards the end of the table?
Could an index on the primary key help?
Is the primary key automatically indexed in MySQL? The answer is Yes
So instead one index for batch_id will help.
The problem is without index the engine do a full table scan. At first is easy find 10k with null values, but when more and more records are updated the engine have to scan much more to find those nulls.
But should be easier create batch_id as an autonumeric column
OTHER OPTION: Create a new table and then add the index and replace old table.
CREATE newTable as
SELECT IF(#newID := #newID + 1,
#newID DIV 10000,
#newID DIV 10000) as batch_id,
<other fields>
FROM YourTable
CROSS JOIN (SELECT #newID :=0 ) as v
Insert auto increment primary key to existing table
Do you have a monotonically increasing id in the table? And all rows for a "batch" have 'consecutive' ids? Then don't add batch_id to the table, instead, create another table Batches with one row per batch: (batch_id (PK), id_start, id_end, start_time, end_time, etc).
If you stick to exact chunks of 10K, then don't even materialize batch_id. Instead, compute it from id DIV 10000 whenever you need it.
If you want to discuss this further, please provide SHOW CREATE TABLE for the existing table, and explain what you will be doing with the "batches".
To answer your question about "slow near the end": It is having to scan farther and farther in the table to find the NULLs. You would be better to walk through the table once, fiddling with each 10K chunk as you go. Do this using the PRIMARY KEY, whatever it is. (That is, even if it is not AUTO_INCREMENT.) More Details .

is my large mysql table destined for failure?

I have built a mysql table on my local computer to store stock market data. The table name is minute_data, and the structure is simple enough:
You can see that I made the key column a combination of date and symbol -> concat(date,symbol). This way I do an insert ignore ... query to add data to the table without duplicating a date/symbol combination.
With this table, data retrieval is very simple. Say I wanted to get all the data for the symbol CSCO, then I could simply do this query:
select * from minute_data where symbol = "CSCO" order by date;
Everything has been "working". The table now has data from over 1000 symbols, with over 22 million rows already. I am thinking that is is not even half full for all the 1000 symbols yet, so I am expecting to keep growing the size of the table.
I am starting to see serious performance problems when querying this table. For example the following query (which I often want to do, to see the latest date for a particular symbol) takes well over 1 minute to complete, and only returns 1 row!
select * from minute_data where symbol = "CSCO" order by date desc limit 1;
This query (which is also very import) is also taking over 1 minute on average:
select count(*), symbol from minute_data group by symbol;
The performance problems are making it unrealistic to keep working with the data in this way. These are the questions that I would like to ask the community:
Is it futile to continue building my data set into this table?
Is MySQL a bad choice altogether for a data set like this?
What can I do to this table to improve performance?
What kind of data structure should I use for this purpose (instead of a MySQL table)?
Thank You!
UPDATE
I am providing the output from the explain, the same for the following 2 queries:
explain select count(*), symbol from minute_data group by symbol;
explain select * from minute_data where symbol = "CSCO" order by date desc limit 1;
UPDATE 2
pretty simple fix. I performed this query to remove the useless key_col that I had defined above, and made a primary key on 2 columns: date and symbol:
alter table minute_data drop primary key, add primary key (date,symbol);
Now I tried the following query, and it finished in less than 1 second:
select * from minute_data where symbol = "CSCO" order by date desc limit 1;
This query still takes a long time to complete (72 seconds). I guess that's still because the query has to tabulate all 22 million rows in one query?:
select count(*), symbol from minute_data group by symbol;
Your key_col is completely useless. You know that you can have a primary key over multiple columns? I'd recommend, that you drop that column and create a new primary key on (date, symbol) in this order since your date column has the higher cardinality. Additionally you can then (if there's the need for it) create another unique index on (symbol, date). Post EXPLAINs of your most important queries. And what's the cardinality of symbol?
UPDATE:
What you can see in the explain is, that there's no index which can be used and it scans the whole 22.5 million rows. Please have a try with the above mentioned. If you don't want to drop the key_col right now, you should at least add an index on symbol column.

How do I reset sequence numbers to become consecutive?

I've got a mysql table where each row has its own sequence number in a "sequence" column. However, when a row gets deleted, it leaves a gap. So...
1
2
3
4
...becomes...
1
2
4
Is there a neat way to "reset" the sequencing, so it becomes consecutive again in one SQL query?
Incidentally, I'm sure there is a technical term for this process. Anyone?
UPDATED: The "sequence" column is not a primary key. It is only used for determining the order that records are displayed within the app.
If the field is your primary key...
...then, as stated elsewhere on this question, you shouldn't be changing IDs. The IDs are already unique and you neither need nor want to re-use them.
Now, that said...
Otherwise...
It's quite possible that you have a different field (that is, as well as the PK) for some application-defined ordering. As long as this ordering isn't inherent in some other field (e.g. if it's user-defined), then there is nothing wrong with this.
You could recreate the table using a (temporary) auto_increment field and then remove the auto_increment afterwards.
I'd be tempted to UPDATE in ascending order and apply an incrementing variable.
SET #i = 0;
UPDATE `table`
SET `myOrderCol` = #i:=#i+1
ORDER BY `myOrderCol` ASC;
(Query not tested.)
It does seem quite wasteful to do this every time you delete items, but unfortunately with this manual ordering approach there's not a whole lot you can do about that if you want to maintain the integrity of the column.
You could possibly reduce the load, such that after deleting the entry with myOrderCol equal to, say, 5:
SET #i = 5;
UPDATE `table`
SET `myOrderCol` = #i:=#i+1
WHERE `myOrderCol` > 5
ORDER BY `myOrderCol` ASC;
(Query not tested.)
This will "shuffle" all the following values down by one.
I'd say don't bother. Reassigning sequential values is a relatively expensive operation and if the column value is for ordering purpose only there is no good reason to do that. The only concern you might have is if for example your column is UNSIGNED INT and you suspect that in the lifetime of your application you might have more than 4,294,967,296 rows (including deleted rows) and go out of range, even if that is your concern you can do the reassigning as a one time task 10 years later when that happens.
This is a question that often I read here and in other forums. As already written by zerkms this is a false problem. Moreover if your table is related with other ones you'll lose relations.
Just for learning purpose a simple way is to store your data in a temporary table, truncate the original one (this reset auto_increment) and than repopulate it.
Silly example:
create table seq (
id int not null auto_increment primary key,
col char(1)
) engine = myisam;
insert into seq (col) values ('a'),('b'),('c'),('d');
delete from seq where id = 3;
create temporary table tmp select col from seq order by id;
truncate seq;
insert into seq (col) select * from tmp;
but it's totally useless. ;)
If this is your PK then you shouldn't change it. PKs should be (mostly) unchanging columns. If you were to change them then not only would you need to change it in that table but also in any foreign keys where is exists.
If you do need a sequential sequence then ask yourself why. In a table there is no inherent or guaranteed order (even in the PK, although it may turn out that way because of how most RDBMSs store and retrieve the data). That's why we have the ORDER BY clause in SQL. If you want to be able to generate sequential numbers based on something else (time added into the database, etc.) then consider generating that either in your query or with your front end.
Assuming that this is an ID field, you can do this when you insert:
INSERT INTO yourTable (ID)
SELECT MIN(ID)
FROM yourTable
WHERE ID > 1
As others have mentioned I don't recommend doing this. It will hold a table lock while the next ID is evaluated.

How to optimize MySQL query ‘SELECT * from table WHERE Date=CURDATE() and ID=1;’

I have the following data in a MySQL table table:
ID: int(11) [this is the primary key]
Date: date
and I run the MySQL query:
SELECT * from table WHERE Date=CURDATE() and ID=1;
This takes between 0.6 and 1.2 seconds.
Is there any way to optimize this query to get results quicker?
My objective is to find out if I already have a record for today for this ID.
Add indexes on ID and Date.
See CREATE INDEX manual.
You could add a limit 1 at the end, since you are searching for a primary key the max results is 1.
And if you only want to know wether it exists or not you could replace * with ID to select only the ID.
Furthermore, if you haven't already, you really need to add indexes.
SET #cur_date = CURDATE()
...WHERE Date = #cur_date ...
and then create an index of Date, ID (order is important, it should match the order you query on).
In general, calling functions before you do the query and storing them to variables lets SQL treat them like numbers instead of functions, which tends to allow it to use a faster query algorithm.