I have built a mysql table on my local computer to store stock market data. The table name is minute_data, and the structure is simple enough:
You can see that I made the key column a combination of date and symbol -> concat(date,symbol). This way I do an insert ignore ... query to add data to the table without duplicating a date/symbol combination.
With this table, data retrieval is very simple. Say I wanted to get all the data for the symbol CSCO, then I could simply do this query:
select * from minute_data where symbol = "CSCO" order by date;
Everything has been "working". The table now has data from over 1000 symbols, with over 22 million rows already. I am thinking that is is not even half full for all the 1000 symbols yet, so I am expecting to keep growing the size of the table.
I am starting to see serious performance problems when querying this table. For example the following query (which I often want to do, to see the latest date for a particular symbol) takes well over 1 minute to complete, and only returns 1 row!
select * from minute_data where symbol = "CSCO" order by date desc limit 1;
This query (which is also very import) is also taking over 1 minute on average:
select count(*), symbol from minute_data group by symbol;
The performance problems are making it unrealistic to keep working with the data in this way. These are the questions that I would like to ask the community:
Is it futile to continue building my data set into this table?
Is MySQL a bad choice altogether for a data set like this?
What can I do to this table to improve performance?
What kind of data structure should I use for this purpose (instead of a MySQL table)?
Thank You!
UPDATE
I am providing the output from the explain, the same for the following 2 queries:
explain select count(*), symbol from minute_data group by symbol;
explain select * from minute_data where symbol = "CSCO" order by date desc limit 1;
UPDATE 2
pretty simple fix. I performed this query to remove the useless key_col that I had defined above, and made a primary key on 2 columns: date and symbol:
alter table minute_data drop primary key, add primary key (date,symbol);
Now I tried the following query, and it finished in less than 1 second:
select * from minute_data where symbol = "CSCO" order by date desc limit 1;
This query still takes a long time to complete (72 seconds). I guess that's still because the query has to tabulate all 22 million rows in one query?:
select count(*), symbol from minute_data group by symbol;
Your key_col is completely useless. You know that you can have a primary key over multiple columns? I'd recommend, that you drop that column and create a new primary key on (date, symbol) in this order since your date column has the higher cardinality. Additionally you can then (if there's the need for it) create another unique index on (symbol, date). Post EXPLAINs of your most important queries. And what's the cardinality of symbol?
UPDATE:
What you can see in the explain is, that there's no index which can be used and it scans the whole 22.5 million rows. Please have a try with the above mentioned. If you don't want to drop the key_col right now, you should at least add an index on symbol column.
Related
I have the following query that runs forever and I am looking to see if there is anyway that I can optimise it. This is running on a table that has in total 1,406,480 rows of data but apart from the Filename and Refcolumn, the ID and End_Date have both been indexed.
My Query:
INSERT INTO UniqueIDs
(
SELECT
T1.ID
FROM
master_table T1
LEFT JOIN
master_table T2
ON
(
T1.Ref_No = T2.Ref_No
AND
T1.End_Date = T2.End_Date
AND
T1.Filename = T2.Filename
AND
T1.ID > T2.ID
)
WHERE T2.ID IS NULL
AND
LENGTH(T1.Ref_No) BETWEEN 5 AND 10
)
;
Explain Results:
The reason for not indexing the Ref_No is that this is a text column and therefore I get a BLOB/TEXT error when I try and index this column.
Would really appreciate if somebody could advise on how I can quicken this query.
Thanks
Thanks to Bill in regards to multi column indexes I have managed to make some headway. I first ran this code:
CREATE INDEX I_DELETE_DUPS ON master_table(id, End_Date);
I then added a new column to show the length of the Ref_No but had to change it from the query Bill mentioned as my version of MySQL is 5.5. So I ran it in 3 steps:
ALTER TABLE master_table
ADD COLUMN Ref_No_length SMALLINT UNSIGNED;
UPDATE master_table SET Ref_No_length = LENGTH(Ref_No);
ALTER TABLE master_table ADD INDEX (Ref_No_length);
Last step was to change my insert query with the where clause for the length. This was changed to:
AND t1.Ref_No_length between 5 and 10;
I then ran this query and within 15 mins I had 280k worth of id's inserted into my UniqueIDs table. I did go change my insert script to see if I could add more values to the length by doing the following:
AND t1.Ref_No_length IN (5,6,7,8,9,10,13);
This was to bring in the values where length was also equal to 13. This query took a lot longer, 2hr 50 mins to be precise but the additional ask of looking for all rows that have length of 13 gave me an extra 700k unique ids.
I am looking at ways to optimise the query with the IN clause, but a big improvement where this query kept running for 24 hours. So thank you so much Bill.
For the JOIN, you should have a multi-column index on (Ref_No, End_Date, Filename).
You can create a prefix index on a TEXT column like this:
ALTER TABLE master_table ADD INDEX (Ref_No(10));
But that won't help you search based on the LENGTH(). Indexing only helps search by value indexed, not by functions on the column.
In MySQL 5.7 or later, you can create a virtual column like this, with an index on the values calculated for the virtual column:
ALTER TABLE master_table
ADD COLUMN Ref_No_length SMALLINT UNSIGNED AS (LENGTH(Ref_No)),
ADD INDEX (Ref_No_length);
Then MySQL will recognize that your condition in your query is the same as the expression for the virtual column, and it will automatically use the index (exception: in my experience, this doesn't work for expressions using JSON functions).
But this is no guarantee that the index will help. If most of the rows match the condition of the length being between 5 and 10, the optimizer will not bother with the index. It may be more work to use the index than to do a table-scan.
the ID and End_Date have both been indexed.
You have PRIMARY KEY(id) and redundantly INDEX(id)? A PK is a unique key.
"have both been indexed" -- INDEX(a), INDEX(b) is not the same as INDEX(a,b) -- they have different uses. Read about "composite" indexes.
That query smells a lot like "group-wise" max done in a very slow way. (Alas, that may have come from the online docs.)
I have compiled the fastest ways to do that task here: http://mysql.rjweb.org/doc.php/groupwise_max (There are multiple versions, based on MySQL version and what issues your code can/cannot tolerate.)
Please provide SHOW CREATE TABLE. One important question: Is id the PRIMARY KEY?
This composite index may be useful:
(Filename, End_Date, Ref_No, -- first, in any order
ID) -- last
This, as others have noted, is unlikely to be helped by any index, hence T1 will need a full-table-scan:
AND LENGTH(T1.Ref_No) BETWEEN 5 AND 10
If Ref_No cannot be bigger than 191 characters, change it to a VARCHAR so that it can be used in an index. Oh, did I ask for SHOW CREATE TABLE? If you can't make it VARCHAR, then my recommended composite index is
INDEX(Filename, End_Date, ID)
I have a table with a large number of records ( > 300,000). The most relevant fields in the table are:
CREATE_DATE
MOD_DATE
Those are updated every time a record is added or updated.
I now need to query this table to find the date of the record that was modified last. I'm currently using
SELECT mod_date FROM table ORDER BY mod_date DESC LIMIT 1;
But I'm wondering if this is the most efficient way to get the answer.
I've tried adding a where clause to limit the date to the last month, but it looks like that's actually slower (and I need the most recent date, which could be older than the last month).
I've also tried the suggestion I read elsewhere to use:
SELECT UPDATE_TIME
FROM information_schema.tables
WHERE TABLE_SCHEMA = 'db'
AND TABLE_NAME = 'table';
But since I might be working on a dump of the original that query might result into NULL. And it looks like this is actually slower than the original query.
I can't resort to last_insert_id() because I'm not updating or inserting.
I just want to make sure I have the most efficient query possible.
The most efficient way for this query would be to use an index for the column MOD_DATE.
From How MySQL Uses Indexes
8.3.1 How MySQL Uses Indexes
Indexes are used to find rows with specific column values quickly.
Without an index, MySQL must begin with the first row and then read
through the entire table to find the relevant rows. The larger the
table, the more this costs. If the table has an index for the columns
in question, MySQL can quickly determine the position to seek to in
the middle of the data file without having to look at all the data. If
a table has 1,000 rows, this is at least 100 times faster than reading
sequentially.
You can use
SHOW CREATE TABLE UPDATE_TIME;
to get the CREATE statement and see, if an index on MOD_DATE is defined.
To add an Index you can use
CREATE INDEX
CREATE [UNIQUE|FULLTEXT|SPATIAL] INDEX index_name
[index_type]
ON tbl_name (index_col_name,...)
[index_option]
[algorithm_option | lock_option] ...
see http://dev.mysql.com/doc/refman/5.6/en/create-index.html
Make sure that both of those fields are indexed.
Then I would just run -
select max(mod_date) from table
or create_date, whichever one.
Make sure to create 2 indexes, one on each date field, not a compound index on both.
As for a discussion of the difference between this and using limit, see MIN/MAX vs ORDER BY and LIMIT
Use EXPLAIN:
http://dev.mysql.com/doc/refman/5.0/en/explain.html
This tells You how mysql executes statement, thanks to that You can figure out most efficient way, cause it depends on Your db structure and there is no one universal solution.
I have a table 'tbl' something like that:
ID bigint(20) - primary key, autoincrement
field1
field2
field3
That table has 600k+ rows.
Query:
SELECT * from tbl ORDER by ID LIMIT 600000, 1 takes 1.68 second
Query:
SELECT ID, field1 from tbl ORDER by ID LIMIT 600000, 1 takes 1.69 second
Query:
SELECT ID from tbl ORDER by ID LIMIT 600000, 1 takes 0.16 second
Query:
SELECT * from tbl WHERE ID = xxx takes 0.005 second
Those queries are tested in phpmyadmin.
And the result is query 3 and query 4 together return necessarily data.
Query 1 does the same jobs but much slower...
This doesn't look right for me.
Could anyone give any advice?
P.S. I'm sorry for formatting.. I'm new to this site.
New test:
Q5 : CREATE TEMPORARY TABLE tmptable AS (SELECT ID FROM tbl WHERE ID LIMIT 600030, 30);
SELECT * FROM tbl WHERE ID IN (SELECT ID FROM tmptable); takes 0.38 sec
I still don't understand how it's possible. I recreated all indexes.. what else can I do with that table? Delete and refill it manually? :)
Query 1 looks at the table's primary key index, finds the correct 600,000 ids and their corresponding locations within the table, then goes to the table and fetches everything from those 600k locations.
Query 2 looks at the table's primary key index, finds the correct 600k ids and their corresponding locations within the table, then goes to the table and fetches whichever subset of fields are asked for from those 600k rows.
Query 3 looks at the table's primary key index, finds the correct 600k ids, and returns them. It doesn't need to look at the table at all.
Query 4 looks at the table's primary key index, finds the single entry requested, goes to the table, reads that single entry, and returns it.
Time-wise, let's build backwards:
(Q4) The table index allows lookup of a key (id) in O(log n) time, meaning every time the table doubles in size it only takes one extra step to find the key in the index*. If you have 1 million rows, then, it would only take ~20 steps to find it. A billion rows? 30 steps. The index entry includes data on where in the table to go to find the data for that row, so MySQL jumps to that spot in the table and reads the row. The time reported for this is almost entirely overhead.
(Q3) As I mentioned, the table index is very fast; this query finds the first entry and just traverses the tree until it has the requested number of rows. I'm sure I could calculate the precise number of steps it would take, but as a maximum we'll say 20 steps x 600k rows = 12M steps; since it's traversing a tree it would likely be more like 1M steps, but the precise number is largely irrelevant. The most important thing to realize here is that once MySQL has walked the index to pull the ids it needs, it has everything you asked for. There's no need to go look at the table. The time reported for this one is essentially the time it takes MySQL to walk the index.
(Q2) This begins with the same tree-walking as discussed for query 3, but while pulling the IDs it needs, MySQL also pulls their location within the table files. It then has to go to the table file (probably already cached/mmapped in memory), and for every entry it pulled, seek to the proper place in the table and get the fields requested out of those rows. The time reported for this query is the time it takes to walk the index (as in Q3) plus the time to visit every row specified in the index.
(Q1) This is identical to Q2 when all fields are specified. As the time is essentially identical to Q2, we can see that it doesn't really take measurably more time to pull more fields out of the database, any time there is dwarfed by crawling the index and seeking to the rows.
*: Most databases use an indexing data structure (B-trees for MySQL) that has a log base much higher than 2, meaning that instead of an extra step every time the table doubles, it's more like an extra step every time the table size goes up by a factor of hundreds to thousands. This means that instead of the 20-30 steps I stated in the example, it's more like 2-5.
I started with this question: is my large mysql table destined for failure?
The answer that I found from that question was satisfactory. I have a table with 22 million rows that I would like to grow to about 100 million. At this time, the table minute_data structure is like this:
A problem that I am having is as follows. I need to execute this query:
select datediff(date,now()) from minute_data where symbol = "CSCO" order by date desc limit 1;
Which is very fast ( < 1 sec ) when the table contains the value "CSCO". The problem is, sometimes I will query for a symbol that is not in the table already. When I execute a query like this for, say, symbol = "ABCD":
select datediff(date,now()) from minute_data where symbol = "ABCD" order by date desc limit 1;
Then the query takes a LONG TIME... like forever ( 180 seconds ).
A way I can get around this is by making sure that the table contains the symbol I am looking for before I execute the query. The fastest way I found to do this is with the follow query, which I just need to use to check to see if the table minute_data contains the symbol I am looking for or not. Basically I just need it to return a boolean value so I know if the symbol is in the table or not:
select count(1) from minute_data where symbol = "CSCO";
This query takes over 30 seconds to return 1 value, way too long for my liking, since the query above, which actually returns a datediff calculation only takes less than 1 second.
symbol column is part of the pri key, I thought it should be able to figure out if a value exists there very quickly.
What am I doing wrong? Is there a fast way to do what I want to do? Should I change the structure of the data to optimize performance?
Thank You!
UPDATE
I think I found a good solution to this problem. From the answer below by LastCoder, I did the following:
1) Created a new table called minute_data_2 with the exact same definition as minute_data.
2)
ALTER TABLE minute_data_2 ADD PRIMARY KEY (symbol, date);
3)
INSERT IGNORE INTO minute_data_2 SELECT * FROM minute_data;
4)
DROP TABLE minute_data;
5) Rename minute_data_2 to minute_data
Now I am seeing blindingly fast speed for the same query which I described above as taking more than 180 second, now completes in .001 seconds. Amazing.
Did you try using EXISTS (...)
select datediff(date,now()) from minute_data
where EXISTS(SELECT * FROM minute_data WHERE symbol = "CSCO")
AND symbol = "CSCO" order by date desc limit 1;
Even though symbol is a primary key, it seems you have the timestamp as a PK as well which makes me think you are using a COMPOSITE pk which means the ordering is by timestamp then symbol. You may want to put separate index on symbol, if all you have is a composite one where timestamp is first.
I think is better to make a table named symbols and add a reference to that table in your minute_data table:
symbols:
symbol_id (INT, Primary Key, Auto Increment)
symbol_text (VARCHAR)
minute_data:
key_col (BIGINT, Primary Key, Auto Increment)
symbol_id (INT, Index)
other_field
Use InnoDB as table type for adding references.
Try to avoid duplicate entries into your tables..
I have the following data in a MySQL table table:
ID: int(11) [this is the primary key]
Date: date
and I run the MySQL query:
SELECT * from table WHERE Date=CURDATE() and ID=1;
This takes between 0.6 and 1.2 seconds.
Is there any way to optimize this query to get results quicker?
My objective is to find out if I already have a record for today for this ID.
Add indexes on ID and Date.
See CREATE INDEX manual.
You could add a limit 1 at the end, since you are searching for a primary key the max results is 1.
And if you only want to know wether it exists or not you could replace * with ID to select only the ID.
Furthermore, if you haven't already, you really need to add indexes.
SET #cur_date = CURDATE()
...WHERE Date = #cur_date ...
and then create an index of Date, ID (order is important, it should match the order you query on).
In general, calling functions before you do the query and storing them to variables lets SQL treat them like numbers instead of functions, which tends to allow it to use a faster query algorithm.