I'm a newbie using MySql. I'm reviewing a table that has around 200,000
records. When I execute a simple:
SELECT * FROM X WHERE Serial=123
it takes a long time, around 15-30 secs in return a response (with 200,000 rows) .
Before adding an index it takes around 50 seconds (with 7 million) to return a simple select where statement.
This table increases its rows every day. Right now it has 7 million rows. I added an index in the following way:
ALTER TABLE `X` ADD INDEX `index_name` (`serial`)
Now it takes 109 seconds to return a response.
Which initial approaches should I apply to this table to improve the performance?
Is MySql the correct tool to handle big tables that will have around 5-10 million of records? or should I move to another tool?
Assuming serial is some kind of numeric datatype...
You do ADD INDEX only once. Normally, you would have foreseen the need for the index and add it very cheaply when you created the table.
Now that you have the index on serial, that select, with any value other than 123, will run very fast.
If there is only one row with serial = 123, the indexed table will spit out the row in milliseconds whether it has 7 million rows or 7 billion.
If serial = 123 shows up in 1% of the table, then finding all 70M rows (out of 7B) will take much longer than finding all 70K rows (out of 7M).
Indexes are your friends!
If serial is a VARCHAR, then...
Plan A: Change serial to be a numeric type (if appropriate), or
Plan B: Put quotes around 123 so that you are comparing strings to strings!
Related
I have a mysql database which has a table with around 60 million entries with primary key say 'x'. I have a data set(csv file) which also has around 60 million entries. This dataset also has index 'x'. For values of key 'x' common to both the mysql table and dataset, the corresponding entries in the mysql table just gets updated with increment to a counter variable. The new ones in the dataset are to be inserted.
A simple serial execution in which we try to update the entry if present or else insert takes around 8 hours to complete. What can I do to increase the speed of this whole procedure?
Plan A: IODKU, as #Rogue suggested.
Plan B: Two sqls; they might run faster because part of the 8 hours is gathering a huge amount of undo information in case of a crash. The normalization section comes close to those 2 queries.
Plan C: Walk through the pair of tables, using the PRIMARY KEY of one of them to do IODKU in chunks of, say, 1000 rows. See my Chunking code (and adapt it from DELETE to IODKU).
In Plans B and C, turn on autocommit so that you don't build up a huge redo log.
Plan D: Build a new table as you merge the two tables with a JOIN. Finish with an atomic
RENAME TABLE real TO old,
new TO real;
DROP TABLE old; -- when happy with the result.
Plan E: Plan D + Chunking of the INSERT ... SELECT real JOIN tmp ...
If I have a table TableA with 10k rows and I want to search all rows where id > 8000
When I use the SQL statement SELECT * FROM TableA WHERE id > 8000 to search them, what will MySQL do? Will it search 10k rows and return the 2k rows that match the condition or just ignore those 8k rows and return the 2k rows of data?
I also have a requirement to store a lot of data in the database per day and need to have a quick search for today records. Is one big table still the best method or are other solutions available?
Or would it be best to create 2 tables. 1 for the all records and 1 for today's records and when the new data coming, both table will insert but in the next day the record of the second table will delete.
Which method are better when comparing the speed of select or any other good method can for this case?
Actually i don't have the real database here now but i just worry
about which way/method can be better in that case
Updated information below at (8-12-2016 11:00)
I am using InnoDB but i will use the date as the search key and it is not a PK.
Returning 2k rows is just a extremely case for study but in the real case may returning (User Numbers * each record for that User), so if i got 100 user and they make 10 record in that day, i may need to returning 1k rows record.
My real case is i need to store all user records per days (maybe 10 records per 1 user) and i need to generate the rank for the last day records and the last 7 days records so i just worry if i just search the last day records in a large table, would it be slow or create another table just for save the last day records?
Are you fetching more than about 20% of the table? (The number 20% is inexact.)
Is the PRIMARY KEY on id? Or is it a secondary key?
Are you using ENGINE=InnoDB?
Case: InnoDB and PRIMARY KEY(id): The execution will start at 8000 and go until finished. This is optimal
Case: InnoDB, id is a secondary key, and a 'small' percentage of table is being fetched: The index will be used; it is a BTree and is scanned from 8000 to end, jumping over to the data (via the PK) to find the rows.
Case: InnoDB, id is secondary, and large percentage: The index will be ignored, and the entire table will be scanned ("table scan"), ignoring rows that don't match the WHERE clause. A table scan is likely to be faster than the previous case because of all the 'jumping over to the data'.
Other comments:
10K rows is "small" as tables go.
returning 2K rows is "large" as result sets go. What are you doing with them?
Is there further filtering that you could turn over to MySQL, so that you don't get all 2K back? Think COUNT or SUM with GROUP BY, FULLTEXT search index, etc.
More tips on indexes.
my question is simple: let's say that I have hypothetically 18446744073709551615 records in one table (the max number) but I want to select from those records only one something like this:
SELECT * FROM TABLE1 WHERE ID = 5
1.- will the result be so slow to appear?
or if I have another table with only five records and I do the same query
SELECT * FROM TABLE2 WHERE ID = 5
2.- will the result appear at the same speed as in the first select or will be much faster in this other one?
thanks.
Let's assume for simplicity that the ID column is a fixed-width primary key. It will be found in roughly 64 index lookups (Wolfram Alpha on that). Since MySQL / InnoDB uses BTrees, it will be somewhat less than that for disk seeks.
Searching among 1 in a million would take you roughly index lookups. Seeking among 5 values will take 3 index lookups and the whole page will probably fit into one block.
Most of the speed difference will come from data that is being read from disk. The index branching should be a relatively fast operation and functionally you would not notice the difference once the values were cached in RAM. That is to say the first time you select from you 264 rows, it will be a little bit to read from a spinning disk, but essentially the same speed for the 5 and 264 rows if you were to repeat the query (even ignoring query cache).
No the first one will almost certainly be slower than the second but probably not that much slower, provided you have an index on the ID column.
With an index, you can efficiently find the first record meeting the condition and then all the other records will be close by (in the index structures anyway, not necessarily the data area).
I'd say you're more likely to run out of disk storage with the first one before you run out of database processing power :-)
I had a table with 3 columns and 23 million rows. Each single row contains a primary key, int value, and a "one single" word, that is it. Each word is 3 characters long. In other words, each word's "Hash Representation" was there. The Table size was 5 GB. This table is well indexed.
Now I am going to create the same table with real words in it, no more 3 character hash. So each word will contain their normal number of letters. Now this table contains 23 million rows, 3 columns. However since the length of the words is more than the 3 character hash, the size of the table is 15 GB. This table is well indexed.
The only difference between these 2 tables is that in first table, the data type of the Hash is char(3). Now in the second table, the data type of the "non_hashed_word" is varchar(20).
Now please have a look at the below code, which we ran in our previous table I mentioned. This code runs 0.01 seconds.
SELECT `indexVal`, COUNT(`indexVal`) AS OverlapWords, `UniqueWordCount`,
(COUNT(`indexVal`)/`UniqueWordCount`) AS SimScore FROM `key_word`WHERE `hashed_word` IN
('001','01v','0ji','0k9','0vc','0#v','0%d','13#' ,'148'
,'1e1','1sx','1v$','1#c','1?b','1?k','226','2kl','2ue','2*l','2?4','36h','3au','3us','4d~')
GROUP BY (`indexVal`) LIMIT 500
We are expecting to run the same code in our new table as well.
So my question is, even though the number of rows and the number of columns are same, can our query be sloe because the table size is much larger now? Or maybe because the datatype is varchar() now?
Definitely yes. Use EXPLAIN to get the query plan. Another reasons:
limit has to have the whole result set to get the first 500 -> more rows, more data
Operations (count, / , etc..) needs to be executed for each row
If index exist, this is larger when there is more rows, can be fragmented on the disk
etc....
I have a table 'tbl' something like that:
ID bigint(20) - primary key, autoincrement
field1
field2
field3
That table has 600k+ rows.
Query:
SELECT * from tbl ORDER by ID LIMIT 600000, 1 takes 1.68 second
Query:
SELECT ID, field1 from tbl ORDER by ID LIMIT 600000, 1 takes 1.69 second
Query:
SELECT ID from tbl ORDER by ID LIMIT 600000, 1 takes 0.16 second
Query:
SELECT * from tbl WHERE ID = xxx takes 0.005 second
Those queries are tested in phpmyadmin.
And the result is query 3 and query 4 together return necessarily data.
Query 1 does the same jobs but much slower...
This doesn't look right for me.
Could anyone give any advice?
P.S. I'm sorry for formatting.. I'm new to this site.
New test:
Q5 : CREATE TEMPORARY TABLE tmptable AS (SELECT ID FROM tbl WHERE ID LIMIT 600030, 30);
SELECT * FROM tbl WHERE ID IN (SELECT ID FROM tmptable); takes 0.38 sec
I still don't understand how it's possible. I recreated all indexes.. what else can I do with that table? Delete and refill it manually? :)
Query 1 looks at the table's primary key index, finds the correct 600,000 ids and their corresponding locations within the table, then goes to the table and fetches everything from those 600k locations.
Query 2 looks at the table's primary key index, finds the correct 600k ids and their corresponding locations within the table, then goes to the table and fetches whichever subset of fields are asked for from those 600k rows.
Query 3 looks at the table's primary key index, finds the correct 600k ids, and returns them. It doesn't need to look at the table at all.
Query 4 looks at the table's primary key index, finds the single entry requested, goes to the table, reads that single entry, and returns it.
Time-wise, let's build backwards:
(Q4) The table index allows lookup of a key (id) in O(log n) time, meaning every time the table doubles in size it only takes one extra step to find the key in the index*. If you have 1 million rows, then, it would only take ~20 steps to find it. A billion rows? 30 steps. The index entry includes data on where in the table to go to find the data for that row, so MySQL jumps to that spot in the table and reads the row. The time reported for this is almost entirely overhead.
(Q3) As I mentioned, the table index is very fast; this query finds the first entry and just traverses the tree until it has the requested number of rows. I'm sure I could calculate the precise number of steps it would take, but as a maximum we'll say 20 steps x 600k rows = 12M steps; since it's traversing a tree it would likely be more like 1M steps, but the precise number is largely irrelevant. The most important thing to realize here is that once MySQL has walked the index to pull the ids it needs, it has everything you asked for. There's no need to go look at the table. The time reported for this one is essentially the time it takes MySQL to walk the index.
(Q2) This begins with the same tree-walking as discussed for query 3, but while pulling the IDs it needs, MySQL also pulls their location within the table files. It then has to go to the table file (probably already cached/mmapped in memory), and for every entry it pulled, seek to the proper place in the table and get the fields requested out of those rows. The time reported for this query is the time it takes to walk the index (as in Q3) plus the time to visit every row specified in the index.
(Q1) This is identical to Q2 when all fields are specified. As the time is essentially identical to Q2, we can see that it doesn't really take measurably more time to pull more fields out of the database, any time there is dwarfed by crawling the index and seeking to the rows.
*: Most databases use an indexing data structure (B-trees for MySQL) that has a log base much higher than 2, meaning that instead of an extra step every time the table doubles, it's more like an extra step every time the table size goes up by a factor of hundreds to thousands. This means that instead of the 20-30 steps I stated in the example, it's more like 2-5.