Can MySQL Query speed depend on the table size? - mysql

I had a table with 3 columns and 23 million rows. Each single row contains a primary key, int value, and a "one single" word, that is it. Each word is 3 characters long. In other words, each word's "Hash Representation" was there. The Table size was 5 GB. This table is well indexed.
Now I am going to create the same table with real words in it, no more 3 character hash. So each word will contain their normal number of letters. Now this table contains 23 million rows, 3 columns. However since the length of the words is more than the 3 character hash, the size of the table is 15 GB. This table is well indexed.
The only difference between these 2 tables is that in first table, the data type of the Hash is char(3). Now in the second table, the data type of the "non_hashed_word" is varchar(20).
Now please have a look at the below code, which we ran in our previous table I mentioned. This code runs 0.01 seconds.
SELECT `indexVal`, COUNT(`indexVal`) AS OverlapWords, `UniqueWordCount`,
(COUNT(`indexVal`)/`UniqueWordCount`) AS SimScore FROM `key_word`WHERE `hashed_word` IN
('001','01v','0ji','0k9','0vc','0#v','0%d','13#' ,'148'
,'1e1','1sx','1v$','1#c','1?b','1?k','226','2kl','2ue','2*l','2?4','36h','3au','3us','4d~')
GROUP BY (`indexVal`) LIMIT 500
We are expecting to run the same code in our new table as well.
So my question is, even though the number of rows and the number of columns are same, can our query be sloe because the table size is much larger now? Or maybe because the datatype is varchar() now?

Definitely yes. Use EXPLAIN to get the query plan. Another reasons:
limit has to have the whole result set to get the first 500 -> more rows, more data
Operations (count, / , etc..) needs to be executed for each row
If index exist, this is larger when there is more rows, can be fragmented on the disk
etc....

Related

Mariadb Explain statement estimating high number of rows that would be found during lookup

I have a table with 32 columns of which 6 rows are primary keys and 2 more column are indexed.
Explain statement provides the below output
I have observed that, everytime the number of rows in the explain statement increases, the select query takes seconds to retrieve data from DB. The above select query returned only 310 rows but it had to scan 382546 rows.
Time taken was calculated by enabling mariadb's slow query log.
Create table query
I would like to understand the incorrectness in the table or query which is considerably slowing down the select query execution.
Your row is relatively large (around 300bytes, depending on the content of your varchar columns). Using the primary key means (for InnoDB) that MySQL will read the whole row. Assuming the estimate of 400k rows is right (which it probably isn't, but you can check by removing the and country_code = 1506 from your query to get a better count), MySQL may end up reading more than 100mb from disk, which reasonably can take several seconds.
Adding a proper index should fix this, in your case I would suggest (country_code, lcr_run_id, tier_type) (which would, with your primary key, actually be the same as just (country_code)).
If most of your queries have that form (e.g. use at least these three columns for lookup), you could think about changing the order of your primary key to start with those three columns, it should give you another speedboost. That operation will take some time though.
Hash partitioning is useless for performance, get rid of it. Ditto for subpartitioning.
Specifying which partition to use defeats the purpose of letting the Optimizer do it for you.
You simply need INDEX(tier_type, lcr_run_id, country_code) with the columns in any desired order.
Plan A: Have the PRIMARY KEY start with those 3 columns (again, the order is not important)
Plan B: Have a "secondary" index with those 3 columns, but not being the same as the start of the PK. (This index could have more columns on the end; let's see some more queries to advise further.)
Either way, it will scan only 310 rows if you also get rid of all partitioning. (Hence, resolving your "returned only 310 rows but it had to scan 382546 rows". Anyway, the '382546' may have been a poor estimate by Explain.)
The important issue here is that indexing works with the leftmost columns in the INDEX. (The PK is an index.) Your SELECT had a match on the first 2 columns, but country_code came later in the list, and the intervening columns were not tested with =.
The three 35M values makes me wonder if the PK is over-specified. For example, if a "zone" is comprised of several "countries", then "zone" is irrelevant in specifying the PK.
The table has only 382K rows, but it is much fatter than it needs to be. Partitioning has a lot of overhead. Also, most columns have (I think) much bigger datatypes than needed. BIGINT takes 8 bytes; INT takes 4 bytes. For example, if there are only a small number of "zones", use TINYINT UNSIGNED, which takes only 1 byte (and allows values 0..255). (See also other 'int' variants.)
Oops, I missed something else. Since zone is not in the WHERE, it can't even get past the primary partitioning.

MySql performance issues in queries

I'm a newbie using MySql. I'm reviewing a table that has around 200,000
records. When I execute a simple:
SELECT * FROM X WHERE Serial=123
it takes a long time, around 15-30 secs in return a response (with 200,000 rows) .
Before adding an index it takes around 50 seconds (with 7 million) to return a simple select where statement.
This table increases its rows every day. Right now it has 7 million rows. I added an index in the following way:
ALTER TABLE `X` ADD INDEX `index_name` (`serial`)
Now it takes 109 seconds to return a response.
Which initial approaches should I apply to this table to improve the performance?
Is MySql the correct tool to handle big tables that will have around 5-10 million of records? or should I move to another tool?
Assuming serial is some kind of numeric datatype...
You do ADD INDEX only once. Normally, you would have foreseen the need for the index and add it very cheaply when you created the table.
Now that you have the index on serial, that select, with any value other than 123, will run very fast.
If there is only one row with serial = 123, the indexed table will spit out the row in milliseconds whether it has 7 million rows or 7 billion.
If serial = 123 shows up in 1% of the table, then finding all 70M rows (out of 7B) will take much longer than finding all 70K rows (out of 7M).
Indexes are your friends!
If serial is a VARCHAR, then...
Plan A: Change serial to be a numeric type (if appropriate), or
Plan B: Put quotes around 123 so that you are comparing strings to strings!

Which is better 1Table 150,000,000,000 rows or 5000 TABLES with 300,000 rows

Would spreading the workload vertically across many rows increase
performance beyond splitting the workload horizontally across many
tables???
So 1 TABLE with 150,000,000,000 rows and 6 Columns
table1(item_id,cat_id,note) The NOTE column will store a non repeating INT that will constantly be changing.
The standard way.
VS
5000 TABLES with 300,000 rows and 1,000 columns.
table1(id,cat_id,item_1,item_2,item_3,...item_999) table2(table1_id,id,cat_id,item_1000,item_1001,item_1002,...item_1998)
...
table5000(table1_id,id,cat_id,item_499000,item_499001,item_499002,...item_500000)
The column itself will define every category the item is in.
The Row will define every item in that category.
NOTE from above: will be placed in the intersected cell.
Which is better? Why? LINKS if possible
Is there a huge performance lag when searching multiple tables versus one single one?
Bytes per short string column 8,000
Bytes per GROUP BY, ORDER BY 8,060
Bytes per row 8,060
Columns per index key 16
Columns per foreign key 16
Columns per primary key 16
Columns per nonwide table 1,024
Columns per wide table 30,000
Columns per SELECT statement 4,096
Columns per INSERT statement 4096
Columns per UPDATE statement (Wide Tables) 4096
This is the limit of mysql
When you combine varchar, nvarchar, varbinary, sql_variant, or CLR user-defined type columns that exceed 8,060 bytes per row, consider the following:
are you building real time application ?
do you have really idea about dividing relationships to tables?
do you have idea about acid property?
Your idea about database is wrong
YOu just need to revise design
i am very much worry about your coding how will you code???
follow this steps
get your requirements properly
do some analysis
and redesign your database i think you will really get good output
Max rows i have columns with is 100 that is much more from my pointof view so i divided that columns in to 17 tables
http://www.slideshare.net/ronaldbradford/top-20-design-tips-for-mysql-data-architects-presentation
checkout this link

SQL Left Join. Taking too long.

Okay so here are my table schemas.
I have 2 tables. Say Table A and Table B. The primary key of Table A is PriKeyA bigint(50) and primary key of Table B is PriKeyB varchar(255). Both PriKeyA and PriKeyB contain the same type of data.
The relevant fields of Table A required for this problem are Last_login_date_in_A (date) and Table B is the primary key itself.
What I need to do is, get those PriKeyA's in A which are not there in Table B's PriKeyB column and the Last_login_date_in_A column should be greater than 30 days from the current date. Basically I need the difference of Table A and Table B along with a certain condition(which is the date in this problem)
Here is my SQL command
: SELECT A.PriKeyA from A
LEFT JOIN B ON A.PriKeyA = B.PriKeyB
WHERE B.PriKeyB IS NULL and DATEDIFF(CURRENTDATE,Last_login_date_in_A)>30;
However when I run this MySQL command, it takes about ridiculously long amount of time (About 3 hours). The size of Table A is 2,50,000 and Table B is 42,000 records respectively. I thought that this problem could arise due to the fact that PriKeyA and PriKeyB are different datatypes. So i also used the CAST(PriKeyB as unsigned) in the query. But that too didn't work. There was a marginal performance improvement.
What could be the possible problems? I've used Left Joins before and they never have taken this long.
The expense of the query appears to be for these reasons:
The SQL datatype for A's PK and B's PK aren't the same.
Table A probably doesn't have an index on Last_login_date_in_A
What this means is that ALL rows in table A MUST be examined one row at a time in order to determine if the > 30 days ago criteria is true. This is especially true if A has 2,500,000 rows (as evidenced by how you placed your commas in A's row count) instead of 250,000.
Adding an index on Last_login_date_in_A might help you out here, but will also slightly slow down insert/update/delete statement times for the table due to needing to update the additional index.
Additionally, you should utilize the documentation for explaining MySQL's actual chosen query plan for your query at: MySQL query plan documentation

mysql select order by primary key. Performance

I have a table 'tbl' something like that:
ID bigint(20) - primary key, autoincrement
field1
field2
field3
That table has 600k+ rows.
Query:
SELECT * from tbl ORDER by ID LIMIT 600000, 1 takes 1.68 second
Query:
SELECT ID, field1 from tbl ORDER by ID LIMIT 600000, 1 takes 1.69 second
Query:
SELECT ID from tbl ORDER by ID LIMIT 600000, 1 takes 0.16 second
Query:
SELECT * from tbl WHERE ID = xxx takes 0.005 second
Those queries are tested in phpmyadmin.
And the result is query 3 and query 4 together return necessarily data.
Query 1 does the same jobs but much slower...
This doesn't look right for me.
Could anyone give any advice?
P.S. I'm sorry for formatting.. I'm new to this site.
New test:
Q5 : CREATE TEMPORARY TABLE tmptable AS (SELECT ID FROM tbl WHERE ID LIMIT 600030, 30);
SELECT * FROM tbl WHERE ID IN (SELECT ID FROM tmptable); takes 0.38 sec
I still don't understand how it's possible. I recreated all indexes.. what else can I do with that table? Delete and refill it manually? :)
Query 1 looks at the table's primary key index, finds the correct 600,000 ids and their corresponding locations within the table, then goes to the table and fetches everything from those 600k locations.
Query 2 looks at the table's primary key index, finds the correct 600k ids and their corresponding locations within the table, then goes to the table and fetches whichever subset of fields are asked for from those 600k rows.
Query 3 looks at the table's primary key index, finds the correct 600k ids, and returns them. It doesn't need to look at the table at all.
Query 4 looks at the table's primary key index, finds the single entry requested, goes to the table, reads that single entry, and returns it.
Time-wise, let's build backwards:
(Q4) The table index allows lookup of a key (id) in O(log n) time, meaning every time the table doubles in size it only takes one extra step to find the key in the index*. If you have 1 million rows, then, it would only take ~20 steps to find it. A billion rows? 30 steps. The index entry includes data on where in the table to go to find the data for that row, so MySQL jumps to that spot in the table and reads the row. The time reported for this is almost entirely overhead.
(Q3) As I mentioned, the table index is very fast; this query finds the first entry and just traverses the tree until it has the requested number of rows. I'm sure I could calculate the precise number of steps it would take, but as a maximum we'll say 20 steps x 600k rows = 12M steps; since it's traversing a tree it would likely be more like 1M steps, but the precise number is largely irrelevant. The most important thing to realize here is that once MySQL has walked the index to pull the ids it needs, it has everything you asked for. There's no need to go look at the table. The time reported for this one is essentially the time it takes MySQL to walk the index.
(Q2) This begins with the same tree-walking as discussed for query 3, but while pulling the IDs it needs, MySQL also pulls their location within the table files. It then has to go to the table file (probably already cached/mmapped in memory), and for every entry it pulled, seek to the proper place in the table and get the fields requested out of those rows. The time reported for this query is the time it takes to walk the index (as in Q3) plus the time to visit every row specified in the index.
(Q1) This is identical to Q2 when all fields are specified. As the time is essentially identical to Q2, we can see that it doesn't really take measurably more time to pull more fields out of the database, any time there is dwarfed by crawling the index and seeking to the rows.
*: Most databases use an indexing data structure (B-trees for MySQL) that has a log base much higher than 2, meaning that instead of an extra step every time the table doubles, it's more like an extra step every time the table size goes up by a factor of hundreds to thousands. This means that instead of the 20-30 steps I stated in the example, it's more like 2-5.