I have already 300GB data into table.
I want to add indexing to a perticlar column but when adding index on a column then increase the memory of drive(or increase the size of ibdata file).
This process is apply to another table but memory is not increase.
below is query to add indexing,
CREATE INDEX index_name ON table (column)
In general, the amount of space needed for the index tables, varies between about 50% of the original text volume and 200%.
Generally, the larger the total amount of text, the smaller the overhead, but many small records will use more overhead than fewer large records. Also, clean data (such as published text) will require less overhead than dirty data such as emails or discussion notes, since the dirty data is likely to include many unique words from spellings mistakes and abbreviations.
So,the data type and content of the column also matters.
Hope this will help you a bit..!!
Related
I am dealing with a decently large table in a MySQL database, about 10 million rows. One of the columns is VARCHAR(50), but the rest are all fixed size (int, datetime, etc). Would changing this field to be CHAR(50) improve MySQL's ability to scan the table, given that each row would be a fixed size?
I understand the performance benefits of CHAR vs VARCHAR--that is not my concern. I had simply read that fixed size rows can speed up MySQL when dealing with large tables, and just want to know if this is true and to what degree.
Short Answer: No
Long Answer:
Fixed row size is an old wives' tale. It had a rare use case when MyISAM was the only Engine. With it dying and InnoDB dominating, it is really out of date.
If your Varchar is usually short, then your table takes about twice as much disk space as a CHAR(50) than as VARCHAR(50). That means twice the I/O. I/O is the "slow" part of the performance, not the ability to do an UPDATE "in place" as MyISAM might do and InnoDB does not do (or at least not exactly).
These days, 10M rows is only medium-sized. But even for a billion-row table, I would give the same advice -- even more emphatically.
InnoDB works with 16KB "blocks". Sure, if the text in your Varchar expands causing a block split the costs a little effort. But meanwhile, the block is holding twice as many rows. Etc.
Another point: Locating the row is a far bigger cost than splitting it into columns.
There are 210 columns in my table with around 10000 rows. Each row is unique and there is a primary key on the table. The thing is we always had to do select all query on the table to get data of all the sites.
Currently, the problem is it takes too much time and the data returned is around 10mb and it will be large in the future.
The table has varchar, text and date types in it.
Is there any way I can modify the structure or something to make my retrieval faster. More indexing or breaking down the table. (Although I think denormalized data is good for retrieval)
Update: "why do wider tables slow down the query performance?"
Thanks..!
why do wider tables slow down the query performance?
InnoDB stores "wide" tables in a different way. Instead of having all the columns together in a single string (plus overhead, such as lengths, etc), it does the following:
If the total of all the columns for a given row exceeds about 8KB, it will move some of the data to another ("off-record") storage area.
Which columns are moved off-record depends on the sizes of the columns, etc.
The details depend on the ROW_FORMAT chosen.
"Off-record" is another 16KB block (or blocks).
Later, when doing SELECT * (or at least fetching the off-record column(s)), it must do another disk fetch.
What to do?
Rethink having so many columns.
Consider "vertical partitioning", wherein you have another table(s) that contains selected TEXT columns. Suggest picking groups of columns based on access patterns in your app.
For columns that are usually quite long, consider compressing them in the client and storing into a BLOB instead of a TEXT. Most "text" shrinks 3:1. Blobs are sent off-record the same as Texts, however, these compressed blobs would be smaller, hence less likely to spill.
Do more processing in SQL -- to avoid returning all the rows, or to avoid returning the full text, etc. When blindly shoveling lots of text to a client, the network and client become a sighificant factor in the elapsed time, not just the SELECT, itself.
I know if a table is too big, the indexes can hardly be fit into the buffer_pool,
so using index may result in a large number of random disk IO. So the full table scan,
in general, is probably much faster than index scan even though it only reads about %1 rows.
What I am confused about is :
[0] If there are a big table( 30 millions rows),and many small tables(each table can be fit into memory(buffer)),
will the big table also affect query about small tables ?
My logic is <======>
the buffer is shared by the whole database, so the big table will take most of buffer.
So the indexes of small tables can also hardly be fit into buffer(or it's often
removed from the buffer). Then the above conclusion(full table scan vs index scan) can be applied to this case .
[1] When the big table are partitioned into may small tables(in just one machine), the situation of buffer should keep identical.
So such partition cannot solve this problem(full table scan vs index scan), right? so the "big table" should not mean "one big table", but the "huge database or the sum of data is large"
To sum up, is my inclusion right? if wrong, why? Please give me a hint. Thanks very much.
The buffer_pool is shared across all tables, data and index. But the rest of what you said is needs to focus on "blocks" instead of "tables".
Caching is performed on a block basis. A block (in InnoDB) is 16KB. Most of the innodb_buffer_pool_size is dedicated to data and index blocks.
The cache is run (approximately) as LRU (Least Recently Used) -- That is, the least recently used blocks are tossed from the cache when other blocks are needed.
No, a table or index is not "entirely" loaded into the cache. Instead, the desired blocks are loaded (and purged) when needed.
If all the data and indexes fit into the cache, then (eventually) all the blocks will 'live' there.
If the data plus indexes are too big, then blocks will come and go as needed. Usually this is nearly as good as having them all loaded. For example, if you are usually using "recent" records, then the blocks containing them will 'stay' in the cache; meanwhile "old" blocks will get bumped out.
If you are using UUIDs (GUIDs), performance can get really bad -- this is because of the random nature of such indexed values.
Full table scans (and full index scans) should be avoided whether or not things are too big to fit in cache. They are costly, and they can usually be avoided by proper indexing and/or query formulation.
When you do a full table scan on a table that is bigger than the cache, something's gotta give. You will have to do some I/O, and some blocks will be bumped out of cache. However, there is a technique built in that prevents blindly purging the entire cache for an occasional table scan. For further discussion, research innodb_old_blocks_pct. (No, I don't recommend changing it from the default 37%.)
What do you mean by partitioning a table? If you mean the builtin PARTITION mechanism, then so what? If you scan a table you are scanning all the partitions. Same number blocks; same impact on the cache.
I have dealt with sets of tables that exceed the buffer_pool by a factor of 10 or more. I can discuss performance techniques, but I need a specific SHOW CREATE TABLE (with or without PARTITIONs) and some of the naughty queries (such as table scans).
The Optimizer chooses between doing a table scan and using an index based on a variety of statistics, etc. A Rule of Thumb is that, if more than 20% of the rows need to be touched, it will do a table scan instead of bouncing between the index and the data. (Note: the cutoff is much higher than the 1% you mentioned.)
An Index is structured as a BTree in 16KB blocks, so it is very efficient to start in the middle and scan a range. For example: INDEX(last_name) for WHERE last_name LIKE 'J%' would probably do a "range scan" of 10% of the index, even if that involved bouncing over to the table a lot.
With MyISAM having variable length columns (varchar, blob) on the table really slowed queries so that I encountered advices on the net to move varchar columns into separate table.
Is that still an issue with InnoDB? I don't mean cases where introducing many varchar rows into the table causes page split. I just mean should you consider, for example, move post_text (single BLOB field in the table) into another table, speaking performance-wise about InnoDB?
As far as I know BLOBs (and TEXTs) are actually stored outside of the table, VARCHARs are stored in the table.
VARCHARs are bad for read performance because each record can be of variable length and that makes it more costly to find fields in a record.
BLOBs are slow because the value has to be fetched separately and it will very likely require another read from disk or cache.
To my knowledge InnoDB doesn't do anything differently in this respect so I would assume the performance characteristics hold.
I don't think moving BLOB values really helps - other than reducing overall table size which has a positive influence on performance regardless.
VARCHARs are a different story. You will definitely benefit here. If all your columns are of defined length (and I guess that means you can't use BLOBs either?) the field lookup will be faster.
If you're just 'reading' the VARHCAR and BLOB fields I'd say this is worth a shot. But if your select query needs to compare a value from a VARCHAR or a BLOB you're pretty sour.
So yes you can definitely gain performance here but make sure you test that you're actually gaining performance and that the increase is worth the aggressive denormalization.
PS.
Another way of 'optimizing' VARCHAR read performance is to simply replace them by CHAR fields (of fixed length). This could benefit read performance, so long as the increase in disk space is acceptable.
InnoDB data completely differently than MyISAM.
In MyISAM all indexes--primary or otherwise--- are stored in the MYI file an contain a pointer to the data stored in the MYD file. Variable length rows shouldn't directly affect query speed directly, but the MYD file does tend to get more fragmented with variable length rows because the hole left behind when you delete a row can't necessarily be filed in with the row you insert next. If you update a variable length value to make it longer you might have to move it somewhere else, which means it will tend to get out-of-order with respect to the indexes over time, making range queries slower. (If you're running it on a spinning disk where seek times are important).
InnoDB stores data clustered in pages in a B-tree on the primary key. So long as the data will fit in a page it is stored in the page whether you're using a BLOB or VARCHAR. As long as you aren't trying to insert inordinately long values on a regular basis it shouldn't matter whether your rows are fixed-length or variable-length.
I have a weekly script that moves data from our live database and puts it into our archive database, then deletes the data it just archived from the live database. Since it's a decent size delete (about 10% of the table gets trimmed), I figured I should be running OPTIMIZE TABLE after this delete.
However, I'm reading this from the mysql documentation and I don't know how to interpret it:
http://dev.mysql.com/doc/refman/5.1/en/optimize-table.html
"OPTIMIZE TABLE should be used if you have deleted a large part of a table or if you have made many changes to a table with variable-length rows (tables that have VARCHAR, VARBINARY, BLOB, or TEXT columns). Deleted rows are maintained in a linked list and subsequent INSERT operations reuse old row positions. You can use OPTIMIZE TABLE to reclaim the unused space and to defragment the data file."
The first sentence is ambiguous to me. Does it mean you should run it if:
A) you have deleted a large part of a table with variable-length rows or if you have made many changes to a table with variable-length rows
OR
B) you have deleted a large part of ANY table or if you have made many changes to a table with variable-length rows
Does that make sense? So if my table has no VAR columns, do I need to run it still?
While we're on the subject - is there any indicator that tells me that a table is ripe for an OPTIMIZE call?
Also, I read this http://www.xaprb.com/blog/2010/02/07/how-often-should-you-use-optimize-table/ that says running OPTIMIZE table only is useful for the primary key. If most of my selects are from other indices, am I just wasting effort on tables that have a surrogate key?
Thanks so much!
In your scenario, I do not believe that regularly optimizing the table will make an appreciable difference.
First things first, your second interpretation (B) of the documentation is correct - "if you have deleted a large part of ANY table OR if you have made many changes to a table with variable-length rows."
If your table has no VAR columns, each record, regardless of the data it contains, takes up the exact same amount of space in the table. If a record is deleted from the table, and the DB chooses to reuse the exact area the previous record was stored, it can do so without wasting any space or fragmenting your data.
As far as whether OPTIMIZE only improves performance on a query that utilizes the primary key index, that answer would almost certainly vary based on what storage engine is in use, and I'm afraid I wouldn't be able to answer that.
However, speaking of storage engines, if you do end up using OPTIMIZE, be aware that it doesn't like to run on InnoDB tables, so the command maps to ALTER and rebuilds the table, which might be a more expensive operation. Either way, the table locks during the optimizations, so be very careful about when you run it.
There are so many differences between MyISAM and InnoDB, I am splitting this answer in two:
MyISAM
FIXED has some meaning for MyISAM.
"Deleted rows are maintained in a linked list and subsequent INSERT operations reuse old row positions" applies to MyISAM, not InnoDB. Hence, for MyISAM tables with a lot of churn, OPTIMIZE can be beneficial.
In MyISAM, VAR plus DELETE/UPDATE leads to fragmentation.
Because of the linked list and VAR, a single row can be fragmented across the data file (.MYD). (Otherwise, a MyISAM row is contiguous in the data file.)
InnoDB
FIXED has no meaning for InnoDB tables.
For VAR in InnoDB, there are "block splits", not a linked list.
In a BTree, block splits stabilize at and average 69% full. So, with InnoDB, almost any abuse will leave the table not too bloated. That is, DELETE/UPDATE (with or without VAR) leads to the more limited BTree 'fragmentation'.
In InnoDB, emptied blocks (16KB each) are put on a "free list" for reuse; they are not given back to the OS.
Data in InnoDB is ordered by the PRIMARY KEY, so deleting a row in one part of the table does not provide space for a new row in another part of the table. But, when a block is freed up, it can be used elsewhere.
Two adjacent blocks that are half empty will be coalesced, thereby freeing up a block.
Both
If you are removing "old" data (your 10%), then PARTITIONing is a much better way to do it. See my blog. It involves DROP PARTITION, which is instantaneous and gives space back to the OS, plus REORGANIZE PARTITION, which can be instantaneous.
OPTIMIZE TABLE is almost never worth doing.