I am using MySQL 5.5 with innoDB. The basis of my server is Netty, JDBC and BoneCP.
I have a log table that contains user inputs(HTTP header, request body etc). This table will only be read very rarely for reasons like security and data recovery. Therefore the read performance is not something we care about.
There are five columns in this table.
Name | Type
--------------------------------------------------
logID | big Integer(auto-increment)(primary key)
userNumber | medium integer
logTime | timestamp
header | varchar(100)
body | varchar(200)
What are some tips that will improve the insert performance?
Also, is the logID neccessary for this case?
If the table is never referenced, why do you use any key at all in there? It's not like it's necessary. It only adds to the insert time, and serves no purpose.
My suggestion would be to drop the logID and not create any indexes on the table at all.
Another optimization would be to change the table type to myISAM. When you only insert and have no constraints on the table, InnoDB will cost you time for the option of ACID compliance, while myISAM doesn't care about that.
Related
Explanation
I have a table which does not have a primary key (or not even a composite key).
The table is for storing the time slots (opening hours and food delivery available hours) of the food shops. Let's call the table "business_hours" and the main fields are as below.
shop_id
day (0 - 6, means Sunday - Saturday)
type (open, delivery)
start_time
end_time
As an example, if shop A is opened on Monday from 9.00am - 01.00pm and 05.00pm to 10.00pm, there will be two records in business_hours table for this scenario.
-----------------------------------------------
| shop_id | day | type | start_time | end_time
-----------------------------------------------
| 1000 | 1 | open | 09:00:00 | 13:00:00
-----------------------------------------------
| 1000 | 1 | open | 17:00:00 | 22:00:00
-----------------------------------------------
When I query this table, I will use shop_id always as the first condition in where clause.
Ex:
SELECT COUNT(*) FROM business_hours WHERE shop_id = 1000 AND day = 1 AND type = 'open' AND start_time <= '13.29.00' AND end_time > '13.29.00';
Question
Applying index for "shop_id" is enough or "day" & "type" fields also should be indexed?
Also better if you can explain how the indexing really works.
It depends on several factors that you should specify:
How fast will the data grow
What is the estimated table size in rows
What queries will be run against that table
How fast do you expect the queries to run
It is more about thinking like: Some service will make thousands of inserts of new records per hour, the old records will be archived nightly and reports are to be created nightly from that table. In such a case you may prefer to not to create many indexes since they slow down inserts.
On the other hand if your table will grow and change slowly and many users will run queries against it, you need to have proper indexes to speed up queries.
If you can, try to create clustered unique primary key that most queries can benefit from. If you have data that form some timeline and most queries will get ranges of data using the datetime criteria (like from - to), it is better to include datetime in clustered index - you will get fastest query performance.
So something like this will grant you best performance for the mentioned select. (But you cannot store duplicate business hours for one shop and type)
CREATE TABLE Business_hours
( shop_id INT NOT NULL
, day INT NOT NULL
--- other columns
, CONSTRAINT Business_hours_PK
PRIMARY KEY (shop_id, day, type, start_time, end_time) -- your clustered index
)
Just creating an index on fields used in the SELECT (all of them or just some of them most used), will speed up your query too:
CREATE INDEX BusinessHours_IX ON business_hours (shop_id,day,type, start_time, end_time);
Difference between clustered and non-clustered is that clustered index affects order in which are db records stored on disk.
You can use EXPLAIN to find missing indexes in your database, see this answer.
For more detail this blog.
Yes, You are create a clustered index on this column (shop_id,day,type). I have create a index like above:
Create clustered index Ix on business_hours (shop_id,day,type)
Use this index your select query like above:
SELECT COUNT(*) FROM business_hours with (index (Ix)) WHERE shop_id = 1000 AND day = 1 AND type = 'open' AND start_time <= '13.29.00' AND end_time > '13.29.00';
You are get result fast but a table which have a primary key than not create
clustered index and create a non clustered index
It depends on your usability if you are not updating the record then use clustered index
on
CREATE CLUSTERED INDEX Saleperday ON business_hours (shop_id,day,type);
because Clustered index traverse along the B Tree and stores the entire row on node itself, So searching is fast. But Updating records is memory cost effective as it shifts the entire row from memory crating new entry for same record.
OR ELSE
If Your are updating the records then non clustered index.
If you create ware house then use Column Store Indexes
For better understanding your can go to these links
http://www.programmerinterview.com/index.php/database-sql/clustered-vs-non-clustered-index/
http://www.patrickkeisler.com/2014/04/what-is-non-clustered-columnstore-index.html
http://searchsqlserver.techtarget.com/feature/SQL-Server-2014-columnstore-index-the-good-the-bad-and-the-clustered
Please reply for answer.
Having decided against a primary key means the following would be allowed:
| shop_id | day | type | start_time | end_time
+---------+-----+--------+------------+---------
| 1000 | 1 | open | 09:00:00 | 13:00:00
| 1000 | 1 | open | 09:00:00 | 13:00:00
| 1000 | 1 | open | 17:00:00 | 22:00:00
| 1000 | 1 | closed | 17:00:00 | 22:00:00
So you can have duplicate entries that may lead to strange query results and even have a shop open and closed in the very same time range. (But well, we all know that even with a primary key you'd still need a before-insert trigger to detect a range overlapping, e.g. 12:00-15:00 vs. 13:00-16:00, and throw an error in case. - How I wish there were some built-in range detection, so we could, say, have a unique index on (shop_id, day, range(start_time, end_time)).)
As to your question: Provided your database is built well, you already have a foreign key on shop_id. You don't need any further index as long as you consider your queries fast enough.
Once you think you need to speed them up, you can add composite indexes as needed. That would usually be an index on all columns in the slow query's WHERE clause. If that still doesn't suffice add the columns that are in the GROUP BY clause, if any. Next step would be to add the columns of the HAVING clause, if any. Next would be the columns of the ORDER BY clause. And the last step would be to even add all columns in your SELECT clause, which would give you a covering index, i.e. all data needed for the query would be in the index and the table itself would hence not have to be accessed any longer.
But as mentioned: As long as you don't have performance issues, you don't have to add any composite indexes.
To decide which fields must be indexed in a database table you need to observe the behavior of each query sent to the table. Indexes are the means of providing an efficient access path between the application and the data. The index provides the access path; so, when query asks for data to the database, it will know where to go to retrieve the data.
Here is some official Microsoft documentation
Clustered Indexes A clustered index stores the actual table data pages at the leaf level, and the table data is ordered physically
around the key. A table can have only one clustered index, and when
this index is created, the following events also occur: • Table data
is rearranged. • New index pages are created. • All nonclustered
indexes within the database are rebuilt. As a result, there are many
disk I/O operations and extensive use of system and memory resources.
If you plan to create a clustered index, be sure you have free space
equal to at least 1.5 times the amount of data in the table. The extra
free space ensures that you have enough space to complete the
operation efficiently.
Nonclustered Indexes In a nonclustered index, pages at the leaf level contain a bookmark that tells SQL Server where to find the data
row corresponding to the key in the index. If the table has a
clustered index, the bookmark indicates the clustered index key. If
the table does not have a clustered index, the bookmark is an actual
row locator. When you create a nonclustered index, SQL Server creates
the required index pages but does not rearrange table data.
The Indexing Method recommended by professionals is comprised of three phases: Monitor, Analyze, and then implements the index. That
means you need to observe the behavior of your database when you run a
query then work for get the best performance
SQL server use this operation for fetch the data:
Table scan: Reads the entire heap and, most likely, passes all the data to a secondary filter operation
Index scan: Reads the entire leaf level (every row) of the clustered index or non-clustered index. The index scan operation might
filter the rows and return only those rows that meet the criteria, or
it might pass all the rows to another filter operation depending on
the complexity of the criteria. The data may or may not be ordered.
Index seek: Locates specific row(s) data using the index and returns only the selected rows in an ordered list
So, once you know that you can run the query and use the option Display the Estimated Execution Plan and analyses the performance,
I recommend reading this post SQL SERVER – Index Seek Vs. Index Scan and Optimizing Your Query Plans with the SQL
I have a table 'logging' in which we log visitor history. We have 14 millions pageviews in a day, so we insert 14 million records in table in a day, and traffic is highest in afternoon. From somedays we are facing the problems for duplicate key entry 'id', which according to me should not be the case, since id is autoincremented field and we are not explicitly passing id in insert query. Following are the details
logging (MyISAM)
----------------------------------------
| id | int(20) |
| virtual_user_id | varchar(1000) |
| visited_page | varchar(255) |
| /* More such columns are there */ |
----------------------------------------
Please let me know what is the problem here. Is keeping table in MyISAM a problem here.
Problem 1: size of your primary key
http://dev.mysql.com/doc/refman/5.0/en/integer-types.html
The max size of an INT regardless of the size you give it is 2147483647, twice that much if unsigned.
That means you get a problem every 153 days.
To prevent that you might want to change the datatype to an unsigned bigint.
Or for even more ridiculously large volumes even a unix timestamp + microtime as a composite key. Or a different DB solution altogether.
Problem 2: the actual error
It might be concurrency, even though I don't find that very plausible.
You'll have to provide the insert IDs / errors for that. Do you use transactions?
Another possibility is a corrupt table.
Don't know your mysql version, but this might work:
CHECK TABLE tablename
See if that has any complaints.
REPAIR TABLE tablename
General advice:
Is this a sensible amount of data to be inserting into a database, and doesn't it slow everything down too much anyhow?
I wonder how your DB performs with locking and all during the delete during for example an alter table.
The right way to do it totally depends on the goals and requirements of your system which I don't know, but here's an idea:
Log lines into a log. Import the log files in our own time. Don't bother your visitors with errors or delays when your DB is having trouble or when you need to do some big operation that locks everything.
I have a table with ~20 columns.
-----------------------------------------------------------------
GUID_PK | GUID_SET_ID | Col_3 | Col_4 | ... | Col_20
-----------------------------------------------------------------
There could be thousands of Sets each having tens to less than a thousand records. Records within a set are all related to each other. sets are totally independent to each other. A whole set is read/written at a time in one big transaction. Once a record is written, it is read-only for ever, never altered, only read. Data is rarely deleted from this table. when it is deleted, the whole set is deleted in one go.
Only SET_ID is an incoming foreign key. PK is an outgoing foreign key to a different table. in the detail table about 3 or 4 records (each a single blob) are kept per master record.
Question is: should I partition the tables? I think yes. My boss thinks better. he wants the tables be created dynamically, one master one detail for each set. I personally am not comfortable with the dynamic creation idea, but fear the one-table-to-rule-them-all architecture.
The bulk insertions and bulk selects are definitely going to hit performance. Bulk deletes will again reorder the indexes. What would be an optimal structure?
Taking into account the most of the Col_x columns are populated you can do a HASH PARTITIONING :
CREATE TABLE
....
PARTITION BY HASH(GUID_SET_ID)
PARTITIONS NO_PART;
Where NO_PART is the number of partitions that you want , this should be established taking into account:
1) the volume of data you receive daily
2) the volume of data you estimate that will be received in the future
Also you can check other partition types here.
My website displays posts by DATE even though the SQL table is ordered by ID. Since the order of the ID is not always the same as the order of the DATE, I run the query with ORDER BY 'DATE'.
SQL Table Example:
----------------------------
| ID | DATE |
----------------------------
| 1 | 2011-10-20 00:00:00 |
| 2 | 2012-10-20 00:00:00 |
| 3 | 2010-10-20 00:00:00 |
| 4 | 2011-09-20 00:00:00 |
----------------------------
To query I use: SELECT * FROM `table` ORDER BY 'DATE';
My questions:
Would it benefit the query performance if the cluster index or primary key of the table was the DATE column?
Is it possible to have the ID column auto-increment when it is not the primary key?
What I want to do is make the query as fast as possible (which I think would be possible by making the DATE the cluster index or primary key) but also allow each post to have a unique auto-increment ID. I tried to make DATE the primary key but I got an error saying "there can be only one auto column and it must be defined as a key".
I would not define the date as a primary key, but rather add an index on the field. Unique, if needed. I believe it is possible to have an autoincrement on a non primary key field, but trying it yourself will give you the best answer!
<-- EDIT -->
To answer your comment question, I can't say its a BAD idea, but dates are always picky. For once, you have to decide if you use UTC or local date, preview how daylight saving time affects your program, foresee if the need of a date update would be possible at some time of the application life, and things like that. I rather forget about that and just go with the unique autogenerated key.
If you do go for the date as PK, you can use timestamp and avoid the second sequence column.
I found more info about dates as primary keys at techtarget.com and made2mentor.com.
It is nice for indexes if the values going into it are unordered. Not mandatory but nice. Since they are trees if an index is only an autoincrement column you end up with an unbalanced tree right from the beginning each time you rebulid the index you are guaranteed to always get unbalanced as new data gets added because it will only get added to one leaf of the tree (until the index page is full).
For the clustered indexes on auto increment fields (which primary keys are by default in Sybase, MS SQL and probably everything else) it is probably a good idea to do relatively frequent index rebuilds. My philosophy is to cluster on the most common scan. So I might set my primary key to the ID column but I'd cluster on the DATE so when I do things like select Date from table where or select ... order by Date the query will scan consecutive items in as it reads the pages off disk.
So I have a table that's being used basically like a NoSQL setup. The structure is:
id bigint primary key
data mediumblob
modified timestamp
It has around 350k rows. The queries that run on it are all structured as follows:
select data from table where id=XXX;
The table engine is InnoDB. I'm noticing that sometimes queries run against this table are rather slow. Sometimes they take 3 seconds to run. The table is 3 GB on disk and I gave the innodb_buffer_pool_size 4G.
Is there anything I'm missing here? Are there any settings I can tweak to improve performance?
Edit: As requested explain output:
+----+-------------+----------+-------+---------------+---------+---------+-------+------+-------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+----------+-------+---------------+---------+---------+-------+------+-------+
| 1 | SIMPLE | cache | const | PRIMARY | PRIMARY | 8 | const | 1 | |
+----+-------------+----------+-------+---------------+---------+---------+-------+------+-------+
create table:
CREATE TABLE `cache` (
`id` bigint(20) unsigned NOT NULL DEFAULT '0',
`data` mediumblob,
`modified` timestamp NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8
There are two issues that I see here initially. First is that you have a query with a blob data type. This will cause speed issues when it comes to data retrieval. Second, you are using InnoDB, which is optimized for writing. This means that while it is probably the best choice overall, in extreme read situations it might be less performant than MyISAM. Neither of these issues are necessarily deal-killers but they do each add a performance hit. Beyond this, however, I'm not sure I can give you a good answer as to what you can do to better optimize without first having you do profiling. That is what I would recommend you do first. Profile your query to figure out what the execution plan is and then identify why the execution plan is so slow.
Here is a good "Top 10" list of MySQL optimizations. At least a couple apply in your situation directly:
http://20bits.com/articles/10-tips-for-optimizing-mysql-queries-that-dont-suck/
Here is another good optimization article that goes into server settings as well (for InnoDB specifically):
http://www.mysqlperformanceblog.com/2007/11/01/innodb-performance-optimization-basics/
Based on the CREATE TABLE statement you provided, I did think of another thing that you should address (again, not a query-killer but it is another performance hit). Unless there is a business case for using a bigint for your ID field, choose an int instead. An int will allow 2.1 billion rows so you shouldn't run out of numbers. Making this switch will save you disk space and it will improve query performance. Here is an article about it:
http://ronaldbradford.com/blog/bigint-v-int-is-there-a-big-deal-2008-07-18/
Try using the minimum size of id as possible. If it's a numeric key that you know will never be larger than a few million, you could use a MEDIUMINT UNSIGNED and save yourself a byte for each record over an INT, which might speed up searches a little. Still, 3 GB is an awful lot for just 350,000 rows.
It sounds like you might also get some bang for your buck by using the partitioning feature to split your table up into logical units. You might want to Google "mysql vertical partitioning" in particular; if there are large columns that you don't access frequently, it would be much more efficient to move them out into a separate table and only query it when you need it.
Could you post your CREATE TABLE statement as well as the output of EXPLAIN select data from table where id=XXX? How is the io wait on the system?
My best guess is that you're IO bound and because the rows aren't all the same size, it's having to search through the data. You have enough memory that it should be able to keep the data cached. This link describes some low level profiling in MySQL that might be helpful.
http://dev.mysql.com/tech-resources/articles/using-new-query-profiler.html
Things I would look for:
when are the slow queries appearing?
is it after a fresh start of the DB? then this might be just a temporary problem - queries hitting in a cold cache
is it during DB dump/load? - then change your backup policies - use replication for example, or add more disk IO (adding more disks in RAID, change disks to SSD, repartition your system on multiple disks, etc)
is it during peak read/write times? replication might also help here - write into master and load balance the reads between master and slaves
Also - is that mediumblob really necessary there?