I have a table containing log entries for a single week for about a thousand web servers. Each server writes about 60,000 entries per day to the table, so there are 420,000 entries per week for each server. The table is truncated weekly. Each log entry contains the servername, which is a varchar (this cannot be changed).
The main operation is to select * from table where servername = 'particular', so as to retrieve the 420,000 records for a server, and a C# program then analyzes the data from that server once selected.
Should I create a clustered index on the servername column to speed up the read operation? (It currently takes over half an hour to execute the above SQL statement.)
Would partitioning help? The computer has only two physical drives.
The query is run for each server once per week. After the query is run for all servers, the table is truncated.
The "standard" ideal clustered key is something like an INT IDENTITY that keeps increasing and is narrow.
However, if your primary use for this table is the listed query, then I think a clustered index on servername makes sense. You will see a big increase in speed if the table is wide, since you will eliminate an expensive key/bookmark lookup that runs on a SELECT * from a nonclustered index (unless you include all the fields in the table).
EDIT:
KM pointed out this will slow down inserts, which is true. For this scenario you may want to consider a two-field key on servername, idfield where idfield is an INT Identity. This would still allow access based only on servername in your query but will insert new records at the end PER SERVER. You will still have fragmentation and reordering.
based on:
The query is run for each server once per week. After the query is run
for all servers, the table is truncated.
and
for about a thousand web servers
I'd change the c# program to just run a single query one time:
select * from table Order By servername,CreateDate
and have it handle "breaking" on a server name changes.
One table scan is better than 1,000. I would not slow down the main application's INSERTS into a log table (with a clustered index) just so your once a week queries run faster.
Yes, it would be a good idea to create a clustered index on servername column as now database has to do table scan to find out which records satisfy the criteria of servername = 'particular'.
Also horizontally partition the table by date would help the cause further. So, at a time the database would need to worry about only a day's data for all servers.
Then make sure that you fire date-based queries:
SELECT * FROM table
WHERE date BETWEEN '20110801' AND '20110808'
AND servername = 'particular'
Related
I have a table with just under 50 million rows. It hit the limit for INT (2147483647). At the moment the table is not being written to.
I am planning on changing the ID column from INT to BIGINT. I am using a Rails migration to do this with the following migration:
def up
execute('ALTER TABLE table_name MODIFY COLUMN id BIGINT(8) NOT NULL AUTO_INCREMENT')
end
I have tested this locally on a dataset of 2000 rows and it worked ok. Running the ALTER TABLE command across the 50 million should be ok since the table is not being used at the moment?
I wanted to check before I run the migration. Any input would be appreciated, thanks!
We had exactly same scenario but with postgresql, and i know how 50M fills up the whole range of int, its gaps in the ids, gaps generated by deleting rows over time or other factors involving incomplete transactions etc.
I will explain what we ended up doing, but first, seriously, testing a data migration for 50M rows on 2k rows is not a good test.
There can be multiple solutions to this problem, depending on the factors like which DB provider are you using? We were using mazon RDS and it has limits on runtime and what they call IOPS(input/output operations) if we run such intensive query on a DB with such limits it will run out of its IOPS quota mid way throuh, and when IOPS quota runs out, DB ends up being too slow and kind of just useless. We had to cancel our query, and let the IOPS catch up which takes about 30 minutes to 1 hour.
If you have no such restrictions and have DB on premises or something like that, then there is another factor, which is, if you can afford downtime?**
If you can afford downtime and have no IOPS type restriction on your DB, you can run this query directly, which will take a lot fo time(may half hour or so, depending on a lot of factors) and in the meantime
Table will be locked, as rows are being changed, so make sure not only this table is not getting any writes, but also no reads during the process, to make sure your process goes to the end smoothly without any deadlocks type situation.
What we did avoiding downtimes and the Amazon RDS IOPS limits:
In my case, we had still about 40M ids left in the table when we realized this is going to run out, and we wanted to avoid downtimes. So we took a multi step approach:
Create a new big_int column, name it new_id or something(have it unique indexed from start), this will be nullable with default null.
Write background jobs which runs each night a few times and backfills the new_id column from id column. We were backfilling about 4-5M rows each night, and a lot more over weekends(as our app had no traffic on weekends).
When you are caught up backfilling, now we will have to stop any access to this table(we just took down our app for a few minutes at night), and create a new sequence starting from the max(new_id) value, or use existing sequence and bind it to the new_id column with default value to nextval of that sequence.
Now switch primary key from id to new_id, before that make new_id not null.
Delete id column.
Rename new_id to id.
And resume your DB operations.
This above is minimal writeup of what we did, you can google up some nice articles about it, one is this. This approach is not new and pretty much common, so i am sure you will find even mysql specific ones too, or you can just adjust a couple of things in this above article and you should be good to go.
I have this query
SELECT id, alias, parent FROM `content`
Is there a way to optimize this query so 'type' is different than 'all'
id - primary, unique
id - index
parent - index
alias - index
....
Note that this query will almost never return more than 1500 rows.
Thank you
Your query is fetching all the rows, so by definition it's going to report "ALL" as the query type in the EXPLAIN report. The only other possibility is the "index" query type, an index-scan that visits every entry in the index. But that's virtually the same cost as a table-scan.
There's a saying that the fastest SQL query is one that you don't run at all, because you get the data some other way.
For example, if the data is in a cache of some type. If your data has no more than 1500 rows, and it doesn't change frequently, it may be a good candidate for putting in memory. Then you run the SQL query only if the cached data is missing.
There are a couple of common options:
The MySQL query cache is an in-memory cache maintained in the MySQL server, and purged automatically when the data in the table changes.
Memcached is a popular in-memory key-value store used frequently by applications that also use MySQL. It's very fast. Another option is Redis, which is similar but is also backed by disk storage.
Turn OFF log_queries_not_using_indexes; it clutters the slow log with red herrings like what you got.
0.00XX seconds -- good enough not to worry.
ALL is actually optimal for fetching multiple columns from 'all' rows of a table.
We have a large MySQL 5.5 database in which many rows are inserted daily and never deleted or updated. There are also users querying the live database. Tables are MyISAM.
But it is effectively impossible to run ANALYZE TABLES because it takes way too long. And so the query optimizer will often pick the wrong index. (15 hours, and sometimes crashes the tables.)
We want to try switching to all InnoDB. Will we need to run ANALYZE TABLES or not?
The MySQL docs say:
The cardinality (the number of different key values) in every index of a table
is calculated when a table is opened, at SHOW TABLE STATUS and ANALYZE TABLE and
on other circumstances (like when the table has changed too much).
But that begs the question: when is a table opened? If that means accessed during a connection then we need do nothing special. But I do not think that that is the case for InnoDB.
So what is the best approach? Run ANALYZE TABLE periodically? Perhaps with an increased dive count?
Or will it all happen automatically?
The query users use apps to get the data, so each run is a separate connection. They generally do NOT expect the rows to be up-to-date within just minutes.
I am using MySQL 5.1 version in Windows 2008 server. When I execute below queries:
SELECT * FROM tablename;
It is taking too much time for fetching all the results in that table. This query is listed in the slow query log too while this table has primary key as well as few more index.
I execute below query to check the execution plan:
explain extended select * from tablename;
I found below information:
id=1
select_type=SIMPLE
table=tablename
possible_keys=null
key=null
key_len=null
ref=null
rows=85151
Extra=blank
I thought that it query should use at least primary key by default. Again, I executed below query and found that filtered column has value=100.0
explain extended select * from tablenmae;
Is there any specific reason about why query is not utilizing key?
You are selecting all rows from the table. This is why the whole table (all rows) needs to be scanned.
A key (or index) is only used if you narrow your search (using where). An index is used in that case to pre-select the rows which you want to have without having to actually scan the whole table for the given criteria.
If you don't need to access all the rows at once, try limiting the returned rows using LIMIT.
SELECT * FROM tablename LIMIT 100;
If you want the next 100 rows, use
SELECT * FROM tablename LIMIT 100,100;
and so on.
Other than that approach (referred to as "paging"), there is not much you can do to speed up this query (other than get a faster machine, more RAM, a faster disk, better network if the DMBS is accessed remotely).
If you need to do some processing, consider moving logic (such as filtering) to the DBMS. This can be achieved using the WHERE portion of a query.
Why would it use a key, when there is no filter, nor order? - there's going to be no approach, in this single table query, where a table scan is not going to be at least as fast.
To solve your performance issue, perhaps you have client side processing that could be passed tot he server (after all, you're not really showing 85,151 rows to the end user at once, are you?) - or get a faster disk...
I have a site in shared plan with a mysql database. The database has a table with ~300000 rows. The table is ~250mb. In every page I call query:
select * from table order by added limit 0,30
In every row is field with 400 characters code which I need. Basically I need all fields.
Until few days ago everything is ok but slow with 500 visitors/day. Now my site is down because I have an alert about cpu abuse(with 1000 visitors/day). In my local server all goes very well with no big cpu usage (~10%).
What can I do to make best performance for my queries?
If I go to VPS plan everything will be ok or the real problem is my table?
Is "added" field indexed? MySQL will scan whole table every time if the table is not indexed by the field in ORDER BY. By indexing your table
ALTER TABLE `table_name` ADD INDEX `added` (`added`);
you'll dramatically reduce cpu usage.