I have a big table with many many tuples, however I perform some queries just in a small range of this (extracted by date range), I would like to know how to set up the table struture to optimize the db work on date range. May I add startdate and enddate fields as index? Can it helps?
Related
Let's say I have a table with 10 billion rows. Each row has a datetime column to signify when the row was added to the table. When querying against this table, 99% of the time the query will be based on a date range. Now my question:
Would it be faster to query based on the date time (e.g. where date between), or to query based on a range of the primary key (rowid)? In the second case I could have a table containing ID range values for each date so it would be simple to know what ID range to query for. Regardless of the implementation, would a primary key query like this be faster than the date range query? And if so, would it be significantly faster/take significantly less resources?
Thanks for any help. I'll be running some tests myself as well, just want to get some outside feedback on the matter to avoid confirmation bias.
I am working on designing a large database. In my application I will have many rows for example I currently have one table with 4 million records. Most of my queries use datetime clause to select data. Is it a good idea to index datetime fields in mysql database?
Select field1, field2,.....,field15
from table where field 20 between now() and now + 30 days
I am trying to keep my database working good and queries being run smoothly
More, what idea do you think I should have to create a high efficiency database?
MySQL recommends using indexes for a variety of reasons including elimination of rows between conditions: http://dev.mysql.com/doc/refman/5.0/en/mysql-indexes.html
This makes your datetime column an excellent candidate for an index if you are going to be using it in conditions frequently in queries. If your only condition is BETWEEN NOW() AND DATE_ADD(NOW(), INTERVAL 30 DAY) and you have no other index in the condition, MySQL will have to do a full table scan on every query. I'm not sure how many rows are generated in 30 days, but as long as it's less than about 1/3 of the total rows it will be more efficient to use an index on the column.
Your question about creating an efficient database is very broad. I'd say to just make sure that it's normalized and all appropriate columns are indexed (i.e. ones used in joins and where clauses).
Here author performed tests showed that integer unix timestamp is better than DateTime. Note, he used MySql. But I feel no matter what DB engine you use comparing integers are slightly faster than comparing dates so int index is better than DateTime index. Take T1 - time of comparing 2 dates, T2 - time of comparing 2 integers. Search on indexed field takes approximately O(log(rows)) time because index based on some balanced tree - it may be different for different DB engines but anyway Log(rows) is common estimation. (if you not use bitmask or r-tree based index). So difference is (T2-T1)*Log(rows) - may play role if you perform your query oftenly.
I know this question was asked years ago, but I just found my solution.
I added an index to a datetime column.
I went from 1.6 seconds of calling the last 600 records sorted by datetime, and after adding the index, it came down to 0.0028 seconds. I'd say it's a win.
ALTER TABLE `database`.`table`
ADD INDEX `name_of_index` (`datetime_field_from_table`);
I have a large table containing hourly statistical data broken down across a number of dimensions. It's now large enough that I need to start aggregating the data to make queries faster. The table looks something like:
customer INT
campaign INT
start_time TIMESTAMP
end_time TIMESTAMP
time_period ENUM(hour, day, week)
clicks INT
I was thinking that I could, for example, insert a row into the table where campaign is null, and the clicks value would be the sum of all clicks for that customer and time period. Similarly, I could set the time period to "day" and this would be the sum of all of the hours in that day.
I'm sure this is a fairly common thing to do, so I'm wondering what the best way to achieve this in MySql? I'm assuming an INSERT INTO combined with a SELECT statement (like with a materialized view) - however since new data is constantly being added to this table, how do I avoid re-calculating aggregate data that I've previously calculated?
I done something similar and here is the problems I have deal with:
You can use round(start_time/86400)*86400 in "group by" part to get summary of all entries from same day. (For week is almost the same)
The SQL will look like:
insert into the_table
( select
customer,
NULL,
round(start_time/86400)*86400,
round(start_time/86400)*86400 + 86400,
'day',
sum(clicks)
from the_table
where time_period = 'hour' and start_time between <A> and <B>
group by customer, round(start_time/86400)*86400 ) as tbl;
delete from the_table
where time_period = 'hour' and start_time between <A> and <B>;
If you going to insert summary from same table to itself - you will use temp (Which mean you copy part of data from the table aside, than it dropped - for each transaction). So you must be very careful with the indexes and size of data returned by inner select.
When you constantly inserting and deleting rows - you will get fragmentation issues sooner or later. It will slow you down dramatically. The solutions is to use partitioning & to drop old partitions from time to time. Or you can run "optimize table" statement, but it will stop you work for relatively long time (may be minutes).
To avoid mess with duplicate data - you may want to clone the table for each time aggregation period (hour_table, day_table, ...)
If you're trying to make the table smaller, you'll be deleting the detailed rows after you make the summary row, right? Transactions are your friend. Start one, compute the rollup, insert the rollup, delete the detailed rows, end the transaction.
If you happen to add more rows for an older time period (who does that??), you can run the rollup again - it will combine your previous rollup entry with your extra data into a new, more powerful, rollup entry.
I have a table with 5 million rows, and I want to get only rows that have the field date between two dates (date1 and date2). I tried to do
select column from table where date > date1 and date < date2
but the processing time is really big. Is there a smarter way to do this? Maybe access directly a row and make the query only after that row? My point is, is there a way to discard a large part of my table that does not match to the date period? Or I have to read row by row and compare the dates?
Usually you apply some kind of condition before retrieving the results. If you don't have anything to filter on you might want to use LIMIT and OFFSET:
SELECT * FROM table_name WHERE date BETWEEN ? AND ? LIMIT 1000 OFFSET 1000
Generally you will LIMIT to whatever amount of records you'd like to show on a particular page.
You can try/do a couple of things:
1.) If you don't already have one, index your date column
2.) Range partition your table on the date field
When you partition a table, the query optimizer can eliminate partitions that are not able to satisfy the query without actually processing any data.
For example, lets say you partitioned your table by the date field monthly and that you had 6 months of data in the table. If you query for a date between range of a week in OCT-2012, the query optimizer can throw out 5 of the 6 partitions and only scan the partition that has records in the month of OCT in 2012.
For more details, check the MySQL Partitioning page. It gives you all the necessary information and gives a more through example of what I described above in the "Partition Pruning" section.
Note, I would recommend creating/cloning your table in a new partitioned table and do the query in order to test the results and whether it satisfies your requirements. If you haven't already indexed the date column, that should be your first step, test, and if need be check out partitioning.
I am trying to find out how to design the indexes for my data when my query is using ranges for 2 fields.
expenses_tbl:
idx date category amount
auto-inc INT TINYINT DECIMAL(7,2)
PK
The column category defines the type of expense. Like, entertainment, clothes, education, etc. The other columns are obvious.
One of my query on this table is to find all those instances where for a given date range, the expense has been more than $50. This query will look like:
SELECT date, category, amount
FROM expenses_tbl
WHERE date > 120101 AND date < 120811
AND amount > 50.00;
How do I design the index/secondary index on this table for this particular query.
Assumption: The table is very large (It's not currently, but that gives me a scope to learn).
MySQL generally doesn't support ranges on multiple parts of a compound index. Either it will use the index for the date, or an index for the amount, but not both. It might do an index merge if you had two indexes, one on each, but I'm not sure.
I'd check the EXPLAIN before and after adding these indexes:
CREATE INDEX date_idx ON expenses_tbl (date);
CREATE INDEX amount_idx ON expenses_tbl (amount);
Compound index ranges - http://dev.mysql.com/doc/refman/5.5/en/range-access-multi-part.html
Index Merge - http://dev.mysql.com/doc/refman/5.0/en/index-merge-optimization.html
A couple more points that have not been mentioned yet:
The order of the columns in the index can make a difference. You may want to try both of these indexes:
(date, amount)
(amount, date)
Which to pick? Generally you want the most selective condition be the first column in the index.
If your date ranges are large but few expenses are over $50 then you want amount first in the index.
If you have narrow date ranges and most of the expenses are over $50 then you should put date first.
If both indexes are present then MySQL will choose the index with the lowest estimated cost.
You can try adding both indexes and then look at the output of EXPLAIN SELECT ... to see which index MySQL chooses for your query.
You may also want to consider a covering index. By including the column category in the index (as the last column) it means that all the data required for your query is available in the index, so MySQL does not need to look at the base table at all to get the results for your query.
The general answer to your question is that you want a composite index, with two keys. The first being date and the second being the amount.
Note that this index will work for queries with restrictions on the date or on the date and on the expense. It will not work for queries with restrictions on the expense only. If you have both types, you might want a second index on expense.
If the table is really, really large, then you might want to partition it by date and build indexes on expense within each partition.