I have a bunch of data ordered by date and each table holds only one month of data. (The reason for this is to cut down query time, I'm talking about millions of rows in each month table)
For ex.
data_01_2010 holds data from 2010-01-01 to 2010-01-31
data_02_2010 holds data from 2010-02-01 to 2010-02-28
Sometimes I have to query these tables according to a specific date range. Now if the range is across multiple months for ex. 2010-01-01 to 2010-02-28 then I need to query both tables.
Can this be achieved with a single query?
Like for example:
SELECT *
FROM data_01_2010, data_02_2010
WHERE date BETWEEN '2010-01-01' AND '2010-02-28'
The problem with the above query is that it says the column date is ambiguous which it is, because the column is present in both table. (tables have the same structure)
So is this achievable with a single query or do I have to query it for each table separately?
This is a perfect example of why partitioning is so powerful. Partitioning allows you to logically store all of your records in the same table without sacrificing query performance.
In this example, you would have one table called data (hopefully you would name it better than this) and range partition it based on the value of the date column (again hopefully you would name this column better). This means that you could meet your requirement by a simple select:
SELECT * FROM data WHERE date BETWEEN '2010-01-01' AND '2010-02-28';
Under the covers, the database will only access the partitions required based on the where clause.
Reference:
http://dev.mysql.com/doc/refman/5.5/en/partitioning.html
If you do the logic elsewhere to figure out what tables you need to query, and each month has the same schema, you could just union the tables:
SELECT *
FROM data_01_2010
WHERE date BETWEEN '2010-01-01' AND '2010-02-28'
UNION ALL
SELECT *
FROM data_02_2010
WHERE date BETWEEN '2010-01-01' AND '2010-02-28'
But if you need the query to calculate which tables to union, you are venturing into realm of stored procedures.
Related
I have a bunch of financial stock data in a MySQL table. The data is stored in a 1min tick per row format (OHLC). From that data I'd like to create 30min/hourly/daily aggregates. The problem that the table is enormous and grouping by date functions on the timestamp column yeilds horrible performance results.
Ex: The following query produces the right result but ends up taking too long.
SELECT market, max(timestamp) AS TS
FROM tbl_data
GROUP BY market, DATE(timestamp), HOUR(timestamp)
ORDER BY market, TS ASC
The table has a primary index on the (market, timestamp) columns. And I have also added an additional index on the timestamp column. However, that is not of much help as the usage of date/hour functions means a table scan regardless.
How can I improve the performance? Perhaps I should consider a different database than MySQL that provides specialized date/time indexes? if so what would be a good option?
One thing to note is that it would suffice if I could get the LAST row of each hour/day/timeframe. The database has tens of millions of rows.
MySQL version: 5.7
Thanks in advance for the help.
Edit: Here is what Explain shows on a smaller DB of the exact same format:
I have a big database with about 3 million records with records containing a time stamp.
Now I want to select one record per month and it works using this query:
SELECT timestamp, id, gas_used, kwh_used1, kwh_used2 FROM energy
GROUP BY MONTH(timestamp) ORDER BY timestamp ASC
It works but it is very slow.
I have indexes on id and on timestamp.
What can I do to make this query fast?
GROUP BY MONTH(timestamp) is forcing the engine to look at each record individually, aka a sequential scan, which obviously is very slow when you have 30M records.
A common solution is to add an indexed column with just the criterium you will want to select on. However, I highly suspect that you will actually want to select on Year-Month, if your db is not reset every year.
To avoid data corruption issues, it may be best to create an insert trigger that automatically fills that field. That way this extra column doesn't interfere with your business logic.
It is not a good practice to SELECT columns that don't appear in GROUP BY statement, unless they are handled with aggregating function such as MIN(), MAX(), SUM() etc.
In your query this applies to columns:
id, gas_used, kwh_used1, kwh_used2
You will not get the "earliest" (by timestamp) row for each month in this case.
More:
https://dev.mysql.com/doc/refman/5.7/en/group-by-handling.html
My mysql database table has multiple entries with the following structure:
id, title, date, time
There are presently 30 entries in the table and some of those share a common date.
What I'm trying to accomplish is retrieving the database data in such a way that will group them under common dates. So, all entries that share the same date will be grouped in an array indexed by that common date.
In another post, I learnt INDEX BY is great for what I'm trying to achieve but it works only/best on unique fields.
So, I am just curious if there is anything else that could help efficiently group my database entrie.
SELECT date, GROUP_CONCAT(title)
FROM tbl
GROUP BY date
ORDER BY date;
Don't worry about performance until you have thousands of rows.
I need to calculate the num of rows created on a daily basis for a huge Table in mysql. I'm currently using
select count(1) from table_name group by Date
THe query is taking more 2000sec and counting. I was wondering if there's any optimized query or a way to optimize my query.
If you're only interested in items that were created on those dates, you could calculate the count at end-of-day and store it another table.
That lets you run the COUNT query on a much smaller data set (Use WHERE DATE(NOW()) = Date and drop the GROUP BY)
Then then query the new table when you need the data.
Make sure that "date" field is of "date" type, not datetime nor timestamp
Index that column
If you need it for one day, add a "where" statement. i.e. WHERE date="2013-07-10"
Add an index on the Date column, there's no other way to optimize this query that I can think of.
CREATE INDEX ix_date
ON table_name (Date);
Lets say we have 5 tables
Fact_2011
Fact_2010
Fact_2009
Fact_2008
Fact_2007
each of which stores only transactions for the year indicated by the extension of the table's name.
We then create a separate index over each of these tables with the column "Year" as the first column of the index.
Lastly, we create a view, vwFact, which is the union of all of the tables:
SELECT * FROM Fact_2011
UNION
SELECT * FROM Fact_2010
UNION
SELECT * FROM Fact_2009
UNION
SELECT * FROM Fact_2008
UNION
SELECT * FROM Fact_2007
and then perform a queries like this:
SELECT * FROM vwFact WHERE YEAR = 2010
or in less likely situations,
SELECT * FROM vwFact WHERE YEAR > 2010
How efficient would these queries be compared to actually partitioning the data by Year or is it essentially the same? Is having an index by Year over each of these pseudo partitioned tables what is needed to prevent the SQL engine from wasting more than a trivial amount of time to determine that a physical table that contains records outside of the sought date range is not worth scanning? Or is this pseudo partitioning approach exactly what MS partitioning (by year) is doing?
It seems to me that if the query executed is
SELECT Col1Of200 FROM vwFact WHERE YEAR = 2010
that real partitioning would have a distinct advantage, because the pseudo partitioning first has to execute the view to pull back all of the columns from the Fact_2010 table and then filter down to the one column that the end user is selecting, while with MSSQL partitioning, it would be more of a direct up front selection of only the sought column's data.
Comments?
I have implemented partitioned views on SQL Server 2000 with great success
Make sure that you have a check constraint on each table that will restrict the year column to the year. So on the Fact_2010 table it would be Check Year = 2010
then also make the view UNION ALLs not just UNION
now when you query the view for one year it should just access 1 table, you can verify this with the execution plan
if you don't have the check constraints in place it will touch all the tables that are part of the view
that real partitioning would have a distinct advantage, because the
pseudo partitioning first has to execute the view to pull back all of
the columns from the Fact_2010 table and then filter down to the one
column that the end user is selecting
If you have the constraints in place the optimizer is smart enough to just go the tables you need