Lets say we have 5 tables
Fact_2011
Fact_2010
Fact_2009
Fact_2008
Fact_2007
each of which stores only transactions for the year indicated by the extension of the table's name.
We then create a separate index over each of these tables with the column "Year" as the first column of the index.
Lastly, we create a view, vwFact, which is the union of all of the tables:
SELECT * FROM Fact_2011
UNION
SELECT * FROM Fact_2010
UNION
SELECT * FROM Fact_2009
UNION
SELECT * FROM Fact_2008
UNION
SELECT * FROM Fact_2007
and then perform a queries like this:
SELECT * FROM vwFact WHERE YEAR = 2010
or in less likely situations,
SELECT * FROM vwFact WHERE YEAR > 2010
How efficient would these queries be compared to actually partitioning the data by Year or is it essentially the same? Is having an index by Year over each of these pseudo partitioned tables what is needed to prevent the SQL engine from wasting more than a trivial amount of time to determine that a physical table that contains records outside of the sought date range is not worth scanning? Or is this pseudo partitioning approach exactly what MS partitioning (by year) is doing?
It seems to me that if the query executed is
SELECT Col1Of200 FROM vwFact WHERE YEAR = 2010
that real partitioning would have a distinct advantage, because the pseudo partitioning first has to execute the view to pull back all of the columns from the Fact_2010 table and then filter down to the one column that the end user is selecting, while with MSSQL partitioning, it would be more of a direct up front selection of only the sought column's data.
Comments?
I have implemented partitioned views on SQL Server 2000 with great success
Make sure that you have a check constraint on each table that will restrict the year column to the year. So on the Fact_2010 table it would be Check Year = 2010
then also make the view UNION ALLs not just UNION
now when you query the view for one year it should just access 1 table, you can verify this with the execution plan
if you don't have the check constraints in place it will touch all the tables that are part of the view
that real partitioning would have a distinct advantage, because the
pseudo partitioning first has to execute the view to pull back all of
the columns from the Fact_2010 table and then filter down to the one
column that the end user is selecting
If you have the constraints in place the optimizer is smart enough to just go the tables you need
Related
I am having issues in MySQL query optimization.
Situation is like below.
There are over 200000 rows with multi columns in a SQL table.
And I am making filter options in frontend for these data.
For instance, two columns "Year" and "Make".
And there are many values like "2021", 2022, 2019, 2010 in Year columns, while "Ford", "Chevrolet" and so on in Make.
Example Link:
https://www.autobidmaster.com/en/carfinder-online-auto-auctions/?make=Chevrolet
These values are not unique in each columns. and
I am gonna make filter options(unique value : it's count in each column) based on unique values for these two columns.
I thought I can use data grouped by unique values in each and merge them using UNION ALL in single query.
For instance: for two columns Year and Make
$sql1 = "
(SELECT 'Make' as filter_option_name ,Make as filter_options_key_name, COUNT(*) as filter_option_count
FROM dbcopart.wprdb_copartdata ". $where_str ."
GROUP BY filter_options_key_name
ORDER BY filter_options_key_name)
UNION ALL
(SELECT 'Year' as filter_option_name ,Year as filter_options_key_name, COUNT(*) as filter_option_count
FROM dbcopart.wprdb_copartdata ". $where_str ."
GROUP BY filter_options_key_name
ORDER BY filter_options_key_name) "
WIth two columns, it was okay. worked properly.
But there are another columns: over 20 columns to be used as filter option.
20 time UNION ALL for over 200000 rows was slow.
How can I improve my SQL query?
I think there should be another effective way instead of my stupid 'multiple UNION ALL'.
Thanks for your attention.
Your UNION ALL is probably optimal for gathering all 20 sets of counts at one time. But consider running that once an hour and storing it into another table -- then use fetch from that table. (The data will be a little stale, but perhaps good enough for the use case.)
Yes, once they picked 'Lamborghini', you will have to go back to the table to get the revised values for all the counts (minus make). If there is an index starting with make, then this second big UNION will be faster than the first.
Two layers might be worth caching; more than that will take a lot of space for minimal benefit.
Consider keeping the entire dataset in memory and using app code to do the necessary counts; it will probably be faster than using SQL. (But a lot more code.)
I have a large table with hundreds of thousands of rows. However only about 50,000 rows are actually "active" and part of my queries, because I only select the rows that have been updated last 14 days with WHERE crdate > "2014-08-10". So to speed up the queries to the table I'm thinking what of the following options (or maybe you have another suggestion?) that is the best one:
I can delete all old entries and insert them into a "history" table with a cronjob running every day/week. However this will still make the history table slow if I want to do queries to that one.
I can make an index on my "crdate" column. However my dates are in the format of "2014-08-10 06:32:59" so I guess because it is storing so many different values, that index will be quite large(?) and potentially slow(?).
Do you guys have any other suggestion of how I can speed up queries to this table? Is it an bad idea to set an index on a date-column that have so many different values?
1st rule of databases. Always have indexes on columns you are filtering on.
So yes, put an index on crdate.
You can also go with a history table in parallel but make sure you put the index on the crdate column in the history table too. Having the history table, will allow you to have a smaller index in the main table.
I wanted to add to this for future googler's. if you are querying a datatime a more distinct query will result in a more efficient query for example
SELECT * FROM MyTable WHERE MyDateTime = '01/01/2015 00:00:00'
Will be faster than:
SELECT * FROM MyTable WHERE MyDateTime = '01/01/2015'
I tested this repeatedly on an indexed view(by datetime) of 5 million rows the more distinct query gave me a 1 second quicker response
I have data that resemble stock data that is being updated every hour. So there are 24 entries every day for each stock. (just using stock as an example). But sometimes, the data may not be updated.
For example, let's assume we have 3 stocks, A, B, C. And assume that we gather data at various intervals during the day for each stock. The data would look something like this...
row A B C
1 3 4 5
2 3.5 4.1 5
3 2.9 3.8 4.3
What I want is to sum up the average value of each stock for this time period or
Avg(A) + Avg(B) + Avg(C)
In reality I have hundreds of stocks and hundreds of thousands of rows. I need this to calculate for a single day.
I tried this (stock names are in an array - stocks = array('A','B','C'))
SELECT SUM(AVG(stock_price)) FROM table WHERE date = [mydate] AND stock_name IN () ('".implode("','", $stocks)."') GROUP BY stock_name
but that didn't work. Can someone provide some insight?
Thanks, in advance.
Calculate the per-stock averages in a sub-query, then sum them in the main query.
SELECT SUM(average_price) AS total_averages
FROM (SELECT AVG(price) AS average_price)
FROM table
WHERE <conditions>
GROUP BY stock_name) AS averages
One way to do it, use an inline view as a rowsource:
SELECT SUM(a.avg_stock_price) AS sum_avg_stock_price
FROM ( SELECT AVG(t.stock_price) AS avg_stock_price
FROM table t
WHERE t.date = [mydate]
AND t.stock_name IN ('a','b','c')
GROUP BY t.stock_name
) a
You can run just the query from the inline view (aliased as a) to get verify the results it returns. The outer query runs against the set of rows returned by the inline view query. (MySQL refers to the inline view (aliased as a) as a "derived table".
The outer query is effectively like this:
SELECT SUM(a.avg_stock_price) AS sum_avg_stock_price
FROM a
The "trick" is that "a" isn't a regular table, it's a set of rows returned by a query; but in terms of relational algebra theory, it works the same... it's a set or rows. If a were a regular table, we could write:
SELECT b.col
FROM (
SELECT col FROM a
) b
We don't want to do that in MySQL when we don't have to, because of the inefficient way that MySQL processes that. MySQL first runs the inner query (the query in the inline view). MySQL creates a temporary MyISAM table, and inserts the rows returned by the query into the temporary MyISAM table. MySQL then runs the outer query, against that temporary table (which MySQL refers to as a "derived table") to return the result. Creating and populating a temporary table that's a copy of a regular table is a lot of overhead, especially with large sets.
What makes this powerful is that inline view query can include JOINs, WHERE clause, aggregates, GROUP BY, whatever. As long as it returns a set of rows (with appropriate column names), we can wrap the query in parens, and reference it in another query like it was a table.
My problem is this:
select * from
(
select * from barcodesA
UNION ALL
select * from barcodesB
)
as barcodesTOTAL, boxes
where barcodesTotal.code=boxes.code;
Table barcodesA has 4000 entries
Table barcodesB has 4000 entries
Table boxes has like 180.000 entries
It takes 30 seconds to proccess the query.
Another problematic query:
select * from
viewBarcodesTotal, boxes
where barcodesTotal.code=boxes.code;
viewBarcodesTotal contains the UNION ALL from both barcodes tables. It also takes forever.
Meanwhile,
select * from barcodesA , boxes where barcodesA.code=boxes.code
UNION ALL
select * from barcodesB , boxes where barcodesB.code=boxes.code
This one takes <1second.
The question is obviously WHY?, is my code bugged? is mysql bugged?
I have to migrate from access to mysql, and i would have to rewrite all my code if the first option in bugged.
Add an index on boxes.code if you don't already have one. Joining 8000 records (4K+4K) to the 180,000 will benefit from an index on the 180K side of the equation.
Also, be explicit and specify the fields you need back in your SELECT statements. Using * in a production-use query is bad form as it encourages not having to think about what fields (and how big they might be), not to mention the fact that you have 2 different tables in your example, barcodesa and barcodesb with potentially different data types and column orders that you're UNIONing....
The REASON for the performance difference...
The first query says... First, do a complete union of EVERY record in A UNIONed with EVERY record in B, THEN Join it to boxes on the code. The union does not have an index to be optimized against.
By explicitly applying your SECOND query instance, each table individually IS optimized on the join (apparently there IS an index per performance of second, but I would ensure both tables have index on "code" column).
I have a bunch of data ordered by date and each table holds only one month of data. (The reason for this is to cut down query time, I'm talking about millions of rows in each month table)
For ex.
data_01_2010 holds data from 2010-01-01 to 2010-01-31
data_02_2010 holds data from 2010-02-01 to 2010-02-28
Sometimes I have to query these tables according to a specific date range. Now if the range is across multiple months for ex. 2010-01-01 to 2010-02-28 then I need to query both tables.
Can this be achieved with a single query?
Like for example:
SELECT *
FROM data_01_2010, data_02_2010
WHERE date BETWEEN '2010-01-01' AND '2010-02-28'
The problem with the above query is that it says the column date is ambiguous which it is, because the column is present in both table. (tables have the same structure)
So is this achievable with a single query or do I have to query it for each table separately?
This is a perfect example of why partitioning is so powerful. Partitioning allows you to logically store all of your records in the same table without sacrificing query performance.
In this example, you would have one table called data (hopefully you would name it better than this) and range partition it based on the value of the date column (again hopefully you would name this column better). This means that you could meet your requirement by a simple select:
SELECT * FROM data WHERE date BETWEEN '2010-01-01' AND '2010-02-28';
Under the covers, the database will only access the partitions required based on the where clause.
Reference:
http://dev.mysql.com/doc/refman/5.5/en/partitioning.html
If you do the logic elsewhere to figure out what tables you need to query, and each month has the same schema, you could just union the tables:
SELECT *
FROM data_01_2010
WHERE date BETWEEN '2010-01-01' AND '2010-02-28'
UNION ALL
SELECT *
FROM data_02_2010
WHERE date BETWEEN '2010-01-01' AND '2010-02-28'
But if you need the query to calculate which tables to union, you are venturing into realm of stored procedures.