I have a stored procedure in which a temporary table is created.
There are 16 different select statements which are used to insert data into temp table by using joins on 4 tables at a time.
New requirement is to apply few more where conditions based on some input parameters.
My questions is:
I have two choices now:
apply conditions in where clause in each select statements while inserting data into temporary table.
do not apply any condition while inserting the data but in the end delete the data from temp table (data which is not required).
The second approach seems simple, but I was thinking about performance issues as initially unnecessary data would be inserted into it, but again there are multiple filters applied every time.
Can anyone guide me which approach should be used.
Basically among filtering, insertion, deletion which takes more time.
All tables have thousands of rows in them.
It's hard to answer without the exact details, but generally speaking, the first approach sounds better.
The second approach means you'll be doing (potentially, depends on the exact conditions) twice the I/O - once to copy the data into the temp table, and again to delete it. If your dataset is large, this will be considerable.
Your first approach is better.
If the select is taking time, you should optimize the selects so that it use the indexes.
Secondly,you have a large table and then you are selecting a handful of records from it, doesn't solve your problem.
Related
rails app, I have a table, the data already has hundreds of millions of records, I'm going to split the table to multiple tables, this can speed up the read and write.
I found this gem octopus, but he is a master/slave, I just want to split the big table.
or what can I do when the table too big?
Theoretically, a properly designed table with just the right indexes will be able to handle very large tables quite easily. As the table grows the slow down in queries and insertion of new records is supposed to be negligible. But in practice we find that it doesn't always work that way! However the solution definitely isn't to split the table into two. The solution is to partition.
Partitioning takes this notion a step further, by enabling you to
distribute portions of individual tables across a file system
according to rules which you can set largely as needed. In effect,
different portions of a table are stored as separate tables in
different locations. The user-selected rule by which the division of
data is accomplished is known as a partitioning function, which in
MySQL can be the modulus, simple matching against a set of ranges or
value lists, an internal hashing function, or a linear hashing
function.
If you merely split a table your code is going to become inifinitely more complicated, each time you do an insert or a retrieval you need to figure out which split you should run that query on. When you use partitions, mysql takes care of that detail for you an as far as the application is concerned it's still one table.
Do you have an ID on each row? If the answer is yes, you could do something like:
CREATE TABLE table2 AS (SELECT * FROM table1 WHERE id >= (SELECT COUNT(*) FROM table1)/2);
The above statement creates a new table with half of the records from table1.
I don't know if you've already tried, but an index should help in speed for a big table.
CREATE INDEX index_name ON table1 (id)
Note: if you created the table using unique constraint or primary key, there's already an index.
This is somewhat of a conceptual question. In terms of query optimization and speed, I am wondering which route would have the best performance and be the fastest. Suppose I am using JFreeChart (this bit is somewhat irrelevant). The entire idea of using JFreeChart with a MYSQL database is to query for two values, an X and a Y. Suppose the database is full of many different tables, and usually, the X and the Y come from two different tables. Would it be faster, in the query for the chart, to use joins and unions to get the two values...or... first create a table with the joined/union values, and then run queries on this new table (no joins or unions needed)? This would all be in one code mind you. So, overall: joins and unions to get X and Y values, or create a temporary table joining the values and then querying the chart with those.
It would, of course, be faster to pre-join the data and select from a single table than to perform a join. This assumes that you're saving one lookup per row and are properly using indexes in the first place.
However, even though you get performance improvements from dernormalization in such a manner, it's not commonly done. A few of the reason why it's not common include:
Redundant data takes up more space
With redundant data, you have to update both copies whenever something changes
JOINs are fast
JOINs on multiple rows can improve (they don't always require a lookup per row) with such things as the new Batched Key Access joins in MySQL 5.6, but it only helps with some queries, hence you have to tell MySQL which join type you want. It's not automatic.
I am building an analytics platform where users can create reports and such against a MySQL database. Some of the tables in this database are pretty huge (billions of rows), so for all of the features so far I have indexes built to speed up each query.
However, the next feature is to add the ability for a user to define their own query so that they can analyze data in ways that we haven't pre-defined. They have full read permission to the relevant database, so basically any SELECT query is a valid query for them to enter. This creates problems, however, if a query is defined that filters or joins on a column we haven't currently indexed - sometimes to the point of taking over a minute for a simple query to execute - something as basic as:
SELECT tbl1.a, tbl2.b, SUM(tbl3.c)
FROM
tbl1
JOIN tbl2 ON tbl1.id = tbl2.id
JOIN tbl3 ON tbl1.id = tbl3.id
WHERE
tbl1.d > 0
GROUP BY
tbl1.a, tbl1.b, tbl3.c, tbl1.d
Now, assume that we've only created indexes on columns not appearing in this query so far. Also, we don't want too many indexes slowing down inserts, updates, and deletes (otherwise the simple solution would be to build an index on every column accessible by the users).
My question is, what is the best way to handle this? Currently, I'm thinking that we should scan the query, build indexes on anything appearing in a WHERE or JOIN that isn't already indexed, execute the query, and then drop the indexes that were built afterwards. However, the main things I'm unsure about are a) is there already some best practice for this sort of use case that I don't know about? and b) would the overhead of building these indexes be enough that it would negate any performance gains the indexes provide?
If this strategy doesn't work, the next option I can see working is to collect statistics on what types of queries the users run, and have some regular job periodically check what commonly used columns are missing indexes and create them.
If using MyISAM, then performing an ALTER statement on tables with large (billions of rows) in order to add an index will take a considerable amount of time, probably far longer than the 1 minute you've said for the statement above (and you'll need another ALTER to drop the index afterwards). During that time, the table will be locked meaning other users can't execute their own queries.
If your tables use the InnoDB engine and you're running MySQL 5.1+, then CREATE / DROP index statements shouldn't lock the table, but it still may take some time to execute.
There's a good rundown of the history of ALTER TABLE [here][1].
I'd also suggest that automated query analysis to identify and build indeces would quite difficult to get right. For example, what about cases such as selecting by foo.a but ordering by foo.b? This kind of query often needs a covering index over multiple columns, otherwise you may find your server tries a filesort on a huge resultset which can cause big problems.
Giving your users an "explain query" option would be a good first step. If they know enough SQL to perform custom queries then they should be able to analyse EXPLAIN in order to best execute their query (or at least realise that a given query will take ages).
So, going further with my idea, I propose you segment your datas into well identified views. You used abstract names so I can't reuse your business model, but I'll take a virtual example.
Say you have 3 tables:
customer (gender, social category, date of birth, ...)
invoice (date, amount, ...)
product (price, date of creation, ...)
you would create some sorts of materialized views for specific segments. It's like adding a business layer on top of the very bottom data representation layer.
For example, we could identify the following segments:
seniors having at least 2 invoices
invoices of 2013 with more than 1 product
How to do that? And how to do that efficiently? Regular views won't help your problem because they will have poor explain plans on random queries. What we need is a real physical representation of these segments. We could do something like this:
CREATE TABLE MV_SENIORS_WITH_2_INVOICES AS
SELECT ... /* select from the existing tables */
;
/* add indexes: */
ALTER TABLE MV_SENIORS_WITH_2_INVOICES ADD CONSTRAINT...
... etc.
So now, your guys just have to query MV_SENIORS_WITH_2_INVOICES instead of the original tables. Since there are less records, and probably more indexes, the performances will be better.
We're done! Oh wait, no :-)
We need to refresh these datas, a bit like a FAST REFRESH in Oracle. MySql does not have (not that I know... someone corrects me?) a similar system, so we have to create some triggers for that.
CREATE TRIGGER ... AFTER INSERT ON `seniors`
... /* insert the datas in MV_SENIORS_WITH_2_INVOICES if it matches the segment */
END;
Now we're done!
I'm working on optimizing a mysql query that joins 2 tables together and has a few where clauses and an order by.
I noticed (using explain) that a temporary table is created during the evaluation of the query. (since I'm grouping on a field in a table that isn't the first table in the join queue)
I'd really like to know if this temp table is being written to disk or not, which the explain results don't tell me.
It would also be nice to be able to tell what exactly is going into said temporary table. Some of the restrictions in my where clause are on indexed columns and some aren't, so I think that mysql might not be optimally picking rows into the temporary table.
Specifically, my query is basically of the form: select ... from a join b where ... with restrictions on both a and b on both indexed and non-indexed columns. The problem is that the number of rows going into the temp table selected from a is more than I suspect it should be. I want to investigate this.
All databases use a memory area or work area to execute a query and will use temp tables in those memory areas depending on how you built your query. If your joining multiple tables it may use more than one to build the final result set. Those temp tables usually exist in memory as long as the user is logged on.
Explain is illustrating the process it is trying to optimize as it interprets your SQL. If you have a poorly indexed where clause or if you are using a where clause in a join it could be pulling an excessive amount of data into memory as it executes and builds your final result set. This is what poor performance at the DB level looks like.
By reading your pseudo code in the last paragraph I would say you need some indexing and to rewrite your Where clause to join on indexed fields. Post your SQL if you really want an opinion.
Here's the scenario, the old database has this kind of design
dbo.Table1998
dbo.Table1999
dbo.Table2000
dbo.table2001
...
dbo.table2011
and i merged all the data from 1998 to 2011 in this table dbo.TableAllYears
now they're both indexed by "application number" and has the same numbers of columns (56 columns actually..)
now when i tried
select * from Table1998
and
select * from TableAllYears where Year=1998
the first query has 139669 rows # 13 seconds
while the second query has same number of rows but # 30 seconds
so for you guys, i'm i just missing something or is multiple tables better than single table?
You should partition the table by year, this is almost equivalent to having different tables for each year. This way when you query by year it will query against a single partition and the performance will be better.
Try dropping an index on each of the columns that you're searching on (where clause). That should speed up querying dramatically.
So in this case, add a new index for the field Year.
I believe that you should use a single table. Inevitably, you'll need to query data across multiple years, and separating it into multiple tables is a problem. It's quite possible to optimize your query and your table structure such that you can have many millions of rows in a table and still have excellent performance. Be sure your year column is indexed, and included in your queries. If you really hit data size limitations, you can use partitioning functionality in MySQL 5 that allows it to store the table data in multiple files, as if it were multiple tables, while making it appear to be one table.
Regardless of that, 140k rows is nothing, and it's likely premature optimization to split it into multiple tables, and even a major performance detriment if you need to query data across multiple years.
If you're looking for data from 1998, then having only 1998 data in one table is the way to go. This is because the database doesn't have to "search" for the records, but knows that all of the records in this table are from 1998. Try adding the "WHERE Year=1998" clause to the Table1998 table and you should get a slightly better comparison.
Personally, I would keep the data in multiple tables, especially if it is a particularly large data set and you don't have to do queries on the old data frequently. Even if you do, you might want to look at creating a view with all of the table data and running the reports on that instead of having to query several tables.