I have identified my hanging job is indeed suffering from skew on its join.
What techniques can I use to make my job still succeed?
My code looks like the following:
from transforms.api import Input, Output, transform
#transform(
my_output=Output("/path/to/my/output"),
left_input=Input("/path/to/my/left_input"),
right_input=Input("/path/to/my/right_input"),
)
def my_compute_function(my_output, left_input, right_input):
left_df = left_input.dataframe()
right_df = right_input.dataframe()
output_df = left_df.join(right_df, on=["my_joint_column"])
my_output.write_dataframe(output_df)
I can see one task in particular taking a long time:
You have a couple of options, depending on the correctness of the distribution of your keys.
The first thing you must verify is:
Is the distribution of keys actually correct? i.e. Are the duplicated rows per key actually valid and need to be operated upon?
It's quite common for null values or other such invalid keys to be present in your data, and it's worth verifying if these either need to filtered out, or consolidated by picking just the latest version (this is commonly called a max row or min row operation, i.e. for each key, pick the key that has the maximum value on some other column, such as a timestamp column).
Assuming the present keys are in fact valid and need to be operated upon, you next must ask:
Is one side of the join significantly smaller than the other?
This typically means the right side of a join has 1/10th the number of keys as the left side. If this is true, you can try Salting the Join. It's worth noting that the size difference is not a function of the total rows in the dataset (although this can be a quick-and-dirty way to estimate this), it instead should be thought of as a count difference between the keys of the join, should you be doing a join. You can get the counts per key using the technique here, and the scale difference can be easily computed by dividing df1_COUNT by df2_COUNT instead of multiplying them.
If the right side of the join is not significantly smaller than the left, then:
You have a large join that has similar row counts on both sides. You must boost Executor memory to allow the rows to fit into memory
This means you must apply a profile to your Transform increasing the Executor memory above its current value (which can be found on the same page where AQE is noted)
Related
I have a MariaDB InnoDB table with several million rows, but with short, fixed-width rows consisting of numbers and timestamps only.
We usually search, filter and sort the rows using any of the existing columns.
We want to add a column to store an associated "url" to each row. Ideally every row will have it's url.
We know for a fact that we won't be sorting, searching and filtering by the url column.
We don't mind truncating the URL to it's first 255 bytes, so we are going to give it the VARCHAR type.
But of course that column's width would be variable. The whole record will become variable-width and the width of the original record will double in many cases.
We were considering the alternative of using a different, secondary table for storing the varchar.
We could join them when querying the data, or even more efficiently -probably- just fetch the url's for the page we are showing.
Would this approach be advisable?
Is there a better alternative that would also allow us to preserve performance?
Update: As user Bill Karwin noted in one comment below, InnoDB does not benefit from fixed width as much as MyISAM does, so the real issue here is about the size of the row and not so much about the fixed versus variable width discussion.
Assuming you have control over how the URL is generated, you may want to change it to a fixed-length state. Youtube videos' URIs, for instance, are always 11 characters long and base-64. This fixes the variable length problem and avoids joining tables.
If changing URI generation is not an option, you have a few alternatives to make it fixed-length:
You could fill in the blanks with a special character to force every url to be 255 within the database, and removing it just before returning it. This is not a clean solution but makes DQL operations faster than joining.
You could fetch the url as you have stated, but beware that two http requests may be more time consuming than any other option with just one request.
You could join with another table only when the user requires it, as opposed to it being the default.
Consider that having variable length may not be as big a problem, depending on your needs. The only issue might be if you're grossly oversizing fields, but it doesn't seem to be your case.
This question already has answers here:
Which is faster/best? SELECT * or SELECT column1, colum2, column3, etc
(49 answers)
Closed 9 years ago.
Basically what's the difference in terms of security and speed in these 2 queries?
SELECT * FROM `myTable`
and
SELECT `id`, `name`, `location`, `place` etc... FROM `myTable`
Would using * increase the benchmark on my query and perform slower than static rows?
There won't be much appreciable difference in performance if you also select all columns individually.
The idea is to select only the data you require and no more, which can improve performance if there is alot of unneeded columns in your query, for example, when you join several tables.
Ofc, on the other side of the coin, using * makes life easier when you make changes to the table.
Security-wise, the less you select, the less potentially sensitive data can be inadvertently dumped to the user's browser. Imagine if * included the column social_security_number and somewhere in your debug code it gets printed out as an HTML comment.
Performance-wise, in many cases your database is on another server, so requesting the entire row when you only need a small part of it means a lot more data going over the network.
There is not a single, simple answer, and your literal question cannot fully be answered without more detail of the specific table structure, but I'm going with the assumption that you aren't actually talking about a specific query against a specific table, but rather about selecting columns explicitly or using the *.
SELECT * is always wasteful of something unless you are actually going to use every column that exists in the rows you're reading... maybe network bandwidth, or CPU resources, or disk I/O, or a combination, but something is being unnecessarily used, though that "something" may be in very small and imperceptible quantities ... or it may not ... but it can add up over time.
The two big examples that come to mind where SELECT * can be a performance killer are cases like...
...tables with long VARCHAR and BLOB columns:
Columns such as BLOB and VARCHAR that are too long to fit on a B-tree page are stored on separately allocated disk pages called overflow pages. We call such columns off-page columns. The values of these columns are stored in singly-linked lists of overflow pages, and each such column has its own list of one or more overflow pages
— http://dev.mysql.com/doc/refman/5.6/en/innodb-row-format-overview.html
So if * includes columns that weren't stored on-page with the rest of the row data, you just took an I/O hit and/or wasted space in your buffer pool with accesses that could have been avoided had you selected only what you needed.
...also cases where SELECT * prevents the query from using a covering index:
If the index is a covering index for the queries and can be used to satisfy all data required from the table, only the index tree is scanned. In this case, the Extra column says Using index. An index-only scan usually is faster than ALL because the size of the index usually is smaller than the table data.
— http://dev.mysql.com/doc/refman/5.6/en/explain-output.html
When one or more columns are indexed, copies of the column data are stored, sorted, in the index tree, which also includes the primary key, for finding the rest of the row data. When selecting from a table, if all of the columns you are selecting can be found within a single index, the optimizer will automatically choose to return the data to you by reading it directly from the index, instead of going to the time and effort to read in all of the row data... and this, some cases, is a very significant difference in the performance of a query, because it can mean substantially smaller resource usage.
If EXPLAIN SELECT does not reveal the exact same query plan when selecting the individual columns you need compared with the plan used when selecting *, then you are looking at some fairly hard evidence that you are putting the server through unnecessary work by selecting things you aren't going to use.
In additional cases, such as with the information_schema tables, the columns you select can make a particularly dramatic and obvious difference in performance. The information_schema tables are not actually tables -- they're server internal structures exposed via the SQL interface... and the columns you select can significantly change the performance of the query because the server has to do more work to calculate the values of some columns, compared to others. A similar situation is true of FEDERATED tables, which actually fetch data from a remote MySQL server to make a foreign table appear logically to be local. The columns you select are actually transferred across the network between servers.
Explicitly selecting the columns you need can also lead to fewer sneaky bugs. If a column you were using in code is later dropped from a table, the place in your code's data structure -- in some languages -- is going to contain an undefined value, which in many languages is the same think you would see if the column still existed but was null... so the code thinks "okay, that's null, so..." a logical error follows. Had you explicitly selected the columns you wanted, subsequent executions of the query would throw a hard error instead of quietly misbehaving.
MySQL's C-client API, which some other client libraries are built on, supports two modes of fetching data, one of which is mysql_store_result, which buffers the data from the server on the client side before the application actually reads it into its internal structures... so as you are "reading from the server" you may have already implicitly allocated a lot of memory on the client side to store that incoming result-set even when you think you're fetching a row at a time. Selecting unnecessary columns means even more memory needed.
SELECT COUNT(*) is an exception. The COUNT() function counts the number of non-null values seen, and * merely means "count the rows"... it doesn't examine column data, so if you want a star there, go for it.
As a favor to your future self, unless you want to go back later and rewrite all of those queries when you're trying to get more performance out of your server, you should bite the bullet and do the extra typing, now.
As a bonus, when other people see your code, they won't accuse you of laziness or inexperience.
One of my stored procedures was taking too long execute. Taking a look at query execution plan I was able to locate the operation taking too long. It was a nested loop physical operator that had outer table (65991 rows) and inner table (19223 rows). On the nested loop it showed estimated rows = 1,268,544,993 (multiplying 65991 by 19223) as below:
I read a few articles on physical operators used for joins and got a bit confused whether nested loop or hash match would have been better for this case. From what i could gather:
Hash Match - is used by optimizer when no useful indexes are available, one table is substantially smaller than the other, tables are not sorted on the join columns. Also hash match might indicate more efficient join method (nested loops or merge join) could be used.
Question: Would hash match be better than nested loops in this scenario?
Thanks
ABSOLUTELY. A hash match would be a huge improvement. Creating the hash on the smaller 19,223 row table then probing into it with the larger 65,991 row table is a much smaller operation than the nested loop requiring 1,268,544,993 row comparisons.
The only reason the server would choose the nested loops is that it badly underestimated the number of rows involved. Do your tables have statistics on them, and if so, are they being updated regularly? Statistics are what enable the server to choose good execution plans.
If you've properly addressed statistics and are still having a problem you could force it to use a HASH join like so:
SELECT *
FROM
TableA A -- The smaller table
LEFT HASH JOIN TableB B -- the larger table
Please note that the moment you do this it will also force the join order. This means you have to arrange all your tables correctly so that their join order makes sense. Generally you would examine the execution plan the server already has and alter the order of your tables in the query to match. If you're not familiar with how to do this, the basics are that each "left" input comes first, and in graphical execution plans, the left input is the lower one. A complex join involving many tables may have to group joins together inside parentheses, or use RIGHT JOIN in order to get the execution plan to be optimal (swap left and right inputs, but introduce the table at the correct point in the join order).
It is generally best to avoid using join hints and forcing join order, so do whatever else you can first! You could look into the indexes on the tables, fragmentation, reducing column sizes (such as using varchar instead of nvarchar where Unicode is not required), or splitting the query into parts (insert to a temp table first, then join to that).
I would not recommend trying to "fix" the plan by forcing the hints in one direction or another. Instead, you need to look to your indexes, statistics and the TSQL code to understand why you have a Table spool loading up 1.2billion rows from 19000.
Is it safe to assume when reading a MySQL table line by line from an application that the table will always read from top to bottom, one after the other in perfect sequential order.
E.G. If a table is ordered by a unique ID and I read it in via C++ one line at a time. Is it safe to assume I will get each line in exact unique ID order every time.
My gut feeling is that it is not a safe assumption but I have no technical reasoning for that.
My testing has always shown that it does provide table rows in order but it makes me nervous relying on it. As a result I write programs such that they do not depend on this assumption which makes them a little more complicated and a little less efficient.
Thanks
C
If you use the following query you should have no issues,
SELECT columns
FROM tables
WHERE predicates
ORDER BY column ASC/DESC;
If you are using the standard C++ MySQL connector, then according to the reference manual, "The preview version does buffer all result sets on the client to support cursors".
It has also generally been my experience that your result sets are buffered, and therefore do not change when the underlying table changes.
You are right. It is not safe to assume when reading a MySQL table line by line from an application that the table will always be read from top to bottom, one after the other in perfect sequential order. It is not safe to assume that if a table is ordered by a unique ID and I read it in (via C++ or otherwise) one line at a time, you will get each line in exact unique ID order every time.
There is no guarantee for that, on any RDBMS. No one should rely on that assumption.
Rows have no (Read: should not have) intrinsic or default order in relational tables. Tables (relations) are, by definition, unordered sets or rows.
What gives this impression is that most systems, when asked to return a result for a query like:
SELECT columns
FROM table
they retrieve all the rows from the disk, reading the whole file. So, they return the rows (usually) in the order they were stored in the file or in the order of the clustered key (e.g. in InnoDB tables in MySQL). So, they return the result with the same order every time.
If there are multiple tables in the FROM clause or if there are WHERE conditions, it's a whole different situation, as not the whole tables are read, different indexes may be used, so the system may not read the tables files but just the index files. Or read a small part of the tables.
It's also a different story if you have partitioned tables or distributed databases.
Conclusion is that you should have an ORDER BY part in your queries, if you want to guarantee the same order every time.
I have a table. I only need to run one type of query: to find a given unique in column 1, then get say, the first 3 columns out.
now, how much would it affect speed if I added an extra few columns to the table for basically "data storage". I know I should use a saparate table, but lets assume I am constrained to having just 1 table, so the only way is to add on some columns at the end.
So, if I add on some columns, say 10 at the end, 30 varchar each, will this slow down any query given in the first sentence? If so, by how much of a factor do you think compared to without the extra reduntant yet present columns?
Yes, extra data can slow down queries because it means fewer rows can fit into a page, and this means more disk accesses to read a certain number of rows and fewer rows can be cached in memory.
The exact factor in slow down is hard to predict. It could be negligible, but if you are near the boundary between being able to cache the entire table in memory or not, a few extra columns could make a big difference to the execution speed. The difference in the time it takes to fetch a row from a cache in memory or from disk is several orders of magnitude.
If you add a covering index the extra columns should have less of an impact as the query can use the relatively narrow index without needing to refer to the wider main table.
I don't understand the 'I know I should use a separate table' bit. What you've described is the reason you have a DB, to associate a key with some related data. Look at it another way, how else do you retrieve this information if you don't have the key?
To answer your question, the only way to know what the performance hit is going to be is empirical testing (though Mark's answer, posted just prior to mine, is one - of VERY many - factors to speed).
That depends a bit on how much data you already have in the records. The difference would normally be somewhere between almost none at all to not so big.
The difference comes from how much more data has to be loaded from disk to get to the data. The extra columns will likely mean that there is room for less records in each page, but it's possible that it happens to be room enough left in each page for most of the extra data so that there are few extra blocks needed. It depends on how well the current data lines up in the pages.