Costs in explain queries in Vertica - output

may I ask what exactly the costs in explain output are? Are they combied, summed (or something else) from several metrics like IO, RAM etc. or they are one specific metric?
e.g. +-SELECT LIMIT 10 [Cost: 282K, Rows: 10]
Thank you
Martin

It's actually documented here, although maybe not as exact as you would like.
The query optimizer chooses a query plan based on cost estimates. The
query optimizer uses information from a number of sources to develop
potential plans and determine their relative costs. These include:
Number of table rows
Column statistics, including: number of distinct values (cardinality), minimum/maximum values, distribution of values, and
disk space usage
Access path that is likely to require fewest I/O operations, and lowest CPU, memory, and network usage
Available eligible projections
Join options: join types (merge versus hash joins), join order
Query predicates Data segmentation across cluster nodes

Related

Is the effectiveness of a column index related to the entropy of the column data

As the consumer and occasionally of relational databases (Postgres, MySQL) I often have to consider query speeds in the context of various queries. However often you don't know how a database will be used or where the bottlenecks might be until it's in production.
This makes me wonder, can I use a rule of thumb about the predicted entropy of a column as a heuristic for guessing the speed increase of indexing that column?
A quick Google results in papers written by Computer Science graduates for Computer Science graduates. Can you sum it up in "layman" terms for a self taught programmer?
Entropy?: I'm defining entropy as calculated by number of rows divided by number of times a value is repeated on average (mean). If this is a poor choice of words for those with a CS vocabulary, please suggest a better word.
This question is really too broad to answer thoroughly, but I'll attempt to sum up the situation for PostgreSQL (I don't know enough about other RDBMS, but some of what I write will apply to most of them).
Instead of entropy as you propose above, the PostgreSQL term is the selectivity of a certain condition, which is a number between 0 and 1, defined as the number of rows that satisfy the condition, divided by the total number of rows in the table. A condition with a low selectivity value is (somewhat counterĀ­ intuitively) called highly selective.
The only sure way to figure out if an index is useful or not is to compare the execution times with and without the index.
When PostgreSQL decides if using an index for a condition on a table is effective or not, it compares the estimated cost of a sequential scan of the whole table with the cost of an index scan using an applicable index.
Since sequential reads and random I/O (as used for accessing indexes) often differ in speed, there are a few parameters that influence the cost estimate and hence the decision:
seq_page_cost: Cost of a sequentially fetched disk page
random_page_cost: Cost of a non-sequentially fetched disk page
cpu_tuple_cost: Cost of processing one table row
cpu_index_tuple_cost: Cost of processing an index entry during an index scan
These costs are measured in imaginary units, it is customary to define seq_page_cost as 1 and the others in relation.
The database collects table statistics to so that it knows how big each table is and how the column values are distributed (most common values and their frequency, histograms, correlation to physical position).
To see an example how all these numbers are used by PostgreSQL, look at this example from the documentation.
Using the default settings, a rule of thumb might be that an index will not help much unless the selectivity is less than 0.2.
What I think you are asking is what the impact of an index is relating to the data distribution of data in a column. There is a bunch of theory here. In GENERAL, you will find that index lookup efficiency depends on the distribution of data in the index. In other words, an index is more efficient if you are pulling 0.01% of the table than if you are pulling 5% of the table. This is because random disk I/O is always less efficient (even on SSDs due to read-ahead caching by the OS) than sequential reads.
Now this is not the only consideration. There are always questions about the best way to retrieve a set, particularly if ordered, using an index. Do you scan the ordering index or the filtering index and then sort? Usually you have an assumption here that data is evenly distributed between the two but where this is a bad assumption you can get bad query plans.
So what you should do here is look up index cardinality and get experience with query plans, particularly when the planner makes a mistake so you can understand why it is in error.

Understanding the # of buckets for my SnappyData table?

The default # of buckets is 113. Why? Why not 110? Does the bucket logic perform better with a certain "divisible by" value.
There are a lot of examples in SnappyData with less buckets. Why is that? What logic went into determining to use less buckets than the default 113?
What are the implications of choosing less? What about more buckets? I see a lot of logging in my Spark SQL queries looking for data at each bucket. Is it worse on performance of a query to have more buckets?
Follow these guidelines to calculate the total number of buckets for the partitioned table:
Use a prime number. We use hashing function internally and this provides the most even distribution. Check this post for more details : Why use a prime number in hashCode?
Make it at least four times as large as the number of data stores
you expect to have for the table. The larger the ratio of buckets to
data stores, the more evenly the load can be spread across the members.
Note that there is a trade-off between load balancing and overhead, however. Managing a bucket introduces significant overhead, especially with higher levels of redundancy.
We have chosen a prime number which is most efficient in distributing data in a hash based partitioning logic. Number of buckets will have some impact on query performance. As buckets are translated to Spark tasks , there will be task scheduling overhead with a higher number of buckets.
But If your cluster has more capacity in terms of number of cpus, you should certainly try to match number of buckets to a near by prime number.

UPDATE vs COUNT vs SELECT performance

Is this statement true or false
The performance of these queries
SELECT * FROM table;
UPDATE table SET field = 1;
SELECT COUNT(*) FROM table;
Are identical
Or is there ever a case in which the performance of one will greatly differ from the other?
UPDATE
I'm more interested if there's a large difference between the SELECT and the UPDATE. You can ignore the COUNT(*) if you want
Assume the select performs full table scan. The update will also perform update on all rows in the table.
Assume the update is only updating one field - though it will update all rows (it's an indexed field)
I know that they'll take different time and that they do different things. What I want to know is if the difference will be significant or not. EG. If the update will take 5 times longer than the select then it's significant. Use this as the threshold. And there's no need to be precise. Just give an approximation.
There are different resource types involved:
disk I/O (this is the most costly part of every DBMS)
buffer pressure: fetching a row will cause fetching a page from disk, which will need buffer memory to be stored in
work/scratch memory for intermediate tables, structures and aggregates.
"terminal" I/O to the front-end process.
cost of locking, serialisation and versioning and journaling
CPU cost : this is neglectable in most cases (compared to disk I/O)
The UPDATE query in the question is the hardest: it will cause all disk pages for the table to be fetched, put into buffers, altered into new buffers and written back to disk. In normal circumstances, it will also cause other processes to be locked out, with contention and even more buffer pressure as a result.
The SELECT * query needs all the pages, too; and it needs to convert/format them all into frontend-format and send them back to the frontend.
The SELECT COUNT(*) is the cheapest, on all resources. In the worst case all the disk pages have to be fetched. If an index is present, fewer disk- I/O and buffers are needed. The CPU cost is still neglectable (IMHO) and the "terminal" output is marginal.
When you say "performance", do you mean "how long it takes them to execute"?
One of them is returning all data in all rows.
One of them (if you remove the "FROM") is writing data to the rows.
One is counting rows and returning none of the data in the rows.
All three of those queries are doing entirely different things. Therefore, it is reasonable to conclude that all three of them will take different amounts of time to complete.
Most importantly, why are you asking this question? What are you trying to solve? I have a bad feeling that you're going down a wrong path by asking this.
I have a large (granted indexed) table here at work, and this is what I found
select * from X (limited to the first 100,000 records) (12.5 seconds)
select count(*) from X (returned millions of records) (15.57 seconds)
Update on an indexed table is very fast (less then a second)
The SELECT and UPDATE should be about the same (but they could easily vary, this depends on the database). COUNT(*) is cached in many databases, at some level, so that query could easily be O(1). Of course a lazy implementation of UPDATE could also be O(1), but I don't know of anyone doing that currently.
tl;dr: "False" or "it depends".
All three queries do vastly different things.
They each have their own performance characteristics and are not directly comparable.
Can you clarify what you are attempting to investigate?

Nvidia Cuda Program - is my prob appropriate for the Cuda architecture?

I've been reading about Nvidia Cuda and I've seen some questions on SO that people have answered where they include the comment that "your problem is not appropriate to be running on a GPU".
At my office, we have a database that has an enormous number of records that we query against, and it can take forever. We've implemented SQL queries that SELECT DISTINCT or they apply an uppercase function against a value. As an introduction to Cuda, I thought about writing a program that could take all the strings and uppercase them on the GPU.
I've been reading a book about Cuda where the author talks about trying to make the GPU cores execute as much as possible in order to hide latency of reading data across the PCI bus or putting things in global memory. Since the memory sizes are pretty small and since I have millions of distinct words, naturally I'm going to saturate the bus and starve the GPU cores.
Is this a problem that would not receive a fantastic performance boost from a graphics card as opposed to the CPU?
Thanks,
mj
We've implemented SQL queries that SELECT DISTINCT or they apply an uppercase function against a value.
Have you considered adding a column in your table with precomputed upper case versions of your strings?
I'm inclined to think that if your database is entirely in RAM and queries still take "forever", your database may not be properly structured and indexed. Examine your query plans.
I think that, in the normal case, where your selects are neatly covered by indexes, you won't be able to optimize with the GPU. But maybe there are things that could be optimized for the GPU, like queries that require table scans such as LIKE queries with wildcards and queries that select rows based on calculations (value less than, etc). Maybe even things like queries with many joins when join columns have many duplicated values.
The key to such an implementation would be to keep a mirror of some the data in your database on the GPU and keep it in sync with the database. And then run operations such as parallel reductions on that data to come up with row IDs to then use for selects against the regular database.
Before taking such a step though, I would explore the countless possibilities for datebase query optimizations that use space-time tradeoffs.
You will have a pretty big bottleneck in global memory access since your operation/transfer ratio is O(1).
What would probably be more worthwhile is doing the comparisons on the GPU, as that has a operation/transfer ratio is much larger.
While you load a string into shared memory to do this, you could also capitalize it, effectively including what you wanted to do before, and a bit more.
I can't help but feel a CPU based implementation would probably give you better performance. It would, at least, give you fewer headaches...

SQL Server minus MySQL Features

I know MySQL and and trying to learn SQL Server .
I am looking for features/keywords that are in SQL Server and not in MySQL.
Eg: TOP, CLUSTERED/NONCLUSTERED Indexes etc..
Any links/pointers are appreciated.
Thanks!
I hope you don't mind me pasting from an answer I gave to someone else. The question was about performance in general, but in covering all aspects of performance I touched on most of the engine features as well. Giving yourself a thorough eduction in performance is going to also give you a thorough education in features.
So here are basic performance items to research. This is slanted toward MSSQL but not exclusive to it:
The basic logical data storage architecture of the system you're working with. For example, b-tree, extent, page, the sizes and configurations of these, how much data is read at once, the maximum size of a row (if that is an issue in your DBMS), what is done with out-of-row data (again if that is an issue in your DBMS).
Indexes, constraints, and basic ordering of tables and row data: heaps, clustered, nonclustered, unique and non-uniqueness of these indexes, primary keys, unique constraints, included columns. In all these indexes whether nulls are allowed, just one null is allowed, or none. Uniqueifiers. Covering index.
SARGability (look up SARG which is short for "Search ARGument").
Foreign keys, defaults, cascade deletes/update, their effect on inserts and deletes.
Whether NULLs require any storage space and if this is affected by column position. The number of bytes required to store each data type. When trailing spaces are stored or not stored for string data types. Packed vs. non-packed data types (e.g. float and decimal vs. integer). The concept of rows per page (or smallest unit of disk read) in both clustered and nonclustered indexes.
Fill factor, fragmentation, statistics, index selectivity, page splits, forwarding pointers.
When "batching" an operation can boost performance and why, and how to do it most efficiently.
INNER, LEFT, RIGHT, FULL, and CROSS JOINs. Semi-joins (EXISTS) and anti-semi-joins (NOT EXISTS). Any other language-specific syntax such as USING in mySql and CROSS APPLY/OUTER APPLY in SQL Server. The effect of putting a join condition in the ON clause of an outer join vs. putting it in the WHERE clause.
Independent subqueries, correlated subqueries, derived tables, common table expressions, understanding that EXISTS and NOT EXISTS generally appear to introduce a correlated subquery, but usually are seen in the execution plan as joins (semi or anti-semi joins).
Viewing and understanding execution plans either graphically or in text. Viewing the statistics/profile of CPU, reads, writes, and duration used by whole SQL batches or individual statements. Understanding the limitations of execution plans & profiles, which practically speaking means you generally have to use both to optimize well. Caching and reuse of execution plans, expiration of plans from the cache. Parameter sniffing and parameterization. Dynamic SQL in relation to these.
The relative costs of converting data types to other data types or just working with those data types. (For example, a solid rule of thumb is that working with strings is more costly than working with numbers.)
The generally exorbitant cost of row-by-row processing as opposed to set-based. The proper use for cursors (rare, though sometimes called for). How functions can hide execution plan costs. The tempting trap of writing functions that get called for every row when the problem could be solved in sets (though this can be tricky to learn how to see, especially because traditional application programming tends to train people to think in terms of functions like this).
Seeks, scans, range scans, "skip" scans. Bookmark lookups aka an index seek followed by table seek to the same table using the value found in the index seek. Loop, merge, and hash joins. Eager & lazy spools. Join order. Estimated row count. Actual row count.
When a query is too big and should be split into more than one, using temp tables or other means.
Multi-processor capabilities and the benefits and gotchas of parallel execution.
Tempdb or other temp file usage. Lifetime and scope of temp tables, table variables (if your DB engine has such). Whether statistics are collected for these (in SQL Server temp tables use statistics and table variables do not).
Locking, lock granularity, lock types, lock escalation, blocks, deadlocks. Data access pattern (such as UPDATE first, INSERT second, DELETE last). Intent, shared, exclusive locks. Lock hints (e.g. in SQL Server UPDLOCK, HOLDLOCK, READPAST, TABLOCKX).
Transactions and transaction isolation. Read committed, read uncommitted, repeatable read, serializable, snapshot, others I can't remember now.
Data files, file groups, separate disks, transaction logs, simple recovery, full recovery, oldest open transaction aka minimum log sequence number (LSN), file growth.
Sequences, arrays, lists, identity columns, windowing functions, TOP/rownum/limiting number of rows returned.
Materialized views aka indexed views. Calculated columns.
1 to 1, 1 to 0 or 1, 1 to many, many to many.
UNION, UNION ALL, and other "vertical" joins. SQL Server has EXCEPT and INTERSECT, too.
Expansion of IN () lists to OR. Expansion of IsNull(), Coalesce(), or other null-handling mechanisms to CASE statements.
The pitfalls of using DISTINCT to "fix" a query instead of dealing with the underlying problem.
How linked servers do NOT do joins across the link well, queries to a linked server often become row-by-row, large amounts of data can be pulled across the link to perform a join locally even if this isn't sensible.
The pitfall of doing any I/O or error-prone task in a trigger. The scope of triggers (whether they fire for every row or once for each data operation).
Making the front-end, GUI, reporting tool, or other client do client-type work (such as formatting dates or numbers as strings) instead of the DB engine.
Error handling. Rolling back transactions and how this always rolls back to the first transaction no matter how deeply nested, but a COMMIT only commits one level of work.
It is not a good way to learn new DBMS. The difference is not in the sql dialect only, but in the how DBMS works, its behaviour, practices, etc. Each new DBMS should be learned by you as yours first one.
NewID() is one different keyword too but I think keywords are not the only thing you need to look for. Stored procedures definition also have differences. Security is different (dbo user, use of windows groups/users in MSSQL), networking and I agree with zerkms answer too that behavior and practices are different.
delete from tableA where columnA in(
select columnA from (
select a.*, row_number() over (order by columnA) rn
from tableA a)
where rn>100)