I have a couple of questions.
I would like to know if we need to worry about distribution in Netezza while using only select statements(not creating tables).
I am basically trying to create a dataset in SAS by connecting to Netezza and selecting the view which has a couple of joins. I am wondering how will this affect performance of Netezza if i am creating the table directly in SAS.
I am creating a table by joining another two tables on customer_id. However, the output dataset does not consist of customer_id as a column. Can i distribute this table on customer_id?
Thanks.
For your first question, you typically don't need to worry about distribution if you aren't creating a table. It does help to understand distribution methods for the tables you are selecting from, but it's certainly not a requirement. Having a distribution method that supports the particular joins you are doing can certainly help performance during the select (e.g. if your join columns are superset of the distribution columns then you'll get co-located joins), but if the target of the output is SAS, then there's no effect on the write of the dataset to SAS.
For your second question, a table is distributed either on a column, or columns, in the table itself, or via a RANDOM (aka round robin) distribution method. In your case, if you are storing your data set in a table on Netezza, you could not distribute the data on customer_id as that column is not included in the data set.
Related
I have the following problem:
We have a lot of different, yet similar types of data items that we want to record in a (MariaDB) database. All data items have some common parameters such as id, username, status, file glob, type, comments, start & end time stamps. In addition there are many (let's say between 40 and 100) parameters that are specific to each type of data item.
We would prefer to have the different data item types in the same table because they will be displayed along with several other data, as they happen, in one single list in the web application. This will appear like an activity stream or "Facebook wall".
It seems that the normalised approach with a top-level generic table joined with specific tables underneath will lead to bad performance. We will have to do both a lot of joins and unions in order to display the activity stream, and the application will frequently poll with this query, so it's important that the query runs fast.
So, which is the better solution(s) in terms of performance and storage optimization?
to utilize MariaDB's dynamic columns
to just add in all the different kinds of columns we need in one table, and just accept that each data item type will only use a few of the columns, i.e. the rest will be null.
something else?
Does it matter if we use regular columns when a lot of the data in them will be null?
When should we use dynamic columns and when is it better to use regular columns?
I believe you should have separate columns for the values you are filtering by. However, you might have some unfiltered values. For those it might be a good idea to store them in a single column as a json object (simple to encode/decode).
A few columns -- the main ones for using in WHERE and ORDER BY clauses (but not necessarily all the columns you might filter on.
A JSON column or MariaDB Dynamic columns.
See my blog on why not to use EAV schema. I focus on how to do it in JSON, but MariaDB's Dynamic Columns is arguably better.
Is it possible to create a custom function in MySQL like SUM, MAX, and so on. That accepts multiple columns and do some operation on each row?
The reason I am asking this question is because I tried to do my logic using stored procedure but unfortunatelly couldn't find a way how to select data from table name where the name of the table is input parameter.
Somebody suggested to use dynamic SQL but I can not get the cursor. So my only hope is to use custom defined function.
To make the question more clear here is what I want to do:
I want to calculate the distance of a route where each row in the database table represents coordinates (latitude and longtitude). Unfortunatelly the data I have is really big and if I query the data and do the calculationgs using Java it takes more than half a minute to transfer the data to the web server so I want to do the calculations on the SQL machine.
Select something1, something2 from table_name where table name is a variable
Multiple identically-structured tables (prerequisite for this sort of query) is contrary to the Principle of Orthogonal Design.
Don't do it. At least not without very good reason—with suitable indexes, (tens of) millions of records per table is easily enough for MySQL to handle without any need for partitioning; and even if one does need to partition the data, there are better ways than this manual kludge (which can give rise to ambiguous, potentially inconsistent data and lead to redundancy and complexity in your data manipulation code).
I am writing a server using Netty and MySQL(with JDBC connector/J).
Please note that I am very new to server programming.
Say I have an application that users input about 20 information about themselves.
And I need to make some methods where I need only specific data from those information.
Instead of using "select dataOne, dataTwo from tableOne where userNum=1~1000"
create a new table called tableTwo containing only dataOne and dataTwo.
Then use "select * from tableTwo where userNum=1~1000"
Is this a good practice when I make tables for every method I need?
If not, what can be a better practice?
You should not be replicating data.
SQL is made in such a way that you specify the exact columns you want after the SELECT statement.
There is no overhead to selecting specific columns, and this is the way SQL is designed for.
There is overhead to replicating your data, and storing in 2 different tables.
Consequences of using such a design:
In a world where we used only select * we would need a different table for each combination of columns we want in results.
Consequently, we would be storing the same data repeatedly. If you needed 10 different column combinations, this would be 10X your data.
Finally, data manipulation statements (update, insert) would need to update the same data in multiple tables also multiplying the time needed to perform basic operations.
It would cause databases to not be scalable.
I'm looking to create an order history for each client at work using MySQL.
I was wondering if it is best practice to create a separate table for each client, with each row identifying an order they've placed - or - to have one table with all orders, and have a column with an identifier for each client, which would be called to populate their order history.
We're looking at around 50-100 clients, with 10-20 orders a year that would be added to each of them so I am trying to make this as efficient as I am, performance wise.
Any help would be appreciated. Thanks.
It is never a good idea to create a separate table for specific data (e.g. per client) as this destroys relational integrity / flexibility within the RDBMS itself. You would have to have something external that adds/removes the tables, and the tables wouldn't have integrity between each other.
The answer is in your second sentence: One table for orders that has a column that points to the unique identifier for clients. This is as efficient as possible, especially for such small numbers.
I've got a SQL Server 2008 R2 database which has a number of tables. Two of these tables contains a lot of large data .. mainly because one of them is VARBINARY(MAX) and the sister table is GEOGRAPHY. (Why two tables? Read Below if you're interested***)
The data in these tables are geospatial shapes, such as zipcode boundaries.
Now, the first 70K odd rows are for DataType = 1
the rest 5mil rows are for DataType = 2
Now, is it possible to split the table data into two files? so all rows that are for DataType != 2 goes into File_A and DataType = 2 goes into File_B?
This way, when I backup the DB, I can skip adding File_B so my download is waaaaay smaller? Is this possible?
I guessing you might be thinking -> why not keep them as TWO extra tables? Mainly because in the code, the data is conceptually the same .. it's just happens that I want to split the storage of this model data. It really messes up my model if I now how two aggregates in my model, instead of one.
***Entity Framework doesn't like Tables with GEOGRAPHY, so i have to create a new table which transforms the GEOGRAPHY to VARBINARY, and then drop that into EF.
It's a bit of overkill, but you could use table paritioning to do this, as each partition can be mapped to a distinct file group. Some caveats:
Table partitioning is only available in Enterprise (and developer) edition
Like clustered indexes, you only get one, so be sure that this is how you'd want to partition your tables
I'm not sure how well this would play out against "selective backups" or, much more importantly, partial restores. You'd want to test a lot of oddball recovery scenarios before going to Production with this
An older-fashioned way to do it would be to set up partitioned views. This gives you two tables, true, but the partitioned view "structure" is pretty solid and fairly elegant, and you wouldn't have to worry about having your data split across multiple backup files.
I think you might want to look into data partitioning. You can partition your data into multiple file groups, and therefore files, based on a key value, such as your DataType column.
Data partitioning can also help with performance. So, if you need that too, you can check out partition schemes for your indexes as well.