dbt - how to have create table make an autoincrement column - identity

We are building a data warehouse, and leaning towards dbt (data build tool). In the process of writing the model, we need to have an autoincrement (identity) column. How do we instruct dbt to make that part of the create table statement that it creates?

I'm pretty sure this isn't possible. I recently had this as a requirement and I was not able to solve for an auto-incrementing id. But, I also pushed back hard on the necessity of this. The nice thing with modern warehouses is that calculations are fast - you'd probably be better off using the dbt surrogate_key macro to just calculate a hash on the fly (collision probably is vanishingly low). It's deterministic (meaning it'll compute the same every time assuming the underlying data doesn't change) and you don't have to worry about managing state.

Related

How to implement custom fields in database

I need to implement a custom fields in my database so every user can add any fields he wants to his form/entities.
The user should be able to filter or/and sort his data by any custom field.
I want to work with MySQL because the rest of my data is very suitable to SQL. So, unless you have a great idea, SQL will be preferred over NoSQL.
We thought about few solutions:
JSON field - Great for dynamic schema. Can be filtered and sorted. The problem is that it is slower then regular columns.
Dynamic indexes can solve that but is it too risky to add indexes dynamically.
Key-value table - A simple solution but a really slow one. You can't index it properly and the queries are awful.
Static placeholder columns - Create N columns and hold a map of each field to its placeholder. - A good solution in terms of performance but it makes the DB not readable and it has limited columns.
Any thoughts how to improve any of the solutions or any idea for a new solution?
As many of the commenters have remarked, there is no easy answer to this question. Depending on which trade-offs you're willing to make, I think the JSON solution is neatest - it's "native" to MySQL, so easiest to explain and understand.
However, given that you write that the columns are specified only at set up time, by technically proficient people, you could, of course, have the set-up process include an "alter table" statement to add new columns. Your database access code and all the associated view logic would then need to be configurable too; it's definitely non-trivial.
However...it's a proven solution. Magento and Drupal, for instance, have admin screens for adding attributes to the business entities, which in turn adds columns to the relational database.

update target table given DateCreated and DateUpdated columns in source table

What is the most efficient way of updating a target table given the fact that the source table contains a DateTimeCreated and DateTimeUpdated column?
I would like to keep the source in target in synch avoiding a
truncate. I am looking for a bets practice pattern in this situation
I'll avoid a best practice answer but give enough detail to make an appropriate choice. There are two main methods with which you might update a table in SSIS, avoiding a TRUNCATE - LOAD:
1) Use an OLEBD COMMAND
This method is good if:
you have a reliable DateTimeUpdated column,
there are not many rows to update,
there are not a lot of columns to update
there are not many added columns in the dataflow (i.e. derived column transforms)
and the update statement is fairly straightforward.
This method performs poorly with many columns because it performs a row-by-row update. Relying on an audit date column can be a great method to reduce the number of rows to update, but it can also cause problems if rows are updated in the source system and the audit column is not changed. I recommend only trusted it if it has a trigger or you can be certain that no human can perform updates on the table.
Additionally, this component falls short when there is a lot of columns to map or a lot of transforms going on in the data flow. For example, if you are converting all string columns from unicode to non-unicode, you may have many additional columns in the mix that will make mapping and maintenance a pain. The mapping tool in this component is good for about 10 columns, it starts to get confusing very quickly after that. Especially because you are mapping to numbered parameters rather than column names.
Lastly, if you are doing anything complex in the update statement, it is better suited for SQL code rather than maintaining it in the components editor which has no intellisense and is generally painful to use.
2) Stage the data and perform the update in Execute SQL task after the data flow
This method is good for all the reasons that the OLEDB command is bad for, but has some disadvantages as well. There is more code to maintain:
a couple of t-sql tasks,
a proc
and a staging table
This means also that it takes more time to set up as well. However, it does perform very well and the code is far easier to read and understand. Ongoing maintenance is simpler as well.
Please see my notes from this other question that I happened to answer today on the same subject: SSIS Compare tables content and update another

Database design to create tables on the fly

I need to create dynamic tables in the database on the fly. For example, in the database I will have tables named:
Table
Column
DataType
TextData
NumberData
DateTimedata
BitData
Here I can add a table in the table named table, then I can add all the columns to that table in the columns table and associate a datatype to each column.
Basically I want to create tables without actually creating a table in the database. Is this even possible? If so, can you direct me to the right place so I can research? Also, I would prefer sql server or any free database software.
Thanks
What you are describing is an entity-attribute-value model (EAV). It is a very poor way to design a data model.
Although the data model is quite flexible, querying such a data model is quite complicated. You frequently end up having to self-join a table n times if you want to select or filter on n different attributes. That gets slow rather slow and becomes rather hard to optimize relatively quickly.
Plus, you generally end up building a lot of functionality that the database or your ORM would provide.
I'm not sure what the real problem you're having is, but the solution you proposed is the "database within a database" antipattern which makes so many people cringe.
Depending on how you're querying your data, if you were to structure things like you're planning, you'd either need a bunch of piece-wise queries which are joined in the middleware (slow) or one monster monolithic query (either slow or creates massive index bloat), if one is even possible.
If you must create tables on the fly, learn the CREATE TABLE ALTER TABLE and DROP TABLE DDL statements for the particular database engine you're using. Better yet, find an ORM that will do this for you. If your real problem is that you need to store unstructured data, check out MongoDB, Redis, or some of the other NoSQL variants.
My final advice is to write up the actual problem you're trying to solve as a separate question, and you'll probably learn a lot more.
Doing this with documents might be easier. Perhaps you should look at a noSQL solution such as mongoDB.
Or you can still create the Temporary tables but use a cronjob and create the Temporary tables every %% hours and rename it to the correct name after the query's are done. so your site is stil in the air
What you are trying to archive is not not bad but you must use it in the correct logic way.
*sorry for my bad english
I did something like this in LedgerSMB. While we use EAV modelling for a few things (where the flexibility is needed and the sort of querying we are doing is straight-forward, for example menu nodes use this in part), in general, you want to stay away from this as much as possible.
A better approach is to do all of what you are doing except for the data columns. Then you can (shock of shocks) just create the tables. This gives you a catalog of what you have added so your app knows this (and you can diff from the system catalogs if you ever have to check!) but at the same time you get actual relational modelling.
What we did in LedgerSMB was to have stored procedures that would accept a table name exists ('extends_' || name supplied). If so would add a column with the datatype required and write this to the application catalogs. This gives us relational modelling of extended attributes. At load time, the application loads the application catalogs and writes queries as appropriate at appropriate points to load/save the data. It works pretty well, actually.

MYSQL - Database Design Large-scale real world deployment

I would love to hear some opinions or thoughts on a mysql database design.
Basically, I have a tomcat server which recieves different types of data from about 1000 systems out in the field. Each of these systems are unique, and will be reporting unique data.
The data sent can be categorized as frequent, and unfrequent data. The unfrequent data is only sent about once a day and doesn't change much - it is basically just configuration based data.
Frequent data, is sent every 2-3 minutes while the system is turned on. And represents the current state of the system.
This data needs to be databased for each system, and be accessible at any given time from a php page. Essentially for any system in the field, a PHP page needs to be able to access all the data on that client system and display it. In other words, the database needs to show the state of the system.
The information itself is all text-based, and there is a lot of it. The config data (that doesn't change much) is key-value pairs and there is currently about 100 of them.
My idea for the design was to have 100+ columns, and 1 row for each system to hold the config data. But I am worried about having that many columns, mainly because it isn't too future proof if I need to add columns in the future. I am also worried about insert speed if I do it that way. This might blow out to a 2000row x 200column table that gets accessed about 100 times a second so I need to cater for this in my initial design.
I am also wondering, if there is any design philosophies out there that cater for frequently changing, and seldomly changing data based on the engine. This would make sense as I want to keep INSERT/UPDATE time low, and I don't care too much about the SELECT time from php.
I would also love to know how to split up data. I.e. if frequently changing data can be categorised in a few different ways should I have a bunch of tables, representing the data and join them on selects? I am worried about this because I will probably have to make a report to show common properties between all systems (i.e. show all systems with a certain condition).
I hope I have provided enough information here for someone to point me in the right direction, any help on the matter would be great. Or if someone has done something similar and can offer advise I would be very appreciative. Thanks heaps :)
~ Dan
I've posted some questions in a comment. It's hard to give you advice about your rapidly changing data without knowing more about what you're trying to do.
For your configuration data, don't use a 100-column table. Wide tables are notoriously hard to handle in production. Instead, use a four-column table containing these columns:
SYSTEM_ID VARCHAR System identifier
POSTTIME DATETIME The time the information was posted
NAME VARCHAR The name of the parameter
VALUE VARCHAR The value of the parameter
The first three of these columns are your composite primary key.
This design has the advantage that it grows (or shrinks) as you add to (or subtract from) your configuration parameter set. It also allows for the storing of historical data. That means new data points can be INSERTed rather than UPDATEd, which is faster. You can run a daily or weekly job to delete history you're no longer interested in keeping.
(Edit if you really don't need history, get rid of the POSTTIME column and use MySQL's nice extension feature INSERT ON DUPLICATE KEY UPDATE when you post stuff. See http://dev.mysql.com/doc/refman/5.0/en/insert-on-duplicate.html)
If your rapidly changing data is similar in form (name/value pairs) to your configuration data, you can use a similar schema to store it.
You may want to create a "current data" table using the MEMORY access method for this stuff. MEMORY tables are very fast to read and write because the data is all in RAM in your MySQL server. The downside is that a MySQL crash and restart will give you an empty table, with the previous contents lost. (MySQL servers crash very infrequently, but when they do they lose MEMORY table contents.)
You can run an occasional job (every few minutes or hours) to copy the contents of your MEMORY table to an on-disk table if you need to save history.
(Edit: You might consider adding memcached http://memcached.org/ to your web application system in the future to handle a high read rate, rather than constructing a database design for version 1 that handles a high read rate. That way you can see which parts of your overall app design have trouble scaling. I wish somebody had convinced me to do this in the past, rather than overdesigning for early versions. )

Best practice to Load Fact table in MS SSIS

I am new to SSIS in data warehouse. I am using Microsoft business intelligence studio.
I have 5 Dimensions each having some PK.
I have a Fact table that contains all the PK of Dimensions, means their is a foreign key relationship exist ( as in star schema).
Now what is the best practice to load the fact table.
What i have done is write a cross join query between 5 Dimensions and the resultant set is dumped to the fact table. But i don't think this is a good practice.
I am completely new to MS SSIS. so plz describe suggestions in detail.
thanks
Take a look at Microsoft Project Real examples. Also get a Kimball book and read-up on loading fact tables -- the topic covers several chapters.
I would echo #Damir's points about Project Real and Kimball. I am a fan of both.
I guess to give you some more thoughts, to answer your question,
load your date dimension and other "static" dimensions as a one off load
load records into all your dimensions to take care of NULL and UNKNOWN values
load your dimensions. For your dimensions, decide on a column by column basis what you want as type 1 or type 2 changing dimension columns. Be cautious and choose them mostly as type 1 unless there is a good reason.
[edited] load your fact table by joining your staging transaction data which will go into a fact table to your new dimension tables using the business keys, thus looking up the dimension's foreign keys as you go. e.g. sales transactions will have a store number (the business key), which you would want to look up in DimStore (already loaded in the previous step), which would give you the kStore of DimStore, then you would record kstore against that transaction in FactSalesTransaction.
Other general things you should consider (not related to your question, but if yo uare stating out you should consider)
Data archiving. How long will you keep data online? / when will it be deleted?
Table partitioning. If you have very large Fact tables(s), you should consider partitioning on a date or subject area basis. Date is quite nice, as you can do some interesting things with regard to dropping old partitions when the data is too old as part of the standard load process.
Having the DWH as a snowflaked schema, then using a set of views to flatten the snoflake into a star. This is particularly useful when putting an OLAP cube on top of a SQL DWH, as it simplifies the cube design.
How are you going to manage different environments (Dev/Test/etc/Prod)? Using one of the SQL Server configuration styles is imperative.
Build a template SSIS package with all the variables you need and the configration/connection strings you want. It will save loads of time to do that now, rather than having to rework packages when you discover new things. Do trivial prototypes initially to prove your methodology!