handled dynamically missing source columns in ssis - ssis

I have a small SSIS question. I'm extracting data from a MySQL table with a varying column list to a SQL Server table with a fixed column list.
source table: Test(mysql server)
id | name | sal | deptno | loc | referby
1 | abc | 100 |10 | hyd | xyz
2 | mnc | 200 |20 |chen | pqr
First I select MySQL table configuration, then I drag and drop oledbdestination for MySQL server table configuration. I configure the target table, and after that the package works fine and the data looks like below.
Target table : Test (sql server )
id | name | sal |deptno | loc |referby
1 | abc | 100 |10 | hyd | xyz
2 | mnc | 200 |20 |chen | pqr
The second time I run the package, a column has been removed from the source table's schema, so the package fails. I open the MySql server testsource configuration and I edit the query to return NULL for the missing column:
select id,'null' as name,sal,deptno,loc,referby from test
I rerun the package and the data looks like this.
Target table : Test (sql server )
id | name | sal |deptno | loc |referby
1 | null | 100 |10 | hyd | xyz
2 | null | 200 |20 |chen | pqr
I always truncate the target table and load data.
The target table has an unchanging list of columns while the source table's column list can vary. I do not want keep editing the query to account for possible missing columns. How I can handle this at the package level?

A couple ideas:
Use dynamic SQL. Replace your straightforward SELECT ... with a query that iterates through the target table's column list (perhaps fetched via SHOW COLUMNS), builds a SELECT query that inserts NULL for the missing columns then execute it via PREPARE and EXECUTE.
The query-generating query would need to produce a SELECT statement containing the fixed set of columns your target table expects to see. If an expected column doesn't exist in the source, the query-generating query should insert the placeholder NULL AS ColumnName in the query.
(I'm not a MySQL expert so I'm unsure of MySQL's exact capabilities in this regard but in theory this approach sounds workable.)
Use a Script Component as the data source. Configure this component with the output columns you expect. Have the component query the source database (maybe using a simple SELECT * FROM ....) and then copy only the relevant columns that exist from source to output row buffer. With this approach, columns that don't exist will automatically be outputted into the data flow as null/their default value because the Script Component won't have set them to a value.

SSIS is very rigid when it comes to dynamic sources like this. I think your best bet would be to explore BIML which could generate a new package for you each time you need to "refresh" the schema.
http://www.sqlservercentral.com/stairway/100550/

Related

mySQL - Reiteratively Count rows that have particular CSV string

2-column MySQL Table:
| id| class |
|---|---------|
| 1 | A,B |
| 2 | B,C,D |
| 3 | C,D,A,G |
| 4 | E,F,G |
| 5 | A,F,G |
| 6 | E,F,G,B |
Requirement is to generate a report/output which tells which individual CSV value of class column is in how many rows.
For example, A is present in 3 rows (with id 1,3,5), and C is present in 2 rows (with id 2,3), and G is in 4 rows (3,4,5,6) so the output report should be
A - 3
B - 3
C - 2
...
...
G - 4
Essentially, column id can be ignored.
The draft that I can think of - first all the values of class column need to picked, split on comma, then create a distinct list of each unique value (A,B,C...), and then count how many rows contain the unique value from that distinct list.
While I know basic SQL queries, this is way too complex for me. Am unable to match it with some CSV split function in MySQL. (Am new to SQL so don't know much).
An alternative approach I made it to work - Download class column values in a file, feed it to a perl script which will create a distinct array of A,B,C, then read the downloaded CSV file again foreach element in distinct array and increase the count, and finally publish the report. But this is in perl which will be a separate execution, while the client needs it in SQL report.
Help will be appreciated.
Thanks
You may try split-string-into-rows function to get distinct values and use COUNT function to find number of occurrences. Specifically check here

How to join a group of tables with the same suffix?

So I am no MYSQL expert and I really need some help trying to figure this out. I currently have over 60 tables that I wish to join into a single table, none of the data in those tables match each other, so I need the rows of all the tables into a single one. They do have the same schema if that is the correct term, basically the same format. They all end in the same suffix '_dir'.
What I thought that could work was something like this,
Get all tables under the same suffix,
For each table in the table list join or insert row into main_table.
I don't know how to do this in mysql or if its even possible. I know I can use,
SELECT *
FROM INFORMATION_SCHEMA.TABLES
WHERE TABLE_NAME LIKE '%_dir%'
to get the list of all the tables, but how can I use this to iterate over every table?
Here is an example of input data:
table 1:
| NAME | INST_NAME | Drop
| data 1 | 'this is an example instance1 | 1.5
| data 1 | 'this is an example of instance2| 2.0
table 2:
| NAME | INST_NAME | DROP
| data 2 | 'this is an example instance1 | 3.0
| data 2 | 'this is an example of instance2| 4.0
Output table:
| NAME | INST_NAME | DROP
| data 1 | 'this is an example instance1 | 1.5
| data 1 | 'this is an example of instance2| 2.0
| data 2 | 'this is an example instance1 | 3.0
| data 2 | 'this is an example of instance2| 4.0
Note that I have to do this for over 60 tables not just 2. There are also other tables with different information in the same database, so I cant just join all tables in there.
You really need to fix your data structure. You should not be storing data in tables with the same structure -- that information should all go into a single table. Then you wouldn't have this issue.
For now, you can construct a view with all the data. You can generate the code for the view with something like this:
SELECT CONCAT('CREATE VIEW vw_dir AS',
GROUP_CONCAT(REPLACE('SELECT NAME, INST_NAME, `DROP` FROM [T]', '[T]'), TABLE_NAME)
SEPARATOR ' UNION ALL '
)
) as create_view_sql
FROM INFORMATION_SCHEMA.TABLES
WHERE TABLE_NAME LIKE '%_dir%';
Then take the SQL, run it, and you'll have a view called vw_dir. The next time you add a table, you'll need to drop the view and then recreate it.
With this solved, you can now start thinking about how to get all the data into a single table, without having the intermediate tables cluttering up your database.

SQL: Text type with a lot of commonly used values

I have a table that basically looks like the following:
Timestamp | Service | Observation
----------+---------+------------
... | vm-1 | 15
... | vm-1 | 20
... | vm-1 | 20
... | vm-1 | 20
... | vm-1 | 20
... | vm-1 | 20
... | bvm-2 | 184
... | bvm-2 | 104
... | bvm-2 | 4
... | bvm-2 | 14
... | bvm-2 | 657
... | bvm-2 | 6
... | bvm-2 | 6
The Service column will not have a lot of different values. I don't know at table creation time what all possible values are going to be so I can't use an enum, but the number of distinct values are going to grow very slowly at (less than ~10 new distinct values per month or less), whereas I'll have thousands of new observations per day.
Right now I'm just thinking of using a VARCHAR or mysql's TEXT type for the Service column, but given the specifics of the situation those kind of seem wasteful.
Are databases usually smart about this sort of thing? Or is there some way I can hint to the database that this behavior is something that it can reliably exploit?
I'm using MySQL 5.7. I'd prefer something standards compliant or portable, but I'm also open to MySQL specific workarounds.
EDIT:
In other words, what I want is for the column to be treated like an enum, but have the database figure out dynamically based on the data that shows up in the table what the different enum values are.
Every time you need to use an enum you should consider creating another table and reference to it. It's basic normalization. So create one table for the ServiceType with a name and an id field the name can be VARCHAR and the id should be INT. The actual table then just uses the id instead of the service name.
You can write a simple stored procedure to do the inserting and looking up of duplicate names as well as a view to access the results so outside of the DB you barely know how it is internally handled.
Your stored procedure needs to:
Check if the service exists and insert it if not. INSERT IGNORE ... is probably your friend here.
Get the ID of the service with SELECT id INTO #serv_id FROM ServiceType WHERE name = [service_name];
Insert into the table with the service ID instead of the service.
Don't over optimize. MySQL does not store TINYINT more efficiently than INT so just use the latter and it won't fail until you have billions of services.
I think , you have to create a new table for store the services and and then this table primary key (service_id) can be replaced in place of service text. But main table service column should be int type for storing the service id . So please change the service column type to int(4) .
hope it will be helpfull

SSIS Surrogate Key incrementation

I'm using SSIS to create a star schema for a data warehouse with surrogate keys (sg).
My process goes like this:
find max sg (using SQL)
in data flow: data source-> c# script that adds +1 to the max sg -> write to destination.
Now, with fixed dimensions it works without problems. Every added row gets the sequential sg.
However when I use the Slowly Changing Dimension and historically updating a row I get the following:
sg_key | name | city | current_row
1 | a | X | true
2 | b | Y | true
3 | c | Z | false
4 | d | H | true
7 | c | T | true
Now, correct me if I'm wrong but I always thought SSIS is pushing one row at a time through all the flow tasks, but it looks like it first generates ALL the sg_keys for all the rows and then sends the updated row through the flow.
Do I understand how SSISworks in a wrong way? How can I fix it?
Cheers,
Mark.
If you use SQL Server as a destination, why didn't use an IDENTITY Column? (instead of a C# script)
https://msdn.microsoft.com/en-us/library/ms186775.aspx
Identity will automatically increment your column when you insert a new row. If you don't update this column, the value will not change.
Arnaud

pdi spoon ms-access concat

Suppose I have this table named table1:
| f1| f2 |
--------------
| 1 | str1 |
| 1 | str2 |
| 2 | str3 |
| 3 | str4 |
| 3 | str5 |
I wanted to do something like:
Select f1, group_concat(f2) from table1
this is in mysql, I am working with ms-access! And get the result:
| 1 | str1,str2|
| 2 | str3 |
| 3 | str4,str5|
So I searched for a function in ms-access that would do the same and found it! xD
The problem is that everyday I have to download some database in ms-access, create the function to concat there, and then create a new table with those concated values.
I wanted to incorporate that process in the Pentaho Data Integration spoon transformations, that I use after all this work.
So what I want is a way to define a ms-access function in the PDI spoon, or some way to combine steps that would emulate the group_concat from mysql.
Simple - Query from access, and use the "group by" step to do your group_concat - there is an option to concatenate fields separated by , or any string of your choice.
Dont forget that the stream must be sorted by whatever you're grouping by unless you use the memory group by step.
A simple way is you move your data in ms-access to mysql with the same structure (mysql DB structure = ms-access DB structure), then execute your "Select f1, group_concat(f2) from table1". For details follow this below steps :
Create transformation A to move/transfer your ms-access data to mysql
Create transformation B to execute Select f1, group_concat(f2) from table1
Create job to execute transformation A and B (You must execute tranformation A before B)