Which data flow transform component is able to separate & extract data by specifying a delimit type such as a comma or a pipe? - ssis

IF a table field contains string value delimited by a pipe like:
"John|Lee|12-01-2015"
When SSIS reads it, is there an SSIS component that you can specify a separator and map each into next flow stage?
Say. data flow reads it("John|Lee|12-01-2015") into a variable, then anything can separate the whole string by | and map each sub-value into a set of column via configuration?
I know it is pretty simple to use C# to do it. In SSIS, script component can be used. Just wonder if there is an SSIS component for it, pretty much lile a flat file source, but read from a variable rather than a file.

Related

SSIS reading csv produced by My Sql, code page conflict

I've got a 3rd party file coming in - utf-8 encoded, 56 columns, csv export from MySql. My intent is to load it into a Sql Server 2019 instance - a table layout I do not have control over.
Sql Server Import Wizard will automatically do the code page conversions to latin 1 (and a couple string-to-int conversions) but it will not handle the MySql "\N" for null conventions, so I thought I'd try my hand at SSIS to see if I could get the data cleaned up on ingestion.
I got a number of components set up to do various filtering and transforming (like the "\N" stuff) and that was all working fine. Then I tried to save the data using an OLE DB destination, and the wheels kinda fall off the cart.
SSIS appears to drop all of the automatic conversions Import Wizard would do and force you to make the conversions explicit.
I add a Data Transformation Component into the flow and edit all 56 columns to be explicit about the various conversions - only it lets me edit the "Copy of" output column code pages it will not save them. Either in the Editor or the Advanced Editor.
I saw another article here saying "Use the Derived Column Transformation" but that seems to be on a column-by-column basis (so I'd have to add 56 of them).
It seems kinda crazy that SSIS is such a major step backwards in this regard from Import Wizard, bcp, or BULK INSERT.
Is there a way to get it to work through the code page switch in SSIS using SSIS components? All the components I've seen recommended don't seem to be working and all of the other articles say "make another table using different code pages or NVARCHAR and then copy one table to the other" which kinda defeats the purpose.
It took synthesizing a number of different posts on tangentially related issues, but I think I've finally gotten SSIS to do a lot of what Import Wizard and BULK INSERT gave for free.
It seems that to read a utf-8 csv file in with SSIS and to process it all the way through to a table that's in 1252 and not using NVARCHAR involves the following:
Create a Flat File Source component and set the incoming encoding to 65001 (utf-8). In the Advanced editor, convert all string columns from DT_STR/65001 to DT_WSTR (essentially NVARCHAR). It's easier to work with those outputs the rest of the way through your workflow, and (most importantly) a Data Conversion transform component won't let you convert from 65001 to any other code page. But it will let you convert from DT_WSTR to DT_STR in a different code page.
1a) SSIS is pretty annoying about putting a default 50 length on everything by default. And not carrying through any lengths as defaults from one component/transform to the next. So you have to go through and set the appropriate lengths on all the "Column 0" input columns from the Flat File Source and all the WSTR transforms you create in that component.
1b) If your input file contains, as mine apparently does, invalid utf-8 encoding now and then, choose "RD_RedirectRow" as the Truncation error handling for every column. Then add a Flat File Destination to your workflow, and attach the red line coming out of your Flat File Source to it. That's if you want to see what row was bad. You can just choose "RD_IgnoreError" if you don't care about bad input. But leaving the default means your whole package will blow up if it hits any bad data
Create a Script transform component, and in that script you can check each column for the MySql "\N" and change it to null.
Create a Data Conversion transformation component and add it to your workflow. Because of the DT_WSTR in step 1, you can now change that output back to a DT_STR in a different code page here. If you don't change to DT_WSTR from the get-go, the Data Conversion component will not work changing the code page at this step. 99% of the data I'm getting in just has latinate characters, utf-8 encoded (the accents). There are a smattering of kanji characters in a small subset of data, so to reproduce what Import Wizard does for you, you must change the Truncation error handling on every column here that might be impacted to RD_IgnoreError. Unlike some documentation I read, RD_IgnoreError does not put null in the column; it puts the text with the non-mapping characters replaced with "?" like we're all used to.
Add your OLE DB destination component and map all of the output columns from step 3 to the columns of your database.
So, a lot of work to get back to Import Wizard started and to start getting the extra things SSIS can do for you. And SSIS can be kind of annoying about snapping column widths back to the default 50 when you change something. If you've got a lot of columns this can get pretty tedious.

Azure | ADF | How to use a String variable to lookup a Key in an Object type Parameter and retrieve its Value

I am using Azure Data Factory. I'm trying to use a String variable to lookup a Key in a JSON array and retrieve its Value. I can't seem to figure out how to do this in ADF.
Details:
I have defined a Pipeline Parameter named "obj", type "Object" and content:
{"values":{"key1":"value1","key2":"value2"}}
Parameter definition
I need to use this pipeline to find a value named "key1" and return it as "value1"; "key2" and return it as "value2"... and so on. I'm planning to use my "obj" as a dictionary, to accomplish this.
Technically speaking, If i want to find the value for key2, I can use the code below, and it will be returned "value2":
#pipeline().parameters.obj.values.key2
What i can't figure out is how to do it using a variable (instead of hardcoded "key2").
To clear things out: I have a for-loop and, inside it, i have just a copy activity: for-each contents
The purpose of the copy activity is to copy the file named item().name, but save it in ADLS as whatever item().name translates to, according to "obj"
This is how the for-loop could be built, using Python: python-for-loop
In ADF, I tried a lot of things (using concat, replace...), but none worked. The simpliest woult be this:
#pipeline().parameters.obj.values.item().name
but it throws the following error:
{"code":"BadRequest","message":"ErrorCode=InvalidTemplate, ErrorMessage=Unable to parse expression 'pipeline().parameters.obj.values.item().name'","target":"pipeline/name_of_the_pipeline/runid/run_id","details":null,"error":null}
So, can you please give any ideas how to define my expression?
I feel this must be really obvious, but I'm not getting there.....
Thanks.
Hello fellow Pythonista!
The solution in ADF is actually to reference just as you would in Python by enclosing the 'variable' in square brackets.
I created a pipeline with a parameter obj like yours
and, as a demo, the pipeline has a single Set Variable activity that got the value for key2 into a variable.
This is documented but you need X-ray vision to spot it here.
Based on your comments, this is the output of a Filter activity. The Filter activity's output is an object that contains an array named value, so you need to iterate over the "output.value":
Inside the ForEach you reference the name of the item using "item().name":
EDIT BASED ON MORE INFORMATION:
The task is to now take the #item().name value and use it as a dynamic property name against a JSON array. This is a bit of a challenge given the limited nature of the Pipeline Expression Language (PEL). Array elements in PEL can only be referenced by their index value, so to do this kind of complex lookup you will need to loop over the array and do some string parsing. Since you are already inside a FOR loop, and nested FOR loops are not supported, you will need to execute another pipeline to handle this process AND the Copy activity. Warning: this gets ugly, but works.
Child Pipeline
Define a pipeline with two parameters, one for the values array and one for the item().name:
When you execute the child pipeline, pass #pipeline.parameters.obj.values as "valuesArray" and #item().name as "keyValue".
You will need several string parsing operations, so create some string variables in the Pipeline:
In the Child Pipeline, add a ForEach activity. Check the Sequential box and set the Items to the valuesArray parameter:
Inside the ForEach, start by cleaning up the current item and storing it as a variable to make it a little easier to consume.
Parse the object key out of the variable [this is where it starts to get a little ugly]:
Add an IF condition to test the value of the current key to the keyValue parameter:
Add an activity to the TRUE condition that parses the value into a variable [gets really ugly here]:
Meanwhile, back at the Pipeline
At this point, after the ForEach, you will have a variable (IterationValue) that contains the correct value from your original array:
Now that you have this value, you can use that variable as a DataSet parameter in the Copy activity.

Importing massive dataset in Neo4j where each entity has differing properties

I'm trying to bulk load a massive dataset into a single Neo4j instance. Each node will represent a general Entity which will have specific properties, e.g.:
label
description
date
In addition to these there are zero or more properties specific to the Entity type, so for example if the Entity is a Book, the properties will look something like this:
label
description
date
author
first published
...
And if the Entity is a Car the properties will look something like this:
label
description
date
make
model
...
I first attempted to import the dataset by streaming each Entity from the filesystem and using Cypher to insert each node (some 200M entities and 400M relationships). This was far too slow (as I had expected but worth a try).
I've therefore made use of the bulk import tool neo4j-admin import which works over a CSV file which has specified headers for each property. The problem I'm having is that I don't see a way to add the additional properties specific to each Entity. The only solution I can think of is to include a CSV column for every possible property expressed across the set of entities, however I believe I will end up with a bunch of redundant properties on all my entities.
EDIT1
Each Entity is unique, so there will be some 1M+ types (labels in Neo4j)
Any suggestions on how to accomplish this would be appreciated.
The import command of neo4j-admin supports importing from multiple node and relationship files.
Therefore, to support multiple "types" of nodes (called labels in neo4j), you can split your original CSV file into separate files, one for each Entity "type". Each file can then have data columns specific to that type.
[UPDATED]
Here is one way to support the import of nodes having arbitrary schemata from a CSV file.
The CSV file should not have a header.
Every property on a CSV line should be represented by an adjacent pair of values: 1 for the property name, and 1 for the property value.
With such a CSV file, this code (which takes advantage of the APOC function apoc.map.fromValues) should work:
LOAD CSV FROM "file:///mydata.csv" AS line
CREATE (e:Entity)
SET e = apoc.map.fromValues(line);
NOTE: the above code would use strings for all values. If you want some property values to be integers, booleans, etc., then you can do something like this instead (but this is probably only sensible if the same property occurs frequently; if the property does not exist on a line no property will be created in the node, but it will waste some time):
LOAD CSV FROM "file:///mydata.csv" AS line
WITH apoc.map.fromValues(line) AS data
WITH apoc.map.setKey(data, 'foo', TOINTEGER(data.foo)) AS data
CREATE (e:Entity)
SET e = apoc.map.fromValues(line);

Using Boomi how can I create a flat file profile which has "line" data on the same physical line in the flat file as the "header" data?

I've got parent & child data which I am trying to convert into a flat file using Dell Boomi. The flat file structure is column-based and needs a structure where the lines data is on the same line of the file as the header data.
For instance, a header record which has 4 line items needs to generate a file with a structure of:
[header][line][line][line][line]
Currently what I have been able to generate is either
[header][line]
[header][line]
[header][line]
[header][line]
or
[header]
[line]
[line]
[line]
[line]
I think using the results of the second profile and then using a data processing shape to strip [\r][\n] might be my best option but wanted to check before implementing it.
I created a user define function for each field of data. For this example, lets just speak for "FirstName."
I used a branch immediately-Branch 1 Split the documents, flow controlled them one at a time, and entered the map, then stopped. Branch two contained a message where I built the new file. I typed a static value with the header name, then used a dynamic process property as the parameter next to the header.
User defined function for "FirstName" accepts the first name as input, appends the dynamic process property (need to define a dynamic property for each field), prepends your delimiter of choice, then sets the same dynamic process property.
This was all done with NO scripting at all. I hope this helps. I can provide screenshots if you need more clarification.

Using Web Performance Tests (Visual Studio) with JSON and a data source

I have an AngularJS/WebAPI app that is a wizard style app where you go step by step and type in some information and in the end you get an answer.
I would like to automate this and use a data source but the problem is every page is passing a giant JSON object back to the server with the parameters changing as the user goes through the app.
So for example one of the parameters entered is a ZIP CODE. But to use the DATA SOURCE concept as demoed I would need to create a CSV file which would be like
, , etc..
And that Body/String Body does not have option to add a data source...
Any ideas?
The StringBody fields can contain content parameters. They can be edited via the properties panel of the relevant part of the request in Visual Studio. They can be set to text in the style
some text {{ContextParameter1}} more text {{ContextParameter2}} even more text
the items with doubled curly braces will be replaced with the named context parameters. The rest is taken from the original string body. Values from the data source are made available as context parameter and so can be included. You may need to set the "Select columns" properties of the data source to "Select all columns" to make all values available, the default is just those that are explicitly bound.
Use this method to parameterise sections of the recorded string bodies.
It is also possible to edit the ".webtest" file, it is just an XML representation of the test. However all the string bodies I have seen are 16-bit values (ie 16 bits per character) that are then base-64 encoded.
You should be able to set up an extraction rule on the first response. This value will be stored in a context variable for use in subsequent requests.
e.g.
Request1 with either no variables or data sourced variables
Response1 from which is extracted myVariable
Request2 utilising {{myVariable}} extracted previously
Response2 extract similar . . .
At each subsequent request, replace the section of the json in the string body with {{myVariable}}.