I am working on a SSIS(2017) solution to read and load data from these 3 excel file names:
message_EDF_100420202.csv
message_UltaBIO_10042020.csv
message_SEIDV_10042020.csv
What I need to do is get only EDF or UltraBIO or SEIDV as a new column (derived column task)
so I need some help to set up correctly the substring function inside the derived column task.
any suggestion?
It appears your pattern is message_ Stuff-I-Want _junk (spaces not present in actual pattern). It's delimited by underscores and since the starting text is constant, that makes life easier.
Create a new column called MessageLessName
Remove the message_ portion with an expression
REPLACE([SourceFile], "message_", "")
Now, we want to take the left N most characters where N corresponds to the location of the underscore in our new column MessageLessName. For ease of debugging, I propose you add a second Derived Column Task to the output of the first one (where we defined MessageLessName). Here, we're going to create FirstUnderscore column
findstring([MessageLessName], "_", 1)
Finally, we'll add a third Derived Column Task and here-in is where we'll get to the final file name.
LEFT([MessageLessName], [FirstUnderscore])
Now that may be off by one due to my being lazy but because you can check each step along the way, you can verify MessageLessName is exactly what you think it should be and that FirstUnderscore is N characters in from our MessageLessName column.
script component using Split.
Row.ColumnName.ToString().Split('_')[1];
You are taking the column value and casting to string. (current value is the whole string)
Next is splitting based on '_' (current value is an array of three strings)
Finally you are taking the second value (0 based) (current value is the string you want)
Here's a little bonus. Getting the date as well:
string[] breakdown = Row.fileNames.Split('_');
Row.Type = breakdown[1];
string dateToFix = breakdown[2].Replace(".csv", "");
Row.Date = DateTime.Parse(dateToFix.Substring(0,2) +"/"
+ dateToFix.Substring(2,2) + "/" + dateToFix.Substring(4,4));
Related
In this sheet, I've the below input data:
As seen, the courses are separated by /
I want to display the same in the format below, where each line shows one course only, with the data of the student repeated:
I know using =split(C3," / ",true,true) can split the courses into 2 columns at the same row, but I need them in the same column, so I tried =TRANSPOSE(split(C3," / ",true,true)) that is working fine for the first line only, but it fail with using ARRAYFORMULA.
Any thought? I'm opened for any potential solution, formula or script or any other.
UPDATE
I tried this trick, creating a new column showing number of courses for each student as =ArrayFormula(LEN(REGEXREPLACE(C11:C13, "[^/]", ""))+1)
Then using Rep to repeat each row based on the number of courses =arrayformula({transpose(split(concatenate(rept(B11:B13 & ",",D11:D13)),",",false,true)),transpose(split(concatenate(REPT(C11:C13 & ",",D11:D13)),",",false,true))}) then ended up with:
But here, I've the courses still joint together, how can i split them!
I've added two sheets to your sample spreadsheet. "Sheet2" is a cleanup of your testing sheet, "Sheet1." The other sheet ("Erik Help") references Sheet2, not Sheet1, and contains the following formula in cell A1:
=ArrayFormula({"Student ID","Student Name","Course";SUBSTITUTE(SPLIT(QUERY(FLATTEN(SPLIT(FILTER(SUBSTITUTE("/ "&Sheet2!C3:C,"/","/ "&Sheet2!A3:A&"zzz~"&Sheet2!B3:B&"~"),Sheet2!A3:A<>""),"/")),"Select * WHERE Col1 Is Not Null"),"~"),"zzz","")})
This one array formula produces all headers and results.
A virtual array is formed between the curly brackets { }. Headers are introduced first followed by a semicolon, which means "bump down one row to continue." The header titles can be changed as you like.
How It Works:
An addition "/ " is concatenated to the front of every non-blank entry in Sheet2!C2:C. Then SUBSTITUTE replaces every one of these forward slashes with Col A data, "zzz~", Col B data and "~". The tildes (~) will be used later by the outer SPLIT. The "zzz" is added to make sure that ID numbers are converted to text so that they hold formatting throughout the processing and don't turn into real numbers; later, the outer SUBSTITUTE will replace those with null (i.e., get rid of the 'zzz').
Once the initial concatenations are complete, they are SPLIT at the forward slash and then FLATTENed into one column. QUERY removes any blank rows in this virtual array so far. The remaining results are again SPLIT at the tilde. Finally, that outer SUBSTITUTE removes the temporary instances of 'zzz'.
I also added a custom CF formula for the alternating color banding on alternate rows.
You can try this one:
Formula:
=ARRAYFORMULA(TRIM(QUERY(SPLIT(FLATTEN(IF(IFERROR(SPLIT(C3:C5, "/"))="",,
A3:A5&"×"&B3:B5&"×"&SPLIT(C3:C5, "/"))), "×"),
"where Col3 is not null")))
Output:
Reference:
How to transpose & split multiple columns and repeat specific cells in a column
I am using Below code in my derived column of SSIS to remove title in Name column such as Mr,Mrs,Ms and Dr.
Ex:-
Mr ABC
MS XYZX
Mrs qwrer
DR ADCS
SO I am removing the title of the name.
SUBSTRING( [Name] , 1, 3)=="Mr" && LEN( [Name] ) >2 ? RIGHT([Name],LEN([Name])-2)
But getting Error as incomplete token or invalid statement.
Please help.
any other suggestion to remove the prefixes are also welcome but need to Use transformation.
A different way to think about the problem is that you want to look at the first "word" in the column Name where "word" is the collection of characters from the start of the string to the first space.
String matching in SSIS is case-sensitive so you'll want to force the first word to lower/upper case - however your master list of titles is cased (and then ensure the title list is all lower/upper case).
I am an advocate of making life easier on yourself so I'll add a Derived Column, actually lots of derived columns, that will identify the position of the first space in Name and I'll call this FirstSpace
DER GetFirstSpace
Add a new column, called FirstSpace. The expression we want to use is FINDSTRING
FINDSTRING([Name], " ", 1)
That will return the position of the first instance of a space (or zero if no space was found).
DER GetFirstWord
Add another derived column after the DER GetGetFirstSpace. We need to do this so we can inspect the values we're passing to get the first word. Do it all in a single Derived column and when you get something wrong, you won't be able to debug it and the real cost of development is maintenance. New column, FirstWord will be type DT_WSTR 4000 because that's what happens when you use the string manipulation expressions. I am going to force this to upper case as I'll need it later on.
UPPER(SUBSTRING([Name], 1, [FirstSpace]))
TODO: Verify whether that will be "DR" or "DR " with trailing space as I'm currently coding this by memory.
TODO: What happens if FirstSpace is 0 - we might need to make use of ternary operator ?:
At this point in the data flow, we have a single word in a column named FirstWord what we need to do is compare that to our list of known titles and if it exists, strip it from the original. And that's an interesting problem.
DER GetIsTitleMatched
Add yet another Derived column, this time to solve whether we've matched our list of titles. This will be a boolean type column named IsTitleMatched
[FirstWord] == "DR" || [FirstWord] == "MRS" || [FirstWord] == "MR" || [FirstWord] == "MS"
Following that pattern "FirstWord is exactly equal literal text OR..." when this derived column evaluates, we'll know whether the first word is something to be removed (finally)
DER SetFinalName
Here we're going to add yet another column, NameFinal The magic of stripping out the bad word will be that we use the RIGHT expression starting at the position of that opening space and going to the end of the word. You might need to add a left TRIM in there based on whether the RIGHT operation is inclusive of the starting point or not. Again, free handing at the moment so good but no guarantee of perfection.
(IsTitleMatched) ? RIGHT([Name], [FirstSpace]) : [Name]
I do violate my own rule here as I have a quasi complex expression there in the positive case. Feel free to insert a derived column task that computes the stripped version of the name.
At this point, you've got 4 to 5 derived columns in your data flow but you can add a data viewer between each to ensure you're getting the expected result. You're welcome to solve this in your own manner but that's the easiest approach I can think of.
I'm trying to merge multiple JSON data sets into one large data set, due to a max limit of 100 on the server I'm pulling them from.
The easiest way to do this would be to eliminate the end of one set and the beginning of the next and replace it with "," so that there would be only one open and close to the entire large set. This is what appears between the last entry of one set and the first entry of the next currently:
],"version":"1.0"}{"error":"OK","limit":100,"offset":100,"number_of_page_results":100,
"number_of_total_results":20235,"status_code":1,"results":[
Again, I need that entire string replaced with just a comma, but the problem I'm encountering is that I had to change the offset between each data set to grab the next 100 entries, so the "offset":100, is different in each string ("offset":200, "offset":300, etc.). I can't seem to get wildcards to cooperate. I suspect it has something to do with all the brackets that are already in the string.
Any help would be appreciated. Thank you.
A regular expression that matches the whole input you provided (provided there's no new line characters) is:
\],"version":"1\.0"\}\{"error":"OK","limit":[0-9]+,"offset":[0-9]+,"number_of_page_results":[0-9]+,"number_of_total_results":[0-9]+,"status_code":[0-9]+,"results":\[
It will get any digits in place off all the numbers in your sample (except version).
I have a text file with more than one table's data in it (different column counts). I import the whole row as one column. Based on a conditional split, the rows are dispersed to their correct flow. I use a script component to split the single column values (row) into the correct columns for that table and give it as output columns. All of this is working fine, and data looks fine.
My problem comes in with some numeric fields. When a numeric field has no values in it, it ends up in the table with another column's numeric value.
I have put data viewers everywhere, in not one of them there is data for the column that should be empty. When I look in the table itself, there it is... data from another column.
It is not the mappings, I checked it a dozen times.
It is not the names that are the same or something like that.
There is no data according to dataviewers anywhere in the load process.
There is no hidden code anywhere.
I droped and recreated the table.
I displayed a messagebox with the column's (that is supposed to be empty) assigned "column value", and no data, like expected.
I used a derived column, same result, no data in dataviewers, but eventualy data in the table.
I also created another test table with those numeric fields as varchar. When I do this, the column is empty (like expected). When I change it to numeric, the field is populated again. (If it was the other way around I could understand).
What can be te reason for this? It is driving me insane.
EDIT
Script code:
//C#
public override void Input0_ProcessInputRow(Input0Buffer Row)
{
ASCIIEncoding enc = new System.Text.ASCIIEncoding();
char[] seperator = { '|' };
Byte[] ByteBlob;
String[] ColumnValue;
ByteBlob = Row.Column0.GetBlobData(0, (int)(Row.Column0.Length));
ColumnValue = enc.GetString(ByteBlob).Split(seperator);
Row.OutputColumn0 = ColumnValue[0];
Row.OutputColumn1 = ColumnValue[1];
///etc
Just to give an example of what it does, this is what a row would look like in a sence.
Column names:
Source|Tablename|Value1|Value2|Description|Value3|Description2|Value4
Actual Data:
ABC|Revenue|123,456|729,537|MisterX||None|
Data in Table:
ABC|Revenue|123,456|729,537|MisterX|729,537|None|729,537
try using Row.ColumnX_IsNull , for example if (Row.Column0_IsNull) {youroutputcolumn=null} else {...}
I have a couple of questions about the task on which I am stuck and any answer would be greatly appreciated.
I have to extract data from a flat file (CSV) as an input and load the data into the destination table with a specific format based on position.
For example, if I have order_id,Total_sales,Date_Ordered with some data in it, I have to extract the data and load it in a table like so:
The first field has a fixed length of 2 with numeric as a datatype.
total_sales is inserted into the column of total_sales in the table with a numeric datatype and length 10.
date as datetime in a format which would be different than that of the flat file, like ccyy-mm-dd.hh.mm.ss.xxxxxxxx (here x has to be filled up with zeros).
Maybe I don't have the right idea to solve this - any solution would be appreciated.
I have tried using the following ways:
Used a flat file source to get the CSV file and then gave it as an input to OLE DB destination with a table of fixed data types created. The problem here is that the columns are loaded, but I have to fill them up with zeros in case the date when it is been loaded or in most of the columns if I am not utilizing the total length then it has to preceded with zeros in it.
For example, if I have an Orderid of length 4 and in the flat file I have an order id like 201 then it has to be changed to 0201 when it is loaded in the table.
I also tried another way of using a flat file source and created a variable which takes the entire row as an input and tried to separate it with derived columns. I was to an extent successful in getting it, but at last the data type in the derived column got fixed to Boolean type explicitly, which I am not able to change to the data type I want.
Please give me some suggestions on how to handle this issue...
Assuming you have a csv file in the following format
order_id,Total_sales,Date_Ordered
1,123.23,01/01/2010
2,242.20,02/01/2010
3,34.23,3/01/2010
4,9032.23,19/01/2010
I would start by creating a Flat File Source (inside a Data Flow Task), but rather than having it fixed width, set the format to Delimited. Tick the Column names in the first data row. On the column tab, make sure row delimiter is set to "{CR}{LF}" and column delimiter is set to "Comma(,)". Finally, on the Advanced tab, set the data types of each column to integer, decimal and date.
You mention that you want to pad the numeric data types with leading zero's when storing them in the database. Numeric data types in databases tend not to hold leading zero's. So you have two options; either hold the data as the type they are in the target system (int, decimal and dateTime) or use the Derived Column control to convert them to strings. If you decide to store them as strings, adding an expression like
"00000" + (DT_WSTR, 5) [order_id]
to the Derived Column control will add up to 5 leading zeros to order id (don't forget to set the data type length to 5) and would result in an order id of "00001"
Create your target within a Data Flow Destination and make the table/field mappings accordingly (or let SSIS create a new table / mappings for you).