Ignore last/corrupted record from flat file source in SSIS - ssis

I have following csv file:
col1, col2, col3
"r1", "r2", "r3"
"r11", "r22", "r33"
"totals","","",
followed by 2 blank lines. The import is failing as there is extra comma at the end of the last data row and most probably will fail because of the extra blank lines at the end.
Can I skip the last row somehow or even better stop import when I get into that row? It always has "totals" string in the "col1".
UPDATE:
As far as I understood from the answers that it is not possible to do that with Flat File. Currently I did that with the "Script Component" as a source

You can do it by reading the row as a single string.
Conditionally split out Null and left(col0)=="total"
in script component you then use split function
finally trim("\"")

I know of nothing built-in to SSIS that lets you ignore the LAST line of a CSV.
One way to handle this is to precede your dataflow with a script task that uses the FileSystemObject to edit the CSV and remove the last line.

You will need to create a custom script where you read all lines but the last within SSIS.

This is old but it came up for me when searching this topic. My solution was to redirect rows on the destination. The last row is redirected instead of failing and the job completes. Of course you will potentially redirect rows you don't want to. It all depends on how much you can trust the data.

Related

Pipeline unable to read field of plain text file

Using Apache Hop latest version I'm trying to read in a plain text file. This text file is old and basically only structured by its lines (it has no delimiter, no seperator, no enclosure, etc.). I would like to read and process the lines of this file as rows in my transformation.
I use the "Text file input" transformation to read the file. Apparently reading it works, but I seem have no field available when trying to retrieve the fields. It simply states that no fields were found.
When I run the "preview records" I do get empty records equal to the number if lines in the file, so that is good. However there is no data shown as there is no field detected.
Curiously enough, when I press "Show file content" I DO get the desired content, nicely structured in the rows as desired, so I know the file is being read correctly.
Does anyone know how to best read these kind of files?
PS: The files can be anywhere from 10 to 100000 lines.
When there is no header row with field names or Hop is not able to detect any fields you can also create a field in the fields tab and it will put content in there.
As we just use a position based approach and split the content using the specified delimiter everything should go in "field1" when no delimiter is found in the data.
Figured it out. The naming is a bit misleading, but you can use the "CSV File input" and then set a TAB as delimited. Then use preview on your file and you should find that the lines are actually being parsed.

Remove last blank row in CSV using Logic App

I have a CSV file stored in SFTP where the last row is a blank, so the data looks like this in text:
a,b,c
d,e,f
,,
How can I use Logic App to remove that final row and then save it in BLOB? I have the following but will need some extra steps before the BLOB creation I think.
Considering the same sample here is my Logic app
In Compose_2 it takes the index of the last empty item. Below is the expression that I used to retrieve the lastIndex.
lastIndexOf(variables('Sample'),'\n')
Then in Compose_3 I'm selecting the one which I wanted
substring(variables('Sample'),0,outputs('Compose_2'))
Here is the Final Result
NOTE:-
Make sure you remove an extra ' \ ' been attached to '\n' in the code view at the Compose_2.
So the final Compose_2 looks like
lastIndexOf(variables('Sample'),'
')
Updated Answer
If the received data is coming from CSV then you can use the take() expression you retrieve the wanted rows. Here are a few screen shots for detailed explanation:-
Below is the expression in the compose connector
take(outputs('Split_To_Get_Rows'),sub(length(outputs('Split_To_Get_Rows')),1))

Jitterbit: target CSV-file created with only header although "do not create emtpy files" is checked

In Jitterbit Dataloader 10.37 I want to create CSV-files from Salesforce data but only if the query returns data.
I checked "do not create empty files" on the target type local file but it is still creating a csv just with the header but with no data. I do not want files created with no data in it. It is not an option to not have the header at all in the files - I will need it when there is data from the query.
Any suggestions? What am I missing?
I've seen this happen in situations where the write operation is after a couple of other operations. In that instance a header is written in the first operation, then another header is written in a second operation. The first row is read as the header, the second row (another header) is read as data, and written out.
I always add in a condition where I check if one of the fields equals its name. Something like this, to just skip those rows.
<trans>
if(Id=="Id",
false;,
true;
);
</trans>
The best way to do this is to send your output to a variable array. Then check the variable to see if data is present. So set your target to a global variable. Then add a script after that target and do your validation. To test your script use DEBUGBREAK(); to test and look at your variable content. That way you can see what is going into it.
Then make your condition statement.
if( Length($varailbe)>1,RunOperation("operation:myexport"),"novalue"):

Parse tab separated text file in Google Sheets

I have a txt file available on the web which contains tab separated values (TSV/CSV) like this:
Product_IdtabColortabPricetabQuantityItem1 tabRed tab$5.2 tab5Item2 tabBlue tab$7.5 tab10
I imported the txt file into a Google Spreadsheet using the IMPORTDATA(url) formula. The problem is that now I need to split the text to columns. I tried the following formulas without success:
Split(A1,"\t")
Split(A1," ")
Split(A1,"<tab>")
another thing I tried is to to use the Substitute function, but I just can't figure out how to match the Tab character in Google Spreadsheets?
Pages strips tabs by default when you paste text using a standard paste. Tab delimited data can be pasted and automatically parsed using:
Right Click -> Paste special -> Paste values only
IMPORTDATA(url) seems to handle tabs automatically, as others have mentioned before, if the URL ends in ".tsv".
I had trouble trying to import a file from Dropbox even though the file was named "something.tsv", because the url was
"https://www.dropbox.com/s/xxxxxxx/something.tsv?dl=1"
I managed to solve the problem by adding a dummy query parameter to the url:
"https://www.dropbox.com/s/xxxxxxx/something.tsv?dl=1&x=.tsv"
NOTE: I know this question was asked back in 2014 and I am answering this question some 5 years later. I am posting the answer here in hopes that someone else who googles their way here will be saved the headache and can be helped by how I devised a solution.
SUMMARY OF THE ISSUE: By default the IMPORTDATA() function will properly process a tab-delimited file only if the file name ends with the extension .TSV
UPDATE Nov 14, 2019:
In a comment below, Poul shared that he has found an undocumented parameter for the IMPORTDATA() function by which you can specify the delimiter to split the data. As of writing this, the official documentation makes no reference to this delimiter.
In effect the documentation should look something like the following:
IMPORTDATA("url","delimiter")
So, if you wanted to force a file to be split on the TAB character, it would look something like
IMPORTDATA("url","\t")
PRIOR ANSWER:
UPDATE: I am leaving my original answer just in case it might be helpful if the answer above, which includes undocumented functionality, does not continue to work.
ORIGINAL ANSWER: After seemingly countless attempts, I figured out how to coax Google Sheets into importing a tab-delimited file regardless of the extension.
For those looking for the quick and dirty answer, copy the following into a cell of a Google Sheet to give it a try:
=ARRAYFORMULA(IFERROR(SPLIT(IMPORTDATA("https://iso639-3.sil.org/sites/iso639-3/files/downloads/iso-639-3_Latin1.tab"),CHAR(9),FALSE,FALSE)))
For those that want to know a bit more, I will try to explain how each of the nested functions are helping to create the final solution:
=ARRAYFORMULA( IFERROR( SPLIT( IMPORTDATA(URL-HERE) ,CHAR(9),FALSE,FALSE) ) )
IMPORTDATA() - the primary function that pulls in the data file from the web
SPLIT - split the row by tab, note the use of char(09) to generate the tab character; also note the use of FALSE for the last parameter which was required in my case to ensure empty cells were not collapsed together
IFERROR - used to catch situations where an import might fail, the error will be trapped and not returned to the spreadsheet
ARRAYFORMULA - this function ensures that every line in the file is parsed; without this, only the first line of the file would be returned to the spreadsheet
It turns out that IMPORTDATA(url) can import a tab separated file, but it expects the file name to have the .tsv extension. This is inconsistent with Excel, where a tab-separated export results in *.txt.
If you can ensure that you use a .tsv extension, then your problem is solved.
You can also use the Sheets UI to import the file (into a new Spreadsheet). Select File > Import..., then Upload > Select a file from your computer. When the file selection dialog opens, paste the URL into the file name field, and click Open. The file will be downloaded to your PC then uploaded to Drive, through the Import dialog that will let you choose the delimiter.
(Validated on Windows 8.1 with Chrome; I don't know how this will behave on other OSes or browsers.)
Edit: See this gist.
importFromCSV(string fileName, string sheetName)
Populates a sheet with contents read from a CSV file located in the user's GDrive. If either parameter is not provided, the function will open inputBoxes to obtain them interactively.
Automatically detects tab or comma delimited input.
I had luck using split() and indicating only a single space as the delimiter, even though the data i pasted in had tabs separating each "column": =SPLIT(A1, " ", True) where A1 had data separated by 1 or more spaces. It seems that pasting in TSV data results in conversion from tabs to spaces.
This could be done in two steps leveraging the fact that tab is essentially multiple spaces.
Steps are as follows:
Select the columns which have tab separated data. Then trim tab to single space by using Data -> Data cleanup -> Trim whitespaces.
Now usual Data -> Split text to columns should work out of the box or after selecting space as separator.

SSIS - read a single header record from a flat file or an excel file prior to processing

Is there a method by which one can read just the first record of a file, i.e., to read header information so that a decision can be made whther or not to process the remainder of the file?
I know that with the split transformation component one can write an expression that will ignore all of the rows besides the header based on a key word in the header. I would rather not go that route as that is inefficiently reading every record in the file.
Specifically, is there script component logic that I can implement to close the file
and end the dataflow after the first record has been read?
See this post from Todd McDermid:
Basically, you would set up a Foreach
Container to loop over the files in
your directory. Inside the Foreach,
you would determine the "file type" -
perhaps by creating a variable with a
long-winded expression on it that
pulls apart your file name and assumes
the a "file type" value - then passes
control on to one of five Data Flows
via conditional connectors.
(Double-click on the standard green
connector, change it's Evaluation
Operation to Expression and
Constraint, and set the expression to
be "file_type_variable =
".) Then each Data Flow
picks apart one "file type".