I use UCI Balloon dateset as example(https://archive.ics.uci.edu/ml/machine-learning-databases/balloons/yellow-small.data)
load data and rename column names:
combine column "color" and "size" with a tab character. I use tab escape "\t". Now working:
Do you know how to input the correct tab delimiter here?
Don't think tab is possible in Column Combiner node. You can use Column Expressions node with join() function. Example: join(col1,"\t"col2). In output you'll see values concatenated but if you write data with CSV Writer for example you'll see there is tab. Why do you need/want tab between?
Possibly give this a try:
Delimit by a space. Then replace that delimited with the tab \t.
Node Configuration:
Image of node configuration
Resulting Output:
Image of node output
Related
I have a flat source file with following settings.
Test Qualifier = "
Header row delimiter = |
I have a Column A which has value = This is the example of "value in the source file"
The error that i am seeing is the column delimiter for the column A is not found. I see the issue here is having double quotes in the column value. Any suggestions on how to fix this.
The file has to be changed. If the text qualifier of the file is a double-quote, then the actual data cannot contain double-quotes.
You can either talk to the person who creates the file and have them use a different text delimiter, or you can write a script task that edits the file and replaces the delimiting double-quotes with another delimiter, leaving the double-quotes in the data intact.
For a bit more detail, see here.
I am new to Pentaho Kettle and I am trying to build a simple data transformation (filter, data conversion, etc). But I keep getting errors when reading my CSV data file (whether using CSV File Input or Text File Input).
The error is:
... couldn't convert String to number : non-numeric character found at
position 1 for value [ ]
What does this mean exactly and how do I handle it?
Thank you in advance
I have solved it. The idea is similar to what #nsousa suggested, but I didn't use the Trim option because I tried it and it didn't work on my case.
What I did is specify that if the value is a single space, it is set to null. In the Fields tab of the Text File Input, set the Null if column to space .
That value looks like an empty space. Set the Format of the Integer field to # and set trim to both.
This question already has an answer here:
How to read a flatfile with lowercase thorn as the delimiter
(1 answer)
Closed 8 years ago.
I need to create a CSV file with a column delimiter of CTRL-A. Is that possible with the flat file destination? If it is, what's the syntax? If it isn't, is there a solution short of a custom destination?
I took a similar approach to sorrell, but my outcome was slightly different:
In SSMS, Create the character and copy the output. Note that this will look like nothing if you paste it anywhere.
SELECT char(1)
EDIT:
Make sure to copy the result of this query from the results window. You can confirm you have it by pasting into notepad - it will show the cursor move one space. or in notepad++, it will show a highlighted "SOH"
Here is where I found which decimal to use: http://www.unix-manuals.com/refs/misc/ascii-table.html
Paste that value into the Column Delimiter of the flat file manager:
This is what the output looks like in Notepad:
More interestingly, this is what it looks like in Notepad++ (matching the Start Of Heading text - SOH from the ASCII table I posted a link to above:
I haven't fully tested this, but it seems doable.
Create a template flat file with the headings. I used Linqpad to create the Ctrl-A character using a unicode string (\u0001). You could also get there the ascii route using \x01 (same character, just pointing this out if you need to use it in code). Here's what's in my flat file.
ColumnAâ–¡ColumnB
Create a Flat File Destination, and create a New Flat File Connection. Select Delimited as the type.
Browse to your flat file template, check the Unicode box (if unicode), and if the data should contain headers, check that box too.
Copy the Ctrl-A character from your template and paste it into the Column delimiter box. Then click the Refresh button.
You should now be able to work with the delimited columns. If you need to manually recreate that character in code, you can always use \x01 or \u0001.
Take this XLS file
I then save this XLS file as CSV and then open it up with a text editor. This is what I see:
Col1,Col2,Col3,Col4,Col5,Col6,Col7
1,ABC,"AB""C","D,E",F,03,"3,2"
I see that the double quote character in column C was stored as AB""C, the column value was enclosed with quotations and the double quote character in the data was replaced with 2 double quote characters to indicate that the quote is occurring within the data and not terminating the column value. I also see that the value for column G, 3,2, is enclosed in quotes so that it is clear that the comma occurs within the data rather than indicating a new column. So far, so good.
I am a little surprised that all of the column values are not enclosed by quotes but even this seems reasonable OK when I assume that EXCEL only specifies column delimieters when special characters like a commad or a dbl quote character exists in the data.
Now I try to use SQL Server to import the csv file. Note that I specify a double quote character as the Text Qualifier character.
And a command char as the Column delimiter character. However, note that SSIS imports column 3 incorrectly,eg, not translating the two consecutive double quote characters as a single occurence of a double quote character.
What do I have to do to get Excel and SSIS to get along?
Generally people avoid the issue by using column delimiter chactacters that are LESS LIKELY to occur in the data but this is not a real solution.
I find that if I modify the file from this
Col1,Col2,Col3,Col4,Col5,Col6,Col7
1,ABC,"AB""C","D,E",F,03,"3,2"
...to this:
Col1,Col2,Col3,Col4,Col5,Col6,Col7
1,ABC,"AB"C","D,E",F,03,"3,2"
i.e, removing the two consecutive quotes in column C's value, that the data is loaded properly, however, this is a little confusing to me. First of all, how does SSIS determine that the double quote between the B and the C is not terminating that column value? Is it because the following characters are not a comma column delimiter or a row delimiter (CRLF)? And why does Excel export it this way?
According to Wikipedia, here are a couple of traits of a CSV file:
Fields containing line breaks (CRLF), double quotes, and commas
should be enclosed in double-quotes. For example:
"aaa","b CRLF
bb","ccc" CRLF
zzz,yyy,xxx
If double-quotes are used to enclose fields, then a double-quote
appearing inside a field must be escaped by preceding it with
another double quote. For example:
"aaa","b""bb","ccc"
However, it looks like SSIS doesn't like it that way when importing. What can be done to get Excel to create a CSV file that could contain ANY special characters used as column delimiters, text delimiters or row delimiters in the data? There's no reason that it can't work using the approach specified in Wikipedia,. which is what I thought the old MS DTS packages used to do...
Update:
If I use Notepad change the input file to
Col1,Col2,Col3,Col4,Col5,Col6,Col7,Col8
"1","ABC","AB""C","D,E","F","03","3,2","AB""C"
Excel reads it just fine
but SSIS returns
The preview sample contains embedded text qualifiers ("). The flat file parser does not support embedding text qualifiers in data. Parsing columns that contain data with text qualifiers will fail at run time.
Conclusion:
Just like the error message says in your update...
The flat file parser does not support embedding text qualifiers in data. Parsing columns that contain data with text qualifiers will fail at run time.
Confirmed bug in Microsoft Connect. I encourage everyone reading this to click on this aforementioned link and place your vote to have them fix this stinker. This is in the top 10 of the most egregious bugs I have encountered.
Do you need to use a comma delimiter.
I used a pipe delimiter with no Text qualifier and it worked fine. Here is my output form the text file.
1|ABC|AB"C|D,E|F|03|3,2
You have 3 options in my opinion.
Read the data into a stage table.
Run any update queries you need on the columns
Now select your data from the stage table and output it to a flat file.
OR
Use pipes are you delimiters.
OR
Do all of this in a C# application and build it in code.
You could send the row to a script in SSIS and parse and build the file you want there as well.
Using text qualifiers and "character" delimited fields is problematic for sure.
Have Fun!
When I used Loadrunner, it can read data from a csv file. As we know , csv file is separated by a comma.
The question is, if the parameter in csv has comma itself, the string will be separated to several segments. That is not I want to get.
How can we get the original data with comma in it?
When data has a comma, use an escape character to store the data in the parameter.
For example, if the name is 'Smith, John', it can be stored as Smith\, John in the Loadrunner data file.
When you save a file in Excel that has commas in the actual cell data, the whole cell will be inside two " characters. Also it seems that cells with a space in them are inside " chars.
Example
ColA,ColB,"ColC with a , inside",ColD,ColE
More info on CSV file format: http://www.parse-o-matic.com/parse/pskb/CSV-File-Format.htm
The answer to the question is that perhaps the easiest way to do deal with , separators is to change the separator to a ; character. This is also a valid separator in CSV files.
Then the example would be:
ColA;ColB;"ColC with a , inside";ColD;ColE
Maybe the right way is to use C functions to read data from the file (for example fopen/fread)? When you have read it you be able to use "strchr" to find first quotes char and second quotes char. All in that interval would be a value, and it doesn't matter if comma is inside.
For the documentation about fopen, fread,strchr, you could refer to the HP or C function references.
Hope this will help you.
Assuming you are reading from a data file for the parameters, just use a custom seperator. Comma is the default, but you can define it to be whatever you want. Whenever I have a comma in the variable data I tend to use a pipe symbol, '|' as a way to distinguish the columns of data in the data file.
Examine your parameter dialog carefully and you will see where to make the change.