Nifi: Delimiters in Nifi - csv

I was working on a task and got an error. Which says that invalid char between encapsulated token and delimiter.
Here is the SS of the data
For line 08, I was using the pipe as a delimiter, and the escape character was '^'. For line 12 I was using the comma as a delimiter. The highlighted part is the issue. If I remove the cap sign from line 08 and a single quote from line 12 it runs with success.
The processor is ConverRecord and here is the screenshot of the configs of the processor.
Actually, I am using two processors of ConvertRecord. In one processor the fields separator is a comma(,) whereas in the second processor the fields separator is also comma(,) but the escape character is Cap sign(^).
assume that these are two different records.
Why it is throwing error at that point? And how can I solve this issue?
Thanks in advance.

For the first sample data (line 08), configure CSVReader as:
Quote Character: "
Escape Character: \
Value Separator(delimiter): |
For the second sample data (line 12), configure CSVReader as:
Quote Character: "
Escape Character: \
Value Separator(delimiter): ,
The reason for failure is that your data does not conform with delimited data specifications i.e. data is invalid, so you need to add upstream cleanup logic.
For line 08 data - you have used escape character as ^ and same is appeared in the data as well, so when CSVReader encountered ^" it escaped " because of this opening double quote does not have a corresponding closing double quote causing to throw the exception. So setting Escape Character: \ property will resolve the issue. \ is kind of widely used escape character, so it is very rare to get \ as a part of data.
For line 12 data - seems like single quote ' is used as Quote Character and missing a corresponding closing quote character i.e. ' causing to throw the exception. You need to devise a logic that will add the missing closing quote character wherever required. A workaround would be to use Quote Character: " so that ' will be the part of the data and then you can clean it at downstream eg. if you are putting data into a table then post-ingestion updating the column to remove '

Related

MYSQL in Bash script gives out Error 1045 [duplicate]

In Bash, what are the differences between single quotes ('') and double quotes ("")?
Single quotes won't interpolate anything, but double quotes will. For example: variables, backticks, certain \ escapes, etc.
Example:
$ echo "$(echo "upg")"
upg
$ echo '$(echo "upg")'
$(echo "upg")
The Bash manual has this to say:
3.1.2.2 Single Quotes
Enclosing characters in single quotes (') preserves the literal value of each character within the quotes. A single quote may not occur between single quotes, even when preceded by a backslash.
3.1.2.3 Double Quotes
Enclosing characters in double quotes (") preserves the literal value of all characters within the quotes, with the exception of $, `, \, and, when history expansion is enabled, !. The characters $ and ` retain their special meaning within double quotes (see Shell Expansions). The backslash retains its special meaning only when followed by one of the following characters: $, `, ", \, or newline. Within double quotes, backslashes that are followed by one of these characters are removed. Backslashes preceding characters without a special meaning are left unmodified. A double quote may be quoted within double quotes by preceding it with a backslash. If enabled, history expansion will be performed unless an ! appearing in double quotes is escaped using a backslash. The backslash preceding the ! is not removed.
The special parameters * and # have special meaning when in double quotes (see Shell Parameter Expansion).
The accepted answer is great. I am making a table that helps in quick comprehension of the topic. The explanation involves a simple variable a as well as an indexed array arr.
If we set
a=apple # a simple variable
arr=(apple) # an indexed array with a single element
and then echo the expression in the second column, we would get the result / behavior shown in the third column. The fourth column explains the behavior.
#
Expression
Result
Comments
1
"$a"
apple
variables are expanded inside ""
2
'$a'
$a
variables are not expanded inside ''
3
"'$a'"
'apple'
'' has no special meaning inside ""
4
'"$a"'
"$a"
"" is treated literally inside ''
5
'\''
invalid
can not escape a ' within ''; use "'" or $'\'' (ANSI-C quoting)
6
"red$arocks"
red
$arocks does not expand $a; use ${a}rocks to preserve $a
7
"redapple$"
redapple$
$ followed by no variable name evaluates to $
8
'\"'
\"
\ has no special meaning inside ''
9
"\'"
\'
\' is interpreted inside "" but has no significance for '
10
"\""
"
\" is interpreted inside ""
11
"*"
*
glob does not work inside "" or ''
12
"\t\n"
\t\n
\t and \n have no special meaning inside "" or ''; use ANSI-C quoting
13
"`echo hi`"
hi
`` and $() are evaluated inside "" (backquotes are retained in actual output)
14
'`echo hi`'
`echo hi`
`` and $() are not evaluated inside '' (backquotes are retained in actual output)
15
'${arr[0]}'
${arr[0]}
array access not possible inside ''
16
"${arr[0]}"
apple
array access works inside ""
17
$'$a\''
$a'
single quotes can be escaped inside ANSI-C quoting
18
"$'\t'"
$'\t'
ANSI-C quoting is not interpreted inside ""
19
'!cmd'
!cmd
history expansion character '!' is ignored inside ''
20
"!cmd"
cmd args
expands to the most recent command matching "cmd"
21
$'!cmd'
!cmd
history expansion character '!' is ignored inside ANSI-C quotes
See also:
ANSI-C quoting with $'' - GNU Bash Manual
Locale translation with $"" - GNU Bash Manual
A three-point formula for quotes
If you're referring to what happens when you echo something, the single quotes will literally echo what you have between them, while the double quotes will evaluate variables between them and output the value of the variable.
For example, this
#!/bin/sh
MYVAR=sometext
echo "double quotes gives you $MYVAR"
echo 'single quotes gives you $MYVAR'
will give this:
double quotes gives you sometext
single quotes gives you $MYVAR
Others explained it very well, and I just want to give something with simple examples.
Single quotes can be used around text to prevent the shell from interpreting any special characters. Dollar signs, spaces, ampersands, asterisks and other special characters are all ignored when enclosed within single quotes.
echo 'All sorts of things are ignored in single quotes, like $ & * ; |.'
It will give this:
All sorts of things are ignored in single quotes, like $ & * ; |.
The only thing that cannot be put within single quotes is a single quote.
Double quotes act similarly to single quotes, except double quotes still allow the shell to interpret dollar signs, back quotes and backslashes. It is already known that backslashes prevent a single special character from being interpreted. This can be useful within double quotes if a dollar sign needs to be used as text instead of for a variable. It also allows double quotes to be escaped so they are not interpreted as the end of a quoted string.
echo "Here's how we can use single ' and double \" quotes within double quotes"
It will give this:
Here's how we can use single ' and double " quotes within double quotes
It may also be noticed that the apostrophe, which would otherwise be interpreted as the beginning of a quoted string, is ignored within double quotes. Variables, however, are interpreted and substituted with their values within double quotes.
echo "The current Oracle SID is $ORACLE_SID"
It will give this:
The current Oracle SID is test
Back quotes are wholly unlike single or double quotes. Instead of being used to prevent the interpretation of special characters, back quotes actually force the execution of the commands they enclose. After the enclosed commands are executed, their output is substituted in place of the back quotes in the original line. This will be clearer with an example.
today=`date '+%A, %B %d, %Y'`
echo $today
It will give this:
Monday, September 28, 2015
Since this is the de facto answer when dealing with quotes in Bash, I'll add upon one more point missed in the answers above, when dealing with the arithmetic operators in the shell.
The Bash shell supports two ways to do arithmetic operation, one defined by the built-in let command and the other the $((..)) operator. The former evaluates an arithmetic expression while the latter is more of a compound statement.
It is important to understand that the arithmetic expression used with let undergoes word-splitting, pathname expansion just like any other shell commands. So proper quoting and escaping need to be done.
See this example when using let:
let 'foo = 2 + 1'
echo $foo
3
Using single quotes here is absolutely fine here, as there isn't any need for variable expansions here. Consider a case of
bar=1
let 'foo = $bar + 1'
It would fail miserably, as the $bar under single quotes would not expand and needs to be double-quoted as
let 'foo = '"$bar"' + 1'
This should be one of the reasons, the $((..)) should always be considered over using let. Because inside it, the contents aren't subject to word-splitting. The previous example using let can be simply written as
(( bar=1, foo = bar + 1 ))
Always remember to use $((..)) without single quotes
Though the $((..)) can be used with double quotes, there isn't any purpose to it as the result of it cannot contain content that would need the double quote. Just ensure it is not single quoted.
printf '%d\n' '$((1+1))'
-bash: printf: $((1+1)): invalid number
printf '%d\n' $((1+1))
2
printf '%d\n' "$((1+1))"
2
Maybe in some special cases of using the $((..)) operator inside a single quoted string, you need to interpolate quotes in a way that the operator either is left unquoted or under double quotes. E.g., consider a case, when you are tying to use the operator inside a curl statement to pass a counter every time a request is made, do
curl http://myurl.com --data-binary '{"requestCounter":'"$((reqcnt++))"'}'
Notice the use of nested double quotes inside, without which the literal string $((reqcnt++)) is passed to the requestCounter field.
There is a clear distinction between the usage of ' ' and " ".
When ' ' is used around anything, there is no "transformation or translation" done. It is printed as it is.
With " ", whatever it surrounds, is "translated or transformed" into its value.
By translation/ transformation I mean the following:
Anything within the single quotes will not be "translated" to their values. They will be taken as they are inside quotes. Example: a=23, then echo '$a' will produce $a on standard output. Whereas echo "$a" will produce 23 on standard output.
A minimal answer is needed for people to get going without spending a lot of time as I had to.
The following is, surprisingly (to those looking for an answer), a complete command:
$ echo '\'
whose output is:
\
Backslashes, surprisingly to even long-time users of bash, do not have any meaning inside single quotes. Nor does anything else.

How to avoid quoted commas parsing a CSV in JRuby

I'm trying to use the following line to parse a CSV into a table in JRuby.
# Parse the CSV file into a table
table = CSV.parse(File.read(tempFileName), headers: true)
The CSV file I'm using as input could have text columns that include commas. For these cases, I've included double quotation marks on the columns, to indicate that the internal commas on these columns should not be considered as delimiters.
For example, I could have the following CSV:
Address No, Alpha Name
1, Marcelo
2, "Surname, Name"
However, when I execute the code, I obtain the following error:
"Exception": "Message": "org.jruby.embed.InvokeFailedException:
(MalformedCSVError) Illegal quoting in line 2.: tat RUBY.main
Is there any way to avoid this error indicating the correct quoting character and, also, is it going to avoid considering the internal quotes as column separators?

spark csv writer - escape string without using quotes

I am trying to escape delimiter character that appears inside data. Is there a way to do it by passing option parameters? I can do it from udf, but I am hoping it is possible using options.
val df = Seq((8, "test,me\nand your", "other")).toDF("number", "test", "t")
df.coalesce(1).write.mode("overwrite").format("csv").option("quote", "\u0000").option("delimiter", ",").option("escape", "\\").save("testcsv1")
But the escape is not working. The output file is written as
8,test,me
and your,other
I want the output file to be written as.
8,test\,me\\nand your,other
I'm not certain, but I think if you had your sequence as
Seq((8, "test\\,me\\\\nand your", "other"))
and did not specify a custom escape character, it would behave as you are expecting and give you 8,test\,me\\nand your,other as the output. This is because \\ acts simply as the character '\' rather than an escape, so they are printed where you want and the n immediately after is not interpreted as part of a newline character.

Weka and CSV files

I'm currently trying to import some data into weka. Currently the data is in a CSV file, and consists of a numerical ID and then some string data(Tweets). I'm getting an error where it is reading "Wrong number of values, Read 1, expected 2 Token[EOL], line 17". I'm using quotes as my enclosure characters for the String data. I understand that something(presumably an EOL character?) is causing weka to incorrectly separate some of the String data into multiple entries on the same line, but I'm not sure how to fix the EOL token problem.
My data set can be viewed here. The current data set is on Sheet 2:
https://docs.google.com/spreadsheets/d/1Yclu0t4ITFWn6itYBsVtkGalmP9BPaWFFP6U6jAeLMU/edit?usp=sharing
The text file itself may be found here:
https://drive.google.com/file/d/0B433FqC3TscQQkRxZklQclA3Z3M/view?usp=sharing
Current error is now on the 3rd line, with the same error. The only newline character there is the one at the end of the line denoting a new entry, so I'm not sure why its having issues.
In its datasets, Weka considers a newline character as an indication of the end of instance. Your line 17 is actually a multi-line tweet which confuses Weka. You can use either
a RegEx to get rid of the newline characters in every single tweet or
during downloading the tweets, clean the tweets to get rid of any newline character in them.
Unfortunately, Weka does not have a mechanism to get rid of this problem by itself (as far as I know).
EDIT
Okay, here are some other things that need to be fixed (according to your EDITS in the question):
Replace ' with \'
Replace grave accent with \grave accent
Many tweets contain quotes inside quotes. The inside double quotes (") should be replaced by \"
If you put your tweets inside double quotes, then your header should be id, "text"
Some tweets contain two consecutive double quotes, get rid of them or replace them with \".
I cannot say exactly where, because I lost trace, but I think still some tweets contain new lines in them (or at least one tweet has it still)
These are just a few things that I noticed. There might be more. Time will tell.

Is there a way to include commas in CSV columns without breaking the formatting?

I've got a two column CSV with a name and a number. Some people's name use commas, for example Joe Blow, CFA. This comma breaks the CSV format, since it's interpreted as a new column.
I've read up and the most common prescription seems to be replacing that character, or replacing the delimiter, with a new value (e.g. this|that|the, other).
I'd really like to keep the comma separator (I know excel supports other delimiters but other interpreters may not). I'd also like to keep the comma in the name, as Joe Blow| CFA looks pretty silly.
Is there a way to include commas in CSV columns without breaking the formatting, for example by escaping them?
To encode a field containing comma (,) or double-quote (") characters, enclose the field in double-quotes:
field1,"field, 2",field3, ...
Literal double-quote characters are typically represented by a pair of double-quotes (""). For example, a field exclusively containing one double-quote character is encoded as """".
For example:
Sheet: |Hello, World!|You "matter" to us.|
CSV: "Hello, World!","You ""matter"" to us."
More examples (sheet → csv):
regular_value → regular_value
Fresh, brown "eggs" → "Fresh, brown ""eggs"""
" → """"
"," → ""","""
,,," → ",,,"""
,"", → ","""","
""" → """"""""
See wikipedia.
I found that some applications like Numbers in Mac ignore the double quote if there is space before it.
a, "b,c" doesn't work while a,"b,c" works.
The problem with the CSV format, is there's not one spec, there are several accepted methods, with no way of distinguishing which should be used (for generate/interpret). I discussed all the methods to escape characters (newlines in that case, but same basic premise) in another post. Basically it comes down to using a CSV generation/escaping process for the intended users, and hoping the rest don't mind.
Reference spec document.
If you want to make that you said, you can use quotes. Something like this
$name = "Joe Blow, CFA.";
$arr[] = "\"".$name."\"";
so now, you can use comma in your name variable.
You need to quote that values.
Here is a more detailed spec.
In addition to the points in other answers: one thing to note if you are using quotes in Excel is the placement of your spaces. If you have a line of code like this:
print '%s, "%s", "%s", "%s"' % (value_1, value_2, value_3, value_4)
Excel will treat the initial quote as a literal quote instead of using it to escape commas. Your code will need to change to
print '%s,"%s","%s","%s"' % (value_1, value_2, value_3, value_4)
It was this subtlety that brought me here.
You can use Template literals (Template strings)
e.g -
`"${item}"`
CSV files can actually be formatted using different delimiters, comma is just the default.
You can use the sep flag to specify the delimiter you want for your CSV file.
Just add the line sep=; as the very first line in your CSV file, that is if you want your delimiter to be semi-colon. You can change it to any other character.
This isn't a perfect solution, but you can just replace all uses of commas with ‚ or a lower quote. It looks very very similar to a comma and will visually serve the same purpose. No quotes are required
in JS this would be
stringVal.replaceAll(',', '‚')
You will need to be super careful of cases where you need to directly compare that data though
Depending on your language, there may be a to_json method available. That will escape many things that break CSVs.
I faced the same problem and quoting the , did not help. Eventually, I replaced the , with +, finished the processing, saved the output into an outfile and replaced the + with ,. This may seem ugly but it worked for me.
May not be what is needed here but it's a very old question and the answer may help others. A tip I find useful with importing into Excel with a different separator is to open the file in a text editor and add a first line like:
sep=|
where | is the separator you wish Excel to use.
Alternatively you can change the default separator in Windows but a bit long-winded:
Control Panel>Clock & region>Region>Formats>Additional>Numbers>List separator [change from comma to your preferred alternative]. That means Excel will also default to exporting CSVs using the chosen separator.
You could encode your values, for example in PHP base64_encode($str) / base64_decode($str)
IMO this is simpler than doubling up quotes, etc.
https://www.php.net/manual/en/function.base64-encode.php
The encoded values will never contain a comma so every comma in your CSV will be a separator.
You can use the Text_Qualifier field in your Flat file connection manager to as ". This should wrap your data in quotes and only separate by commas which are outside the quotes.
First, if item value has double quote character ("), replace with 2 double quote character ("")
item = item.ToString().Replace("""", """""")
Finally, wrap item value:
ON LEFT: With double quote character (")
ON RIGHT: With double quote character (") and comma character (,)
csv += """" & item.ToString() & ""","
Double quotes not worked for me, it worked for me \". If you want to place a double quotes as example you can set \"\".
You can build formulas, as example:
fprintf(strout, "\"=if(C3=1,\"\"\"\",B3)\"\n");
will write in csv:
=IF(C3=1,"",B3)
A C# method for escaping delimiter characters and quotes in column text. It should be all you need to ensure your csv is not mangled.
private string EscapeDelimiter(string field)
{
if (field.Contains(yourEscapeCharacter))
{
field = field.Replace("\"", "\"\"");
field = $"\"{field}\"";
}
return field;
}