Remove double quotes within double quotes - mysql

I want to import a large CSV file using MySQL load data infile, file delimited with pipe |, enclosed with double quotes "". Many fields are text data with double quotes inside double quotes and I get all data in the same column, so I need to remove extra double quotes only if contain within quotes:
Example:
|"George Kastrioti "Skanderbeg""|""|""|"1926"|
Desired output:
|"George Kastrioti Skanderbeg"|"|"|"1926"|
Tried with sed but with no real success, any ideas or tips?

sed ': again
s/\(|"[^"|]*\)"\([^"|]*"\)/\1\2/g
t again
s/""/"/g' YourFile
but i imagine that |""| is more logic than |"| so this version should be better (just an idea, don't know your real need and your sample state 1 double quote only for empty value)
sed ': again
s/\(|"[^"|]*\)"\([^"|]*"\)/\1\2/g
t again' YourFile

Related

How do I convince Splunk that a backslash inside a CSV field is not an escape character?

I have the following row in a CSV file that I am ingesting into a Splunk index:
"field1","field2","field3\","field4"
Excel and the default Python CSV reader both correctly parse that as 4 separate fields. Splunk does not. It seems to be treating the backslash as an escape character and interpreting field3","field4 as a single mangled field. It is my understanding that the standard escape character for double quotes inside a quoted CSV field is another double quote, according to RFC-4180:
"If double-quotes are used to enclose fields, then a double-quote appearing inside a field must be escaped by preceding it with another double quote."
Why is Splunk treating the backslash as an escape character, and is there any way to change that configuration via props.conf or any other way? I have set:
INDEXED_EXTRACTIONS = csv
KV_MODE = none
for this sourcetype in props.conf, and it is working fine for rows without backslashes in them.
UPDATE: Yeah so Splunk's CSV parsing is indeed not RFC-4180 compliant, and there's not really any workaround that I could find. In the end I changed the upstream data pipeline to output JSON instead of CSVs for ingestion by Splunk. Now it works fine. Let this be a cautionary tale if anyone stumbles across this question while trying to parse CSVs in Splunk!

Find hidden (or not hidden) double quotes in file using linux

I need to convert a file from comma separated to pipe delimited. I plan on using sed 's/,/|/g' - I have tested on dummy data and this has worked for me.
However, before I execute the "sed" command, how can I find out if my data file contains double quotes? It has been requested to have the file converted to pipe delimited without any double quotes. This is a fairly large file (4+ million rows) and wondering if there is a way (perhaps 'grep' command) in linux that would quickly identify any double quotes in the file. Since this is a csv file, there is a chance a business entity may contain double quotes (i.e. "business name, inc").
Thanks in advance!

Using SED to add values after the 5th field of a CSV file that is also an IP Address

I have a CSV file that I'm working to manipulate using sed. What I'm doing is inserting the current YYYY-MM-DD HH:MM:SS into the 5th field after the IP Address. As you can see below, each value is enclosed by double quotes and each CSV column is separated by a comma.
"12345","","","None","192.168.2.1","qqq","000"
"67890","ABC-1234-5678","9.9","Low","192.168.2.1","qqq","000"
Using the command: sed 'N;s/","/","YYYY-MM-DD HH:MM:SS","/5' FILENAME
I am adding in the date after the 5th field. Normally this works, but often
certain values in the CSV file mess up this count that would insert the date into the 5th field. To remedy this issue, how can I not only add the date after the 5th field, but also make sure the 5th field is an IP Address?
The final output should be:
"12345","","","None","192.168.2.1","YYYY-MM-DD HH:MM:SS","qqq","000"
"67890","ABC-1234-5678","9.9","Low","192.168.2.1","YYYY-MM-DD HH:MM:SS","qqq","000"
Please respond with how this is done using SED and not AWK. And how I can make sure the 5th field is also an IP Address before the date is added in?
This answer currently assumes that the CSV file is beautifully consistent and simple (as in the sample data), so that:
Fields always have double quotes around them.
There are never fields like "…""…" to indicate a double quote embedded in the string.
There are never fields with commas in between the quotes ("this,that").
Given those pre-requisites, this sed script does the job:
sed 's/^\("[^"]*",\)\{4\}"\([0-9]\{1,3\}\.\)\{3\}[0-9]\{1,3\}",/&"YYYY-MM-DD HH:MM:SS",/'
Let's split that search pattern into pieces:
^\("[^"]*",\)\{4\}
Match start of line followed by: 4 repeats of a double quote, a sequence of zero or more non-double-quotes, a double quote and a comma.
In other words, this identifies the first four fields.
"\([0-9]\{1,3\}\.\)\{3\}
Match a double quote, then 3 repeats of 1-3 decimal digits followed by a dot — the first three triplets of an IPv4 dotted-decimal address.
[0-9]\{1,3\}",
Match 1-3 decimal digits followed by a double quote and a comma — the last triplet of an IPv4 dotted-decimal address plus the end of a field.
Clearly, for each idiosyncrasy of CSV files that you also need to deal with, you have to modify the regular expressions. That's not trivial.
Using extended regular expressions (enabled by -E on both GNU and BSD sed), you could write:
sed -E 's/^("(([^"]*"")*[^"]*)",){4}"([0-9]{1,3}\.){3}[0-9]{1,3}",/&"YYYY-MM-DD HH:MM:SS",/'
The pattern to recognize the first 4 fields is more complex than before. It matches 4 repeats of: double quote, zero or more occurrences of { zero or more non-double-quotes followed by two double quotes } followed by zero or more non-double-quotes followed by a double quote and a comma.
You can also write that in classic sed (basic regular expressions) with a liberal sprinkling of backslashes:
sed 's/^\("\(\([^"]*""\)*[^"]*\)",\)\{4\}"\([0-9]\{1,3\}\.\)\{3\}[0-9]\{1,3\}",/&"YYYY-MM-DD HH:MM:SS",/'
Given the data file:
"12345","","","None","192.168.2.1","qqq","000"
"67890","ABC-1234-5678","9.9","Low","192.168.2.1","qqq","000"
"23456","Quaternions","2.3","Pisces","Heredotus","qqq","000"
"34567","Commas, oh commas!","3.14159","""Quotes"" quoth he","192.168.99.37","zzz","011"
"45678","Commas, oh commas!","3.14159","""Quote me"",""or not""","192.168.99.37","zzz","011"
The first script shown produces the output:
"12345","","","None","192.168.2.1","YYYY-MM-DD HH:MM:SS","qqq","000"
"67890","ABC-1234-5678","9.9","Low","192.168.2.1","YYYY-MM-DD HH:MM:SS","qqq","000"
"23456","Quaternions","2.3","Pisces","Heredotus","qqq","000"
"34567","Commas, oh commas!","3.14159","""Quotes"" quoth he","192.168.99.37","zzz","011"
"45678","Commas, oh commas!","3.14159","""Quote me"",""or not""","192.168.99.37","zzz","011"
The first two lines are correctly mapped; the third is correctly unchanged, but the last two should have been mapped and were not.
The second and third commands produce:
"12345","","","None","192.168.2.1","YYYY-MM-DD HH:MM:SS","qqq","000"
"67890","ABC-1234-5678","9.9","Low","192.168.2.1","YYYY-MM-DD HH:MM:SS","qqq","000"
"23456","Quaternions","2.3","Pisces","Heredotus","qqq","000"
"34567","Commas, oh commas!","3.14159","""Quotes"" quoth he","192.168.99.37","YYYY-MM-DD HH:MM:SS","zzz","011"
"45678","Commas, oh commas!","3.14159","""Quote me"",""or not""","192.168.99.37","YYYY-MM-DD HH:MM:SS","zzz","011"
Note that Heredotus is not modified (correctly), and the last two lines get the date string added after the IP address (also correctly).
Those last regular expressions are not for the faint-of-heart.
Clearly, if you want to insist that the IP addresses only match numbers in the range 0..255 in each component, with no leading 0, then you have to beef up the IP address matching portion of the regular expression. It can be done; it is not pretty. It is easiest to do it with extended regular expressions:
([0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])
You'd use that unit in place of each [0-9]{3} unit in the regexes shown before.
Note that this still does not attempt to deal with fields not surrounded by double quotes.
It also does not determine the value to substitute from the date command. That is doable with (if not elementary then) routine shell scripting carefully managing quotes:
dt=$(date +'%Y-%m-%d %H:%M:%S')
sed -E 's/^("(([^"]*"")*[^"]*)",){4}"([0-9]{1,3}\.){3}[0-9]{1,3}",/&"'"$dt"'",/'
The '…"'"$dt"'",/' sequence is part of what starts out as a single-quoted string. The first double quote is simple data in the string; the next single quote ends the quoting, the "$dt" interpolates the value from date inside shell double quotes (so the space doesn't cause any trouble), then the single quote resumes the single-quoted notation, adding another double quote, a comma and a slash before the string (argument to sed) is terminated.
Try:
awk -vdate1=$(date +"%Y-%m-%d") -vdate2=$(date +"%H:%M:%S") -F, '$5 ~ /[0-9]+\.[0-9]+\.[0-9]+\.[0-9]/{$5=$5 FS date1 " " date2} 1' OFS=, Input_file
Also if you want to edit the same Input_file you could take above command's output into a temp file and later rename(mv command) to the same Input_file
Adding one-liner form of solution too now.
awk -vdate1=$(date +"%Y-%m-%d") -vdate2=$(date +"%H:%M:%S") -F, '
$5 ~ /[0-9]+\.[0-9]+\.[0-9]+\.[0-9]/{
$5=$5 FS date1 " " date2
}
1
' OFS=, Input_file

Is there a way to include commas in CSV columns without breaking the formatting?

I've got a two column CSV with a name and a number. Some people's name use commas, for example Joe Blow, CFA. This comma breaks the CSV format, since it's interpreted as a new column.
I've read up and the most common prescription seems to be replacing that character, or replacing the delimiter, with a new value (e.g. this|that|the, other).
I'd really like to keep the comma separator (I know excel supports other delimiters but other interpreters may not). I'd also like to keep the comma in the name, as Joe Blow| CFA looks pretty silly.
Is there a way to include commas in CSV columns without breaking the formatting, for example by escaping them?
To encode a field containing comma (,) or double-quote (") characters, enclose the field in double-quotes:
field1,"field, 2",field3, ...
Literal double-quote characters are typically represented by a pair of double-quotes (""). For example, a field exclusively containing one double-quote character is encoded as """".
For example:
Sheet: |Hello, World!|You "matter" to us.|
CSV: "Hello, World!","You ""matter"" to us."
More examples (sheet → csv):
regular_value → regular_value
Fresh, brown "eggs" → "Fresh, brown ""eggs"""
" → """"
"," → ""","""
,,," → ",,,"""
,"", → ","""","
""" → """"""""
See wikipedia.
I found that some applications like Numbers in Mac ignore the double quote if there is space before it.
a, "b,c" doesn't work while a,"b,c" works.
The problem with the CSV format, is there's not one spec, there are several accepted methods, with no way of distinguishing which should be used (for generate/interpret). I discussed all the methods to escape characters (newlines in that case, but same basic premise) in another post. Basically it comes down to using a CSV generation/escaping process for the intended users, and hoping the rest don't mind.
Reference spec document.
If you want to make that you said, you can use quotes. Something like this
$name = "Joe Blow, CFA.";
$arr[] = "\"".$name."\"";
so now, you can use comma in your name variable.
You need to quote that values.
Here is a more detailed spec.
In addition to the points in other answers: one thing to note if you are using quotes in Excel is the placement of your spaces. If you have a line of code like this:
print '%s, "%s", "%s", "%s"' % (value_1, value_2, value_3, value_4)
Excel will treat the initial quote as a literal quote instead of using it to escape commas. Your code will need to change to
print '%s,"%s","%s","%s"' % (value_1, value_2, value_3, value_4)
It was this subtlety that brought me here.
You can use Template literals (Template strings)
e.g -
`"${item}"`
CSV files can actually be formatted using different delimiters, comma is just the default.
You can use the sep flag to specify the delimiter you want for your CSV file.
Just add the line sep=; as the very first line in your CSV file, that is if you want your delimiter to be semi-colon. You can change it to any other character.
This isn't a perfect solution, but you can just replace all uses of commas with ‚ or a lower quote. It looks very very similar to a comma and will visually serve the same purpose. No quotes are required
in JS this would be
stringVal.replaceAll(',', '‚')
You will need to be super careful of cases where you need to directly compare that data though
Depending on your language, there may be a to_json method available. That will escape many things that break CSVs.
I faced the same problem and quoting the , did not help. Eventually, I replaced the , with +, finished the processing, saved the output into an outfile and replaced the + with ,. This may seem ugly but it worked for me.
May not be what is needed here but it's a very old question and the answer may help others. A tip I find useful with importing into Excel with a different separator is to open the file in a text editor and add a first line like:
sep=|
where | is the separator you wish Excel to use.
Alternatively you can change the default separator in Windows but a bit long-winded:
Control Panel>Clock & region>Region>Formats>Additional>Numbers>List separator [change from comma to your preferred alternative]. That means Excel will also default to exporting CSVs using the chosen separator.
You could encode your values, for example in PHP base64_encode($str) / base64_decode($str)
IMO this is simpler than doubling up quotes, etc.
https://www.php.net/manual/en/function.base64-encode.php
The encoded values will never contain a comma so every comma in your CSV will be a separator.
You can use the Text_Qualifier field in your Flat file connection manager to as ". This should wrap your data in quotes and only separate by commas which are outside the quotes.
First, if item value has double quote character ("), replace with 2 double quote character ("")
item = item.ToString().Replace("""", """""")
Finally, wrap item value:
ON LEFT: With double quote character (")
ON RIGHT: With double quote character (") and comma character (,)
csv += """" & item.ToString() & ""","
Double quotes not worked for me, it worked for me \". If you want to place a double quotes as example you can set \"\".
You can build formulas, as example:
fprintf(strout, "\"=if(C3=1,\"\"\"\",B3)\"\n");
will write in csv:
=IF(C3=1,"",B3)
A C# method for escaping delimiter characters and quotes in column text. It should be all you need to ensure your csv is not mangled.
private string EscapeDelimiter(string field)
{
if (field.Contains(yourEscapeCharacter))
{
field = field.Replace("\"", "\"\"");
field = $"\"{field}\"";
}
return field;
}

Load data infile

I need my program to have two enclosed characters. My strings in my file can be enclosed by either single quotes or double quotes. How may I manage that?
Not sure this is The Answer... but I'd run it through sed to replace all the double quotes to single (or the other way around). You need to keep in mind escaping when writing your regex, though.
Is it a csv or delimited file? You can pick a delimiter and not worry about string enclosures (so long as the strings don't contain the delimiters).