I have the following csv where I have to replace the thousand comma separator with nothing. In example below, when I have the amount "1,000.00" I should have 1000.00 (no comma, no quotes) instead.
I use JREPL to remove header from my csv
jrepl "(?:.*\n){1,1}([\s\S]*)" "$1" /m /f "csv/Transactions.csv" /o "csv/Transactionsfeed.csv")
I was wondering if I could do the process of removing header + dealing with the thousand comma in one step.
I am also opened to the option of doing it with another command in a second step...
Tnx ID,Trace ID - Gateway,Profile,Customer PIN,Customer,Ext. ID,Identifier,Amount,Chrg,Curr,Processor,Type,Status,Created By,Date Created,RejectReason
1102845,3962708,SL,John,Mohammad Alo,NA,455015*****9998,900.00,900.00,$,Un,Credit Card,Rejected,Internet,2016-05-16 06:54:10,"-330: Fail by bank, try again later(refer to acquirer)"
1102844,3962707,SL,John,Mohammad Alo,NA,455015*****9998,"1,000.00","1,000.00",$,Un,Credit Card,Rejected,Internet,2016-05-16 06:52:26,"-330: Fail by bank, try again later(refer to acquirer)"
Yes, there is a very efficient and fairly compact and straight-forward solution:
jrepl "\q(\d{1,3}(?:,\d{3})*(?:\.\d*)*)\q" "$1.replace(/,/g,'')" /x /j /jendln "if (ln==1) $txt=false" /f "csv/Transactions.csv" /o "csv/Transactionsfeed.csv"
The /JENDLN JScript expression strips the header line by setting $txt to false if it is the first line.
The search string matches any quoted number that contains commas as thousand separators, and $1 is the number without the quotes.
The replace string is a JScript expression that replaces all commas in the matching $1 number with nothing.
EDIT
Note that the above will likely work with any CSV that you are likely to have. However, it would fail if you have a quoted field that contains a quoted number string literal. Something like the following would yield a corrupted CSV with the code above:
...,"some text ""123,456.78"" more text",...
This issue can be fixed with a bit more regex code. You only want to modify a quoted number if the opening quote is preceded by a comma or the beginning of the line, and the closing quote should be followed by a comma or the end of line.
A look-ahead assertion can be used for the trailing comma/EOL. But JREPL does not support look-behind. So the leading comma/BOL must be captured and preserved in the replacement
jrepl "(^|,)\q(\d{1,3}(?:,\d{3})*(?:\.\d*)*)\q(?=$|,)" "$1+$2.replace(/,/g,'')" /x /j /jendln "if (ln==1) $txt=false" /f "csv/Transactions.csv" /o "csv/Transactionsfeed.csv"
EDIT in response to changing requirement in comment
The following will simply remove all quotes and commas from quoted CSV fields. I don't like this concept, and I suspect there is a much better way to handle this for import into mysql, but this is what the OP is asking for.
jrepl "(^|,)(\q(?:[^\q]|\q\q)*\q)(?=$|,)" "$1+$2.replace(/,|\x22/g,'')" /x /j /jendln "if (ln==1) $txt=false" /f "csv/Transactions.csv" /o "csv/Transactionsfeed.csv"
May I suggest you a different, simpler solution? The 5-lines Batch file below do what you want; save it with .bat extension:
#set #a=0 /*
#cscript //nologo //E:JScript "%~F0" < "csv/Transactions.csv" > "csv/Transactionsfeed.csv"
#goto :EOF */
WScript.Stdin.ReadLine();
WScript.Stdout.Write(WScript.Stdin.ReadAll().replace(/(\"(\d{1,3}),(\d{3}\.\d{2})\")/g,"$2$3"));
JREPL.BAT is a large and complex program capable of advanced replacement tasks; however, your request is very simple. This code is also a Batch-JScript hybrid script that use the replace method in the same way as JREPL.BAT, but that is tailored to your specific request.
The first ReadLine() read the header line of the input file, so the posterior ReadAll() read and process the rest of lines.
The regexp (\"(\d{1,3}),(\d{3}\.\d{2})\") define 3 submatches enclosed in parentheses: the first one is the whole number enclosed in quotes, like "1,000.00"; the second submatch is the digits before the comma and the third submatch is the digits after the comma, including the decimal point.
The .replace method change the previous regexp, that is, the whole number enclosed in quotes by just the second and third submatches.
Related
Should or should I not wrap quotes around variables in a shell script?
For example, is the following correct:
xdg-open $URL
[ $? -eq 2 ]
or
xdg-open "$URL"
[ "$?" -eq "2" ]
And if so, why?
General rule: quote it if it can either be empty or contain spaces (or any whitespace really) or special characters (wildcards). Not quoting strings with spaces often leads to the shell breaking apart a single argument into many.
$? doesn't need quotes since it's a numeric value. Whether $URL needs it depends on what you allow in there and whether you still want an argument if it's empty.
I tend to always quote strings just out of habit since it's safer that way.
In short, quote everything where you do not require the shell to perform word splitting and wildcard expansion.
Single quotes protect the text between them verbatim. It is the proper tool when you need to ensure that the shell does not touch the string at all. Typically, it is the quoting mechanism of choice when you do not require variable interpolation.
$ echo 'Nothing \t in here $will change'
Nothing \t in here $will change
$ grep -F '#&$*!!' file /dev/null
file:I can't get this #&$*!! quoting right.
Double quotes are suitable when variable interpolation is required. With suitable adaptations, it is also a good workaround when you need single quotes in the string. (There is no straightforward way to escape a single quote between single quotes, because there is no escape mechanism inside single quotes -- if there was, they would not quote completely verbatim.)
$ echo "There is no place like '$HOME'"
There is no place like '/home/me'
No quotes are suitable when you specifically require the shell to perform word splitting and/or wildcard expansion.
Word splitting (aka token splitting);
$ words="foo bar baz"
$ for word in $words; do
> echo "$word"
> done
foo
bar
baz
By contrast:
$ for word in "$words"; do echo "$word"; done
foo bar baz
(The loop only runs once, over the single, quoted string.)
$ for word in '$words'; do echo "$word"; done
$words
(The loop only runs once, over the literal single-quoted string.)
Wildcard expansion:
$ pattern='file*.txt'
$ ls $pattern
file1.txt file_other.txt
By contrast:
$ ls "$pattern"
ls: cannot access file*.txt: No such file or directory
(There is no file named literally file*.txt.)
$ ls '$pattern'
ls: cannot access $pattern: No such file or directory
(There is no file named $pattern, either!)
In more concrete terms, anything containing a filename should usually be quoted (because filenames can contain whitespace and other shell metacharacters). Anything containing a URL should usually be quoted (because many URLs contain shell metacharacters like ? and &). Anything containing a regex should usually be quoted (ditto ditto). Anything containing significant whitespace other than single spaces between non-whitespace characters needs to be quoted (because otherwise, the shell will munge the whitespace into, effectively, single spaces, and trim any leading or trailing whitespace).
When you know that a variable can only contain a value which contains no shell metacharacters, quoting is optional. Thus, an unquoted $? is basically fine, because this variable can only ever contain a single number. However, "$?" is also correct, and recommended for general consistency and correctness (though this is my personal recommendation, not a widely recognized policy).
Values which are not variables basically follow the same rules, though you could then also escape any metacharacters instead of quoting them. For a common example, a URL with a & in it will be parsed by the shell as a background command unless the metacharacter is escaped or quoted:
$ wget http://example.com/q&uack
[1] wget http://example.com/q
-bash: uack: command not found
(Of course, this also happens if the URL is in an unquoted variable.) For a static string, single quotes make the most sense, although any form of quoting or escaping works here.
wget 'http://example.com/q&uack' # Single quotes preferred for a static string
wget "http://example.com/q&uack" # Double quotes work here, too (no $ or ` in the value)
wget http://example.com/q\&uack # Backslash escape
wget http://example.com/q'&'uack # Only the metacharacter really needs quoting
The last example also suggests another useful concept, which I like to call "seesaw quoting". If you need to mix single and double quotes, you can use them adjacent to each other. For example, the following quoted strings
'$HOME '
"isn't"
' where `<3'
"' is."
can be pasted together back to back, forming a single long string after tokenization and quote removal.
$ echo '$HOME '"isn't"' where `<3'"' is."
$HOME isn't where `<3' is.
This isn't awfully legible, but it's a common technique and thus good to know.
As an aside, scripts should usually not use ls for anything. To expand a wildcard, just ... use it.
$ printf '%s\n' $pattern # not ``ls -1 $pattern''
file1.txt
file_other.txt
$ for file in $pattern; do # definitely, definitely not ``for file in $(ls $pattern)''
> printf 'Found file: %s\n' "$file"
> done
Found file: file1.txt
Found file: file_other.txt
(The loop is completely superfluous in the latter example; printf specifically works fine with multiple arguments. stat too. But looping over a wildcard match is a common problem, and frequently done incorrectly.)
A variable containing a list of tokens to loop over or a wildcard to expand is less frequently seen, so we sometimes abbreviate to "quote everything unless you know precisely what you are doing".
Here is a three-point formula for quotes in general:
Double quotes
In contexts where we want to suppress word splitting and globbing. Also in contexts where we want the literal to be treated as a string, not a regex.
Single quotes
In string literals where we want to suppress interpolation and special treatment of backslashes. In other words, situations where using double quotes would be inappropriate.
No quotes
In contexts where we are absolutely sure that there are no word splitting or globbing issues or we do want word splitting and globbing.
Examples
Double quotes
literal strings with whitespace ("StackOverflow rocks!", "Steve's Apple")
variable expansions ("$var", "${arr[#]}")
command substitutions ("$(ls)", "`ls`")
globs where directory path or file name part includes spaces ("/my dir/"*)
to protect single quotes ("single'quote'delimited'string")
Bash parameter expansion ("${filename##*/}")
Single quotes
command names and arguments that have whitespace in them
literal strings that need interpolation to be suppressed ( 'Really costs $$!', 'just a backslash followed by a t: \t')
to protect double quotes ('The "crux"')
regex literals that need interpolation to be suppressed
use shell quoting for literals involving special characters ($'\n\t')
use shell quoting where we need to protect several single and double quotes ($'{"table": "users", "where": "first_name"=\'Steve\'}')
No quotes
around standard numeric variables ($$, $?, $# etc.)
in arithmetic contexts like ((count++)), "${arr[idx]}", "${string:start:length}"
inside [[ ]] expression which is free from word splitting and globbing issues (this is a matter of style and opinions can vary widely)
where we want word splitting (for word in $words)
where we want globbing (for txtfile in *.txt; do ...)
where we want ~ to be interpreted as $HOME (~/"some dir" but not "~/some dir")
See also:
Difference between single and double quotes in Bash
What are the special dollar sign shell variables?
Quotes and escaping - Bash Hackers' Wiki
When is double quoting necessary?
I generally use quoted like "$var" for safe, unless I am sure that $var does not contain space.
I do use $var as a simple way to join lines:
lines="`cat multi-lines-text-file.txt`"
echo "$lines" ## multiple lines
echo $lines ## all spaces (including newlines) are zapped
Whenever the https://www.shellcheck.net/ plugin for your editor tells you to.
I have csv file where I'm trying to replace two carriage returns in a row with a single carriage return using Fart.exe. First off, is this possible? If so, the text within the CSV is laid out like the below where "CRLF" is an actual carriage return.
,CRLF
CRLF
But I want it to be just this without the extra carriage return on the second line:
,CRLF
I thought I could just do the below but it won't work:
CALL "C:\tmp\fart.exe" -C "C:\tmp\myfile.csv" ,\r\n\r\n ,\r\n
I need to know what to change ,\r\n\r\n to in order to make this work. Any ideas how I could make this happen? Thanks!
As Squashman has suggested, you are simply trying to remove empty lines.
There is no need for a 3rd party tool to do this. You can simply use FINDSTR to discard empty lines:
findstr /v "^$" myFile.txt >myFile.txt.new
move /y myFile.txt.new *. >nul
However, this will only work if all the lines end with CRLF. If you have a unix formatted file that ends each line with LF, then it will not work.
A more robust option would be to use JREPL.BAT - a regular expression command line text processing utility.
jrepl "^$" "" /r 0 /f myFile.txt /o -
Be sure to use CALL JREPL if you put the command within a batch script.
FART processes one line at a time, and the CRLF is not considered to be part of the line. So you can't use a normal FART command to remove CRLF. If you really want to use FART, then you will need to use the -B binary mode. You also need to use -C to get support for the escape sequences.
I've never used FART, so I can't be sure - but I believe the following would work
call fart -B -C myFile.txt "\r\n\r\n" "\r\n"
If you have many consecutive empty lines, then you will need to run the FART command repeatedly until there are no more changes.
I have a CSV file that I want to modify using batch to remove a string basically I have the next
randomID1, randomID2, randomID3, networkinterface, othercolumn1, othercolumn2,
abc123AAB, 098189909, 999181818, net on Server123, FORCED, anotherthing,
abc2455aB, 848449388, 123131232, LocalNet on SEV1, FORCED, otherlessstuff,
My relevant caracthers are Server123 and SEV1, so I need to convert the above on
randomID1, randomID2, randomID3, networkinterface, othercolumn1, othercolumn2,
abc123AAB, 098189909, 999181818, Server123, FORCED, anotherthing,
abc2455aB, 848449388, 123131232, SEV1, FORCED, otherlessstuff,
This means removing 'net on ' and 'LocalNet on ' strings.
How can I do this?
Batch language is far from ideal for this, but here's a basic script to simply line-by-line remove occurrences of "net on " and "LocalNet on " from input.txt and save the result as output.txt:
#ECHO OFF
SETLOCAL ENABLEDELAYEDEXPANSION
TYPE NUL > output.txt
FOR /F "delims=" %%L IN (input.txt) DO (
SET LINE=%%L
SET LINE=!LINE:LocalNet on =!
ECHO/!LINE:net on =!>> output.txt
)
Refinements are possible and may be needed. E.g., this won't work if the file contains reserved characters such as &. And it's not case sensitive. The latter is the reason the "LocalNet on " substitution is done before the "net on " substitution which is a substring when case insensitive. There's nothing CSV specific here because from your question that doesn't appear to be required. But if for example you needed to treat different comma-separated tokens differently, that can be done with a "delims=," option and some extra code.
I've been racking my brain with this for the past half an hour and everything I've tried so far has failed miserably!
Within an html file, there is a field within tags, but the field itself is not separated with a space from the > sign so it's hard to read with awk. I would basically like to add a single space after the opening tag, but gsub and awk are refusing to cooperate.
I've tried
awk 'gsub("class\\\'\\\'>","class\\\'\\\'>")' filename
since one backslash is needed to escape the single quote, the second to escape the backslash itself, and the third to escape the sequence \' but Terminal (I'm working on a Mac) refuses to execute, and instead goes in the next line awaiting some other input from me.
Please help :(
In Bash, single quotes accept absolutely no kind of escape. Suppose e.g. I write this command:
$ echo '\''
>
Bash will consider the string opened by ' closed at the second ', generating a string containing only \. The next ', then, is considered the opening of a new string, so bash expects for more input in the next line (signalled by the >).
If you are not aware of this fact, you may think that the string after the echo command below will be open yet, but it is closed:
$ echo 'will this string contain a single quote like \'
will this string contain a single quote like \
So, when you write
'gsub("class\\\'\\\'>","class\\\'\\\'> ")'
you are writing the string gsub("class\\\ concatenated with a backslash and a quote (\'); then a greater than signal. After this, the "," is interpreted as a string containing a comma, because the single quote of the beginning of the expression was closed before. For now, the result is:
gsub("class\\\\'>,
After the comma, you have the string class, followed by a backslash and a quote, followed by another backslash and another quote, and finally by a greater than symbol and a space. This is the current string:
gsub("class\\\\'>,class\'\'>
This is no valid awk expression! Anyway, it gets worse: the double quote " will start a string, which will contain a closing parenthesis and a single quote, but this string is never closed!
Summing up, your problem is that, if you opened a string with ' in Bash, it will be forcedly close at the next ', no matter how many backslashes you put before it.
Solution: you can make some tricks opening and closing strings with ' and " but it will become cumbersome quickly. My suggested solution is to put your awk expression in a file. Then, use the -f flag from awk - this flag will make awk to execute the following file:
$ cat filename # The file to be changed
class''>
class>
class''>
$ cat mycode.awk # The awk script
gsub("class''>", "class''>[PSEUDOSPACE]")
$ awk -f mycode.awk filename # THE RESULT!
class''>[PSEUDOSPACE]
class''>[PSEUDOSPACE]
If you do not want to write a file, use the so called here documents:
$ awk -f- filename <<EOF
gsub("class''>", "class''>[PSEUDOSPACE]")
EOF
class''>[PSEUDOSPACE]
class''>[PSEUDOSPACE]
The problem is that you are escaping the ', so you are not finishing the command. For example:
echo \' > foo
echoes a single quote into the file named foo, and
echo \\\' > foo
writes a single backslash followed by a single quote.
In particular, you cannot escape a single quote inside a string, so
'foo\'bar'
is the string foo\ followed by the string bar followed by an unmatched open quote. It is exactly the same as writing "foo\\"bar'
I've got a two column CSV with a name and a number. Some people's name use commas, for example Joe Blow, CFA. This comma breaks the CSV format, since it's interpreted as a new column.
I've read up and the most common prescription seems to be replacing that character, or replacing the delimiter, with a new value (e.g. this|that|the, other).
I'd really like to keep the comma separator (I know excel supports other delimiters but other interpreters may not). I'd also like to keep the comma in the name, as Joe Blow| CFA looks pretty silly.
Is there a way to include commas in CSV columns without breaking the formatting, for example by escaping them?
To encode a field containing comma (,) or double-quote (") characters, enclose the field in double-quotes:
field1,"field, 2",field3, ...
Literal double-quote characters are typically represented by a pair of double-quotes (""). For example, a field exclusively containing one double-quote character is encoded as """".
For example:
Sheet: |Hello, World!|You "matter" to us.|
CSV: "Hello, World!","You ""matter"" to us."
More examples (sheet → csv):
regular_value → regular_value
Fresh, brown "eggs" → "Fresh, brown ""eggs"""
" → """"
"," → ""","""
,,," → ",,,"""
,"", → ","""","
""" → """"""""
See wikipedia.
I found that some applications like Numbers in Mac ignore the double quote if there is space before it.
a, "b,c" doesn't work while a,"b,c" works.
The problem with the CSV format, is there's not one spec, there are several accepted methods, with no way of distinguishing which should be used (for generate/interpret). I discussed all the methods to escape characters (newlines in that case, but same basic premise) in another post. Basically it comes down to using a CSV generation/escaping process for the intended users, and hoping the rest don't mind.
Reference spec document.
If you want to make that you said, you can use quotes. Something like this
$name = "Joe Blow, CFA.";
$arr[] = "\"".$name."\"";
so now, you can use comma in your name variable.
You need to quote that values.
Here is a more detailed spec.
In addition to the points in other answers: one thing to note if you are using quotes in Excel is the placement of your spaces. If you have a line of code like this:
print '%s, "%s", "%s", "%s"' % (value_1, value_2, value_3, value_4)
Excel will treat the initial quote as a literal quote instead of using it to escape commas. Your code will need to change to
print '%s,"%s","%s","%s"' % (value_1, value_2, value_3, value_4)
It was this subtlety that brought me here.
You can use Template literals (Template strings)
e.g -
`"${item}"`
CSV files can actually be formatted using different delimiters, comma is just the default.
You can use the sep flag to specify the delimiter you want for your CSV file.
Just add the line sep=; as the very first line in your CSV file, that is if you want your delimiter to be semi-colon. You can change it to any other character.
This isn't a perfect solution, but you can just replace all uses of commas with ‚ or a lower quote. It looks very very similar to a comma and will visually serve the same purpose. No quotes are required
in JS this would be
stringVal.replaceAll(',', '‚')
You will need to be super careful of cases where you need to directly compare that data though
Depending on your language, there may be a to_json method available. That will escape many things that break CSVs.
I faced the same problem and quoting the , did not help. Eventually, I replaced the , with +, finished the processing, saved the output into an outfile and replaced the + with ,. This may seem ugly but it worked for me.
May not be what is needed here but it's a very old question and the answer may help others. A tip I find useful with importing into Excel with a different separator is to open the file in a text editor and add a first line like:
sep=|
where | is the separator you wish Excel to use.
Alternatively you can change the default separator in Windows but a bit long-winded:
Control Panel>Clock & region>Region>Formats>Additional>Numbers>List separator [change from comma to your preferred alternative]. That means Excel will also default to exporting CSVs using the chosen separator.
You could encode your values, for example in PHP base64_encode($str) / base64_decode($str)
IMO this is simpler than doubling up quotes, etc.
https://www.php.net/manual/en/function.base64-encode.php
The encoded values will never contain a comma so every comma in your CSV will be a separator.
You can use the Text_Qualifier field in your Flat file connection manager to as ". This should wrap your data in quotes and only separate by commas which are outside the quotes.
First, if item value has double quote character ("), replace with 2 double quote character ("")
item = item.ToString().Replace("""", """""")
Finally, wrap item value:
ON LEFT: With double quote character (")
ON RIGHT: With double quote character (") and comma character (,)
csv += """" & item.ToString() & ""","
Double quotes not worked for me, it worked for me \". If you want to place a double quotes as example you can set \"\".
You can build formulas, as example:
fprintf(strout, "\"=if(C3=1,\"\"\"\",B3)\"\n");
will write in csv:
=IF(C3=1,"",B3)
A C# method for escaping delimiter characters and quotes in column text. It should be all you need to ensure your csv is not mangled.
private string EscapeDelimiter(string field)
{
if (field.Contains(yourEscapeCharacter))
{
field = field.Replace("\"", "\"\"");
field = $"\"{field}\"";
}
return field;
}