remove \r\n from within double quoted field in csv file - csv

I have a large csv file with about 100 double quoted "text fields" per line. Many of the lines have a \r\n embedded in a double quoted field. The \r\n pair is also used for line termination.
How can the \r\n pairs be removed from the double quoted fields and not impact the \r\n line terminations.
I have tried creating individual sed scripts to identify the particular embedded sequences. That sort of worked, but the number of scripts became unmanageable. I have also tried using a 'tr -d '\r' command, that did not work.

For this purpose you can use perl.
perl -0pe 's/"(.*?)([[:space:]]+)(.*?)"/"\1 \3"/g' input.csv > output.csv
perl -pe 's/../../g' file is analogous to sed -e 's/../../g' file but using Perl compatible regular expressions (PCRE). See this or this for more details.
With the option zero perl -0 we are considering the null character as input record separator (see xargs -0 ... or find . -print0 ...).
All versions of sed supports Basic Regular Expressions (BRE) and some versions, as GNU sed (using sed -E), supports Extended Regular Expressions (ERE), a superset of BRE. Both in BRE and ERE the quantifiers * and + are always greedy, they match as many characters as possible. In some contexts, however, we need to match as few characters as possible. With PCRE we can do the quantifiers * and + lazy or minimal or reluctant using *? and +? respectively.
In order to remove \r\n inside double quotes preserving any other line break, we use lazy matching: "(.*?)([[:space:]]+)(.*?)". We also use backreferences "\1 \3" omitting the second matched parentheses which contains any sequence of white spaces (characters \t,\r, \n, \v or \f), in particular sequences of \r\n.
This approach works in CSV files in which "lines have a \r\n embedded in a double quoted field" as is your case. If input.csv contains cells with three or more paragraphs separated by \r\n, we must modify the regular expression.

Related

how can I pass a json as a command after -x in wscat [duplicate]

Should or should I not wrap quotes around variables in a shell script?
For example, is the following correct:
xdg-open $URL
[ $? -eq 2 ]
or
xdg-open "$URL"
[ "$?" -eq "2" ]
And if so, why?
General rule: quote it if it can either be empty or contain spaces (or any whitespace really) or special characters (wildcards). Not quoting strings with spaces often leads to the shell breaking apart a single argument into many.
$? doesn't need quotes since it's a numeric value. Whether $URL needs it depends on what you allow in there and whether you still want an argument if it's empty.
I tend to always quote strings just out of habit since it's safer that way.
In short, quote everything where you do not require the shell to perform word splitting and wildcard expansion.
Single quotes protect the text between them verbatim. It is the proper tool when you need to ensure that the shell does not touch the string at all. Typically, it is the quoting mechanism of choice when you do not require variable interpolation.
$ echo 'Nothing \t in here $will change'
Nothing \t in here $will change
$ grep -F '#&$*!!' file /dev/null
file:I can't get this #&$*!! quoting right.
Double quotes are suitable when variable interpolation is required. With suitable adaptations, it is also a good workaround when you need single quotes in the string. (There is no straightforward way to escape a single quote between single quotes, because there is no escape mechanism inside single quotes -- if there was, they would not quote completely verbatim.)
$ echo "There is no place like '$HOME'"
There is no place like '/home/me'
No quotes are suitable when you specifically require the shell to perform word splitting and/or wildcard expansion.
Word splitting (aka token splitting);
$ words="foo bar baz"
$ for word in $words; do
> echo "$word"
> done
foo
bar
baz
By contrast:
$ for word in "$words"; do echo "$word"; done
foo bar baz
(The loop only runs once, over the single, quoted string.)
$ for word in '$words'; do echo "$word"; done
$words
(The loop only runs once, over the literal single-quoted string.)
Wildcard expansion:
$ pattern='file*.txt'
$ ls $pattern
file1.txt file_other.txt
By contrast:
$ ls "$pattern"
ls: cannot access file*.txt: No such file or directory
(There is no file named literally file*.txt.)
$ ls '$pattern'
ls: cannot access $pattern: No such file or directory
(There is no file named $pattern, either!)
In more concrete terms, anything containing a filename should usually be quoted (because filenames can contain whitespace and other shell metacharacters). Anything containing a URL should usually be quoted (because many URLs contain shell metacharacters like ? and &). Anything containing a regex should usually be quoted (ditto ditto). Anything containing significant whitespace other than single spaces between non-whitespace characters needs to be quoted (because otherwise, the shell will munge the whitespace into, effectively, single spaces, and trim any leading or trailing whitespace).
When you know that a variable can only contain a value which contains no shell metacharacters, quoting is optional. Thus, an unquoted $? is basically fine, because this variable can only ever contain a single number. However, "$?" is also correct, and recommended for general consistency and correctness (though this is my personal recommendation, not a widely recognized policy).
Values which are not variables basically follow the same rules, though you could then also escape any metacharacters instead of quoting them. For a common example, a URL with a & in it will be parsed by the shell as a background command unless the metacharacter is escaped or quoted:
$ wget http://example.com/q&uack
[1] wget http://example.com/q
-bash: uack: command not found
(Of course, this also happens if the URL is in an unquoted variable.) For a static string, single quotes make the most sense, although any form of quoting or escaping works here.
wget 'http://example.com/q&uack' # Single quotes preferred for a static string
wget "http://example.com/q&uack" # Double quotes work here, too (no $ or ` in the value)
wget http://example.com/q\&uack # Backslash escape
wget http://example.com/q'&'uack # Only the metacharacter really needs quoting
The last example also suggests another useful concept, which I like to call "seesaw quoting". If you need to mix single and double quotes, you can use them adjacent to each other. For example, the following quoted strings
'$HOME '
"isn't"
' where `<3'
"' is."
can be pasted together back to back, forming a single long string after tokenization and quote removal.
$ echo '$HOME '"isn't"' where `<3'"' is."
$HOME isn't where `<3' is.
This isn't awfully legible, but it's a common technique and thus good to know.
As an aside, scripts should usually not use ls for anything. To expand a wildcard, just ... use it.
$ printf '%s\n' $pattern # not ``ls -1 $pattern''
file1.txt
file_other.txt
$ for file in $pattern; do # definitely, definitely not ``for file in $(ls $pattern)''
> printf 'Found file: %s\n' "$file"
> done
Found file: file1.txt
Found file: file_other.txt
(The loop is completely superfluous in the latter example; printf specifically works fine with multiple arguments. stat too. But looping over a wildcard match is a common problem, and frequently done incorrectly.)
A variable containing a list of tokens to loop over or a wildcard to expand is less frequently seen, so we sometimes abbreviate to "quote everything unless you know precisely what you are doing".
Here is a three-point formula for quotes in general:
Double quotes
In contexts where we want to suppress word splitting and globbing. Also in contexts where we want the literal to be treated as a string, not a regex.
Single quotes
In string literals where we want to suppress interpolation and special treatment of backslashes. In other words, situations where using double quotes would be inappropriate.
No quotes
In contexts where we are absolutely sure that there are no word splitting or globbing issues or we do want word splitting and globbing.
Examples
Double quotes
literal strings with whitespace ("StackOverflow rocks!", "Steve's Apple")
variable expansions ("$var", "${arr[#]}")
command substitutions ("$(ls)", "`ls`")
globs where directory path or file name part includes spaces ("/my dir/"*)
to protect single quotes ("single'quote'delimited'string")
Bash parameter expansion ("${filename##*/}")
Single quotes
command names and arguments that have whitespace in them
literal strings that need interpolation to be suppressed ( 'Really costs $$!', 'just a backslash followed by a t: \t')
to protect double quotes ('The "crux"')
regex literals that need interpolation to be suppressed
use shell quoting for literals involving special characters ($'\n\t')
use shell quoting where we need to protect several single and double quotes ($'{"table": "users", "where": "first_name"=\'Steve\'}')
No quotes
around standard numeric variables ($$, $?, $# etc.)
in arithmetic contexts like ((count++)), "${arr[idx]}", "${string:start:length}"
inside [[ ]] expression which is free from word splitting and globbing issues (this is a matter of style and opinions can vary widely)
where we want word splitting (for word in $words)
where we want globbing (for txtfile in *.txt; do ...)
where we want ~ to be interpreted as $HOME (~/"some dir" but not "~/some dir")
See also:
Difference between single and double quotes in Bash
What are the special dollar sign shell variables?
Quotes and escaping - Bash Hackers' Wiki
When is double quoting necessary?
I generally use quoted like "$var" for safe, unless I am sure that $var does not contain space.
I do use $var as a simple way to join lines:
lines="`cat multi-lines-text-file.txt`"
echo "$lines" ## multiple lines
echo $lines ## all spaces (including newlines) are zapped
Whenever the https://www.shellcheck.net/ plugin for your editor tells you to.

MySQL search and replace specific character combinations

NOTE: This is hilarious - I've had to update this post multiple times because it doesn't properly display combinations of the \ character :)
I need to be able to search and replace a specific set of characters without compromising specific allowed combinations or the redundancy of those characters. Let's take a core escape character \ for example.
A string that needs processing may look like this:
"This is \ a test! \\r\\n Let's have \r \n \\\ Some fun!"
Now I need to address each \ for translation to JSON (each \ needs to be \\\\), but at the same time, I don't want to touch the \r or \n. I also want to be able to differentiate between \r and \n vs \r and \n. Is there a method using REGEXP I could adopt that would take the above and convert it to:
"This is \\\\ a test! \\r\\n Let's have \\r \\n \\\\\\\\\\\\ Some fun!"
Note I want each escape \ to be 4x \, and also to ensure special escaped characters are double escaped, but not compound them (the 4x only needs to be applied to stand alone \'s).
What it comes down to is I can't control the data that's coming in, but I can control scrubbing it. I could get weird data with \/////\ by somebody who was just having fun. I need to be able to scrub that as a TEXT value and prepare it for insertion into the database via a dynamically created SQL statement that's executed, which means a single \ needs to be \\\\ and the / is ignored (for example).
I'm thinking I first need to do a specific scan of special escape sequences (such as \', \b, \\, \r, etc.) but at the same time verify they aren't already double escaped. I need need to ensure the additional scans of \ don't meet any special standards (and are just escaped on their own).
I'm hoping somebody has already dealt with this and there's an existing function or SP designed to do this sort of thing so I'm not reinventing the wheel.
Thanks!

When use parameters in JSON POST data in Jmeter, another json entity is changed [duplicate]

I am tired of always trying to guess, if I should escape special characters like '()[]{}|' etc. when using many implementations of regexps.
It is different with, for example, Python, sed, grep, awk, Perl, rename, Apache, find and so on.
Is there any rule set which tells when I should, and when I should not, escape special characters? Does it depend on the regexp type, like PCRE, POSIX or extended regexps?
Which characters you must and which you mustn't escape indeed depends on the regex flavor you're working with.
For PCRE, and most other so-called Perl-compatible flavors, escape these outside character classes:
.^$*+?()[{\|
and these inside character classes:
^-]\
For POSIX extended regexes (ERE), escape these outside character classes (same as PCRE):
.^$*+?()[{\|
Escaping any other characters is an error with POSIX ERE.
Inside character classes, the backslash is a literal character in POSIX regular expressions. You cannot use it to escape anything. You have to use "clever placement" if you want to include character class metacharacters as literals. Put the ^ anywhere except at the start, the ] at the start, and the - at the start or the end of the character class to match these literally, e.g.:
[]^-]
In POSIX basic regular expressions (BRE), these are metacharacters that you need to escape to suppress their meaning:
.^$*[\
Escaping parentheses and curly brackets in BREs gives them the special meaning their unescaped versions have in EREs. Some implementations (e.g. GNU) also give special meaning to other characters when escaped, such as \? and +. Escaping a character other than .^$*(){} is normally an error with BREs.
Inside character classes, BREs follow the same rule as EREs.
If all this makes your head spin, grab a copy of RegexBuddy. On the Create tab, click Insert Token, and then Literal. RegexBuddy will add escapes as needed.
Modern RegEx Flavors (PCRE)
Includes C, C++, Delphi, EditPad, Java, JavaScript, Perl, PHP (preg), PostgreSQL, PowerGREP, PowerShell, Python, REALbasic, Real Studio, Ruby, TCL, VB.Net, VBScript, wxWidgets, XML Schema, Xojo, XRegExp.PCRE compatibility may vary
    Anywhere: . ^ $ * + - ? ( ) [ ] { } \ |
Legacy RegEx Flavors (BRE/ERE)
Includes awk, ed, egrep, emacs, GNUlib, grep, PHP (ereg), MySQL, Oracle, R, sed.PCRE support may be enabled in later versions or by using extensions
ERE/awk/egrep/emacs
    Outside a character class: . ^ $ * + ? ( ) [ { } \ |
    Inside a character class: ^ - [ ]
BRE/ed/grep/sed
    Outside a character class: . ^ $ * [ \
    Inside a character class: ^ - [ ]
    For literals, don't escape: + ? ( ) { } |
    For standard regex behavior, escape: \+ \? \( \) \{ \} \|
Notes
If unsure about a specific character, it can be escaped like \xFF
Alphanumeric characters cannot be escaped with a backslash
Arbitrary symbols can be escaped with a backslash in PCRE, but not BRE/ERE (they must only be escaped when required). For PCRE ] - only need escaping within a character class, but I kept them in a single list for simplicity
Quoted expression strings must also have the surrounding quote characters escaped, and often with backslashes doubled-up (like "(\")(/)(\\.)" versus /(")(\/)(\.)/ in JavaScript)
Aside from escapes, different regex implementations may support different modifiers, character classes, anchors, quantifiers, and other features. For more details, check out regular-expressions.info, or use regex101.com to test your expressions live
Unfortunately there really isn't a set set of escape codes since it varies based on the language you are using.
However, keeping a page like the Regular Expression Tools Page or this Regular Expression Cheatsheet can go a long way to help you quickly filter things out.
POSIX recognizes multiple variations on regular expressions - basic regular expressions (BRE) and extended regular expressions (ERE). And even then, there are quirks because of the historical implementations of the utilities standardized by POSIX.
There isn't a simple rule for when to use which notation, or even which notation a given command uses.
Check out Jeff Friedl's Mastering Regular Expressions book.
Unfortunately, the meaning of things like ( and \( are swapped between Emacs style regular expressions and most other styles. So if you try to escape these you may be doing the opposite of what you want.
So you really have to know what style you are trying to quote.
Really, there isn't. there are about a half-zillion different regex syntaxes; they seem to come down to Perl, EMACS/GNU, and AT&T in general, but I'm always getting surprised too.
Sometimes simple escaping is not possible with the characters you've listed. For example, using a backslash to escape a bracket isn't going to work in the left hand side of a substitution string in sed, namely
sed -e 's/foo\(bar/something_else/'
I tend to just use a simple character class definition instead, so the above expression becomes
sed -e 's/foo[(]bar/something_else/'
which I find works for most regexp implementations.
BTW Character classes are pretty vanilla regexp components so they tend to work in most situations where you need escaped characters in regexps.
Edit: After the comment below, just thought I'd mention the fact that you also have to consider the difference between finite state automata and non-finite state automata when looking at the behaviour of regexp evaluation.
You might like to look at "the shiny ball book" aka Effective Perl (sanitised Amazon link), specifically the chapter on regular expressions, to get a feel for then difference in regexp engine evaluation types.
Not all the world's a PCRE!
Anyway, regexp's are so clunky compared to SNOBOL! Now that was an interesting programming course! Along with the one on Simula.
Ah the joys of studying at UNSW in the late '70's! (-:
https://perldoc.perl.org/perlre.html#Quoting-metacharacters and https://perldoc.perl.org/functions/quotemeta.html
In the official documentation, such characters are called metacharacters. Example of quoting:
my $regex = quotemeta($string)
s/$regex/something/
For PHP, "it is always safe to precede a non-alphanumeric with "\" to specify that it stands for itself." - http://php.net/manual/en/regexp.reference.escape.php.
Except if it's a " or '. :/
To escape regex pattern variables (or partial variables) in PHP use preg_quote()
To know when and what to escape without attempts is necessary to understand precisely the chain of contexts the string pass through. You will specify the string from the farthest side to its final destination which is the memory handled by the regexp parsing code.
Be aware how the string in memory is processed: if can be a plain string inside the code, or a string entered to the command line, but a could be either an interactive command line or a command line stated inside a shell script file, or inside a variable in memory mentioned by the code, or an (string)argument through further evaluation, or a string containing code generated dynamically with any sort of encapsulation...
Each of this context assigned some characters with special functionality.
When you want to pass the character literally without using its special function (local to the context), than that's the case you have to escape it, for the next context... which might need some other escape characters which might additionally need to be escaped in the preceding context(s).
Furthermore there can be things like character encoding (the most insidious is utf-8 because it look like ASCII for common characters, but might be optionally interpreted even by the terminal depending on its settings so it might behave differently, then the encoding attribute of HTML/XML, it's necessary to understand the process precisely right.
E.g. A regexp in the command line starting with perl -npe, needs to be transferred to a set of exec system calls connecting as pipe the file handles, each of this exec system calls just has a list of arguments that were separated by (non escaped)spaces, and possibly pipes(|) and redirection (> N> N>&M), parenthesis, interactive expansion of * and ?, $(()) ... (all this are special characters used by the *sh which might appear to interfere with the character of the regular expression in the next context, but they are evaluated in order: before the command line. The command line is read by a program as bash/sh/csh/tcsh/zsh, essentially inside double quote or single quote the escape is simpler but it is not necessary to quote a string in the command line because mostly the space has to be prefixed with backslash and the quote are not necessary leaving available the expand functionality for characters * and ?, but this parse as different context as within quote. Then when the command line is evaluated the regexp obtained in memory (not as written in the command line) receives the same treatment as it would be in a source file.
For regexp there is character-set context within square brackets [ ], perl regular expression can be quoted by a large set of non alfa-numeric characters (E.g. m// or m:/better/for/path: ...).
You have more details about characters in other answer, which are very specific to the final regexp context. As I noted you mention that you find the regexp escape with attempts, that's probably because different context has different set of character that confused your memory of attempts (often backslash is the character used in those different context to escape a literal character instead of its function).
For Ionic (Typescript) you have to double slash in order to scape the characters.
For example (this is to match some special characters):
"^(?=.*[\\]\\[!¡\'=ªº\\-\\_ç##$%^&*(),;\\.?\":{}|<>\+\\/])"
Pay attention to this ] [ - _ . / characters. They have to be double slashed. If you don't do that, you are going to have a type error in your code.
to avoid having to worry about which regex variant and all the bespoke peculiarties, just use this generic function that covers every regex variant other than BRE (unless they have unicode multi-byte chars that are meta) :
jot -s '' -c - 32 126 |
mawk '
function ___(__,_) {
return substr(_="",
gsub("[][!-/_\140:-#{-~]","[&]",__),
gsub("["(_="\\\\")"^]",_ "&",__))__
} ($++NF = ___($!_))^_'
!"#$%&'()*+,-./0123456789:;<=>?
#ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_
`abcdefghijklmnopqrstuvwxyz{|}~
[!]["][#][$][%][&]['][(][)][*][+][,][-][.][/]
0 1 2 3 4 5 6 7 8 9 [:][;][<][=][>][?]
[#] ABCDEFGHIJKLMNOPQRSTUVWXYZ [[]\\ []]\^ [_]
[`] abcdefghijklmnopqrstuvwxyz [{][|][}][~]
square-brackets are much easier to deal with, since there's no risk of triggering warning messages about "escaping too much", e.g. :
function ____(_) {
return substr("", gsub("[[:punct:]]","\\\\&",_))_
}
\!\"\#\$\%\&\'\(\)\*\+\,\-\.\/ 0123456789\:\;\<\=\>\?
\#ABCDEFGHIJKLMNOPQRSTUVWXYZ\[\\\]\^\_\`abcdefghijklmnopqrstuvwxyz \{\|\}\~
gawk: cmd. line:1: warning: regexp escape sequence `\!' is not a known regexp operator
gawk: cmd. line:1: warning: regexp escape sequence `\"' is not a known regexp operator
gawk: cmd. line:1: warning: regexp escape sequence `\#' is not a known regexp operator
gawk: cmd. line:1: warning: regexp escape sequence `\%' is not a known regexp operator
gawk: cmd. line:1: warning: regexp escape sequence `\&' is not a known regexp operator
gawk: cmd. line:1: warning: regexp escape sequence `\,' is not a known regexp operator
gawk: cmd. line:1: warning: regexp escape sequence `\:' is not a known regexp operator
gawk: cmd. line:1: warning: regexp escape sequence `\;' is not a known regexp operator
gawk: cmd. line:1: warning: regexp escape sequence `\=' is not a known regexp operator
gawk: cmd. line:1: warning: regexp escape sequence `\#' is not a known regexp operator
gawk: cmd. line:1: warning: regexp escape sequence `\_' is not a known regexp operator
gawk: cmd. line:1: warning: regexp escape sequence `\~' is not a known regexp operator
Using Raku (formerly known as Perl_6)
Works (backslash or quote all non-alphanumeric characters except underscore):
~$ raku -e 'say $/ if "#.*?" ~~ m/ \# \. \* \? /; #works fine'
「#.*?」
There exist six flavors of Regular Expression languages, according to Damian Conway's pdf/talk "Everything You Know About Regexes Is Wrong". Raku represents a significant (~15 year) re-working of standard Perl(5)/PCRE Regular Expressions.
In those 15 years the Perl_6 / Raku language experts decided that all non-alphanumeric characters (except underscore) shall be reserved as Regex metacharacters even if no present usage exists. To denote non-alphanumeric characters (except underscore) as literals, backslash or escape them.
So the above example prints the $/ match variable if a match to a literal #.*? character sequence is found. Below is what happens if you don't: # is interpreted as the start of a comment, . dot is interpreted as any character (including whitespace), * asterisk is interpreted as a zero-or-more quantifier, and ? question mark is interpreted as either a zero-or-one quantifier or a frugal (i.e. non-greedy) quantifier-modifier (depending on context):
Errors:
~$ ~$ raku -e 'say $/ if "#.*?" ~~ m/ # . * ? /; #ERROR!'
===SORRY!===
Regex not terminated.
at -e:1
------> y $/ if "#.*?" ~~ m/ # . * ? /; #ERROR!⏏<EOL>
Regex not terminated.
at -e:1
------> y $/ if "#.*?" ~~ m/ # . * ? /; #ERROR!⏏<EOL>
Couldn't find terminator / (corresponding / was at line 1)
at -e:1
------> y $/ if "#.*?" ~~ m/ # . * ? /; #ERROR!⏏<EOL>
expecting any of:
/
https://docs.raku.org/language/regexes
https://raku.org/

Using SED to add values after the 5th field of a CSV file that is also an IP Address

I have a CSV file that I'm working to manipulate using sed. What I'm doing is inserting the current YYYY-MM-DD HH:MM:SS into the 5th field after the IP Address. As you can see below, each value is enclosed by double quotes and each CSV column is separated by a comma.
"12345","","","None","192.168.2.1","qqq","000"
"67890","ABC-1234-5678","9.9","Low","192.168.2.1","qqq","000"
Using the command: sed 'N;s/","/","YYYY-MM-DD HH:MM:SS","/5' FILENAME
I am adding in the date after the 5th field. Normally this works, but often
certain values in the CSV file mess up this count that would insert the date into the 5th field. To remedy this issue, how can I not only add the date after the 5th field, but also make sure the 5th field is an IP Address?
The final output should be:
"12345","","","None","192.168.2.1","YYYY-MM-DD HH:MM:SS","qqq","000"
"67890","ABC-1234-5678","9.9","Low","192.168.2.1","YYYY-MM-DD HH:MM:SS","qqq","000"
Please respond with how this is done using SED and not AWK. And how I can make sure the 5th field is also an IP Address before the date is added in?
This answer currently assumes that the CSV file is beautifully consistent and simple (as in the sample data), so that:
Fields always have double quotes around them.
There are never fields like "…""…" to indicate a double quote embedded in the string.
There are never fields with commas in between the quotes ("this,that").
Given those pre-requisites, this sed script does the job:
sed 's/^\("[^"]*",\)\{4\}"\([0-9]\{1,3\}\.\)\{3\}[0-9]\{1,3\}",/&"YYYY-MM-DD HH:MM:SS",/'
Let's split that search pattern into pieces:
^\("[^"]*",\)\{4\}
Match start of line followed by: 4 repeats of a double quote, a sequence of zero or more non-double-quotes, a double quote and a comma.
In other words, this identifies the first four fields.
"\([0-9]\{1,3\}\.\)\{3\}
Match a double quote, then 3 repeats of 1-3 decimal digits followed by a dot — the first three triplets of an IPv4 dotted-decimal address.
[0-9]\{1,3\}",
Match 1-3 decimal digits followed by a double quote and a comma — the last triplet of an IPv4 dotted-decimal address plus the end of a field.
Clearly, for each idiosyncrasy of CSV files that you also need to deal with, you have to modify the regular expressions. That's not trivial.
Using extended regular expressions (enabled by -E on both GNU and BSD sed), you could write:
sed -E 's/^("(([^"]*"")*[^"]*)",){4}"([0-9]{1,3}\.){3}[0-9]{1,3}",/&"YYYY-MM-DD HH:MM:SS",/'
The pattern to recognize the first 4 fields is more complex than before. It matches 4 repeats of: double quote, zero or more occurrences of { zero or more non-double-quotes followed by two double quotes } followed by zero or more non-double-quotes followed by a double quote and a comma.
You can also write that in classic sed (basic regular expressions) with a liberal sprinkling of backslashes:
sed 's/^\("\(\([^"]*""\)*[^"]*\)",\)\{4\}"\([0-9]\{1,3\}\.\)\{3\}[0-9]\{1,3\}",/&"YYYY-MM-DD HH:MM:SS",/'
Given the data file:
"12345","","","None","192.168.2.1","qqq","000"
"67890","ABC-1234-5678","9.9","Low","192.168.2.1","qqq","000"
"23456","Quaternions","2.3","Pisces","Heredotus","qqq","000"
"34567","Commas, oh commas!","3.14159","""Quotes"" quoth he","192.168.99.37","zzz","011"
"45678","Commas, oh commas!","3.14159","""Quote me"",""or not""","192.168.99.37","zzz","011"
The first script shown produces the output:
"12345","","","None","192.168.2.1","YYYY-MM-DD HH:MM:SS","qqq","000"
"67890","ABC-1234-5678","9.9","Low","192.168.2.1","YYYY-MM-DD HH:MM:SS","qqq","000"
"23456","Quaternions","2.3","Pisces","Heredotus","qqq","000"
"34567","Commas, oh commas!","3.14159","""Quotes"" quoth he","192.168.99.37","zzz","011"
"45678","Commas, oh commas!","3.14159","""Quote me"",""or not""","192.168.99.37","zzz","011"
The first two lines are correctly mapped; the third is correctly unchanged, but the last two should have been mapped and were not.
The second and third commands produce:
"12345","","","None","192.168.2.1","YYYY-MM-DD HH:MM:SS","qqq","000"
"67890","ABC-1234-5678","9.9","Low","192.168.2.1","YYYY-MM-DD HH:MM:SS","qqq","000"
"23456","Quaternions","2.3","Pisces","Heredotus","qqq","000"
"34567","Commas, oh commas!","3.14159","""Quotes"" quoth he","192.168.99.37","YYYY-MM-DD HH:MM:SS","zzz","011"
"45678","Commas, oh commas!","3.14159","""Quote me"",""or not""","192.168.99.37","YYYY-MM-DD HH:MM:SS","zzz","011"
Note that Heredotus is not modified (correctly), and the last two lines get the date string added after the IP address (also correctly).
Those last regular expressions are not for the faint-of-heart.
Clearly, if you want to insist that the IP addresses only match numbers in the range 0..255 in each component, with no leading 0, then you have to beef up the IP address matching portion of the regular expression. It can be done; it is not pretty. It is easiest to do it with extended regular expressions:
([0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])
You'd use that unit in place of each [0-9]{3} unit in the regexes shown before.
Note that this still does not attempt to deal with fields not surrounded by double quotes.
It also does not determine the value to substitute from the date command. That is doable with (if not elementary then) routine shell scripting carefully managing quotes:
dt=$(date +'%Y-%m-%d %H:%M:%S')
sed -E 's/^("(([^"]*"")*[^"]*)",){4}"([0-9]{1,3}\.){3}[0-9]{1,3}",/&"'"$dt"'",/'
The '…"'"$dt"'",/' sequence is part of what starts out as a single-quoted string. The first double quote is simple data in the string; the next single quote ends the quoting, the "$dt" interpolates the value from date inside shell double quotes (so the space doesn't cause any trouble), then the single quote resumes the single-quoted notation, adding another double quote, a comma and a slash before the string (argument to sed) is terminated.
Try:
awk -vdate1=$(date +"%Y-%m-%d") -vdate2=$(date +"%H:%M:%S") -F, '$5 ~ /[0-9]+\.[0-9]+\.[0-9]+\.[0-9]/{$5=$5 FS date1 " " date2} 1' OFS=, Input_file
Also if you want to edit the same Input_file you could take above command's output into a temp file and later rename(mv command) to the same Input_file
Adding one-liner form of solution too now.
awk -vdate1=$(date +"%Y-%m-%d") -vdate2=$(date +"%H:%M:%S") -F, '
$5 ~ /[0-9]+\.[0-9]+\.[0-9]+\.[0-9]/{
$5=$5 FS date1 " " date2
}
1
' OFS=, Input_file

Regular expression usage in command line - replace space with %20 inside href

Find/replace space with %20
I must replace all spaces in *.html files which are inside href="something something .pdf" with %20.
I found a regular expression for that task:
find : href\s*=\s*['"][^'" ]*\K\h|(?!^)\G[^'" ]*\K\h
replace : %20
That regular expression works in text editors like Notepad++ or Geany.
I want use that regular expression from the Linux command line with sed or perl.
Solution (1):
cat test002.html | perl -ne 's/href\s*=\s*['\''"][^'\''" ]*\K\h|(?!^)\G[^'\''" ]*\K\h/%20/g; print;' > Work_OK01.html
Solution (2):
cat test002.html | perl -ne 's/href\s*=\s*[\x27"][^\x27" ]*\K\h|(?!^)\G[^\x27" ]*\K\h/%20/g; print;' > Work_OK02.html
The problem is that you don't properly escape the single quotes in your program.
If your program is
...[^'"]...
The shell literal might be
'...[^'\''"]...'
'...[^'"'"'"]...'
'...[^\x27"]...' # Avoids using a single quote to avoid escaping it.
So, you were going for
perl -ne 's/href\s*=\s*['\''"][^'\''" ]*\K\h|(?!^)\G[^'\''" ]*\K\h/%20/g; print;'
Don't try do everything at once. Here are some far cleaner (i.e. far more readable) solutions:
perl -pe's{href\s*=\s*['\''"]\K([^'\''"]*)}{ $1 =~ s/ /%20/rg }eg' # 5.14+
perl -pe's{href\s*=\s*['\''"]\K([^'\''"]*)}{ (my $s = $1) =~ s/ /%20/g; $s }eg'
Note that -p is the same as -n, except that it cause a print to be performed for each line.
The above solutions make a large number of assumptions about the files that might be encountered[1]. All of these assumptions would go away if you use a proper parser.
If you have HTML files:
perl -MXML::LibXML -e'
my $doc = XML::LibXML->new->parse_file($ARGV[0]);
$_->setValue( $_->getValue() =~ s/ /%20/gr )
for $doc->findnodes(q{//#href});
binmode(STDOUT);
print($doc->toStringHTML());
' in_file.html >out_file.html
If you have XML (incl XHTML) files:
perl -MXML::LibXML -e'
my $doc = XML::LibXML->new->parse_file($ARGV[0]);
$_->setValue( $_->getValue() =~ s/ /%20/gr )
for $doc->findnodes(q{//#href});
binmode(STDOUT);
$doc->toFH(\*STDOUT);
' in_file.html >out_file.html
Assumptions made by the substitution-based solutions:
File uses an ASCII-based encoding (e.g. UTF-8, iso-latin-1, but not UTF-16le).
No newline between href and =.
No newline between = and the value.
No newline in the value of href attributes.
Nothing matching /href\s*=/ in text (incl CDATA sections).
Nothing matching /href\s*=/ in comments.
No other attributes have a name ending in href.
No single quote (') in href="...".
No double quote (") in href='...'.
No href= with an unquoted value.
Space in href attributes aren't encoded using a character entity (e.g ).
Maybe more?
(SLePort makes similar assumptions even though they didn't document them. They also assume href attributes don't contain >.)
An xml parser would be more suited for that job(eg. XMLStarlet, xmllint,...), but if you don't have newlines in your a tags, the below sed should work.
Using the t command and backreferences, it loops over and replace all spaces up to last " inside the a tags:
$ sed ':a;s/\(<a [^>]*href=[^"]*"[^ ]*\) \([^"]*">\)/\1%20\2/;ta' <<< '<a href="http://url with spaces">'
<a href="http://url%20with%20spaces">
You seem to have neglected to escape the quotes inside the string you pass to Perl. So Bash sees you giving perl the following arguments:
s/href\s*=\s*[][^', which results from the concatenation of the single-quoted string 's/href\s*=\s*[' and the double-quoted string "][^'"
]*Kh, unquoted, because \K and \h are not special characters in the shell so it just treats them as K and h respectively
Then Bash sees a pipe character |, followed by a subshell invocation (?!^), in which !^ gets substituted with the first argument of the last command invoked. (See "History Expansion > Word Designators" in the Bash man page.) For example, if your last command was echo myface then (?!^) would look for the command named ?myface and runs it in a subshell.
And finally, Bash gets to the sequence \G[^'" ]*\K\h/%20/g; print;', which is interpreted as the concatenation of G (from \G), [^, and the single-quoted string " ]*\K\h/%20/g; print;. Bash has no idea what to do with G[^" ]*\K\h/%20/g; print;, since it just finished parsing a subshell invocation and expects to see a semicolon, line break, or logical operator (or so on) before getting another arbitrary string.
Solution: properly quote the expression you give to perl. You'll need to use a combination of single and double quotes to pull it off, e.g.
perl -ne 's/href\s*=\s*['"'\"][^'\" ]*"'\K\h|(?!^)\G[^'"'\" ]*"'\K\h/%20/g; print;'