sed JSON regular expression

sed JSON regular expression - json

Apologize for such dummy question but this is my first time using curl command and now I got this command from somewhere to extract the following string
{"success":true,"results":1,"total":1,"more":false,"offset":0,"hits":[{"path":"/home/users/Vq7DPVRHzGVK--OTJsHs","excerpt":"","name":"Vq7DPVRHzGVK--OTJsHs","title":"Vq7DPVRHzGVK--OTJsHs","lastModified":"2017-03-03
16:45:46","created":"2017-03-03 16:45:46"}]}
I pipe the curl output to sed with the following script:
sed -e 's/^.*"path":"\([^"]*\)".*$/\1/
Result:
/home/users/Vq7DPVRHzGVK--OTJsHs
Can anyone explain how's the regex work here? and how do I get the result for only Vq7DPVRHzGVK--OTJsHs instead of including the /home/user path?

Explanation:
s/ ^.*"path":"\([^"]*\)".*$ / \1 /
----------^------------ ---^---
Pattern Replacement string
How does Regular Expression work:
^.* # Match beginning of input string & anything else
"path":" # Up to literal string `"path":"`
\([^"]*\) # Then match slash and match + group anything up to a double quote `"`
".*$ # Match double quote and the rest of input string
By replacement string \1 you are replacing whole matching part with first capturing group which is every thing between double quotes of path value except the beginning slash.
What you want is changing capturing group from capturing whole part to last section:
s/^.*"path":"[^"]*\/\([^"]*\)".*$/\1/

Regex demo
Regex: .*"path\":"\K[\/\w]+(?=\/)\/\K[^"]+

Related

Bash substring removal

I have strings in the following pattern: <SOMETHING_1>{<JSON>}<SOMETHING_2>
I want to keep the <JSON> and remove the <SOMETHING_X>blocks. I'm trying to do it with substring removal, but instead of getting
{x:1,action:PLAYING,name:John,description:Some description rSxv9HiATMuQ4wgoV2CGxw}
I keep getting
{x:1,action:PLAYING,name:John,description:Some}
because the whitespace in the description field cuts off the substring.
Any ideas on what to change?
CODE:
string="000{x:1,action:PLAYING,name:John,description:Some description rSxv9HiATMuQ4wgoV2CGxw}401"
string=$1
string=$(echo "${string#*{}")
string=$(echo "${string%}*}")
string={$string}
echo $string

The original code works perfectly, if we accept a direct assignment of the string -- though the following is a bit more explicit:
string="000{x:1,action:PLAYING,name:John,description:Some description rSxv9HiATMuQ4wgoV2CGxw}401"
string='{'"${string#*"{"}" # trim content up to and including the first {, and replace it
string="${string%'}'*}"'}' # trim the last } and all after, and replace it again
printf '%s\n' "$string"
...properly emits:
{x:1,action:PLAYING,name:John,description:Some description rSxv9HiATMuQ4wgoV2CGxw}
I'm guessing that the string is being passed on a command line unquoted, and is thus being split into multiple arguments. If you quote your command-line arguments to prevent string-splitting by the calling shell (./yourscript "$string" instead of ./yourscript $string), this issue will be avoided.

with sed:
string="000{x:1,action:PLAYING,name:John,description:Some description rSxv9HiATMuQ4wgoV2CGxw}401"
sed 's/.*\({.*}\).*/\1/g' <<<$string
output:
{x:1,action:PLAYING,name:John,description:Some description rSxv9HiATMuQ4wgoV2CGxw}

Here you go…
string="000{x:1,action:PLAYING,name:John,description:Some description rSxv9HiATMuQ4wgoV2CGxw}401"
echo "original: ${string}"
string="${string#*\{}"
string="${string%\}*}"
echo "final: {${string}}"
By the way, JSON keys should be surrounded with double quotes.

Using SED to add values after the 5th field of a CSV file that is also an IP Address

I have a CSV file that I'm working to manipulate using sed. What I'm doing is inserting the current YYYY-MM-DD HH:MM:SS into the 5th field after the IP Address. As you can see below, each value is enclosed by double quotes and each CSV column is separated by a comma.
"12345","","","None","192.168.2.1","qqq","000"
"67890","ABC-1234-5678","9.9","Low","192.168.2.1","qqq","000"
Using the command: sed 'N;s/","/","YYYY-MM-DD HH:MM:SS","/5' FILENAME
I am adding in the date after the 5th field. Normally this works, but often
certain values in the CSV file mess up this count that would insert the date into the 5th field. To remedy this issue, how can I not only add the date after the 5th field, but also make sure the 5th field is an IP Address?
The final output should be:
"12345","","","None","192.168.2.1","YYYY-MM-DD HH:MM:SS","qqq","000"
"67890","ABC-1234-5678","9.9","Low","192.168.2.1","YYYY-MM-DD HH:MM:SS","qqq","000"
Please respond with how this is done using SED and not AWK. And how I can make sure the 5th field is also an IP Address before the date is added in?

This answer currently assumes that the CSV file is beautifully consistent and simple (as in the sample data), so that:
Fields always have double quotes around them.
There are never fields like "…""…" to indicate a double quote embedded in the string.
There are never fields with commas in between the quotes ("this,that").
Given those pre-requisites, this sed script does the job:
sed 's/^\("[^"]*",\)\{4\}"\([0-9]\{1,3\}\.\)\{3\}[0-9]\{1,3\}",/&"YYYY-MM-DD HH:MM:SS",/'
Let's split that search pattern into pieces:
^\("[^"]*",\)\{4\}
Match start of line followed by: 4 repeats of a double quote, a sequence of zero or more non-double-quotes, a double quote and a comma.
In other words, this identifies the first four fields.
"\([0-9]\{1,3\}\.\)\{3\}
Match a double quote, then 3 repeats of 1-3 decimal digits followed by a dot — the first three triplets of an IPv4 dotted-decimal address.
[0-9]\{1,3\}",
Match 1-3 decimal digits followed by a double quote and a comma — the last triplet of an IPv4 dotted-decimal address plus the end of a field.
Clearly, for each idiosyncrasy of CSV files that you also need to deal with, you have to modify the regular expressions. That's not trivial.
Using extended regular expressions (enabled by -E on both GNU and BSD sed), you could write:
sed -E 's/^("(([^"]*"")*[^"]*)",){4}"([0-9]{1,3}\.){3}[0-9]{1,3}",/&"YYYY-MM-DD HH:MM:SS",/'
The pattern to recognize the first 4 fields is more complex than before. It matches 4 repeats of: double quote, zero or more occurrences of { zero or more non-double-quotes followed by two double quotes } followed by zero or more non-double-quotes followed by a double quote and a comma.
You can also write that in classic sed (basic regular expressions) with a liberal sprinkling of backslashes:
sed 's/^\("\(\([^"]*""\)*[^"]*\)",\)\{4\}"\([0-9]\{1,3\}\.\)\{3\}[0-9]\{1,3\}",/&"YYYY-MM-DD HH:MM:SS",/'
Given the data file:
"12345","","","None","192.168.2.1","qqq","000"
"67890","ABC-1234-5678","9.9","Low","192.168.2.1","qqq","000"
"23456","Quaternions","2.3","Pisces","Heredotus","qqq","000"
"34567","Commas, oh commas!","3.14159","""Quotes"" quoth he","192.168.99.37","zzz","011"
"45678","Commas, oh commas!","3.14159","""Quote me"",""or not""","192.168.99.37","zzz","011"
The first script shown produces the output:
"12345","","","None","192.168.2.1","YYYY-MM-DD HH:MM:SS","qqq","000"
"67890","ABC-1234-5678","9.9","Low","192.168.2.1","YYYY-MM-DD HH:MM:SS","qqq","000"
"23456","Quaternions","2.3","Pisces","Heredotus","qqq","000"
"34567","Commas, oh commas!","3.14159","""Quotes"" quoth he","192.168.99.37","zzz","011"
"45678","Commas, oh commas!","3.14159","""Quote me"",""or not""","192.168.99.37","zzz","011"
The first two lines are correctly mapped; the third is correctly unchanged, but the last two should have been mapped and were not.
The second and third commands produce:
"12345","","","None","192.168.2.1","YYYY-MM-DD HH:MM:SS","qqq","000"
"67890","ABC-1234-5678","9.9","Low","192.168.2.1","YYYY-MM-DD HH:MM:SS","qqq","000"
"23456","Quaternions","2.3","Pisces","Heredotus","qqq","000"
"34567","Commas, oh commas!","3.14159","""Quotes"" quoth he","192.168.99.37","YYYY-MM-DD HH:MM:SS","zzz","011"
"45678","Commas, oh commas!","3.14159","""Quote me"",""or not""","192.168.99.37","YYYY-MM-DD HH:MM:SS","zzz","011"
Note that Heredotus is not modified (correctly), and the last two lines get the date string added after the IP address (also correctly).
Those last regular expressions are not for the faint-of-heart.
Clearly, if you want to insist that the IP addresses only match numbers in the range 0..255 in each component, with no leading 0, then you have to beef up the IP address matching portion of the regular expression. It can be done; it is not pretty. It is easiest to do it with extended regular expressions:
([0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])
You'd use that unit in place of each [0-9]{3} unit in the regexes shown before.
Note that this still does not attempt to deal with fields not surrounded by double quotes.
It also does not determine the value to substitute from the date command. That is doable with (if not elementary then) routine shell scripting carefully managing quotes:
dt=$(date +'%Y-%m-%d %H:%M:%S')
sed -E 's/^("(([^"]*"")*[^"]*)",){4}"([0-9]{1,3}\.){3}[0-9]{1,3}",/&"'"$dt"'",/'
The '…"'"$dt"'",/' sequence is part of what starts out as a single-quoted string. The first double quote is simple data in the string; the next single quote ends the quoting, the "$dt" interpolates the value from date inside shell double quotes (so the space doesn't cause any trouble), then the single quote resumes the single-quoted notation, adding another double quote, a comma and a slash before the string (argument to sed) is terminated.

Try:
awk -vdate1=$(date +"%Y-%m-%d") -vdate2=$(date +"%H:%M:%S") -F, '$5 ~ /[0-9]+\.[0-9]+\.[0-9]+\.[0-9]/{$5=$5 FS date1 " " date2} 1' OFS=, Input_file
Also if you want to edit the same Input_file you could take above command's output into a temp file and later rename(mv command) to the same Input_file
Adding one-liner form of solution too now.
awk -vdate1=$(date +"%Y-%m-%d") -vdate2=$(date +"%H:%M:%S") -F, '
$5 ~ /[0-9]+\.[0-9]+\.[0-9]+\.[0-9]/{
$5=$5 FS date1 " " date2
}
1
' OFS=, Input_file

Regular expression usage in command line - replace space with %20 inside href

Find/replace space with %20
I must replace all spaces in *.html files which are inside href="something something .pdf" with %20.
I found a regular expression for that task:
find : href\s*=\s*['"][^'" ]*\K\h|(?!^)\G[^'" ]*\K\h
replace : %20
That regular expression works in text editors like Notepad++ or Geany.
I want use that regular expression from the Linux command line with sed or perl.
Solution (1):
cat test002.html | perl -ne 's/href\s*=\s*['\''"][^'\''" ]*\K\h|(?!^)\G[^'\''" ]*\K\h/%20/g; print;' > Work_OK01.html
Solution (2):
cat test002.html | perl -ne 's/href\s*=\s*[\x27"][^\x27" ]*\K\h|(?!^)\G[^\x27" ]*\K\h/%20/g; print;' > Work_OK02.html

The problem is that you don't properly escape the single quotes in your program.
If your program is
...[^'"]...
The shell literal might be
'...[^'\''"]...'
'...[^'"'"'"]...'
'...[^\x27"]...' # Avoids using a single quote to avoid escaping it.
So, you were going for
perl -ne 's/href\s*=\s*['\''"][^'\''" ]*\K\h|(?!^)\G[^'\''" ]*\K\h/%20/g; print;'
Don't try do everything at once. Here are some far cleaner (i.e. far more readable) solutions:
perl -pe's{href\s*=\s*['\''"]\K([^'\''"]*)}{ $1 =~ s/ /%20/rg }eg' # 5.14+
perl -pe's{href\s*=\s*['\''"]\K([^'\''"]*)}{ (my $s = $1) =~ s/ /%20/g; $s }eg'
Note that -p is the same as -n, except that it cause a print to be performed for each line.
The above solutions make a large number of assumptions about the files that might be encountered[1]. All of these assumptions would go away if you use a proper parser.
If you have HTML files:
perl -MXML::LibXML -e'
my $doc = XML::LibXML->new->parse_file($ARGV[0]);
$_->setValue( $_->getValue() =~ s/ /%20/gr )
for $doc->findnodes(q{//#href});
binmode(STDOUT);
print($doc->toStringHTML());
' in_file.html >out_file.html
If you have XML (incl XHTML) files:
perl -MXML::LibXML -e'
my $doc = XML::LibXML->new->parse_file($ARGV[0]);
$_->setValue( $_->getValue() =~ s/ /%20/gr )
for $doc->findnodes(q{//#href});
binmode(STDOUT);
$doc->toFH(\*STDOUT);
' in_file.html >out_file.html
Assumptions made by the substitution-based solutions:
File uses an ASCII-based encoding (e.g. UTF-8, iso-latin-1, but not UTF-16le).
No newline between href and =.
No newline between = and the value.
No newline in the value of href attributes.
Nothing matching /href\s*=/ in text (incl CDATA sections).
Nothing matching /href\s*=/ in comments.
No other attributes have a name ending in href.
No single quote (') in href="...".
No double quote (") in href='...'.
No href= with an unquoted value.
Space in href attributes aren't encoded using a character entity (e.g ).
Maybe more?
(SLePort makes similar assumptions even though they didn't document them. They also assume href attributes don't contain >.)

An xml parser would be more suited for that job(eg. XMLStarlet, xmllint,...), but if you don't have newlines in your a tags, the below sed should work.
Using the t command and backreferences, it loops over and replace all spaces up to last " inside the a tags:
$ sed ':a;s/\(<a [^>]*href=[^"]*"[^ ]*\) \([^"]*">\)/\1%20\2/;ta' <<< '<a href="http://url with spaces">'
<a href="http://url%20with%20spaces">

You seem to have neglected to escape the quotes inside the string you pass to Perl. So Bash sees you giving perl the following arguments:
s/href\s*=\s*[][^', which results from the concatenation of the single-quoted string 's/href\s*=\s*[' and the double-quoted string "][^'"
]*Kh, unquoted, because \K and \h are not special characters in the shell so it just treats them as K and h respectively
Then Bash sees a pipe character |, followed by a subshell invocation (?!^), in which !^ gets substituted with the first argument of the last command invoked. (See "History Expansion > Word Designators" in the Bash man page.) For example, if your last command was echo myface then (?!^) would look for the command named ?myface and runs it in a subshell.
And finally, Bash gets to the sequence \G[^'" ]*\K\h/%20/g; print;', which is interpreted as the concatenation of G (from \G), [^, and the single-quoted string " ]*\K\h/%20/g; print;. Bash has no idea what to do with G[^" ]*\K\h/%20/g; print;, since it just finished parsing a subshell invocation and expects to see a semicolon, line break, or logical operator (or so on) before getting another arbitrary string.
Solution: properly quote the expression you give to perl. You'll need to use a combination of single and double quotes to pull it off, e.g.
perl -ne 's/href\s*=\s*['"'\"][^'\" ]*"'\K\h|(?!^)\G[^'"'\" ]*"'\K\h/%20/g; print;'

Easiest method for removing html/xml <tags> from single-line output

I have output from grep I'm trying to clean up that looks like:
<words>Http://www.path.com/words</words>
I've tried using...
sed 's/<.*>//'
...to remove the tags, but that just destroys the entire line. I'm not sure why that's happening, since every '<' is closed with a '>' before it gets to the content.
What is the easiest way to do this?
Thanks!

Try this for your sed expression:
sed 's/<.*>\(.*\)<\/.*>/\1/'
Quick breakdown of the expression:
<.*> - Match the first tag
\(.*\) - Match and save the text between the tags
<\/.*> - Match the end tag making sure to escape the / character
\1 - Output the result of the first saved match
- (the text that is matched between \( and \))
More about back-references
A question came up in the comments that should probably be addressed for completeness.
The \( and \) are Sed's back-reference markers. They save a portion of the matched expression for use later.
For example, if we have an input string:
This has (parens) in it. In addition we can use parenslike thisparens
using back-references.
We develop an expression:
sed s/.*(\(.*\)).*\1\\(.*\)\1.*/\1 \2/
Which gives us:
parens like this
How the heck did that work? Let's break down the expression to find out.
Expression breakdown:
sed s/ - This is the opening tag to a sed expression.
.* - Match any character to start (as well as nothing).
( - Match a literal left parenthesis character.
\(.*\) - Match any character and save as a back-reference. In this case it will match anything between the first open and last close parenthesis in the expression.
) - Match a literal right parenthesis character.
.* - Same as above.
\1 - Match the first saved back-reference. In the case of our sample this is filled in with `parens`
\(.*\) - Same as above.
\1 - Same as above.
/ - End of the match expression. Signals transition to the output expression.
\1 \2 - Print our two back-references.
/ - End of output expression.
As we can see, the back-reference taken from between the parenthesis (( and )) was substituted back into the matching expression to be able to match the string parens.

One liner awk html tag replace "''>" with "''> " with gsub

I've been racking my brain with this for the past half an hour and everything I've tried so far has failed miserably!
Within an html file, there is a field within tags, but the field itself is not separated with a space from the > sign so it's hard to read with awk. I would basically like to add a single space after the opening tag, but gsub and awk are refusing to cooperate.
I've tried
awk 'gsub("class\\\'\\\'>","class\\\'\\\'>")' filename
since one backslash is needed to escape the single quote, the second to escape the backslash itself, and the third to escape the sequence \' but Terminal (I'm working on a Mac) refuses to execute, and instead goes in the next line awaiting some other input from me.
Please help :(

In Bash, single quotes accept absolutely no kind of escape. Suppose e.g. I write this command:
$ echo '\''
>
Bash will consider the string opened by ' closed at the second ', generating a string containing only \. The next ', then, is considered the opening of a new string, so bash expects for more input in the next line (signalled by the >).
If you are not aware of this fact, you may think that the string after the echo command below will be open yet, but it is closed:
$ echo 'will this string contain a single quote like \'
will this string contain a single quote like \
So, when you write
'gsub("class\\\'\\\'>","class\\\'\\\'> ")'
you are writing the string gsub("class\\\ concatenated with a backslash and a quote (\'); then a greater than signal. After this, the "," is interpreted as a string containing a comma, because the single quote of the beginning of the expression was closed before. For now, the result is:
gsub("class\\\\'>,
After the comma, you have the string class, followed by a backslash and a quote, followed by another backslash and another quote, and finally by a greater than symbol and a space. This is the current string:
gsub("class\\\\'>,class\'\'>
This is no valid awk expression! Anyway, it gets worse: the double quote " will start a string, which will contain a closing parenthesis and a single quote, but this string is never closed!
Summing up, your problem is that, if you opened a string with ' in Bash, it will be forcedly close at the next ', no matter how many backslashes you put before it.
Solution: you can make some tricks opening and closing strings with ' and " but it will become cumbersome quickly. My suggested solution is to put your awk expression in a file. Then, use the -f flag from awk - this flag will make awk to execute the following file:
$ cat filename # The file to be changed
class''>
class>
class''>
$ cat mycode.awk # The awk script
gsub("class''>", "class''>[PSEUDOSPACE]")
$ awk -f mycode.awk filename # THE RESULT!
class''>[PSEUDOSPACE]
class''>[PSEUDOSPACE]
If you do not want to write a file, use the so called here documents:
$ awk -f- filename <<EOF
gsub("class''>", "class''>[PSEUDOSPACE]")
EOF
class''>[PSEUDOSPACE]
class''>[PSEUDOSPACE]

The problem is that you are escaping the ', so you are not finishing the command. For example:
echo \' > foo
echoes a single quote into the file named foo, and
echo \\\' > foo
writes a single backslash followed by a single quote.
In particular, you cannot escape a single quote inside a string, so
'foo\'bar'
is the string foo\ followed by the string bar followed by an unmatched open quote. It is exactly the same as writing "foo\\"bar'

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

sed JSON regular expression - json

Regex demo Regex: .*"path\":"\K[\/\w]+(?=\/)\/\K[^"]+

Related

Bash substring removal

Using SED to add values after the 5th field of a CSV file that is also an IP Address

Regular expression usage in command line - replace space with %20 inside href

Easiest method for removing html/xml <tags> from single-line output

One liner awk html tag replace "''>" with "''> " with gsub

Categories

Resources