Substitute json value with space using sed and regex - json

I have multiple json file, which look like the sample below:
#sample json
{"urlCurrent":"https://www.website1.com/inside/377/388/408/8002.html?utm_source=source&utm_medium=Click&utm_campaign=123","id":"00001"}
{"urlCurrent":"https://127.0.0.1/inside/414/756/765/34984.html","id":"00002"}
{"urlCurrent":"https://msdn.anything.com/en-us","id":"00002"}
{"urlCurrent":"https://web.something.com/","id":"00002"}
I would like the json to become:
#result json
{"urlCurrent":"https://www.website1.com/","id":"00001"}
{"urlCurrent":"https://127.0.0.1/","id":"00002"}
{"urlCurrent":"https://msdn.anything.com/","id":"00002"}
{"urlCurrent":"https://web.something.com/","id":"00002"}
I think that with
sed -i 's/{regular expression}/\ /g' sample.json
which is to substitute anything after / with space, the result can be achieved. However, I don't know how to use regular expression to match the pattern I need. Neither do I know which keyword I should search in order to achieve this.
Is there a way to truncate the urlCurrent to become the result I need?
Thanks in advance!
12/23 Update
This works:
sed -E -i -r 's!(http|ftp|https)://([0-9a-zA-Z\.]+)([0-9a-zA-Z\/\.?#=_&%~+-]+)!\2!g' sample.json

sed -i -r 's/(.*:\/\/?[^\/]+\/?)[^\"]*(.*)/\1\2/' sample.json

Related

How to insert a new line with content in front of each line in json?

I think sed should be the command to do this, but I haven't figured out the proper command yet.
My json file looks like this :
{"LAST_MODIFIED_BY":"david","LAST_MODIFIED_DATE":"2018-06-26 12:02:03.0","CLASS_NAME":"/SC/Trade/HTS_CA/1234abcd","DECISION":"AGREE","TASK_TYPE_ID":"abcd1234-832b-43b6-afa6-361253ffe1d5","NAME":"something"}
{"LAST_MODIFIED_BY":"sarah","LAST_MODIFIED_DATE":"2018-08-26 12:02:03.0","CLASS_NAME":"/SC/Import/HTS_US/9876abcd","DECISION":"DISAGREE","TASK_TYPE_ID":"abcd1234-832b-43b6-afa6-361253ffe1d5","NAME":"nicename"}
... more rows to follow
what I wanted to achieve is a json document with below contents:
{"index":{}}
{"LAST_MODIFIED_BY":"david","LAST_MODIFIED_DATE":"2018-06-26 12:02:03.0","CLASS_NAME":"/SC/Trade/HTS_CA/1234abcd","DECISION":"AGREE","TASK_TYPE_ID":"abcd1234-832b-43b6-afa6-361253ffe1d5","NAME":"something"}
{"index":{}}
{"LAST_MODIFIED_BY":"sarah","LAST_MODIFIED_DATE":"2018-08-26 12:02:03.0","CLASS_NAME":"/SC/Import/HTS_US/9876abcd","DECISION":"DISAGREE","TASK_TYPE_ID":"abcd1234-832b-43b6-afa6-361253ffe1d5","NAME":"nicename"}
... more rows to follow
so that I could run bulk load API against Elasticsearch.
The closest one is this one: Elasticsearch Bulk JSON Data, but it split my json file into broken items instead of my desired format.
Any ideas how I can achieve this would be greatly appreciated!
Using sed:
sed 's/^/{"index":{}}\
/'
The trick here is the \.
Alternatively, if your shell supports it:
sed $'s/^/{"index":{}}\n/'
or (as per #sundeep's suggestion):
sed $'i\\\n{"index":{}}\n'
Using jq:
jq -nc 'inputs | {"index":{}}, .'
Here, the key is the -c option to produce JSONLines.
Using awk:
awk '{print "{\"index\":{}}"; print;}'
Etc.
This might work for you (GNU sed):
sed 'i{"index":{}}' file
Insert {"index":{}} before each line.

Extract json value with sed

I have a json result and I would like to extract a string without double quotes
{"value1":5.0,"value2":2.5,"value3":"2019-10-24T15:26:00.000Z","modifier":[]}
With this regex I can extract the value3 (019-10-24T15:26:00.000Z) correctly
sed -e 's/^.*"endTime":"\([^"]*\)".*$/\1/'
How can I extract the "value2" result, a string without double quotes?
I need to do with sed so can’t install jq. That’s my problem
With GNU sed for -E to enable EREs:
$ sed -E 's/.*"value3":"?([^,"]*)"?.*/\1/' file
2019-10-24T15:26:00.000Z
$ sed -E 's/.*"value2":"?([^,"]*)"?.*/\1/' file
2.5
With any POSIX sed:
$ sed 's/.*"value3":"\{0,1\}\([^,"]*\)"\{0,1\}.*/\1/' file
2019-10-24T15:26:00.000Z
$ sed 's/.*"value2":"\{0,1\}\([^,"]*\)"\{0,1\}.*/\1/' file
2.5
The above assumes you never have commas inside quoted strings.
Just run jq a Command-line JSON processor
$ json_data='{"value1":5.0,"value2":2.5,"value3":"2019-10-24T15:26:00.000Z","modifier":[]}'
$ jq '.value2' <(echo "$json_data")
2.5
with the key .value2 to access the value you are interested in.
This link summarize why you should NOT use, regex for parsing json
(the same goes for XML/HTML and other data structures that are in
theory can be infinitely nested)
Regex for parsing single key: values out of JSON in Javascript
If you do not have jq available:
you can use the following GNU grep command:
$ echo '{"value1":5.0,"value2":2.5,"value3":"2019-10-24T15:26:00.000Z","modifier":[]}' | grep -zoP '"value2":\s*\K[^\s,]*(?=\s*,)'
2.5
using the regex detailed here:
"value2":\s*\K[^\s,]*(?=\s*,)
demo: https://regex101.com/r/82J6Cb/1/
This will even work if the json is not linearized!!!!
With python it is also pretty direct and you should have it installed by default on your machine even if it is not python3 it should work
$ cat data.json
{"value1":5.0,"value2":2.5,"value3":"2019-10-24T15:26:00.000Z","modifier":[]}
$ cat extract_value2.py
import json
with open('data.json') as f:
data = json.load(f)
print(data["value2"])
$ python extract_value2.py
2.5
You can try this :
creds=$(eval aws secretsmanager get-secret-value --region us-east-1 --secret-id dpi/dev/hivemetastore --query SecretString --output text )
passwd=$(/bin/echo "${creds}" | /bin/sed -n 's/.*"password":"\(.*\)",/\1/p' | awk -F"\"" '{print $1}')
it is definitely possible to remove the AWK part though ...
To extract all values in proper list form to a file using sed(LINUX).
sed 's/["{}\]//g' <your_file.json> | sed 's/,/\n/g' >> <your_new_file_to_save>
sed 's/regexp/replacement/g' inputFileName > outputFileName
In some versions of sed, the expression must be preceded by -e to indicate that an expression follows.
The s stands for substitute, while the g stands for global, which means that all matching occurrences in the line would be replaced.
I've put [ ] inside it as elements that you wanna remove from .json file.
The pipe character | is used to connect the output from one command to the input of another.
Then, the last thing I did is substitute , and add a \n, known as line breaker.
If you want to show a single value see below command:
sed 's/["{}\]//g' <your_file.json> | sed 's/,/\n/g' | sed 's/<ur_value>//p'
p is run; this is equivalent to /pattern match/! p as per above; i.e., "if the line does not match /pattern match/, print it". So the complete command prints all the lines from the first occurrence of the pattern to the last line, but suppresses the ones that match.
if your data in 'd' file, try gnu sed
sed -E 's/[{,]"\w+":([^,"]+)/\1\n/g ;s/(.*\n).*".*\n/\1/' d

not able to store sed output to variable

I am new to bash script.
I am getting some json response and i get only one property from the response. I want to save it to a variable but it is not working
token=$result |sed -n -e 's/^.*access_token":"//p' | cut -d'"' -f1
echo $token
it returns blank line.
I cannot use jq or any third party tools.
Please let me know what I am missing.
Your command should be:
token=$(echo "$result" | sed -n -e 's/^.*access_token":"//p' | cut -d'"' -f1)
You need to use echo to print the contents of the variable over standard output, and you need to use a command substitution $( ) to assign the output of the pipeline to token.
Quoting your variables is always encouraged, to avoid problems with white space and glob characters like *.
As an aside, note that you can probably obtain the output using something like:
token=$(jq -r .access_token <<<"$result")
I know you've said that you can't use jq but it's a standalone binary (no need to install it) and treats your JSON in the correct way, not as arbitrary text.
Give this a try:
token="$(sed -E -n -e 's/^.*access_token": ?"//p' <<<"$result" | cut -d'"' -f1)"
Explanation:
token="$( script here )" means that $token is set to the output/result of the script run inside the subshell through a process known as command substituion
-E in sed allows Extended Regular Expressions. We want this because JSON generally contains a space after the : and before the next ". We use the ? after the space to tell sed that the space may or may not be present.
<<<"$result" is a herestring that feeds the data into sed as stdin in place of a file.

awk change datetime format

I have huge amount of files where each string is a json with incorrect date format. The format I have for now is 2011-06-02 21:43:59 and what I need to do is to add T in between to transform it to ISO format 2011-06-02T21:43:59.
Can somebody, please, point me to some one liner solution? Was struggling with this for 2 hours, but no luck.
sed will come to your rescue, with a simple regex:
sed 's/\([0-9]\{4\}-[0-9]\{2\}-[0-9]\{2\}\) /\1T/g' file > file.new
or, to modify the file in place:
sed -i 's/\([0-9]\{4\}-[0-9]\{2\}-[0-9]\{2\}\) /\1T/g' file
Example
echo '2011-06-02 21:43:59' | sed 's/\([0-9]\{4\}-[0-9]\{2\}-[0-9]\{2\}\) /\1T/g'
2011-06-02T21:43:59
Read more about regexes here: Regex Tag Info
The following seems to be the working solution:
sed -i -r 's/([0-9]{4}-[0-9]{2}-[0-9]{2}) ([0-9]{2}:[0-9]{2}:[0-9]{2})/\1T\2/g' myfiles
-i to process files
-r is to switch on extended regular expression
([0-9]{4}-[0-9]{2}-[0-9]{2}) - is for date
- the space between date and time in source data
([0-9]{2}:[0-9]{2}:[0-9]{2}) - is for time
Also with awk, you can match group with gensub :
awk '{
print gensub(/([0-9]{4}-[0-9]{2}-[0-9]{2})\s+([0-9]{2}:[0-9]{2}:[0-9]{2})/,
"\\1T\\2",
"g");
}' data.txt
echo '2011-06-02 21:43:59' | awk 'sub(/ /,"T")'
2011-06-02T21:43:59

Help with sed regex: extract text from specific tag

First time sed'er, so be gentle.
I have the following text file, 'test_file':
<Tag1>not </Tag1><Tag2>working</Tag2>
I want to extract the text in between <Tag2> using sed regex, there may be other occurrences of <Tag2> and I would like to extract those also.
So far I have this sed based regex:
cat test_file | grep -i "Tag2"| sed 's/<[^>]*[>]//g'
which gives the output:
not working
Anyone any idea how to get this working?
As another poster said, sed may not be the best tool for this job. You may want to use something built for XML parsing, or even a simple scripting language, such as perl.
The problem with your try, is that you aren't analyzing the string properly.
cat test_file is good - it prints out the contents of the file to stdout.
grep -i "Tag2" is ok - it prints out only lines with "Tag2" in them. This may not be exactly what you want. Bear in mind that it will print the whole line, not just the <Tag2> part, so you will still have to search out that part later.
sed 's/<[^>]*[>]//g' isn't what you want - it simply removes the tags, including <Tag1> and <Tag2>.
You can try something like:
cat tmp.tmp | grep -i tag2 | sed 's/.*<Tag2>\(.*\)<\/Tag2>.*/\1/'
This will produce
working
but it will only work for one tag pair.
For your nice, friendly example, you could use
sed -e 's/^.*<Tag2>//' -e 's!</Tag2>.*!!' test-file
but the XML out there is cruel and uncaring. You're asking for serious trouble using regular expressions to scrape XML.
you can use gawk, eg
$ cat file
<Tag1>not </Tag1><Tag2>working here</Tag2>
<Tag1>not </Tag1><Tag2>
working
</Tag2>
$ awk -vRS="</Tag2>" '/<Tag2>/{gsub(/.*<Tag2>/,"");print}' file
working here
working
awk -F"Tag2" '{print $2}' test_1 | sed 's/[^a-zA-Z]//g'