awk change datetime format - json

I have huge amount of files where each string is a json with incorrect date format. The format I have for now is 2011-06-02 21:43:59 and what I need to do is to add T in between to transform it to ISO format 2011-06-02T21:43:59.
Can somebody, please, point me to some one liner solution? Was struggling with this for 2 hours, but no luck.

sed will come to your rescue, with a simple regex:
sed 's/\([0-9]\{4\}-[0-9]\{2\}-[0-9]\{2\}\) /\1T/g' file > file.new
or, to modify the file in place:
sed -i 's/\([0-9]\{4\}-[0-9]\{2\}-[0-9]\{2\}\) /\1T/g' file
Example
echo '2011-06-02 21:43:59' | sed 's/\([0-9]\{4\}-[0-9]\{2\}-[0-9]\{2\}\) /\1T/g'
2011-06-02T21:43:59
Read more about regexes here: Regex Tag Info

The following seems to be the working solution:
sed -i -r 's/([0-9]{4}-[0-9]{2}-[0-9]{2}) ([0-9]{2}:[0-9]{2}:[0-9]{2})/\1T\2/g' myfiles
-i to process files
-r is to switch on extended regular expression
([0-9]{4}-[0-9]{2}-[0-9]{2}) - is for date
- the space between date and time in source data
([0-9]{2}:[0-9]{2}:[0-9]{2}) - is for time

Also with awk, you can match group with gensub :
awk '{
print gensub(/([0-9]{4}-[0-9]{2}-[0-9]{2})\s+([0-9]{2}:[0-9]{2}:[0-9]{2})/,
"\\1T\\2",
"g");
}' data.txt

echo '2011-06-02 21:43:59' | awk 'sub(/ /,"T")'
2011-06-02T21:43:59

Related

How to insert a new line with content in front of each line in json?

I think sed should be the command to do this, but I haven't figured out the proper command yet.
My json file looks like this :
{"LAST_MODIFIED_BY":"david","LAST_MODIFIED_DATE":"2018-06-26 12:02:03.0","CLASS_NAME":"/SC/Trade/HTS_CA/1234abcd","DECISION":"AGREE","TASK_TYPE_ID":"abcd1234-832b-43b6-afa6-361253ffe1d5","NAME":"something"}
{"LAST_MODIFIED_BY":"sarah","LAST_MODIFIED_DATE":"2018-08-26 12:02:03.0","CLASS_NAME":"/SC/Import/HTS_US/9876abcd","DECISION":"DISAGREE","TASK_TYPE_ID":"abcd1234-832b-43b6-afa6-361253ffe1d5","NAME":"nicename"}
... more rows to follow
what I wanted to achieve is a json document with below contents:
{"index":{}}
{"LAST_MODIFIED_BY":"david","LAST_MODIFIED_DATE":"2018-06-26 12:02:03.0","CLASS_NAME":"/SC/Trade/HTS_CA/1234abcd","DECISION":"AGREE","TASK_TYPE_ID":"abcd1234-832b-43b6-afa6-361253ffe1d5","NAME":"something"}
{"index":{}}
{"LAST_MODIFIED_BY":"sarah","LAST_MODIFIED_DATE":"2018-08-26 12:02:03.0","CLASS_NAME":"/SC/Import/HTS_US/9876abcd","DECISION":"DISAGREE","TASK_TYPE_ID":"abcd1234-832b-43b6-afa6-361253ffe1d5","NAME":"nicename"}
... more rows to follow
so that I could run bulk load API against Elasticsearch.
The closest one is this one: Elasticsearch Bulk JSON Data, but it split my json file into broken items instead of my desired format.
Any ideas how I can achieve this would be greatly appreciated!
Using sed:
sed 's/^/{"index":{}}\
/'
The trick here is the \.
Alternatively, if your shell supports it:
sed $'s/^/{"index":{}}\n/'
or (as per #sundeep's suggestion):
sed $'i\\\n{"index":{}}\n'
Using jq:
jq -nc 'inputs | {"index":{}}, .'
Here, the key is the -c option to produce JSONLines.
Using awk:
awk '{print "{\"index\":{}}"; print;}'
Etc.
This might work for you (GNU sed):
sed 'i{"index":{}}' file
Insert {"index":{}} before each line.

How to grep specific dates from a html file

I have an HTML file that has a number of dates in this format dd/mm/yy spread all over it. I was looking for a way to retrieve specific dates from it.
input:
Released: 08/08/2019</td>
<td>06/26/2019</td>
Released: 03/09/2019</td>
<td>14/29/2019</td>
I found a way to retrieve all dates from the file:
grep -o "[0-9]\{2\}/[0-9]\{2\}/[0-9]\{4\}"
output:
08/08/2019
06/26/2019
03/09/2019
14/29/2019
However, I need to filter these dates and pick only those that have this format:
<td>dd/mm/yyyy</td>
So from the above input, I need this output:
06/26/2019
14/29/2019
I always recommend using an HTML/XML parser. If this is not possible try GNU grep and a Perl-compatible regular expression (PCRE):
grep -Po '(?<=<td>)[0-9]{2}/[0-9]{2}/[0-9]{4}(?=</td>)' file
Output:
06/26/2019
14/29/2019
This gnu awk may do?
awk -F"</?td>" '/^<td>/{print $2}' file
06/26/2019
14/29/2019

Substitute json value with space using sed and regex

I have multiple json file, which look like the sample below:
#sample json
{"urlCurrent":"https://www.website1.com/inside/377/388/408/8002.html?utm_source=source&utm_medium=Click&utm_campaign=123","id":"00001"}
{"urlCurrent":"https://127.0.0.1/inside/414/756/765/34984.html","id":"00002"}
{"urlCurrent":"https://msdn.anything.com/en-us","id":"00002"}
{"urlCurrent":"https://web.something.com/","id":"00002"}
I would like the json to become:
#result json
{"urlCurrent":"https://www.website1.com/","id":"00001"}
{"urlCurrent":"https://127.0.0.1/","id":"00002"}
{"urlCurrent":"https://msdn.anything.com/","id":"00002"}
{"urlCurrent":"https://web.something.com/","id":"00002"}
I think that with
sed -i 's/{regular expression}/\ /g' sample.json
which is to substitute anything after / with space, the result can be achieved. However, I don't know how to use regular expression to match the pattern I need. Neither do I know which keyword I should search in order to achieve this.
Is there a way to truncate the urlCurrent to become the result I need?
Thanks in advance!
12/23 Update
This works:
sed -E -i -r 's!(http|ftp|https)://([0-9a-zA-Z\.]+)([0-9a-zA-Z\/\.?#=_&%~+-]+)!\2!g' sample.json
sed -i -r 's/(.*:\/\/?[^\/]+\/?)[^\"]*(.*)/\1\2/' sample.json

How to convert linefeed into literal "\n"

I'm having some trouble converting my file to a properly formatted json string.
Have been fiddling with sed for ages now, but it seems to mock me.
Am working on RHEL 6, if that matters.
I'm trying to convert this file (content):
Hi there...
foo=bar
tomàto=tomáto
url=http://www.stackoverflow.com
Into this json string:
{"text":"Hi there...\n\nfoo=bar\ntomàto=tomáto\nurl=http://www.stackoverflow.com"}
How would I replace the actual line feeds in the literal '\n' character?? This is where I'm utterly stuck!
I've been trying to convert line feeds into ";" first and then back to a literal "\n". Tried loops for each row in the file. Can't make it work...
Some help is much appreciated!
Thanks!
sed is for simple substitutions on individual lines, that is all. Since sed works line by line your sed script doesn't see the line endings and so you can't get it change the line endings without jumping through hoops using arcane language constructs and convoluted logic that hasn't been useful since the mid-1970s when awk was invented.
This will change all newlines in your input file to the string \n:
$ awk -v ORS='\\n' '1' file
Hi there...\n\nfoo=bar\ntomàto=tomáto\nurl=http://www.stackoverflow.com\n
and this will do the rest:
$ awk -v ORS='\\n' 'BEGIN{printf "{\"text\":\""} 1; END{printf "\"}\n"}' file
{"text":"Hi there...\n\nfoo=bar\ntomàto=tomáto\nurl=http://www.stackoverflow.com\n"}
or this if you have a newline at the end of your input file but don't want it to become a \n string in the output:
$ awk -v ORS='\\n' '{rec = (NR>1 ? rec ORS : "") $0} END{printf "{\"text\":\"%s\"}\n", rec}' file
{"text":"Hi there...\n\nfoo=bar\ntomàto=tomáto\nurl=http://www.stackoverflow.com"}
With GNU sed:
sed ':a;N;s/\n/\\n/;ta' file | sed 's/.*/{"text":"&"}/'
Output:
{"text":"Hi there...\n\nfoo=bar\ntomàto=tomáto\nurl=http://www.stackoverflow.com"}
Use awk for this :
awk -v RS=^$ '{gsub(/\n/,"\\n");sub(/^/,"{\"text\":\"");sub(/\\n$/,"\"}")}1' file
Output
{"text":"Hi there...\n\nfoo=bar\ntomàto=tomáto\nurl=http://www.stackoverflow.com"}
awk to the rescue!
$ awk -vRS='\0' '{gsub("\n","\\n");
print "{\"text\":\"" $0 "\"}"}' file
{"text":"Hi there...\n\nfoo=bar\ntomàto=tomáto\nurl=http://www.stackoverflow.com\n"}
This might work for you (GNU sed):
sed '1h;1!H;$!d;x;s/.*/"text":"&"/;s/\n/\\n/g' file
Slurp the file into memory and use pattern matching to manipulate the file to the desired format.
The most simple (and elegant ?) solution :) :
#!/bin/bash
in=$(perl -pe 's/\n/\\n/' $1)
cat<<EOF
{"text":"$in"}
EOF
Usage:
./script.sh file.txt
Output :
{"text":"Hi there...\n\nfoo=bar\ntomàto=tomáto\nurl=http://www.stackoverflow.com\n"}

Help with sed regex: extract text from specific tag

First time sed'er, so be gentle.
I have the following text file, 'test_file':
<Tag1>not </Tag1><Tag2>working</Tag2>
I want to extract the text in between <Tag2> using sed regex, there may be other occurrences of <Tag2> and I would like to extract those also.
So far I have this sed based regex:
cat test_file | grep -i "Tag2"| sed 's/<[^>]*[>]//g'
which gives the output:
not working
Anyone any idea how to get this working?
As another poster said, sed may not be the best tool for this job. You may want to use something built for XML parsing, or even a simple scripting language, such as perl.
The problem with your try, is that you aren't analyzing the string properly.
cat test_file is good - it prints out the contents of the file to stdout.
grep -i "Tag2" is ok - it prints out only lines with "Tag2" in them. This may not be exactly what you want. Bear in mind that it will print the whole line, not just the <Tag2> part, so you will still have to search out that part later.
sed 's/<[^>]*[>]//g' isn't what you want - it simply removes the tags, including <Tag1> and <Tag2>.
You can try something like:
cat tmp.tmp | grep -i tag2 | sed 's/.*<Tag2>\(.*\)<\/Tag2>.*/\1/'
This will produce
working
but it will only work for one tag pair.
For your nice, friendly example, you could use
sed -e 's/^.*<Tag2>//' -e 's!</Tag2>.*!!' test-file
but the XML out there is cruel and uncaring. You're asking for serious trouble using regular expressions to scrape XML.
you can use gawk, eg
$ cat file
<Tag1>not </Tag1><Tag2>working here</Tag2>
<Tag1>not </Tag1><Tag2>
working
</Tag2>
$ awk -vRS="</Tag2>" '/<Tag2>/{gsub(/.*<Tag2>/,"");print}' file
working here
working
awk -F"Tag2" '{print $2}' test_1 | sed 's/[^a-zA-Z]//g'