How to grep specific dates from a html file

How to grep specific dates from a html file - html

I have an HTML file that has a number of dates in this format dd/mm/yy spread all over it. I was looking for a way to retrieve specific dates from it.
input:
Released: 08/08/2019</td>
<td>06/26/2019</td>
Released: 03/09/2019</td>
<td>14/29/2019</td>
I found a way to retrieve all dates from the file:
grep -o "[0-9]\{2\}/[0-9]\{2\}/[0-9]\{4\}"
output:
08/08/2019
06/26/2019
03/09/2019
14/29/2019
However, I need to filter these dates and pick only those that have this format:
<td>dd/mm/yyyy</td>
So from the above input, I need this output:
06/26/2019
14/29/2019

I always recommend using an HTML/XML parser. If this is not possible try GNU grep and a Perl-compatible regular expression (PCRE):
grep -Po '(?<=<td>)[0-9]{2}/[0-9]{2}/[0-9]{4}(?=</td>)' file
Output:
06/26/2019
14/29/2019

This gnu awk may do?
awk -F"</?td>" '/^<td>/{print $2}' file
06/26/2019
14/29/2019

Related

Calling Imagemagick from awk?

I have a CSV of image details I want to loop over in a bash script. awk seems like an obvious choice to loop over the data.
For each row, I want to take the values, and use them to do Imagemagick stuff. The following isn't working (obviously):
awk -F, '{ magick "source.png" "$1.jpg" }' images.csv

GNU AWK excels at processing structured text data, although it can be used to summon commands using system function it is less handy for that than some other language, e.g. python has module of standard library called subprocess which is more feature-rich.
If you wish to use awk for this task anyway, then I suggest preparing output to be feed into bash command, say you have file.txt with following content
file1.jpg,file1.bmp
file2.png,file2.bmp
file3.webp,file3.bmp
and you have files listed in 1st column in current working directory and wish to convert them to files shown in 2nd column and access to convert command, then you might do
awk 'BEGIN{FS=","}{print "convert \"" $1 "\" \"" $2 "\""}' file.txt | bash
which is equvialent to starting bash and doing
convert "file1.jpg" "file1.bmp"
convert "file2.png" "file2.bmp"
convert "file3.webp" "file3.bmp"
Observe that I have used literal " to enclose filenames, so it should work with names containing spaces. Disclaimer: it might fail if name containing special character, e.g. ".

Get value of json with bash

I have json file with the following structure
{ "tool_first":"1.1.1","tool_second":"2.2.2","tool_three":"3.3.3" }
And I want to retrieve version from it with bash grep. I create something like this
cat myjson.json | grep -Po '"tool_second":\K"[A-Za-z0-9/._]*"'
which give me output
"2.2.2"
How to use variable instead of string "tool_second"? I want to have something like
cat myjson.json | grep -Po '"$x":\K"[A-Za-z0-9/._]*"'
where $x is the variable; x = "tool_second". I can't retrieve information from it with variable. How to escape variable properly in this way? I need just version number, without "".

grep is NOT the right tool for parsing JSON text. Use a more syntax aware tool like jq. Use the answer below only for trivial purposes.
You are not escaping your double-quotes in your string to search present in variable x,
x="\"tool_second\""
grep -Po "$x:\K\"[A-Za-z0-9/._]*\"" file
"2.2.2"
and it can work for other strings too!
x="\"tool_first\""
grep -Po "$x:\K\"[A-Za-z0-9/._]*\"" file
"1.1.1"

awk change datetime format

I have huge amount of files where each string is a json with incorrect date format. The format I have for now is 2011-06-02 21:43:59 and what I need to do is to add T in between to transform it to ISO format 2011-06-02T21:43:59.
Can somebody, please, point me to some one liner solution? Was struggling with this for 2 hours, but no luck.

sed will come to your rescue, with a simple regex:
sed 's/\([0-9]\{4\}-[0-9]\{2\}-[0-9]\{2\}\) /\1T/g' file > file.new
or, to modify the file in place:
sed -i 's/\([0-9]\{4\}-[0-9]\{2\}-[0-9]\{2\}\) /\1T/g' file
Example
echo '2011-06-02 21:43:59' | sed 's/\([0-9]\{4\}-[0-9]\{2\}-[0-9]\{2\}\) /\1T/g'
2011-06-02T21:43:59
Read more about regexes here: Regex Tag Info

The following seems to be the working solution:
sed -i -r 's/([0-9]{4}-[0-9]{2}-[0-9]{2}) ([0-9]{2}:[0-9]{2}:[0-9]{2})/\1T\2/g' myfiles
-i to process files
-r is to switch on extended regular expression
([0-9]{4}-[0-9]{2}-[0-9]{2}) - is for date
- the space between date and time in source data
([0-9]{2}:[0-9]{2}:[0-9]{2}) - is for time

Also with awk, you can match group with gensub :
awk '{
print gensub(/([0-9]{4}-[0-9]{2}-[0-9]{2})\s+([0-9]{2}:[0-9]{2}:[0-9]{2})/,
"\\1T\\2",
"g");
}' data.txt

echo '2011-06-02 21:43:59' | awk 'sub(/ /,"T")'
2011-06-02T21:43:59

Convert bash output to JSON

I am running the following command:
sudo clustat | grep primary | awk 'NF{print $1",""server:"$2 ",""status:"$3}'
Results are:
service:servicename,server:servername,status:started
service:servicename,server:servername,status:started
service:servicename,server:servername,status:started
service:servicename,server:servername,status:started
service:servicename,server:servername,status:started
My desired result is:
{"service":"servicename","server":"servername","status":"started"}
{"service":"servicename","server":"servername","status":"started"}
{"service":"servicename","server":"servername","status":"started"}
{"service":"servicename","server":"servername","status":"started"}
{"service":"servicename","server":"servername","status":"started"}
I can't seem to put the qoutation marks withour srewing up my output.

Use jq:
sudo clustat | grep primary |
jq -R 'split(" ")|{service:.[0], server:.[1], status:.[2]}'
The input is read as raw text, not JSON. Each line is split on a space (the argument to split may need to be adjusted depending on the actual input). jq ensures that values are properly quoted when constructing the output objects.

Don't do this: Instead, use #chepner's answer, which is guaranteed to generate valid JSON as output with all possible inputs (or fail with a nonzero exit status if no JSON representation is possible).
The below is only tested to generate valid JSON with the specific inputs shown in the question, and will quite certainly generate output that is not valid JSON with numerous possible inputs (strings with literal quotes, strings ending in literal backslashes, etc).
sudo clustat |
awk '/primary/ {
print "{\"service\":\"" $1 "\",\"server\":\"" $2 "\",\"status\":\""$3"\"}"
}'

For JSON conversion of common shell commands, a good option is jc (JSON Convert)
There is no parser for clustat yet though.
clustat output does look table-like, so you may be able to use the --asciitable parser with jc.

Extract dates from a specific json format with sed

I have a json file including the sample lines of code below:
[{"tarih":"20130824","tarihView":"24-08-2013"},{"tarih":"20130817","tarihView":"17-08-2013"},{"tarih":"20130810","tarihView":"10-08-2013"},{"tarih":"20130803","tarihView":"03-08-2013"},{"tarih":"20130727","tarihView":"27-07-2013"},{"tarih":"20130720","tarihView":"20-07-2013"},{"tarih":"20130713","tarihView":"13-07-2013"},{"tarih":"20130706","tarihView":"06-07-2013"}]
I need to extract all the dates in the yy/mm/dd format into a text format with proper line endings:
20130824
20130817
20130810
20130803
...
20130706
How can I do this by using sed or similar console utility?
Many thanks for your help.

this line works for your example:
grep -Po '\d{8}' file
or with BRE:
grep -o '[0-9]\{8\}' file
it outputs:
20130824
20130817
20130810
20130803
20130727
20130720
20130713
20130706
if you want to extract the string after "tarih":", you could :
grep -Po '"tarih":"\K\d{8}' file
it gives same output.
Note that regex won't do date string validation.

This is VERY easy in python:
#!/bin/bash
python -c "vals=$(cat jsonfile)
for curVal in vals: print curVal['tarih']"
If I paste your example to jsonfile I get this output
20130824
20130817
20130810
20130803
20130727
20130720
20130713
20130706
Which is exactly what you need, right?
This works because in python [] is a list and {} is a dictionary, so it is very easy to get any data from that structure. This method is very safe as well, because it wont fail if some field in your data contains { , " or any other character that sed will probably look for. Also it does not depend on the field position or the number of fields.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

How to grep specific dates from a html file - html

I always recommend using an HTML/XML parser. If this is not possible try GNU grep and a Perl-compatible regular expression (PCRE): grep -Po '(?<=<td>)[0-9]{2}/[0-9]{2}/[0-9]{4}(?=</td>)' file Output: 06/26/2019 14/29/2019

This gnu awk may do? awk -F"</?td>" '/^<td>/{print $2}' file 06/26/2019 14/29/2019

Related

Calling Imagemagick from awk?

Get value of json with bash

awk change datetime format

Convert bash output to JSON

Extract dates from a specific json format with sed

Categories

Resources