How to fetch json/xml data using REGEX - json

I am new to regex and Linux, I want to know how we can fetch JSON/XML data using regex using a Linux terminal. I am a windows user so currently working on the windows subsystem for Linux. I know I should use jq for JSON but the requirement is to use regex, the pattern that I will have will be same in every report so regex can be used even though it is not really recommended.
I wanna know two things
How I can test my regex in windows subsystem for Linux.
how I can add it in the shell script, as of now I am using jq.
This is how I am using jq to fetch data in shell script
cat abc.json | jq -r '.url'
So how i can achieve the same thing using regex ?
my abc.json is as below
{"url":"http://www.netcharles.com/orwell/essays.htm",
"domain":"netcharles.com",
"title":"Orwell Essays & Journalism Section - Charles' George Orwell Links",
"tags":["orwell","writing","literature","journalism","essays","politics","essay","reference","language","toread"],
"index":2931,
"time_created":1345419323,
"num_saves":24}
I tried this in windows subsystem for linux
sed -E '(url|title|tags)":"((\"|[^"])*)' abc.json
got this error
sed: -e expression #1, char 1: unknown command: `('
Expected output
\\"url\\":\\"http://www.netcharles.com/orwell/essays.htm\\",\\"domain\\":\\"netcharles.com\\",\\"title\\":\\"Orwell Essays & Journalism Section - Charles\\' George Orwell Links\\",\\"tags\\":[\\"orwell\\",\\"writing\\",\\"literature\\",\\"journalism\\",\\"essays\\",\\"politics\\",\\"essay\\",\\"reference\\",\\"language\\",\\"toread\\"],\\"index\\":2931,\\"time_created\\":1345419323,\\"num_saves\\":24
or could someone please tell me what would be regex for accessing something like this. ps - the value of first_name will be a string
cat user.json | jq -r '.user_data.username.first_name'

Related

Force mongodb to output strict JSON

I want to consume the raw output of some MongoDB commands in other programs that speak JSON. When I run commands in the mongo shell, they represent Extended JSON, fields in "shell mode", with special fields like NumberLong , Date, and Timestamp. I see references in the documentation to "strict mode", but I see no way to turn it on for the shell, or a way to run commands like db.serverStatus() in things that do output strict JSON, like mongodump. How can I force Mongo to output standards-compliant JSON?
There are several other questions on this topic, but I don't find any of their answers particularly satisfactory.
The MongoDB shell speaks Javascript, so the answer is simple: use JSON.stringify(). If your command is db.serverStatus(), then you can simply do this:
JSON.stringify(db.serverStatus())
This won't output the proper "strict mode" representation of each of the fields ({ "floatApprox": <number> } instead of { "$numberLong": "<number>" }), but if what you care about is getting standards-compliant JSON out, this'll do the trick.
I have not found a way to do this in the mongo shell, but as a workaround, mongoexport can run queries and its output uses strict mode and can be piped into other commands that expect JSON input (such as json_pp or jq). For example, suppose you have the following mongo shell command to run a query, and you want to create a pipeline using that data:
db.myItemsCollection.find({creationDate: {$gte: ISODate("2016-09-29")}}).pretty()
Convert that mongo shell command into this shell command, piping for the sake of example to `json_pp:
mongoexport --jsonArray -d myDbName -c myItemsCollection -q '{"creationDate": {"$gte": {"$date": "2016-09-29T00:00Z"}}}' | json_pp
You will need to convert the query into strict mode format, and pass the database name and collection name as arguments, as well as quote properly for your shell, as shown here.
In case of findOne
JSON.stringify(db.Bill.findOne({'a': '123'}))
In case of a cursor
db.Bill.find({'a': '123'}).forEach(r=>print(JSON.stringify(r)))
or
print('[') + db.Bill.find().limit(2).forEach(r=>print(JSON.stringify(r) + ',')) + print(']')
will output
[{a:123},{a:234},]
the last one will have a ',' after the last item...remove it
To build on the answer from #jbyler, you can strip out the numberLongs using sed after you get your data - that is if you're using linux.
mongoexport --jsonArray -d dbName -c collection -q '{fieldName: {$regex: ".*turkey.*"}}' | sed -r 's/\{ "[$]numberLong" : "([0-9]+)" }/"\1"/g' | json_pp
EDIT: This will transform a given document, but will not work on a list of documents. Changed find to findOne.
Adding
.forEach(function(results){results._id=results._id.toString();printjson(results)})`
to a findOne() will output valid JSON.
Example:
db
.users
.findOne()
.forEach(function (results) {
results._id = results._id.toString();
printjson(results)
})
Source: https://www.mydbaworld.com/mongodb-shell-output-valid-json/

Output UNIX environment as JSON

I'd like a unix one-liner that will output the current execution environment as a JSON structure like: { "env-var" : "env-value", ... etc ... }
This kinda works:
(echo "{"; printenv | sed 's/\"/\\\"/g' | sed -n 's|\(.*\)=\(.*\)|"\1"="\2"|p' | grep -v '^$' | paste -s -d"," -; echo "}")
but has some extra lines and I think won't work if the environment values or variables have '=' or newlines in them.
Would prefer pure bash/sh, but compact python / perl / ruby / etc one-liners would also be appreciated.
Using jq 1.5 (e.g. jq 1.5rc2 -- see http://stedolan.github.io/jq):
$ jq -n env
This works for me:
python -c 'import json, os;print(json.dumps(dict(os.environ)))'
It's pretty simple; the main complication is that os.environ is a dict-like object, but it is not actually a dict, so you have to convert it to a dict before you feed it to the json serializer.
Adding parentheses around the print statement lets it work in both Python 2 and 3, so it should work for the forseeable future on most *nix systems (especially since Python comes by default on any major distro).
#Alexander Trauzzi asked: "Wondering if anyone knows how to do this, but only passing a subset of the current environment's variables?"
I just found the way to do this:
jq -n 'env | {USER, HOME, PS1}'

Retrieve Dropbox personal path from ~/.dropbox/info.json in Bash Script

In Dropbox since version 2.8, the path to your dropbox folder can be found in the file ~/.dropbox/info.json
In my case, I'm seeking my personal path, not the business path, which is not in the typical Dropbox location ~/Dropbox but on a separate volume.
My ~/.dropbox/info.json:
{"business": {"path": "/Users/ChristopherA/ReOrient Media", "host": 123456789}, "personal": {"path": "/Volumes/Cloud/Dropbox", "host": 123456789}}
I have tried using grep/awk, but can't quite reliably get just the path /Volumes/Cloud/Dropbox, as there may be only one first level entry (i.e. no business dropbox), and the order might different for other users (i.e. I can't always rely on last pa
Some people suggested using jsawk, but I wasn't able to figure out how to make it work, and I'd prefer no dependencies as this script will be used on multiple computers.
Ideas?
-- Christopher Allen
A solution using a json-specific tool would be much more robust.
Using sed
Using just sed, and assuming that your json data is in a file called json, try:
$ sed -n 's/.*"personal":[^}]*"path": "\([^"]*\)",.*/\1\n/p' json
/Volumes/Cloud/Dropbox
Your sample json data was all on a single line. If that is not the case in general, then it would be better to remove the newlines before passing it to sed:
$ tr '\n' ' ' <json | sed -n 's/.*"personal":[^}]*"path": "\([^"]*\)",.*/\1\n/p'
/Volumes/Cloud/Dropbox
Using awk
$ awk -F'"' -v RS='"personal"[^}]*path":' 'NR==2 {print $2}' json
/Volumes/Cloud/Dropbox
The above uses a regular expression for the record separator. GNU awk supports this. Others may or may not.
Mac OSX Version
From Christopher Allen, the following works on a Mac:
tr '\n' ' ' <json | sed -n 's/.*"personal":[^}]*"path": "([^"]*)",.*/\1/p
Using bash
#!/bin/bash
data=$(cat json)
data=${data#*\"personal\":}
data=${data#*path\":}
data=${data#*\"}
data=${data%%\"*}
echo "$data"

How can I extract td from html in bash?

I am querying London postcode data from geonames:
http://www.geonames.org/postalcode-search.html?q=london&country=GB
I want to turn the output into a list of just the postcode identifiers (Bethnal Green, Islington, etc.). What is the best way to extract just the names in bash?
I'm not sure if you mean this \n delimited list (or one in brackets and comma delimited)
html='http://www.geonames.org/postalcode-search.html?q=london&country=GB'
wget -q "$html" -O - |
w3m -dump -T 'text/html'|
sed -nr 's/^ +[0-9]+ +(.*) +[A-Z]+[0-9]+ +United Kingdom.*/\1/p'
w3m is a: "WWW browsable pager with excellent tables/frames support"
output (first 10 lines)
London Bridge
Kilburn
Ealing
Wandsworth
Pimlico
Kensington
Leyton
Leytonstone
Plaistow
Poplar
I see the site offers (but not for free) web services with XML or JSON data... It would be the best way, since the HTML page is not meant to be parsed (easily).
Anyway, nothing is impossible, nonetheless using strictly only bash commands would be a lot hard, if not impossible; often several other common tools are piped in order to achieve the result. But then, sometimes it turns to be more conveniente to stick to a single tool like e.g. Perl, instead of combining cat, grep, awk, sed and whatever else.
Something like
sed -e 's/>/>\n/g' region.html |
egrep -i "^\s*[A-Z]+[0-9]+</td>" |
sed -e 's|</td>||g'
worked extracting 200 lines, assuming a specific format for the code.
ADD
If there's no limit to the software you can use to parse the data, then you could use a line like
wget -q "http://www.geonames.org/postalcode-search.html?q=london&country=GB" -O - |
sgrep '"<table class=\"restable\"" .. "</table>"' |
sed -e 's|/tr>|/tr>\n|g; s|</td>\s*<td[^>]*>|;|g; s|</th>\s*<th[^>]*>|;|g; s|<[^>]\+>||g; s|;; .*$| |g' |
grep -v "^\s*$" |
tail -n+2 | cut -d";" -f2,3
which extracts places and postal codes seperated by a ; like in a CSV, as well as awk:
wget -q "$html" -O - |
w3m -dump -T 'text/html' |
awk '/\s*[0-9]+ / { print substr($0, 11, 16); }'
which is based on the answer by Peter.O and extracts the same data... and so on. But in these cases, since you are not limited to the minimal tools found on most Unix or GNU systems, I would stick to one single widespread tool, e.g. perl.
If you have access to the mojo tool from the Mojolicious project this all becomes quite a lot easier:
mojo get 'http://www.geonames.org/postalcode-search.html?q=london&country=GB' '.restable > tr > td:nth-child(2)' text | grep ^'[a-zA-Z]'
The grep at the end is just to filter out some junk results; almost (but not quite) every other line is bad, because the page structure is slightly inconsistent. Otherwise you could say tr:nth-child(even) and get nice results.

Parse ClamAV logs in Bash script using Regex to insert in MySQL

Morning/Evening all,
I've got a problem where I'm making a script for work that uses ClamAV to scan for malware, and then place it's results in MySQL by taking the resultant ClamAV logs using grep with awk to convert the right parts of the log to a variable. The problem I have is that whilst I have done the summary ok, the syntax of detections makes it slightly more difficult. I'm no expert at regex by all means and this is a bit of a learning experience, so there is probably a far better way of doing it than I have!
The lines I'm trying to parse looks like these:
/net/nas/vol0/home/recep/SG4rt.exe: Worm.SomeFool.P FOUND
/net/nas/vol0/home/recep/SG4rt.exe: moved to '/srv/clamav/quarantine/SG4rt.exe'
As far as I was able to establish, I need a positive lookbehind to match what happens after and before the colon, without actually matching the colon or the space after it, and I can't see a clear way of doing it from RegExr without it thinking I'm trying to look for two colons. To make matters worse, we sometimes get these too...
WARNING: Can't open file /net/nas/vol0/home/laser/samples/sample1.avi: Permission denied
The end result is that I can build a MySQL query that inserts the path, malware found and where it was moved to or if there was an error then the path, then the error encountered so as to convert each element to a variable contents in a while statement.
I've done the scan summary as follows:
Summary looks like:
----------- SCAN SUMMARY -----------
Known viruses: 329
Engine version: 0.97.1
Scanned directories: 17350
Scanned files: 50342
Infected files: 3
Total errors: 1
Data scanned: 15551.73 MB
Data read: 16382.67 MB (ratio 0.95:1)
Time: 3765.236 sec (62 m 45 s)
Parsing like this:
SCANNED_DIRS=$(cat /srv/clamav/$IY-scan-$LOGTIME.log | grep "Scanned directories" | awk '{gsub("Scanned directories: ", "");print}')
SCANNED_FILES=$(cat /srv/clamav/$IY-scan-$LOGTIME.log | grep "Scanned files" | awk '{gsub("Scanned files: ", "");print}')
INFECTED=$(cat /srv/clamav/$IY-scan-$LOGTIME.log | grep "Infected files" | awk '{gsub("Infected files: ", "");print}')
DATA_SCANNED=$(cat /srv/clamav/$IY-scan-$LOGTIME.log | grep "Data scanned" | awk '{gsub("Data scanned: ", "");print}')
DATA_READ=$(cat /srv/clamav/$IY-scan-$LOGTIME.log | grep "Data read" | awk '{gsub("Data read: ", "");print}')
TIME_TAKEN=$(cat /srv/clamav/$IY-scan-$LOGTIME.log | grep "Time" | awk '{gsub("Time: ", "");print}')
END_TIME=$(date +%s)
mysql -u scanner_parser --password=removed sc_live -e "INSERT INTO bs.live.bs_jobstat VALUES (NULL, '$CURRTIME', '$PID', '$IY', '$SCANNED_DIRS', '$SCANNED_FILES', '$INFECTED', '$DATA_SCANNED', '$DATA_READ', '$TIME_TAKEN', '$END_TIME');"
rm -f /srv/clamav/$IY-scan-$LOGTIME.log
Some of those variables are from other parts of the script and can be ignored. The reason I'm doing this is to save logfile clutter and have a simple web based overview of the status of the system.
Any clues? Am I going about all this the wrong way? Thanks for help in advance, I do appreciate it!
From what I can determine from the question, it seems like you are asking how to distinguish the lines you want from the logger lines that start with WARNING, ERROR, INFO.
You can do this without getting to fancy with lookahead or lookbehind. Just grep for lines beginning with
"/net/nas/vol0/home/recep/SG4rt.exe: "
then using awk you can extract the remainder of the line. Or you can gsub the prefix out like you are doing in the summary processing section.
As far as the question about processing the summary goes, what strikes me most is that you are processing the entire file multiple times, each time pulling out one kind of line. For tasks like this, I would use Perl, Ruby, or Python and make one pass through the file, collecting the pieces of each line after the colon, storing them in regular programming language variables (not env variables), and forming the MySQL insert string using interpolation.
Bash is great for some things but IMHO you are justified in using a more general scripting language (Perl, Python, Ruby come to mind).