Retrieve Dropbox personal path from ~/.dropbox/info.json in Bash Script - json

In Dropbox since version 2.8, the path to your dropbox folder can be found in the file ~/.dropbox/info.json
In my case, I'm seeking my personal path, not the business path, which is not in the typical Dropbox location ~/Dropbox but on a separate volume.
My ~/.dropbox/info.json:
{"business": {"path": "/Users/ChristopherA/ReOrient Media", "host": 123456789}, "personal": {"path": "/Volumes/Cloud/Dropbox", "host": 123456789}}
I have tried using grep/awk, but can't quite reliably get just the path /Volumes/Cloud/Dropbox, as there may be only one first level entry (i.e. no business dropbox), and the order might different for other users (i.e. I can't always rely on last pa
Some people suggested using jsawk, but I wasn't able to figure out how to make it work, and I'd prefer no dependencies as this script will be used on multiple computers.
Ideas?
-- Christopher Allen

A solution using a json-specific tool would be much more robust.
Using sed
Using just sed, and assuming that your json data is in a file called json, try:
$ sed -n 's/.*"personal":[^}]*"path": "\([^"]*\)",.*/\1\n/p' json
/Volumes/Cloud/Dropbox
Your sample json data was all on a single line. If that is not the case in general, then it would be better to remove the newlines before passing it to sed:
$ tr '\n' ' ' <json | sed -n 's/.*"personal":[^}]*"path": "\([^"]*\)",.*/\1\n/p'
/Volumes/Cloud/Dropbox
Using awk
$ awk -F'"' -v RS='"personal"[^}]*path":' 'NR==2 {print $2}' json
/Volumes/Cloud/Dropbox
The above uses a regular expression for the record separator. GNU awk supports this. Others may or may not.
Mac OSX Version
From Christopher Allen, the following works on a Mac:
tr '\n' ' ' <json | sed -n 's/.*"personal":[^}]*"path": "([^"]*)",.*/\1/p
Using bash
#!/bin/bash
data=$(cat json)
data=${data#*\"personal\":}
data=${data#*path\":}
data=${data#*\"}
data=${data%%\"*}
echo "$data"

Related

How to fetch json/xml data using REGEX

I am new to regex and Linux, I want to know how we can fetch JSON/XML data using regex using a Linux terminal. I am a windows user so currently working on the windows subsystem for Linux. I know I should use jq for JSON but the requirement is to use regex, the pattern that I will have will be same in every report so regex can be used even though it is not really recommended.
I wanna know two things
How I can test my regex in windows subsystem for Linux.
how I can add it in the shell script, as of now I am using jq.
This is how I am using jq to fetch data in shell script
cat abc.json | jq -r '.url'
So how i can achieve the same thing using regex ?
my abc.json is as below
{"url":"http://www.netcharles.com/orwell/essays.htm",
"domain":"netcharles.com",
"title":"Orwell Essays & Journalism Section - Charles' George Orwell Links",
"tags":["orwell","writing","literature","journalism","essays","politics","essay","reference","language","toread"],
"index":2931,
"time_created":1345419323,
"num_saves":24}
I tried this in windows subsystem for linux
sed -E '(url|title|tags)":"((\"|[^"])*)' abc.json
got this error
sed: -e expression #1, char 1: unknown command: `('
Expected output
\\"url\\":\\"http://www.netcharles.com/orwell/essays.htm\\",\\"domain\\":\\"netcharles.com\\",\\"title\\":\\"Orwell Essays & Journalism Section - Charles\\' George Orwell Links\\",\\"tags\\":[\\"orwell\\",\\"writing\\",\\"literature\\",\\"journalism\\",\\"essays\\",\\"politics\\",\\"essay\\",\\"reference\\",\\"language\\",\\"toread\\"],\\"index\\":2931,\\"time_created\\":1345419323,\\"num_saves\\":24
or could someone please tell me what would be regex for accessing something like this. ps - the value of first_name will be a string
cat user.json | jq -r '.user_data.username.first_name'

Parsing JSON with BusyBox tools

I'm working on a blog theme for Hugo installable on Android (BusyBox via Termux) and plan to create a BusyBox Docker image and copy my theme and the hugo binary to it for use on ARM.
Theme releases are archived and made available on NPM and the tools available on BusyBox have allowed me to reliably parse version from the metadata from JSON:
meta=$(wget -qO - https://registry.npmjs.org/package/latest)
vers=$(echo "$meta" | egrep -o "\"version\".*[^,]*," | cut -d ',' -f1 | cut -d ':' -f2 | tr -d '" ')
Now I would like to copy the dist value from the meta into a text file for use in Hugo:
"dist": {
"integrity": "sha512-3MH2/UKYPjr+CTC85hWGg/N3GZmSlgBWXzdXHroDfJRnEmcBKkvt1oiadN8gzCCppqCQhwtmengZzg0imm1mtg==",
"shasum": "a159699b1c5fb006a84457fcdf0eb98d72c2eb75",
"tarball": "https://registry.npmjs.org/after-dark/-/after-dark-6.4.1.tgz",
"fileCount": 98,
"unpackedSize": 5338189
},
Above pretty-printed for clarity. The actual metadata is compressed.
Is there a way I can reuse the version parsing logic above to also pull the dist field value?
Proper robust parsing requires tools like jq where it could be as simple as jq '.version' ip.txt and jq '.dist' ip.txt
You could use sed but use it at your own risk
$ sed -n 's/.*"version":"\([^"]*\).*/\1/p' ip.txt
6.4.1
$ sed -n 's/.*\("dist":{[^}]*}\).*/\1/p' ip.txt
"dist":{"integrity":....
....}
-n option to disable automatic printing
the p modifier with s command will allow to print only when substitution succeeds, this will mean output is empty instead of entire input line when something goes wrong
.*"version":"\([^"]*\).* this will match entire line, capturing data between double quotes after version tag - you'll have to adjust the regex if whitespaces are allowed and other valid json formats
.*\("dist":{[^}]*}\).* this will match entire line, capturing data starting with "dist":{ and first occurrence of } afterwards - so this is not suited if the tag itself can contain }

Output UNIX environment as JSON

I'd like a unix one-liner that will output the current execution environment as a JSON structure like: { "env-var" : "env-value", ... etc ... }
This kinda works:
(echo "{"; printenv | sed 's/\"/\\\"/g' | sed -n 's|\(.*\)=\(.*\)|"\1"="\2"|p' | grep -v '^$' | paste -s -d"," -; echo "}")
but has some extra lines and I think won't work if the environment values or variables have '=' or newlines in them.
Would prefer pure bash/sh, but compact python / perl / ruby / etc one-liners would also be appreciated.
Using jq 1.5 (e.g. jq 1.5rc2 -- see http://stedolan.github.io/jq):
$ jq -n env
This works for me:
python -c 'import json, os;print(json.dumps(dict(os.environ)))'
It's pretty simple; the main complication is that os.environ is a dict-like object, but it is not actually a dict, so you have to convert it to a dict before you feed it to the json serializer.
Adding parentheses around the print statement lets it work in both Python 2 and 3, so it should work for the forseeable future on most *nix systems (especially since Python comes by default on any major distro).
#Alexander Trauzzi asked: "Wondering if anyone knows how to do this, but only passing a subset of the current environment's variables?"
I just found the way to do this:
jq -n 'env | {USER, HOME, PS1}'

How can I extract td from html in bash?

I am querying London postcode data from geonames:
http://www.geonames.org/postalcode-search.html?q=london&country=GB
I want to turn the output into a list of just the postcode identifiers (Bethnal Green, Islington, etc.). What is the best way to extract just the names in bash?
I'm not sure if you mean this \n delimited list (or one in brackets and comma delimited)
html='http://www.geonames.org/postalcode-search.html?q=london&country=GB'
wget -q "$html" -O - |
w3m -dump -T 'text/html'|
sed -nr 's/^ +[0-9]+ +(.*) +[A-Z]+[0-9]+ +United Kingdom.*/\1/p'
w3m is a: "WWW browsable pager with excellent tables/frames support"
output (first 10 lines)
London Bridge
Kilburn
Ealing
Wandsworth
Pimlico
Kensington
Leyton
Leytonstone
Plaistow
Poplar
I see the site offers (but not for free) web services with XML or JSON data... It would be the best way, since the HTML page is not meant to be parsed (easily).
Anyway, nothing is impossible, nonetheless using strictly only bash commands would be a lot hard, if not impossible; often several other common tools are piped in order to achieve the result. But then, sometimes it turns to be more conveniente to stick to a single tool like e.g. Perl, instead of combining cat, grep, awk, sed and whatever else.
Something like
sed -e 's/>/>\n/g' region.html |
egrep -i "^\s*[A-Z]+[0-9]+</td>" |
sed -e 's|</td>||g'
worked extracting 200 lines, assuming a specific format for the code.
ADD
If there's no limit to the software you can use to parse the data, then you could use a line like
wget -q "http://www.geonames.org/postalcode-search.html?q=london&country=GB" -O - |
sgrep '"<table class=\"restable\"" .. "</table>"' |
sed -e 's|/tr>|/tr>\n|g; s|</td>\s*<td[^>]*>|;|g; s|</th>\s*<th[^>]*>|;|g; s|<[^>]\+>||g; s|;; .*$| |g' |
grep -v "^\s*$" |
tail -n+2 | cut -d";" -f2,3
which extracts places and postal codes seperated by a ; like in a CSV, as well as awk:
wget -q "$html" -O - |
w3m -dump -T 'text/html' |
awk '/\s*[0-9]+ / { print substr($0, 11, 16); }'
which is based on the answer by Peter.O and extracts the same data... and so on. But in these cases, since you are not limited to the minimal tools found on most Unix or GNU systems, I would stick to one single widespread tool, e.g. perl.
If you have access to the mojo tool from the Mojolicious project this all becomes quite a lot easier:
mojo get 'http://www.geonames.org/postalcode-search.html?q=london&country=GB' '.restable > tr > td:nth-child(2)' text | grep ^'[a-zA-Z]'
The grep at the end is just to filter out some junk results; almost (but not quite) every other line is bad, because the page structure is slightly inconsistent. Otherwise you could say tr:nth-child(even) and get nice results.

Easiest way to extract the urls from an html page using sed or awk only

I want to extract the URL from within the anchor tags of an html file.
This needs to be done in BASH using SED/AWK. No perl please.
What is the easiest way to do this?
You could also do something like this (provided you have lynx installed)...
Lynx versions < 2.8.8
lynx -dump -listonly my.html
Lynx versions >= 2.8.8 (courtesy of #condit)
lynx -dump -hiddenlinks=listonly my.html
You asked for it:
$ wget -O - http://stackoverflow.com | \
grep -io '<a href=['"'"'"][^"'"'"']*['"'"'"]' | \
sed -e 's/^<a href=["'"'"']//i' -e 's/["'"'"']$//i'
This is a crude tool, so all the usual warnings about attempting to parse HTML with regular expressions apply.
grep "<a href=" sourcepage.html
|sed "s/<a href/\\n<a href/g"
|sed 's/\"/\"><\/a>\n/2'
|grep href
|sort |uniq
The first grep looks for lines containing urls. You can add more elements
after if you want to look only on local pages, so no http, but
relative path.
The first sed will add a newline in front of each a href url tag with the \n
The second sed will shorten each url after the 2nd " in the line by replacing it with the /a tag with a newline
Both seds will give you each url on a single line, but there is garbage, so
The 2nd grep href cleans the mess up
The sort and uniq will give you one instance of each existing url present in the sourcepage.html
With the Xidel - HTML/XML data extraction tool, this can be done via:
$ xidel --extract "//a/#href" http://example.com/
With conversion to absolute URLs:
$ xidel --extract "//a/resolve-uri(#href, base-uri())" http://example.com/
I made a few changes to Greg Bacon Solution
cat index.html | grep -o '<a .*href=.*>' | sed -e 's/<a /\n<a /g' | sed -e 's/<a .*href=['"'"'"]//' -e 's/["'"'"'].*$//' -e '/^$/ d'
This fixes two problems:
We are matching cases where the anchor doesn't start with href as first attribute
We are covering the possibility of having several anchors in the same line
An example, since you didn't provide any sample
awk 'BEGIN{
RS="</a>"
IGNORECASE=1
}
{
for(o=1;o<=NF;o++){
if ( $o ~ /href/){
gsub(/.*href=\042/,"",$o)
gsub(/\042.*/,"",$o)
print $(o)
}
}
}' index.html
You can do it quite easily with the following regex, which is quite good at finding URLs:
\b(([\w-]+://?|www[.])[^\s()<>]+(?:\([\w\d]+\)|([^[:punct:]\s]|/)))
I took it from John Gruber's article on how to find URLs in text.
That lets you find all URLs in a file f.html as follows:
cat f.html | grep -o \
-E '\b(([\w-]+://?|www[.])[^\s()<>]+(?:\([\w\d]+\)|([^[:punct:]\s]|/)))'
I am assuming you want to extract a URL from some HTML text, and not parse HTML (as one of the comments suggests). Believe it or not, someone has already done this.
OT: The sed website has a lot of good information and many interesting/crazy sed scripts. You can even play Sokoban in sed!
This is my first post, so I try to do my best explaining why I post this answer...
Since the first 7 most voted answers, 4 include GREP even when the
post explicitly says "using sed or awk only".
Even when the post requires "No perl please", due to the previous
point, and because use PERL regex inside grep.
and because this is the simplest way ( as far I know , and was
required ) to do it in BASH.
So here come the simplest script from GNU grep 2.28:
grep -Po 'href="\K.*?(?=")'
About the \K switch , not info was founded in MAN and INFO pages, so I came here for the answer....
the \K switch get rid the previous chars ( and the key itself ).
Bear in mind following the advice from man pages:
"This is highly experimental and grep -P may warn of unimplemented features."
Of course, you can modify the script to meet your tastes or needs, but I found it pretty straight for what was requested in the post , and also for many of us...
I hope folks you find it very useful.
thanks!!!
In bash, the following should work. Note that it doesn't use sed or awk, but uses tr and grep, both very standard and not perl ;-)
$ cat source_file.html | tr '"' '\n' | tr "'" '\n' | grep -e '^https://' -e '^http://' -e'^//' | sort | uniq
for example:
$ curl "https://www.cnn.com" | tr '"' '\n' | tr "'" '\n' | grep -e '^https://' -e '^http://' -e'^//' | sort | uniq
generates
//s3.amazonaws.com/cnn-sponsored-content
//twitter.com/cnn
https://us.cnn.com
https://www.cnn.com
https://www.cnn.com/2018/10/27/us/new-york-hudson-river-bodies-identified/index.html\
https://www.cnn.com/2018/11/01/tech/google-employee-walkout-andy-rubin/index.html\
https://www.cnn.com/election/2016/results/exit-polls\
https://www.cnn.com/profiles/frederik-pleitgen\
https://www.facebook.com/cnn
etc...
Expanding on kerkael's answer:
grep "<a href=" sourcepage.html
|sed "s/<a href/\\n<a href/g"
|sed 's/\"/\"><\/a>\n/2'
|grep href
|sort |uniq
# now adding some more
|grep -v "<a href=\"#"
|grep -v "<a href=\"../"
|grep -v "<a href=\"http"
The first grep I added removes links to local bookmarks.
The second removes relative links to upper levels.
The third removes links that don't start with http.
Pick and choose which one of these you use as per your specific requirements.
Go over with a first pass replacing the start of the urls (http) with a newline (\nhttp). Then you have guaranteed for yourself that your link starts at the beginning of the line and is the only URL on the line.The rest should be easy, here is an example:
sed "s/http/\nhttp/g" <(curl "http://www.cnn.com") | sed -n "s/\(^http[s]*:[a-Z0-9/.=?_-]*\)\(.*\)/\1/p"
alias lsurls='_(){ sed "s/http/\nhttp/g" "${1}" | sed -n "s/\(^http[s]*:[a-Z0-9/.=?_-]*\)\(.*\)/\1/p"; }; _'
You can try:
curl --silent -u "<username>:<password>" http://<NAGIOS_HOST/nagios/cgi-bin/status.cgi|grep 'extinfo.cgi?type=1&host='|grep "status"|awk -F'</A>' '{print $1}'|awk -F"'>" '{print $3"\t"$1}'|sed 's/<\/a> <\/td>//g'| column -c2 -t|awk '{print $1}'
That's how I tried it for better view, create shell file and give link as parameter, it will create temp2.txt file.
a=$1
lynx -listonly -dump "$a" > temp
awk 'FNR > 2 {print$2}' temp > temp2.txt
rm temp
>sh test.sh http://link.com
Eschewing the awk/sed requirement:
urlextract is made just for such a task (documentation).
urlview is an interactive CLI solution (github repo).
I scrape websites using Bash exclusively to verify the http status of client links and report back to them on errors found. I've found awk and sed to be the fastest and easiest to understand. Props to the OP.
curl -Lk https://example.com/ | sed -r 's~(href="|src=")([^"]+).*~\n\1\2~g' | awk '/^(href|src)/,//'
Because sed works on a single line, this will ensure that all urls are formatted properly on a new line, including any relative urls. The first sed finds all href and src attributes and puts each on a new line while simultaneously removing the rest of the line, inlcuding the closing double qoute (") at the end of the link.
Notice I'm using a tilde (~) in sed as the defining separator for substitution. This is preferred over a forward slash (/). The forward slash can confuse the sed substitution when working with html.
The awk finds any line that begins with href or src and outputs it.
Once the content is properly formatted, awk or sed can be used to collect any subset of these links. For example, you may not want base64 images, instead you want all the other images. Our new code would look like:
curl -Lk https://example.com/ | sed -r 's~(href="|src=")([^"]+).*~\n\1\2~g' | awk '/^(href|src)/,//' | awk '/^src="[^d]/,//'
Once the subset is extracted, just remove the href=" or src="
sed -r 's~(href="|src=")~~g'
This method is extremely fast and I use these in Bash functions to format the results across thousands of scraped pages for clients that want someone to review their entire site in one scrape.