parsing FireFox bookmarks using regular expression - json

I tried to parse firefox bookmark(JSON exported version), using this efforts:
cat boo.json | grep '\"uri\"\:\"^http\://[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,3}\"'
cat boo.json | grep '"uri"\:"^http\://[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,3}'
cat boo.json | grep '"uri"\:"^http\://[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,3}"'
And few others but all fails, json bookmarked file will look like this:
.........."uri":"http://www.google.com/?"......"uri":"http://stackoverflow.com/"
So, the output should be like this:
"uri":"http://www.google.com/?"
"uri":"http://stackoverflow.com/"
What is the missing part on my regular expression?
UPDATE:
Url's on bookmark file ending with one of this special character:
/, ex: "uri":"http://stackoverflow.com/"
", ex: "uri":"http://stackoverflow.com/questions/13148794/parsing-firefox-bookmarks-using-regular-expression"
}, ex: "uri":"https://fr.add-ons.mozilla.com/fr/firefox/bookmarks/"}
With this modified regular expression:
$ egrep -o "(http|https)://([^ ]*).(*\/)" boo.json
Result:
http://fr.fxfeeds.mozilla.com/fr/firefox/headlines.xml"},{"name":"livemark/siteURI","flags":0,"expires":4,"mimeType":null,"type":3,"value":"http://www.lemonde.fr/"}],"type":"text/x-moz-place-container","children":[]}]},{"index":2,"title":"Tags","id":4,"parent":1,"dateAdded":1344432674984000,"lastModified":1344432674984000,"type":"text/
http://stackoverflow.com/questions/13148794/parsing-firefox-bookmarks-using-regular-expression","charset":"UTF-8"},{"index":29,"title":"adrusi/
http://stackoverflow.com/
...
But with this still doesn't get me only url's.

Have you tried JSON.sh? Its works great!
https://github.com/dominictarr/JSON.sh

I use this regex to extract urls , it's works great
cat *.html | grep -Eo "(http|https)://[a-zA-Z0-9./?=_-]*" | sort | uniq

Mr Jeff Atwood had posted an article the problem with urls, With his proposed Regular Expression, I managed to extract all the url's from FireFox bookmark:
egrep -o "\(?\bhttp://[-A-Za-z0-9+&##/%?=~_()|!:,.;]*[-A-Za-z0-9+&##/%=~_()|]" my-bookmark.json

Related

How do I get the latest tag value from the github API for a given repository

I can get the latest commit from the GitHub api using :
$ curl 'https://api.github.com/repos/dwkns/test/commits?per_page=1'
However the resulting JSON doesn't contain any reference to the tag I created when I did that commit.
I can get a list of tags using :
$ curl 'https://api.github.com/repos/dwkns/test/tags'
However the resulting JSON, while it contains the names of tags I want, is not in the order in which they were created - there is no way of telling which tag is the latest one.
EDIT : The latest tag created was LatestLatestLatest
My question then is what API call(s) do I need to do to get the name of the latest tag in my repository?
Semantic Versioning Example
NOTE: If you're in a hurry and don't need all the fine details explained, just jump down to "The Solution" and execute the command.
This solution uses curl and grep to match the LATEST semantically versioned release number. An example will be demonstrated using my own Github repo "pi-ap" (a pile of bash scripts which automates config of a Raspberry Pi into a wireless AP).
You can test the example I give you on the CLI and after you're satisfied it works as intended, you can tweak it to your own use-case.
Versioning Format Construction:
Since we're using grep to match the version number, I need to explain its' construction. 3 pairs of integers separated by 2 dots and prefaced by a "v":
vXX.XX.XX
^ ^ ^
| | |
| | Patch
| Minor
Major
NOTE: If a field only has a single digit, I'll pad it with a zero to ensure the resulting format is predictable: always 3 pairs of integers separated by 2 dots.
The Solution:
Github Username: F1Linux
Github Repo Name: pi-ap (NOTE: exclude the ".git" suffix)
curl -s 'https://github.com/f1linux/pi-ap/tags/'|grep -Eo "$Version v[0-9]{1,2}.[0-9]{1,2}.[0-9]{1,2}"|sort -r|head -n1
Validate the Result Correct:
In your browser, go to:
https://github.com/f1linux/pi-ap/tags
And validate that the latest tag was returned from the command.
The above is fairly extensible for most use-cases. Just need to change the user & repo names and remove/replace the "v" if you don't use this convention in tagging your repos.
Using jq in combination with curl you can have a pretty straightforward command:
curl -s \
-H "Accept: application/vnd.github.v3+json" \
https://api.github.com/repos/dwkns/test/tags \
| jq -r '.[0].name'
Output (as of today):
v56
Explanation on jq command:
-r is for "raw", avoid json quotes on jq's output
.[0] selects the first (latest) tag object in json array we got from github
.name selects the name property in this lastest json object
#!/bin/sh
curl -s https://github.com/dwkns/test/tags |
awk '/tag-name/{print $3;exit}' FS='[<>]'
Or
#!/bin/awk -f
BEGIN {
FS = "[<>]"
while ("curl -s https://github.com/dwkns/test/tags" | getline) {
if(/tag-name/){print $3;exit}
}
}

Finding all kinds of extensions referenced in a html file

Here is my problem statement :
There is a folder with many html and text files. I need to recursively go through each one of them and find all kinds of file extensions referenced in these html/text files like .jpg, .tif, .png etc
The problem is I don't have a defined list of the extensions I want to search for.
What would be the best way to achieve this using a shell script ?
Coming up with a Reg-ex which would essentially search for all occurrences of a dot followed by 3 or 4 letters, and filtering out the ones which end with a space or a comma, or a quote etc ??
Any suggestions would be helpful.
You can use shell script to parse file name with regex, but straight forward version is pretty simple:
$ cat *.{txt,html} | grep -oP '\b[A-Za-z0-9_]+\.[A-Za-z0-9]{1,4}\b' | awk -F. '{ print "." $(NF) }' | sort -u
For recursive search:
find . -name '*.txt' -or -name '*.html' -exec grep -oP '\b[A-Za-z0-9_.]+\.[A-Za-z0-9]{1,4}\b' {} \; | awk -F. '{ print "." $(NF) }' | sort -u

How can I extract td from html in bash?

I am querying London postcode data from geonames:
http://www.geonames.org/postalcode-search.html?q=london&country=GB
I want to turn the output into a list of just the postcode identifiers (Bethnal Green, Islington, etc.). What is the best way to extract just the names in bash?
I'm not sure if you mean this \n delimited list (or one in brackets and comma delimited)
html='http://www.geonames.org/postalcode-search.html?q=london&country=GB'
wget -q "$html" -O - |
w3m -dump -T 'text/html'|
sed -nr 's/^ +[0-9]+ +(.*) +[A-Z]+[0-9]+ +United Kingdom.*/\1/p'
w3m is a: "WWW browsable pager with excellent tables/frames support"
output (first 10 lines)
London Bridge
Kilburn
Ealing
Wandsworth
Pimlico
Kensington
Leyton
Leytonstone
Plaistow
Poplar
I see the site offers (but not for free) web services with XML or JSON data... It would be the best way, since the HTML page is not meant to be parsed (easily).
Anyway, nothing is impossible, nonetheless using strictly only bash commands would be a lot hard, if not impossible; often several other common tools are piped in order to achieve the result. But then, sometimes it turns to be more conveniente to stick to a single tool like e.g. Perl, instead of combining cat, grep, awk, sed and whatever else.
Something like
sed -e 's/>/>\n/g' region.html |
egrep -i "^\s*[A-Z]+[0-9]+</td>" |
sed -e 's|</td>||g'
worked extracting 200 lines, assuming a specific format for the code.
ADD
If there's no limit to the software you can use to parse the data, then you could use a line like
wget -q "http://www.geonames.org/postalcode-search.html?q=london&country=GB" -O - |
sgrep '"<table class=\"restable\"" .. "</table>"' |
sed -e 's|/tr>|/tr>\n|g; s|</td>\s*<td[^>]*>|;|g; s|</th>\s*<th[^>]*>|;|g; s|<[^>]\+>||g; s|;; .*$| |g' |
grep -v "^\s*$" |
tail -n+2 | cut -d";" -f2,3
which extracts places and postal codes seperated by a ; like in a CSV, as well as awk:
wget -q "$html" -O - |
w3m -dump -T 'text/html' |
awk '/\s*[0-9]+ / { print substr($0, 11, 16); }'
which is based on the answer by Peter.O and extracts the same data... and so on. But in these cases, since you are not limited to the minimal tools found on most Unix or GNU systems, I would stick to one single widespread tool, e.g. perl.
If you have access to the mojo tool from the Mojolicious project this all becomes quite a lot easier:
mojo get 'http://www.geonames.org/postalcode-search.html?q=london&country=GB' '.restable > tr > td:nth-child(2)' text | grep ^'[a-zA-Z]'
The grep at the end is just to filter out some junk results; almost (but not quite) every other line is bad, because the page structure is slightly inconsistent. Otherwise you could say tr:nth-child(even) and get nice results.

Can aspell output line number and not offset in pipe mode?

Can aspell output line number and not offset in pipe mode for html and xml files? I can't read the file line by line because in this case aspell can't identify closed tag (if tag situated on the next line).
This will output all occurrences of misspelt words with line numbers:
# Get aspell output...
<my_document.txt aspell pipe list -d en_GB --personal=./aspell.ignore.txt |
# Proccess the aspell output...
grep '[a-zA-Z]\+ [0-9]\+ [0-9]\+' -oh | \
grep '[a-zA-Z]\+' -o | \
while read word; do grep -on "\<$word\>" my_document.txt; done
Where:
my_document.txt is your original document
en_GB is your primary dictionary choice (e.g. try en_US)
aspell.ignore.txt is an aspell personal dictionary (example below)
aspell_output.txt is the output of aspell in pipe mode (ispell style)
result.txt is a final results file
aspell.ignore.txt example:
personal_ws-1.1 en 500
foo
bar
example results.txt output (for an en_GB dictionary):
238:color
302:writeable
355:backends
433:dataonly
You can also print the whole line by changing the last grep -on into grep -n.
This is just an idea, I haven't really tried it yet (I'm on a windows machine :(). But maybe you could pipe the html file through head (with byte limit) and count newlines using grep to find your line number. It's neither efficient nor pretty, but it might just work.
cat icantspell.html | head -c <offset from aspell> | egrep -Uc "$"
I use the following script to perform spell-checking and to work-around the awkward output of aspell -a / ispell. At the same time, the script also works around the problem that ordinals like 2nd aren't recognized by aspell by simply ignoring everything that aspell reports which is not a word of its own.
#!/bin/bash
set +o pipefail
if [ -t 1 ] ; then
color="--color=always"
fi
! for file in "$#" ; do
<"$file" aspell pipe list -p ./dict --mode=html |
grep '[[:alpha:]]\+ [0-9]\+ [0-9]\+' -oh |
grep '[[:alpha:]]\+' -o |
while read word ; do
grep $color -n "\<$word\>" "$file"
done
done | grep .
You even get colored output if the stdout of the script is a terminal, and you get an exit status of 1 in case the script found spelling mistakes, otherwise the exit status of the script is 0.
Also, the script protects itself from pipefail, which is a somewhat popular option to be set i.e. in a Makefile but doesn't work for this script. Last but not least, this script explicitly uses [[:alpha:]] instead of [a-zA-Z] which is less confusing when it's also matching non-ASCII characters like German äöüÄÖÜß and others. [a-zA-Z] also does, but that to some level comes at a surprise.
aspell pipe / aspell -a / ispell output one empty line for each input line (after reporting the errors of the line).
Demonstration printing the line number with awk:
$ aspell pipe < testFile.txt |
awk '/^$/ { countedLine=countedLine+1; print "#L=" countedLine; next; } //'
produces this output:
#(#) International Ispell Version 3.1.20 (but really Aspell 0.60.7-20110707)
& iinternational 7 0: international, Internationale, internationally, internationals, intentional, international's, Internationale's
#L=1
*
*
*
& reelly 22 11: Reilly, really, reel, rely, rally, relay, resell, retell, Riley, rel, regally, Riel, freely, real, rill, roll, reels, reply, Greeley, cruelly, reel's, Reilly's
#L=2
*
#L=3
*
*
& sometypo 18 8: some typo, some-typo, setup, sometime, someday, smote, meetup, smarty, stupor, Smetana, somatic, symmetry, mistype, smutty, smite, Sumter, smut, steppe
#L=4
with testFile.txt
iinternational
I say this reelly.
hello
here is sometypo.
(Still not as nice as hunspell -u (https://stackoverflow.com/a/10778071/4124767). But hunspell misses some command line options I like.)
For others using aspell with one of the filter modes (tex, html, etc), here's a way to only print line numbers for misspelled words in the filtered text. So for example, it won't print misspellings in the comments.
ASPELL_ARGS="--mode=html --personal=./.aspell.en.pws"
for file in "$#"; do
for word in $(aspell $ASPELL_ARGS list < "$file" | sort -u); do
grep -no "\<$word\>" <(aspell $ASPELL_ARGS filter < "$file")
done | sort -n
done
This works because aspell filter does not delete empty lines. I realize this isn't using aspell pipe as requested by OP, but it's in the same spirit of making aspell print line numbers.

Easiest way to extract the urls from an html page using sed or awk only

I want to extract the URL from within the anchor tags of an html file.
This needs to be done in BASH using SED/AWK. No perl please.
What is the easiest way to do this?
You could also do something like this (provided you have lynx installed)...
Lynx versions < 2.8.8
lynx -dump -listonly my.html
Lynx versions >= 2.8.8 (courtesy of #condit)
lynx -dump -hiddenlinks=listonly my.html
You asked for it:
$ wget -O - http://stackoverflow.com | \
grep -io '<a href=['"'"'"][^"'"'"']*['"'"'"]' | \
sed -e 's/^<a href=["'"'"']//i' -e 's/["'"'"']$//i'
This is a crude tool, so all the usual warnings about attempting to parse HTML with regular expressions apply.
grep "<a href=" sourcepage.html
|sed "s/<a href/\\n<a href/g"
|sed 's/\"/\"><\/a>\n/2'
|grep href
|sort |uniq
The first grep looks for lines containing urls. You can add more elements
after if you want to look only on local pages, so no http, but
relative path.
The first sed will add a newline in front of each a href url tag with the \n
The second sed will shorten each url after the 2nd " in the line by replacing it with the /a tag with a newline
Both seds will give you each url on a single line, but there is garbage, so
The 2nd grep href cleans the mess up
The sort and uniq will give you one instance of each existing url present in the sourcepage.html
With the Xidel - HTML/XML data extraction tool, this can be done via:
$ xidel --extract "//a/#href" http://example.com/
With conversion to absolute URLs:
$ xidel --extract "//a/resolve-uri(#href, base-uri())" http://example.com/
I made a few changes to Greg Bacon Solution
cat index.html | grep -o '<a .*href=.*>' | sed -e 's/<a /\n<a /g' | sed -e 's/<a .*href=['"'"'"]//' -e 's/["'"'"'].*$//' -e '/^$/ d'
This fixes two problems:
We are matching cases where the anchor doesn't start with href as first attribute
We are covering the possibility of having several anchors in the same line
An example, since you didn't provide any sample
awk 'BEGIN{
RS="</a>"
IGNORECASE=1
}
{
for(o=1;o<=NF;o++){
if ( $o ~ /href/){
gsub(/.*href=\042/,"",$o)
gsub(/\042.*/,"",$o)
print $(o)
}
}
}' index.html
You can do it quite easily with the following regex, which is quite good at finding URLs:
\b(([\w-]+://?|www[.])[^\s()<>]+(?:\([\w\d]+\)|([^[:punct:]\s]|/)))
I took it from John Gruber's article on how to find URLs in text.
That lets you find all URLs in a file f.html as follows:
cat f.html | grep -o \
-E '\b(([\w-]+://?|www[.])[^\s()<>]+(?:\([\w\d]+\)|([^[:punct:]\s]|/)))'
I am assuming you want to extract a URL from some HTML text, and not parse HTML (as one of the comments suggests). Believe it or not, someone has already done this.
OT: The sed website has a lot of good information and many interesting/crazy sed scripts. You can even play Sokoban in sed!
This is my first post, so I try to do my best explaining why I post this answer...
Since the first 7 most voted answers, 4 include GREP even when the
post explicitly says "using sed or awk only".
Even when the post requires "No perl please", due to the previous
point, and because use PERL regex inside grep.
and because this is the simplest way ( as far I know , and was
required ) to do it in BASH.
So here come the simplest script from GNU grep 2.28:
grep -Po 'href="\K.*?(?=")'
About the \K switch , not info was founded in MAN and INFO pages, so I came here for the answer....
the \K switch get rid the previous chars ( and the key itself ).
Bear in mind following the advice from man pages:
"This is highly experimental and grep -P may warn of unimplemented features."
Of course, you can modify the script to meet your tastes or needs, but I found it pretty straight for what was requested in the post , and also for many of us...
I hope folks you find it very useful.
thanks!!!
In bash, the following should work. Note that it doesn't use sed or awk, but uses tr and grep, both very standard and not perl ;-)
$ cat source_file.html | tr '"' '\n' | tr "'" '\n' | grep -e '^https://' -e '^http://' -e'^//' | sort | uniq
for example:
$ curl "https://www.cnn.com" | tr '"' '\n' | tr "'" '\n' | grep -e '^https://' -e '^http://' -e'^//' | sort | uniq
generates
//s3.amazonaws.com/cnn-sponsored-content
//twitter.com/cnn
https://us.cnn.com
https://www.cnn.com
https://www.cnn.com/2018/10/27/us/new-york-hudson-river-bodies-identified/index.html\
https://www.cnn.com/2018/11/01/tech/google-employee-walkout-andy-rubin/index.html\
https://www.cnn.com/election/2016/results/exit-polls\
https://www.cnn.com/profiles/frederik-pleitgen\
https://www.facebook.com/cnn
etc...
Expanding on kerkael's answer:
grep "<a href=" sourcepage.html
|sed "s/<a href/\\n<a href/g"
|sed 's/\"/\"><\/a>\n/2'
|grep href
|sort |uniq
# now adding some more
|grep -v "<a href=\"#"
|grep -v "<a href=\"../"
|grep -v "<a href=\"http"
The first grep I added removes links to local bookmarks.
The second removes relative links to upper levels.
The third removes links that don't start with http.
Pick and choose which one of these you use as per your specific requirements.
Go over with a first pass replacing the start of the urls (http) with a newline (\nhttp). Then you have guaranteed for yourself that your link starts at the beginning of the line and is the only URL on the line.The rest should be easy, here is an example:
sed "s/http/\nhttp/g" <(curl "http://www.cnn.com") | sed -n "s/\(^http[s]*:[a-Z0-9/.=?_-]*\)\(.*\)/\1/p"
alias lsurls='_(){ sed "s/http/\nhttp/g" "${1}" | sed -n "s/\(^http[s]*:[a-Z0-9/.=?_-]*\)\(.*\)/\1/p"; }; _'
You can try:
curl --silent -u "<username>:<password>" http://<NAGIOS_HOST/nagios/cgi-bin/status.cgi|grep 'extinfo.cgi?type=1&host='|grep "status"|awk -F'</A>' '{print $1}'|awk -F"'>" '{print $3"\t"$1}'|sed 's/<\/a> <\/td>//g'| column -c2 -t|awk '{print $1}'
That's how I tried it for better view, create shell file and give link as parameter, it will create temp2.txt file.
a=$1
lynx -listonly -dump "$a" > temp
awk 'FNR > 2 {print$2}' temp > temp2.txt
rm temp
>sh test.sh http://link.com
Eschewing the awk/sed requirement:
urlextract is made just for such a task (documentation).
urlview is an interactive CLI solution (github repo).
I scrape websites using Bash exclusively to verify the http status of client links and report back to them on errors found. I've found awk and sed to be the fastest and easiest to understand. Props to the OP.
curl -Lk https://example.com/ | sed -r 's~(href="|src=")([^"]+).*~\n\1\2~g' | awk '/^(href|src)/,//'
Because sed works on a single line, this will ensure that all urls are formatted properly on a new line, including any relative urls. The first sed finds all href and src attributes and puts each on a new line while simultaneously removing the rest of the line, inlcuding the closing double qoute (") at the end of the link.
Notice I'm using a tilde (~) in sed as the defining separator for substitution. This is preferred over a forward slash (/). The forward slash can confuse the sed substitution when working with html.
The awk finds any line that begins with href or src and outputs it.
Once the content is properly formatted, awk or sed can be used to collect any subset of these links. For example, you may not want base64 images, instead you want all the other images. Our new code would look like:
curl -Lk https://example.com/ | sed -r 's~(href="|src=")([^"]+).*~\n\1\2~g' | awk '/^(href|src)/,//' | awk '/^src="[^d]/,//'
Once the subset is extracted, just remove the href=" or src="
sed -r 's~(href="|src=")~~g'
This method is extremely fast and I use these in Bash functions to format the results across thousands of scraped pages for clients that want someone to review their entire site in one scrape.