Is there something like a "CSS selector" or XPath grep? - html

I need to find all places in a bunch of HTML files, that lie in following structure (CSS):
div.a ul.b
or XPath:
//div[#class="a"]//div[#class="b"]
grep doesn't help me here. Is there a command-line tool that returns all files (and optionally all places therein), that match this criterium? I.e., that returns file names, if the file matches a certain HTML or XML structure.

Try this:
Install http://www.w3.org/Tools/HTML-XML-utils/.
Ubuntu: aptitude install html-xml-utils
MacOS: brew install html-xml-utils
Save a web page (call it filename.html).
Run: hxnormalize -l 240 -x filename.html | hxselect -s '\n' -c "label.black"
Where "label.black" is the CSS selector that uniquely identifies the name of the HTML element. Write a helper script named cssgrep:
#!/bin/bash
# Ignore errors, write the results to standard output.
hxnormalize -l 240 -x $1 2>/dev/null | hxselect -s '\n' -c "$2"
You can then run:
cssgrep filename.html "label.black"
This will generate the content for all HTML label elements of the class black.
The -l 240 argument is important to avoid parsing line-breaks in the output. For example if <label class="black">Text to \nextract</label> is the input, then -l 240 will reformat the HTML to <label class="black">Text to extract</label>, inserting newlines at column 240, which simplifies parsing. Extending out to 1024 or beyond is also possible.
See also:
https://superuser.com/a/529024/9067 - similar question
https://gist.github.com/Boldewyn/4473790 - wrapper script

I have built a command line tool with Node JS which does just this. You enter a CSS selector and it will search through all of the HTML files in the directory and tell you which files have matches for that selector.
You will need to install Element Finder, cd into the directory you want to search, and then run:
elfinder -s "div.a ul.b"
For more info please see http://keegan.st/2012/06/03/find-in-files-with-css-selectors/

There are two tools:
pup - Inspired by jq, pup aims to be a fast and flexible way of exploring HTML from the terminal.
htmlq - Likes jq, but for HTML. Uses CSS selectors to extract bits of content from HTML files.
Examples:
$ wget http://en.wikipedia.org/wiki/Robots_exclusion_standard -O robots.html
$ pup --color 'title' < robots.html
<title>
Robots exclusion standard - Wikipedia
</title>
$ htmlq --text 'title' < robots.html
Robots exclusion standard - Wikipedia

Per Nat's answer here:
How to parse XML in Bash?
Command-line tools that can be called from shell scripts include:
4xpath - command-line wrapper around Python's 4Suite package
XMLStarlet
xpath - command-line wrapper around Perl's XPath library

Related

How to get absolute URLs in Bash

I want to get all URLs from specific page in Bash.
This problem is already solved here: Easiest way to extract the urls from an html page using sed or awk only
The trick, however, is to parse relative links into absolute ones. So if http://example.com/ contains links like:
About us
<script type="text/javascript" src="media/blah.js"></a>
I want the results to have following form:
http://example.com/about.html
http://example.com/media/blah.js
How can I do so with as little dependencies as possible?
Simply put, there is no simple solution. Having little dependencies leads to unsightly code, and vice versa: code robustness leads to higher dependency requirements.
Having this in mind, below I describe a few solutions and sum them up by providing pros and cons of each one.
Approach 1
You can use wget's -k option together with some regular expressions (read more about parsing HTML that way).
From Linux manual:
-k
--convert-links
After the download is complete, convert the links in the document to
make them suitable for local viewing.
(...)
The links to files that have not been downloaded by Wget will be
changed to include host name and absolute path of the location they
point to.
Example: if the downloaded file /foo/doc.html links to /bar/img.gif
(or to ../bar/img.gif), then the link in doc.html will be modified to
point to http://hostname/bar/img.gif.
An example script:
#wget needs a file in order for -k to work
tmpfil=$(mktemp);
#-k - convert links
#-q - suppress output
#-O - redirect output to given file
wget http://example.com -k -q -O "$tmpfil";
#-o - print only matching parts
#you could use any other popular regex here
grep -o "http://[^'\"<>]*" "$tmpfil"
#remove unnecessary file
rm "$tmpfil"
Pros:
Works out of the box on most systems, assuming you have wget installed.
In most cases, this will be sufficient solution.
Cons:
Features regular expressions, which are bound to break on some exotic pages due to HTML hierarchical model standing below regular expressions in Chomsky hierarchy.
You cannot pass a location in your local file system; you must pass working URL.
Approach 2
You can use Python together with BeautifulSoup. An example script:
#!/usr/bin/python
import sys
import urllib
import urlparse
import BeautifulSoup
if len(sys.argv) <= 1:
print >>sys.stderr, 'Missing URL argument'
sys.exit(1)
content = urllib.urlopen(sys.argv[1]).read()
soup = BeautifulSoup.BeautifulSoup(content)
for anchor in soup.findAll('a', href=True):
print urlparse.urljoin(sys.argv[1], anchor.get('href'))
And then:
dummy:~$ ./test.py http://example.com
Pros:
It's the correct way to handle HTML, since it's properly using fully-fledged parser.
Exotic output is very likely going to be handled well.
With small modifications, this approach works for files, not URLs only.
With small modifications, you might even be able to give your own base URL.
Cons:
It needs Python.
It needs Python with custom package.
You need to manually handle tags and attributes like <img src>, <link src>, <script src> etc (which isn't presented in the script above).
Approach 3
You can use some features of lynx. (This one was mentioned in the answer you provided in your question.) Example:
lynx http://example.com/ -dump -listonly -nonumbers
Pros:
Very concise usage.
Works well with all kind of HTML.
Cons:
You need Lynx.
Although you can extract links from files as well, you cannot control the base URL and you end up with file://localhost/ links. You can fix this using ugly hacks like manual inserting <base href=""> tag into HTML.
Another option is my Xidel (XQuery/Webscraper):
For all normal links:
xidel http://example.com/ -e '//a/resolve-uri(#href)'
For all links and srcs:
xidel http://example.com/ -e '(//#href, //#src)/resolve-uri(.)'
With rr-'s format:
Pros :
Very concise usage.
Works well with all kind of HTML.
It's the correct way to handle HTML, since it's properly using fully-fledged parser.
Works for files and urls
You can give your own base URL. (with resolve-uri(#href, "baseurl"))
No dependancies except Xidel (except openssl, if you also have https urls)
Cons:
You need Xidel, which is not contained in any standard repository
Why not simply this ?
re='(src|href)='
baseurl='example.com'
wget -O- "http://$baseurl" | awk -F'(src|href)=' -F\" "/$re/{print $baseurl\$2}"
you just need wget and awk.
Feel free to improve the snippet a bit if you have both relative & absolute urls at the same time.

Looking for a way to exclude files used by geninfo/genhtml

We are trying to use geninfo and genhtml (alternative to gcovr, see here) to produce an html page using coverage provided by gcov.
geninfo creates lcov-tracefiles from gcov's *.gcda files
genhtml generates html files from the above tracefiles
However, the end result includes not only our code, but also files from /usr/include.
Does anyone know of a way to exclude these?
I tried looking at the man page but could not find anything http://linux.die.net/man/1/geninfo
If you're just looking to ignore files from /usr/include, a better option is probably "--no-external", which is intended for exactly this purpose.
lcov --no-external -d $(BLD_DIR) --capture -o .coverage.run
You can use the lcov -r option to remove those files you aren't interested in.
lcov -r <input tracefile> /usr/include/\* -o <output tracefile>

Solution to programmatically generate an export script from a directory hierarchy

I often have the following scenario: in order to reproduce a bug for reporting, I create a small sample project, sometimes a maven multi module project. So there may be a hierarchy of directories and it will usually contain a few small text files. Standard procedure would of course be to create a zip file and send that. But on some mailing lists attachments are not allowed, and so I am looking for a way to automatically create an installation script, that I can post to such mailing lists.
Basically I would be happy with a Unix flavor only that creates mkdir statements to create directories and >> statements to write the file contents. (Actually, apart from the relative path delimiters, the windows and unix versions can probably be identical)
Does such a tool exist somewhere? If not, I'll probably write one in java, but I'm happy to accept solutions in all kinds of languages.
(The tool could run under windows or unix, but the target platform for the generated scripts should be either unix or configurable)
I think you're looking for shar, which creates a shell archive (shell script that when run produces a given directory hierarchy). It is available on most systems; you can use GNU sharutils if you don't already have it.
Normal usage for packing up a directory tree would be something like:
shar `find somedirectory -print` > archive.sh
If you're using GNU sharutils, and want to create "vanilla" archives which use only the most portable of shell builtins, mkdir, and sed, then you should invoke it as shar -V. You can remove some more extra baggage from the scripts by using -xQ; -x to remove checks for existing files, and -Q to remove verbose output from the archive.
shar -VxQ `find somedir -print` > archive.sh
If you really want something even simpler, here's a dirt-simple version of shar as a shell script. It takes filenames on standard input instead of arguments for simplicity and to be a little more robust.
#!/bin/sh
while read filename
do
if test -d $filename
then
echo "mkdir -p '$filename'"
else
echo "sed 's/^X//' <<EOF > '$filename'"
sed 's/^/X/' < "$filename"
echo 'EOF'
fi
done
Invoke as:
find somedir -print | simpleshar > archive.sh
You still need to invoke sed, as you need some way of ensuring that no lines in the here document begin with the delimiter, which would close the document and cause later lines to be interpreted as part of the script. I can't think of any really good way to solve the quoting problem using only shell builtins, so you will have to rely on sed (which is standard on any Unix-like system, and has been practically forever).
if your problem are non-text-file-hating filters:
in times long forgotten, we used uuencode to get past 8-bit eating relays -
is that a way to get past attachment eating mail boxes these days ?
So why not zip and uuencode ?
(or base64, which is its younger cousin)

getting HTML source or rich text from the X clipboard

How can rich text or HTML source code be obtained from the X clipboard? For example, if you copy some text from a web browser and paste it into kompozer, it pastes as HTML, with links etc. preserved. However, xclip -o for the same selection just outputs plain text, reformatted in a way similar to that of elinks -dump. I'd like to pull the HTML out and into a text editor (specifically vim).
I asked the same question on superuser.com, because I was hoping there was a utility to do this, but I didn't get any informative responses. The X clipboard API is to me yet a mysterious beast; any tips on hacking something up to pull this information are most welcome. My language of choice these days is Python, but pretty much anything is okay.
To complement #rkhayrov's answer, there exists a command for that already: xclip. Or more exactly, there's a patch to xclip which was added to xclip later on in 2010, but hasn't been released yet that does that. So, assuming your OS like Debian ships with the subversion head of xclip (2019 edit: version 0.13 with those changes was eventually released in 2016 (and pulled into Debian in January 2019)):
To list the targets for the CLIPBOARD selection:
$ xclip -selection clipboard -o -t TARGETS
TIMESTAMP
TARGETS
MULTIPLE
SAVE_TARGETS
text/html
text/_moz_htmlcontext
text/_moz_htmlinfo
UTF8_STRING
COMPOUND_TEXT
TEXT
STRING
text/x-moz-url-priv
To select a particular target:
$ xclip -selection clipboard -o -t text/html
 rkhayrov
$ xclip -selection clipboard -o -t UTF8_STRING
rkhayrov
$ xclip -selection clipboard -o -t TIMESTAMP
684176350
And xclip can also set and own a selection (-i instead of -o).
In X11 you have to communicate with the selection owner, ask about supported formats, and then request data in the specific format. I think the easiest way to do this is using existing windowing toolkits. E,g. with Python and GTK:
#!/usr/bin/python
import glib, gtk
def test_clipboard():
clipboard = gtk.Clipboard()
targets = clipboard.wait_for_targets()
print "Targets available:", ", ".join(map(str, targets))
for target in targets:
print "Trying '%s'..." % str(target)
contents = clipboard.wait_for_contents(target)
if contents:
print contents.data
def main():
mainloop = glib.MainLoop()
def cb():
test_clipboard()
mainloop.quit()
glib.idle_add(cb)
mainloop.run()
if __name__ == "__main__":
main()
Output will look like this:
$ ./clipboard.py
Targets available: TIMESTAMP, TARGETS, MULTIPLE, text/html, text/_moz_htmlcontext, text/_moz_htmlinfo, UTF8_STRING, COMPOUND_TEXT, TEXT, STRING, text/x-moz-url-priv
...
Trying 'text/html'...
I asked the same question on superuser.com, because I was hoping there was a utility to do this, but I didn't get any informative responses.
Trying 'text/_moz_htmlcontext'...
<html><body class="question-page"><div class="container"><div id="content"><div id="mainbar"><div id="question"><table><tbody><tr><td class="postcell"><div><div class="post-text"><p></p></div></div></td></tr></tbody></table></div></div></div></div></body></html>
...
Trying 'STRING'...
I asked the same question on superuser.com, because I was hoping there was a utility to do this, but I didn't get any informative responses.
Trying 'text/x-moz-url-priv'...
http://stackoverflow.com/questions/3261379/getting-html-source-or-rich-text-from-the-x-clipboard
Extending the ideas from Stephane Chazelas, you can:
Copy from the formatted source.
Run this command to extract from the clipboard, convert to HTML, and then (with a pipe |) put that HTML back in the clipboard, again using the same xclip:
xclip -selection clipboard -o -t text/html | xclip -selection clipboard
Next, when you paste with Ctrl+v, it will paste the HTML source.
Going further, you can make it a shortcut, so that you don't have to open the terminal and run the exact command each time. ✨
To do that:
Open the settings for your OS (in my case it's Ubuntu)
Find the section for the Keyboard
Then find the section for shortcuts
Create a new shortcut
Set a Name, e.g.: Copy as HTML
Then as the command for the shortcut, put:
bash -c "xclip -selection clipboard -o -t text/html | xclip -selection clipboard"
Note: notice that it's the same command as above, but put inside of an inline Bash script. This is necessary to be able to use the | (pipe) to send the output from one command as input to the next.
Set the shortcut to whatever combination you want, preferably not overwriting another shortcut you use. In my case, I set it to: Ctrl+Shift+c
After this, you can copy some formatted text as normally with: Ctrl+c
And then, before pasting it, convert it to HTML with: Ctrl+Shift+c
Next, when you paste it with: Ctrl+v, it will paste the contents as HTML. 🧙✨

Is it possible to email the contents of vim using HTML

I like to view the current differences in the source files I'm working on with a command like:
vim <(svn diff -dub)
What I'd really like to be able to do is to email that colorized diff. I know vim can export HTML with the :TOhtml, but how do I pipeline this output into an html email? Ideally. i'd like to be able to send an html diff with a single shell script command.
The following one-liner produces an HTML file named email.html:
diff file1 file2 | vim - +TOhtml '+w email.html' '+qall!'
You can now use Pekka’s code to send the email.
However, I believe in using the right tool for the right job – and VIM may not be the right tool here. Other highlighters exist and their use is more appropriate here.
For example, Pygments can be harnessed to produce the same result, much more efficiently and hassle-free:
diff -u report.log .report.log | pygmentize -l diff -f html > email.html
Notice that this produces only the actual text body, not the style sheet, nor the surrounding HTML scaffold. This must be added separately but that’s not difficult either. Here’s a complete bash script to produce a valid minimal HTML file:
echo '<!DOCTYPE html><html><head><title>No title</title><style>' > email.html
pygmentize -S default -f html >> email.html
echo '</style></head><body>' >> email.html
diff -u report.log .report.log | pygmentize -l diff -f html >> email.html
echo '</body></html>' >> email.html
EDIT In case that Pekka’s code didn’t work – as for me – because you don’t have the required versions of mail and mutt installed then you can use sendmail as follows to send the HTML email:
( echo 'To: email-address#example.com'
echo 'Content-Type: text/html'
echo 'Subject: test'
echo ''
cat email.html ) | sendmail -t
It’s important to leave an empty line between the header and the body of the email. Also, notice that it’s of course unnecessary to create the temporary file email.html. Just paste the rest of the commands into the right place above and delete the redirects to file.
I'm no Linux Guru, but this looks like it should serve your needs to pipe your output into:
Send an HTML file as email from the command line. (uses mail)
There's also a one-line mutt example here:
mutt -e "my_hdr Content-Type: text/html"
-s "my subject" you#xxxxxxxxxxx < message.html
this will generate a pure HTML E-Mail with no pure text alternative - for that you would have to build multi-part mail... But maybe it'll do for what you need.