Using xmlstarlet to extract HTML

Using xmlstarlet to extract HTML - html

I am trying to extract a particular section of a HTML document from a bash shell script and have been using xmlstarlet sel but I can't quite get it to return actual HTML, rather than just the text values from the HTML tags.
I'm trying a command line as follows:
xmlstarlet sel -t -m "//div[#id='mw-content-text']" -v "." wiki.html
But it is giving text only, without any HTML/XML markup. For info, I'm trying to export this data into a HTML format outside the mediawiki instance it has come from.
If xmlstarlet is the wrong tool, any suggestions for other tools also gratefully received!

-v means --value-of which is the contents of tags. You should use -c or --copy-of to get the tags themselves.
xmlstarlet sel -t -m "//div[#id='mw-content-text']" -c "." wiki.html
Or just
xmlstarlet sel -t -c "//div[#id='mw-content-text']" wiki.html

Related

wget command to download web-page & rename file with with html title?

I would like to download an html web-page and have the filename be the title of the html page.
I have found a command to get the html title:
wget -qO- 'https://www.linuxinsider.com/story/Austrumi-Linux-Has-Great-Potential-if-You-Speak-Its-Language-86285.html/' | gawk -v IGNORECASE=1 -v RS='</title' 'RT{gsub(/.*<title[^>]*>/,"");print;exit}'
And it prints this: Austrumi Linux Has Great Potential if You Speak Its Language | Reviews | LinuxInsider
Found on: https://unix.stackexchange.com/questions/103252/how-do-i-get-a-websites-title-using-command-line
How could i pipe the title back into wget to use it as the filename when downloading that web-page?
EDIT: in case there is no way to do this directly in wget, i found a way to simply rename the html files once downloaded
Renaming HTML files using <title> tags

You can't wget a file, analyze it's contents and then make the same wget execution that downloaded the file magically go back in time and output it to a new file named after it's contents that you analyzed in step 2. Just do this:
wget '...' > tmp &&
name=$(gawk '...' tmp) &&
mv tmp "$name"
Add protection against / in name as necessary.

Pandoc fails to embed metadata from the supplied YAML file

I need to convert some .xhtml files to regular .html (html5) with pandoc, and during the conversion I would like to embed some metadata (supplied via a YAML file) in the final files.
The conversion runs smoothly, but any attempt to embed the metadata invariably fails.
I tried many variations of this command, but it should be something like:
pandoc -s -H assets/header -c css/style.css -B assets/prefix -A assets/suffix --metadata-file=metadata.yaml input_file -o output_file --to=html5
The error I get is:
pandoc: unrecognized option `--metadata-file=metadata.yaml'
Try pandoc --help for more information.
I really don't get what's wrong with this, since I found this option in the pandoc manual
Any ideas?

Your pandoc version is too old. Update to pandoc 2.3 or later.

OC command to remove multiple old tag(s) from an Image Stream?

I know oc tag -d python:3.5 will remove only 3.5 tag.However I would like to remove multiple old tags from the same Image Stream using oc command.
For instance image streams phython:rel-1, phython:rel-2, phython:rel-3.
I am trying like oc tag -d python:rel-*. But I end up with below error message.
*Error from server (NotFound): imagestreamtags.image.openshift.io "rel-*" not found*
I am wondering is there any way to apply wildcards for tags to remove multiple old tags in one go?

Not fully tested, and you can't do it in one command invocation, but you can use a shell script something like:
#!/bin/bash
TAGS=`oc get is python --template='{{range .spec.tags}}{{" "}}{{.name}}{{end}}{{"\n"}}'`
for tag in $TAGS; do
if [[ "$tag" = rel-* ]]; then
oc tag python:$tag -d
fi
done

Use Pandoc to convert tex file

I try to use pandoc to convert my tex file in html or epub. It is not a complex Latex file with Math formule. It is something like a book.
But I have a problem. When I convert the file in pdf with pdflatex, the all file is ok. But when I use
pandoc book.tex -s --webtex -o book.html
or
pandoc -S book.tex -o book.epub
It is as if there was no compilation.. << are not replaced by «. Each command, like \emph{something}, are just ignored and the word is delete from the paragraph.
In fact, it is as if I had made a simple copy and paste, without commands.

In older versions of pandoc, I think you needed to tell it the input format, otherwise it would assume it was markdown.
pandoc -f latex book.tex -S -o book.epub

Wget recognizes some part of my URL address as a syntax error

I am quite new with wget and I have done my research on Google but I found no clue.
I need to save a single HTML file of a webpage:
wget yahoo.com -O test.html
and it works, but, when I try to be more specific:
wget http://search.yahoo.com/404handler?src=search&p=food+delicious -O test.html
here comes the problem, wget recognizes &p=food+delicious as a syntax, it says: 'p' is not recognized as an internal or external command
How can I solve this problem? I really appreciate your suggestions.

The & has a special meaning in the shell. Escape it with \ or put the url in quotes to avoid this problem.
wget http://search.yahoo.com/404handler?src=search\&p=food+delicious -O test.html
or
wget "http://search.yahoo.com/404handler?src=search&p=food+delicious" -O test.html
In many Unix shells, putting an & after a command causes it to be executed in the background.

Wrap your URL in single quotes to avoid this issue.
i.e.
wget 'http://search.yahoo.com/404handler?src=search&p=food+delicious' -O test.html

if you are using a jubyter notebook, maybe check if you have downloaded
pip install wget
before warping from URL

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

Using xmlstarlet to extract HTML - html

-v means --value-of which is the contents of tags. You should use -c or --copy-of to get the tags themselves. xmlstarlet sel -t -m "//div[#id='mw-content-text']" -c "." wiki.html Or just xmlstarlet sel -t -c "//div[#id='mw-content-text']" wiki.html

Related

wget command to download web-page & rename file with with html title?

Pandoc fails to embed metadata from the supplied YAML file

OC command to remove multiple old tag(s) from an Image Stream?

Use Pandoc to convert tex file

Wget recognizes some part of my URL address as a syntax error

Categories

Resources