echoing html in a bash string - html

I am complete bash newbie.
I want to use the bash >>operator to append some html to the end of a file. I would like to construct said HTML with three concatenated sections: an opening html tag, a variable defined by bash earlier, and a closing html tag.
Something like:
echo <div id="myid"> $myBashVariable </div> >> file.html
I am not sure what syntax is needed to escape the various characters needed for the HTML markup... <,>,/,".
How can I make this work?

Save yourself the trouble and use a here document:
#!/bin/sh
var="world"
cat >> file.html << EOF
<html>
<head>
<title>Hello $var</title>
</head>
<body>
<div id="whatever">Hello $var, and welcome to my page.</div>
</body>
</html>
EOF
Note that characters in $var will not be HTML escaped, so if var='<script>alert(1)</script>' you will get a JS popup.

\ is the escape character that you have to add before the special characters to escape the special meaning of those.
Try this.
echo \<div id=\"myid\"\> $myBashVariable \</div\> >> file.html

You need to escape with \
On windows,
echo "<div id=\"myid\">%myBashVariable%</div>" >> file.html
On Linux,
echo \<div\ id=\"myid\"\>$myBashVariable\</div\> >> file.html

Related

How to copy text between 2 html tags?

I want to copy all the text in a website between tags:
<p> and </p>
using bash.
Do you have an idea how to do it?
As the comment above states: don't even try. There is no reliable way to parse HTML with Bash internals.
But when you're using a shell you may as well use third-party command line tools such as pup which are built for HTML parsing on the command line.
Yes, an HTML parser is a better choice. But if you are just trying to grab the text in between the first set of P tags quickly, you can use Perl:
perl -n0e 'if (/<p>(.*?)<\/p>/s) { print $1; }'
For example:
echo "
<p>A test
here
today</p>
<p>whatever</p>
" | perl -n0e 'if (/<p>(.*?)<\/p>/s) { print $1; }'
This will output:
A test
here
today

How do I write a ">" into a txt file with batch?

I need to write an html document from a batch file and this document contains the ">" character. When I try to write a ">" character to a file though, it cuts off and doesn't write.
Example -
Echo <HTML> > HtmlDoc.html
The output here to the file would be
<HTML
How do I fix this?
You need to escape the special characters:
echo ^<html^> > HtmlDoc.html
For more information about escapes in batch scripting, read http://www.robvanderwoude.com/escapechars.php

How do I extract content from a webpage with certain headers in bash?

So far I am using curl along w3m and sed to extract portions of a webpage like <body>....content....</body>. I want to ignore all the other headers (ex. <a></a>, <div></div>). Except the way I am doing it right now is really slow.
curl -L "http://www.somewebpage.com" | sed -n -e '\:<article class=:,\:<div id="below">: p' > file.html
w3m -dump file.html > file2.txt
These two lines above are really slow because curl was to first save the whole webpage into a file and phrase it, then w3m phrases it and saves it into another file. I just want to simply this code. I was wondering if there was a way with lynx or hmtl2text that lets you extract webpage content with specified headers. So like if I wanted to extract something from as webpage (www.badexample.com <---not actually the link) with this content:
<title>blah......blah...</title>
<body>
Some text I need to extract
</body>
more stuffs
Is there a program which i can specify the parameter in which to extract the content? So I would specify someprogram <body></body> www.badexample.com and it would extract the content only in those headers?
You can use Perl's one liner for this:
perl -MLWP::Simple -e "print get ($ARGV[0]) =~ /<$ARGV[1]>(.*?)<\/$ARGV[1]>/;" http://www.example.com/ title
Instead of the html tag, you can pass the whole regex as well:
perl -MLWP::Simple -e "print get ($ARGV[0]) =~ /$ARGV[1]/;" "http://www.example.com/" "<body>(.*?)</body>"
Must it be in bash? What about PHP and DOMDocument()?
$dom = new DOMDocument();
$new_dom = new DOMDocument();
$url_value = 'http://www.google.com';
$html = file_get_contents($url_value);
$dom->loadHTML($html);
$body = $dom->getElementsByTagName('body')->item(0);
foreach ($body->childNodes as $child){
$new_dom->appendChild($new_dom->importNode($child, true));
}
echo $new_dom->saveHTML();

How to write the "greater than symbol" in a HTML file using a batch file

I was wondering if it is possible to use the "<" and ">" when writing to a HTML file from a batch file. I need this so I can write certain things to html files.
I tried the following and it didn't work:
ECHO </html> >>File.html
PS. Thanks in advance
It's a little messy, but you have to escape the < and > characters using ^:
echo ^<html^> >> a.html
echo ^<body^>Hi^</body^> >> a.html
echo ^</html^> >> a.html
Result:
<html>
<body>Hi</body>
</html>
You can escape the character by placing a carot sign (^) in front of "<" and ">".
echo ^</html^> >>File.html

How to indent html with xmllint?

I'm outputting html that's all crushed together, and would like to convert it to have proper indentation. I've been trying to use xmllint for this, but with no joy. E.g. when this is in file.html:
<table><tr><td><b>Foo</b></td></tr></table>
<table><tr><td>Bar</td></tr></table>
I get:
$ xmllint --format file.html
file.html:2: parser error : Extra content at the end of the document
<table><tr><td>Bar</td></tr></table>
^
<<< exit status [1] >>>
But when file.html contains either of those lines alone, it works fine (removing the second line):
$ xmllint --format file.html
<?xml version="1.0"?>
<table>
<tr>
<td>
<b>Foo</b>
</td>
</tr>
</table>
When i inlcude the --html option, it's more likely to run without errors, but then it doesn't indent.
Any suggestions? Are there any other (*nix) tools I can use for this? Thanks ...
As user 4M01 suggested: On the command line, append the pipe with a call to HTML tidy.
HTML output from xmllint will be repaired; tidy will wrap some reasonable ... around your html fragment.
xmllint --xpath "//tr[6]/td[7]" --html - | tidy -q
tidy -i sets the indent: auto config value. If instead of auto I set it to yes, I consistently got better indentation style:
tidy --indent yes
I think this is because the HTML you have supplied doesn't have a root tag, thus making it an invalid XML.
Try adding the body tag and run xmllint again on it.
<body><table><tr><td><b>Foo</b></td></tr></table>
<table><tr><td>Bar</td></tr></table></body>
Have you tried HTML Tidy ? More Information about this is available at W3 & sourceforge.Even there GUI tool available which known as GuiTidy . This tools are great , they not only help in proper indentation but also validate html code.
Hope this help