Do you know a HTML-Snippet Validator? - html

I'm searching for a tool that would allow me to check whether a certain snippet of HTML would be valid in it's proper context.
I'd input something like
<dd>
my definition
<div>
div inside <dd> is allowed
</div>
</dd>
instead of the whole document. A ordinary validator will complain about the missing dl-tag, but most of the times I just want to know whether a certain element is valid inside another one or not.
I'll try to explain it more detailed. Consider the following snippet:
<form>
<label>Name: <input /></label>
</form>
Would be valid, but to check it I have two options:
Validate the whole document: Most of the times this is good enough, but sometimes when I'm working on partial HTML snippets or embedded HTML it's quite some trouble. I'd have to copy the whole thing to a new HTML document and validate that.
Just copy the snippet and validate it with the W3C validator and ignore some of the errors.
Basically I'd like to check, whether an element contains only elements it's allowed to contain.

You can actually use the W3C validator to check a snippet.
Choose the 'Validate by Direct Input' tab and select More Options. In there there is a radio button to 'Validate HTML fragment'. http://validator.w3.org/#validate_by_input+with_options
It will wrap your page in valid html so the errors which you see are only due to your snippet.

W3C does currently (May 2020) have a fragment validator, but it seems to have bitrot (no HTML5, at least). Here's a couple of simple scripts that attach a header and a footer to your fragment, and run the result through a local copy of the Nu checker. The fragment can be anything which is valid at the top level of a <body> tag - modify the header and footer if you need something else.
validate1.sh takes a single filename argument and checks it, while validate2.sh cycles through all HTML files in a directory. It has a simple exclusion list mechanism which you'll need to change. You'll need to modify both to point to your copy of vnu.jar.
validate1.sh:
#!/bin/bash
#
# validate1.sh
#
# Run the nu validator on the HTML fragment in the supplied filename.
# This script adds a header and trailer around the fragment, and supplies
# the result to 'vnu.jar'. You'll need to modify it to correctly locate
# vnu.jar.
if test "$#" -ne 1; then
echo "Usage: '$0 fname', where 'fname' is the HTML file to be linted"
exit 1
fi
var="<!doctype html>
<html lang=\"en\">
<head>
<title>foo</title>
</head>
<body>
$(< "$1")
</body>
</html>"
echo "Checking '$1'... subtract 6 from any reported line numbers"
echo "$var" | java -jar vnu.jar -
validate2.sh:
#!/bin/bash
#
# validate2.sh
#
# Run the nu validator on the HTML fragments in the supplied directory. This
# script adds a header and footer around each fragment in the directory, and
# supplies the result to 'vnu.jar'. You'll need to modify it to correctly
# locate vnu.jar.
if test "$#" -ne 1; then
echo "Usage: '$0 fname', where 'fname' is the HTML directory to be linted"
exit 1
fi
for filename in $1/*.html; do
case $filename in
# simple exclusion list example:
"$1/root.html" | "$1/sitedown.html")
echo "Skipping '$filename'"
continue
;;
*)
;;
esac
var="<!doctype html>
<html lang=\"en\">
<head>
<title>foo</title>
</head>
<body>
$(< "$filename")
</body>
</html>"
echo "Checking '$filename'... subtract 6 from any reported line numbers"
echo "$var" | java -jar vnu.jar -
done

Related

How to remove a file if it is not used by another file

I have to clean the directory and its subdirectories by removing all unused files. (A file is considered unused if it is not linked to in any
of the HTML files or if it is not specified explicitly that this file is in use). A file can be linked in an HTML file by either href or img src.
For example, I have an I.html,1.html,2.html and 1 folder. In I.html file, an href uses 1.html and 1 directory, but 2.html is not used by any other files. So, how can I remove the unused 2.html file?
use strict;
use warnings;
my($path,$regexExpression) = #ARGV;
my $fileNames = "data.txt";
my #abc= ();
if(not defined $path){
die "File directory not given, please try again \n"
}
print "added file ";
if (not defined $regexExpression) {
$regexExpression="*";
print "--Taking default Regular Expression. \n"
}
if (defined $regexExpression) {
print "The regular Expression : $regexExpression \n";
my $directorypathx= `pwd`;
my ($listofFileNames) = findFilesinDir($path);
my ($listofLinks) = readallHrefInaFile();
my ($listofImage) = readImageFile();
print $listofLinks;
}
sub findFilesinDir{
print "inside subroutines ", $path,"\n";
my($pathName) = #_;
my $fileNames =`find '$pathName' -name '$regexExpression' | sort -h -r > $fileNames ` ;
if (-l $fileNames){
return $fileNames;
}
}
sub readallHrefInaFile{
my $getAllLinks = ` grep -Eo "<a .*href=.*>" $path*.html | uniq ` ;
push (#abc,$getAllLinks);
}
sub readImageFile{
print "image files \n";
my $getAllImage = ` grep -Eo "<img .*src=.*>" $path*.html | uniq `;
push (#abc,$getAllImage);
}
print #abc;
I.html
<html>
<head>
<title>Index</title>
</head>
<body>
<h1>Index</h1>
1
<h1>Downloads</h1>
Compressed craters
<hr>
</body>
</html>
1.html
<html>
<head>
<title>1</title>
</head>
<body>
<h1>1</h1>
<img src="images/1-1.gif" />
<img src="images/1-2.gif" />
<hr>
</body>
</html>
The overall approach you show is reasonable, but there is a lot to say about the code itself. The place to do that would be code review and I encourage you to submit your code there as well.
One overall comment I'd make is that there is no reason to reach so often for external tools; your program uses external grep and find and sort and pwd. We can practically always do the whole job with an abundance of tools that Perl provides.
Here is a simple example for what you need, where most of work is done using modules.
The list of files to search for in our HTML is assembled using File::Find::Rule, recursively under $dir. Another option is the core File::Find module.
Even as HTML parsing appears simple in this case, it is much better to use a module for that as well, instead of a regex. The HTML::TreeBuilder is a bit of a standard for what you need here. That module itself uses others, the workhorse being HTML::Element
The following program works with one HTML file ($source_file), for which we need to find files under a given directory ($dir) which are not used in either an href attribute or a src attribute in img tag. These files need be deleted (that line is commented out).
use warnings;
use strict;
use feature 'say';
use File::Find::Rule;
use HTML::TreeBuilder;
my ($dir, $source_file) = #ARGV;
die "Usage: $0 dir-name file-name\n" if not $dir or not $source_file;
my #files = File::Find::Rule->file->in($dir);
#say for #files;
foreach my $file (#files) {
next if $file eq $source_file; # not the file itself!
say "Processing $file...";
my $tree = HTML::TreeBuilder->new_from_file($source_file);
my $esc_file = quotemeta $file;
my #in_href = $tree->look_down( 'href', qr/$esc_file/ );
my #in_img_src = $tree->look_down( _tag => 'img', 'src', qr/$esc_file/ );
if (#in_href == 0 and #in_img_src == 0) {
say "\tthis file is not used in 'href' or 'img-src' in $source_file";
# To delete it uncomment the next line -- after all is fully tested
#unlink $file or warn "Can't unlink $file: $!";
}
}
The statement that actually removes files, using unlink, is of course commented out. Enable that only once you have thoroughly checked the final version of the script, and have made backups.
Notes
Refine what files you are looking for by adding "rules" with File::Find::Rule
I use quotemeta on filenames, which escapes all special characters in them; otherwise something may sneak in that would throw off the regex used by look_down
The code above simply parses twice through each file, assembling the lists of elements found for href attribute and then for src attribute (in img tag). This can be done in one pass, by using sub { } specification for criteria in look_down
The script must be invoked with the directory name and the main HTML file name. Please change that for proper command line parsing, and more sophisticated use, with Getopt::Long
A whole lot more can be fine tuned here, both with searching for files and in parsing HTML; there is a lot of information in modules' documentation, and yet more in many posts around this site.
The code is tested for simple cases; please adjust to your realistic needs.
Here is a full example of usage.
I place this script (script.pl) in a directory with a file I.html and a directory www.
The I.html file:
<!DOCTYPE html>
<html> <head> <title>Test handling of unused files</title> </head>
<body>
Used file from www
<img src="www/images/used.jpg" alt="no_image_really">
</body>
</html>
The directory www has files used.html and another.html, and a subdirectory images with files used.jpg and another.jpg in it, so altogether we have
.
├── script.pl
├── I.html
└── www
├── used.html
├── another.html
└── images
├── used.jpg
└── another.jpg
There is no need for any content in any of files in www for this test. This is only a minimal setup; I've added more files and directories, and tags to I.html, to test.
Then I run script.pl www I.html and get the expected output.

what is the xpath syntax to grab html tag elements?

how do I print the title value for the below html file using xmlstarlet?
thufir#doge:~/.html$
thufir#doge:~/.html$ xmlstarlet sel -t -v "/html/header[#name='title']" -n hello.html
thufir#doge:~/.html$
thufir#doge:~/.html$ cat hello.html
<html>
<header><title>This is title</title></header>
<body>
Hello world
</body>
</html>
thufir#doge:~/.html$
Grabbing xml might be a bit different than html? Assuming garden-variety html and not xhtml.
The reason I'm using xmlstarlet is specifically to use xpath syntax which seems rather alien.
With:
"/html/header[#name='title']"
you select an header element which has an attribute name with the value "title".
What you want is to grab a title element in an header element:
//header/title
or just use :
//title
which selects all title elements, regardless of its position in the tree.
I'd just cheat and use Chrome's Developer Tools.
Open the HTML in Chrome, open the Developer Tools, then in the Elements tab, right click and select Copy > Copy XPath.
/html/body/header/title

Pandoc HTML variables: `quotes` and `math`

Pandoc default HTML template contains these two variables:
quotes,
math.
How are they supposed to be used?
More specifically I see that quotes sets the values for the tag <q>. Is this tag used in markdown to HTML conversion?
tl;dr: they seem to be mostly obsolete legacies from previous versions of pandoc
quotes
A little archeology of pandoc commits shows that 'quotes' was added when pandoc switched from using <q> tags to directly adding quotes signs. A new option, --html-q-tags was added to keep the previous behavior: the option wraps quotes in <q> and sets quotes to true so that a piece of css code is added as explained in the html template. See this commit to pandoc and this commit to pandoc-templates. See the behavior with the following file:
"hello world"
This:
pandoc test.md -t html --smart --standalone
Produces (skipping the usual head, with no css affecting <q>)
<p>“hello world”</p>
While this
pandoc test.md -t html --standalone --html-q-tags --smart
produces (skipping the usual header)
<style type="text/css">q { quotes: "“" "”" "‘" "’"; }</style>
</head>
<body>
<p><q>hello world</q></p>
</body>
You have to use --smart though.
math
It looks like this was introduced to include math rendering scripts inside the standalone file. See this commit from 2010. I think some command-line options picking non-(currently)-default math rendering systems, like --mathml, sets this variable to a value that actually makes sense (like copying the math rendering scripts). Try:
pandoc -t html --mathml
For the quotes variable, see #scoa.
As regards the math variable, I found what follows.
When using MathML, that is the option --mathml, the code block:
$if(math)$
$math$
$endif$
in the default HTML conversion template adds a portability script to the HTML output.
Anyway, Chrome and Edge do not currently support MathML and Firefox seems to support it without this script.
So, for a custom template, removing the $if(math)$ ... code block will not affect MathML rendering.
When using MathJax, that is the option --mathjax, $if(math)$ ... adds to the HTML output the script block:
<script src="https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS_CHTML-full" type="text/javascript"></script>
This is always necessary to render the maths formulae.
When using the --latexmathml, a giant script, converting the LaTeX style math into MathML, is inserted by the $if(math)$ ... code block. Without this code block in the conversion template, the script is not inserted and the maths can't be rendered.

Automatic <a> around headings in Pandoc

This Markdown code:
# Introduction
Turns into this HTML code when compiled with Pandoc:
<h1 id="introduction">Introduction</h1>
The way I use Markdown:
Generate HTML document
Edit it in MS Word to add page numbering
HTML version goes to blog, MS Word version goes to uni submissions
In CSS I can override link colors if they are inside H# tags, but MS Word has problems interpreting hierarchy of CSS overrides... and ends up with wrong colors anyway.
Is there a way to generate HTML without headings being wrapped in anchor tags, like below?
<h1 id="introduction">Introduction</h1>
In case there is no solution, here is a little PHP script I wrote to remove tags from headings that must be run on the resulting HTML file:
<?php
// Usage: php cleanheadings.php myhtmlfile.html
// Check that arguments were supplied
if(!isset($argv[1])) die('No input file, exiting');
// Load file
$content = file_get_contents($argv[1]);
// Cut out the <a> tag
$heading = '/(<h[123456] id="[\w-0-9]+">)(<a href="#[\w-0-9]+">)(.+)(<\/a>)(<\/h[123456])/mu';
$clean = '$1$3$5';
$cleanhtml = preg_replace($heading,$clean,$content);
// Write changes back to file
file_put_contents($argv[1], $cleanhtml);
?>

How can I parse place holder text in a HTML file which are then replaced with custom tags?

First a bit of background information. I create HTML emails at my work place and the whole process is very tedious. It goes a little little like this...
Code markup for HTML using tables and some CSS
Parse HTML and CSS using Premailer so all CSS is inline
Test HTML works in all email clients
Create a copy of the inline version of HTML and start adding in proprietary variables to email tool used for sending emails, ie <%=constant.first_name%>, <%=unsubscribe_link%>
Test in email client to see if it works and client is happy. If not repeat steps 1 through 5 again.
So as you can see it gets really tedious after a while.
What I would like to do is create a command line script similar to Premailer which allows me to parse a HTML file with variables stored in it without destroying the example text already in the HTML. That way when you are previewing the HTML it all looks dandy.
For example...
Store the first name function as a variable for own use.
$first_name = "<%=constant.first_name%>
Then tell the parser what word(s) to replace with the appropriate variable.
<p>My name is <!-- $first_name -->Gavin<!-- /$first_name --></p>
So that the final output looks something like:
<p>My name is <%=constat.first_name%></p>
Would such a thing be possible? Is there a better syntax I could, a custom tag like <first_name>Gavin</first_name>, if the browser can handle it.
Any advice is helpful. :)
I've seen this done before using a syntax like:
{assign_variable:first_name="Jesse"}
Then, you could use it like:
{first_name}
The way you'd parse this (provided you're using PHP) would be something like:
<?php
// Our Template Code
$strHTML = <<<EOT
{assign_variable:first_name="Jesse"}
{assign_variable:last_name="Bunch"}
Hello, {first_name}!
EOT;
// Get all the variables
$arrMatches = array();
preg_match_all('/\{assign\_variable\:([a-zA-Z\_\-]*)\=\"([a-zA-Z0-9]+)\"\}/', $strHTML, $arrMatches);
// Remove the assign_variable tags
$strHTML = preg_replace('/\{assign\_variable\:([a-zA-Z\_\-]*)\=\"([a-zA-Z0-9]+)\"\}/', '', $strHTML);
// Combine them into key/values
$arrVariables = array_combine($arrMatches[1], $arrMatches[2]);
foreach($arrVariables as $key=>$value) {
// Replace the variable occurrences
$strHTML = str_replace('{' . $key . '}', $value, $strHTML);
}
// Send the parsed template
echo $strHTML;
Which outputs:
Hello, Jesse!
Note, this is a very basic example. Here are some improvements to make on this code before using it in production:
Edit the regex to allow the right characters.
Maybe implement a better replacement method than a loop
Check for parse errors
Benchmark performance
All in all, I think you get the idea. Hope this points you in the right direction.
I have a similar situation
I have created a "format template" like this:
<?php // section1 $var1/$var2 ?>
<head>
<title>$var1</title>
<meta name="description" content="$var2">
</head>
<?php // section2 $var1/$var2 ?>
<body>
hello: <p>$var1</p>
news for you: <p>$var2</p>
</body>
it is valid php code and valid html code, so you can edit it with dreamwaver or similar, and you can host it also.
then a php script replaces all ocurrences of vars in all sections.