bash grep only numbers lastest database - html

curl -s http://virusradar.com/en/update/info/latest | grep -oP "(?\<=\<h1\>Update )\[0-9\]+"
When there was a base number in the header, it worked.
Now removed how to change the request.
Tell me good people.

That URL appears to be defunct:
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>301 Moved Permanently</title>
</head><body>
<h1>Moved Permanently</h1>
<p>The document has moved here.</p>
</body></html>
What do you mean by "base" ?
Please provide the original form of the command, so we could see how it was (incorrectly?) modified.

Related

How can I add header metadata without adding the <h1>?

I'm writing something in markdown and converting it to html with pandoc, but when I add the title variable in the yaml header, it also adds an <h1> to the top of the document, which I don't want. In the pandoc documentation it says to use the title-meta variable, but it still says
[WARNING] This document format requires a nonempty <title> element.
Is there a way to set the title without adding the title block?
command I'm using:
pandoc -s "file.md" -o "file.html"`
output of pandoc --version:
pandoc 2.10.1
Compiled with pandoc-types 1.21, texmath 0.12.0.2, skylighting 0.8.5
Default user data directory: C:\Users\noah\AppData\Roaming\pandoc
Copyright (C) 2006-2020 John MacFarlane
Web: https://pandoc.org
This is free software; see the source for copying conditions.
There is no warranty, not even for merchantability or fitness
for a particular purpose.
One can set an explicit title with --metadata=title="My title" while simultaneously preventing the output of the <h1> and <header> elements by setting the template variable title to an empty string:
pandoc --metadata=title="Fancy title" --variable=title="" ...

How to remove a file if it is not used by another file

I have to clean the directory and its subdirectories by removing all unused files. (A file is considered unused if it is not linked to in any
of the HTML files or if it is not specified explicitly that this file is in use). A file can be linked in an HTML file by either href or img src.
For example, I have an I.html,1.html,2.html and 1 folder. In I.html file, an href uses 1.html and 1 directory, but 2.html is not used by any other files. So, how can I remove the unused 2.html file?
use strict;
use warnings;
my($path,$regexExpression) = #ARGV;
my $fileNames = "data.txt";
my #abc= ();
if(not defined $path){
die "File directory not given, please try again \n"
}
print "added file ";
if (not defined $regexExpression) {
$regexExpression="*";
print "--Taking default Regular Expression. \n"
}
if (defined $regexExpression) {
print "The regular Expression : $regexExpression \n";
my $directorypathx= `pwd`;
my ($listofFileNames) = findFilesinDir($path);
my ($listofLinks) = readallHrefInaFile();
my ($listofImage) = readImageFile();
print $listofLinks;
}
sub findFilesinDir{
print "inside subroutines ", $path,"\n";
my($pathName) = #_;
my $fileNames =`find '$pathName' -name '$regexExpression' | sort -h -r > $fileNames ` ;
if (-l $fileNames){
return $fileNames;
}
}
sub readallHrefInaFile{
my $getAllLinks = ` grep -Eo "<a .*href=.*>" $path*.html | uniq ` ;
push (#abc,$getAllLinks);
}
sub readImageFile{
print "image files \n";
my $getAllImage = ` grep -Eo "<img .*src=.*>" $path*.html | uniq `;
push (#abc,$getAllImage);
}
print #abc;
I.html
<html>
<head>
<title>Index</title>
</head>
<body>
<h1>Index</h1>
1
<h1>Downloads</h1>
Compressed craters
<hr>
</body>
</html>
1.html
<html>
<head>
<title>1</title>
</head>
<body>
<h1>1</h1>
<img src="images/1-1.gif" />
<img src="images/1-2.gif" />
<hr>
</body>
</html>
The overall approach you show is reasonable, but there is a lot to say about the code itself. The place to do that would be code review and I encourage you to submit your code there as well.
One overall comment I'd make is that there is no reason to reach so often for external tools; your program uses external grep and find and sort and pwd. We can practically always do the whole job with an abundance of tools that Perl provides.
Here is a simple example for what you need, where most of work is done using modules.
The list of files to search for in our HTML is assembled using File::Find::Rule, recursively under $dir. Another option is the core File::Find module.
Even as HTML parsing appears simple in this case, it is much better to use a module for that as well, instead of a regex. The HTML::TreeBuilder is a bit of a standard for what you need here. That module itself uses others, the workhorse being HTML::Element
The following program works with one HTML file ($source_file), for which we need to find files under a given directory ($dir) which are not used in either an href attribute or a src attribute in img tag. These files need be deleted (that line is commented out).
use warnings;
use strict;
use feature 'say';
use File::Find::Rule;
use HTML::TreeBuilder;
my ($dir, $source_file) = #ARGV;
die "Usage: $0 dir-name file-name\n" if not $dir or not $source_file;
my #files = File::Find::Rule->file->in($dir);
#say for #files;
foreach my $file (#files) {
next if $file eq $source_file; # not the file itself!
say "Processing $file...";
my $tree = HTML::TreeBuilder->new_from_file($source_file);
my $esc_file = quotemeta $file;
my #in_href = $tree->look_down( 'href', qr/$esc_file/ );
my #in_img_src = $tree->look_down( _tag => 'img', 'src', qr/$esc_file/ );
if (#in_href == 0 and #in_img_src == 0) {
say "\tthis file is not used in 'href' or 'img-src' in $source_file";
# To delete it uncomment the next line -- after all is fully tested
#unlink $file or warn "Can't unlink $file: $!";
}
}
The statement that actually removes files, using unlink, is of course commented out. Enable that only once you have thoroughly checked the final version of the script, and have made backups.
Notes
Refine what files you are looking for by adding "rules" with File::Find::Rule
I use quotemeta on filenames, which escapes all special characters in them; otherwise something may sneak in that would throw off the regex used by look_down
The code above simply parses twice through each file, assembling the lists of elements found for href attribute and then for src attribute (in img tag). This can be done in one pass, by using sub { } specification for criteria in look_down
The script must be invoked with the directory name and the main HTML file name. Please change that for proper command line parsing, and more sophisticated use, with Getopt::Long
A whole lot more can be fine tuned here, both with searching for files and in parsing HTML; there is a lot of information in modules' documentation, and yet more in many posts around this site.
The code is tested for simple cases; please adjust to your realistic needs.
Here is a full example of usage.
I place this script (script.pl) in a directory with a file I.html and a directory www.
The I.html file:
<!DOCTYPE html>
<html> <head> <title>Test handling of unused files</title> </head>
<body>
Used file from www
<img src="www/images/used.jpg" alt="no_image_really">
</body>
</html>
The directory www has files used.html and another.html, and a subdirectory images with files used.jpg and another.jpg in it, so altogether we have
.
├── script.pl
├── I.html
└── www
├── used.html
├── another.html
└── images
├── used.jpg
└── another.jpg
There is no need for any content in any of files in www for this test. This is only a minimal setup; I've added more files and directories, and tags to I.html, to test.
Then I run script.pl www I.html and get the expected output.

Unable to update confluence page with content from an HTML file using curl

I am trying to update a Confluence page with some HTML content. I have this HTML content in a different file named Output.html in the same location. I cannot directly copy & paste that HTML content to this script, as it is a huge amount of data, and also I need to execute this script dynamically.
curl -u user:pass -X PUT -H 'Content-Type: application/json' -d'{"id":"2196","type":"page","title":"Main page","space":{"key":"AB"},"body":{"storage":{"value":"<p> Text </p>","representation":"storage"}},"version":{"number":2}}' https://Client.atlassian.net/wiki/rest/api/content/2196 | python -mjson.tool
For example, my HTML file content is as follows:
<!DOCTYPE html> <html> <head> <title>Page Title</title> </head> <body> <h1>My First Heading</h1> <p>My first paragraph.</p> </body> </html>
I need this to be updated on my Confluence page as HTML content, which needs to fetched directy from the HTML file to the script "value":"<p> Text </p>"
When I manually copy sample HTML content to this value space, the page successfully shows the HTML content.
I got this thing worked using Python and it's request module. See code below,
import json
import requests
url = 'https://Client.atlassian.net/wiki/rest/api/content/87440'
headers = {'Content-Type': "application/json", 'Accept': "application/json"}
f = open("file.html", "r")
html = f.read()
data={}
data['id'] = "87440"
data['type']="page"
data['title']="Data Page"
data['space']={"key":"AB"}
data['body'] = {"storage":{"representation":"storage"}}
data['version']={"number":4}
print data
data['body']['storage']['value'] = html
print data
res = requests.put(url, json=data, auth=('Username', 'Password'))
print (res.status_code)
print (res.raise_for_status())
Feel free to ask if you have got any doubt.

HTML-to-RTF document conversion, preserving classes as styles

I need a HTML2RTF tool, that is, a software that converts HTML format to RTF format... But not "any convertion": I need to preserve the HTML class attributes (ex. of paragraphs) as MS-Word "styles".
My first option was some terminal command of LibreOffice, like
libreoffice --convert-to
because LibreWriter have the bigger community and suppose the best software convertion... But disappointed because not preserve class attributes as styles, even when testing as user in the graphical interface.
I need a Linux solution (also abiword not solved)... Or, last option, a webservice to easy plug in a intranet's Windows server.
Input sample:
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<title>sample1 doc</title>
<!-- no style need, but can be declarated with anything, don't matter -->
<style type="text/css">
.myStyle1 {color: #F00;} .myStyle2 {color: #880;}
.a {color: #00F;} .b {color: #088;}
</style>
</head>
<body><!-- important to preserve class names -->
<p class="myStyle1">Hello in <i>style#1</i>.
<span class="a">SPAN S1</span>.</p>
<p class="myStyle2">... Hello in style#2...</p>
<p class="myStyle1">Bye <span class="b">S2</span>.</p>
</body>
</html>
In MS-Word this sample is imported and looks ok, with styles where was classes.
In LibreOffice (and libreoffice terminal tools) not.
So, there are another tool for LibreOffice? There are a tool for Linux?
PS: last possibility, if none for Linux, a webservice for Windows and MS-Office.
Works for me in Libreoffice 4.3.3.2. Just opened the HTML file you provided and I can see styles named Text.Body.myStyle1 and myStyle2.
Clues, for Debian Stable and UBUNTU LTS 64bits... See this How-To. Basic steps:
sudo apt-get remove libreoffice*
wget http://download.documentfoundation.org/libreoffice/stable/4.3.3/deb/x86_64/LibreOffice_4.3.3_Linux_x86-64_deb.tar.gz
tar -xzvf LibreOffice_4.3.3_Linux_x86-64_deb.tar.gz
cd LibreOffice_4.3.3*_Linux_x86-64_deb/DEBS
sudo dpkg -i *.deb
After v4.3.3, need also to install:
sudo apt-get install libreoffice-writer
then, the cited command:
libreoffice --headless -convert-to rtf libreTeste.html

Do you know a HTML-Snippet Validator?

I'm searching for a tool that would allow me to check whether a certain snippet of HTML would be valid in it's proper context.
I'd input something like
<dd>
my definition
<div>
div inside <dd> is allowed
</div>
</dd>
instead of the whole document. A ordinary validator will complain about the missing dl-tag, but most of the times I just want to know whether a certain element is valid inside another one or not.
I'll try to explain it more detailed. Consider the following snippet:
<form>
<label>Name: <input /></label>
</form>
Would be valid, but to check it I have two options:
Validate the whole document: Most of the times this is good enough, but sometimes when I'm working on partial HTML snippets or embedded HTML it's quite some trouble. I'd have to copy the whole thing to a new HTML document and validate that.
Just copy the snippet and validate it with the W3C validator and ignore some of the errors.
Basically I'd like to check, whether an element contains only elements it's allowed to contain.
You can actually use the W3C validator to check a snippet.
Choose the 'Validate by Direct Input' tab and select More Options. In there there is a radio button to 'Validate HTML fragment'. http://validator.w3.org/#validate_by_input+with_options
It will wrap your page in valid html so the errors which you see are only due to your snippet.
W3C does currently (May 2020) have a fragment validator, but it seems to have bitrot (no HTML5, at least). Here's a couple of simple scripts that attach a header and a footer to your fragment, and run the result through a local copy of the Nu checker. The fragment can be anything which is valid at the top level of a <body> tag - modify the header and footer if you need something else.
validate1.sh takes a single filename argument and checks it, while validate2.sh cycles through all HTML files in a directory. It has a simple exclusion list mechanism which you'll need to change. You'll need to modify both to point to your copy of vnu.jar.
validate1.sh:
#!/bin/bash
#
# validate1.sh
#
# Run the nu validator on the HTML fragment in the supplied filename.
# This script adds a header and trailer around the fragment, and supplies
# the result to 'vnu.jar'. You'll need to modify it to correctly locate
# vnu.jar.
if test "$#" -ne 1; then
echo "Usage: '$0 fname', where 'fname' is the HTML file to be linted"
exit 1
fi
var="<!doctype html>
<html lang=\"en\">
<head>
<title>foo</title>
</head>
<body>
$(< "$1")
</body>
</html>"
echo "Checking '$1'... subtract 6 from any reported line numbers"
echo "$var" | java -jar vnu.jar -
validate2.sh:
#!/bin/bash
#
# validate2.sh
#
# Run the nu validator on the HTML fragments in the supplied directory. This
# script adds a header and footer around each fragment in the directory, and
# supplies the result to 'vnu.jar'. You'll need to modify it to correctly
# locate vnu.jar.
if test "$#" -ne 1; then
echo "Usage: '$0 fname', where 'fname' is the HTML directory to be linted"
exit 1
fi
for filename in $1/*.html; do
case $filename in
# simple exclusion list example:
"$1/root.html" | "$1/sitedown.html")
echo "Skipping '$filename'"
continue
;;
*)
;;
esac
var="<!doctype html>
<html lang=\"en\">
<head>
<title>foo</title>
</head>
<body>
$(< "$filename")
</body>
</html>"
echo "Checking '$filename'... subtract 6 from any reported line numbers"
echo "$var" | java -jar vnu.jar -
done