I have a folder in a ftp with a hundred of subfolders, each have it's own index.html
I want to add a <link rel="stylesheet" href="https://subdomain.domain.fr/vad/client/build/iconfont.css">
in each index.html
The subdomain is variable and can be captured from another stylesheet link ex :
<link rel="stylesheet" href="https://subdomain.domain.fr/vad/client/build/theme.css">
I tried this :
find . -type f -name index.html -exec sed -i 's/<link rel="stylesheet" href="https:\/\/\(*\).domain.fr\/vad\/client\/build\/theme.css">/<link rel="stylesheet" href="https:\/\/\1.domain.fr\/vad\/client\/build\/theme.css"><link rel="stylesheet" href="https:\/\/\1.domain.fr\/vad\/client\/build\/iconfont.css">/g' {} \;
With capturing and copy groups but it's not working
For ease and readability, change the delimiter from / to let's say # You also have to escape real dots in search pattern…
sed -i 's#<link rel="stylesheet" href="https://\(*\)\.domain\.fr/vad/client/build/theme\.css">#<link rel="stylesheet" href="https://\1.domain.fr/vad/client/build/theme.css"><link rel="stylesheet" href="https://\1.domain.fr/vad/client/build/iconfont.css">#g'
From there, I can see there's a mistake in your regexp capturing group… You wrote \(*\), but I suspect you mean \(.*\) :) (otherwise, you where trying to capture nothing …or by chance opening parenthesis only…)
Now, it's look like you are replacing one word with another one, in order to change the CSS file? As it's appearing in a specific kind of line, you can perform a simple replacement in line matching that pattern ;)
sed -i '/\<link rel="stylesheet" href="https:\/\/.*\.domain\.fr\/vad\/client\/build/s#theme#iconfont#'
Using Perl and a Mojo::DOM HTML Parser to edit your HTML:
use strict; use warnings;
use Mojo::DOM;
# Slurp the whole HTML as string
my $html = join "", <>;
my $dom = Mojo::DOM->new($html);
# Fetch domain name
$_ = $dom
->find('link[href][rel="stylesheet"]')
->map(attr => 'href')
->last;
my ($domain) = m|^https?://([^/]+)/|
or die "No match https?!\n";
# Find/append
$dom
->find('head > link[href][rel="stylesheet"]')
->last
->append(
"\n" .
'<link rel="stylesheet" href="https://' .
$domain .
'/build/iconfont.css" />'
);
# Render
print "$dom";
Output
Example of one file:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html lang="fr" xml:lang="fr" xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta content="text/html; charset=UTF-8" http-equiv="Content-Type">
<link href="https://subdomain.domain.fr/build/theme.css" rel="stylesheet">
<link href="https://subdomain.domain.fr/build/iconfont.css" rel="stylesheet">
<title></title>
</head>
<body>
POUET
</body>
</html>
Usage
First test the script against some files without sponge.
Then, if tests are satisfactory:
#!/bin/bash
shopt -s globstar # enable recursion **
for h in **/*.html; do
perl Mojo::DOM.pl "$h" | sponge "$h"
done
Related
I have for example a bunch of HTML pages like this :
<!DOCTYPE html>
<html>
<head><title>Table des matières</title>
<meta http-equiv="Content-Type" content="text/html; charset="utf-8"" />
<meta name="generator" content="TeX4ht (http://www.tug.org/tex4ht/)" />
<meta name="originator" content="TeX4ht (http://www.tug.org/tex4ht/)" />
<!-- 3,html,xhtml,charset="utf-8" -->
<meta name="src" content="content_final.tex" />
<link rel="stylesheet" type="text/css" href="content_final.css" />
<script type="text/javascript" src="./jquery.js">
</script>
<script type="text/javascript">
$(document).ready(function() {
function capitalizeFirstLetter(string) {
return string.charAt(0).toUpperCase() + string.slice(1).toLowerCase();
}
$('div.caption span.id').each(function() { var result = $(this).text().replace(':','');
result=capitalizeFirstLetter(result);
$(this).text(result);
});
});
</script>
</head><body
>
<!--l. 125--><div class="crosslinks"><p class="noindent">[<a
href="chapter1.html" >next</a>] [tail] [<a
href="/sciences/index.html" >up</a>] </p></div>
<h2 class="likechapterHead"><a
id="x2-1000"></a>Table des matières</h2>
<div class="tableofcontents">
But impossible to convert all french accents in these HTML pages like above the accent in
"Table des matières" with "è" appearing instead of "è".
I tried 2 things :
for i in $(ls *.html); do iconv -f iso-8859-1 -t utf8 $i > $i"_new"; mv -f $i"_new" $i; done
=> the accents are not converted
for i in $(ls *.html); do recode ..html $i; done
=> I have the following errors :
recode: section5.html failed: Invalid input in step `CHAR..ISO-10646-UCS-2'
recode: section6.html failed: Invalid input in step `CHAR..ISO-10646-UCS-2'
recode: section7.html failed: Invalid input in step `CHAR..ISO-10646-UCS-2'
recode: section8.html failed: Invalid input in step `CHAR..ISO-10646-UCS-2'
recode: section9.html failed: Invalid input in step `CHAR..ISO-10646-UCS-2'
recode: table_of_contents.html failed: Invalid input in step `CHAR..ISO-10646-UCS-2'
...
I don't know what to do to convert all these french accents ?
Has anyone got an idea or suggestion to convert all possible french accents ? I would like to use iconv, recode or sed commands.
UPDATE 1: taking a basic example, here is the message I get for a single file :
$ recode ..html table_of_contents.html
recode: table_of_contents.html failed: Invalid input in step `CHAR..ISO-10646-UCS-2'
What's wrong ?
UPDATE 2: here is the output of my original HTML pages :
$file -i index.html
$ index.html: text/x-tex; charset=iso-8859-1
and the head of the index.html :
<!DOCTYPE html>
<html>
<head><title>Table des matières</title>
<meta http-equiv="Content-Type" content="text/html; charset="utf-8"" />
<meta name="generator" content="TeX4ht (http://www.tug.org/tex4ht/)" />
<meta name="originator" content="TeX4ht (http://www.tug.org/tex4ht/)" />
<!-- 3,html,xhtml,charset="utf-8" -->
<meta name="src" content="content_final.tex" />
<link rel="stylesheet" type="text/css" href="content_final.css" />
<script type="text/javascript" src="./jquery.js">
</script>
<script type="text/javascript">
$(document).ready(function() {
function capitalizeFirstLetter(string) {
return string.charAt(0).toUpperCase() + string.slice(1).toLowerCase();
}
$('div.caption span.id').each(function() { var result = $(this).text().replace(':','');
result=capitalizeFirstLetter(result);
$(this).text(result);
If I apply the command :
$ recode -vfd u8..html index.html
Request: UTF-8..:libiconv:..ISO-10646-UCS-2..HTML_4.0
Shrunk to: UTF-8..ISO-10646-UCS-2..HTML_4.0
Recoding index.html... done
and
<!DOCTYPE html>
<html>
<head><title>Table des matires</title>
<meta http-equiv="Content-Type" content="text/html; charset="utf-8"" />
<meta name="generator" content="TeX4ht (http://www.tug.org/tex4ht/)" />
<meta name="originator" content="TeX4ht (http://www.tug.org/tex4ht/)" />
<!-- 3,html,xhtml,charset="utf-8" -->
<meta name="src" content="content_final.tex" />
<link rel="stylesheet" type="text/css" href="content_final.css" />
<script type="text/javascript" src="./jquery.js">
</script>
<script type="text/javascript">
$(document).ready(function() {
function capitalizeFirstLetter(string) {
return string.charAt(0).toUpperCase() + string.slice(1).toLowerCase();
}
$('div.caption span.id').each(function() { var result = $(this).text().replace(':','');
result=capitalizeFirstLetter(result);
$(this).text(result);
});
});
</script>
as you can see, the "è" has disappeared.
What can I do ?
Assuming the source file encoding is UTF-8. Following command worked in my environment:
$ recode -vfd u8..html index.html
Output:
$ locale charmap
UTF-8
$ file -i index.html
index.html: text/html; charset=utf-8
$ recode -vfd u8..html index.html
Request: UTF-8..:iconv:..ISO-10646-UCS-2..HTML_4.0
Shrunk to: UTF-8..ISO-10646-UCS-2..HTML_4.0
Recoding index.html... done
You can use the command options to debug the error in this way:
-v Verbose output. Useful to find in which step the error occurred.
-f Forces the completion even if error occurred. You can compare the output file with original to figure out which character/location is giving trouble.
-d For HTML, recode doesn't convert ASCII characters. Avoids conversion of < > " & etc. html characters.
Update If the encoding/charset is iso-8859-1 then you need to use:
$ recode -vfd iso-8859-1..html index.html
Request: ISO-8859-1..ISO-10646-UCS-2..HTML_4.0
Recoding index.html... done
#Or use following.
$ recode -vfd lat1..html index.html
Request: ISO-8859-1..ISO-10646-UCS-2..HTML_4.0
Recoding index.html... done
The ISO-8859-1 has following aliases in recode:
l1
lat1
latin1
Latin-1
819/CR-LF
CP819/CR-LF
CSISOLATIN1
IBM819/CR-LF
ISO8859-1
iso-ir-100
ISO_8859-1
ISO_8859-1:1987
You can use anyone of the above in the command.
I would like to display a simple HTML page in a PowerShell dialog box.
This is the way to build a dialog with dialog.ps1:
[void] [System.Reflection.Assembly]::LoadWithPartialName("System.Drawing")
[void] [System.Reflection.Assembly]::LoadWithPartialName("System.Windows.Forms")
$objForm = New-Object System.Windows.Forms.Form
[void] $objForm.ShowDialog()
In this windows I would like to display a webpage like index.html:
<!DOCTYPE html>
<html lang="de">
<head>
<meta charset="utf-8" />
<meta name="viewport" content="width=device-width, initial-scale=1.0" />
<title>Title</title>
</head>
<body>
Hello world!
</body>
</html>
Of course, the webpage has a little more elements, like a picture with picturemap.
If this would work with CMD too, I would like this option even more.
The following snippet - which uses PSv5+[1] syntax for convenience - demonstrates use of the WebBrowser control to display HTML text in a WinForms dialog:
# PSv5+:
# Import namespaces so that types can be referred by
# their mere name (e.g., `Form` rather than `System.Windows.Forms.Form`)
#
using namespace System.Windows.Forms
using namespace System.Drawing
# Load the WinForms assembly.
Add-Type -AssemblyName System.Windows.Forms
# Create a form.
$form = [Form] #{
ClientSize = [Point]::new(400, 400)
Text = "WebBrowser-Control Demo"
}
# Create a web-browser control, make it as large as the inside of the form,
# and assign the HTML text.
$sb = [WebBrowser] #{
ClientSize = $form.ClientSize
DocumentText = #'
<!DOCTYPE html>
<html lang="de">
<head>
<meta charset="utf-8" />
<meta name="viewport" content="width=device-width, initial-scale=1.0" />
<title>Title</title>
</head>
<body>
Hello world!
</body>
</html>
'#
}
# Add the web-browser control to the form...
$form.Controls.Add($sb)
# ... and display the form as a dialog (synchronously).
$form.ShowDialog()
# Clean up.
$form.Dispose()
[1] The code also works in PowerShell [Core] v7+, but not in PowerShell Core v6.x, because the latter fundamentally did not support WinForms (and WPF).
I'm using a tool to generate some html that looks something like this:
<html>
<head>
<title>Blah</title>
<style>
/* stuff */
</style>
</head>
But I'd like a way to replace that style tag with some custom styling
<link rel="stylesheet" href="style.css">
possibly with awk or sed so that I can add it to my Makefile.
Is this possible?
awk to the rescue!
This is not xml/html aware but a basic text substitution...
$ awk '/<style>/ {f=1}
!f;
/<\/style>/ {f=0;
print "<link rel=\"stylesheet\" href=\"style.css\">"}' file
will give
<html>
<head>
<title>Blah</title>
<link rel="stylesheet" href="style.css">
</head>
If you like tricks, check also this out:
$ ht=$'<html>\n<head>\n<title>Blah</title>\n<style>\n/* stuff */\n</style>\n</head>\n'
$ st=$'<link rel="stylesheet" href="style.css">'
$ echo "$ht"
<html>
<head>
<title>Blah</title>
<style>
/* stuff */
</style>
</head>
$ echo "$ht" |perl -0777 -pe "s/\n/\0/g;s/<style>.*<\/style>/$st/g;s/\0/\n/g"
<html>
<head>
<title>Blah</title>
<link rel="stylesheet" href="style.css">
</head>
echo '<html>
<head>
<title>Blah</title>
<style>
/* stuff */
</style>
</head>'|sed -e '/<style>/{:a;/<\/style>/!{N;ba};c <link rel="stylesheet" href="style.css">' -e'}'
<html>
<head>
<title>Blah</title>
<link rel="stylesheet" href="style.css">
</head>
#mef51: try: taking adaption from karafka's nice code, only thing is this code will print and tags too along with your new line.
awk '/<style>/ {print;f=1}
!f;
/<\/style>/ {f="";
print "<link rel=\"stylesheet\" href=\"style.css\">" ORS $0}' Input_file
Explanation: searching for string and then set variable f's value to ON/TRUE/one(1) and then checking condition !f if variable f's value is NULL(when line doesn't have or it will be NULL) so print the current line. now looking for string and printing the new line along with ORS(Output field separator, whose default value is new line) and the current line.
In an example script that prints HTML, it looks to me that the body tag is not closed. However I have never had experience with Perl before. Is this example incorrect? or is there something else that means body is closed?
print "Content-type: text/html\n\n";
print "<html>\n<head>\n<title>\nPerl CGI
Example\n</title>\n<body>\n<h1>Hello,
World!</h1>\nYour user agent is: <b>\n";
print $cgi_object->user_agent();
print "<b>.</html>\n";
Where there is a . on the last line it looks to me like it should be </body>
You aren't missing anything, that code simply doesn't generate an end tag for the body element, but that tag (unlike the missing Doctype) is optional in HTML anyway so the element will be closed by the browser when it parses the end tag for the html element.
It would be better written something more like this:
#!/usr/bin/env perl
use strict;
use warnings;
use CGI;
use Template;
my $cgi = CGI->new();
print $cgi->header(-charset => 'utf-8');
my $ua = $cgi->user_agent();
my $tt = Template->new();
$tt->process(\*DATA, { ua => $ua });
__END__
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8">
<title>Perl CGI Example</title>
</head>
<body>
<h1>Hello, World!</h1>
<p>Your user agent is: <em>[% ua | html %]</em>.</p>
</body>
</html>
And better still if you ditched CGI and used PSGI/Plack.
I want to remove everything inside the <head> tag except the <title> in an html file, and also insert a script into the <head> tag after this is done. I don't want to delete the <head> tag itself.
Is this possible using Sed?
Using regex to parse HTML is not a good choice. See this famous article for a full discussion
I will suggest you to use a DOM Parser for this type of work since any regex you try will break at some point using sed or any of its variant. Since you've asked for an alternative in your comments consider following code in PHP:
$content = '
<HTML>
<HEAD>
<link href="/style.css" rel="stylesheet" type="text/css">
<title>
Page Title Goes here
</title>
<script>
var str = "ZZZZZ1233#qq.edu";
</script>
</HEAD>
';
$dom = new DOMDocument();
$dom->loadHTML($content);
$head='
<head>
<script>
// your javascript goes here
var x="foo";
</script>
';
$headTag = $dom->getElementsByTagName("head")->item(0);
if ($headTag != null) {
$title = $headTag->getElementsByTagName("title")->item(0);
if ($title != null)
$head .= '<title>' . $title->textContent . '</title>
';
}
$head .= '</head>';
var_dump($head);
OUTPUT
string(118) "
<head>
<script>
// your javascript goes here
var x="foo";
</script>
<title>Page Title Goes here</title>
</head>"