I have several websites each with multiple pages. Each of these pages has multiple scripts in them for various functions. There is a specific script that Im trying to comment out across all the sites.
The script I want to comment out has a target word I can use as a conditional to isolate it from the rest . I would like to use that word to target the script and wrap all of it (approx. 10 lines / including the <script> tags themselves) in a comment.
I have considered using regex, but it seems the multi-line and complex nature of script syntax may put this situation out of reach for a regex solution. Im not versed in regex, so I could be wrong.
Here is a rough idea of what needs to be commented out. What I want to keep are other similar script blocks without the conditional word (in this example "oranges.com"):
<script type='text/javascript'>
window.__wtw_lucky_site_id = 15001;
(function() {
var wa = document.createElement('script'); wa.type = 'text/javascript'; wa.async = true;
wa.src = ('https:' == document.location.protocol ? 'https://ssl' : 'http://ww7632') + '.oranges.com/w.js';
var s = document.getElementsByTagName('script')[0]; s.parentNode.insertBefore(wa, s);
})();
</script>
I suppose it would also be worth mentioning that I will be accessing and manipulating these files via ssh so preferably the solution would be compatible with that in some fashion.
You could do this with Perl (where the script you want to comment has stuff in it):
$ cat test.xml
<html>
<script>
stuff
</script>
<script>
other things
</script>
<body>
<h1>Hello, world!</h1>
</body>
</html>
$ perl -0pe 's/<script([^>]*>.*?stuff.*?)<\/script>/<!-- script\1<\/script -->/smg' test.xml
<html>
<!-- script>
stuff
</script -->
<script>
other things
</script>
<body>
<h1>Hello, world!</h1>
</body>
</html>
For reference, see here. This is a pretty quick and dirty solution. You could also write a script to essentially parse the XML with any number of libraries, loop over the elements, and modify the XML.
There may be an XSLT method, but I was not able to find one that looked particularly straight-forward.
Try the following perl solution on your files :
perl -0777 -p -e 's/(<script.*?orange.*?<\/script>)/\/\*\1\*\///s' file
The perl will match all multi-line patterns with the following format:
<script ...
...
</script>
It then checks to see if the word orange occurs on any of the lines within the matched pattern. If it does, the back reference \1 replaces the matched pattern with itself, only difference is that /* is added at the start and */ is added at the end. So the output would look like:
/*<script ...
...
</script>*/
Alternatively
You may also use a python script to achieve the same result:
import sys
import re
file = sys.argv[1]
f = open(file, 'r')
a = f.read() #read file into string
change = re.sub('(<script.*?orange.*?<\/script>)', r'/*\1*/', a, flags=re.DOTALL) #flag DOTALL includes newline
print(change)
You can run the script as so:
python script.py file > newfile
cat newfile > file
This overwrites the contents of your file with the desired output.
Related
I have to clean the directory and its subdirectories by removing all unused files. (A file is considered unused if it is not linked to in any
of the HTML files or if it is not specified explicitly that this file is in use). A file can be linked in an HTML file by either href or img src.
For example, I have an I.html,1.html,2.html and 1 folder. In I.html file, an href uses 1.html and 1 directory, but 2.html is not used by any other files. So, how can I remove the unused 2.html file?
use strict;
use warnings;
my($path,$regexExpression) = #ARGV;
my $fileNames = "data.txt";
my #abc= ();
if(not defined $path){
die "File directory not given, please try again \n"
}
print "added file ";
if (not defined $regexExpression) {
$regexExpression="*";
print "--Taking default Regular Expression. \n"
}
if (defined $regexExpression) {
print "The regular Expression : $regexExpression \n";
my $directorypathx= `pwd`;
my ($listofFileNames) = findFilesinDir($path);
my ($listofLinks) = readallHrefInaFile();
my ($listofImage) = readImageFile();
print $listofLinks;
}
sub findFilesinDir{
print "inside subroutines ", $path,"\n";
my($pathName) = #_;
my $fileNames =`find '$pathName' -name '$regexExpression' | sort -h -r > $fileNames ` ;
if (-l $fileNames){
return $fileNames;
}
}
sub readallHrefInaFile{
my $getAllLinks = ` grep -Eo "<a .*href=.*>" $path*.html | uniq ` ;
push (#abc,$getAllLinks);
}
sub readImageFile{
print "image files \n";
my $getAllImage = ` grep -Eo "<img .*src=.*>" $path*.html | uniq `;
push (#abc,$getAllImage);
}
print #abc;
I.html
<html>
<head>
<title>Index</title>
</head>
<body>
<h1>Index</h1>
1
<h1>Downloads</h1>
Compressed craters
<hr>
</body>
</html>
1.html
<html>
<head>
<title>1</title>
</head>
<body>
<h1>1</h1>
<img src="images/1-1.gif" />
<img src="images/1-2.gif" />
<hr>
</body>
</html>
The overall approach you show is reasonable, but there is a lot to say about the code itself. The place to do that would be code review and I encourage you to submit your code there as well.
One overall comment I'd make is that there is no reason to reach so often for external tools; your program uses external grep and find and sort and pwd. We can practically always do the whole job with an abundance of tools that Perl provides.
Here is a simple example for what you need, where most of work is done using modules.
The list of files to search for in our HTML is assembled using File::Find::Rule, recursively under $dir. Another option is the core File::Find module.
Even as HTML parsing appears simple in this case, it is much better to use a module for that as well, instead of a regex. The HTML::TreeBuilder is a bit of a standard for what you need here. That module itself uses others, the workhorse being HTML::Element
The following program works with one HTML file ($source_file), for which we need to find files under a given directory ($dir) which are not used in either an href attribute or a src attribute in img tag. These files need be deleted (that line is commented out).
use warnings;
use strict;
use feature 'say';
use File::Find::Rule;
use HTML::TreeBuilder;
my ($dir, $source_file) = #ARGV;
die "Usage: $0 dir-name file-name\n" if not $dir or not $source_file;
my #files = File::Find::Rule->file->in($dir);
#say for #files;
foreach my $file (#files) {
next if $file eq $source_file; # not the file itself!
say "Processing $file...";
my $tree = HTML::TreeBuilder->new_from_file($source_file);
my $esc_file = quotemeta $file;
my #in_href = $tree->look_down( 'href', qr/$esc_file/ );
my #in_img_src = $tree->look_down( _tag => 'img', 'src', qr/$esc_file/ );
if (#in_href == 0 and #in_img_src == 0) {
say "\tthis file is not used in 'href' or 'img-src' in $source_file";
# To delete it uncomment the next line -- after all is fully tested
#unlink $file or warn "Can't unlink $file: $!";
}
}
The statement that actually removes files, using unlink, is of course commented out. Enable that only once you have thoroughly checked the final version of the script, and have made backups.
Notes
Refine what files you are looking for by adding "rules" with File::Find::Rule
I use quotemeta on filenames, which escapes all special characters in them; otherwise something may sneak in that would throw off the regex used by look_down
The code above simply parses twice through each file, assembling the lists of elements found for href attribute and then for src attribute (in img tag). This can be done in one pass, by using sub { } specification for criteria in look_down
The script must be invoked with the directory name and the main HTML file name. Please change that for proper command line parsing, and more sophisticated use, with Getopt::Long
A whole lot more can be fine tuned here, both with searching for files and in parsing HTML; there is a lot of information in modules' documentation, and yet more in many posts around this site.
The code is tested for simple cases; please adjust to your realistic needs.
Here is a full example of usage.
I place this script (script.pl) in a directory with a file I.html and a directory www.
The I.html file:
<!DOCTYPE html>
<html> <head> <title>Test handling of unused files</title> </head>
<body>
Used file from www
<img src="www/images/used.jpg" alt="no_image_really">
</body>
</html>
The directory www has files used.html and another.html, and a subdirectory images with files used.jpg and another.jpg in it, so altogether we have
.
├── script.pl
├── I.html
└── www
├── used.html
├── another.html
└── images
├── used.jpg
└── another.jpg
There is no need for any content in any of files in www for this test. This is only a minimal setup; I've added more files and directories, and tags to I.html, to test.
Then I run script.pl www I.html and get the expected output.
Using NodeJS I would like to parse a variable defined in JSON, which is embeded in HTML of 3rd party website. What is the easiest way to get mentioned variable from HTML?
Chunk of HTML from which I would like to extract mentioned JS can be seen bellow:
...
<footer>
<div>
<script type="application/ld+json">
{"#context":"http:\/\/schema.org","#type":"BreadcrumbList","itemListElement":[{"#type":"ListItem","position":1,"item":{"#id":"https:\/\/www.domain.com\/","image":"https:\/\/assets.domain.com\/img\/facebook\/stuf.png","name":"Home"}}]}
</script>
<script>
var API_URL = ["https:\/\/api1.domain.com\/api","https:\/\/api2.domain.com\/api","https:\/\/api3.domain.com\/api"],
</script>
</div>
</footer>
...
The following HTML is parsed from XY website using NodeJS. I would like to avoid using eval().
I tried with JSDOM, but I didn't know how to select mentioned <script>. Is regex the only solution?
In case you provided, the selector will be: footer>div>script:nth-child(2).
Is this what you're asking for?
Hello I have an ASP script that I need to edit. Actually I need to restyle the email that it sends, so I need to edit the HTML from it.
The problem is the html (from the asp file) has on every row
HTML = HTML & ="
in it (plus some other changes). I need to take the HTML code from that ASP, get rid of the beginning html = html part, edit the double "" and convert them to a single " (I need to do that one by one, because the variables also have quotes in them).
Than, I restyle the page with HTML and after that I need to convert it back so I can integrate it in ASP (basically introduce the double '"' again and stuff).
Yeah, I could edit the HTML from the ASP directly, but I don't know how it might look, because I can't run the script (it needs other files from the server, which I don't have access to).
The question:
Is there a better way of doing this?
Some way of previewing what I'm doing in ASP directly. Or maybe a tool that let's me move from ASP HTML to HTML and back faster.
I sure know that what I'm doing right now is quite dumb, so there must be a better way.
You could create a html template file with some placeholders in it, read it in, replace the placeholders and then use it in your email. Saves you having to keep messing about building up the html using variables. This previous answer has some more details about a possible solution (with code examples).
As #steve-holland mentions creating a template is a great way to avoid all the annoying HTML strings in the code and makes changing layouts a breeze.
I've worked on HTML templating scripts myself in the past, usually I build a Scripting.Dictionary that contains the key value pairs I will be replacing inside the template.
Function getHTMLTemplate(url, params)
Dim stream, keys, html, idx
Set stream = Server.CreateObject("ADODB.Stream")
With stream
.Type = adTypeText
.Charset = "utf-8"
Call .Open()
Call .LoadFromFile(url)
html = .ReadText(adReadAll)
Call .Close()
End With
Set stream = Nothing
keys = o_params.Keys
For idx = 0 To UBound(keys, 1)
html = Replace(html, keys(idx), params.Item(keys(idx)))
Next
Set keys = Nothing
Set params = Nothing
getHTMLTemplate = html
End Function
Usage:
Dim params, html
Set params = Server.CreateObject("Scripting.Dictionary")
With params
.Add("[html_title]", "Title Here")
.Add("[html_logo]", "/images/logo.gif")
'... and so on
End With
html = getHTMLTemplate(Server.MapPath("/templates/main.htm"), params)
Call Response.Write(html)
Example main.htm structure:
<!doctype html>
<html>
<head>
<title>[html_title]</title>
<link rel="stylesheet" type="text/css" href="/styles/main.css" />
</head>
<body>
<div class="header">
<img src="[html_logo]" alt="Company Name" />
</div>
</body>
</html>
Why use a ADODB.Stream instead of the Scripting.FileSystemObject?;
You can control the Charset being returned and even convert from one to another if you need to.
If the template is particular large, you can stream the content in using the Read() method with a specific buffer size to improve the performance of the read.
I´m having a little problem. I am using gmap in primefaces and we need to load the script
<script src="http://maps.google.com/maps/api/js?sensor=true" type="text/javascript"/>
However I need to load the script according to the language of the user Locale.
How can I manage to do that without "hardcoding" the string?
I tried something like this :
<script src="http://maps.google.com/maps/api/js?sensor=true&language="#{template.userLocale} type="text/javascript"/>
// {template.userLocale} has a string o the locale
Can you help please?
You've there a HTML syntax error. What you end up getting would look like this given a language of en:
<script src="http://maps.google.com/maps/api/js?sensor=true&language="en type="text/javascript"/>
(rightclick page in browser and do View Source to see it yourself)
You need to move the doublequote to the end of the attribute value:
<script src="http://maps.google.com/maps/api/js?sensor=true&language=#{template.userLocale}" type="text/javascript"/>
So that the HTML will end up to be:
<script src="http://maps.google.com/maps/api/js?sensor=true&language=en" type="text/javascript"/>
EL can just be used in template text. You need to realize that JSF basically produces HTML code. The HTML <script src> attribute and the EL #{} doesn't run in sync. Instead, JSF/EL procuces it and you just need to make sure that the resulting HTML is syntactically valid.
First a bit of background information. I create HTML emails at my work place and the whole process is very tedious. It goes a little little like this...
Code markup for HTML using tables and some CSS
Parse HTML and CSS using Premailer so all CSS is inline
Test HTML works in all email clients
Create a copy of the inline version of HTML and start adding in proprietary variables to email tool used for sending emails, ie <%=constant.first_name%>, <%=unsubscribe_link%>
Test in email client to see if it works and client is happy. If not repeat steps 1 through 5 again.
So as you can see it gets really tedious after a while.
What I would like to do is create a command line script similar to Premailer which allows me to parse a HTML file with variables stored in it without destroying the example text already in the HTML. That way when you are previewing the HTML it all looks dandy.
For example...
Store the first name function as a variable for own use.
$first_name = "<%=constant.first_name%>
Then tell the parser what word(s) to replace with the appropriate variable.
<p>My name is <!-- $first_name -->Gavin<!-- /$first_name --></p>
So that the final output looks something like:
<p>My name is <%=constat.first_name%></p>
Would such a thing be possible? Is there a better syntax I could, a custom tag like <first_name>Gavin</first_name>, if the browser can handle it.
Any advice is helpful. :)
I've seen this done before using a syntax like:
{assign_variable:first_name="Jesse"}
Then, you could use it like:
{first_name}
The way you'd parse this (provided you're using PHP) would be something like:
<?php
// Our Template Code
$strHTML = <<<EOT
{assign_variable:first_name="Jesse"}
{assign_variable:last_name="Bunch"}
Hello, {first_name}!
EOT;
// Get all the variables
$arrMatches = array();
preg_match_all('/\{assign\_variable\:([a-zA-Z\_\-]*)\=\"([a-zA-Z0-9]+)\"\}/', $strHTML, $arrMatches);
// Remove the assign_variable tags
$strHTML = preg_replace('/\{assign\_variable\:([a-zA-Z\_\-]*)\=\"([a-zA-Z0-9]+)\"\}/', '', $strHTML);
// Combine them into key/values
$arrVariables = array_combine($arrMatches[1], $arrMatches[2]);
foreach($arrVariables as $key=>$value) {
// Replace the variable occurrences
$strHTML = str_replace('{' . $key . '}', $value, $strHTML);
}
// Send the parsed template
echo $strHTML;
Which outputs:
Hello, Jesse!
Note, this is a very basic example. Here are some improvements to make on this code before using it in production:
Edit the regex to allow the right characters.
Maybe implement a better replacement method than a loop
Check for parse errors
Benchmark performance
All in all, I think you get the idea. Hope this points you in the right direction.
I have a similar situation
I have created a "format template" like this:
<?php // section1 $var1/$var2 ?>
<head>
<title>$var1</title>
<meta name="description" content="$var2">
</head>
<?php // section2 $var1/$var2 ?>
<body>
hello: <p>$var1</p>
news for you: <p>$var2</p>
</body>
it is valid php code and valid html code, so you can edit it with dreamwaver or similar, and you can host it also.
then a php script replaces all ocurrences of vars in all sections.