How can I replace and multiply dimensions of img tags in Perl or Ruby? - html

I have a folder full of html files created for a Kindle ebook. The images are coded with width and height, as per the Kindle guidelines:
<img width="328" height="234" src="images/224p_fmt.jpeg" alt="224p.tif"/>
What I need to create/find is a script that will process all the image tags, multiply the width an height attributes by a specified amount (coded into the script) and write them back into the html files.
So, for the above example, say I want to multiply by 1.5, and wind up with
<img width="492" height="351" src="images/224p_fmt.jpeg" alt="224p.tif"/>
Scripts like this are not my forte, so help appreciated. I especially am unclear on how to write a script that I can run on file(s) from the command line and just input/output html.
I assume the meat of the code would be something like
s/<img width="([0-9]+)" height="([0-9]+)" src="(.*?)" alt=".*"/>/'<img width="'.$1*1.5.'" height="'.$2*1.5.'" src="'.$3.'" alt=""/>'/eg;
Which I realize is incorrect (the multiplication part) which is why help appreciated.

You've already got the main regex figured out, just need to tweak it and decide a language. Using regexes on html is not optimal, but since this is somewhat straightforward, its probably ok.
perl -pi.bak -we 's/<img width="([0-9]+)" height="([0-9]+)"/q(<img width=") .
$1*1.5 . q(" height=") . $2*1.5 . q(")/eg;' yourfile.html
Note the use of the alternate quoting q(...), since using single quotes on the command line will conflict with the shell quoting.
There's no need to touch any parts you're not changing, unless you feel the need to make a stricter match. If you do, you can add a look-ahead assertion:
(?=\s*src=".*?"\s*alt=".*?"\/>)
This part will remain unchanged by the substitution.

In Python I'd do it like this.
import sys, re
source = sys.stdin.read()
def multi(by):
def handler(m):
updated = int(m.group(2)) * by
return m.group(1) + str(updated)
return handler
print re.sub(r'((?:width|height)=["\'])(\d+)', multi(1.5), source)
Then you can handle input and output on the command like using < and >.
$ python resize.py < index.html > new_file.html

I would look into using the nokogiri gem to parse the HTML, search for image tags, extract the width and height attributes and then output the changed document so you can save it.
More information at the nokogiri tutorial page.

You're right, it can be done with a small Ruby script. It can look like this :
source = '<img width="328" height="234" src="images/224p_fmt.jpeg" alt="224p.tif"/>'
datas = source.scan(/<img width="([0-9]+)" height="([0-9]+)" src="(.*?)" alt=".*">/).flatten!
source.gsub!(data[0], (data[0].to_i * 1.5).to_s)
source.gsub!(data[1], (data[1].to_i * 1.5).to_s)
Of course, it's a quick and dirty script, far from perfect and it has some drawback.

Related

ruby tags for Sphinx/rst

I create HTML documents from a rst-formated text, with the help of Sphinx. I need to display some Japanese words with furiganas (=small characters above the words), something like that :
I'd like to produce HTML displaying furiganas thanks to the < ruby > tag.
I can't figure out how to get this result. I tried to:
insert raw HTML code with the .. raw:: html directive but it breaks my line into several paragraphs.
use the :superscript: directive but the text in furigana is written beside the text, not above.
use the :role: directive to create a link between the text and a CSS class of my own. But the :role: directive can only be applied to a segment of text, not to TWO segments as required by the furiganas (=text + text above it).
Any idea to help me ?
As long as I know, there's no simple way to get the expected result.
For a specific project, I choosed not to generate the furiganas with the help of Sphinx but to modify the .html files afterwards. See the add_ons/add_furiganas.py script and the result here. Yes, it's a quick-and-dirty trick :(

Compare two HTML documents ignoring multiple and trailing whitespaces

Is there a tool that compares an HTML document like:
<p b="1" a="0 "> a b
c </p>
(as a C string: "<p> a b\nc </p>") equal to:
<p a="0 " b="1">a b c</p>
Note how:
text multiple whitespaces were converted to a single whitespace
newlines were converted to whitespaces
text trailing and heading whitespaces were stripped
attributes were put on a standard order
attribute values were unchanged, including trailing whitespaces
Why I want that
I am working on the Markdown Test Suite that aims to measure markdown engine compliance and portability.
We have markdown input, expected HTML output, and want to determine if the generated HTML output is equal to the expected one.
The problem is that Markdown is underspecified, so we cannot compare directly the two HTML strings.
The actual test code is here, just modify run-tests.py#dom_normalize if you want to try out your solution.
Things I tried
beautifulsoup. Orders the attributes, but does not deal well with whitespaces?
A function formatter regex modification might work, but I don't see a way to differentiate between the inside of nodes and attributes.
A Python only solution like this would be ideal.
looking for a Javascript function similar to isEqualNode() (does not work because ignores nodeVaue) + some headless JS engine. Couldn't find one.
If there is nothing better, I'll just have to write my own output formatter front-end to some HTML parser.
I ended up cooking up a custom HTML renderer that normalizes things based on Python's stdlib HTMLParser.
You can see it at: https://github.com/karlcow/markdown-testsuite/blob/749ed0b812ffcb8b6cc56f93ff94c6fdfb6bd4a2/run-tests.py#L20
Usage and docstrig tests at: https://github.com/karlcow/markdown-testsuite/blob/749ed0b812ffcb8b6cc56f93ff94c6fdfb6bd4a2/run-tests.py#L74

Using ruby variables as html code

I would expect that the following:
<div style="padding-top:90px;"><%= u.one_line %></div>
simply pulls whatever is in u.one_line (which in my case is text from database), and puts it in the html file. The problem I'm having is that sometimes, u.one_line has text with formatted html in it (just line breaks). For example sometimes:
u.one_line is "This is < / b r > awesome"
and I would like the page to process the fact that there's a line break in there... I had to put it with spaces up ^^^ here because the browser would not display it otherwise on stackoverflow. But on my server it's typed correctly, unfortunately instead of the browser processing the line break, it prints out the "< / b r>" part...
I hope you guys understand what I mean :(?
always remember to use raw or html_safe for html output in rails because rails by default auto-escapes html content for protecting against XSS attacks.
for more see
When to use raw() and when to use .html_safe

How to generate hash from ~200k text/html that would match/compare to similar text?

I would like to make a sort of hash key out of a text (in my case html) that would match/compare to the hash of other similar text
ex of matching texts:
"2012/10/01 This is my webpage #1"+ 100k_of_same_text + random_words_1 + ..
"2012/10/02 This is my webpage #2"+ 100k_of_same_text + random_words_2 + ..
...
"2012/10/02 This is my webpage #2"+ 100k_of_same_text + random_words_3 + ..
So far I've thought of removing numbers and tags but that wold still leave the random words.
Is there anything out there that dose this?
I have root access to the server so I can add any UDF that is necesare and if needed I can do the processing in c or other languages.
The ideal would be a function like generateSimilarHash(text) and an other function compareSimilarHashes(hash1,hash2) that would return the procent of matching text.
Any function like compare(text1,text2) would not work as in my case as I have many pages to compare (~20 mil at the moment)
Any advice is welcomed!
UPDATE:
I'm refering to ahash function as it is described on wikipedia:
A hash function is any algorithm or subroutine that maps large data
sets of variable length to smaller data sets of a fixed length.
the fixed length part is not necessary in my case.
It sounds like you need to utilize a program like diff.
If you are just trying to compare text a hash is not the way to go because slight differences in input cause total and complete differnces in output. (Thus the reason why they are used to encode passwords, and secure text). Character difference programs are pretty complicated, unless you really are interested in how they work and are trying to write your own I would just use a solution like the one that is shown here using sdiff to get a percentage.
Percentage value with GNU Diff
You could use some sort of Levenshtein distance algoritm. this works for small pieces of text, but I'm rather sure that something similar can be applied to large chunks of text.
Ref: http://en.m.wikibooks.org/wiki/Algorithm_implementation/Strings/Levenshtein_distance
I've found out that tag order in webpages can create a very distinctive pattern, that remains the same even if portions of text / css / script change. So I've made a string generated by the tag order (ex: html head meta title body div table tr td span bold... => "hhmtbdttsb...") and then I just do exact matches between these strings. I can even apply the Levenshtein distance algorithm and get accurate results.
If I didn't have html, I would have used the punctuation/end-lines for splitting, or something similar.

What would I use to remove escaped html from large sets of data

Our database is filled with articles retrieved from RSS feeds. I was unsure of what data I would be getting, and how much filtering was already setup (WP-O-Matic Wordpress plugin using the SimplePie library). This plugin does some basic encoding before insertion using Wordpress's built in post insert function which also does some filtering. Between the RSS feed's encoding, the plugin's encoding using PHP, Wordpress's encoding and SQL escaping, I'm not sure where to start.
The data is usually at the end of the field after the content I want to keep. It is all on one line, but separated out for readability:
<img src="http://feeds.feedburner.com/~ff/SoundOnTheSound?i=xFxEpT2Add0:xFbIkwGc-fk:V_sGLiPBpWU" border="0"></img>
<img src="http://feeds.feedburner.com/~ff/SoundOnTheSound?d=qj6IDK7rITs" border="0"></img>
<img src="http://feeds.feedburner.com/~ff/SoundOnTheSound?i=xFxEpT2Add0:xFbIkwGc-fk:D7DqB2pKExk"
Notice how some of the images are escape and some aren't. I believe this has to do with the last part being cut off so as to be unrecognizable as an html tag, which then caused it to be html endcoded while the actual img tags were left alone.
Another record has only this in one of the fields, which means the RSS feed gave me nothing for the item (filtered out now, but I have a bunch of records like this):
<img src="http://farm3.static.flickr.com/2183/2289902369_1d95bcdb85.jpg" alt="post_img" width="80"
All extracted samples are on one line, but broken up for readability. Otherwise, they are copied exactly from the database from the command line mysql client.
Question: What is the best way to work with the above escaped html (or portion of an html tag), so I can then remove it without affecting the content?
I want to remove it, because the images at the end of the field are usually images that have nothing to do with content. In the case of the feedburner ones, feedburner adds those to every single article in a feed. Other times, they're broken links surrounding broken images. The point is not the valid html img tags which can be removed easily. It's the mangled tags which if unencoded will not be valid html, which will not be parsable with your standard html parsers.
[EDIT]
If it was just a matter of pulling the html I wanted out and doing a strip_tags and reinserting the data, I wouldn't be asking this question.
The portion that I have a problem with is that what used to be an img tag was html encoded and the end cut off. If it's deencoded it will not be an html tag, so I cannot parse it the usual way.
With all the <img src=" crap, I can't get my head around searching for it other than SELECT ID, post_content FROM table WHERE post_content LIKE '<img' which at least gets me those posts. But when I get the data, I need a way to find it, remove it, but keep the rest of the content.
[/EDIT]
[EDIT 2]
<img src="http://farm4.static.flickr.com/3162/2735565872_b8a4e4bd17.jpg" alt="post_img" width="80" />Through the first two months of the year, the volume of cargo handled at Port of Portland terminals has increased 46 percent as the port?s marine cargo business shows signs of recovering from a dismal 2009.<div>
<img src="http://feeds.feedburner.com/~ff/bizj_portland?d=yIl2AUoC8zA" border="0"></img> <img src="http://feeds.feedburner.com/~ff/bizj_portland?i=YIs66yw13JE:_zirAnH6dt8:V_sGLiPBpWU" border="0"></img> <img src="http://feeds.feedburner.com/~ff/bizj_portland?i=YIs66yw13JE:_zirAnH6dt8:F7zBnMyn0Lo" border="0"></img> <a href="http://feeds.bizjournals.com/~ff/bizj_portland?a=YIs66yw13JE:_zirAnH6dt8:qj6IDK7rITs"><img src="http://feeds.feedburner.com/~ff/bizj_portland?d=qj6IDK7rITs"
The part I want to keep:
<img src="http://farm4.static.flickr.com/3162/2735565872_b8a4e4bd17.jpg" alt="post_img" width="80" />Through the first two months of the year, the volume of cargo handled at Port of Portland terminals has increased 46 percent as the port?s marine cargo business shows signs of recovering from a dismal 2009.
To reiterate: It's not about removing the valid html img tags. That's easy. I need to be able to find specifically the <img src="http://feeds.feedburner.com/~ff/bizj_portland?d=qj6IDK7rITs" if it's part of the pattern of img tag img tag mangled img tag or anchor img anchor img img mangled image etc etc, but not remove <img if it is indeed part of the article. Out of the few dozen samples I've reviewed, it's been pretty consistent that this mangled img tag is at the end of the field.
The other one is the single mangled image tag. It's consistently a mangled flickr img tag, but as above, I can't just search for <img as it could be a valid part of the content.
The problem lies in that I can't simply decode it and parse it as HTML, because it will not be valid html.
[/EDIT 2]
The best way is to:
Install HTML::Entities from CPAN and use that to unescape the URIs.
Install HTML::Parser from CPAN and use that to parse and remove the URIs after they're unescaped.
Regexes are not a suitable tool for this task.
Question updated...
To extract the data you want, you could use this approach:
use HTML::Entities qw/decode_entities/;
my $decoded = decode_entities $raw;
if ($decoded =~ s{ (<img .+? (?:>.+?</img>|/>)) } {}x) { # grab the image
my $img = $1;
$decoded =~ s{<.+?>} {}xg; # strip complete tags
$decoded =~ s{< [^>]+? $} {}x; # strip trailing noise
print $img.$decoded;
}
Using a regex to parse HTML is generally frowned upon, however, in this case, it is more about stripping out segments that match a pattern. After testing the regexes on a larger set of data, you should have an idea of what might need to be tweaked.
Hope this helps.
I wouldn't strip it out. It's far from unrecoverable junk.
First apply HTML::Entities::decode_entities conditionally (use the occurence of < as the first character as heuristic), then let HTML::Tidy::libXML->clean(…, 'UTF-8', 1) reconstruct the mark-up as intended. clean returns a whole document, but it's trivial to extract just the needed img element.
Your best bet will be to recollect all of the articles that are in the database so that they aren't truncated and corrupted. If this is not an option then...
Based on your examples above it looks like you're stripping out everything that follows the text content of each article. In your example the text content is followed by a DIV tag and a bunch of IMG tags that may or may not have been truncated and or been converted into HTML entities.
If all of your records are similar you can strip out everything after the Text content by removing the final div tag and everything that follows it using perl like this:
my $article = magic_to_get_an_article();
$article =~ s/<div>.*//s;
magic_to_store_article($article);
If your records include anything more complex than this you're better off using an HTML parsing module and reading the documentation carefully to find out how it handles invalid HTML.
How about a stupid simple Perl find and replace on the var containing your data...
foreach $line(#lines) {
$line =~ s/</</gi;
$line =~ s/>/>/gi;
}
Given the sample input and output you give at the end of your post, the following will get you the desired output:
#!/usr/bin/perl
use strict; use warnings;
use HTML::TokeParser::Simple;
my $parser = HTML::TokeParser::Simple->new( \*DATA );
if ( my $tag = $parser->get_tag('img') ) {
print $tag->as_is;
print $parser->get_text('div');
}
__DATA__
<img src="http://farm4.static.flickr.com/3162/2735565872_b8a4e4bd17.jpg" alt="post_img" width="80" />Through the first two months of the year, the volume of cargo handled at Port of Portland terminals has increased 46 percent as the port?s marine cargo business shows signs of recovering from a dismal 2009.<div> <img src="http://feeds.feedburner.com/~ff/bizj_portland?d=yIl2AUoC8zA" border="0"></img> <img src="http://feeds.feedburner.com/~ff/bizj_portland?i=YIs66yw13JE:_zirAnH6dt8:V_sGLiPBpWU" border="0"></img> <img src="http://feeds.feedburner.com/~ff/bizj_portland?i=YIs66yw13JE:_zirAnH6dt8:F7zBnMyn0Lo" border="0"></img> <a href="http://feeds.bizjournals.com/~ff/bizj_portland?a=YIs66yw13JE:_zirAnH6dt8:qj6IDK7rITs"><img src="http://feeds.feedburner.com/~ff/bizj_portland?d=qj6IDK7rITs"
Output:
<img src="http://farm4.static.flickr.com/3162/2735565872_b8a4e4bd17.jpg" alt="po
st_img" width="80" />Through the first two months of the year, the volume of car
go handled at Port of Portland terminals has increased 46 percent as the port?s
marine cargo business shows signs of recovering from a dismal 2009.
However, I am puzzled as to the size and scope of each chunk you are supposed to process.