Regex to detect all characters outside of the <img> tag - html

I don't have experience in regex. I am just trying to find a way to detect
and delete every character outside of the img tag. In other words I want to
strip a given html code from all text and tags and just keep everything within
the img tags. The result should show just the image tags like that:
<img src="sourcehere">
Is there a way to do this?
UPDATE:
I need specifically a regex that goes in preg_replace.
This is what I have done, but it doesn't work:
$buffer ="<html><head></head><body><img src='image.jpg'></body></html>";
$buffer = preg_replace('(?i)<(?!img|/img).*?>', '', $buffer);
echo $buffer; /* should output <img src='image.jpg'> but it doesn't */

What are your plans -- do you want to log it to a file or just display in a console, or output it in some way. This worked for me, but actually 'stringing' it out might take extra work.
this is jQuery. From my understanding you want to remove everything but the images from your document.
var arr2 = Array.prototype.slice.call( document.images );
jQuery('body').contents().remove();
for(i = 0; i < arr2.length;i++){
jQuery('body').append(arr2[i])
}

This doesn't need to be some big and fancy regex.
<img[^>]*>
This matches the text "" followed by the closer ">".
Once you have the matches you would just want to write out the matches to a string, or to the document, or however you want to represent them.
EDIT:
To complete what the OP is showing in PHP, you would want to call match instead of replace. You don't really need to replace all of the non-matching sections. You can just keep the results:
$buffer ="<html><head></head><body><img src='image.jpg'></body></html>";
preg_match("/<img[^>]*>/", $buffer, $matchArray);
foreach ($matchArray as $match){
echo $match;
}
prints out:
<img src='image.jpg'>
EDIT:
The problem I am seeing with trying to replace every other tag will be when you have contents between the tags. If you don't care about that, then here is something that works using preg_replace().
$buffer ="<html><head></head><body><img src='image.jpg'></body></html>";
$buffer = preg_replace('/(?i)<\\/*(?!img)[^>]*>/', '', $buffer);
echo $buffer; /* outputs <img src='image.jpg'> */

Use
preg_replace('/<img[^>]*>(*SKIP)(*FAIL)|./si', '', $buffer)
See regex proof.
EXPLANATION
--------------------------------------------------------------------------------
<img '<img'
--------------------------------------------------------------------------------
[^>]* any character except: '>' (0 or more times
(matching the most amount possible))
--------------------------------------------------------------------------------
> '>'
--------------------------------------------------------------------------------
(*SKIP)(*FAIL) skips the match
--------------------------------------------------------------------------------
| or
--------------------------------------------------------------------------------
. any character

Related

 (OBJ) symbol in WordPress URL?

I have a question about a WordPress URL in Google Chrome 94.0.4606.81:
I was reading a WordPress article recently and noticed that there is an  (OBJ) symbol in the URL. The symbol is also in the webpage title.
Take Ownership and Select Owner
Question:
What is the purpose of the  (OBJ) symbol -- and how is it possible that it has been included in a URL?
It seems like you got this symbol in the title field of the article. You can remove it from there. If you don't see it select everything in the field with ctrl + a and write the title new.
Honestly, I don't know what nature is this copy/paste issue in WP, and the "Object Replacement Character"
To avoid appearing this character it's enough to use Ctrl+Shift+V shortcut while pasting into WP post title field, means: Paste Text Without Formatting.
If you want to be sure in protecting the post slug (means: post URL) you can use the snippet in your functions.php:
/**
* Remove the strange [OBJ] character in the post slug
* See: https://github.com/WordPress/gutenberg/issues/38637
*/
add_filter("wp_unique_post_slug", function($slug, $post_ID, $post_status, $post_type, $post_parent, $original_slug) {
return preg_replace('/(%ef%bf%bc)|(efbfbc)|[^\w-]/', '', $slug);
}, 10, 6);
preg_replace function searches here for string "%ef%bf%bc" or "efbfbc" (UTF-8 - hex encoded OBJ character) OR any character that IS NOT base alphanumeric character or dash character – to delete.
Since you've mentioned it also made into the title: I use this to filter the title on save to remove these special characters.
function sbnc_filter_title($title) {
// Concatenate separate diacritics into one character if we can
if ( function_exists('normalizer_normalize') && strlen( normalizer_normalize( $title ) ) < strlen( $title ) ) {
$title = normalizer_normalize( $title );
}
// Replace no-break-space with regular space
$title = preg_replace( '/\x{00A0}/u', ' ', $title );
// Remove whitespaces from the ends
$title = trim($title);
// Remove any invisible and control characters
$title = preg_replace('/[^\x{0020}-\x{007e}\x{00a1}-\x{FFEF}]/u', '', $title);
return $title;
}
add_filter('title_save_pre', 'sbnc_filter_title');
Please note that you may need to extend set of allowed UTF range in the preg_replace call based on the languages you support. The range in the example should suit most languages actively used in the word, but if you may write article titles that include archaic scripts like Linear-B, gothic etc. you may need to extend the ranges.
If you copy-pasted it from somewhere, like I did, remember to paste as text using Ctrl + Shift + V to avoid this.
Also, it is the case that this [OBJ] only appears in Chromium-based browsers like Chrome, Edge etc, unlike in Firefox which I believe discards it by default.

preg_replace not working

I have this function in my website.
function autolink($content) {
$pattern = "/>>[0-9]/i" ;
$replacement = ">>$0";
return preg_replace($pattern, $replacement, $content, -1);
This is for making certain characters into a clickable hyperlink.
For example, (on a thread) when a user inputs '>>4' to denote to the another reply number 4, the function can be useful.
But it's not working. the characters are not converted into a hyperlink. They just remain as plain text. Not clickable.
Could someone tell me what is wrong with the function?
So the objective is to convert:
This is a reference to the >>4 reply
...into:
This is a reference to the >>4 reply
...where ">" is the HTML UTF-8 equivalent of ">". (remember, you don't want to create HTML issues)
The problems: (1) you forgot to escape the quotes in the replacement (2) since you want to isolate the number, you need to use parentheses to create a sub-pattern for later reference.
Once you do this, you arrive at:
function autolink($contents) {
return preg_replace( "/>>([0-9])/i",
">>$1",
$contents,
-1
);
}
Good luck

Parse html using Perl works for 2 lines but not multiple

I have written the following Perl script-
use HTML::TreeBuilder;
my $html = HTML::TreeBuilder->new_from_content(<<END_HTML);
<span class=time>1 h </span>
User: There are not enough <b>big</b>
<b>fish</b> in the lake ;
END_HTML
my $source = "foo";
my #time = "10-14-2011";
my $name = $html->find('a')->as_text;
my $comment = $html->as_text;
my #keywords = map { $_->as_text } $html->find('b');
Which outputs- foo, 10-14-2011, User, 1h User: There are not enough big fish in the lake, big fish
Which is perfect and what I wanted from the test html but
this only works fine when I put in the aforementioned HTML, which I did for test purposes.
However the full HTML file has multiple references of 'a' and 'b' for instances therefore when printing out the results for these columns are blank.
How can I account for multiple values for specific searches?
Without sight of your real HTML it is hard to help, but $html->find returns a list of <a> elements, so you could write something like
foreach my $anchor ($html->find('a')) {
print $anchor->as_text, "\n";
}
But that will find all <a> elements, and it is unlikely that that is what you want. $html->look_down() is far more flexible, and provides for searching by attribute as well as by tag name.
I cannot begin to guess about your problem with comments without seeing what data you are dealing with.
If you need to process each text element independently then you probably need to call the objectify_text method. This turns every text element in the tree into a pseudo element with a ~text tag name and a text attribute, for instance <p>paragraph text</p> would be transformed into <p><~text text="paragraph text" /></p>. These elements can be discovered using $html->find('~text') as normal. Here is some code to demonstrate
use strict;
use warnings;
use HTML::TreeBuilder;
my $html = HTML::TreeBuilder->new_from_content(<<END_HTML);
<span class=time>1 h </span>
User: There are not enough <b>big</b>
<b>fish</b> in the lake ;
END_HTML
$html->objectify_text;
print $_->attr('text'), "\n" for $html->find('~text');
OUTPUT
1 h
User
: There are not enough
big
fish
in the lake ;

How can I reliably parse a QuakeLive player profile using Perl?

I'm currently working on a Perl script to gather data from the QuakeLive website.
Everything was going fine until I couldn't get a set of data.
I was using regexes for that and they work for everything apart from the favourite arena, weapon and game type. I just need to get the names of those three elements in a $1 for further processing.
I tried regexing up to the favorites image, but without succeeding. If it's any use, I'm already using WWW::Mechanize in the script.
I think that the problem could be related to the class name of the paragraphs where those elements are, while the previous one was classless.
You can find an example profile HERE.
Note that for the previous part of the page, it worked using code like:
$content =~ /<b>Wins:<\/b> (.*?)<br \/>/;
$wins = $1;
print "Wins: $wins\n";
The immediate problem is that you have:
<p class="prf_faves">
<img src="http://cdn.quakelive.com/web/2010092807/images/profile/none_v2010092807.0.gif"
width="17" height="17" alt="" class="fl fivepxhr" />
<b>Arena:</b> Campgrounds
<div class="cl"></div>
</p>
That is, there is no <br /> following the value for favorites such as Arena. Now, the correct way to do this would involve using a proper HTML parser. The fragile solution is to adapt your pattern (untested):
my ($favarena) = $content =~ m{<b>Arena:</b> ([^<]+)};
That should put everything up to the < of the next <div> in $favarena. Now, if all arenas are single words with no spaces in them,
my ($favarena) = $content =~ m{<b>Arena:</b> (\S+)};
would save you the trouble of having to trim whitespace afterwards.
Note that it is easy for such regex based solutions to be fooled with simple things like commented out snippets in the source. E.g., if the source were to be changed to:
<p class="prf_faves">
<img src="http://cdn.quakelive.com/web/2010092807/images/profile/none_v2010092807.0.gif"
width="17" height="17" alt="" class="fl fivepxhr" />
<!-- <b>Arena: </b> here -->
<b>Arena:</b> Campgrounds
<div class="cl"></div>
</p>
your script would be in trouble where as a solution using an HTML parser would not.
An example using HTML::TokeParser::Simple:
#!/usr/bin/perl
use strict; use warnings;
use HTML::TokeParser::Simple;
my $p = HTML::TokeParser::Simple->new( 'martianbuddy.html' );
while ( my $tag = $p->get_tag('p') ) {
next unless $tag->is_start_tag;
next unless defined (my $class = $tag->get_attr('class'));
next unless grep { /^prf_faves\z/ } split ' ', $class;
my $fav = $p->get_tag('b');
my $type = $p->get_text('/b');
my $value = $p->get_text('/p');
$value =~ s/\s+\z//;
print "$type = $value\n";
}
Output:
Arena: Campgrounds
Game Type: Clan Arena
Weapon: Rocket Launcher
And, here is an example using HTML::TreeBuilder:
#!/usr/bin/perl
use strict; use warnings;
use HTML::TreeBuilder;
use YAML;
my $tree = HTML::TreeBuilder->new;
$tree->parse_file('martianbuddy.html');
my #p = $tree->look_down(_tag => 'p', sub {
return unless defined (my $class = $_[0]->attr('class'));
return unless grep { /^prf_faves\z/ } split ' ', $class;
return 1;
}
);
for my $p ( #p ) {
my $text = $p->as_text;
$text =~ s/^\s+//;
my ($type, $value) = split ': ', $text;
print "$type: $value\n";
}
Output:
Arena: Campgrounds
Game Type: Clan Arena
Weapon: Rocket Launcher
Given that the document is an HTML fragment rather than a full document, you will have more success with modules based on HTML::Parser rather than those that expect to operate on well-formed XML documents.
Using regular expressions for this particular task is less than ideal. There are just too many things that might change, and you're not taking advantage of inherent structure of HTML pages. Have you considered using something like HTML::TreeBuilder instead? It will allow you to say "get me the value of the 3rd table cell in the table named weapons", etc.

How can I extract HTML img tags wrapped in anchors in Perl?

I am working on parsing HTML obtain all the hrefs that match a particular url (let's call it "target url") and then get the anchor text. I have tried LinkExtractor, TokenParser, Mechanize, TreeBuilder modules. For below HTML:
<a href="target_url">
<img src=somepath/nw.gf alt="Open this result in new window">
</a>
all of them give "Open this result in new window" as the anchor text.
Ideally I would like to see blank value or a string like "image" returned so that I know there was no anchor text but the href still matched the target url (http://www.yahoo.com in this case). Is there a way to get the desired result using other module or Perl regex?
Thanks,
You should post some examples that you tried with "LinkExtractor, TokenParser, Mechanize & TreeBuilder" so that we can help you.
Here is something which works for me in pQuery:
use pQuery;
my $data = '
<html>
Not yahoo anchor text
<img src="somepath/nw.gif" alt="Open this result in new window"></img>
just text for yahoo
anchor text only<img src="blah" alt="alt text"/>
</html>
';
pQuery( $data )->find( 'a' )->each(
sub {
say $_->innerHTML
if $_->getAttribute( 'href' ) eq 'http://www.yahoo.com';
}
);
# produces:
#
# => <img alt="Open this result in new window" src="somepath/nw.gif"></img>
# => just text for yahoo
# => anchor text only<img /="/" alt="alt text" src="blah"></img>
#
And if you just want the text:
pQuery( $data )->find( 'a' )->each(
sub {
return unless $_->getAttribute( 'href' ) eq 'http://www.yahoo.com';
if ( my $text = pQuery($_)->text ) { say $text }
}
);
# produces:
#
# => just text for yahoo
# => anchor text only
#
/I3az/
Use a proper parser (like HTML::Parser or HTML::TreeBuilder). Using regular expressions to parse SGML (HTML/XML included) isn't really all that effective because of funny multiline tags and attributes like the one you've run into.
If the HTML you are working with is fairly close to well formed you can usually load it into an XML module that supports HTML and use it to find and extract data from the parts of the document you are interested in.
My method of choice is XML::LibXML and XPath.
use XML::LibXML;
my $parser = XML::LibXML->new();
my $html = ...;
my $doc = $parser->parse_html_string($html);
my #links = $doc->findnodes('//a[#href = "http://example.com"]');
for my $node (#links) {
say $node->textContent();
}
The string passed to findnodes is an XPath expression that looks for all 'a' element descendants of $doc that have an href attribute equal to "http://example.com".