extract title tag from html - html

I want to extract contents of title tag from html string. I have done some search but so far i am not able to find such code in VB/C# or PHP. Also this should work with both upper and lower case tags e.g. should work with both <title></title> and <TITLE></TITLE>. Thank you.

You can use regular expressions for this but it's not completely error-proof. It'll do if you just want something simple though (in PHP):
function get_title($html) {
return preg_match('!<title>(.*?)</title>!i', $html, $matches) ? $matches[1] : '';
}

Sounds like a job for a regular expression. This will depend on the HTML being well-formed, i.e., only finds the title element inside a head element.
Regex regex = new Regex( ".*<head>.*<title>(.*)</title>.*</head>.*",
RegexOptions.IgnoreCase );
Match match = regex.Match( html );
string title = match.Groups[0].Value;
I don't have my regex cheat sheet in front of me so it may need a little tweaking. Note that there is also no error checking in the case where no title element exists.

If there is any attribute in the title tag (which is unlikely but can happen) you need to update the expression as follows:
$title = preg_match('!<title.*>(.*?)</title>!i', $url_content, $matches) ? $matches[1] : '';

Related

Ruby regex to match content between <ul> tags

I have a script to grab a page and edit it. The page HTML looks something like this:
<p>Title</p>...extra content...<ul><li>Item1</li><li>Item2</li></ul>
There are multiple titles and multiple unordered lists but I want to change each list with a regular expression that can find the list with a certain title and use .sub in Ruby to replace it.
The regex I currently have looks like this:
regex = /<p>Title1?.*<\/ul>/
Now if there are any items below the regex it will match to the last tag and accidentally grab all the lists below it for example if I have this content:
content = "<p>Title1</p><ul><li>Item1</li><li>Item2</li></ul><p>Title2</p><ul><li>Item1</li><li>Item2</li><li>Item3</li></ul>"
and I want to add another list item to the section for Title 1:
content.sub(regex, "<p>Title1</p><ul><li>Item1</li><li>Item2</li><li>NEW_ITEM</li></ul>)
It will delete all items below it. How do I rewrite my regex to select only the first /ul tag to substitute?
"I want to change each list with a regular expression." No you don't. You really do not want to go down this road because it's filled with misery, sorrow, and tears. One day someone will put a list item in your list item.
There are libraries like Nokogiri that make manipulating HTML very easy. There's no excuse to not use something like it:
require 'nokogiri'
html = "<p>Title</p>...extra content...<ul><li>Item1</li><li>Item2</li></ul>"
doc = Nokogiri::HTML(html)
doc.css('ul').children.first.inner_html = 'Replaced Text'
puts doc.to_s
That serves as a simple example for "replace text from first list item". It can be easily adapted to do other things, as the css method takes a simple CSS selector, not unlike jQuery.
Use a non-greedy (lazy) quantifier .*?
See this explanation of Ruby Regexp repetition.
regex = /<p>Title1?.*?<\/ul>/
...it reformats the html with newlines and changes all <br /> to <br>...
That's usually because the wrong method is used when emitting the doc as HTML or XHTML:
doc = Nokogiri::HTML::DocumentFragment.parse('<p>foo<br />bar</p>')
doc.to_xhtml # => "<p>foo<br />bar</p>"
doc.to_html # => "<p>foo<br>bar</p>"
doc = Nokogiri::HTML::DocumentFragment.parse('<p>foo<br>bar</p>')
doc.to_xhtml # => "<p>foo<br />bar</p>"
doc.to_html # => "<p>foo<br>bar</p>"
As for spuriously adding line-ends where they weren't before, I haven't seen that. It's possible to tell Nokogiri to do that if you're modifying the DOM, but from what I've seen, on its own Nokogiri is very benign.

Count html tags with Perl regex

I'm trying to parse an HTML file to count HTML tags. I'm not much familiar with Regexp though.
My current code counts only by line. not tag by tag. It returns the whole line.
while(<SUB>){
while(/(<[^\/][a-z].*>)/gi){
print $_;
$count++;
}
}
suppose that we have a line like this in the file
<div>blahblahblah</div><h1>hello</h1><p>blah</>
I need to extract the opening tag of every HTML tag and also tags like <hr>,<br> and <img>.
Could you please put me in the right direction.
If you want to count HTML tags within a document I suggest that you use HTML::Treebuilder.
use strict;
use HTML::Tree;
use LWP::Simple;
my $ex = "http://www.google.com";
my $content = get($ex);
my $tree = HTML::Tree->new();
$tree->parse($content);
my #a_tags = $tree->look_down( '_tag' , 'div' );
my $size=#a_tags;
print $size;
Now you can specify different tag names instead of div and count all different tags that you require. I suggest studying HTML::Treebuilder as it is a very useful module and you may finds methods you may find useful.

Regex to find content, then backtrack to initial HTML tag

I'm trying to use regex to match a string that starts with a <p> tag and has some specific content. Then, I want to replace everything from that specific paragraph tag to the end of the page.
I've tried using the expression <p.*?some content.*</html>, but it grabs the first <p> tag it sees, then follows through all the way to the end. I want it to only recognize the paragraph tag immediately preceding the content, allowing for other content and tags between the paragraph tag and the content.
How can I get to some specific content with the regex, then backtrack to the first paragraph tag it sees before the content, and then select everything from there to the end?
If it helps, I'm using EditPad Pro's "Search & Replace" function (although this could apply to anything that uses regex).
For simple input use regex
<p[^<]*some content.*<\/html>
but safer would be to use regex
<p(?:[^<]*|<(?!p\b))*some content.*<\/html>
To start, this is Java code, but it can be easily adapted to other regex engines / programming languages, I suppose.
So from what I understand, you want a situation where a given input has a part that starts with <p> and immediately followed by some target content/phrase. You then want to replace everything following the initial <p> tag with some other content?
If that is correct, you could do something like this:
String input; // holds your input text/html
String targetPhrase = "some specific content"; // some target content/phrase
String replacement; // holds the replacement value
Pattern p = Pattern.compile("<p[^>]*>(" + targetPhrase + ".*)$", Pattern.CASE_INSENSITIVE);
Matcher m = p.matcher(input);
m.replaceFirst(replacement);
Of course, as mentioned in above comments, you really don't want to be using regex for HTML.
Alternatively, if you know that if the <p> tag is just that, with no properties or anything, you could try a substring instead.
So for example, if you're looking for "<p>some specific content", you could try something like:
String input; // your input text/html
String replacement; // the replacement value(s)
int index = input.indexOf("<p>some specific content");
if (index > -1) {
String output = input.substring(0, index);
output += "<p>" + replacement;
// now output holds your modified text/html
}

Perl - split html code by "table" tag and its contents

I'm trying to split a chunck of html code by the "table" tag and its contents.
So, I tried
my $html = 'aaa<table>test</table>bbb<table>test2</table>ccc';
my #values = split(/<table*.*\/table>/, $html);
After this, I want the #values array to look like this:
array('aaa', 'bbb', 'ccc').
But it returns this array:
array('aaa', 'ccc').
Can anyone tell me how I can specify to the split function that each table should be parsed separately?
Thank you!
Your regex is greedy, change it to /<table.*?\/table>/ and it will do what you want. But you should really look into a proper HTML parser if you are going to be doing any serious work. A search of CPAN should find one that is suited to your needs.
Your regex .* is greedy, therefore chewing its way to the last part of the string. Change it to .*? and it should work better.
Use a ? to specify non-greedy wild-card char slurping, i.e.
my #values = split(/<table*.*?\/table>/, $html);
Maybe using HTML parser is a bit overkill for your example, but it will pay off later when your example grows. Solution using HTML::TreeBuilder:
use HTML::TreeBuilder;
use Data::Dump qw(dd);
my $html = 'aaa<table>test</table>bbb<table>test2</table>ccc';
my $tree = HTML::TreeBuilder->new_from_content($html);
# remove all <table>....</table>
$_->delete for $tree->find('table');
dd($tree->guts); # ("aaa", "bbb", "ccc")

Regular expression to find URLs not inside a hyperlink

There's many regex's out there to match a URL. However, I'm trying to match URLs that do not appear anywhere within a <a> hyperlink tag (HREF, inner value, etc.). So NONE of the URLs in these should match:
something
http://www.example2.com
<b>something</b>http://www.example.com/<span>test</span>
Any URL outside of <a></a> should be matched.
One approach I tried was to use a negative lookahead to see if the first <a> tag after the URL was an opening <a> or a closing </a>. If it is a closing </a> then the URL must be inside a hyperlink. I think this idea was okay, but the negative lookahead regex didn't work (or more accurately, the regex wasn't written correctly). Any tips are very appreciated.
I was looking for this answer as well and because nothing out there really worked like I wanted it too this is the regex that I created. Obviously since its a regex be aware that this is not a perfect solution.
/(?!<a[^>]*>[^<])(((([A-Za-z]{3,9}:(?:\/\/)?)(?:[-;:&=\+\$,\w]+#)?[A-Za-z0-9.-]+|(?:www.|[-;:&=\+\$,\w]+#)[A-Za-z0-9.-]+)((?:\/[\+~%\/.\w-_]*)?\??(?:[-\+=&;%#.\w_]*)#?(?:[\w]*))?))(?![^<]*<\/a>)/gi
And the whole function to update html is:
function linkifyWithRegex(input) {
let html = input;
let regx = /(?!<a[^>]*>[^<])(((([A-Za-z]{3,9}:(?:\/\/)?)(?:[-;:&=\+\$,\w]+#)?[A-Za-z0-9.-]+|(?:www.|[-;:&=\+\$,\w]+#)[A-Za-z0-9.-]+)((?:\/[\+~%\/.\w-_]*)?\??(?:[-\+=&;%#.\w_]*)#?(?:[\w]*))?))(?![^<]*<\/a>)/gi;
html = html.replace(
regx,
function (match) {
return '' + match + "";
}
);
return html;
}
You can do it in two steps instead of trying to come up with a single regular expression:
Blend out (replace with nothing) the HTML anchor part (the entire anchor tag: opening tag, content and closing tag).
Match the URL
In Perl it could be:
my $curLine = $_; #Do not change $_ if it is needed for something else.
$curLine =~ /<a[^<]+<\/a>//g; #Remove all of HTML anchor tag, "<a", "</a>" and everything in between.
if ( $curLine =~ /http:\/\//)
{
print "Matched an URL outside a HTML anchor !: $_\n";
}
You can do that using a single regular expression that matches both anchor tags and hyperlinks:
# Note that this is a dummy, you'll need a more sophisticated URL regex
regex = '(<a[^>]+>)|(http://.*)'
Then loop over the results and only process matches where the second sub-pattern was found.
Peter has a great answer: first, remove anchors so that
Some text TeXt and some more text with link http://a.net
is replaced by
Some text and some more text with link http://a.net
THEN run a regexp that finds urls:
http://a.net
Use the DOM to filter out the anchor elements, then do a simple URL regex on the rest.
^.*<(a|A){1,1} ->scan until >a or >A is found
.*(href|HREF){1,1}\= -> scan until href= or HREF=
\x22{1,1}.*\x22 -> accept all characters between two quotes
> -> look for >
.+(|){1,1} -> accept description and end anchor tag
$ -> End of string search
pattern= "^.*<(a|A){1,1}.*(href|HREF){1,1}.*\=.*\x22{0,1}.*\x22{0,1}.*>.+(|){1,1}$"