How to get element by xpath and inner text? - html

I am trying to locate a UI popup using xpath and inner text. The popup html is like this:
Hello ? {some elements} What would you like to do today {more elements} Play or Die ? {Yes button, No button}.
I want to get this element by using something like this:
//div[contains(#innerText, 'Hello ? What would you like to do today Play or Die ?')]
How do I do this ? Or is there a better way to find this popup ? There are no Ids or permanent classes here. Moreover the DOM structure is variable.

You should be able to use xpath ver2's matches function:
//div[matches(text(), 'Hello ?\w+ What would you like to do today \w+')]
Which allows regex.

Assuming that the interleaving elements contain no inner text, you can use . instead of #innerText to get concatenation of all text nodes within context element (the div in this case). Combine . with normalize-space() to remove leading and trailing whitespace characters, as well as to normalize consecutive whitespace characters into single space. With normalize-space() the . can be removed as it is the default parameter :
//div[contains(normalize-space(), 'Hello ? What would you like to do today Play or Die ?')]

Related

Remove HTML tags in specific tags in MySQL

I'd like to make a SQL script to remove for exemple all <strong> and </strong> tags which are inside a title <hX></hX> tag.
I want to replace all occurences like <h4><strong>Some text</strong></h4> with <h4>Some text</h4>,
but only if in a H tag and without losing content of course.
I tried many things like the REGEXP_REPLACE and REGEXP_SUBSTR but I'm stuck with something like REGEXP_REPLACE(myfield, "<h\\d>.*<strong>.*<\/strong>.*<\/h\\d>", "") which replaces all match.
I use php to strip info out: preg_replace('#[^A-Za-z0-9]#i', '', $_POST['username']); // filter everything but letters and numbers. It can be modified for specific phrases and characters. I know it isn't SQL but it is something. Also in Javascript, you can use an innerHTML command that pulls the text only out from within tags >Text<

Extracting content of HTML tag with specific attribute

Using regular expressions, I need to extract a multiline content of a tag, which has specific id value. How can I do this?
This is what I currently have:
<div(.|\n)*?id="${value}"(.|\n)*?>(.|\n)*?<\/div>
The problem with this is this sample:
<div id="1">test</div><div id="2">test</div>
If I want to replace id="2" using this regexp (with ${value} = 2), the whole string would get matched. This is because from the tag opening to closing I match everything until id is found, which is wrong.
How can I do this?
A fairly simple way is to use
Raw: <div(?=\s)[^>]*?\sid="2"[^>]*?>([\S\s]*?)</div>
Delimited: /<div(?=\s)[^>]*?\sid="2"[^>]*?>([\S\s]*?)<\/div>/
Use the variable in place of 2.
The content will be in group 1.
Change (.|\n) to [^>] so it won't match the > that ends the tag. Then it can't match across different divs.
<div\b[^>]*\bid="${value}"[^>]*>.*?<\/div>
Also, instead of using (.|\n)* to match across multiple lines, use the s modifier to the regexp. This makes . match any character, including newlines.
However, using regular expressions to parse HTML is not very robust. You should use a DOM parser.

How do I put two spaces after every period in our HTML?

I need there to be two spaces after every period in every sentence in our entire site (don't ask).
One way to do it is to embark on manually adding a &nbsp after every single period. This will take several hours.
We can't just find and replace every period, because we have concatenations in PHP and other cases where there is a period and then a space, but it's not in a sentence.
Is there a way to do this...and everything still work in Internet Explorer 6?
[edit] - The tricky part is that in the code, there are lines of PHP that include dots with spaces around them like this:
<?php echo site_url('/css/' . $some_name .'.css');?>
I definitely don't want extra spaces to break lines like that, so I would be happy adding two visible spaces after each period in all P tags.
As we all know, HTML collapses white space, but it only does this for display. The extra spaces are still there. So if the source material was created with two spaces after each period, then some of these substitution methods that are being suggested can be made to work reliably - search for "period-space-space" and replace it with something more suituble, like period-space-&emsp14;. Please note that you shouldn't use because it can prevent proper wrapping at margins. (If you're using ragged right, the margin change won't be noticeable as long as you use the the nbsp BEFORE the space.)
You can also wrap each sentence in a span and use the :after selector to add a space and format it to be wide with "word-spacing". Or you can wrap the space between sentences itself in a span and style that directly.
I've written a javascript solution for blogger that does this on the fly, looks for period-space-space, and replaces it with a spanned, styled space that appears wider.
If however your original material doesn't include this sort of thing then you'll have to study up on sentence boundary detection algorithms (which are not so simple), and then modify one to also not trip over PHP code.
You might be able to use the JavaScript split method or regex depending on the scope of the text.
Here's the split method:
var el = document.getElementById("mydiv");
if (el){
el.innerText = el.innerText.split(".").join(".\xA0 ");
}
Test case:
Hello world.Insert spaces after the period.Using the split method.
Result:
Hello world. Insert spaces after the period. Using the split method.
Have you thought using output buffer? ob_start($callback)
Not tested, but if you'll stick this before any output (or betetr yet, offload the function):
<?php
function processDots($buffer)
{
return (str_replace(".", ". ", $buffer));
}
ob_start("processDots");
?>
and add this to end of input:
<?php ob_end_flush(); ?>
Might just work :)
If you're not opposed to a "post processing"/"javascript" solution:
var nodes = $('*').contents().map(function(a, b) {
return (b.nodeType === Node.TEXT_NODE ? b : null);
});
$.each(nodes, function(i,node){
node.data = node.data.replace(/(\.\s)/g, '.\u00A0\u00A0');
});
Using jQuery for the sake of brevity, but not required.
p.s. I saw your comment about not all periods and a space are to be treated equal, but this is about as good as it gets. otherwise, you're going to need a lot better/more bullet-proof approach.
Incorporate something like this into your PHP file:
<?php if (preg_match('/^. [A-Z]$/' || '/^. [A-Z]$/')) { preg_replace('. ', '. '); } ?>
This allows you to search for the beginning of each new sentence as in .spacespaceA-Z, or .spaceA-Z and then replaces that with . space. [note: Capital letter is not replaced]

Surrounding text with tag and populating tag

I have several lines of text, in them there is a word or words that are capitalized like this:
Hello HOW ARE YOU good to see you
I am FINE
Is there a tool that can go through the text and surround all those capitalized with the HTML anchor text?
and
I guess more difficultly, also populate the href with uncapitalized, space(s) removed version of that capitalized text?
Any help on one or both questions is appreciated.
It took me a while, but here it is in javascript: http://jsfiddle.net/RdJ4E/4/
I'm sure you will find the way hot to tune the code. Good luck!
Is this a beginning? Matching all uppercased words is trivial with regex, and with providing the String.replace method with a callback function instead of a string you can do whatever you want with the matched string.
myString.replace(/(\b[A-Z\s]+\b)/g, function(result, match){
var stripped = encodeURI(result.trim().toLowerCase());
return ' '+result.trim()+' ';
});
http://jsfiddle.net/mwxnC/2/

regex: selecting everything but img tag

I'm trying to select some text using regular expressions leaving all img tags intact.
I've found the following code that selects all img tags:
/<img[^>]+>/g
but actually having a text like:
This is an untagged text.
<p>this is my paragraph text</p>
<img src="http://example.com/image.png" alt=""/>
this is a link
using the code above will select the img tag only
/<img[^>]+>/g #--> using this code will result in:
<img src="http://example.com/image.png" alt=""/>
but I would like to use some regex that select everything but the image like:
/magical regex/g # --> results in:
This is an untagged text.
<p>this is my paragraph text</p>
this is a link
I've also found this code:
/<(?!img)[^>]+>/g
which selects all tags except the img one. but in some cases I will have untagged text or text between tags so this won't work for my case. :(
is there any way to do it?
Sorry but I'm really new to regular expressions so I'm really struggling for few days trying to make it work but I can't.
Thanks in advance
UPDATE:
Ok so for the ones thinking I would like to parse it, sorry I don't want it, I just want to select text.
Another thing, I'm not using any language in specific, I'm using Yahoo Pipes which only provide regex and some string tools to accomplish the job. but it doesn't evolves any programming code.
for better understanding here is the way regex module works in yahoo pipes:
http://pipes.yahoo.com/pipes/docs?doc=operators#Regex
UPDATE 2
Fortuntately I'm being able to strip the text near the img tag but on a step-by-step basis as #Blixt recommended, like:
<(?!img)[^>]+> , replace with "" #-> strips out every tag that is not img
(?s)^[^<]*(.*), replace with $1 #-> removes all the text before the img tag
(?s)^([^>]+>).*, replace with $1 #-> removed all the text after the img tag
the problem with this is that it will only catch the first img tag and then I would have to do it manually and catch the others hard-coding it, so I still not sure if this is the best solution.
The regexp you have to find the image tags can be used with a replace to get what you want.
Assuming you are using PHP:
$htmlWithoutIMG = preg_replace('/<img[^>]+>/g', '', $html);
If you are using Javascript:
var htmlWithoutIMG = html.replace(/<img[^>]+>/g, '');
This takes your text, finds the <img> tags and replaces them with nothing, ie. it deletes them from the text, leaving what you want. Can not recall if the <,> need escaping.
Regular expression matches have a single start and length. This means the result you want is impossible in a single match (since you want the result to end at one point, then continue later).
The closest you can get is to use a regular expression that matches everything from start of string up to start of <img> tag, everything between <img> tags and everything from end of <img> tag to end of string. Then you could get all matches from that regular expression (in your example, there would be two matches).
The above answer is assuming you can't modify the result. If you can modify the result, simply replace the <img> tags with the empty string to get your result.