REGEX in mysql table containing html data - html

I have a table that stores html templates in a mysql database. Now I have to perform some text replacement on them. However my target text is also present in some of the anchor tags and I don't want that to be replaced.
EX :
<body> ... (has huge html crap)... .........(Some more html crap) ... (a bit more of html crap) ... </body>
Task is to replace the occurrences of the "KEYWORD" with "NEW KEYWORD" in the body but not the urls.
It would also be helpful if I can first find such cases where the KEYWORD is a part of a link in a given template.

MySQL is not capable of such advanced string manipulation.
However, if you were to have a one-time-use PHP script do the editing (ie. select from the table, for each row process and update), you can do this:
// foreach row as $row
$newtext = preg_replace("(<a\b.*?>(*SKIP)(*FAIL)|KEYWORD)","NEW KEYWORD",$row['data']);
What this does is look for links (very approximate Regex but should suffice in almost all cases here), then skip over them. Then, it looks for KEYWORD and replaces it with NEW KEYWORD.
You can use this to quickly and easily handle the replacement.
If that "almost all cases" thing above turns out to not be enough, you can use DOMDocument to load the HTML into a parser and process text nodes only from there.

Maybe you could find the cases where the KEYWORD is a part of a link with something like this:
SELECT * FROM tbl WHERE html REGEXP '<a[^>]*KEYWORD';

Related

Create an NCX file with Notepad++ and Regular expression

I have a HTML Table of Contents page containing list of book chapters with hyperlinks:
Multimedia Implementation<br/>
Table of Contents<br/>
About the Author<br/>
About the Technical Reviewers<br/>
Acknowledgments<br/>
Part I: Introduction and Overview<br/>
Chapter 1. Technical Overview<br/>
...
I want create NCX file for a Kindle book which must contain details as follows:
<navPoint id="n1" playOrder="1">
<navLabel>
<text>Multimedia Implementation</text>
</navLabel>
<content src="final/main.html"/>
</navPoint>
<navPoint id="n2" playOrder="2">
<navLabel>
<text>Table of Contents</text>
</navLabel>
<content src="final/toc.html"/>
</navPoint>
<navPoint id="n3" playOrder="3">
<navLabel>
<text>About the Author</text>
</navLabel>
<content src="final/pref01.html"/>
</navPoint>
...
I'm using Notepad++: is it possible automate this process with regular expression?
You cannot do everything using regex.. you can split the problem into two parts..
generate strings like <navPoint id="n1" playOrder="1"> using program logic (increment variable)
remaining you can do with regex
Use the following regex to match:
<a\shref="([^"]*)">([^<]*)<\/a><br\/>
And replace with:
(generated string)<navLabel>\n<text>\2</text>\n<content src="\1"/>\n</navPoint>
See DEMO
Yes, it is possibly to replace the links with <navpoint> tags. The only thing I found no solution for is the incremental numbering of the <navpoint> attributes id and playOrder...
The following regex will do most of the work:
/^<a[^>]*href="([^"]+)"[^>]*([^<]+).*$/gm
substitute with:
<navpoint id="n" playOrder="">\n<navLabel><text>$2</text></navLabel>\n<content src="$1" />\n</navpoint>\n
Regex details
/^<a .. only parse lines that start with an `<a` tag
.*href=" .. find the first occurance of `href="`
([^"]+) .. capture the text and stop when a " is found
"[^>]*> .. find the end of the <a> tag
([^<]+) .. capture the text and stop when a < is found (i.e. the </a> tag)
.*$/ .. continue to end of the line
gm .. search the whole string and parse each line individually
More detailled (but also more confusing) explanation is here:
https://regex101.com/r/gA0yJ2/1
This link also demonstrates how the regex is working. You can test changes there if you like

Replace Placeholder HTML comment with HTML element

I'm developing an iOS app that needs to display HTML content inside a HTML powered textview (DTCoreText).
For whatever reaso nthe client has decided to provide videos inside special HTML comments that I'm supposed to turn into a tag.
The comment format is as such
<!-- placeholder_video:url_of_video.mp4 -->
I was hoping I could write a regex to match the entire comment, extract the content and replace it with a element pointing to the correct URL through NSRegularExpression's stringByReplacingMatchesInString:options:range:withTemplate: but I can't for the life of me figure out Regular Expressions.
The best I could come up with is
(?<=<!-- placeholder_video:)(.*)(?=-->)
Which matches the comment's content (the mp4 URL), but I need it to match the entire comment instead and extract the content as a sub pattern (that I would later access through \1 if I understand correctly) so I can use a replace pattern to quickly replace the comment with the proper <video src="url_of_video.mp4"> string
Can it be done? Or am I better off trying to do it in two passes instead? (match the entire comment then run another regex on that comment to extract the URL and replace the former?
Based on the way your question looks right now (Having forgotten to paste the example of how the comment looks) it's hard to give a good answer.
But since you mention that this:
(?<=<!-- placeholder_video:)(.*)(?=-->)
will manage to fetch the content of the comment. And since you say all you want is to capture the entire comment.
Then if I understand this correctly I would say all you really need to do is add a capturing group around your entire expression and drop the lookback and lookahead.
(Maybe also avoid grabbing the leading and trailing spaces)
(<!-- placeholder_video:\s*(.*)\s*-->)
When testing with the following:
<!-- placeholder_video: url_of_video.mp4 -->
I will get 2 groups:
1: <!-- placeholder_video: url_of_video.mp4 -->
2: url_of_video.mp4
You can also give your groups names if you like, to make it easier to reference them:
(?<comment><!-- placeholder_video:\s*(?<url>.*)\s*-->)
It is also true that you can use \n to reference group n inside the regular expression.
If you plan to replace the first capturing group with the second one in a single regex, then how you do it would depend on the language. Some languages like C# will allow you to provide your own replacing method, which is one option. But I'm assuming you're not in C# here.
In Javascript you can simply use $n to reference the n'th matched group as the replaced value. (You can also provide a function, but you don't need to)
A full working example in JS (Using jQuery but not needed):
<div id="example">
<!-- placeholder_video: url_of_video.mp4 -->
</div>
<script>
var str = $("#example").html();
var str2 = str.replace(/(<!-- placeholder_video:\s*(.*)\s*-->)/g, "<video src=\"$2\">");
alert(str2);
</script>
You can see the working jsfiddle example here: http://jsfiddle.net/72WeZ/

Regex: Extracting readable (non-code) text and URLs from HTML documents

I am creating an application that will take a URL as input, retrieve the page's html content off the web and extract everything that isn't contained in a tag. In other words, the textual content of the page, as seen by the visitor to that page. That includes 'masking' out everything encapsuled in <script></script>, <style></style> and <!-- -->, since these portions contain text that is not enveloped within a tag (but is best left alone).
I have constructed this regex:
(?:<(?P<tag>script|style)[\s\S]*?</(?P=tag)>)|(?:<!--[\s\S]*?-->)|(?:<[\s\S]*?>)
It correctly selects all the content that i want to ignore, and only leaves the page's text contents. However, that means that what I want to extract won't show up in the match collection (I am using VB.Net in Visual Studio 2010).
Is there a way to "invert" the matching of a whole document like this, so that I'd get matches on all the text strings that are left out by the matching in the above regex?
So far, what I did was to add another alternative at the end, that selects "any sequence that doesn't contain < or >", which then means the leftover text. I named that last bit in a capture group, and when I iterate over the matches, I check for the presence of text in the "text" group. This works, but I was wondering if it was possible to do it all through regex and just end up with matches on the plain text.
This is supposed to work generically, without knowing any specific tags in the html. It's supposed to extract all text. Additionally, I need to preserve the original html so the page retains all its links and scripts - i only need to be able to extract the text so that I can perform searches and replacements within it, without fear of "renaming" any tags, attributes or script variables etc (so I can't just do a "replace with nothing" on all the matches I get, because even though I am then left with what I need, it's a hassle to reinsert that back into the correct places of the fully functional document).
I want to know if this is at all possible using regex (and I know about HTML Agility Pack and XPath, but don't feel like).
Any suggestions?
Update:
Here is the (regex-based) solution I ended up with: http://www.martinwardener.com/regex/, implemented in a demo web application that will show both the active regex strings along with a test engine which lets you run the parsing on any online html page, giving you parse times and extracted results (for link, url and text portions individually - as well as views where all the regex matches are highlighted in place in the complete HTML document).
what I did was to add another alternative at the end, that selects "any sequence that doesn't contain < or >", which then means the leftover text. I named that last bit in a capture group, and when I iterate over the matches, I check for the presence of text in the "text" group.
That's what one would normally do. Or even simpler, replace every match of the markup pattern with and empty string and what you've got left is the stuff you're looking for.
It kind of works, but there seems to be a string here and there that gets picked up that shouldn't be.
Well yeah, that's because your expression—and regex in general—is inadequate to parse even valid HTML, let alone the horrors that are out there on the real web. First tip to look at, if you really want to chase this futile approach: attribute values (as well as text content in general) may contain an unescaped > character.
I would like to once again suggest the benefits of HTML Agility Pack.
ETA: since you seem to want it, here's some examples of markup that looks like it'll trip up your expression.
<a href=link></a> - unquoted
<a href= link></a> - unquoted, space at front matched but then required at back
- very common URL char missing in group
- more URL chars missing in group
<a href=lïnk></a> - IRI
<a href
="link"> - newline (or tab)
<div style="background-image: url(link);"> - unquoted
<div style="background-image: url( 'link' );"> - spaced
<div style="background-image: url('link');"> - html escape
<div style="background-image: ur\l('link');"> - css escape
<div style="background-image: url('link\')link');"> - css escape
<div style="background-image: url(\
'link')"> - CSS folding
<div style="background-image: url
('link')"> - newline (or tab)
and that's just completely valid markup that won't match the right link, not any of the possible invalid markup, markup that shouldn't but does match a link, or any of the many problems with your other technique of splitting markup from text. This is the tip of the iceberg.
Regex is not reliable for retrieving textual contents of HTML documents. Regex cannot handle nested tags. Supposing a document doesn't contain any nested tag, regex still requires every tags are properly closed.
If you are using PHP, for simplicity, I strongly recommend you to use DOM (Document Object Model) to parse/extract HTML documents. DOM library usually exists in every programming language.
If you're looking to extract parts of a string not matched by a regex, you could simply replace the parts that are matched with an empty string for the same effect.
Note that the only reason this might work is because the tags you're interested in removing, <script> and <style> tags, cannot be nested.
However, it's not uncommon for one <script> tag to contain code to programmatically append another <script> tag, in which case your regex will fail. It will also fail in the case where any tag isn't properly closed.
You cannot parse HTML with regular expressions.
Parsing HTML with regular expressions leads to sadness.
I know you're just doing it for fun, but there are so many packages out there than actually do the parsing the right way, AND do it reliably, AND have been tested.
Don't go reinventing the wheel, and doing it a way that is all but guaranteed to frustrate you down the road.
OK, so here's how I'm doing it:
Using my original regex (with the added search pattern for the plain text, which happens to be any text that's left over after the tag searches are done):
(?:(?:<(?P<tag>script|style)[\s\S]*?</(?P=tag)>)|(?:<!--[\s\S]*?-->)|(?:<[\s\S]*?>))|(?P<text>[^<>]*)
Then in VB.Net:
Dim regexText As New Regex("(?:(?:<(?<tag>script|style)[\s\S]*?</\k<tag>>)|(?:<!--[\s\S]*?-->)|(?:<[\s\S]*?>))|(?<text>[^<>]*)", RegexOptions.IgnoreCase)
Dim source As String = File.ReadAllText("html.txt")
Dim evaluator As New MatchEvaluator(AddressOf MatchEvalFunction)
Dim newHtml As String = regexText.Replace(source, evaluator)
The actual replacing of text happens here:
Private Function MatchEvalFunction(ByVal match As Match) As String
Dim plainText As String = match.Groups("text").Value
If plainText IsNot Nothing AndAlso plainText <> "" Then
MatchEvalFunction = match.Value.Replace(plainText, plainText.Replace("Original word", "Replacement word"))
Else
MatchEvalFunction = match.Value
End If
End Function
Voila. newHtml now contains an exact copy of the original, except every occurrence of "Original word" in the page (as it's presented in a browser) is switched with "Replacement word", and all html and script code is preserved untouched. Of course, one could / would put in a more elaborate replacement routine, but this shows the basic principle. This is 12 lines of code, including function declaration and loading of html code etc. I'd be very interested in seeing a parallel solution, done in DOM etc for comparison (yes, I know this approach can be thrown off balance by certain occurrences of some nested tags quirks - in SCRIPT rewriting - but the damage from that will still be very limited, if any (see some of the comments above), and in general this will do the job pretty darn well).
For Your Information,
Instead of Regex, With JQuery , Its possible to extract text alone from a html markup. For that you can use the following pattern.
$("<div/>").html("#elementId").text()
You can refer this JSFIDDLE

RegEx: Link Twitter-Name Mentions to Twitter in HTML

I want to do THIS, just a little bit more complicated:
Lets say, I have an HTML input:
Don't break!
Some Twitter Users: #codinghorror, #spolsky, #jarrod_dixon and #blam4c.
You can't reach me at blam4c#example.com.
Is there a good RegEx to replace the twitter username mentions by links to twitter, but leave #example (eMail-Adress at the bottom) AND #test (in the link title, i.e. in HTML tags)?
It probably should also try to not add links inside existing links, i.e. not break this:
Hello #someone there!
My current attempt is to add ">" at the beginning of the string, then use this RegEx:
Search: '/>([^<]*\s)\#([a-z0-9_]+)([\s,.!?])/i'
Replace: '>\1#\2\3'
Then remove the ">" I added in step 1.
But that won't match anything but the "#blam4c". I know WHY it does so, that's not the problem.
I would like to find a solution that finds and replaces all twitter user name mentions without destroying the HTML. Maybe it might even be better to code this without RegEx?
First, keep the angle brackets out of your regexps.
Use a HTML parser and xpath to select the text nodes you are interested in processing, then consider a regexp for matching only #refs in those nodes.
I'll let to other people to try and give a specific answer to the regex part.
I agree with ddaa, there's almost no sane way to attack this without stripping the html links out first.
Presumably you'd be starting out with an actual Twitter message, which cannot by definition include any manually entered hyperlinks.
For example, here's how I found this question (the link resolves to this question so don't bother clicking it!)
Some Twitter Users: #codinghorror, #spolsky, #jarrod_dixon and #blam4c. http://bit.ly/2phvZ1
In this case, it's easy:
var msg = "Some Twitter Users: #codinghorror, #spolsky, #jarrod_dixon and #blam4c. http://bit.ly/2phvZ1";
var html = Regex.Replace(msg, "(?<!\w)(#(\w+))",
"$1");
(this might need some tweaking, I'd like to test it against a corpus, but it seems correct for the average Twitter message)
As for your more complicated cases (with HTML markup embedded in the tweets), I have no idea. Way too hard for me.
This regexp might work a bit better: /\B\#([\w\-]+)/gim
Here's a jsFiddle example of it in action: http://jsfiddle.net/2TQsx/4/

How can I retrieve a collection of values from nested HTML-like elements using RegExp?

I have a problem creating a regular expression for the following task:
Suppose we have HTML-like text of the kind:
<x>...<y>a</y>...<y>b</y>...</x>
I want to get a collection of values inside <y></y> tags located inside a given <x> tag, so the result of the above example would be a collection of two elements ["a","b"].
Additionally, we know that:
<y> tags cannot be enclosed in other <y> tags
... can include any text or other tags.
How can I achieve this with RegExp?
This is a job for an HTML/XML parser. You could do it with regular expressions, but it would be very messy. There are examples in the page I linked to.
I'm taking your word on this:
"y" tags cannot be enclosed in other "y" tags
input looks like: <x>...<y>a</y>...<y>b</y>...</x>
and the fact that everything else is also not nested and correctly formatted. (Disclaimer: If it is not, it's not my fault.)
First, find the contents of any X tags with a loop over the matches of this:
<x[^>]*>(.*?)</x>
Then (in the loop body) find any Y tags within match group 1 of the "outer" match from above:
<y[^>]*>(.*?)</y>
Pseudo-code:
input = "<x>...<y>a</y>...<y>b</y>...</x>"
x_re = "<x[^>]*>(.*?)</x>"
y_re = "<y[^>]*>(.*?)</y>"
for each x_match in input.match_all(x_re)
for each y_match in x_match.group(1).value.match_all(y_re)
print y_match.group(1).value
next y_match
next x_match
Pseudo-output:
a
b
Further clarification in the comments revealed that there is an arbitrary amount of Y elements within any X element. This means there can be no single regex that matches them and extracts their contents.
Short and simple: Use XPath :)
It would help if we knew what language or tool you're using; there's a great deal of variation in syntax, semantics, and capabilities. Here's one way to do it in Java:
String str = "<y>c</y>...<x>...<y>a</y>...<y>b</y>...</x>...<y>d</y>";
String regex = "<y[^>]*+>(?=(?:[^<]++|<(?!/?+x\\b))*+</x>)(.*?)</y>";
Matcher m = Pattern.compile(regex).matcher(str);
while (m.find())
{
System.out.println(m.group(1));
}
Once I've matched a <y>, I use a lookahead to affirm that there's a </x> somewhere up ahead, but there's no <x> between the current position and it. Assuming the pseudo-HTML is reasonably well-formed, that means the current match position is inside an "x" element.
I used possessive quantifiers heavily because they make things like this so much easier, but as you can see, the regex is still a bit of a monster. Aside from Java, the only regex flavors I know of that support possessive quantifiers are PHP and the JGS tools (RegexBuddy/PowerGrep/EditPad Pro). On the other hand, many languages provide a way to get all of the matches at once, but in Java I had to code my own loop for that.
So it is possible to do this job with one regex, but a very complicated one, and both the regex and the enclosing code have to be tailored to the language you're working in.