PREG_REPLACE in MySQL and PREG_REPLACE_EVAL - mysql

I'm using this UDF for the PREG_REPLACE function in MySQL and everything seems to be working fine so far. However, my goal now is to find and encode entities inside pre tags in a column. So ultimately I'd like to get this:
<pre>
<strong>Hello World!</strong>
</pre>
To look like this:
<pre>
<strong>Hello World!</strong>
</pre>
I'm using the PREG_REPLACE function to find the contents inside the pre tags like this:
SELECT PREG_REPLACE('/<pre>(.*?)<\\/pre>/sm', '\\1', '<pre><strong>Hello World</strong></pre>');
Now I'd like to replace \\1 with something that would say "replace with ENCODE_ENTITIES('\\1'). Obviously it could be any other function, like UPPER for example, but UPPER('\\1') doesn't give much. I kind of like the idea of PREG_REPLACE_EVAL in PHP's implementation of preg_replace which allows something like this:
preg_replace("/(<\/?)(\w+)([^>]*>)/e",
"'\\1'.strtoupper('\\2').'\\3'",
$html_body);
Any ideas on how to implement something similar in MySQL? Or maybe I'm heading the wrong way? Thanks!

Unless you want to modify the source code for the UDF to add that feature, there's no way to do that in MySQL. I suggest you pick your favorite programming language and write a program to make the update for you, or perhaps ensure that this modification is performed before the data hits the database.

Related

How to build complex vs code snippet variable transforms?

I'm trying to write a code snippet for vs code that takes a given file name, removes a piece of the name and capitalizes the first letter. For example
Input:
example.model.js
Output:
Example
Output im getting:
${TM_FILENAME_BASE/(.*).[model]+$//capitalize//}
I'm able to remove the trailing half of the file name with the following string
"${TM_FILENAME_BASE/(.*)\\.[model]+$/$1/}"
I tried to take this a step further with the following but it doesn't seem to work.
"${TM_FILENAME_BASE/(.*)\\.[model]+$/${1:/capitalize/}/}"
Based on the documentation i'm not sure where I'm going wrong.
https://code.visualstudio.com/docs/editor/userdefinedsnippets#_transform-examples
Any ideas on what I'm missing here? Also are there any tools that could help build these kinds of complex expressions?
Thanks
It looks like i was writing the grammer incorrect adding a trailing slash / the correct way is below
${TM_FILENAME_BASE/(.).\.[model]+$/${1:/capitalize}/};"
With this regex (.*)\\.[model]+$, (.*) captures the whole word.
For eg, it will capture example in example.model.js and thus, capitalize it as EXAMPLE
You need to capture only the first character like so:
"${TM_FILENAME_BASE/(.).*\\.[model]+$/${1:/capitalize/}/}"

Finding a specific link from a site

I'm trying to find a specific link from a web page using windows command line and tools. I think Xidel can do what I want to do.
In the page, the link is used like this:
file: 'http://link.link/index.txt'
Note: there's only one line like this. Now if I can set something like
file: '{%link}'
then I'll be able to extract the link. Also if I want to change the word index.txt to something like root.txt and then use aria2 to download the link as http://link.link/root.txt , what do I need to do?
(I don't have any experience with any of these tools/command like scripts, I just wanted to make something that does this (some alternatives are already available but I want to do it myself) and this only. So I did search for it and have an idea on how can I do it but extrating the exact url seems to be the hardest part since I couldn't find anything that might help me in xidel's docs)
Xidel is meant to extract data from HTML/XML/json files, but it can also extract from CSV's and TXT if you know how to use the $raw variable and xidel/xquery functions, like extract(), tokenize() and replace().
Post the URL or the source (or part thereof) of the webpage and I'll see how I can help you.

How to store html characters in mysql and display them correctly

Not sure if I am asking this correctly. But I am using a Jquery HTML Editor cleditor so that the users can load html text. When I insert this into my db(mysql) and want to display the outcome it takes out any html characters it had like: <p>, <span>, and so on. So when I go view it, it shows like this:
class=\"noticia_texto\">jlasdfklsfklaf
which obviously it's not readable. Help please? Should I be using anything at the time of inserting or displaying or both? Also my datatype is set to Blob.
MySQL does NOT strip html tags. If they're being removed upon insertion (or retrieval), then it's something in your code doing it, not MySQL.
Given that the quotes in your snippet are escaped, you've almost certainly got magic_quotes enabled, and/or a home-brew SQL escaping function run amok.

How can I extract the HREF value from an HTML link?

My text file contains 2 lines:
<IMG SRC="/icons/folder.gif" ALT="[DIR]"> yahoo.com.jp/
</PRE><HR>
In my Perl script, I have:
my $String =~ /.*(HREF=")(.*)(">)/;
print "$2";
and my output is the following:
Output 1: yahoo.com.jp
Output 2: ><HR>
What I am trying to achieve is have my Perl script automatically extract the string inside the <A Href="">
As I am very new to regex, I want to ask if my regex is a badly formed one? If so can someone provide some suggestion to make it look nicer?
Secondly, I do not know why my second output is "><HR>", I thought the expected behavior is that output2 will be skipped since it does not contain HREF=". Obviously I am very wrong.
Thanks for the help.
To answer your specific question about why your regex isn't working, you're using .*, which is "greedy" - it will by default match as much as you can. Alternatives would be using the non-greedy form, .*?, or be a bit more exacting about what you're trying to match. For instance, [^"]* will match anything that's not a double quote, which seems to be what you're looking for.
But yes, the other posters are correct - using regular expressions to do anything non-trivial in HTML parsing is a recipe for disaster. Technically you can do it properly, especially in Perl 5.10 (which has more advanced regular expression features), but it's usually not worth the headache.
Using regular expressions to parse HTML works just often enough to lull you into a false sense of security. You can get away with it for simple cases where you control the input but you're better off using something like HTML::Parser instead.
If I may, I'd like to suggest the simplest way of doing this (it may not be the fastest or lightest-weight way): HTML::TreeBuilder::XPath
It gives you the power of XPath in non-well-formed HTML.
use HTML::TreeBuilder::XPath;
my $tree= HTML::TreeBuilder::XPath->new_from_file( 'D:\Archive\XPath.pm.htm' );
my #hrefs = $tree->findvalues( '//div[#class="noprint"]/a/#href');
print "The links are: ", join( ',', #hrefs ), "\n";
When trying to match against HTML (or XML) with a regex you have to be careful about using . Rarely ever do you want a . because start is a greedy modifier that will match as far as it can. as Gumbo showed use the character class specifier [^"]* to match all characters except a quote. This will match till the end quote. You may also want to use something similar for matching the angle bracket. Try this:
/HREF="([^"]*)"[^>]*>/i
That should match much more consistently.

How do I match text in HTML that's not inside tags?

Given a string like this:
This is the foo link
... and a search string like "foo", I would like to highlight all occurrences of "foo" in the text of the HTML -- but not inside a tag. In other words, I want to get this:
This is the <b>foo</b> link
However, a simple search-and-replace won't work, because it will match part of the URL in the <a> tag's href.
So, to express the above in the form of a question: How do I restrict a regex so that it only matches text outside of HTML tags?
Note: I promise that the HTML in question will never be anything pathological like:
<img title="Haha! Here are some angle brackets to screw you up: ><" />
Edit: Yes, of course I'm aware that there are complex libraries in CPAN that can parse even the most heinous HTML, and thus alleviate the need for such a regex. On many occasions, that's what I would use. However, this is not one of those occasions, since keeping this script short and simple, without external dependencies, is important. I just want a one-line regex.
Edit 2: Again, I know that Template::Refine::Fragment can parse all my HTML for me. If I were writing an application I would certainly use a solution like that. But this isn't an application. It's barely more than a shell script. It's a piece of disposable code. Being a single, self-contained file that can be passed around is of great value in this case. "Hey, run this program" is a much simpler instruction than, "Hey, install a Perl module and then run this-- wait, what, you've never used CPAN before? Okay, run perl -MCPAN -e shell (preferably as root) and then it's going to ask you a bunch of questions, but you don't really need to answer them. No, don't be afraid, this isn't going to break anything. Look, you don't need to answer every question carefully -- just hit enter over and over. No, I promise, it's not going to break anything."
Now multiply the above across a great deal of users who are wondering why the simple script they've been using isn't so simple anymore, when all that's changed is to make the search term boldface.
So while Template::Refine::Fragment may be the answer to someone else's HTML parsing question, it's not the answer to this question. I just want a regular expression that works on the very limited subset of HTML that the script will actually be asked to parse.
If you can absolutely guarantee that there are no angle brackets in the HTML other than those used to open and close tags, this should work:
s%(>|\G)([^<]*?)($key)%$1$2<b>$3</b>%g
In general, you want to parse the HTML into a DOM, and then traverse the text nodes. I would use Template::Refine for this:
#!/usr/bin/env perl
use strict;
use warnings;
use feature ':5.10';
use Template::Refine::Fragment;
my $frag = Template::Refine::Fragment->new_from_string('<p>Hello, world. This is a test of foo finding. Here is another foo.');
say $frag->process(
simple_replace {
my $n = shift;
my $text = $n->textContent;
$text =~ s/foo/<foo>/g;
return XML::LibXML::Text->new($text);
} '//text()',
)->render;
This outputs:
<p>Hello, world. This is a test of <foo> finding. Here is another <foo>.</p>
Anyway, don't parse structured data with regular expressions. HTML is not "regular", it's "context-free".
Edit: finally, if you are generating the HTML inside your program, and you have to do transformations like this on strings, "UR DOIN IT WRONG". You should build a DOM, and only serialize it when everything has been transformed. (You can still use TR, however, via the new_from_dom constructor.)
The following regex will match all text between tags or outside of tags:
<.*?>(.*?)<.*?>|>(.*?)<
Then you can operate on that as desired.
Try this one
(?=>)?(\w[^>]+?)(?=<)
it matches all words between tags
To strip off the variable size contents from even nested tags you can use this regex that is in fact a mini-regular grammar for that. (note: PCRE machine)
(?<=>)((?:\w+)(?:\s*))(?1)*