Regex has unexpected results. I am new to this, fair warning - html

The following regex does match what I am looking for, but it will also match all file extensions (just the file extensions) of anything ending with gif|jpg|png
webcomic"\ssrc="http://www\.explosm\.net/[a-zA-Z/]+\.gif|png|jpg"\s
I am using it on the source of the following page, which is a webcomic that is updated daily:
http://www.explosm.net/comics/
Today, the end goal would be the following, and only the following:
webcomic" src="http://www.explosm.net/db/files/Comics/Kris/lawyer.gif"
I'm just getting my feet wet with regex, have browsed a few websites but can't figure this one out. I don't get why just the file extensions are getting matched, when their file paths/urls do not match the rest of my pattern.
Any help appreciated

Well, the problem that jumps right out at me is the end there. gif|png|jpg should really be (gif|jpg|png) - with what you have now, the string can match webcomic"\ssrc="http://www\.explosm\.net/[a-zA-Z/]+\.gif, or it can match just png or jpg"\s. With the parentheses, it will match webcomic"\ssrc="http://www\.explosm\.net/[a-zA-Z/]+\. followed by (gif or jpg or png), and then followed by "\s.

That last bit
gif|png|jpg
means "match any of the three". If want it to match just gif, write just gif.

I'd try a regex like this:
\shttp://www.explosm.net\/[a-zA-Z]+\.(gif|png|jpg|jpeg)\s

Related

How to build complex vs code snippet variable transforms?

I'm trying to write a code snippet for vs code that takes a given file name, removes a piece of the name and capitalizes the first letter. For example
Input:
example.model.js
Output:
Example
Output im getting:
${TM_FILENAME_BASE/(.*).[model]+$//capitalize//}
I'm able to remove the trailing half of the file name with the following string
"${TM_FILENAME_BASE/(.*)\\.[model]+$/$1/}"
I tried to take this a step further with the following but it doesn't seem to work.
"${TM_FILENAME_BASE/(.*)\\.[model]+$/${1:/capitalize/}/}"
Based on the documentation i'm not sure where I'm going wrong.
https://code.visualstudio.com/docs/editor/userdefinedsnippets#_transform-examples
Any ideas on what I'm missing here? Also are there any tools that could help build these kinds of complex expressions?
Thanks
It looks like i was writing the grammer incorrect adding a trailing slash / the correct way is below
${TM_FILENAME_BASE/(.).\.[model]+$/${1:/capitalize}/};"
With this regex (.*)\\.[model]+$, (.*) captures the whole word.
For eg, it will capture example in example.model.js and thus, capitalize it as EXAMPLE
You need to capture only the first character like so:
"${TM_FILENAME_BASE/(.).*\\.[model]+$/${1:/capitalize/}/}"

Regular expressions to match several characters Editpad

I have a wordpress export with older posts which are going to be imported into a different installation. But I have an issue where there is a part of the content that needs to be converted into a different type of content in the other installation.
The original code is like this:
000d3c4c]]></content:encoded>
And I need to make it into this
</content:encoded><wp:postmeta>
<wp:meta_key><![CDATA[mkd_post_audio_link_meta]]></wp:meta_key>
<wp:meta_value><![CDATA[000d2aa8.mp3]]></wp:meta_value>
</wp:postmeta>
I'm using EditPad Pro to search through the file, and to prevent me from using several hours doing this change manually, I tant to use the extensive search and replace feature in EditPad, but I have some issues trying to match this. I want an expression to be like this.
First search and replace will work no matter what because i can change
<a href="000
into
</content:encoded>
<wp:postmeta>
<wp:meta_key>
<![CDATA[mkd_post_audio_link_meta]]></wp:meta_key><wp:meta_value>><![CDATA[000d2aa8.mp3]]></wp:meta_value>
</wp:postmeta>
But I struggle with the next part.
How can I change everything after .mp3 to this
]]></wp:meta_value>
</wp:postmeta>
To make it brief. I want to replace all occurences of
">000d3c4c</a>]]></content:encoded>
with
]]></wp:meta_value>
</wp:postmeta>
The 000d3c4c are different for each occurence of the mp3 link.
Match this
<a[^>]*>([^<]+)
and replace by this
</content:encoded>
<wp:postmeta>
<wp:meta_key><![CDATA[mkd_post_audio_link_meta]]></wp:meta_key>
<wp:meta_value><![CDATA[$1.mp3]]></wp:meta_value>
</wp:postmeta>

Trademark symbol is displayed as raw text

if you visit www.startwire.com you'll see in the center of the page (in the yellow box, under the video) the following:
StartWire™
in our dev and stage environments, this is not an issue, but it is in production. What could possibly be causing this?
If you look at the page source, you will see &trade; - you are double encoding the entity.
This should be simply ™.
In the HTML you have:
<h2>Sign-up now. StartWire&trade; is completely FREE.</h2>
whereas the correct would be:
<h2>Sign-up now. StartWire™ is completely FREE.</h2>
Notice the extraneous &. Look like you are double encoding something on the server.
If you check your page source it says:
&trade;
This means that probably it took ™ and transformed that into HTML. So the & becomes &. This is probably due to the use of a htmlentities() function.
Make sure you do not do this conversion twice...
A possible cause of this is that you are taking the contents from a database and that you have encoded the entries before inserting them into the database and you encode them a second time when you retrieve them from this database.
Is the content being "HTML encoded" (or whatever they call it) automatically, somewhere in the script? Because this is what appears in the HTML: &trade;.
My suggestions would be to just use the symbol in your code (™). If that doesn't work, try escaping the & of ™ using \ (so that it becomes \™).
not sure, but i have checked your site it shows like you have write like
&™
simple write ™

Regex to find http and .html

I'm trying to find links in the following format:
http://subdomain.subdomain.domain.tld/subfolder/randomstring.html
Basically, I need a regex that looks for http:// and stops looking when it finds .html. Everything in between shouldn't matter. I.e., more/less subdomains, variable TLD and variable folder.
Is this possible?
((http://)?=(.html))
What I've got so far (not functional) is this. I'm really not familiar with the look-ahead assertion so I might be on the wrong track.
Anyways, any help is going to be greatly appreciated!
Look ahead? You only need a non-greedy match everything.
/http:\/\/.*?\.html/
I would use something like: /http:\/\/[^<>\s]+?\.html/
Can be enhanced, but at least won't match stuff like:
http://something.com/ has a lot of .html files

How can I extract the HREF value from an HTML link?

My text file contains 2 lines:
<IMG SRC="/icons/folder.gif" ALT="[DIR]"> yahoo.com.jp/
</PRE><HR>
In my Perl script, I have:
my $String =~ /.*(HREF=")(.*)(">)/;
print "$2";
and my output is the following:
Output 1: yahoo.com.jp
Output 2: ><HR>
What I am trying to achieve is have my Perl script automatically extract the string inside the <A Href="">
As I am very new to regex, I want to ask if my regex is a badly formed one? If so can someone provide some suggestion to make it look nicer?
Secondly, I do not know why my second output is "><HR>", I thought the expected behavior is that output2 will be skipped since it does not contain HREF=". Obviously I am very wrong.
Thanks for the help.
To answer your specific question about why your regex isn't working, you're using .*, which is "greedy" - it will by default match as much as you can. Alternatives would be using the non-greedy form, .*?, or be a bit more exacting about what you're trying to match. For instance, [^"]* will match anything that's not a double quote, which seems to be what you're looking for.
But yes, the other posters are correct - using regular expressions to do anything non-trivial in HTML parsing is a recipe for disaster. Technically you can do it properly, especially in Perl 5.10 (which has more advanced regular expression features), but it's usually not worth the headache.
Using regular expressions to parse HTML works just often enough to lull you into a false sense of security. You can get away with it for simple cases where you control the input but you're better off using something like HTML::Parser instead.
If I may, I'd like to suggest the simplest way of doing this (it may not be the fastest or lightest-weight way): HTML::TreeBuilder::XPath
It gives you the power of XPath in non-well-formed HTML.
use HTML::TreeBuilder::XPath;
my $tree= HTML::TreeBuilder::XPath->new_from_file( 'D:\Archive\XPath.pm.htm' );
my #hrefs = $tree->findvalues( '//div[#class="noprint"]/a/#href');
print "The links are: ", join( ',', #hrefs ), "\n";
When trying to match against HTML (or XML) with a regex you have to be careful about using . Rarely ever do you want a . because start is a greedy modifier that will match as far as it can. as Gumbo showed use the character class specifier [^"]* to match all characters except a quote. This will match till the end quote. You may also want to use something similar for matching the angle bracket. Try this:
/HREF="([^"]*)"[^>]*>/i
That should match much more consistently.