How to edit Firefox bookmarks with regex in Notepad++ - html

I need to combine multiple bookmarks files and reduce the size, but I don't know how to use regular expression.
I want to:
Delete every line that starts with <DD>
Delete the following HTML tags and the (unknown) text between the
qoutes: ICON_URI="...", ICON="...", and LAST_CHARSET="..."
Replace the text between > and </A>
Delete duplicate lines
Sort lines alphabetically

Tested in Notepad++, for other tools it may not work. Also for some strange cases it could not work in Notepad++ as well
1. ^<DD>.*?$ - replace with empty string
2.ICON_URI="[^"]*" - replace with empty string
3.(<a[^>]+>).+?</a> - replace with \1 </a>
4. This is hard to do with regex, you can use grouping and repetition, but I'm not advanced in that
5. Use excel or other similar tool and order there, much easier

Related

Regular expressions to match several characters Editpad

I have a wordpress export with older posts which are going to be imported into a different installation. But I have an issue where there is a part of the content that needs to be converted into a different type of content in the other installation.
The original code is like this:
000d3c4c]]></content:encoded>
And I need to make it into this
</content:encoded><wp:postmeta>
<wp:meta_key><![CDATA[mkd_post_audio_link_meta]]></wp:meta_key>
<wp:meta_value><![CDATA[000d2aa8.mp3]]></wp:meta_value>
</wp:postmeta>
I'm using EditPad Pro to search through the file, and to prevent me from using several hours doing this change manually, I tant to use the extensive search and replace feature in EditPad, but I have some issues trying to match this. I want an expression to be like this.
First search and replace will work no matter what because i can change
<a href="000
into
</content:encoded>
<wp:postmeta>
<wp:meta_key>
<![CDATA[mkd_post_audio_link_meta]]></wp:meta_key><wp:meta_value>><![CDATA[000d2aa8.mp3]]></wp:meta_value>
</wp:postmeta>
But I struggle with the next part.
How can I change everything after .mp3 to this
]]></wp:meta_value>
</wp:postmeta>
To make it brief. I want to replace all occurences of
">000d3c4c</a>]]></content:encoded>
with
]]></wp:meta_value>
</wp:postmeta>
The 000d3c4c are different for each occurence of the mp3 link.
Match this
<a[^>]*>([^<]+)
and replace by this
</content:encoded>
<wp:postmeta>
<wp:meta_key><![CDATA[mkd_post_audio_link_meta]]></wp:meta_key>
<wp:meta_value><![CDATA[$1.mp3]]></wp:meta_value>
</wp:postmeta>

Can't figure out a regex with line break - HTML

I have written a very simple regular expression to search within an HTML document for any tag - as we are modifying 40+ templates that have been edited by a WYSIWYG editor that was horrible. Basically, it added style="font... tags everywhere - so I want to delete them all.
The problem is, some of them have line breaks between the styles (like you would typically write CSS) - and I can't figure out how to include line breaks within my expression.
Here is what I have:
style="font(.*?)"
I am using textmate to search for it, and it works great except for styles that have hard line breaks in them.
Any help???
Use this RegEx: style="font([\s\S]*?)". . does not match \n by default.
Putting (?s) at the front of your regex causes . to match newline as well
This is the most straightforward way to do it:
style="font([^"]*)"

Regular Expression to Retrieve text between two html tags with Visual Studio's search-replace feature

I'm trying to use Visual Studio's search-replace function to remove tags that don't do anything. The intent is to simplify some HTML before I paste it into a SharePoint page.
This is what I'm using in the Find box \<font\>{~(.*\<font\>.*)}\</font\>
And the Replace box has \1
However, the expression comes up with no matches, even though I have plenty of places like this <font> xxxx </font> within the HTML. I could move the .* outside the paranthesis, but then the expression matches most of the line where I have multiple sets of font tags - some which actually do something.
I'm thinking this would be much easier if the IDE used the same regular expression engine as the languages for which it is the primary development tool.
I just had to review the documentation for VS 2010. Using a minimal match # was all I needed: \<font\>{.#}\</font\>.
I was trying to replace all span tags with div tags. I was able to solve a similar problem by using the following RegEx in the picture. I had to escape both the > and < and the class attribute double quotes.
\<span class=\"label\"\>{.#}\</span\>
<div class="label">\1<\div>

How to remove all empty tags in X/HTML code in once?

for example :
I want to remove all highlighted tags
alt text http://shup.com/Shup/299976/110220132930-My-Desktop.png
You could use a regular expression in any editor that supports them. For instance, I tested this one in Dreamweaver:
<(?!\!|input|br|img|meta|hr)[^/>]*?>[\s]*?</[^>]*?>
Just make a search and replace all (with the regex as search string and nothing as replacement). Note however that this may remove necessary whitespace. If you just want to remove empty tags without anything in between,
<(?!\!|input|br|img|meta|hr)[^/>]*?></[^>]*?>
would be the way to go.
Update: You want to remove &nbsps as well:
<(?!\!|input|br|img|meta|hr)[^/>]*?>(?:[\s]| )*?</[^>]*?>
I did not verify this one - it should be OK though, try it out :-)
If this is only about quickly editing a file, and your editor supports regular expression replacement, you can use a regex like this:
<[^>]+></[^>]+>
Search for this regex, and replace with an empty string.
Note: This isn't safe in any way - don't rely on it, as it can find more things than just valid, empty tags. (It would also find <a></b> for example.) There is no safe way to do this with regexes - but if you check each replacement manually, you should be fine. If you need real safe replacement, then either you'll have to find an editor that supports this (JEdit may be a good bet, but I haven't checked), or you'll have to parse the file yourself - e.g. using XSLT.
What you're asking for sounds like a job for regular expressions. Many editors support regular expression find/replace. Personally, I'd probably do this from the command-line with Perl (sed would also work), but that's just me.
perl -pe 's|<([^\s>]+)[^>]*></\1>||g' < file.html > new_file.html
or if you're brave, edit the file in place:
perl -pe 's|<([^\s>]+)[^>]*></\1>||g' -i file.html
This will remove:
<p></p>
<p id="foo"></p>
but not:
<p>hello world</p>
<p></a>
Warning: things like <img src="pic.png"></img> and <br></br> will also be removed. It's not obvious from your question, but I'll assume this is undesirable. Maybe you're not worried because you know all your images are declared like this <img src="pic.png"/>. Otherwise the regular expression will need to be modified to account for this, but I decided to start simple for an easier explanation...
It works by matching the opening tag: a literal < followed by the tag name (one or more characters which are not whitespace or > = [^\s>]+), any attributes (zero or more characters which aren't > = [^>]*), and then a literal >; and a closing tag with the same name: this takes advantage of the fact that we captured the tag name, so we can use a backreference = </\1>. The matches are then replaced with the empty string.
If the syntax/terminology used here is unfamiliar to you, I'm a fan of the perlre documentation page. Regular expression syntax in other languages should be very similar if not identical to this, so hopefully this will be useful even if you don't Perl :)
Oh, one more thing. If you have things like <div><p></p></div>, these will not be picked up all at once. You'll have to do multiple passes: the first will remove the <p></p> leaving a <div></div>to be removed by the second. In Perl, the substitution operator returns the number of replacements made, so you can:
perl -pe '1 while s|<([^\s>]+)[^>]*></\1>||g' < file.html > new_file.html

How can I extract the HREF value from an HTML link?

My text file contains 2 lines:
<IMG SRC="/icons/folder.gif" ALT="[DIR]"> yahoo.com.jp/
</PRE><HR>
In my Perl script, I have:
my $String =~ /.*(HREF=")(.*)(">)/;
print "$2";
and my output is the following:
Output 1: yahoo.com.jp
Output 2: ><HR>
What I am trying to achieve is have my Perl script automatically extract the string inside the <A Href="">
As I am very new to regex, I want to ask if my regex is a badly formed one? If so can someone provide some suggestion to make it look nicer?
Secondly, I do not know why my second output is "><HR>", I thought the expected behavior is that output2 will be skipped since it does not contain HREF=". Obviously I am very wrong.
Thanks for the help.
To answer your specific question about why your regex isn't working, you're using .*, which is "greedy" - it will by default match as much as you can. Alternatives would be using the non-greedy form, .*?, or be a bit more exacting about what you're trying to match. For instance, [^"]* will match anything that's not a double quote, which seems to be what you're looking for.
But yes, the other posters are correct - using regular expressions to do anything non-trivial in HTML parsing is a recipe for disaster. Technically you can do it properly, especially in Perl 5.10 (which has more advanced regular expression features), but it's usually not worth the headache.
Using regular expressions to parse HTML works just often enough to lull you into a false sense of security. You can get away with it for simple cases where you control the input but you're better off using something like HTML::Parser instead.
If I may, I'd like to suggest the simplest way of doing this (it may not be the fastest or lightest-weight way): HTML::TreeBuilder::XPath
It gives you the power of XPath in non-well-formed HTML.
use HTML::TreeBuilder::XPath;
my $tree= HTML::TreeBuilder::XPath->new_from_file( 'D:\Archive\XPath.pm.htm' );
my #hrefs = $tree->findvalues( '//div[#class="noprint"]/a/#href');
print "The links are: ", join( ',', #hrefs ), "\n";
When trying to match against HTML (or XML) with a regex you have to be careful about using . Rarely ever do you want a . because start is a greedy modifier that will match as far as it can. as Gumbo showed use the character class specifier [^"]* to match all characters except a quote. This will match till the end quote. You may also want to use something similar for matching the angle bracket. Try this:
/HREF="([^"]*)"[^>]*>/i
That should match much more consistently.