Replacing stuff of HTML using regex - html

I am editing a couple of hundred HTML files and I have to replace all the stuff manually, so I was wondering whether it could be done using regex.I don't think it is possible, but it might be, so please help me out.
Okay, so for example, I have many <p> tags in a file, each with a different class. eg:
<p class="class1">stuff here</p>
<p class="class2">more stuff here</p>
I wanted to replace the "stuff here" and "more stuff here" with something, for example
<p class="class1">[content]</p>
<p class="class2">[content]</p> .
I wanted to know if that is possible.
I'm using notepad++.
P.S. I'm new to regex.

I think notepad++ is great for stuff like this. Open up Find/Replace, and check the regular expressions box in the dialog's Search Mode section.
In the "Find what" field, try this:
\<p\ class\=(.*)\>(.*)\<\/p\>
and in "Replace with":
\<p\ class\=\1\>[content]\<\/p\>
the \1 here will take whatever (found by (.*)) between the class= and the angle bracket > which ends the tag, and replace it with itself, which essentially results in ignoring the class name, rather than having to specify. the second (.*) catches the current content inside the paragraph tag, which is what you want to replace. So where I wrote [content] in the "Replace with" block, that's where you'd put your new content. This does limit you to content that you can paste into the notepad++ find/replace dialog, but I think it has a pretty huge limit.
If I'm remembering that text field's limitations incorrectly, another thing you could do is just adjust my "Replace with" text to just replace the old text with some newlines:
\<p\ class\=\1\>\n\n\<\/p\>
This will delete the old text and leave a clear line where it once was, making it easy to paste whatever you want into the normal editor pane.
The first way is probably better, if your new content will fit the Replace With field, because this regex works once per line. And you can click "Replace" a couple times, and if it's working, clicking "Replace all" will iterate through every <p> element in the file.
Note: this solution assumes that your <p> tags open and close within one line, as you typed them your question description. If they break lines, you're going to want to enable . matches newline in the Replace dialog, and... you need trickier (more precise) syntax than (.*) to catch your class name and content-to-be-replaced. Let me know if this is the the case, and I'll fiddle with it and see if I can help more. The (.*) needs to change to (.*?) or something; the search needs to get more greedy, because if . matches newline, then .* matches any and every possible character infinite times, i.e., the whole document.

Related

Replacing only first HTML tag on a page in RegEx

I would like to search for only the first occurrence of an HTML tag (and it's contents) and replace it for another. I want the search and replace to stop after it's found the first occurrence on a page.
For example, at the top of each page in a directory is:
<h3>This is my title</h3>
I want to search and replace the h3 tag with an h1 tag and leaving the contents of tag the same. So that the outputted result would be:
<h1>This is my title</h1>
The "this is my title" portion is different on each page. I will be using a Microsoft Server program on the server (called fnr.exe) that does search and replace and can handle regex.
I only want this to occur on first instance of each document that I am running this find and replace with.
I have tried
find: /h3>/g
replace: /h1>
That did not work. I'm not sure what else to do.
This is what my MS program looks like:
I've also tried to use another program which seems popular for windows called Notepad++. This is a screenshot of that attempt. It replaced all occurances. (For testing on this one, I tried to find only the first h2 tag and replace it with h1. It replaced all the h2 tags.
I don't have any MS programs so you're going to have to test this out on your own but I think this should do what you are after.
Search for
^([\s\S]*?)<h3>(.*?)</h3>
replace with
$1<h1>$2</h1>
Demo and explanation of regex, https://regex101.com/r/tB6rV2/2
Can you not just search for /h3>/g and replace with h1>

Is there a way to replace a whole paragraph in Notepad++

I am trying to replace in Notepad++ using the Replace module, the below paragraph in html (i have 30 html file, and need to replace the below in all of them)
<script type="text/javascript">
<!--
var slideInterval=20000;
var slideTransition=3500;
var slideArray=["/background1.jpg","background2.jpg"];
jQuery.fx.interval=33;
// -->
</script>
But Notepad++ doesn't let me replace unless it's a line instead of a paragraph, and if i put everything on one line to replace, i will have another problems to worry about in my html.
I hope you have a work around on that.
I found a good way to use a multi-line "find" or "replace". I just copy pasted the paragraph into the Ctrl+H "find" field, then brought another paragraph and pasted it into the "replace" field. Notepad++ will show a tabbed space that means a line break. And voila, you can "Replace in all open documents" with just a single click.
N.B.: the "copy" operation should be within Notepad++, otherwise it would paste only the first line in either fields.
Update:
To be clearer about my answer, i found out that Notepad++ will let me only Paste once. That means, if i Copy a paragraph, i can paste it WITH its line break in the "find" field for example but if i paste it another time in the replace field, it will paste only the first line. Hence, no more than 1 "paste" operation is allowed into the Ctrl+H box in case i want to "paste" the line break.
So, in order for this to be done, first, i select any text i want and Ctrl+C on it, then, i go for the paragraph to be found, i just "Select" it and hit Ctrl+H: Notepad++ automatically shows the already selected text into the "find" field. Secondly, we "Paste" the text that's already in our clipboard into the "Replace" field. And the line breaks are here!
In brief: Select text --> Ctrl+C --> Select text --> Ctrl+H --> Ctrl+V in "replace field"
I think I found a guide that describes what you're looking for. The author has examples and the results, and some multi-line replacements are included. You should be able to extrapolate what he does over multiple files by clicking "Replace All in All Opened Documents".
http://markantoniou.blogspot.com/2008/06/notepad-how-to-use-regular-expressions.html

Truly selecting in HTML tags in Vim

In an HTML doc say I have this:
<p>
fdhjfkdj hfkjdfhkjdfhkjdh dfhdkf kjdh kjdhkjdhk
fhkdj hdjfhjkdh kjdh kjdf jkdhf d
jfdfhkdjfhkjdf
fjdj fhkd fdhfkjd hfkjdfhkjdf kdhfd
fdhjkfjk dhjdfhkjdf kjdfhdk fhdk
</p>
If I do the normal vit command in vim it'll select the text inside if I yank it, but if I try to do anything such as tab over or run gqit affects the entire <p>. For example, doing vit then gq ends up looking something like
<p> fdhjfkdj hfkjdfhkjdfhkjdh dfhdkf kjdh kjdhkjdhk lkd sldj lks jlkdf
jlsdkf jlsdf jdl dlsjl fhkdj hdjfhjkdh kjdh kjdf jkdhf d jfdfhkdjfhkjdf fjdj
fhkd fdhfkjd hfkjdfhkjdf kdhfd fdhjkfjk dhjdfhkjdf kjdfhdk fhdk </p>
Indenting it wont indent the text, but the whole tag. How do I truly select on the the text inside so I can run commands on it like the ones above?
That's because the inner HTML of the paragraph starts immediately after the starting <p> tag, so it includes the newline character immediately after it (which you'll also see after vit). As you've recognized, reformatting and indenting are line-based, so that single character counts.
To make the text object work like you want, you need to move to the start of the selection (o), then reduce it to the next line (easiest with j; for indenting and formatting, the exact start column isn't important, anyway). So the sequence for reformatting would be:
vitojgq
If you want something quicker, you need to write your own text object. Have a look at my CountJump plugin, or the textobj-user plugin; they can help with defining one.

RegEx to substitute tag names, leaving the content and attributes intact

I would like to replace opening and closing tag, leaving the content of tags and its attribute intact.
Here is what I have:
<div class="QText">Text to be kept</div>
to be replaced with
<span class="QText">Text to be kept</span>
I tried this expression which finds all expressions I want but there seems to be no way to replace found expressions.
<div class="QText">(.*?)</div>
Thanks in advance.
I think #AmitJoki's answer will work well enough in certain circumstances, but if you only want to replace div elements when they have an attribute or a specific set of attributes, then you would want to use a regex replacement with backreferences - how you specify and refer to a backreference, unfortunately, depends upon your chosen editor. Visual Studio has the most unique and annoying "flavor" of regex I know of, while Dreamweaver has a fairly typical implementation (both as well as I imagine whatever editor you're using do regex replacement - you just have to know the menu item or keystroke to bring up the dialog).
If memory serves, Dreamweaver has replacement options when you hit Ctrl+F, while you have to hit Ctrl+H, so try those.
Once you get a "Find" and "Replace" box, you would put something like what you have in your last example above: <div class="QText">(.*?)</div> or perhaps <div class="(QText|RText|SText)">(.*?)</div> into your "Find" box, then put something like <span class="QText">\1</span> or <span class="\1">\2</span> in the "Replacement" box. A few utilities might use $1 to refer to a backreference rather than \1, but you'll have to lookup help or experiment to be sure.
If you are using a language to run this expression, you need to tell us which language.
If you are using a specific editor to run this expression, you need to tell us which editor.
...and never forget the prevailing wisdom on regex and HTML
Just replace div.
var s="<div class='QText'>Text to be kept</div>";
alert(s.replace(/div/g,"span"));
Demo: http://jsfiddle.net/9sgvP/
Mark it as answer if it helps ;)
Posted as requested
If its going to be literal like that, capture what's to be kept, then replace the rest,
Find: <div( class="QText">.*?</)div>
Replace: <span$1span>

Regex: Extracting readable (non-code) text and URLs from HTML documents

I am creating an application that will take a URL as input, retrieve the page's html content off the web and extract everything that isn't contained in a tag. In other words, the textual content of the page, as seen by the visitor to that page. That includes 'masking' out everything encapsuled in <script></script>, <style></style> and <!-- -->, since these portions contain text that is not enveloped within a tag (but is best left alone).
I have constructed this regex:
(?:<(?P<tag>script|style)[\s\S]*?</(?P=tag)>)|(?:<!--[\s\S]*?-->)|(?:<[\s\S]*?>)
It correctly selects all the content that i want to ignore, and only leaves the page's text contents. However, that means that what I want to extract won't show up in the match collection (I am using VB.Net in Visual Studio 2010).
Is there a way to "invert" the matching of a whole document like this, so that I'd get matches on all the text strings that are left out by the matching in the above regex?
So far, what I did was to add another alternative at the end, that selects "any sequence that doesn't contain < or >", which then means the leftover text. I named that last bit in a capture group, and when I iterate over the matches, I check for the presence of text in the "text" group. This works, but I was wondering if it was possible to do it all through regex and just end up with matches on the plain text.
This is supposed to work generically, without knowing any specific tags in the html. It's supposed to extract all text. Additionally, I need to preserve the original html so the page retains all its links and scripts - i only need to be able to extract the text so that I can perform searches and replacements within it, without fear of "renaming" any tags, attributes or script variables etc (so I can't just do a "replace with nothing" on all the matches I get, because even though I am then left with what I need, it's a hassle to reinsert that back into the correct places of the fully functional document).
I want to know if this is at all possible using regex (and I know about HTML Agility Pack and XPath, but don't feel like).
Any suggestions?
Update:
Here is the (regex-based) solution I ended up with: http://www.martinwardener.com/regex/, implemented in a demo web application that will show both the active regex strings along with a test engine which lets you run the parsing on any online html page, giving you parse times and extracted results (for link, url and text portions individually - as well as views where all the regex matches are highlighted in place in the complete HTML document).
what I did was to add another alternative at the end, that selects "any sequence that doesn't contain < or >", which then means the leftover text. I named that last bit in a capture group, and when I iterate over the matches, I check for the presence of text in the "text" group.
That's what one would normally do. Or even simpler, replace every match of the markup pattern with and empty string and what you've got left is the stuff you're looking for.
It kind of works, but there seems to be a string here and there that gets picked up that shouldn't be.
Well yeah, that's because your expression—and regex in general—is inadequate to parse even valid HTML, let alone the horrors that are out there on the real web. First tip to look at, if you really want to chase this futile approach: attribute values (as well as text content in general) may contain an unescaped > character.
I would like to once again suggest the benefits of HTML Agility Pack.
ETA: since you seem to want it, here's some examples of markup that looks like it'll trip up your expression.
<a href=link></a> - unquoted
<a href= link></a> - unquoted, space at front matched but then required at back
- very common URL char missing in group
- more URL chars missing in group
<a href=lïnk></a> - IRI
<a href
="link"> - newline (or tab)
<div style="background-image: url(link);"> - unquoted
<div style="background-image: url( 'link' );"> - spaced
<div style="background-image: url('link');"> - html escape
<div style="background-image: ur\l('link');"> - css escape
<div style="background-image: url('link\')link');"> - css escape
<div style="background-image: url(\
'link')"> - CSS folding
<div style="background-image: url
('link')"> - newline (or tab)
and that's just completely valid markup that won't match the right link, not any of the possible invalid markup, markup that shouldn't but does match a link, or any of the many problems with your other technique of splitting markup from text. This is the tip of the iceberg.
Regex is not reliable for retrieving textual contents of HTML documents. Regex cannot handle nested tags. Supposing a document doesn't contain any nested tag, regex still requires every tags are properly closed.
If you are using PHP, for simplicity, I strongly recommend you to use DOM (Document Object Model) to parse/extract HTML documents. DOM library usually exists in every programming language.
If you're looking to extract parts of a string not matched by a regex, you could simply replace the parts that are matched with an empty string for the same effect.
Note that the only reason this might work is because the tags you're interested in removing, <script> and <style> tags, cannot be nested.
However, it's not uncommon for one <script> tag to contain code to programmatically append another <script> tag, in which case your regex will fail. It will also fail in the case where any tag isn't properly closed.
You cannot parse HTML with regular expressions.
Parsing HTML with regular expressions leads to sadness.
I know you're just doing it for fun, but there are so many packages out there than actually do the parsing the right way, AND do it reliably, AND have been tested.
Don't go reinventing the wheel, and doing it a way that is all but guaranteed to frustrate you down the road.
OK, so here's how I'm doing it:
Using my original regex (with the added search pattern for the plain text, which happens to be any text that's left over after the tag searches are done):
(?:(?:<(?P<tag>script|style)[\s\S]*?</(?P=tag)>)|(?:<!--[\s\S]*?-->)|(?:<[\s\S]*?>))|(?P<text>[^<>]*)
Then in VB.Net:
Dim regexText As New Regex("(?:(?:<(?<tag>script|style)[\s\S]*?</\k<tag>>)|(?:<!--[\s\S]*?-->)|(?:<[\s\S]*?>))|(?<text>[^<>]*)", RegexOptions.IgnoreCase)
Dim source As String = File.ReadAllText("html.txt")
Dim evaluator As New MatchEvaluator(AddressOf MatchEvalFunction)
Dim newHtml As String = regexText.Replace(source, evaluator)
The actual replacing of text happens here:
Private Function MatchEvalFunction(ByVal match As Match) As String
Dim plainText As String = match.Groups("text").Value
If plainText IsNot Nothing AndAlso plainText <> "" Then
MatchEvalFunction = match.Value.Replace(plainText, plainText.Replace("Original word", "Replacement word"))
Else
MatchEvalFunction = match.Value
End If
End Function
Voila. newHtml now contains an exact copy of the original, except every occurrence of "Original word" in the page (as it's presented in a browser) is switched with "Replacement word", and all html and script code is preserved untouched. Of course, one could / would put in a more elaborate replacement routine, but this shows the basic principle. This is 12 lines of code, including function declaration and loading of html code etc. I'd be very interested in seeing a parallel solution, done in DOM etc for comparison (yes, I know this approach can be thrown off balance by certain occurrences of some nested tags quirks - in SCRIPT rewriting - but the damage from that will still be very limited, if any (see some of the comments above), and in general this will do the job pretty darn well).
For Your Information,
Instead of Regex, With JQuery , Its possible to extract text alone from a html markup. For that you can use the following pattern.
$("<div/>").html("#elementId").text()
You can refer this JSFIDDLE