Replacing only first HTML tag on a page in RegEx - html

I would like to search for only the first occurrence of an HTML tag (and it's contents) and replace it for another. I want the search and replace to stop after it's found the first occurrence on a page.
For example, at the top of each page in a directory is:
<h3>This is my title</h3>
I want to search and replace the h3 tag with an h1 tag and leaving the contents of tag the same. So that the outputted result would be:
<h1>This is my title</h1>
The "this is my title" portion is different on each page. I will be using a Microsoft Server program on the server (called fnr.exe) that does search and replace and can handle regex.
I only want this to occur on first instance of each document that I am running this find and replace with.
I have tried
find: /h3>/g
replace: /h1>
That did not work. I'm not sure what else to do.
This is what my MS program looks like:
I've also tried to use another program which seems popular for windows called Notepad++. This is a screenshot of that attempt. It replaced all occurances. (For testing on this one, I tried to find only the first h2 tag and replace it with h1. It replaced all the h2 tags.

I don't have any MS programs so you're going to have to test this out on your own but I think this should do what you are after.
Search for
^([\s\S]*?)<h3>(.*?)</h3>
replace with
$1<h1>$2</h1>
Demo and explanation of regex, https://regex101.com/r/tB6rV2/2

Can you not just search for /h3>/g and replace with h1>

Related

Replacing stuff of HTML using regex

I am editing a couple of hundred HTML files and I have to replace all the stuff manually, so I was wondering whether it could be done using regex.I don't think it is possible, but it might be, so please help me out.
Okay, so for example, I have many <p> tags in a file, each with a different class. eg:
<p class="class1">stuff here</p>
<p class="class2">more stuff here</p>
I wanted to replace the "stuff here" and "more stuff here" with something, for example
<p class="class1">[content]</p>
<p class="class2">[content]</p> .
I wanted to know if that is possible.
I'm using notepad++.
P.S. I'm new to regex.
I think notepad++ is great for stuff like this. Open up Find/Replace, and check the regular expressions box in the dialog's Search Mode section.
In the "Find what" field, try this:
\<p\ class\=(.*)\>(.*)\<\/p\>
and in "Replace with":
\<p\ class\=\1\>[content]\<\/p\>
the \1 here will take whatever (found by (.*)) between the class= and the angle bracket > which ends the tag, and replace it with itself, which essentially results in ignoring the class name, rather than having to specify. the second (.*) catches the current content inside the paragraph tag, which is what you want to replace. So where I wrote [content] in the "Replace with" block, that's where you'd put your new content. This does limit you to content that you can paste into the notepad++ find/replace dialog, but I think it has a pretty huge limit.
If I'm remembering that text field's limitations incorrectly, another thing you could do is just adjust my "Replace with" text to just replace the old text with some newlines:
\<p\ class\=\1\>\n\n\<\/p\>
This will delete the old text and leave a clear line where it once was, making it easy to paste whatever you want into the normal editor pane.
The first way is probably better, if your new content will fit the Replace With field, because this regex works once per line. And you can click "Replace" a couple times, and if it's working, clicking "Replace all" will iterate through every <p> element in the file.
Note: this solution assumes that your <p> tags open and close within one line, as you typed them your question description. If they break lines, you're going to want to enable . matches newline in the Replace dialog, and... you need trickier (more precise) syntax than (.*) to catch your class name and content-to-be-replaced. Let me know if this is the the case, and I'll fiddle with it and see if I can help more. The (.*) needs to change to (.*?) or something; the search needs to get more greedy, because if . matches newline, then .* matches any and every possible character infinite times, i.e., the whole document.

How to remove a div from the entire project?

I've got a project consisting of over 200 html files. There's a div repeated throughout most of these, looking like this:
<div class='foobar' id="abcdef123'></div>
I have found all uses of the class using the Find in Files function in Sublime Text 2 - now I want to remove them, i.e. completely delete any line containing that div (and its closing tag).
Is there an easy way to do it in Sublime Text 2?
EDIT: I have forgotten to mention that sometimes the div has additional classes and the ID is always different. How would I write a regexp to deal with that?
In Notepad++, open all 200 files and replace with the following regular expression.
<div class='foobar[^']*' id="[^']*"></div>
and replace it by nothing. I don't know Sublimetext2.

Find and replace a lot of <a> tags, but the url are different? Is there some way to do it?

I have a page that I need to fix..
There are thousands of <a> tags like <a href="kl1j23l123l12j3"> that I need to get rid off, but the problem is that each <a> tag has a different url in them (href attribute). So, I am wondering if there is some advanced way to get rid of the whole anchor/link but keep the link-text, as that would save me a whole lot of time.
Example
Input : StackOverflow.com
Output: StackOverflow.com
Thanks.
Maybe this is a solution using JavaScript and jQuery. It also can be tweaked to only get the values of links that do not start with http. I wasn't quite sure whether this is relevant according to the links in the question.
// get all links within the document
​var links = $('a');
// simply get all link texts
var x = links.text();​​​​​​
// or just get all links that are like 'kl1j23l123l12j3' as they don't start with 'http'
var x = links.filter('[href^=http]').text();
Here's a demo: http://jsfiddle.net/rg3ET/
Instead of applying them all together into one variable ("x") you could of course loop through them and output them individual.
The following would work under the assumption that each anchor tag is on its own line.
Example:
asdf
<div>
</div>
asdf2
Notepad++ has a regex find and replace feature, which may work for your need.
Replace all </a> tags with nothing
Use a regex to find all <a href="anything"> and replace with nothing.
The following image shows what I did for step 2. You can see that I used a regex of <a .*>. For this to work properly, there should only be one > character per line. Otherwise, the regex will make the longest possible match, possibly including a bunch of other tags. This is why I said the procedure would only work for anchor tags that are on their own lines.
In case you can't see the image (again, this only works:
From Notepad++ menu: Search > Replace
Select Regular expression
In Find what box, put <a .*>
Click Replace All

regex: selecting everything but img tag

I'm trying to select some text using regular expressions leaving all img tags intact.
I've found the following code that selects all img tags:
/<img[^>]+>/g
but actually having a text like:
This is an untagged text.
<p>this is my paragraph text</p>
<img src="http://example.com/image.png" alt=""/>
this is a link
using the code above will select the img tag only
/<img[^>]+>/g #--> using this code will result in:
<img src="http://example.com/image.png" alt=""/>
but I would like to use some regex that select everything but the image like:
/magical regex/g # --> results in:
This is an untagged text.
<p>this is my paragraph text</p>
this is a link
I've also found this code:
/<(?!img)[^>]+>/g
which selects all tags except the img one. but in some cases I will have untagged text or text between tags so this won't work for my case. :(
is there any way to do it?
Sorry but I'm really new to regular expressions so I'm really struggling for few days trying to make it work but I can't.
Thanks in advance
UPDATE:
Ok so for the ones thinking I would like to parse it, sorry I don't want it, I just want to select text.
Another thing, I'm not using any language in specific, I'm using Yahoo Pipes which only provide regex and some string tools to accomplish the job. but it doesn't evolves any programming code.
for better understanding here is the way regex module works in yahoo pipes:
http://pipes.yahoo.com/pipes/docs?doc=operators#Regex
UPDATE 2
Fortuntately I'm being able to strip the text near the img tag but on a step-by-step basis as #Blixt recommended, like:
<(?!img)[^>]+> , replace with "" #-> strips out every tag that is not img
(?s)^[^<]*(.*), replace with $1 #-> removes all the text before the img tag
(?s)^([^>]+>).*, replace with $1 #-> removed all the text after the img tag
the problem with this is that it will only catch the first img tag and then I would have to do it manually and catch the others hard-coding it, so I still not sure if this is the best solution.
The regexp you have to find the image tags can be used with a replace to get what you want.
Assuming you are using PHP:
$htmlWithoutIMG = preg_replace('/<img[^>]+>/g', '', $html);
If you are using Javascript:
var htmlWithoutIMG = html.replace(/<img[^>]+>/g, '');
This takes your text, finds the <img> tags and replaces them with nothing, ie. it deletes them from the text, leaving what you want. Can not recall if the <,> need escaping.
Regular expression matches have a single start and length. This means the result you want is impossible in a single match (since you want the result to end at one point, then continue later).
The closest you can get is to use a regular expression that matches everything from start of string up to start of <img> tag, everything between <img> tags and everything from end of <img> tag to end of string. Then you could get all matches from that regular expression (in your example, there would be two matches).
The above answer is assuming you can't modify the result. If you can modify the result, simply replace the <img> tags with the empty string to get your result.

Visual Studio 2008 HTML formatting - does it ever work?

It's another Visual Studio 2008 HTML formatting question...I think I have either found a bug in the infamously bad VS HTML formatting, or I'm doing something wrong. Here's what I'm doing:
I remove all client side tags via:
Tools -> Options -> Text Editor -> HTML -> Format -> Tag Specific options
I then add b and span tags:
alt text http://www.xtupload.com/new/thumb-3BB0_49B92330.jpg
I press CTRL+E,CTRL+D and I get these two differing results:
1
alt text http://www.xtupload.com/new/image-CBF1_49B92330.jpg
The P before the span tag isn't formatted properly
2
alt text http://www.xtupload.com/new/image-3AB6_49B92330.jpg
The P tag is formatted correctly.
This for a .ASPX extension file.
It looks like it is a bug, and isn't dependent on the tag being SPAN or B.
The work around I found
Add an extra space before the closing P.
How it fails
<p><b>My title</b></p>
Gets re-formatted as
<p>
<b>My title</b></p>
How to get it to work
<p><b>My title</b> </p>
(NB the space after the B) gets reformatted as:
<p>
<b>My title</b>
</p>
And that extra space is removed by VS anyway. Hallelujah my HTML looks beautiful!
I followed the same method as Chris. I decided to use a RegEx find and replace to do it for the whole document. The regex finds any closing p or h* tags that aren't preceded by white space or the start of a line and inserts a newline before the closing tag. Examine the regex to get a better understanding. Here's what I used:
Find what:
{[^:b^]}{\</(p|(h:z))\>}:b*$
Replace with:
\1\n\2
It only finds p and h* because those were the only two I found had this problem. Other tags can be added.
You can customize the layout per tag, if it bothers you that much. Go to the options dialog and select the formatting option under Text Editor -> HTML
Having said that, I don't like some of the inconsistencies I couldn't fix, so I stopped using it except to reformat code from someone else before I started working on it. Once the initial reformat is done, I maintain the formatting manually.