This question already has answers here:
RegEx match open tags except XHTML self-contained tags
(35 answers)
What to do Regular expression pattern doesn't match anywhere in string?
(8 answers)
Closed 8 years ago.
We have HTML code that looks like:
<h1><a name="_Toc22332223">Creating a record</a><h1>
<h1><a name="sectionB">Creating a record</a><h1>
Is there expression to use that we can find and delete the <a name=> and leave the text like this: <h1>Creating a record<h1>
We also do not way to remove other hyperlinks like <a href>
I tried <a name="[0-9]*">.+</a> to no avail.
Thanks!
As suggested by others DOM parsing is the most reliable way.
But if it has to be very simple you can use the the following regex
<[aA]\s+name\s*=[^>]*>(.*)[^<]<\/a>
Example on http://rubular.com/r/cI2CTwUCy3
Related
This question already has answers here:
RegEx match open tags except XHTML self-contained tags
(35 answers)
Closed 1 year ago.
I've been searching for a solution for hours, but haven't found any examples that help.
I want to search a plain text file and remove all instances of <a id="pageXXX"></a> where XXX is the page number.
I have tried
(^<a id="page)(.*:?)("></a>)
(^<a id=\\"page)(.*:?)(\\"></a>)
(^<a id="page)([0-9]+)("></a>)
(^<a id=\\"page)([0-9]+)(\\"></a>)
What am I missing?
This works correctly.
(<a id=\"page)(.*:?)(\"></a>)
This question already has answers here:
Find and replace HTML tags
(3 answers)
Closed 4 years ago.
I need to replace all HTML tags of one kind in a string with another, e.g., replace all <i> tags with <em> tag.
What's the best way to effectively change:
"<p><i>Random stuff here...</i></p>"
to the following?
"<p><em>Random stuff here...</em></p>"
There are millions of such strings, so a solution taking complexity into account would be nice.
You can make use gsub with block
string = "<p><i>Random stuff here...</i></p>"
string.gsub(/(<\/?)i(>)/) { "#{$1}em#{$2}" }
#=> "<p><em>Random stuff here...</em></p>"
Explanation:
Match an i html opening or closing tag and replace it with em
This question already has answers here:
Is it guaranteed that non-numeric attribute values on every web-page HTML are always quoted?
(1 answer)
what are data-* HTML attributes?
(1 answer)
Closed 4 years ago.
In the code I am working on I found this:
<div class="icon icon2 screen-icon" data-screen-idx=1>
What puzzles me is the last "attribute" (or whatever it is )
Is this data-screen-idx-1 legal in html tag?
Please note that 1 is not quoted.
If yes, where can I find info about this.
If not, why would someone write such thing?
Yes, this is valid HTML. They are called "data-attributes" and can be whatever you want, as long as they begin with data-.
See this article for more information. MDN - Using data attributes
This question already has answers here:
Removing html tags from a string in R
(7 answers)
Closed 5 years ago.
I need to clean my dataframe to take of the HTML tags from columns 2-4.
Does anyone knows a simple way to do that?
df$col <- gsub("<[^>]+>", "", df$col)
Or
df$col <- gsub("<.*?>","",df$col)
This uses regex to strip all html tags which usually enclosed in <>.
Note: using regex to strip HTML is not advised at all times however in your case it seems like your data set will have numbers which is why regex would be the best and simple option for you to go about it.
This question already has answers here:
how to extract links and titles from a .html page?
(6 answers)
Closed 6 years ago.
I'm having a little problem with a VB.NET scraper, it's supposed to get all links of a html string, which I have already downloaded, and the links are there (I have checked), so it must be something with my regex string.
My regex string: <a.*?href=""(.*?)"".*?>(.*?)</a>
This works for some sites, but for others it does not.
Here are examples from the HTML source that match and don't match.
Working:
<a href="http://domain.com" rel="nofollow" onmousedown="return clk('25936','3')" target="_blank">/a>
Not working:
<a href='http://domain.com' target="_blank" ><font size=2><b>text</b></a>
Could it be because of the " and ' ?
Check with following RegExp:
<a.*?href=[",'](.*?)[",'].*?><\/a>
You are using double quotes 2 times. since a tag's href will be used with single and double quotes you have to check with both.