RegEx match content inside div with specific class - html

How can I match the contents of all divs that have a specific class. For example:
<div class="column-box-description paddingT05">content</div>

Generally, you shouldn't do this with regex unless you can make strong assumptions about the text you're matching; you'll do better with something that actually parses HTML.
But, if you can make these stronger assumptions, you can use:
<div class="[^"]*?paddingT05[^"]*?">(.*?)<\/div>
The key part is the reluctant quantifier *? which matches the minimal text possible (i.e. it doesn't greedily eat up the </div>.

You can do something like this:
<div.*class\s*=\s*["'].*the_class_you_require_here.*["']\s*>(.*)<\/div>
Replace "the_class_you_require_here" with a class name of your choosing. The div content is in the first group resulted from this expesion. You can read on groups here: http://www.regular-expressions.info/brackets.html

Related

Find and exclude html-tags as whole words in negative lookbehind with regex

I basically try to find all paragraphs (in javascript/jquery) in a text, that are not yet wrapped in a set of defined html-tags:
p|h1|h2|h3|h4|h5|h6|blockquote|img|table|iframe
My current regex (https://regex101.com/r/O4i2hP/1) already matches paragraphs and excludes the defined tags
(.+?(?<![</(p|h1|h2|h3|h4|h5|h6|blockquote|img|table|iframe)>]$))(\n|$)+/gm
but I just don't get, how to just match whole tags only.
The problem is:
(p|h1|h2|h3|h4|h5|h6|blockquote|img|table|iframe)> matches a single
character in the list (p|h123456blockquteimgafr)> (case sensitive)
Thus, as you can see from the example, code that is wrapped in tags such as <strong>TEXT</strong> is also excluded.
I tried different things such as word boundaries \bword\b, but didn't get it working. I hope you can help. Thx
This will do it.
^(?!<(p|h1|h2|h3|h4|h5|h6|blockquote|img|table|iframe)+?>.</\1>).$
I now found a working approach. The tags should be wrapped in groups rather than in character classes. The following works for me:
(.+?(?<!(<\/)(p|h1|h2|h3|h4|h5|h6|blockquote|img|table|iframe)(>)$))(\n|$)+/gm
see also: https://regex101.com/r/DC5msM/1

RegEx to grab an entire div with a specific ID?

I have some markup which contains the firebug hidden div.
(long story short, the YUI RTE posts content back that includes the hidden firebug div is that is active)
So, in my posted content I have the extra div which I will remove server side in PHP:
<div firebugversion="1.5.4" style="display: none;" id="_firebugConsole"></div>
I cant seem to get a handle on the regex I would need to write to match this string, bearing in mind that it won't always be that exact string (version may change).
All help welcome!
Regex is not the best tool for the job, but you can try:
<div firebugversion=[^>]*></div>
The […] is a character class. Something like [aeiou] matches one of any of the lowercase vowels. [^…] is a negated character class. [^aeiou] matches one of anything but the lowercase vowels.
The * is the zero-or-more repetition. Thus, [^>]* matches a sequence of anything but >.
If you want to target the id specifically, you can try:
<div [^>]*\bid="_firebugConsole"[^>]*></div>
The \b is the word boundary anchor.
Match this regex -
<div.*id="_firebugConsole".*?/div>
I would suggest this:
\<div firebugversion="(.+)" style="(.+)" id="(.+)"\>
Then you have three groups:
firebugversion
style
id
This one is a little more complicated, and probably not perfect, but it will:
Match any div containing the attribute firebugversion
Match the firebugversion attribute no matter which order attributes appear in the tag
Match the div, even if it contains content or spacing between it and its closing tag (i've seen the firebug tag with a &nbsp ; tag inside it before) Note: it does lazy matching so it will only match the next tag, rather than the last it finds in the document
<(div)\b([^>]*?)(firebugversion)([^>]*?)>(.*?)</div>

Regular expression to delete HTML strings

I am trying to delete part of a string that does not match my pattern. For example, in
<SYNC Start=364><P Class=KRCC>
<Font Color=lightpink>abcd
I would like to delete
<P Class=KRCC><Font Color=lightpink>
How do I do that?
Your question does not indicate that you need (or should use) regular expressions. If you want to remove a fixed string, do traditional search and replace.
Just match `your pattern' and write that to a file or update the table of a database. That way, you are deleting the rest.
If the HTML you are parsing is valid and always follows a known standard format, you can use non-greedy patterns to remove most of what you don't want.
These samples will have to be modified based on the tool/framework you're using to handle regular expressions. I am not escaping special characters for brevity.
To match any paragraph tags:
<p.*?>(.*?)</p>
You would replace these matches with $1 (or whatever your syntax requires to access groups).
It's important to use non-greedy (?) patterns to avoid accidentally matching two unrelated start/end tags. For example:
<p.*>(.*)</p>
Would behave very differently. In the case of the following example HTML, it would not correctly match two paragraphs:
<p>Lorem ipsum.</p><p>Lorem ipsum.</p>
Instead, it would match "<p>Lorem ipsum.</p><p>" as the first portion, which would result in losing content.
If you need to match paragraphs with specific classes, you could use something like this:
<p.*?class="delete".*?>(.*?)</p>
Where things get sticky is when you start working with non-standardized HTML. For example, this is all valid HTML, but the pattern to clean it up would be ugly:
<p>no class</p>
<p class=delete>no quotes</p>
<p class="delete">double quotes</p>
<p class='delete'>single quotes</p>
<p>space in closing tag</p >
<p>no closing tag

How can I remove an entire HTML tag (and its contents) by its class using a regex?

I am not very good with Regex but I am learning.
I would like to remove some html tag by the class name. This is what I have so far :
<div class="footer".*?>(.*?)</div>
The first .*? is because it might contain other attribute and the second is it might contain other html stuff.
What am I doing wrong? I have try a lot of set without success.
Update
Inside the DIV it can contain multiple line and I am playing with Perl regex.
As other people said, HTML is notoriously tricky to deal with using regexes, and a DOM approach might be better. E.g.:
use HTML::TreeBuilder::XPath;
my $tree = HTML::TreeBuilder::XPath->new;
$tree->parse_file( 'yourdocument.html' );
for my $node ( $tree->findnodes( '//*[#class="footer"]' ) ) {
$node->replace_with_content; # delete element, but not the children
}
print $tree->as_HTML;
You will also want to allow for other things before class in the div tag
<div[^>]*class="footer"[^>]*>(.*?)</div>
Also, go case-insensitive. You may need to escape things like the quotes, or the slash in the closing tag. What context are you doing this in?
Also note that HTML parsing with regular expressions can be very nasty, depending on the input. A good point is brought up in an answer below - suppose you have a structure like:
<div>
<div class="footer">
<div>Hi!</div>
</div>
</div>
Trying to build a regex for that is a recipe for disaster. Your best bet is to load the document into a DOM, and perform manipulations on that.
Pseudocode that should map closely to XML::DOM:
document = //load document
divs = document.getElementsByTagName("div");
for(div in divs) {
if(div.getAttributes["class"] == "footer") {
parent = div.getParent();
for(child in div.getChildren()) {
// filter attribute types?
parent.insertBefore(div, child);
}
parent.removeChild(div);
}
}
Here is a perl library, HTML::DOM, and another, XML::DOM
.NET has built-in libraries to handle dom parsing.
In Perl you need the /s modifier, otherwise the dot won't match a newline.
That said, using a proper HTML or XML parser to remove unwanted parts of a HTML file is much more appropriate.
<div[^>]*class="footer"[^>]*>(.*?)</div>
Worked for me, but needed to use backslashes before special characters
<div[^>]*class=\"footer\"[^>]*>(.*?)<\/div>
Partly depends on the exact regex engine you are using - which language etc. But one possibility is that you need to escape the quotes and/or the forward slash. You might also want to make it case insensitive.
<div class=\"footer\".*?>(.*?)<\/div>
Otherwise please say what language/platform you are using - .NET, java, perl ...
Try this:
<([^\s]+).*?class="footer".*?>([.\n]*?)</([^\s]+)>
Your biggest problem is going to be nested tags. For example:
<div class="footer"><b></b></div>
The regexp given would match everything through the </b>, leaving the </div> dangling on the end. You will have to either assume that the tag you're looking for has no nested elements, or you will need to use some sort of parser from HTML to DOM and an XPath query to remove an entire sub-tree.
This will be tricky because of the greediness of regular expressions, (Note that my examples may be specific to perl, but I know that greediness is a general issue with REs.) The second .*? will match as much as possible before the </div>, so if you have the following:
<div class="SomethingElse"><div class="footer"> stuff </div></div>
The expression will match:
<div class="footer"> stuff </div></div>
which is not likely what you want.
why not <div class="footer".*?</div> I'm not a regex guru either, but I don't think you need to specify that last bracket for your open div tag

Can an HTML element have multiple ids?

I understand that an id must be unique within an HTML/XHTML page.
For a given element, can I assign multiple ids to it?
<div id="nested_element_123 task_123"></div>
I realize I have an easy solution with simply using a class. I'm just curious about using ids in this manner.
No. From the XHTML 1.0 Spec
In XML, fragment identifiers are of
type ID, and there can only be a
single attribute of type ID per
element. Therefore, in XHTML 1.0 the
id attribute is defined to be of type
ID. In order to ensure that XHTML 1.0
documents are well-structured XML
documents, XHTML 1.0 documents MUST
use the id attribute when defining
fragment identifiers on the elements
listed above. See the HTML
Compatibility Guidelines for
information on ensuring such anchors
are backward compatible when serving
XHTML documents as media type
text/html.
Contrary to what everyone else said, the correct answer is YES
The Selectors spec is very clear about this:
If an element has multiple ID attributes, all of them must be treated as IDs for that element for the purposes of the ID selector.Such a situation could be reached using mixtures of xml:id, DOM3 Core, XML DTDs, and namespace-specific knowledge.
Edit
Just to clarify: Yes, an XHTML element can have multiple ids, e.g.
<p id="foo" xml:id="bar">
but assigning multiple ids to the same id attribute using a space-separated list is not possible.
No. While the definition from W3C for HTML 4 doesn't seem to explicitly cover your question, the definition of the name and id attribute says no spaces in the identifier:
ID and NAME tokens must begin with a letter ([A-Za-z]) and may be followed by any number of letters, digits ([0-9]), hyphens ("-"), underscores ("_"), colons (":"), and periods (".").
My understanding has always been:
IDs are single use and are only applied to one element...
Each is attributed as a unique identifier to (only) one single element.
Classes can be used more than once...
They can therefore be applied to more than one element, and similarly yet different, there can be more than one class (i.e., multiple classes) per element.
No. Every DOM element, if it has an id, has a single, unique id. You could approximate it using something like:
<div id='enclosing_id_123'><span id='enclosed_id_123'></span></div>
and then use navigation to get what you really want.
If you are just looking to apply styles, class names are better.
You can only have one ID per element, but you can indeed have more than one class. But don't have multiple class attributes; put multiple class values into one attribute.
<div id="foo" class="bar baz bax">
is perfectly legal.
No, you cannot have multiple ids for a single tag, but I have seen a tag with a name attribute and an id attribute which are treated the same by some applications.
No, you should use nested DIVs if you want to head down that path. Besides, even if you could, imagine the confusion it would cause when you run document.getElementByID(). What ID is it going to grab if there are multiple ones?
On a slightly related topic, you can add multiple classes to a DIV. See Eric Myers discussion at,
http://meyerweb.com/eric/articles/webrev/199802a.html
I'd like to say technically yes, since really what gets rendered is technically always browser-dependent. Most browsers try to keep to the specifications as best they can and as far as I know there is nothing in the CSS specifications against it. I'm only going to vouch for the actual HTML, CSS, and JavaScript code that gets sent to the browser before any other interpreter steps in.
However, I also say no since every browser I typically test on doesn't actually let you.
If you need to see for yourself, save the following as a .html file and open it up in the major browsers. In all browsers I tested on, the JavaScript function will not match to an element. However, remove either "hunkojunk" from the id tag and all works fine.
Sample Code
<html>
<head>
</head>
<body>
<p id="hunkojunk1 hunkojunk2"></p>
<script type="text/javascript">
document.getElementById('hunkojunk2').innerHTML = "JUNK JUNK JUNK JUNK JUNK JUNK";
</script>
</body>
</html>
Nay.
From 3.2.3.1 The id attribute:
The value must not contain any space characters.
id="a b" <-- find the space character in that VaLuE.
That said, you can style multiple IDs. But if you're following the specification, the answer is no.
From 7.5.2 Element identifiers: the id and class attributes:
The id attribute assigns a unique identifier to an element (which may
be verified by an SGML parser).
and
ID and NAME tokens must begin with a letter ([A-Za-z]) and may be
followed by any number of letters, digits ([0-9]), hyphens ("-"),
underscores ("_"), colons (":"), and periods (".").
So "id" must be unique and can't contain a space.
No.
Having said that, there's nothing to stop you doing it. But you'll get inconsistent behaviour with the various browsers. Don't do it. One ID per element.
If you want multiple assignations to an element use classes (separated by a space).
Any ID assigned to a div element is unique.
However, you can assign multiple IDs "under", and not "to" a div element.
In that case, you have to represent those IDs as <span></span> IDs.
Say, you want two links in the same HTML page to point to the same div element in the page.
The two different links
<p>Exponential Equations</p>
<p><Logarithmic Expressions</p>
Point to the same section of the page
<!-- Exponential / Logarithmic Equations Calculator -->
<div class="w3-container w3-card white w3-margin-bottom">
<span id="exponentialEquationsCalculator"></span>
<span id="logarithmicEquationsCalculator"></span>
</div>
The simple answer is no, as others have said before me. An element can't have more than one ID and an ID can't be used more than once in a page. Try it out and you'll see how well it doesn't work.
In response to tvanfosson's answer regarding the use of the same ID in two different elements. As far as I'm aware, an ID can only be used once in a page regardless of whether it's attached to a different tag.
By definition, an element needing an ID should be unique, but if you need two ID's then it's not really unique and needs a class instead.
That's interesting, but as far as I know the answer is a firm no. I don't see why you need a nested ID, since you'll usually cross it with another element that has the same nested ID. If you don't there's no point, if you do there's still very little point.
Classes are specially made for this, and
here is the code from which you can understand it:
<html>
<head>
<style type="text/css">
.personal{
height:100px;
width: 100px;
}
.fam{
border: 2px solid #ccc;
}
.x{
background-color:#ccc;
}
</style>
</head>
<body>
<div class="personal fam x"></div>
</body>
</html>
ID's should be unique, so you should only use a particular ID once on a page. Classes may be used repeatedly.
Check HTML id Attribute (W3Schools) for more details.
I don´t think you can have two Id´s but it should be possible. Using the same id twice is a different case... like two people using the same passport. However one person could have multiple passports... Came looking for this since I have a situation where a single employee can have several functions. Say "sysadm" and "team coordinator" having the id="sysadm teamcoordinator" would let me reference them from other pages so that employees.html#sysadm and employees.html#teamcoordinator would lead to the same place... One day somebody else might take over the team coordinator function while the sysadm remains the sysadm... then I only have to change the ids on the employees.html page ... but like I said - it doesn´t work :(