We build bespoke WordPress themes, and recently have been receiving complaints regarding the sequence of headings. Most automated tools, including Google's Lighthouse, suggest that you should never skip heading levels, in order to properly communicate page structure for screen readers and other accessibility tools.
This issue is largely due to the way our clients enter content. They tend to prefer picking a visually pleasing heading, rather than the "correct" heading sequentially, so we'll often end up with pages that have an h1, then an h4, then a set of h2s, and so on. We've told these clients that they can fix this by properly entering content, but this seems to be asking too much of them, much like entering alt text for images.
To "solve" this issue, I'm trying to write a filter that will parse the_content, identify all of the headings, and replace their tags so that they become sequential, retaining classes for styling. I realize that this isn't a perfect solution, as the intended heading structure really can't be assumed programmatically, but this is the only viable solution I've been able to determine (if someone has a better idea, please, do tell).
So, for example, the code the user generates could be something like this:
<h2 class="title--h2">This is a second level heading</h2>
<p>Etiam vitae erat ullamcorper ipsum ultrices convallis ac quis nulla. Nam euismod imperdiet enim eu venenatis. Nulla non bibendum dui. Maecenas id tincidunt orci. Sed pellentesque ipsum et tempor convallis. Etiam elementum augue aliquet enim venenatis tincidunt. Praesent nunc dolor, vulputate nec aliquet consectetur, aliquet nec elit. Vivamus non eros nec nibh vestibulum lacinia. Morbi diam turpis, accumsan ac fringilla eget, fringilla vitae lorem. Ut consequat tortor orci, sed lobortis metus facilisis nec. Nulla sed enim in tortor blandit aliquet. Curabitur a finibus mi.</p>
<h4 class="title--h4">This is a fourth level heading</h4>
<p>Nullam blandit, mauris vel vestibulum aliquet, quam lectus laoreet mi, id euismod ligula augue sit amet velit. Suspendisse suscipit lacus quis mauris varius, sed cursus mi auctor. Nullam non augue in ante malesuada blandit. Nam eu purus commodo, porttitor odio commodo, tristique nunc. Suspendisse vitae vehicula turpis. Aenean turpis nibh, auctor ac mollis congue, iaculis id tortor. Morbi in est erat. Proin aliquam varius neque a sollicitudin. Vestibulum varius in urna sit amet hendrerit.</p>
<h4 class="title--h4">This is a fourth level heading</h4>
<p>Donec vitae est sapien. Nulla facilisi. Quisque sed auctor ante, sed viverra elit. Quisque justo arcu, vulputate tempor odio ac, mollis blandit justo. Morbi viverra tincidunt leo vel mattis. Aliquam erat volutpat. Nunc tortor tellus, porta sit amet tellus sed, interdum condimentum ex. </p>
And the output would be:
<h2 class="title--h2">This is a second level heading</h2>
<p>Etiam vitae erat ullamcorper ipsum ultrices convallis ac quis nulla. Nam euismod imperdiet enim eu venenatis. Nulla non bibendum dui. Maecenas id tincidunt orci. Sed pellentesque ipsum et tempor convallis. Etiam elementum augue aliquet enim venenatis tincidunt. Praesent nunc dolor, vulputate nec aliquet consectetur, aliquet nec elit. Vivamus non eros nec nibh vestibulum lacinia. Morbi diam turpis, accumsan ac fringilla eget, fringilla vitae lorem. Ut consequat tortor orci, sed lobortis metus facilisis nec. Nulla sed enim in tortor blandit aliquet. Curabitur a finibus mi.</p>
<h3 class="title--h4">This is a fourth level heading</h3>
<p>Nullam blandit, mauris vel vestibulum aliquet, quam lectus laoreet mi, id euismod ligula augue sit amet velit. Suspendisse suscipit lacus quis mauris varius, sed cursus mi auctor. Nullam non augue in ante malesuada blandit. Nam eu purus commodo, porttitor odio commodo, tristique nunc. Suspendisse vitae vehicula turpis. Aenean turpis nibh, auctor ac mollis congue, iaculis id tortor. Morbi in est erat. Proin aliquam varius neque a sollicitudin. Vestibulum varius in urna sit amet hendrerit.</p>
<h4 class="title--h4">This is a fourth level heading</h4>
<p>Donec vitae est sapien. Nulla facilisi. Quisque sed auctor ante, sed viverra elit. Quisque justo arcu, vulputate tempor odio ac, mollis blandit justo. Morbi viverra tincidunt leo vel mattis. Aliquam erat volutpat. Nunc tortor tellus, porta sit amet tellus sed, interdum condimentum ex. </p>
Again, I realize this is going to lead to unintended structure (I included an example of this in the above demonstration), but this is what my clients are asking for, so I'm giving in.
The code I have so far will track the previous heading level and determine what the new level should be, but I'm having difficulty understanding how to actually replace the tags correctly. My understanding is that modifying the DOM with $node->replaceChild() is going to result in items getting skipped, because the DOM is changing while its being parsed. Additionally, I'd like to retain all attributes on each heading, but I've been unable to locate a method for this; everything suggests copying individual attributes manually, but because this is CMS-driven, I'm worried that custom or unexpected attributes will be missed.
Here's the filter I have so far:
/**
* Ensure heading levels are always in sequence
*
* #param string $content
* #return string
*/
function namespace_fix_title_sequence(string $content): string {
if (! (is_admin() && ! wp_doing_ajax()) && $content) {
$DOM = new DOMDocument();
/**
* Use internal errors to get around HTML5 warnings
*/
libxml_use_internal_errors(true);
/**
* Load in the content, with proper encoding and an `<html>` wrapper required for parsing
*/
$DOM->loadHTML("<?xml encoding='utf-8' ?><html>{$content}</html>", LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
/**
* Clear errors to get around HTML5 warnings
*/
libxml_clear_errors();
/**
* Use XPath to query headings
*/
$XPath = new DOMXPath($DOM);
$headings = $XPath->query("//*[self::h1 or self::h2 or self::h3 or self::h4 or self::h5 or self::h6]");
/**
* Track previous heading level
*/
$previous_level = 1;
foreach ($headings as $heading) {
/**
* Get the current level
*/
$current_level = intval(preg_replace("/^h/", "", $heading->nodeName));
/**
* Determine the target level
*/
$target_level = ($current_level - $previous_level <= 1 ? $current_level : $previous_level + 1);
/**
* DEBUG
*/
echo "<p>Previous: {$previous_level}</p>";
echo "<p>Current: {$current_level}</p>";
echo "<p>Target: {$target_level}</p>";
echo "<hr />";
/**
* Replace current level with target level
*/
// ?
/**
* Update the previous level
*/
$previous_level = $target_level;
}
/**
* Save changes, remove unneeded tags
*/
$content = implode(array_map([$DOM->documentElement->ownerDocument, "saveHTML"], iterator_to_array($DOM->documentElement->childNodes)));
}
return $content;
}
add_filter("the_content", "namespace_fix_title_sequence", 100, 1);
The ideal unrealistic solution
In an ideal world, the best solution would be to totally prevent the content writer from selecting incorrect heading levels in the interface of their WYSIWYG.
As equally as you should maybe force them to put a non-empty alt text for images, a label for input fields, forbid empty links, etc.
Given some place in the document, they would only be allowed to put an heading of level 1 to N+1 where N is the level of the previous heading.
Consider that adjustments would also possibly have to be propagated, i.e. changing an H3 into an H2 in the middle of the text should also change all the following H4 into H3 down to the next H2, and so recursively.
This is, as you see, not as easy as we may think at first.
Sadly, not only it isn't that easy, neither to develop and to use, but anyway, writers are probably not ready for that. Those who don't understand the need for correct structuration will also probably qualify the restriction as a bug or a stupid software limitation against their freedom to write anything in the way they like.
Maybe you could decorelate heading level from the corresponding visual style to avoid frustration, but it's becoming quickly even more complicated.
So the only thing that you can do is educate content writers, or, just as you are proposing it here, trying to fix the incorrect structure automatically.
Algorithm to fix heading structure
Before getting more in the real taslk of DOM manipulation, let's talk a little about an algorithm. It's of course impossible to always fix the stucture in the way the author wanted it to be 100% of the time, but the goal is still trying to choose the most probable thing the author wanted to do.
IF we take your example back, the author wrote H2, H4, H3, H3. Is the simplest fix, H2, H3, H3, H3 the most appropriate?
What about H2, H3, H4, H4? Based on the fact that if two elements are visually different, it was probably intended that they are at different levels, and conversely, if two elements are visually identical, it was also probably intended that they are on the same level.
DOM maipulation
As far as I know, most DOM API I have ever seen in Java, JavaScript, PHP, C++, etc. effectively don't allow you to directly change the element name in place. You must create a new node to do that.
You can't simply change an H4 into an H3 while retaining the inner structure untouched for example.
So, if you indeed can't change the element name in place, you need to:
Create a fragment F with the inner structure of the H4 you want to change into H3.
If fragments are also unavailable or if extracting a fragment is complicated in the DOM API you are using, you have to clone child nodes one by one in an array.
Create the H3 node and put F into it
Copy attributes of the H4 into the H3. There is probably no other way than to make it one by one.
Replace the H4 by the new H3. Alternatively, if there is no replaceChild method, insert the H3 before the H4 and then remove the H4.
I have a bunch of HTML text that looks like this:
<p><strong>Pellentesque habitant morbi tristique</strong> senectus et netus et malesuada fames ac turpis egestas.
Vestibulum tortor quam, feugiat vitae, ultricies eget, tempor sit amet,
ante. Donec eu libero sit amet quam egestas semper.
<em>Aenean ultricies mi vitae est.</em> Mauris placerat
This is text that a user will post forum, and this formatted string is then stored on a server. I am displaying this text on another page, but would like it to be render as the user formatted it, eg:
Pellentesque habitant morbi tristique senectus et netus et malesuada fames ac turpis egestas.
Vestibulum tortor quam, feugiat vitae, ultricies eget, tempor sit amet,
ante. Donec eu libero sit amet quam egestas semper.
Aenean ultricies mi vitae est. Mauris placerat
I have tried using <pre>, <p>, and other tags, but they just print out the raw HTML instead of using the formatting given. I am currently using Angular.JS for my page.
Sample text obtained from http://html-ipsum.com/, "Kitchen Sink" example
You are likely storing the string uuencoded, so it displays the codes shown literally.
I'd double check your raw data store to verify this.
In any case THIS IS NOT A RECOMMENDED WAY TO APPROACH YOUR CODE. You are basically inviting a malicious user to potentially inject malicious code into the your other users.
When allowing users to input any html, it is best to only allow a small subset of tags (and a small subset of attributes), and even then it is very hard to get right.
See Cross-Site Scripting (XSS) Tutorial for more.
It sounds like, at some point during the process of storing your HTML, you have escaped the HTML entities (i.e. converted < to <, that sort of thing).
I don't know what language you're working in, but it's possible to unescape HTML characters in pretty much any language. Here's an answer about doing it in JavaScript. The html_entity_decode() method will do the trick in PHP. For whatever language you're working in, just do research on "unescape html entities" in that language.
Warning: since you are unescaping HTML, there's the risk that the user might have written something naughty (i.e. like a <script> tag with some malicious JS code). Make sure you're cleaning out any nasty HTML somewhere along the line.
I have a fairly large WordPress .XML export file from a blog that I am going to migrate to Drupal. One glaring issue with the export file is that it's missing <p> tags for any paragraph breaks. However, the tags are present on the actual site.
From what I can see from the raw text in the XML file, there are multiple line breaks between paragraphs where there should have been a single <p> tag. I was hoping to globally add in a <p> tag where there's a line break and a capital letter using RegEx but I don't have a working knowledge of how that works. A sample XML tag in the export file that contains the text in question is:
<content:encoded><![CDATA[Lorem ipsum dolor sit amet, consectetur adipiscing elit. Curabitur gravida risus at sem interdum iaculis. Curabitur eget est tellus, quis viverra arcu.
Cras posuere turpis imperdiet odio aliquet sollicitudin. Maecenas et neque eget quam fringilla tempor. Vivamus sodales vulputate consectetur.
Sed ullamcorper elementum est, at dapibus orci fermentum vitae. Vivamus nisi turpis, pretium sed tincidunt et, dapibus at eros. Quisque neque magna, posuere eget eleifend ut.
As you can see from the above, there are multiple line breaks in between what should be paragraphs. I was thinking of the line break / capital letter combo for the RegEx so as to only put in one <p> tag and also target specifically the <content:encoded> XML tag so that I don't add tags elsewhere in the XML file. One other issue to make things more complicated is that some paragraphs already have <p> tags where the editor added in a custom class like <p class="myclass">.
This issue was discussed on StackOverflow somewhere before. Problem is, that Wordpress doesn't store the p tags in its database (if you use its WYSIWYG editor), these tags are created upon rendering by wpautop() function (instead of breaks). So I edited the export.php file (running WP 3.4.1) and added the function there. You can see the result on Pastebin (changes are on lines 375 and 376).
<content:encoded><?php echo wxr_cdata( apply_filters( 'the_content_export', wpautop( $post->post_content ) ) ); ?></content:encoded>
<excerpt:encoded><?php echo wxr_cdata( apply_filters( 'the_excerpt_export', wpautop( $post->post_excerpt ) ) ); ?></excerpt:encoded>
You can copy and paste the whole code in file [root]/wp-admin/icludes/export.php and run the export again. Don't forget to backup the file before - I don't guarantee it will work other versions, but you can get the idea how to edit the export.
What would be the best method for replacing variables/words/lines of text in a larger "paragraph" of code?
Example:
Lorem ipsum dolor $SIT amet, consectetur adipiscing elit. Aliquam condimentum dolor ut est faucibus dapibus. Donec molestie dictum nisi, eu euismod $SAPIEN gravida in. Aliquam dictum, tellus eu facilisis laoreet, sapien nunc placerat turpis, eu pretium augue eros vel lectus. Quisque condimentum lorem $EROS, vel pharetra tortor.
I want to be able to enter text in a textbox/prompt to replace the "Variables" $SIT, $SAPIEN, $EROS with actual values automatically.
I trust I've made myself obscure? :P
I'm n00b at any sort of coding. I only know some basic HTML, PHP, and Java. But please give me a clear solution with an example or link or more help.
Thanks so much!
You must utilize JavaScript if you want to do it client-side, and any of the server-side ones [PHP, Python, Ruby] if you want to do it that way. In all of these languages there are equivalents of "string replace" functions, that'll take list of strings to search, list of strings to replace and subject that they will be working on. Solution for JS and PHP:
http://php.net/manual/en/function.str-replace.php
http://www.w3schools.com/jsref/jsref_replace.asp
The way that you'll do it is up to you.
How would you programmacially abbreviate XHTML to an arbitrary number of words without leaving unclosed or corrupted tags?
i.e.
<p>
Proin tristique dapibus neque. Nam eget purus sit amet leo
tincidunt accumsan.
</p>
<p>
Proin semper, orci at mattis blandit, augue justo blandit nulla.
<span>Quisque ante congue justo</span>, ultrices aliquet, mattis eget,
hendrerit, <em>justo</em>.
</p>
Abbreviated to 25 words would be:
<p>
Proin tristique dapibus neque. Nam eget purus sit amet leo
tincidunt accumsan.
</p>
<p>
Proin semper, orci at mattis blandit, augue justo blandit nulla.
<span>Quisque ante congue...</span>
</p>
Recurse through the DOM tree, keeping a word count variable up to date. When the word count exceeds your maximum word count, insert "..." and remove all following siblings of the current node, then, as you go back up through the recursion, remove all the following siblings of each of its ancestors.
You need to think of the XHTML as a hierarchy of elements and treat it as such. This is basically the way XML is meant to be treated. Then just go through the hierarchy recursively, adding the number of words together as you go. When you hit your limit throw everything else away.
I work mainly in PHP, and I would use the DOMDocument class in PHP to help me do this, you need to find something like that in your chosen language.
To make things clearer, here is the hierarchy for your sample:
- p
- Proin tristique dapibus neque. Nam eget purus sit amet leo
tincidunt accumsan.
- p
- Proin semper, orci at mattis blandit, augue justo blandit nulla.
- span
- Quisque ante congue justo
- , ultrices aliquet, mattis eget, hendrerit,
- em
- justo
- .
You hit the 25 word limit inside the span element, so you remove all remaining text within the span and add the ellipsis. All other child elements (both text and tags) can be discarded, and all subsequent elements can be discarded.
This should always leave you with valid markup as far as I can see, because you are treating it as a hierarchy and not just plain text, all closing tags that are required will still be there.
Of course if the XHTML you are dealing with is invalid to begin with, don't expect the output to be valid.
Sorry for the poor hierarchy example, couldn't work out how to nest lists.