How would you programmacially abbreviate XHTML to an arbitrary number of words without leaving unclosed or corrupted tags?
i.e.
<p>
Proin tristique dapibus neque. Nam eget purus sit amet leo
tincidunt accumsan.
</p>
<p>
Proin semper, orci at mattis blandit, augue justo blandit nulla.
<span>Quisque ante congue justo</span>, ultrices aliquet, mattis eget,
hendrerit, <em>justo</em>.
</p>
Abbreviated to 25 words would be:
<p>
Proin tristique dapibus neque. Nam eget purus sit amet leo
tincidunt accumsan.
</p>
<p>
Proin semper, orci at mattis blandit, augue justo blandit nulla.
<span>Quisque ante congue...</span>
</p>
Recurse through the DOM tree, keeping a word count variable up to date. When the word count exceeds your maximum word count, insert "..." and remove all following siblings of the current node, then, as you go back up through the recursion, remove all the following siblings of each of its ancestors.
You need to think of the XHTML as a hierarchy of elements and treat it as such. This is basically the way XML is meant to be treated. Then just go through the hierarchy recursively, adding the number of words together as you go. When you hit your limit throw everything else away.
I work mainly in PHP, and I would use the DOMDocument class in PHP to help me do this, you need to find something like that in your chosen language.
To make things clearer, here is the hierarchy for your sample:
- p
- Proin tristique dapibus neque. Nam eget purus sit amet leo
tincidunt accumsan.
- p
- Proin semper, orci at mattis blandit, augue justo blandit nulla.
- span
- Quisque ante congue justo
- , ultrices aliquet, mattis eget, hendrerit,
- em
- justo
- .
You hit the 25 word limit inside the span element, so you remove all remaining text within the span and add the ellipsis. All other child elements (both text and tags) can be discarded, and all subsequent elements can be discarded.
This should always leave you with valid markup as far as I can see, because you are treating it as a hierarchy and not just plain text, all closing tags that are required will still be there.
Of course if the XHTML you are dealing with is invalid to begin with, don't expect the output to be valid.
Sorry for the poor hierarchy example, couldn't work out how to nest lists.
Related
We build bespoke WordPress themes, and recently have been receiving complaints regarding the sequence of headings. Most automated tools, including Google's Lighthouse, suggest that you should never skip heading levels, in order to properly communicate page structure for screen readers and other accessibility tools.
This issue is largely due to the way our clients enter content. They tend to prefer picking a visually pleasing heading, rather than the "correct" heading sequentially, so we'll often end up with pages that have an h1, then an h4, then a set of h2s, and so on. We've told these clients that they can fix this by properly entering content, but this seems to be asking too much of them, much like entering alt text for images.
To "solve" this issue, I'm trying to write a filter that will parse the_content, identify all of the headings, and replace their tags so that they become sequential, retaining classes for styling. I realize that this isn't a perfect solution, as the intended heading structure really can't be assumed programmatically, but this is the only viable solution I've been able to determine (if someone has a better idea, please, do tell).
So, for example, the code the user generates could be something like this:
<h2 class="title--h2">This is a second level heading</h2>
<p>Etiam vitae erat ullamcorper ipsum ultrices convallis ac quis nulla. Nam euismod imperdiet enim eu venenatis. Nulla non bibendum dui. Maecenas id tincidunt orci. Sed pellentesque ipsum et tempor convallis. Etiam elementum augue aliquet enim venenatis tincidunt. Praesent nunc dolor, vulputate nec aliquet consectetur, aliquet nec elit. Vivamus non eros nec nibh vestibulum lacinia. Morbi diam turpis, accumsan ac fringilla eget, fringilla vitae lorem. Ut consequat tortor orci, sed lobortis metus facilisis nec. Nulla sed enim in tortor blandit aliquet. Curabitur a finibus mi.</p>
<h4 class="title--h4">This is a fourth level heading</h4>
<p>Nullam blandit, mauris vel vestibulum aliquet, quam lectus laoreet mi, id euismod ligula augue sit amet velit. Suspendisse suscipit lacus quis mauris varius, sed cursus mi auctor. Nullam non augue in ante malesuada blandit. Nam eu purus commodo, porttitor odio commodo, tristique nunc. Suspendisse vitae vehicula turpis. Aenean turpis nibh, auctor ac mollis congue, iaculis id tortor. Morbi in est erat. Proin aliquam varius neque a sollicitudin. Vestibulum varius in urna sit amet hendrerit.</p>
<h4 class="title--h4">This is a fourth level heading</h4>
<p>Donec vitae est sapien. Nulla facilisi. Quisque sed auctor ante, sed viverra elit. Quisque justo arcu, vulputate tempor odio ac, mollis blandit justo. Morbi viverra tincidunt leo vel mattis. Aliquam erat volutpat. Nunc tortor tellus, porta sit amet tellus sed, interdum condimentum ex. </p>
And the output would be:
<h2 class="title--h2">This is a second level heading</h2>
<p>Etiam vitae erat ullamcorper ipsum ultrices convallis ac quis nulla. Nam euismod imperdiet enim eu venenatis. Nulla non bibendum dui. Maecenas id tincidunt orci. Sed pellentesque ipsum et tempor convallis. Etiam elementum augue aliquet enim venenatis tincidunt. Praesent nunc dolor, vulputate nec aliquet consectetur, aliquet nec elit. Vivamus non eros nec nibh vestibulum lacinia. Morbi diam turpis, accumsan ac fringilla eget, fringilla vitae lorem. Ut consequat tortor orci, sed lobortis metus facilisis nec. Nulla sed enim in tortor blandit aliquet. Curabitur a finibus mi.</p>
<h3 class="title--h4">This is a fourth level heading</h3>
<p>Nullam blandit, mauris vel vestibulum aliquet, quam lectus laoreet mi, id euismod ligula augue sit amet velit. Suspendisse suscipit lacus quis mauris varius, sed cursus mi auctor. Nullam non augue in ante malesuada blandit. Nam eu purus commodo, porttitor odio commodo, tristique nunc. Suspendisse vitae vehicula turpis. Aenean turpis nibh, auctor ac mollis congue, iaculis id tortor. Morbi in est erat. Proin aliquam varius neque a sollicitudin. Vestibulum varius in urna sit amet hendrerit.</p>
<h4 class="title--h4">This is a fourth level heading</h4>
<p>Donec vitae est sapien. Nulla facilisi. Quisque sed auctor ante, sed viverra elit. Quisque justo arcu, vulputate tempor odio ac, mollis blandit justo. Morbi viverra tincidunt leo vel mattis. Aliquam erat volutpat. Nunc tortor tellus, porta sit amet tellus sed, interdum condimentum ex. </p>
Again, I realize this is going to lead to unintended structure (I included an example of this in the above demonstration), but this is what my clients are asking for, so I'm giving in.
The code I have so far will track the previous heading level and determine what the new level should be, but I'm having difficulty understanding how to actually replace the tags correctly. My understanding is that modifying the DOM with $node->replaceChild() is going to result in items getting skipped, because the DOM is changing while its being parsed. Additionally, I'd like to retain all attributes on each heading, but I've been unable to locate a method for this; everything suggests copying individual attributes manually, but because this is CMS-driven, I'm worried that custom or unexpected attributes will be missed.
Here's the filter I have so far:
/**
* Ensure heading levels are always in sequence
*
* #param string $content
* #return string
*/
function namespace_fix_title_sequence(string $content): string {
if (! (is_admin() && ! wp_doing_ajax()) && $content) {
$DOM = new DOMDocument();
/**
* Use internal errors to get around HTML5 warnings
*/
libxml_use_internal_errors(true);
/**
* Load in the content, with proper encoding and an `<html>` wrapper required for parsing
*/
$DOM->loadHTML("<?xml encoding='utf-8' ?><html>{$content}</html>", LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
/**
* Clear errors to get around HTML5 warnings
*/
libxml_clear_errors();
/**
* Use XPath to query headings
*/
$XPath = new DOMXPath($DOM);
$headings = $XPath->query("//*[self::h1 or self::h2 or self::h3 or self::h4 or self::h5 or self::h6]");
/**
* Track previous heading level
*/
$previous_level = 1;
foreach ($headings as $heading) {
/**
* Get the current level
*/
$current_level = intval(preg_replace("/^h/", "", $heading->nodeName));
/**
* Determine the target level
*/
$target_level = ($current_level - $previous_level <= 1 ? $current_level : $previous_level + 1);
/**
* DEBUG
*/
echo "<p>Previous: {$previous_level}</p>";
echo "<p>Current: {$current_level}</p>";
echo "<p>Target: {$target_level}</p>";
echo "<hr />";
/**
* Replace current level with target level
*/
// ?
/**
* Update the previous level
*/
$previous_level = $target_level;
}
/**
* Save changes, remove unneeded tags
*/
$content = implode(array_map([$DOM->documentElement->ownerDocument, "saveHTML"], iterator_to_array($DOM->documentElement->childNodes)));
}
return $content;
}
add_filter("the_content", "namespace_fix_title_sequence", 100, 1);
The ideal unrealistic solution
In an ideal world, the best solution would be to totally prevent the content writer from selecting incorrect heading levels in the interface of their WYSIWYG.
As equally as you should maybe force them to put a non-empty alt text for images, a label for input fields, forbid empty links, etc.
Given some place in the document, they would only be allowed to put an heading of level 1 to N+1 where N is the level of the previous heading.
Consider that adjustments would also possibly have to be propagated, i.e. changing an H3 into an H2 in the middle of the text should also change all the following H4 into H3 down to the next H2, and so recursively.
This is, as you see, not as easy as we may think at first.
Sadly, not only it isn't that easy, neither to develop and to use, but anyway, writers are probably not ready for that. Those who don't understand the need for correct structuration will also probably qualify the restriction as a bug or a stupid software limitation against their freedom to write anything in the way they like.
Maybe you could decorelate heading level from the corresponding visual style to avoid frustration, but it's becoming quickly even more complicated.
So the only thing that you can do is educate content writers, or, just as you are proposing it here, trying to fix the incorrect structure automatically.
Algorithm to fix heading structure
Before getting more in the real taslk of DOM manipulation, let's talk a little about an algorithm. It's of course impossible to always fix the stucture in the way the author wanted it to be 100% of the time, but the goal is still trying to choose the most probable thing the author wanted to do.
IF we take your example back, the author wrote H2, H4, H3, H3. Is the simplest fix, H2, H3, H3, H3 the most appropriate?
What about H2, H3, H4, H4? Based on the fact that if two elements are visually different, it was probably intended that they are at different levels, and conversely, if two elements are visually identical, it was also probably intended that they are on the same level.
DOM maipulation
As far as I know, most DOM API I have ever seen in Java, JavaScript, PHP, C++, etc. effectively don't allow you to directly change the element name in place. You must create a new node to do that.
You can't simply change an H4 into an H3 while retaining the inner structure untouched for example.
So, if you indeed can't change the element name in place, you need to:
Create a fragment F with the inner structure of the H4 you want to change into H3.
If fragments are also unavailable or if extracting a fragment is complicated in the DOM API you are using, you have to clone child nodes one by one in an array.
Create the H3 node and put F into it
Copy attributes of the H4 into the H3. There is probably no other way than to make it one by one.
Replace the H4 by the new H3. Alternatively, if there is no replaceChild method, insert the H3 before the H4 and then remove the H4.
How to wrap a text around a centered (round) image like this:
I tried this jsfiddle but the text goes behind the image and does not flow around it.
#circle {
float:positioned;
position: absolute;
top:10%;
left: 40%;
wrap-shape: circle(50%, 50%, 120px);
wrap-margin: 10px;
}
<div id="circle"><img src="http://www.guitare-rabuffetti.fr/test/circle.png"/></div>
<div>
Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Nam cursus. Morbi ut mi. Nullam enim leo, egestas id, condimentum at, laoreet mattis, massa. Sed eleifend nonummy diam. Praesent mauris ante, elementum et, bibendum at, posuere sit amet, nibh. Duis tincidunt lectus quis dui viverra vestibulum. Suspendisse vulputate aliquam dui. Nulla elementum dui ut augue. Aliquam vehicula mi at mauris. Maecenas placerat, nisl at consequat rhoncus, sem nunc gravida justo, quis eleifend arcu velit quis lacus. Morbi magna magna, tincidunt a, mattis non, imperdiet vitae, tellus. Sed odio est, auctor ac, sollicitudin in, consequat vitae, orci. Fusce id felis. Vivamus sollicitudin metus eget eros.
Pellentesque habitant morbi tristique senectus et netus et malesuada fames ac turpis egestas. In posuere felis nec tortor. Pellentesque faucibus. Ut accumsan ultricies elit. Maecenas at justo id velit placerat molestie. Donec dictum lectus non odio. Cras a ante vitae enim iaculis aliquam. Mauris nunc quam, venenatis nec, euismod sit amet, egestas placerat, est. Pellentesque habitant morbi tristique senectus et netus et malesuada fames ac turpis egestas. Cras id elit. Integer quis urna. Ut ante enim, dapibus malesuada, fringilla eu, condimentum quis, tellus. Aenean porttitor eros vel dolor. Donec convallis pede venenatis nibh. Duis quam. Nam eget lacus. Aliquam erat volutpat. Quisque dignissim congue leo.
Mauris vel lacus vitae felis vestibulum volutpat. Etiam est nunc, venenatis in, tristique eu, imperdiet ac, nisl. Cum sociis natoque penatibus et
</div>
As already noted, shape wrapping currently only works for floated elements, so this exact situation isn't do-able with CSS only, because only wrapping on one side of a shape is permitted (expected). Once the CSS Shapes 2 and/or CSS Exclusions specs) are adopted, we will be able to do this with not only shapes but also image transparency.
I ran into this same problem while trying to figure out how shapes and CSS columns interact (spoiler: decent, but not organically). The problem seems to be that the layout algorithm looks for the farthest edge (ignoring the possiblity of multiple sides), then starts content layout from that coordinate. For elements in the middle, this means you get text only on one side. For CSS columns (which is how I figured this out), the layout again starts from the farthest edge, but then continues straight down instead of wrapping to the shape on each line (see fiddle), so protrusions on shapes (like a star polygon) can actually force wrapping content to end up below the entire shape instead of squished to one side or flowing down into the protrusion.
(note there are 3 sets of 2 columns on 2nd example)
However, there are a couple options that may work for similar situations. I have adapted the following from the other answers/comments, but had to make several changes to get them working (and several of the CSS attributes were experimental and are no longer valid), so I felt this was better as a new answer than as edits/comments:
Wrap one side
Use shape-outside on a left floated div to create a wrapping circle, then use margin-left to push it away from the left side. I added a circle inside the div for illustration (your image URL is 404), but had to tweak the location as Chrome did not calculate its position the way one would expect once margins were added.
http://jsfiddle.net/brichins/50h20kxa/1/
Columns and mirrored wrapping elements
If columns are acceptable, manually (see above CSS column discussion) creating 2 containers for columns and placing a shaped element on the side of each gives the following:
http://jsfiddle.net/brichins/gvhpfccu/
The disadvantage here is columns where you may have wanted a single block (not necessarily bad for readability), as well has having to compute an appropriate split for your content.
Reading
Intro and walkthrough on HTML5 Rocks: https://www.html5rocks.com/en/tutorials/shapes/getting-started/
References the amazing Alice in Wonderland example from 2013. It appears to not function completely anymore, but the entire talk is still interesting
Creating Non-Rectangular Layouts With CSS Shapes: https://sarasoueidan.com/blog/css-shapes/
CSS Tricks article: https://css-tricks.com/almanac/properties/s/shape-outside/
D3plus workaround plugin (for similar SVG solutions): https://d3plus.org/examples/utilities/a39f0c3fc52804ee859a/
Question resolved :
Actual situation :
The CSS shape works for float, so it's not for centered images now. This property works only for Chrome and Opera at the moment.
Maybe there will be a solution for non float elements in the future. Look at this W3C editor's draft : http://www.interoperabilitybridges.com/css3-floats/OriginalSubmition.html
A hand made CSS solution :
Basically, there are 2 columns (like in newspapers). The text begins in the left column and goes down. The text continues on the top of the right column and goes down. The columns are a bit higher than the image. The left column has a half invisible circle as well as the right column - on the position of the centered image. The two half circles are build by multiple boxes of different length, they are invisible. (The hight of the boxes is the height of the font.) The text must be justified. The text is now flowing around the half circles in each column. The image will be positioned over the 2 invisible half circles.
Another, not very technical solution is to use Libre Office and Inkscape to produce an SVG file.
Import the picture into Libre Office - wrap the text around the image - save as PDF - open Inkscape - save as SVG - import the SVG in your Webpage - done.
Thanks everybody for helping me and for your inputs !
I don't think it's possible since it relies on float and you can't float to the middle/center of the page.
Here's what I came up with:
[old fiddle]/5Lxc444p/8/
If you put the width style on the actual circle it works better than on a parent div.
Also, here's a good writeup on css shapes: http://www.html5rocks.com/en/tutorials/shapes/getting-started/
EDIT:
Here's an updated fiddle for the 2 column layout with absolute positioned circle.
http://jsfiddle.net/5Lxc444p/11/
I am generating a preview of certain length out of a text string. The text was made out of a HTML string where the HTML code has been removed. Because of certain reasons there are some JSON block within the text. These JSON blocks are placeholders to retrieve information from a database and replace the json string with it on page load.
For the preview the JSON must not be in the string. Therefore I have to clean the string and remove the JSON blocks.
Here is an example of how the string may look:
Pellentesque et vulputate felis. {"bla":"blabla", "blubb":"blubablub"} Maecenas tortor ex, commodo eu massa a, vehicula cursus erat. Nam rhoncus, nunc ut lobortis pretium, libero lorem {"blurb":"blarblar", "blabb":"blabablurb", "test":"testatest"} facilisis urna, et gravida tellus turpis ut nisi. Nulla in ullamcorper metus. Sed sed blandit magna. Integer fermentum.
How do I get these two JSON blocks using regex and remove it?
{"bla":"blabla", "blubb":"blubablub"}
{"blurb":"blarblar", "blabb":"blabablurb", "test":"testatest"}
It works with Rematch() and a following cfloop over the array of JSON blocks. But is it possible with ReReplace()?
Just found the solution
ReReplace(mystring, "\{([^}]*)\}", "", "ALL")
Sry for bothering.
Just found the solution
ReReplace(mystring, "\{([^}]*)\}", "", "ALL")
This solution doesn't work for JSON with nested objects but in my case it's enough.
I have a bunch of HTML text that looks like this:
<p><strong>Pellentesque habitant morbi tristique</strong> senectus et netus et malesuada fames ac turpis egestas.
Vestibulum tortor quam, feugiat vitae, ultricies eget, tempor sit amet,
ante. Donec eu libero sit amet quam egestas semper.
<em>Aenean ultricies mi vitae est.</em> Mauris placerat
This is text that a user will post forum, and this formatted string is then stored on a server. I am displaying this text on another page, but would like it to be render as the user formatted it, eg:
Pellentesque habitant morbi tristique senectus et netus et malesuada fames ac turpis egestas.
Vestibulum tortor quam, feugiat vitae, ultricies eget, tempor sit amet,
ante. Donec eu libero sit amet quam egestas semper.
Aenean ultricies mi vitae est. Mauris placerat
I have tried using <pre>, <p>, and other tags, but they just print out the raw HTML instead of using the formatting given. I am currently using Angular.JS for my page.
Sample text obtained from http://html-ipsum.com/, "Kitchen Sink" example
You are likely storing the string uuencoded, so it displays the codes shown literally.
I'd double check your raw data store to verify this.
In any case THIS IS NOT A RECOMMENDED WAY TO APPROACH YOUR CODE. You are basically inviting a malicious user to potentially inject malicious code into the your other users.
When allowing users to input any html, it is best to only allow a small subset of tags (and a small subset of attributes), and even then it is very hard to get right.
See Cross-Site Scripting (XSS) Tutorial for more.
It sounds like, at some point during the process of storing your HTML, you have escaped the HTML entities (i.e. converted < to <, that sort of thing).
I don't know what language you're working in, but it's possible to unescape HTML characters in pretty much any language. Here's an answer about doing it in JavaScript. The html_entity_decode() method will do the trick in PHP. For whatever language you're working in, just do research on "unescape html entities" in that language.
Warning: since you are unescaping HTML, there's the risk that the user might have written something naughty (i.e. like a <script> tag with some malicious JS code). Make sure you're cleaning out any nasty HTML somewhere along the line.
What would be the best method for replacing variables/words/lines of text in a larger "paragraph" of code?
Example:
Lorem ipsum dolor $SIT amet, consectetur adipiscing elit. Aliquam condimentum dolor ut est faucibus dapibus. Donec molestie dictum nisi, eu euismod $SAPIEN gravida in. Aliquam dictum, tellus eu facilisis laoreet, sapien nunc placerat turpis, eu pretium augue eros vel lectus. Quisque condimentum lorem $EROS, vel pharetra tortor.
I want to be able to enter text in a textbox/prompt to replace the "Variables" $SIT, $SAPIEN, $EROS with actual values automatically.
I trust I've made myself obscure? :P
I'm n00b at any sort of coding. I only know some basic HTML, PHP, and Java. But please give me a clear solution with an example or link or more help.
Thanks so much!
You must utilize JavaScript if you want to do it client-side, and any of the server-side ones [PHP, Python, Ruby] if you want to do it that way. In all of these languages there are equivalents of "string replace" functions, that'll take list of strings to search, list of strings to replace and subject that they will be working on. Solution for JS and PHP:
http://php.net/manual/en/function.str-replace.php
http://www.w3schools.com/jsref/jsref_replace.asp
The way that you'll do it is up to you.