Shortened HTML text and malformed tags - html

In my web application I intend to shorten a lengthy string of HTML formatted text if it is more than 300 characters long and then display the 300 characters and a Read More link on the page.
The issue I came across is when the 300 character limit is reached inside an HTML tag, example: (look for HERE)
<a hreHERE="somewhere">link</a>
<a hre="somewhere">liHEREnk</a>
When this happens, the entire page could become ill-formatted because everything after the HERE in the previous example is removed and the HTML tag is kept open.
I thinking of using CSS to hide any overflow beyond a certain limit and create the "Read More" link if the text is beyond a certain number, but this would entail me including all the text on the page.
I've also thought about splitting the text at . to ensure that it's split at the end of a sentence, but that would mean I would include more characters than I needed.
Is there a better way to accomplish this?
Note: I have not specified a server side language because this is more of a general question, but I'm using ASP.NET/C# .

Extract the plaintext from the HTML, and display that. There are libraries (like the HTML Agility Pack for .NET) that make this easy, and it's not too hard to do it yourself with an XML parser. Trying to fix a truncated HTML snippet is a losing cause.

One option I can think of is to cut it off at 300 characters and make sure the last index of '<' is less than the last index of '>'. If it is, truncate the string right before the last instance of '>', then use a library like tidy html to fix tags that are orphaned (like the </a> in the example).
There are problems with this though. One thing being if there are 300 chars worth of nothing but HTML - your summary will be displayed as empty.
If you do not need the html to be displayed it's far easier to simply extract the plain text and use that instead.
EDIT: Added using something like tidy html for orphaned tags. Original answer only solved cutting thing mid-tag, rather than within an opening/closing tag.

Related

How to insert enter within the text of html code?

I am writing up a resume, and the enter, space, and other characters do not show up. It's a poorly coded website. Is there a way to include enter key without accessing the source of the html code, or including html elements? If I insert < br > element, it shows up as & lt; br &gt ;
&#10
Chava G's suggestion might work, but it really shouldn't (it would indicate a serious XSS vulnerability).
You can try something similar though, with an no-break-space unicode character.
This is exactly the same thing that he is suggesting with except I am suggesting you use the actual character itself.
I think you can just copy it from the wiki page, right between the quotes where it says non-breaking space (" ").
https://en.wikipedia.org/wiki/Non-breaking_space
This route is a bit iffy as well though, sometimes unicode explodes.
You may be better of just accepting the limitations.
HTML ignores extra white-space. Use <br /> in place of Enter, and for the space character.
You can also use the <pre> tag. Any text inside this tag will show up as formatted, including spaces and the enter key.
However, since you are using a third-party website here, Wedstrom is right that the above shouldn't work. There is no way for you to change or add HTML code on another party's site, and there shouldn't be. Try to find a way to write up your resume using the functionality that is readily provided for you (or use your own text editor...).

Regular Expressions to fix invalid HTML

I have hundreds of files (ancient ASP and HTML) filled with outdated and often completely invalid HTML code.
Between Visual Studio and ReSharper, this invalid HTML is flagged and easily visible if the editor window is scrolled to where the invalid HTML appears. However, neither tool is providing any method to quickly fix the errors across the whole project.
The first few errors on which ReSharper focuses my attention are tags that are either not closed or closed but not opened. Sometimes this occurs because the opening and closing tags overlap - for instance:
<font face=verdana size=5><b>some text</font></b>
<span><p>start of a paragraph
with multiple lines of <i><b>text/hmtl
</i> with a nice mix of junk</b>
</span></p>
Sometimes opening tags without a corresponding closing tag were allowed in older versions of HTML (or the tools which generated the HTML didn't care about the standards as some browsers usually figured out what the author meant). So the mess I'm attempting to clean up has many unclosed HTML tags that ought to be closed.
<font face = tahoma size=2>some more text<b><sup>*</sup></b>
...
...
</body>
</html>
And just for good measure, the code includes lots of closing HTML tags that have no matching start tag.
</b><p>some text that is actually within closed tags</p>
</td>
</tr>
</table>
So, other than writing a new application to parse, flag, and fix all these errors - does anyone have some .Net regular expressions that could be used to locate and preferably fix this stuff with Visual Studio 2012's Search and Replace feature?
Though a single expression that does it all would be nice, multiple expressions that each handle one of the above cases would still be very helpful.
For the case of overlapped HTML tags, I'm using this expression:
(?n)(?<t1s>(?><(?<t1>\w+)[^>]*>))(?<c1>((?!</\k<t1>>)(\n|.))*?)(?<t2s>(?><(?!\k<t1>)(?<t2>(?>\w+))[^>]*>))(?<c2>((?!(</(\k<t1>|\k<t2>)>))(\n|.))*?)(?<t1e></\k<t1>>)(?<c3>(?>(\n|.)*?))(?<t2e></\k<t2>>)
Explanation:
(?n) Ignore unnamed captures.
(?<t1s>(?><(?<t1>\w+)[^>]*>)) Get the first tag, capturing the full tag and attributes
for replacement and the name alone for further matching.
(?<c1>((?!</\k<t1>>)(\n|.))*?) Capture content between the first and second tag.
(?<t2s>(?><(?!\k<t1>)(?<t2>(?>\w+))[^>]*>)) Get the 2nd tag, capturing the full
tag and attributes for replacement, the name along for further matching, and ensuring
it does not match the 1st tag and that the first tag is still open.
(?<c2>((?!(</(\k<t1>|\k<t2>)>))(\n|.))*?) Capture content between the second tag
closing of the first tag.
(?<t1e></\k<t1>>) Capture the closing of the first tag, where the second tag is
still open.
(?<c3>(?>(\n|.)*?)) Capture content between the closing of the first tag and the closing
of the second tag.
(?<t2e></\k<t2>>) Capture the closing of the second tag.
With this replacement expression:
${t1s}${c1}${t2s}${c2}${t2e}${c3}${t1e}
The issues with this search expression is that it is painfully slow. Using . instead of (\n|.) for the three content captures is much quicker, but limits the results to just those where the overlapped tags and intervening content are on a single line.
The expression will also match valid, properly closed and properly nested HTML if the first tag appears inside the content of the second tag, like this:
<font color=green><b>hello world</b></font><span class="whatever"><font color=red>*</font></span>
So it is not safe to use the expression in a "Replace All" operation, especially across the hundreds of files in the solution.
For unclosed tags, I've successfully handled the self-closing tags: <img/>, <meta/>, <input/>, <link/>, <br/>, and <hr/>. However, I've still not attempted the generic case for all the other tags - those that may have content or should be closed with a separate closing tag.
Also, I've no idea how to match closing tags without a matching opening tag. The simple solution of </\w+> will match all closing tags regardless of whether or not they have a matched opening tag.
According to their website, Resharper has this feature:
Solution-Wide Analysis
Not only is ReSharper capable of analyzing a specific code file for errors, but it can extend its analysis skills to cover your whole solution.
...
All you have to do is explicitly switch Solution-Wide Analysis on, and then, after it analyzes the code of your solution, view the list of errors in a dedicated window:
[
Even without opening that window, you can still easily navigate through errors in your solution with Go to Next Error in Solution (Shift+Alt+PageDown) and Go to Previous Error in Solution (Shift+Alt+F12) commands.
Your current "solution" is to use regexes on a context-sensitive language (invalid HTML). Please, NO. People flip out already when people suggest parsing context-free languages with regexes.
On second thought, there might be a solution that we can use regexes for.
For this HTML:
<i><b>text/html
</i> with a nice mix of junk</b>
A better transformation would be (it's more valid, right?):
<i><\i><b><i>text/hmtl
</i> with a nice mix of junk</b>
There are many ways this could go wrong, (although it's pretty bad as-is), but I assume you have this all backed up. This regex (where i is an example of a tag you may want to do this with):
<(i(?: [^>]+)?)>([^<]*)<(\/?[^i](?: [^>]+)?)>
Might help you out. I don't know how regex replace works in whatever flavor you're using, but if you replace $0 (everything matched by the regex) with <$1>$2</$1><$3><$1>, you'll get the transformation I'm talking about.

Having the HTML of a webpage, how to obtain the visible words of that webpage?

Having the HTML of a webpage, what would be the easiest strategy to get the text that's visible on the correspondent page? I have thought of getting everything that's between the <a>..</a> and <p>...</p> but that is not working that well.
Keep in mind as that this is for a school project, I am not allowed to use any kind of external library (the idea is to have to do the parsing myself). Also, this will be implemented as the HTML of the page is downloaded, that is, I can't assume I already have the whole HTML page downloaded. It has to be showing up the extracted visible words as the HTML is being downloaded.
Also, it doesn't have to work for ALL the cases, just to be satisfatory most of the times.
I am not allowed to use any kind of external library
This is a poor requirement for a ‘software architecture’ course. Parsing HTML is extremely difficult to do correctly—certainly way outside the bounds of a course exercise. Any naïve approach you come up involving regex hacks is going to fall over badly on common web pages.
The software-architecturally correct thing to do here is use an external library that has already solved the problem of parsing HTML (such as, for .NET, the HTML Agility Pack), and then iterate over the document objects it generates looking for text nodes that aren't in ‘invisible’ elements like <script>.
If the task of grabbing data from web pages is of your own choosing, to demonstrate some other principle, then I would advise picking a different challenge, one you can usefully solve. For example, just changing the input from HTML to XML might allow you to use the built-in XML parser.
Literally all the text that is visible sounds like a big ask for a school project, as it would depend not only on the HTML itself, but also any in-page or external styling. One solution would be to simply strip the HTML tags from the input, though that wouldn't strictly meet your requirements as you have stated them.
Assuming that near enough is good enough, you could make a first pass to strip out the content of entire elements which you know won't be visible (such as script, style), and a second pass to remove the remaining tags themselves.
i'd consider writing regex to remove all html tags and you should be left with your desired text. This can be done in Javascript and doesn't require anything special.
I know this is not exactly what you asked for, but it can be done using Regular Expressions:
//javascript code
//should (could) work in C# (needs escaping for quotes) :
h = h.replace(/<(?:"[^"]*"|'[^']*'|[^'">])*>/g,'');
This RegExp will remove HTML tags, notice however that you first need to remove script,link,style,... tags.
If you decide to go this way, I can help you with the regular expressions needed.
HTML 5 includes a detailed description of how to build a parser. It is probably more complicated then you are looking for, but it is the recommended way.
You'll need to parse every DOM element for text, and then detect whether that DOM element is visible (el.style.display == 'block' or 'inline'), and then you'll need to detect whether that element is positioned in such a manner that it isn't outside of the viewable area of the page. Then you'll need to detect the z-index of each element and the background of each element in order to detect if any overlapping is hiding some text.
Basically, this is impossible to do within a month's time.

Text style affecting the whole site

I've got an input so the user can type either html or plain text. When the user copy & paste text from MS Word, for example, it generates a weird html. Then, when you view that topic, you can see the whole page's style is affected. I don't really know if the generated html has unclosed tags or something, but it looks like it does and thus, the style of the page is affected.
Does anybody know how to "isolate" the html of that div(or whatever the container be) from the whole page's style?
Short of showing the content in an IFRAME, you can't really do that. What I usually do in this situation is apply tag stripping logic to the content as it comes in. You really don't want to allow arbitrary HTML from a security perspective, but even if you don't care what your users input, you should be stripping out invalid HTML tags (Word has a habit of creating tags with weird namespace-looking things like o:p) and running something like Tidy over the result to ensure every tag is properly closed. There are a number of Tidy libraries for .NET out there; here's one.
Here's a quick cut-and-paste of how I've done this in the past. Note that the class implements an interface from the project I used it in, but you get the general idea.
Copying text from word can include <style> tags. The only sure way to isolate these styles is to put the input control in an <iframe>
You can either sanitize the input or display it in an IFrame.
It it were me I'd strip all but basic formatting (e.g., bold, italics) and use Tidy. That's what I end up doing, I strip and convert all the CSS styles of word into <strong>, <em>, etc.

Forcing ending tags in HTML segments or ignoring missing ending tags

When creating an rss feed that shows a subset of a larger html doc (first x characters) I've run into an issue where some tags begin in the "first x characters" but the ending tag is outside that range. This can cause some fun problems if the consumer of the feed is trying to render the html in the feed in that it can cause unexpected rendering issues in the page showing the feed.
I'm assuming this is a common problem that rss feed writers and readers solved long ago, but I cannot seem to figure out how to achieve it short of trying to parse the html in the feed and add missing end tags which could get messy. Any suggestions would greatly be appreciated. Thanks in advance.
Chris
If you use php, an excellent solution is HTMLPurifier. It will clean it up and make it completely safe to retransmit.
Not sure if this would work for your project, but I use HTML Tidy for this in FeedDemon.
Where does the larger document come from? If there is source text from which the HTML is generated, it's much easier to truncate that and re-generate the HTML from the truncated version than it is to deal with the problems of handling partial HTML. To do this at all properly you'd basically need to be re-parsing and serialising the HTML all over again.
HTML inside RSS is still troublesome, anyhow. You might be better off stripping all the tags and doing a simple text truncate on what's left.