PHP/MySQL: store formatting of text properly? - html

I'm writing note software in PHP (to store notes) and most often I include code within, when I fetch the note from the database it collapses all whitespace I assume, so any code blocks look ugly. (I nl2br() it, I mean horizontal space)
What would be the most efficient way to deal with this? I think the database entry keeps the spaces, so would replacing all spaces with be the only solution PHP-display-side? (ugly for long long entries), what are your thoughts on how I can accomplish this taking in mind the code may be 1-16M characters long?

It shouldn't be collapsing all whitespace. Try outputting it inside <pre> tags to see that white space.

What code are you storing the Database? HTML? PHP?! This will determine the best solution to your problem.
Different column types will or won't preserve characters like new lines, carriage returns or tabs. I use Text, using a UTF-8 collation.
At a very basic level look at nl2br() - http://php.net/manual/en/function.nl2br.php

Related

Using HTML entities in .Net resources

Is there a way to prevent .Net/Razor from escaping HTML entities in .Net resources? We have a web application that needs to be available in several languages. This gives the problem that the texts take up different amounts of space depending on what language they are in. As an example, when a TH element contains "Shipment reference" in English, the browser breaks it into two lines, which is fine. In Danish it says "Forsendelsesreference", which does not get split. We want to fix that by inserting an HTML soft hyphen entity. However, when we do that, it gets escaped, and the page shows "Forsendelses­reference". We can see two ways to avoid that. One is to wrap the content of every label and TH element in #Html.Raw. Another is to identify those labels and headers that use a resource with a soft hyphen, and wrap the content in #Html.Raw. Neither is very appealing. Is there a way to just disable escaping of text from resources in general? It is acceptable to disable escaping of all text that come from #class.property, since we use that only for resources. Anything from the user we get from the model or from Ajax.
As suggested by Sami Kuhmonen above, you can use actual soft hyphens in stead of HTML entities.
You can just use Unicode soft hyphens. However, those characters are invisible, making your resource file hard to read. You can also use numeric XML character entities in the resource file, linke this: Forsendelses­reference

Is there an "invisible" hyphen character in Unicode / HTML?

I've found the soft hyphen character (U+00AD SHY) very useful but I am wondering if there is the same thing that will tell the browser where to break long words for wrapping without adding any character at all?
For example, let's say you have a narrow column in HTML with newspaper justification and there is a long URL explicitly in the text itself. You could add the soft/shy hyphen I mentioned but then when a user copy and pastes the URL it will contain those dash characters. An ideal situation would be the same visual results without a hyphen character so that the user may copy and paste the long word(s).
Thoughts or suggestions?
I tried searching for this but most of what I come up with is non-breaking space characters and essentially I am looking for the opposite.
UPDATE: I found the ZERO-WIDTH SPACE (U+200B) but it still has the problem that the character is preserved during copy&paste into the address bar so the results are even more confusing to the end user.
You want the HTML5 tag <wbr>, which is specified to do exactly what you are asking for.
If you can't rely on HTML5, U+200B ZERO WIDTH SPACE (​) should also work.
(The effects of copying text out of an HTML document, unfortunately, are underspecified. If <wbr> doesn't do what you want upon copy-and-paste, you might want to bring it up to the WhatWG — the easiest way to do that is probably to file a Github issue on the spec.)

Why is CSS Formatted Without Whitespace?

Sometimes when I look at style sheets of big websites (even this one) the css code is completely formated (or however you call it), like this: http://cdn.sstatic.net/stackoverflow/all.css
Is this just the result of a style sheet beeing generated by a CMS ?
I call it "minified", and I think that's the general term. But the reason is to reduce loading times. All those useless spaces and comments still count as bytes, and sometimes you can have more spaces and comments than actual effective characters! (It also obfuscates the stylesheets, although that's really pointless as spaces can easily be restored with whatever formatting you need.)
It's probably generated on the fly from a more scriptable/dynamic/dry layout language, and there is simply no reason to add the whitespaces since non-one should be reading them, and it would only add to the file-size.
It can be generated by CMS or manually. Removing all the tabs and spaces reduces the size of the file, thereby loading it faster an inturn can make a site faster.

what are the disadvantages of having tons of entities?

I've been writing a source-to-display converter for a small project. Basically, it takes an input and transforms the input into an output that is displayable by the browser (think Wikipedia-like).
The idea is there, but it isn't like the MediaWiki style, nor is like the MarkDown style. It has a few innovations by itself. For example, when the user types in a chain of spaces, I would presume he wants the spaces preserved. Since html ignores spaces by default, I was thinking of converting these chain of spaces into respective s (for example 3 spaces in a row converted to 1 )
So what happens is that I can foresee a possibility of a ton of tags per post (and a single page may have multiple posts).
I've been hearing alot of anti-&nbsps in the web, but most of it boils down to readability headaches (in this case, the input is supplied by the user. if he decides to make his post unreadable he can do so with any of the other formatting actions supplied) or maintenance headaches (which in this case is not, since it's a converted output).
I'm wondering what are the disadvantages of having tons of tags on a webpage?
You are rendering every space as ?
Besides wasting so much bandwidth, this will not allow dynamic line breaking as "nbsp" means "*n*on *b*reaking *sp*ace". This will most probably cause much trouble.
If it's just being dumped to a client, it's just a matter of size, and if it's gzipped, it barely matters in terms of network traffic.
It'll slow down rendering, I'm sure, and take up DOM space, but whether or not that matters depends on stuff I don't know about your use case(s). You might be able to achieve the same result in other ways, too; not sure.
s aren't tags, but are character entities like ©, <, >, etc.
I'd say that the disadvantages would be readability. When I see a word, I expect the spacing to be constant (unless it is in a block of justified text).
Can you show me a case where you'd need s?
Have you considered trying to figure out what the user, by inserting those spaces, is really trying to achieve? Rather than the how (they want to insert the spaces), the what (if the spaces are at the beginning of a line, they want to indent the text in question).
An example of this is many programming sites convert 4 spaces at the start of a line to a pre+code block.
For your purposes, maybe it should be a <block> block.
The end goal being that of converting the spaces not to what the user (with their limited resources) intended to show up there but, rather, what they meant to convey with it.

Shortened HTML text and malformed tags

In my web application I intend to shorten a lengthy string of HTML formatted text if it is more than 300 characters long and then display the 300 characters and a Read More link on the page.
The issue I came across is when the 300 character limit is reached inside an HTML tag, example: (look for HERE)
<a hreHERE="somewhere">link</a>
<a hre="somewhere">liHEREnk</a>
When this happens, the entire page could become ill-formatted because everything after the HERE in the previous example is removed and the HTML tag is kept open.
I thinking of using CSS to hide any overflow beyond a certain limit and create the "Read More" link if the text is beyond a certain number, but this would entail me including all the text on the page.
I've also thought about splitting the text at . to ensure that it's split at the end of a sentence, but that would mean I would include more characters than I needed.
Is there a better way to accomplish this?
Note: I have not specified a server side language because this is more of a general question, but I'm using ASP.NET/C# .
Extract the plaintext from the HTML, and display that. There are libraries (like the HTML Agility Pack for .NET) that make this easy, and it's not too hard to do it yourself with an XML parser. Trying to fix a truncated HTML snippet is a losing cause.
One option I can think of is to cut it off at 300 characters and make sure the last index of '<' is less than the last index of '>'. If it is, truncate the string right before the last instance of '>', then use a library like tidy html to fix tags that are orphaned (like the </a> in the example).
There are problems with this though. One thing being if there are 300 chars worth of nothing but HTML - your summary will be displayed as empty.
If you do not need the html to be displayed it's far easier to simply extract the plain text and use that instead.
EDIT: Added using something like tidy html for orphaned tags. Original answer only solved cutting thing mid-tag, rather than within an opening/closing tag.