Any reason not to write my own HTML? - html

In Seaside, in all those renderContentOn: methods, I can use the HTML canvas object to assemble my DOM tree.
I am writing a bunch of helpers for my components currently, because I'm using Twitter Bootstrap for the styling and don't want to write all that boilerplate code (<div>s en mas) all the time.
For the way this is setup, the easiest way for me is to simply (I want to avoid using with: aBlock in those helpers) write out the HTML for the wrapping DIVs like this:
html html: '<div class="control-group">'.
Is there any reason for me not to do this? Any downsides?

There are various advantages in using the HTML canvas:
The HTML canvas ensures valid tags, a valid tag structure, that all tags are properly closed (at compile time), and that contents is properly escaped.
The HTML canvas ensures valid attributes, that all attributes are properly closed, and that contents is properly escaped.
As a consequence of the above two the HTML canvas automatically avoids the possibility of cross-site scripting (XSS) vulnerabilities.
The HTML canvas enables better reusability by enabling composition of tags (simple function calls), presenters (renderOn: in Objects), and components (renderContentOn: of components).
The HTML canvas avoids generating unnecessary whitespaces.
The use of HTML canvas enables one to use the standard tools the Smalltalk IDE provides on HTML code: senders, implementors, refactoring engine (extract to method, extract to component, inline method, automatic rewrite, ...), etc.
I agree that in some rare cases it is not worth to use HTML canvas: For example, when large static junks coming from an external source need to be embedded into a page.

I don't think there is a real downside to render static html pieces like that.
However, you might want to check out the Seaside integration of Twitter bootstrap: http://twitterbootstrap.seasidehosting.st/

To rephrase one of Lukas' arguments: it basically is not DRY. If you're using it only once, there is no problem. If you have to use this multiple times, the canvas allows you to use all the clean reuse capabilities smalltalk offers you.

Related

make zend_form render into json

I am making a new version of an "old" software based on the Zend_Frameword 1.x,
this project use zend_form to create versatiles forms I like this aspect of the code, and i'd like to keep it.
But I need to be cross domain and to render the html with javascript, so I'd like Zend_Form to render in JSONP.
on the past I already experience the hard to understand decorators of zend_form and i don't think that I can use them to render in JSON, so, I think that the only way to go is to extend Zend_Form, to implement my own render method, but event so, I have the feeling that it won't be easy...
I'm this afraid of Zend_Form that I am thinking that maybe rendreing the form in html and parsing it on the javascript side to create a structure sufficient to then use it with handlebars (the javascript template engine) would be easier
Parsing html seems suboptimal to me.
Your initial idea - creating your own form class extending Zend_Form and adding a renderJsonp() method - seems a pretty reasonable approach. In the simplest case, your rendering could extract the form attributes and then iterate over the elements, extracting their attributes.
However, remember that a Zend_Form instance is more than just a list of form attributes and elements with their attributes. There are potentially display groups (typically representing html fieldsets) and validators attached to form elements. Also, even if you do not change the decorators used to render the form, they are part of the rendering. A json representation of the form that does not reflect those decorators would probably render into html differently than you experience when Zend_View is responsible for the rendering in a view-script.
While it may be possible to have that displaygroup and validator information reflected in the structure of your json output, it seems unlikely that you would be able to easily automate creation of client-side validation that perfectly parallels that on the server-side side.
One approach I have seen is to have your client-side front-end make an AJAX request to a specially defined route that renders the full html of the form. While I am not a fan of this approach, it does have the advantage that the form remains essentially DRY, all the information required for rendering the form resides back on the server side but is still available to the client. The only thing missing is extracting the server-side validation for use on the client-side.
Just thinking out loud, so not a great answer, but it seemed too long for a comment.

Encapsulate the rendering of HTML

Can someone explain me what it means when they say:
"HTML Helpers enables to encapsulate the rendering of HTML"
Does it mean it will restrict access to the HTML attributes of an object? What if I am using CSS, jquery, Javascript, will it display my JS/jquery and CSS in page source?
HTML helpers are bits of code that will render out the HTML, instead of you writing out the HTML for each and every piece of data directly in the view.
They encapsulate it - meaning they take care of what HTML is output.
By "encapsulating the rendering", they are referring to the encapsulation of the functionality of rendering html.
If there is a common piece of html you render often, (the rails form helpers for example), if you insert the code (or the code building functionality within a helper), each time you need to render that html (or a variant of it) , you merely call the helper method thereby keeping your code "dry" (html code in this case)

Having the HTML of a webpage, how to obtain the visible words of that webpage?

Having the HTML of a webpage, what would be the easiest strategy to get the text that's visible on the correspondent page? I have thought of getting everything that's between the <a>..</a> and <p>...</p> but that is not working that well.
Keep in mind as that this is for a school project, I am not allowed to use any kind of external library (the idea is to have to do the parsing myself). Also, this will be implemented as the HTML of the page is downloaded, that is, I can't assume I already have the whole HTML page downloaded. It has to be showing up the extracted visible words as the HTML is being downloaded.
Also, it doesn't have to work for ALL the cases, just to be satisfatory most of the times.
I am not allowed to use any kind of external library
This is a poor requirement for a ‘software architecture’ course. Parsing HTML is extremely difficult to do correctly—certainly way outside the bounds of a course exercise. Any naïve approach you come up involving regex hacks is going to fall over badly on common web pages.
The software-architecturally correct thing to do here is use an external library that has already solved the problem of parsing HTML (such as, for .NET, the HTML Agility Pack), and then iterate over the document objects it generates looking for text nodes that aren't in ‘invisible’ elements like <script>.
If the task of grabbing data from web pages is of your own choosing, to demonstrate some other principle, then I would advise picking a different challenge, one you can usefully solve. For example, just changing the input from HTML to XML might allow you to use the built-in XML parser.
Literally all the text that is visible sounds like a big ask for a school project, as it would depend not only on the HTML itself, but also any in-page or external styling. One solution would be to simply strip the HTML tags from the input, though that wouldn't strictly meet your requirements as you have stated them.
Assuming that near enough is good enough, you could make a first pass to strip out the content of entire elements which you know won't be visible (such as script, style), and a second pass to remove the remaining tags themselves.
i'd consider writing regex to remove all html tags and you should be left with your desired text. This can be done in Javascript and doesn't require anything special.
I know this is not exactly what you asked for, but it can be done using Regular Expressions:
//javascript code
//should (could) work in C# (needs escaping for quotes) :
h = h.replace(/<(?:"[^"]*"|'[^']*'|[^'">])*>/g,'');
This RegExp will remove HTML tags, notice however that you first need to remove script,link,style,... tags.
If you decide to go this way, I can help you with the regular expressions needed.
HTML 5 includes a detailed description of how to build a parser. It is probably more complicated then you are looking for, but it is the recommended way.
You'll need to parse every DOM element for text, and then detect whether that DOM element is visible (el.style.display == 'block' or 'inline'), and then you'll need to detect whether that element is positioned in such a manner that it isn't outside of the viewable area of the page. Then you'll need to detect the z-index of each element and the background of each element in order to detect if any overlapping is hiding some text.
Basically, this is impossible to do within a month's time.

Django templatetag for rendering a subset of html

I have some html (in this case created via TinyMCE) that I would like to add to a page. However, for security reason, I don't want to just print everything the user has entered.
Does anyone know of a templatetag (a filter, preferably) that will allow only a safe subset of html to be rendered?
I realize that markdown and others do this. However, they also add additional markup syntax which could be confusing for my users, since they are using a rich text editor that doesn't know about markdown.
There's removetags, but it's a blacklisting approach which fails to remove tags when they don't look exactly like the well-formed tags Django expects, and of course since it doesn't attempt to remove attributes it is totally vulnerable to the 1,000 other ways of script-injection that don't involve the <script> tag. It's a trap, offering the illusion of safety whilst actually providing no real security at all.
HTML-sanitisation approaches based on regex hacking are almost inevitably a total fail. Using a real HTML parser to get an object model for the submitted content, then filtering and re-serialising in a known-good format, is generally the most reliable approach.
If your rich text editor outputs XHTML it's easy, just use minidom or etree to parse the document then walk over it removing all but known-good elements and attributes and finally convert back to safe XML. If, on the other hand, it spits out HTML, or allows the user to input raw HTML, you may need to use something like BeautifulSoup on it. See this question for some discussion.
Filtering HTML is a large and complicated topic, which is why many people prefer the text-with-restrictive-markup languages.
Use HTML Purifier, html5lib, or another library that is built to do HTML sanitization.
You can use removetags to specify list of tags to be remove:
{{ data|removetags:"script" }}

Should HTML co-exist with code?

In a web application, is it acceptable to use HTML in your code (non-scripted languages, Java, .NET)?
There are two major sub questions:
Should you use code to print HTML, or otherwise directly create HTML that is displayed?
Should you mix code within your HTML pages?
Generally, it's better to keep presentation (HTML) separate from logic ("back-end" code). Your code is decoupled and easier to maintain this way.
As long as your HTML-writing code is separate from your application logic, and the HTML is guaranteed to be well-formed somehow, you should be okay.
The only code that should be mixed in markup-based pages (i.e, those that contain literal HTML) is the code used for formatting the HTML (e.g., a loop for writing out a list).
There are trade-offs whether you put the code in with the HTML or you use pure code to write the HTML out using quoted string literals.
No, if you want to build good and maintainable software, and to achieve loose coupling.
If I understand the question right, you're asking whether it's a good practice to mix markup with back-end code. No. While this is commonly done, it's still a bad idea.
You should read up on the MVC paradigm, as well as on existing questions on the matter, such as What is the best way to migrate an existing messy webapp to elegant MVC? and Best practices for refactoring classic ASP?
The point is to keep the display logic separate from the rest of the code. In any complex site you'll have code mixed in with your HTML, but the code should be for display purposes only. It shouldn't be doing any complex calculations.
For example, templates will contain loops and conditionals. Plus you'll probably have a library of HTML-specific routines, like printing out an <option> list based on a list object.
Imagine you were writing an application that has two output modes: HTML and something else. How would you write it, to avoid duplicating code? That will probably point you in the right direction.
The HTML that makes up the view has to get sent to the browser in some way. In .net, each server control emits its own HTML markup as part of the page lifecycle. So yes it is OK to use HTML in server side code.
Perhaps you should try following the ASP.net pattern. Create a bunch of controls that represent UI elements and make them responsible for emitting their own HTML based on their state.
Its fugly, and not type safe. But people do it without consequence. I'd prefer using a DOM or, at a minimum, classes designed to write HTML using type safe semantics. Also, its not all that good to mix UI with logic...
If I need methods that generate HTML I usually isolate them in an HtmlHelpers class. That way you keep some level of separation. The ASP.NET MVC Framework does this quite successfully.
If you mean printing out HTML in your code, then no. Unless you have a good reason not to, you should use templates
Even if you think you don't need this now, there's always a good chance you'll need it later. Maybe you want to output in a different format than HTML, or you want different presentation for the same data. You usually have the need for these things further down the road, so it's best to use one from the start.
I hate when developers print() a bunch of html. It's completely unnecessary and looks ugly in any text editor that shows print/echo strings in red.
I agree with everyone else that you should try as hard as you can to separate the HTML/XHTML markup from the application logic. However, sometimes you do need to generate HTML/XHTML in the application logic for various reasons.
In these cases what I have been trying to do is to ensure the bare minimum amount of presentation code is in mixed in with the application logic and try to migrate everything else over to the presentation code. It is worth nothing that is some cases you have situations where you could have everything moved over to the presentation layer, but it might be a bit easier to generate the markup as part of the application logic. In those cases, your best bet is likely to be to go the route that makes the most sense in terms of time.
I don't think there's any excuse for generating HTML inside your business logic. Don't even do it when it's just a "quick fix" or when you'll "go back and fix it later", because that never happens.
To reiterate my position from other questions, using some control logic (conditionals, loops) within HTML to construct it is OK. Do NOT do any data massaging or business logic in the HTML. You have to be disciplined, but it's worth it. Maintenance is much easier if your concerns (like logic and display) are separated.
Ideally you are aiming for a separation of concerns between your presentation (UI) code and your domain (business logic) code.
The reason why you should avoid coupling these two concerns (in either direction) is simple...
You will only have one reason to change a piece of code. whether this is from structural/styling changes in your html design, or from your business rules changing, you should only have to make the change in one place.
To a lesser extent, although many purists would disagree, by sprinkling HTML code through your domain code or vice versa you are creating noise for the next developer who comes along to read/maintain it.
I try to avoid using code to print HTML "directly". It is difficult to maintain, edit, add styles and etc. Some cases like generating an HTML email in the code, I create a text file or HTML file with markers like, [name], [verification code] and etc. I load this from the code and replace those markers. This way, you can edit the style of the email without re-compiling your code. Separating "presentation" and "logic" is a good practice in my opinion.
Mixing code within HTML is generally not a good practice in similar reasons as said in #1. However, I do use code in HTML for things like simple dynamic strings that are displayed multiple times on a page or pages. I think this is better than creating multiple server controls for same exact values to set. Since this is not code "logic" mixed in the HTML, I think this is ok.