Clean HTML using C# - html

How do I repair malformed HTML using C#? A great answer would be an HTML Agility Pack sample!
I'm scraping a site (for legitimate use). The site's HTML is OK but there are some annoying problems.
One way I could go would be through regular expressions. I used Expression Web to analyse the problems and the regular expressions needed to correct them. So one way would be to use a tool such as RegexBuddy to generate C# code for these regular expressions.
However, the recommended tool for processing malformed HTML in C# is the HTML Agility Pack (HAP). Moreover, I've analysed only a handful of pages and I'm afraid that future pages will contain patterns I've not yet solved, and I would hate to enter the "find the errors in the next few pages and correct them" maintenance business. So, if HAP already has a solid, always-working solution, this would be great. The problem is that except for a few mentions here at SO I could not find any how-to-use documentation for this tool, except for the object-by-object API help file.
So - before I spend $ and learning time on RegexBuddy (no free evaluation version), or break my teeth on HAP's API documentation - is there an easy way to do this? An HAP sample would help... :-)

can you tell me what kind of annoying problems are you having?
but you dont need to use regex to clean the html, HAP will let you access the elemtents of a malformed html using Xpath Queries.
and basically you need to learn Xpath to know how to get the html elements you want.
it really depends on the kind of html you are parsing using HAP.
but there is several ways to get the elements.
like by id or class or even you can get the element that follows another element that contain a given text like "name:" for example.
you can goto W3 schools Xpath Tutorial for a nice xpath tutorial

What I took from the answers here:
1) If you're scraping a website you don't control, you'll always enter a maintenance mode where you have to fix your scraper every time the layout of the page you're scraping changes.
2) If you are limited to this known site, why not write your scraper to adjust the problems
So, if I have to go into maintenance mode, it should be as easy as possible. Therefore, my process is as follows:
I use Webius's SWExplorerAutomation to detect scenes in Web pages. The idea is that a Scene is a collection of conditions you define for IE. When a web page is loaded, IE tries to see which set of conditions is met (e.g. - page title is "Account Login", the page contains a "Login" text box a "Password" text box). If a set of conditions corresponding to a scene is detected, IE reports that the scene has been detected. This model provides an abstraction layer - Some changes in the web page can translate to changes in the scene file, saving the code from having to change. Additionally, this shields me from IE's event driven model: I call "scene. I'm evaluating this product but I'm not yet sure I'll use it, mainly because the documentation is terrible. Another alternative is Watin, and one more reason I haven't yet bought SWEA is this article accusing its author of spamming against Watin.
Once the web page has been acquired, I use Expression Web to run compatibility checks and identify errors.
I use RegexMagic to remove and correct errors. I really love this tool. Sure, sometimes it make you murderously angry because it doesn't let you do things that should be really easy, but it's a sweet, sweet tool, and the documentation is amazing.
Finally, after all the errors I know have been corrected, I use HTML Agility Pack to convert to XHTML - cross the ts and dot the is, so to speak: all lower case, quotes across attributes, and so on.
Hope this helps!
Avi

Regex can't be used for HTML Cleaning.
Does http://tidy.sourceforge.net/ helps?

If you're scraping a website you don't control, you'll always enter a maintenance mode where you have to fix your scraper every time the layout of the page you're scraping changes. It doesn't matter if you're using the regex <td color="red">\d+</td> to get the big red number from a page or if you're using a DOM parser to get the 3rd cell in the 2nd row in the table with id numbers to get the same. The regex breaks if the webmaster replaces the color attribute with a class attribute. The DOM parser breaks if the webmaster adds another row to the top of the table.
If you're scraping larger parts of a web page and want to embed them in your own web page, it may be easier to get over your desire for web standards compliance and just let the browser figure out how to display things.

Since you're using Html Agility Pack and know of the problems that occur, if you are limited to this known site, why not write your scraper to adjust the problems when you've loaded the HtmlDocument.
i.e.:
If you know the element always appears after the , insert the element into the first child position of the tag.....

Related

Tagging HTML elements with the source file

Problem
So we have quite a big project with lots of different Partial Views and a client side data binding framework (Knockout.js in our case).
One of the more problemtic parts is that is getting harder and harder to figure out which partial view is rendering an element that I see on my page.
So I need to debug this particular DIV. Okay, where do I find it?
Usually I try to find a very specific class or ID close by this element and do a search through the whole platform - far from ideal.
Question
So I was thinking about the following; tagging all elements (in debug mode) with the source file where they have been generated.
Right now I'm thinking about something like a precompiler that adds a data-source="" to every element. I might refer to an ID within a dictionary to prevent repeating all the long filenames.
Before I'm reinventing the wheel:
is there already something similar?
are there better alternatives?
We're using ASP.NET MVC, but any hints to how other platforms do this are perfect too.
If you are using Visual Studio, I highly recommend the Web Essentials extension. Among many great features, it has one called "Inspect Mode", part of the larger "Browser Link" feature, that does exactly what you are looking for; it identifies the file that a particular DOM element came from. It might be worth a shot if that option is open to you.
#Dirk, as per my understanding your issue is to easily identify the element/view. Adding data-source can be an option but before that have a look at this link
Editing Styles and DOM - Chrome Dev Tools
This page has many demonstrations which might be helpful to your problem. Furthermore, I do agree with Kevin suggestion.

Approach to building a GUI for a web application

First, a short disclaimer. I have next to no knowledge about web applications. I come from an iOS background where I exclusively wrote native code, so if you write your answers like I know nothing outside the shallow parts, that would be great.
I'm interested in learning a stack to develop web applications, but I'm not sure what the right way to build the GUI is. I know that a web front end consists of html and CSS to create the display and javascript as the bridge between the back end and the GUI, but I don't know the best way to put something together.
I know in iOS, you can use the Interface Builder (part of xcode that lets you graphically create the xml that describes the display) to create GUI's without any knowledge of how iOS translates the xml to some rendering, or even what is written in your xml files. Is there any analog to the web front end?
I'm mainly just looking for a list of the accepted ways to develop the GUI for a web application. If I have to learn HTML and CSS, so be it, but I'd like to know what my options are and the tradeoffs between each of them.
I can answer shortly stating that (technically) you can design web pages without coding in HTML or CSS, or even Javascript - although, you would be somewhat limited in your creative abilities and applications.
You can read about WYSIWYG html editors on this link, or try out ckeditor (someone said it's good)...
...I think a bit of background will help you reach a correct decision...
so here goes:
The Long Answer
I would start by trying to put the world of web programming and design into concepts that correlate with iOS coding.
If we look at the whole of a web app from an MVC perspective, then the browser is the view, the server is the controller and the database is the model... although this is very simplified.
Just like in iOS, each of these can be (but doesn't have to be) broken down into sub-MVC systems.
Just like any model in MVC, the view (the browser) can talk to the model (the database), but really it shouldn't. that's just bad practice.
If we break down the main-view (the browser) to a sub-MVC system, I would consider the HTML as the model, the CSS as the view and the browser (through links and javascript) as the controller.
It's not all that clear cut, but thinking like this helps me practice better and cleaner coding.
The HTML is the view's model for the web-app - it contains the data to be displayed or used.
HTML is a variation on the XML format and it contains data organized in a similar way to an XML file.
The basic HTML file will contain:
<html>
<head>
</head>
<body>
</body>
</html>
this should look familiar to you if you read any XML.
The CSS (cascading style sheets) is the view - it states HOW the html DATA should be displayed.
if your web app does't have any CSS, it will use the browser's default CSS/styling to be applied on the data in the HTML.
This "language" makes me think more about dictionaries in iOS (I think that's what they're called in Objective C). They have properties and values (like key-value pairs) that determine how the HTML data is displayed (if it's displayed).
They could look something like this:
body
{
color: white;
background-color: black
}
The browser is the web-app's view's controller - it makes it all work together and serves it up to the screen.
Javascript and links help us tell the controller what we want it to do, but it is the browser that acts (and willfully at times).
You can have a whole web app that acts without javascript, using only the default actions offered by links - in which case the browser will usually ask the server (the main controller) what to do.
Javascript helps us move some of the legwork from the server to the client, by allowing us to have a "smarter" controller for our view - just like in iOS.
The issue of the errant main-view / browser
Not all views are created equal, and not all browsers are the same.
Because the browser is used as the controller for the web-app's view, and because some browsers act differently then others, we web coders have the problem of working around someone else's idea of how our view's controller should behave.
You might see us complaining about it quite a lot (especially complaining about Internet Explorer).
These days, this issue is not as big of a problem as it used to be... it's just that some people don't update their computers...
WYSIWYG web editors
There are website builders and editors that try to work like X-Code does, by allowing us to build the website much like we would write word documents.
But, unlike X-Code which codes only the graphical interface of the view, these website builders write the model as well and usually add javascript into the mix.
When we use these tools (which I avoid), the whole MVC model breaks apart.
We can use them as a starting point for dirty work, but then we take their code apart and adjust it to our needs - usually by taking the code we need for the view (CSS) and applying it where we need it (and discarding much of the nonsense they add to the code and the HTML).
To summarize
As you can tell, HTML and CSS (and Javascript) are only a small part of a web app - as they all relate to the main view of the web app.
To write the controller and models for web apps, we use other tools (such as Ruby, PHP, js.node, MySQL and the like).
Coding the HTML and CSS isn't as hard as you might think, although it might be harder then I present it to be.
You can avoid writing code for web apps and use applications that offer WYSIWYG (What You See Is What You Get), but that would limit your abilities and will take away from your control over what you want to create.

Alternatives to HTML for website creation?

It seems the most common aproach to web design is to use HTML/XHTML & CSS in conjunction with other technologies or languages like Javascript or PHP.
On a theoretical level, I'm interested to know what other languages or technologies could be used to build an entire site without using a single HTML tag or CSS style for styling/positioning?
Could a website be made only using XML or PHP alone, including actual styling and positioning?
Presumably Flash sites are till embedded in HTML tags?
Thanks
There are actually several solutions that allow you to nearly completely avoid CSS and HTML.
GWT: Google Web Toolkit
Written in Java and will allow you to build both server and client code in Java. Used to build Google Wave.
Cappuccino and Objective-J:
Objective-J is to JavaScript as Objective-C is to C. It extends JavaScript with many near features, including type-checking, classes and types.
Cappuccino is like Cacoa (Mac OS X GUI toolkit).
Using these two you can build incredibly rich and desktop like webapps. They run mostly on the client side and you can use whatever you want on the server.
A good example is 280slides
SproutCore is similar to Cappuccino, but it uses pure JavaScript instead. Apple is using SproutCore to make me.com.
I should also mention that knowledge to HTML, CSS, JavaScript is a good skill to know, just like understanding your compiler is a good skill.
EDIT:
As said above Adobe Flash can also be used.
You can make a website with out a single html tag. Just give folder read access to all your directories, have sensible file names. From here you user will be able to browse images , read text files, download videos and depending on the content he may or may not come back ever again, but you do achieve the goal of setting up a "website" with out a single line of html or css or any other code for that matter.
:-) :-) :-)
You can host a telnet server with anonymous access and a specialized shell that restricts the user to doing whatever it is you want the site to do. ;)
Lets make the distinction between what is required by the web browser, and what you as a developer use to create that markup.
Remember that HTML nowadays is xml. You could use any markup language you like and convert that to HTML using XML.
eg ASP.NET uses markup such as which is converted on the server to .
As long as the content going down the wire to the browser is HTML, or generates HTML through script, you can use any approach you like.
However these approaches have mostly failed as developers prefer having direct control over the markup. It makes css as well as scripting much easier when you are certain what the html is going to be.
ASP.NET MVC is a product created in response to criticisms leveled at the ASP.NET webforms model.
Also, this is another answer because it's a completely different technology, but you can write an application in XUL and it'll run in Mozilla-based browsers without any HTML.
There's also XML. You can create websites with XML only. A well known one is World Of Warcraft. Check the page source. An XSL is used as stylesheet. There exist even XML based web frameworks like OpenLaszlo. You can let it serve either DHTML or Flash on reqeust out of a single XML template.
The Wt C++ Web Toolkit.
You can write your web application in C++ using Qt-style widgets (input boxes, buttons, tabs etc) and hook up client-side events to C++ code on your server. All without writing any HTML or CSS.
A sample application from their website (you may also want to look at this excellent tutorial):
HelloApplication::HelloApplication(const WEnvironment& env)
: WApplication(env)
{
setTitle("Hello world"); // application title
root()->addWidget(new WText("Your name, please ? ")); // show some text
nameEdit_ = new WLineEdit(root()); // allow text input
nameEdit_->setFocus(); // give focus
WPushButton *b = new WPushButton("Greet me.", root()); // create a button
b->setMargin(5, Left); // add 5 pixels margin
root()->addWidget(new WBreak()); // insert a line break
greeting_ = new WText(root()); // empty text
/* when the button is clicked, call the 'greet' method */
b->clicked().connect(this, &HelloApplication::greet);
}
void HelloApplication::greet()
{
/* set the empty text object greeting_ to greet the name entered */
greeting_->setText("Hello there, " + nameEdit_->text());
}
Curl (requires a browser plugin)
Wikipedia article
A webpage looks like this:
{curl 1.7 applet}
{value
let b:int=99
let song:VBox={VBox}
{while b > 0 do
{song.add b & " bottle(s) of beer on the wall,"}
{song.add b & " bottle(s) of beer."}
{song.add "Take one down, pass it around,"}
set b = b - 1
{song.add b & " bottle(s) of beer on the wall."}
}
song
}
Source
Since browsers view HTML, I'm assuming you mean create a site without ever having to edit/write HTML/CSS. The framework/app environment/whatever taking care of everything for you - yet still allowing you control over the presentation layer.
Seems like that is certainly possible on a theoretical level.
I ran across Noloh (not one line of html) a while back. Was intrigued, but never actually tried it out.
From various places on the Noloh site:
Because NOLOH does not rely on HTML or pages, maintaining complex rich Internet applications is significantly easier than with other methods.
Developing applications with NOLOH only requires using a single, unified language: a superset of PHP that completely maintains all aspects of server-client communication for you!
I think you could build a site entirely in SVG.
The front page of emacsformacosx is almost entirely SVG, for example.
Downsides: It wouldn't be viewable in IE (at least through version 8). And last I looked, text support, like flowing and justification, was weaker in SVG. (You could embed HTML inside an SVG element when you needed sophisticated text features, but that would violate your no-HTML rule.)
You'd probably still want to use CSS with SVG, because it's a good idea there for the same reason it's a good idea with HTML, but it wouldn't be necessary.
A website is always viewed through a browser (at least always if you are human :)). Browsers understand HTML. Whatever the technology - you have to basically render HTML. Even in cases with rich technologies like flash, the flash object that is rendered by a browser plugin is embedded inside the HTML.
In theory it is possible to do it without HTML, but the question becomes how much does the product diverge from the definition of a website...
One really short, simple answer... you can't :D
Flash requires an embed tag, an image requires an embed tag etc, so you'd have to use HTML in some method or another.
PHP is an embedded language, it is used to generate HTML on which the browsers renders, with XML, well technically a browser like Ie or FireFox will render it in it's own way for readability, but I would not class that as a website.
The major developments in the world of web technologies involves the development of HTML and CSS to improve them, there isn't any need for an alternative. In fact we're pushing towards a standard, what point would there be in introducing a new language to negate these standards. The whole IE saga would simply get worse.
Like the others have suggested, you could directly load an image or a flash file, but an image is useless on it's own, and a flash interface throws up loads of problems like SEO, accessibility etc, not least it's very heavy and usually completely misused. In my mind I wouldn't even class this method as a website, it just doesn't tick any of the boxes (IMO).
I think you can have an URL pointing directly at a hosted Flash (SWF) file, I've certainly done this though I don't know if all browsers work.
Anyhow, I tested this when developing MyDinos.
e.g: http://mydinos.com/home.swf
You can use Emscripten and its SDL subset.
You could try using quickstatic. You can code HTML templates from Python3. What is super cool about it is the fact that if you put in a for-loop for a certain item, you can generate that many items (maybe even use it to print items from a directory or quickly serve thousands of links).

WYSIWYG-editor with "add custom html feature" and secure (validated) html output?

I've been looking into some of the WYSIWYG editors (TinyMCE, FCKEditor, etc.) and they all seem to offer a lot of options.
However, one vital feature that seems to lack is a simple "add custom html" option which would allow the user to input any of these embed-snippets you find all around the web these days, for example a youtube video. This is different than a "edit html/source" feature as that requires actual knowledge of html and there is the risk of the user writing invalid code.
Another issue that I couldn't find much about is the output html. How would I make sure that this output causes no security invulnerabilities? Even when the user has the ability to add his own html?
So, basically, is there an open source WYSIWYG editor which covers these 2 features?
FCKEditor achieves this via plugins. e.g. http://sourceforge.net/projects/youtubepluginfo/
For the first part, you either have the "view source" view of the editor or, if that is too complex, I'm pretty sure such plugins already exist for all major editors. If they don't, building a "insert arbitrary HTML" plugin should be easy to implement by tweaking another simple plug-in like the youTube one linked to in Martin's answer.
The second part - sanitizing the incoming HTML - is impossible to achieve in the WYSIWYG editor itself, because it acts solely on the client side, and fills content into a form input that could be manipulated anyway, even though you turn off the "custom HTML" function in the editor.
Therefore, the sanitizing of the HTML needs to take place on server side. If you can use PHP, a tool that looks very good to me from the outside - I haven't worked with it but plan to in the near future - is HTML Purifier. It claims to produce reliable HTML with minimum hassle.

Embed section of HTML from another site?

Is there a way to embed only a section of a website in another HTML page?
Example: I see an answer I want to blog about, so I grab the HTML content, and splat it in somewhere, and show only that, styled like it is on stackoverflow. Basically, I want to blockquote the section of the page with original styling, if that makes sense. Is that something the site itself has to provide, or can I use an iframe and tell it to show only a certain element or something crazy? Open to all options, but I want it to show up as HTML, not as an image (that's really a last resort).
If this is even possible, are there security concerns I need to aware of?
Don't think image should really be last resort. You have no control over the HTML/CSS of the source page, so even if you craft a solution (probably by using JavaScript to parse out the desired snippet) there is no guarantee that tomorrow the site doesn't decide to change its layout.
Even Jeff, who has control over the layout of stackoverflow.com, still prefers to screen-capture the site, rather than pull in the contents live.
Now if your goal was to have the contents auto-update, that would be a different story. But still, unless you use some agreed-upon method of sharing content, such as RSS, your solution would be very fragile.
The concept you are describing is roughly what is called a "purple include" or "transclusions". There is a library out there for it, but its not exactly actively developed. Here's a couple ajaxian articles on it.
I'd recommend using a server side solution with Python; using urllib2 to request the page, then using BeautifulSoup to parse out the bit that you need. BeautifulSoup has a very flexible selection api with which you can craft heuristics for the section you are interested in.
To illustrate:
soup = BeautifulSoup(html)
text = soup.find(text="Some text on the page that is unlikely to change")
print soup.parent.prettify()
That way if the webmaster later changes the markup on the page, your scraping script should still work.
On client side <iframe> is the only practical option. It is possible to scroll it, but it might not work in the long term, because it's technically close to clickjacking attack.
There's also cross-site XHR, but requires opt-in from destination site, and today works only in few latest browsers.
Getting HTML on server side is easy (every decent web framework has ability to download page and parse HTML and you can use XPath/XSLT or DOM to extract bit you want).
Getting styles however is going to be tricky – CSS rules may not work with HTML fragment taken out of context. You'd have to parse CSS, extract and transform rules or use browser and read currentStyle of every node.
Obviously you have to heavily filter HTML you extract to avoid XSS. It's harder than it seems.
If you don't need to automate this, a good HTML+CSS WYSIWYG editor might be able to extract content fragment with styles.
That sounds like something that IE8's Web Slices would be perfect for. However, it's only available in IE8, and the site of origin would have to implement for you to be able to take advantage of it.