I've got an input so the user can type either html or plain text. When the user copy & paste text from MS Word, for example, it generates a weird html. Then, when you view that topic, you can see the whole page's style is affected. I don't really know if the generated html has unclosed tags or something, but it looks like it does and thus, the style of the page is affected.
Does anybody know how to "isolate" the html of that div(or whatever the container be) from the whole page's style?
Short of showing the content in an IFRAME, you can't really do that. What I usually do in this situation is apply tag stripping logic to the content as it comes in. You really don't want to allow arbitrary HTML from a security perspective, but even if you don't care what your users input, you should be stripping out invalid HTML tags (Word has a habit of creating tags with weird namespace-looking things like o:p) and running something like Tidy over the result to ensure every tag is properly closed. There are a number of Tidy libraries for .NET out there; here's one.
Here's a quick cut-and-paste of how I've done this in the past. Note that the class implements an interface from the project I used it in, but you get the general idea.
Copying text from word can include <style> tags. The only sure way to isolate these styles is to put the input control in an <iframe>
You can either sanitize the input or display it in an IFrame.
It it were me I'd strip all but basic formatting (e.g., bold, italics) and use Tidy. That's what I end up doing, I strip and convert all the CSS styles of word into <strong>, <em>, etc.
Related
In normal case, I can separate the text and the style, but how should I do it, when the text is dynamic (it is editable by the admin user)? The user of course wants to use bold, italic, etc, but if I put a common html-editor (I think) I broke the rule of the separation, because there will be html elements in the text. (I can use BB codes, but it is the same.)
In a long term I think it can cause problems when I want to use the text in any non-html environment. Of course I can strip the html tags, but it is not the way I would like to use (not because it won't work, but the original theoretical issues).
In some cases I can break apart the sentences to solve this problem, but I think it's a bad way, because the parts are pointless alone, and it won't be so easily editable too.
Is there any good solution for this?
That's perfectly ok.
You give the user the oppertuniny to set some attributes for the text (BBCodes recomended).
That is content. Then it's part of the design to interpret the attributes and style it.
For example you may provide the feature to let the user define something like [headline]MyHeadline[/headline]. This is pure content.
How to replace [headline] with HTML and how to style the resulting text is up to the design.
Edit: I recommend BBCodes to provide a closed set of features. That may be easier to deal with. You could just use them in another context and interpret them, instead of stripping out HTML.
If the tags entered are semantic, ie they are using an <i> tag for italic, rather than style="font-style:italic", then your design and content are still separate.
Separating design and content is about separating a site's presentation from the readable code, rather than removing the markup altogether.
I'd advise you focus on Semantic HTML.
I was inspecting today's Google Doodle of Moog Synth, dedicated to Robert Moog,
when I came across the following piece of html code:
<w id=moogk0></w>
<s id=moogk1></s>
<w id=moogk2></w>
<s id=moogk3></s>
(You can view the source & do a Ctrl+F for , you will get it in the first search result).
I googled about s & w tags but found nothing useful except that the s tag is sometimes used to strikeout text but is now deprecated.
Obviously google will not be using deprecated tags but I guess there's a lot more behind this than plain html code. Can anybody explain me the use of this tags? & how does the browser recognise them? & if the browser does not recognise them, whats their use?
The browser doesn't recognise them.
But HTML was designed to ignore what it doesn't recognise, so these are simply not understood by the browser and get ignored. As elements without content, they will not get rendered at all either, though they will be part of the DOM.
However, these can be styled directly as elements in CSS and picked up by Javascript (getElementsByTagName and getElementById etc...).
In short, these elements provide a target for CSS and Javascript without any other impact on display.
Unknown elements are treated as block elements (like div) and can be styled accordingly and be used in scripts.
The tags you are talking about are user created XML tags.
If you need to display dynamic data in your HTML document, it will take a lot of work to edit the HTML each time
the data changes.
With XML, data can be stored in separate XML files. This way you can concentrate on using HTML/CSS for display and layout, and be sure that
changes in the underlying data will
not require any changes to the HTML.
With a few lines of JavaScript code,
you can read an external XML file and
update the data content of your web
page.
Having the HTML of a webpage, what would be the easiest strategy to get the text that's visible on the correspondent page? I have thought of getting everything that's between the <a>..</a> and <p>...</p> but that is not working that well.
Keep in mind as that this is for a school project, I am not allowed to use any kind of external library (the idea is to have to do the parsing myself). Also, this will be implemented as the HTML of the page is downloaded, that is, I can't assume I already have the whole HTML page downloaded. It has to be showing up the extracted visible words as the HTML is being downloaded.
Also, it doesn't have to work for ALL the cases, just to be satisfatory most of the times.
I am not allowed to use any kind of external library
This is a poor requirement for a ‘software architecture’ course. Parsing HTML is extremely difficult to do correctly—certainly way outside the bounds of a course exercise. Any naïve approach you come up involving regex hacks is going to fall over badly on common web pages.
The software-architecturally correct thing to do here is use an external library that has already solved the problem of parsing HTML (such as, for .NET, the HTML Agility Pack), and then iterate over the document objects it generates looking for text nodes that aren't in ‘invisible’ elements like <script>.
If the task of grabbing data from web pages is of your own choosing, to demonstrate some other principle, then I would advise picking a different challenge, one you can usefully solve. For example, just changing the input from HTML to XML might allow you to use the built-in XML parser.
Literally all the text that is visible sounds like a big ask for a school project, as it would depend not only on the HTML itself, but also any in-page or external styling. One solution would be to simply strip the HTML tags from the input, though that wouldn't strictly meet your requirements as you have stated them.
Assuming that near enough is good enough, you could make a first pass to strip out the content of entire elements which you know won't be visible (such as script, style), and a second pass to remove the remaining tags themselves.
i'd consider writing regex to remove all html tags and you should be left with your desired text. This can be done in Javascript and doesn't require anything special.
I know this is not exactly what you asked for, but it can be done using Regular Expressions:
//javascript code
//should (could) work in C# (needs escaping for quotes) :
h = h.replace(/<(?:"[^"]*"|'[^']*'|[^'">])*>/g,'');
This RegExp will remove HTML tags, notice however that you first need to remove script,link,style,... tags.
If you decide to go this way, I can help you with the regular expressions needed.
HTML 5 includes a detailed description of how to build a parser. It is probably more complicated then you are looking for, but it is the recommended way.
You'll need to parse every DOM element for text, and then detect whether that DOM element is visible (el.style.display == 'block' or 'inline'), and then you'll need to detect whether that element is positioned in such a manner that it isn't outside of the viewable area of the page. Then you'll need to detect the z-index of each element and the background of each element in order to detect if any overlapping is hiding some text.
Basically, this is impossible to do within a month's time.
I would like to convert HTML to plain text but retain the minimum structure.
All sections which contain stuff only the browser needs to see such as <script> and <style> to be stripped completely.
Convert all block tags to <div> and all inline ones to <span> or remove inlines completely without leaving whitespace and turning anything delineatd by block levels into paragraphs with two linebreaks.
The idea is to turn random web pages into something suitable for natural language text processing without artefacts left from naively removing markup artifically break words up or making unrelated blocks look like sentences.
Any binary, library, or source in any programming language is OK.
Is there a standard source preferably machine-readable with a full list of elements defining which are block, which inline, and which are like <script> and <style> above?
The list of HTML 4 Block-level elements is here: http://htmlhelp.com/reference/html40/block.html
The most popular HTML parsing libraries for Perl are HTML::Parser which is a SAX-style parser and HTML::TreeBuilder which is more DOM-like.
Beyond that, you'll have to decide which elements are important and which are not based on what you're trying do to.
You may want to do some research yourself. Then, when you run into a problem, ask a question related to the problem. This sounds more like specification for a project that you want someone to do for you.
For starters, websites use tags for all sorts of things, and the problem is very complex. You would probably want to save information in h# and p tags, but you also may want to save div tag information if they use the id tag. In short, you'd have to write rules for each website you encounter, or employ some sort of fuzzy logic.
Instead of doing it on a tag by tag basis, why not try detecting sentences and grammar, or things likely to be in headings, and choose tags that include those things while stripping out the rest?
Here's my own tool to solve this problem in Perl using HTML::Parser as a github gist: html2txt.pl
It's unfinished and perhaps slightly Windows-centric but I thought I'd share it since a few people have viewed my question here. Feel free to play with it.
I'm looking to generate a single HTML file from the content of multiple HTML and text documents (emails).
I'd like some recommendations about the best way to handle this.
For instance, at the most naive level you could extract everything within the tags of the HTML and put it inside a <div>. Plain text would go inside a <pre>. Of course this will mean that anything important in the html <head> section, such as embedded CSS, is lost...
(Dev Environment: Delphi 2007)
TIA
With exported email you can almost certainly grab everything up to the closing "body" tag in the first one, drop the body and html end tags, add your separator code, then loop through the remaining files stripping everything up to the body tag, include up to the closing body tag, then at the end drop in the end body, end html tags. The HTML header content in each is likely to be the same or very similar. It doesn't have to be, but it probably will be.
Trying to do the same for generic HTML documents is going to require the frame solutions above - wrap each document in an inline frame so that each one gets the full set to header includes and inline declarations. Even then you might run ito issues with code designed the break frames (and anything that happens to use frame-incompatible code).