Recommended way to concatenate several HTML files - html

I'm looking to generate a single HTML file from the content of multiple HTML and text documents (emails).
I'd like some recommendations about the best way to handle this.
For instance, at the most naive level you could extract everything within the tags of the HTML and put it inside a <div>. Plain text would go inside a <pre>. Of course this will mean that anything important in the html <head> section, such as embedded CSS, is lost...
(Dev Environment: Delphi 2007)
TIA

With exported email you can almost certainly grab everything up to the closing "body" tag in the first one, drop the body and html end tags, add your separator code, then loop through the remaining files stripping everything up to the body tag, include up to the closing body tag, then at the end drop in the end body, end html tags. The HTML header content in each is likely to be the same or very similar. It doesn't have to be, but it probably will be.
Trying to do the same for generic HTML documents is going to require the frame solutions above - wrap each document in an inline frame so that each one gets the full set to header includes and inline declarations. Even then you might run ito issues with code designed the break frames (and anything that happens to use frame-incompatible code).

Related

(Dis-)advantages of embedding HTML header/footer with <object> tags?

I am looking for a simple way to make my website modular and extract common parts that appear often like header and footer into separate files.
In my case, I can not use any server-side includes with e.g. PHP. I also generally would like to avoid including big libraries like JQuery just for such a simple task.
Now I came across https://stackoverflow.com/a/691059/4464570 which suggests to simply use an HTML <object> tag to include another HTML file into a page, like this:
<object data="parts/header.html" type="text/html">header goes here</object>
I might be missing something important here, but to me this way seems to perfectly fit my needs. It is very short and precise, the <object> tag is well supported by all browsers, I don't need to include any big libraries and actually I don't even need any JavaScript, which allows users blocking that to still view the correct page structure and layout.
So are there any disadvantages I'm currently not aware of yet with this approach? The main reason for my doubts is that out of dozens of answers on how to include HTML fragments, only one recommended <object> while all others went for a PHP or JavaScript/JQuery way.
Furthermore, do I have to pay attention to anything special regarding how to put the <object> tag into my main page, or regarding the structure of the file I want to include this way? Like, may the embedded file be a complete HTML document with <!DOCTYPE>, <html>, <head> and <body> or should/must I strip all those structures and leave only the inner HTML elements? Is there anything special about using JavaScript or CSS inside HTML embedded this way?
The use of the <object> tag for HTML content is very similar to the use of an <iframe>. Both just load a webpage as a seperate document inside a frame that itself is the child of your main document.
Frames used to be very popular in the early days of web development, then in the form of the <frame> tag. They are generally frowned upon, however, and you should really use them as little as possible.
Why not to use this technique for displaying your own content
The HTML content in the child frame cannot communicate with the parent. For example, you can't use a script in the parent HTML to communicate with the child and vice versa. That makes it not very useful for serving your own content when you want to display anything but static text.
Why not to use this technique for displaying someone else's content
You can't use it to serve a lot of external content either. Many websites (including eg. SO) send an X-Frame-Options header along with their webpage that has the value SAMEORIGIN. This prevents your content from being loaded and displayed.

Non formatting tags outside html tag

I have a rather bizarre usecase.
I need a tag on top of a html document that will not be used in formatting but that will contain some information for the parsing entity to act upon (flags & data).
Normally comments are used for this:
<!-- foo = bar -->
<html>
etc.
Now we can strip 'foo' from the comment when the html is parsed and act upon its value 'bar'.
However I am now in the situation that one of the intermediate systems strips all comments from the html as it sends it along.
So my question is: What other tags can go outside of the html tag without breaking the html specs (too much)?
I know of the <!doctype> tag, but you cannot really put in data there without breaking something
NB:
yes, I know this kind of signalling is ugly, but these are not my systems, I must provide such a flag.
all comments are stripped
js is not executed (yet) so we cant do it via the dom
This is a less than ideal situation but you can add anything inside the doctype tag like this:
<!doctype html public 'You are free to add anything here'>
I'm not sure how much data is allowed in there but this shouldn't mess with the parsing of the page.

Today's Google Doodle of Moog Synthesizer

I was inspecting today's Google Doodle of Moog Synth, dedicated to Robert Moog,
when I came across the following piece of html code:
<w id=moogk0></w>
<s id=moogk1></s>
<w id=moogk2></w>
<s id=moogk3></s>
(You can view the source & do a Ctrl+F for , you will get it in the first search result).
I googled about s & w tags but found nothing useful except that the s tag is sometimes used to strikeout text but is now deprecated.
Obviously google will not be using deprecated tags but I guess there's a lot more behind this than plain html code. Can anybody explain me the use of this tags? & how does the browser recognise them? & if the browser does not recognise them, whats their use?
The browser doesn't recognise them.
But HTML was designed to ignore what it doesn't recognise, so these are simply not understood by the browser and get ignored. As elements without content, they will not get rendered at all either, though they will be part of the DOM.
However, these can be styled directly as elements in CSS and picked up by Javascript (getElementsByTagName and getElementById etc...).
In short, these elements provide a target for CSS and Javascript without any other impact on display.
Unknown elements are treated as block elements (like div) and can be styled accordingly and be used in scripts.
The tags you are talking about are user created XML tags.
If you need to display dynamic data in your HTML document, it will take a lot of work to edit the HTML each time
the data changes.
With XML, data can be stored in separate XML files. This way you can concentrate on using HTML/CSS for display and layout, and be sure that
changes in the underlying data will
not require any changes to the HTML.
With a few lines of JavaScript code,
you can read an external XML file and
update the data content of your web
page.

Removes spaces between html tags?

To take away some page loading time, where can I find something to remove spaces between html tags? Without me having to go through each one and remove them myself
Like so:
<body>
<p>Lot's of space</p>
</body>
<body><p>No space</p></body>
I found this site. But it leaves one space between tags. But I don't want any.
Be careful that you have some idea of what is happening or you will corrupt the integrity of your documents. A fully minified code sample removes all comments and all white space characters not necessary for syntactical purposes.
In other words this example of HTML:
<p>Some content
<strong>is strong</strong>
and
<em>emphasized</em>
in this paragraph.</p>
When fully minified becomes:
<p>Somecontent<strong>isstrong</strong>and<em>emphasized</em>inthisparagraph.</p>
In that case the corruption to the content is obvious as all the words are colliding into each other. What is not obvious is the space buffering content and tags and the spaces between tags not adjacent to other content. All white space characters outside of tags in a HTML document are text nodes in the DOM and removing DOM nodes without careful consideration is possibly harmful.
Furthermore, you also have to ensure that your HTML minifier is not corrupting any inline JavaScript or CSS code. Investigate these conditions carefully when looking at the different options available.
Here is one that I wrote which may be helpful to you as it minifies markup tags in a way that is fully recursive to a beautified state using an automated pretty-print application.
http://prettydiff.com/?m=minify&html
Any other HTML minifier without this rule will work.
Google listed me this one:
http://www.willpeavy.com/minifier/
I'm developing web application with Smarty template engine. It has function {strip}, which replaces all new lines, tabs, spaces... in the template. So you can write your code with many new lines and spaces. But output will be in single line.

Text style affecting the whole site

I've got an input so the user can type either html or plain text. When the user copy & paste text from MS Word, for example, it generates a weird html. Then, when you view that topic, you can see the whole page's style is affected. I don't really know if the generated html has unclosed tags or something, but it looks like it does and thus, the style of the page is affected.
Does anybody know how to "isolate" the html of that div(or whatever the container be) from the whole page's style?
Short of showing the content in an IFRAME, you can't really do that. What I usually do in this situation is apply tag stripping logic to the content as it comes in. You really don't want to allow arbitrary HTML from a security perspective, but even if you don't care what your users input, you should be stripping out invalid HTML tags (Word has a habit of creating tags with weird namespace-looking things like o:p) and running something like Tidy over the result to ensure every tag is properly closed. There are a number of Tidy libraries for .NET out there; here's one.
Here's a quick cut-and-paste of how I've done this in the past. Note that the class implements an interface from the project I used it in, but you get the general idea.
Copying text from word can include <style> tags. The only sure way to isolate these styles is to put the input control in an <iframe>
You can either sanitize the input or display it in an IFrame.
It it were me I'd strip all but basic formatting (e.g., bold, italics) and use Tidy. That's what I end up doing, I strip and convert all the CSS styles of word into <strong>, <em>, etc.