Raw HTML - how to measure width/height on the server? - html

I have a web application that lets users upload entire .html files to my server. I wish to 'detect' the width/height of the uploaded html and store it in my DB.
So far, I have unsuccessfully tried using the System.Windows.Forms.WebBrowser control - by reading the file into a string, loading it into the browser.document:
_browser = new WebBrowser();
_browser.Navigate("about:Blank");
_browser.Document.OpenNew(true);
_browser.Document.Write(html);
Inspecting the various properties of the _browser object (document, window etc) seems to always default the size to 250x250.
I've tried putting various css size declarations in the .html file and still the same thing.
Is the only option to inspect the html string and regex match CSS
properties?
How would you reliably determine what the rendered width/height would be of the document in question?
Remember, the .html file may or may not contain css properties. Maybe the user uses older, deprecated tags such as
<body width="500">
vs
<style>
body{ width: 400px; }
<body>
etc.

Even if you could capture the declared width through inspection of CSS and/or HTML tag specifications, you'd be unlikely to get the rendered width. Height will be even worse, since text wraps.
I think you may want to consider a different approach. Do you really need this? What requirement are you trying to satisfy? Can it be done in a different way?

As you've discovered, you won't be able to use a WebBrowser control because the height and width reported are the height and width of the control itself, not the document inside the control.
What you'd really need to do is write your own HTML parsing engine to calculate this out on your own. You would need to calculate out all of the lines, figure out the line height, etc.
Is this really worth the effort? You would need to make so many assumptions that such a calculation would be pretty much worthless... Differences in rendering by different browsers, customers that have their text size set to something other than the default, and probably dozens of others. Even the screen resolution would matter because, as you can see in this paragraph, text tends to wrap. You need to calculate where the text will wrap in order to calculate how many lines of text will show up. You need to factor in font sizes...
All of that said, in theory this should be doable, and the mechanics for calculating this all out would be the same concepts you would use for printing to a printer. Calculating the page height, and figuring out where you are on the page is all standard operating procedure when printing manually.
Here's an article that explains the basics. It'll be up to you to see if it's worth the effort.
http://msdn.microsoft.com/en-us/magazine/cc188767.aspx

You will not be able to find the dimensions using regular expressions - remember that there might not be any, in which case you'd have to manually measure the elements in the document, requiring a complete HTML renderer.
Doing it with Interhet Explorer raises security concerns; make sure that IE is always kept up to date on your server, and that its security settings in the ASP .Net account are as tight as possible. (I'm not sure how to do that)
Try _browser.Document.Body.OffsetRectangle.Size.
EDIT: Note that, ass other people have pointed out, the height will also depend on the width, because of text wrapping, etc, so you should set the width of the IE control to an appropiate value.

Related

Command line tool to interpret HTML/CSS and get element styles

Let's say I download an html page along with all its css files (e.g. with curl)
So I have some html code, some css in head, in tags, and some css from files.
Is there a tool I can use to, e.g. get the color and font-size of that character at position 2957 in page, or the height of this tag starting at position 3917?
I am looking for either Linux command line without X, or perl modules.
Of course the tool would know how proprieties come from parents, get overwritten by css codes depending on their order, etc.
Thanks!
EDIT: height was a dangerous example that can confuse the reader. I do not mean the rendered height when auto e.g. I meant the string "auto". So no rendering necessary.
The standard headless browser is PhantomJS: http://phantomjs.org/ (and there are other similar ones like https://slimerjs.org/).
I'm not sure how pixel-perfect it's going to be (but that's true even with different versions of desktop browsers on a mix of OSs etc.), but would do the full DOM and CSS parsing that you can script and get results from.
Essentially you are asking for browser that can work without any graphic subsystem. Just to measure some element ( "height of this tag starting at position 3917" ) you need fonts on that machine and code that does rasterization of fonts.
I don't think that anyone from browser vendors even looking in rendering-on-headless-device direction.
So is the answer: almost no chances to find such a tool.

Loading font-face before img src

Given a simple HTML page made up of text and several image tags, with CSS, but without any Javascript, is there a way to tell the browser to load font-face URLs before the image sources?
It seems that many browsers will wait until the first occurrence of a tag that requires the font-family before requesting the font (source).
However, even if I place a tag with style="font-family: 'libre_baskerville' !important" at the very top of the body, it doesn't trigger the request until after the image tags sources have been requested, as seen here:
This causes issues due to browsers' (and HTTP spec itself) maximum concurrent connections to the same domain. Since the images are triggered first, the browser has to load images before it can draw text.
The images, being larger files, can take longer to download than the font-face. However, the text is typically more important (and certainly, the text in the first few lines is more important than an image that is below the fold).
A couple of the potential solutions I've considered:
Possible Solution #1:
Avoid using <img> tags, and to use another tag with a CSS background-image. The has the disadvantage of losing the semantic meaning that the image tags provide. This also requires rules to set the width and height of the tag to match the image; these dimensions may not both be known, and if they are it's still more to maintain. It also will not work if CSS is not enabled (though this probably isn't a big concern).
Swapping out the images with tags that each have a background-image set allows the following order for network connections:
Possible Solution #2:
Host the font (or, potentially, the images) on a separate different domain. While this won't change the order in which files are requested, it will prevent the "maximum concurrent connections to the same domain" issue.
This has the disadvantage of adding dependencies (increasing chances of down-time, latency, etc.), as well as having to manage multiple domains simply for fonts. This also cheats by providing a means of avoiding the question, rather than an answer - though it provides the practical results.

Is it better to not render HTML at all, or add display:none?

As far as I understand, not rendering the HTML for an element at all, or adding display:none, seem to have exactly the same behavior: both make the element disappear and not interact with the HTML.
I am trying to disable and hide a checkbox. So the total amount of HTML is small; I can't imagine performance could be an issue.
As far as writing server code goes, the coding work is about the same.
Given these two options, is one better practice than the other? Or does it not matter which I use at all?
As far as I understand, not rendering the HTML for an element at all, or adding display:none, seem to have exactly the same behavior: both make the element disappear and not interact with the HTML.
No, these two options don’t have "exactly the same behavior".
If you hide an element with CSS (display:none), it will still be rendered for
user agents that don’t support CSS (e.g., text browsers), and
user agents that overwrite your CSS (e.g., user style sheets).
So if you don’t need it, don’t include it.
If, for whatever reason, you have to include the element, but it’s not relevant for your document/users (no matter in which presentation), then use the hidden attribute. By using this attribute, you give the information on the HTML level, hence CSS support is not needed/relevant.
You might want to use display:none in addition (this is what many CSS supporting user agents do anyway, but it’s useful for CSS-capable user agents that don’t support the hidden attribute).
You could also use the aria-hidden state in addition, which could be useful for user agents that support WAI-ARIA but not the hidden attribute.
I mean do you need that checkbox? If not then .hide() is just brushing things under the carpet. You are making your HTML cluttered as well as your CSS. However, if it needs to be there then sure, but if you can do without the checkbox then I would not have it in the HTML.
Keep it simple and readable.
The only positive thing I see in hiding it is in the case where you might want to add it back in later as a result of a button being clicked or something else activating it in the page. Otherwise it is just making your code needlessly longer.
For such a tiny scenario the result would be practically the same. But hiding the controls with CSS is IMO not something that you want to make a habit of.
It is always a good idea to make both the code and its output efficient to the point that is practical. So if it's easy for you to not include some controls in the output by adding a little condition everything can be managed tidily, try to do so. Of course this would not extend to the part of your code that receives input, because there you should always be ready to handle any arbitrary data (at least for a public app).
On the other hand, in some cases the code that produces the output is hard to modify; in particular, giving it the capability to determine what to do could involve doing damage in the form of following bad practices: perhaps add a global variable, or else modify/override several functions so that the condition can be transferred through. It's not unreasonable in that case to just add a little CSS in order to again, achieve the solution in a short and localized manner.
It's also interesting to note that in some cases the decision can turn out to be based on hard external factors. For example, a pretty basic mechanism of detecting spambots is to include a field that appears no different in HTML than the others but is made invisible with CSS. In this situation a spambot might fill in the invisible field and thus give itself away.
The confusion point here is this: Why would you ever use display: none instead of simply not render something?
To which the answer is: because you're doing it client side!
"display: none" is better practice when you're doing client side manipulations where the element might need to disappear or reappear without an additional trip to the server. In that case, it is still part of the logical structure of the page and easier to access and manipulate it than remove (and then store in memory in Javascript) and insert it.
However if you're using a server-side heavy framework and always have the liberty of not rendering it, yes, display:none is rather pointless.
Go with "display:none" if the client has to do the work, and manage its relation to the DOM
Go with not rendering it if every time the rendered/not rendered decision changes, the server is generating fresh (and fairly immutable) HTML each time.
I'm not a fan of adding markup to your HTML that cannot be seen and serves no purpose. You didn't provide a single benefit of doing that in your question and so the simple answer is: If you don't need a checkbox to be part of the page, then don't include it in your markup.
I suspect that a hidden checkbox will not add any noticeable time to the download or work by the server. So I agree it's not really a consideration. However, many pages do have extra content (comments, viewstate, etc.) and it can all add up. So anyone with the attitude that they will go ahead and add content that is not needed and never seen by the user, I would expect them to create pages that are noticeably slower overall.
Now, you haven't provided any information about why you might want to include markup that is not needed. Although you said nothing about client script, the one case where I might leave elements in a page that are hidden is when I'm writing client script to remove them. In this case, I may hide() it and leave in the markup. One reason for that is that I could easily show it again if needed.
That's my answer, but I think you'd get a much better answer if you described what considerations you had for including markup on the page that no one will see. Surely, it must offer some benefit that you haven't disclosed or you would have no reason to do it.

Avoid HTML table cells being cut when printed

I have a HTML document with many tables which I want to be printed. The problem is that sometimes, the paper end is reached in the middle of a row, so half of it is printed in one page and the rest in the next page, even cutting a single line of text in two parts.
Is there any way to avoid this?
NOTE: I have already read this question, but I need a solution which not involves CSS, because is not working at the target computer, and I can't change that.
Even with CSS, the issue is difficult due to limited browser support to CSS pagination (as can be seen from the answers to the question you refer to).
Through years, this problem has existed, and I don't think anyone has souped up an HTML trick for the purpose. There have been some tricks for trying to prevent page breaks inside a paragraph or list by placing it in a one-cell table, but this has worked occasionally only, and besides, in your case you already have a table.
So I’m afraid there is no solution, apart from using elements that cause extra vertical spacing, like a pre element containing empty lines (to push the entire table to next page—this may of course make things much worse when the parameters of the situation, like page formatting and paper size, differ from your expectations) or splitting a table into two tables, possibly with extra space between them (even more problematic).
If the target computer doesn't support (enough of) CSS, then you can create a PDF document on the server. If you set the Content-Type correctly, the browser will download the document and start the PDF reader of the system.
If this isn't possible, then there is no solution.

programmatically figure out the height and width at which an html will render in the browser with C#

I have almost 200,000 html pages and I need to figure out the height and width at which each html will render in any browser. I only need approximate numbers. How I can programmatically do this with C#?
C# Does not execute inside of a browser, and should not be used to try and determine the width and height a given HTML page will render at in a browser. Moreover, there is no answer for "any browser", as different browsers may support different fonts, may render the same content slightly different, and may be configured with different display-related settings (most browsers allow the user to arbitrarily scale the default font size up or down as desired, which would of course impact the final render size).
In general, however, I would suggest you do something like:
Come up with a JavaScript snippet that can compute the current size of the document.
Write a C# (or Java, C, bash, etc.) program to append your snippet to each of your 200,000 pages.
Use a browser-based test-harness like Selenium or Webdriver to load up each of your 200,000 pages, extract the result from your JavaScript snippet, and log it out to somewhere convenient.
Optionally, you can repeat step 3 with different browsers to get the width/height for all the different browsers that you care about.
Edit: Apparently Webdriver and Selenium are the same thing now. When did that happen?
It's pretty straightforward. Just write an HTML parser and enough of a rendering engine to at least know the height and width of any HTML element (for any screen size, font setting?). Obviously you will need a CSS parser and engine. Since you want to know for any browser, you will need to have modes of emulating each. If you can't directly get the DOM of the HTML pages you are trying to measure you will need a java-script engine to get the values as they appear on the page.
Or you could run the HTML in a browser and use java-script to get the values. This won't be in .NET, though. You could have the java-script post the data to an ASP.NET page if you like though.
Or you could use one of the tools recommended in answer to your earlier question.