Command line tool to interpret HTML/CSS and get element styles - html

Let's say I download an html page along with all its css files (e.g. with curl)
So I have some html code, some css in head, in tags, and some css from files.
Is there a tool I can use to, e.g. get the color and font-size of that character at position 2957 in page, or the height of this tag starting at position 3917?
I am looking for either Linux command line without X, or perl modules.
Of course the tool would know how proprieties come from parents, get overwritten by css codes depending on their order, etc.
Thanks!
EDIT: height was a dangerous example that can confuse the reader. I do not mean the rendered height when auto e.g. I meant the string "auto". So no rendering necessary.

The standard headless browser is PhantomJS: http://phantomjs.org/ (and there are other similar ones like https://slimerjs.org/).
I'm not sure how pixel-perfect it's going to be (but that's true even with different versions of desktop browsers on a mix of OSs etc.), but would do the full DOM and CSS parsing that you can script and get results from.

Essentially you are asking for browser that can work without any graphic subsystem. Just to measure some element ( "height of this tag starting at position 3917" ) you need fonts on that machine and code that does rasterization of fonts.
I don't think that anyone from browser vendors even looking in rendering-on-headless-device direction.
So is the answer: almost no chances to find such a tool.

Related

Low asterisk in HTML

There are a number of asterisk (*) types as you can see here:
http://www.eki.ee/letter/chardata.cgi?search=asterisk
Even now, we can see that some of these characters like the one with the code: "204E" also known as "low asterisk" is not rendered in HTML (at least while using Chrome anyway).
You can see the character here:
⁎ -> ⁎
Other similar types work however:
✢ -> ✢
✣ -> ✣
✤ -> ✤
Of course out of all the possible types, the authors of my input data have chosen ⁎ to work with.
It makes me think it should be somewhat general, because I saw solutions where a tiny image was used instead of this character in the entire HTML document. Needless to say I do not like that approach even a bit.
Is there a way to make this work in HTML? Is this possibly a browser specific issue?
UPDATE:
Internet Explorer 9 and Google Chrome also fail to render this special character.
Firefox and Chromium (Ubuntu) seem to be able to render it.
Please note, that I would like to find a general solution if possible.
My HTML code:
<html>
it works in chromium though ⁎
</html>
I use Chromium in Ubuntu, and it works fine in it.
and here goes screen shot of IE, in case you want to see it.
Furthermore: for those looking for a middle asterisk ∗ -> ∗
This primarily depends on the fonts installed in the user’s system, not on browser (though some browsers, most notably IE, might be unable to utilize all the fonts in the system, as they should). Regarding e.g. U+204E, font support is relatively limited: no font shipped with Windows contains it, whereas Linux systems probably have some font that contains it.
Using #font-face for some suitable free font, you could make the character display in most computers (excluding basically just those that have font loading disabled). See my Guide to using special characters in HTML.
In this case, that would probably be overkill, if you just need e.g. a low asterisk. A normal asterisk, in a lowered position, should be sufficient – at least compared with the overall typographic quality of HTML documents. Example:
<style>
.low {
position: relative;
top: 0.55ex;
}
</style>
Compare: *⁎<span class=low>*</span>
(Using relative positioning is safer than using vertical-align, directly or via the sub element, since vertical-align usually messes up line spacing.)

HTML5 Embedded Fonts render differently across browsers?

I want to make this page look the same across all browsers. Specifically, I want the wrapping point of the text to be exactly the same on all browsers so I can create a PDF version with 100% accuracy. Check this out in FF vs. Chrome, for example.
http://santaspencil.com/desktop/embedded-test/embedded-fonts-test.php
Questions:
- Can it be done?
- Are there alternatives that don't require the user to download a plugin?
You should consider embedding the font file into your CSS. But as usual stone-age IE can not do this as you will need to include an EOT font file on your server.
http://base64fonts.com will convert your font files to base64 and then produce a css code for you to copy and paste in your html. this will help with insuring your font loads across browsers (except IE).
Good luck
... I want the wrapping point of the text to be exactly the same on all browsers ...
Bang head here (sign on brick wall). Web technology doesn't even try to do this. If you figure out a way to provide your own font -such as embedded webfonts- you can SORTA make it work. But if 100% is your goal, you might as well give up sleeping.
One of the neat things about browsers is their "liquid layout" capabitity, automatically rendering a page differently on a tablet than on a desktop to fill the different screen sizes for example. One of the prices you pay for this infinite rerenderability though is inability to specify the appearance exactly. Besides, edge cases will always arise and bite. For example if the available line is 0-73 units and the text you want to put in it is 74 units long, does it "fit" or not??? (i.e. does zero count? and is using up the very last unit a "fit" or an indication of the need to "wrap"?)
The only way to have browsers render your exact appearance is to give them what appears to them to be an image. Displaying the text on your screen, taking a screenshot of it, and making that screenshot into a *.GIF is one way.
A PDF file works too, as it appears to a browser to be a "funny" image with its own rendering engine. Most rendering engines are probably the same (i.e. the ones from Adobe) even if the browsers aren't the same, so it's much more likely to work. Providing PDF documents on the web works pretty well and is pretty widely supported. If a URL looks like http://yoursys.yourdomain/yourpath/yourfile.pdf most browsers will fetch it and start their PDF rendering tool and display it directly ...usually INside the browser window so the user isn't even aware of a different application having been used.
As to the last part of your question, it's the wrong question. It should be "solutions that don't require a plugin THE USER DOESN'T ALREADY HAVE". The advantage of a PDF plugin is the vast majority of users already have it. Not all plugins are evil/inconvenient ...just the less common ones (or the Flash plugin if your target is iPhones where users aren't even allowed to download it:-).
good luck!
This is probably way too late, but I did not know this until today. There is something called a non-breaking space, represented by in HTML, you can use to prevent unwanted line breaks or other such thing. Wikipedia has a pretty good writeup on it.
http://en.wikipedia.org/wiki/Non-breaking_space

programmatically figure out the height and width at which an html will render in the browser with C#

I have almost 200,000 html pages and I need to figure out the height and width at which each html will render in any browser. I only need approximate numbers. How I can programmatically do this with C#?
C# Does not execute inside of a browser, and should not be used to try and determine the width and height a given HTML page will render at in a browser. Moreover, there is no answer for "any browser", as different browsers may support different fonts, may render the same content slightly different, and may be configured with different display-related settings (most browsers allow the user to arbitrarily scale the default font size up or down as desired, which would of course impact the final render size).
In general, however, I would suggest you do something like:
Come up with a JavaScript snippet that can compute the current size of the document.
Write a C# (or Java, C, bash, etc.) program to append your snippet to each of your 200,000 pages.
Use a browser-based test-harness like Selenium or Webdriver to load up each of your 200,000 pages, extract the result from your JavaScript snippet, and log it out to somewhere convenient.
Optionally, you can repeat step 3 with different browsers to get the width/height for all the different browsers that you care about.
Edit: Apparently Webdriver and Selenium are the same thing now. When did that happen?
It's pretty straightforward. Just write an HTML parser and enough of a rendering engine to at least know the height and width of any HTML element (for any screen size, font setting?). Obviously you will need a CSS parser and engine. Since you want to know for any browser, you will need to have modes of emulating each. If you can't directly get the DOM of the HTML pages you are trying to measure you will need a java-script engine to get the values as they appear on the page.
Or you could run the HTML in a browser and use java-script to get the values. This won't be in .NET, though. You could have the java-script post the data to an ASP.NET page if you like though.
Or you could use one of the tools recommended in answer to your earlier question.

Raw HTML - how to measure width/height on the server?

I have a web application that lets users upload entire .html files to my server. I wish to 'detect' the width/height of the uploaded html and store it in my DB.
So far, I have unsuccessfully tried using the System.Windows.Forms.WebBrowser control - by reading the file into a string, loading it into the browser.document:
_browser = new WebBrowser();
_browser.Navigate("about:Blank");
_browser.Document.OpenNew(true);
_browser.Document.Write(html);
Inspecting the various properties of the _browser object (document, window etc) seems to always default the size to 250x250.
I've tried putting various css size declarations in the .html file and still the same thing.
Is the only option to inspect the html string and regex match CSS
properties?
How would you reliably determine what the rendered width/height would be of the document in question?
Remember, the .html file may or may not contain css properties. Maybe the user uses older, deprecated tags such as
<body width="500">
vs
<style>
body{ width: 400px; }
<body>
etc.
Even if you could capture the declared width through inspection of CSS and/or HTML tag specifications, you'd be unlikely to get the rendered width. Height will be even worse, since text wraps.
I think you may want to consider a different approach. Do you really need this? What requirement are you trying to satisfy? Can it be done in a different way?
As you've discovered, you won't be able to use a WebBrowser control because the height and width reported are the height and width of the control itself, not the document inside the control.
What you'd really need to do is write your own HTML parsing engine to calculate this out on your own. You would need to calculate out all of the lines, figure out the line height, etc.
Is this really worth the effort? You would need to make so many assumptions that such a calculation would be pretty much worthless... Differences in rendering by different browsers, customers that have their text size set to something other than the default, and probably dozens of others. Even the screen resolution would matter because, as you can see in this paragraph, text tends to wrap. You need to calculate where the text will wrap in order to calculate how many lines of text will show up. You need to factor in font sizes...
All of that said, in theory this should be doable, and the mechanics for calculating this all out would be the same concepts you would use for printing to a printer. Calculating the page height, and figuring out where you are on the page is all standard operating procedure when printing manually.
Here's an article that explains the basics. It'll be up to you to see if it's worth the effort.
http://msdn.microsoft.com/en-us/magazine/cc188767.aspx
You will not be able to find the dimensions using regular expressions - remember that there might not be any, in which case you'd have to manually measure the elements in the document, requiring a complete HTML renderer.
Doing it with Interhet Explorer raises security concerns; make sure that IE is always kept up to date on your server, and that its security settings in the ASP .Net account are as tight as possible. (I'm not sure how to do that)
Try _browser.Document.Body.OffsetRectangle.Size.
EDIT: Note that, ass other people have pointed out, the height will also depend on the width, because of text wrapping, etc, so you should set the width of the IE control to an appropiate value.

Determining font and size for text from HTML

I need to determine what font and size will be used for each HTML element. They may be set in various css, div, span, or on the element itself.
If I were to do this manually I would start by looking at the element and work backwards until I came to span, div, or css that had a font and/or size. That is the value I want. The browser can obviously do this because it displays the text using a font and size. I want to print a list with two columns, one with the text and the other with the font/size.
If you are looking for a non-programmatic way: I would suggest the firebug plugin for mozilla. Firebug will not only show you all attributes, but allow you to turn them on and off in the client.
You want to reproduce the functionality of parsing HTML into DOM objects, parsing CSS into rules over those objects, and applying those rules across the objects to end up with the associated computedStyle values? That sounds pretty much like a web browser to me.
You could try scripting Firefox or an IE WebBrowser control. There's also an open-source native Java browser/toolkit being developed, though I don't know how practical that is yet.