When writing filters for the Firefox Add-On 'Adblock Plus' you can write rules to completely remove certain HTML elements from the page, but filtering criteria is in fact limited to a handful of things, like class and id names and attribute values.
What I was hoping for is say a Firefox Add-On which would pass the HTML for a page to some arbitrary process you specify, where this process could reconstitute the HTML for the entire page in any arbitrary way and then have the browser display that. Is there a Firefox add-on that allows this or is this sort of operation commonly accomplished by some entirely different but well-known means (and perhaps not browser-specific).
Wouldn't this allow you to augment pages coming from some website to your browser with arbitrary new features, maybe from an entirely different website.
You are looking for Greasemonkey.
Related
My question is similar to what this poster is asking:
What are the concrete risks of using custom HTML attributes?
but I want to know what can happen if I use custom elements and custom attributes with the current html specs (html 5).
Example
<x a="5"> abc </x>
Visually I see no issues in any browser. js works:
x = document.getElementsByTagName('x');
alert('x has attribute a=' + x[0].getAttribute('a'));
css works too:
x{
color: red;
}
x[a]{
text-decoration:underline;
}
Possible Risks include
Backward compatibility. In particular, IE8 and below won't understand your tag, and you'll have to remember to write document.createElement('x') for all your new elements.
Semantics - having your html machine-readable may not be your goal, but there may come a time when it needs to be parsed in a moderately useful fashion.
Portability & maintenance - there are plenty of current html tags that almost certainly do what you want them to do. At some point, someone else may have to look after your code. Is there anything to be gained from having them spend time learning what all your new tags are for?
SEO - don't take the risk of a penalty just because it's something you can do..
For completeness, there are justified reasons to do it though. If you can demonstrate your new tag improves the semantics of your page (your example of 'x' obviously doesn't) and you can think of some use-case where your page will be machine-parsed by your own process, then go for it.
The only issue I can think of is that other applications, including search engines, won't recognize your custom elements and properties, so they won't know what to look for or how to use them which is a decided disadvantage for SEO. Other applications trying to access your content, including RESTful apps, will not know either without you telling the app developer.
This was always listed as one of the disadvantages of XML/XHTML but here we are again, back full circle to where we should have been in the first place, the use of XML on the web ... but I digress.
The main reason custom elements were frowned upon in the past is because browsers don't know what to do with them and there was no standardised way of telling them what they are.
What are the risks of using custom HTML elements in HTML5 without following standardisation?
Browsers will handle them differently:
Some browsers may ignore the elements and pretend they're not there; <x>, I don't know what <x> is, lets get rid of that.
Some browsers may attempt to convert the element into something else; define a <tab> element and a browser may think you've mis-spelled <table>, for instance.
You'd have to handle what the element is supposed to do across a large range of devices; just because it works on your PC doesn't mean it works on your phone, or your TV, or your e-reader... or your WiFi-powered fridge...
The good news is that there is some new documentation being written up to allow developers to define their own custom elements in a standardised way. Custom Elements, as it's titled, gives both developers and browser vendors the know-how to allow developers to implement and script custom elements in a way which will work across all supporting browsers... or that's the idea, anyway.
I'm trying to write a crawler that gets raw html data and finds Title, price, update date, photo etc... fields and writes it to database. This is an classic and old way to crawl data.
I think that I can do this job wit an other way.
If I crawl all pages (may be more than 1000) in the web site, and compare them all I can find the specific areas.
I mean html tags will be always the same. Only specific areas will change like title, image etc...
So, what is the best way to determine changed areas?
compare them all I can find the spesific areas
what is the best way to determine changed areas?
In your question you set the scrapeing/crawling approach of comparing pages' parts and getting the data of specific areas. This smells with regex approach. Do not use it as the very non-efficient approach. Rather use xpath, operating on XML structures.
So, be simple:
Get html
Make it DOM
Make DOM a valid XML
Apply xPath queries to XML
Believe me, xml libraries are well able to handle huge structures (including idle html tags) and traverse over them. A classical example of using xpath is in this post of mine.
To determine data node paths you just use web inspector tools (F12 - in Chrome and IE and Ctrl+Shift+I in FF) to see the html tags containing useful info.
When creating a website why should you care for HTML with no style?
Is there any device which will render HTML only (no CSS or JavaScript)?
Do you usually care how your website will display without CSS?
Why is it important?
There are several cases in which websites may be used without styling. As mentioned in the comments, screen readers (such as those used by visually-impaired people) read only content, not styling.
Perhaps more importantly, many search engine spiders (think: Google) read your site without styling. When you view your site without CSS, you will gain a better understanding of how search engines view your content.
And if you are lucky, or your content is particularly geeky, you may get the occasional guru who browses your site via Lynx.
There seem to be a few misconceptions here. First of all and most importantly, screenreaders do take into account CSS and JavaScript. Why? Simply because unlike in the past they are not running on their own, but rather work as addons for existing browsers or include the render engines inline in their own systems.
Does that mean you don't need to concern yourself with screenreaders at all? Sadly that's not the case either. For example, if you add display:table to an element just because you want to vertically align something some screenreaders will actually treat it like a real table (which makes no practical sense). The good part is though that pages are read top-to-bottom, header and menu first (if found) and that adding display:none through javascript to an element will hide the element from the screen reader as well. Now, the following is going to sound really harsh, but except if you're making a real high profile website I wouldn't advice you to concern yourself with this too much. On one hand screenreaders are becoming better and better (try for example the one that's included on your android device if you have a recent version of android) and on the other hand blind people are used to websites being a 'bit' messy. Now, that doesn't mean you should start using flash or otherwise crazy stuff, but it does mean that if you just write a proper website, make your menu a list, make your divisions divs and not tables etc. you should in general be fine. And if you are making a high profile website then you should check out WAI-ARIA.
Now, getting to the search engine part, that's not true either for the big search engines at least. Google does take styling into account. Not all the styling that's unimportant for Google, but it actually will realize which stuff is hidden and analyzes the javascript whether hidden content will be shown (as part of it's anti-SEO work), it will search for links in your javascript and probably lots more I am not aware of. Bing does this to a large extend as well, though for example duckduckgo does not do this too much/at all. Either way, once again, the notion that Google sees your site like lynx does was true in the far past, but by now is invalid.
And if you check your serverlogs you will see that nobody accessed your site through lynx. That's just the reality of life nowadays. In the past (again) people would occasionally use lynx if they only had access to a console, but nowadays it's far easier to pull your phone from your pocket which runs a full web browser.
First part of the answer : 'text based browsers'
Text-based browser list
Alynx
ELinks (active version of Links)
Emacs/W3
Line Mode Browser
Links
Lynx
Net-Tamer
w3m
WebbIE
Second part : 'search engines'
List of semantic search engines
List of search engines
Third part : 'web accessibility' where software helps people with disabilities get access to the web.
It's important to note that for the third part, accessibility, it is
sometimes a legal requirement. For example, in the UK it is illegal to
have a website that is not accessible to blind people. There are
similar requirements for US government services. – slebetman
It's also an applicable law in canada
See this list of tools from w3 for a Complete List of Web Accessibility Evaluation Tools
CSS isn’t an on/off thing. Although CSS may be completely disabled, it is much more common that some of your CSS settings get ignored or overridden. Here is an incomplete list of cases (see my CSS Caveats for some additional details):
Speech-based browsers generally ignore most of CSS, largely because most of CSS is directed towards visual rendering.
So do “text-only browsers” (more accurately, character cell browsers, which render in plain text only using a monospace font but may be able to use colors and bolding).
Search engines generally don’t care about CSS at all.
CSS support varies. The more advanced CSS features you use, the more probable it is that many browsers don’t implement them.
CSS support may be disabled by the user, completely or partially.
User style sheets may interfere with your CSS code or even override them.
Browser settings e.g. on minimum font size may make some of you CSS settings ineffective.
Browsers have bugs. The more complicated CSS techniques you use, the more probable it is that you trigger a bug in some browsers.
An external CSS file (the recommended way of using CSS) may get lost, e.g. a browser may need to wait (perhaps in vain) from a server, or an archiving system may archive an HTML file but fail to archive the CSS file.
Styling may get lost in transfer, e.g. when copying content from a web page to MS Word or Excel or Notepad or email.
I have almost 200,000 html pages and I need to figure out the height and width at which each html will render in any browser. I only need approximate numbers. How I can programmatically do this with C#?
C# Does not execute inside of a browser, and should not be used to try and determine the width and height a given HTML page will render at in a browser. Moreover, there is no answer for "any browser", as different browsers may support different fonts, may render the same content slightly different, and may be configured with different display-related settings (most browsers allow the user to arbitrarily scale the default font size up or down as desired, which would of course impact the final render size).
In general, however, I would suggest you do something like:
Come up with a JavaScript snippet that can compute the current size of the document.
Write a C# (or Java, C, bash, etc.) program to append your snippet to each of your 200,000 pages.
Use a browser-based test-harness like Selenium or Webdriver to load up each of your 200,000 pages, extract the result from your JavaScript snippet, and log it out to somewhere convenient.
Optionally, you can repeat step 3 with different browsers to get the width/height for all the different browsers that you care about.
Edit: Apparently Webdriver and Selenium are the same thing now. When did that happen?
It's pretty straightforward. Just write an HTML parser and enough of a rendering engine to at least know the height and width of any HTML element (for any screen size, font setting?). Obviously you will need a CSS parser and engine. Since you want to know for any browser, you will need to have modes of emulating each. If you can't directly get the DOM of the HTML pages you are trying to measure you will need a java-script engine to get the values as they appear on the page.
Or you could run the HTML in a browser and use java-script to get the values. This won't be in .NET, though. You could have the java-script post the data to an ASP.NET page if you like though.
Or you could use one of the tools recommended in answer to your earlier question.
For styling purposes i want to insert some dummy text on the page, but it shouldn't be getting linked to the actual content. Is there a way to block it for search engines, or do i have to use good old images for that?
Or would it be possible to load it dynamically via javascript? because i heard that google will read certain amount of javascript.
Can you show the content in a borderless iframe, and block the iframe's src (a completely separate "page") from the search engines?
Alternatively, add the content with javascript, storing the javascript in a .js file that you block from the engines?
If you load that text via AJAX it probably won't be indexed - last time I checked, GoogleBot doesn't actually execute JS (nor do the other spiders (but some spambots apparently can and do)).
Caveat: the AJAX response should probably contain a X-Robots-Tag: noindex header, in case its URL is actually linked somewhere.
I'd be extremely careful with whatever trick you decide on. Odds are just as likely google will think you're trying to display different content to the user than to it.
I've always believed that Google actually works by rendering the page (possibly using some server-side version of the Chrome rendering engine) and then reads the result back with OCR software to confirm that the text in the source matches what the user would see with JS and frames enabled. Google has always openly warned webmasters not to try serving robots different content to the users, OCR would be the perfect way to find out (especially if your 'verifier' used IE's user-agent string and crawled from IP ranges not registered by Google).
Short answer then, serve the decoration as either:
an iframe
an object
an SVG image
Since your clearly linking the document into your page google will proably consider it a seperate resource and rate things accordingly, especially if the same text appears on every page. Which brings me to:
Are you going to use the same text decor on all/most pages? If so Google will almost certainly treat it as "window dressing" and ignore it (it apparently does this with menus and such).
I'd guess that loading in the content after the page has finished loading (when the document.ready event fires, for example) would be a fairly safe way to do what you're talking about. Not 100% sure about this, though.