Does the CSS property "text-transform" affect SEO results? - html

I am building a site with a ton of 1999 style capitalization of navigation and headings. I have been simply adding in the text content as it appears (capitalized), but the other designer on the project insists on using lower case text in his HTML and capitalizing it with an applied style:
.tedious {text-transform:uppercase;}
I understand the argument of separation of style from content, but in this case it really doesn't matter because I personally will not maintain the site, nor do I ever imagine that the client will need to un-capitalize all of this text. The question is: 1. will search engines pay any attention at all to capitalization of text in a document and 2. would a crawler go so far as to read my style sheet and look for such things (me thinks not). I know that BOLD, STRONG, EM, etc have a (diminishing) effect on SEO so I can imagine a scenario where CAPS would, but have never heard of anyone actually claiming, let alone confirming this.
Digging this site the last few months. First post.

It will only effect what is shown in the search results, you colleagues work will show as lower case in the results.
You mentioned separation of style from content, but i'm not convinced that text-transform is a style really, it's a change of content, i'm sure some people would argue the other side though.

if i was a search engine - I wouldn't care about casing. I would care about the content.
From a human readability standpoint - upper case isn't as easy to read.

Well, I was taught at school that all proper nouns (eg names and names of places) should begin with capital letters.
How would Google know whether I was talking about reading (as in a book) or Reading (as in the town of Reading, Berkshire), without taking into account the capitalisation? I would argue that capitalisation is definitely a semantic indicator rather than simply a case of aesthetics, and is therefore one factor that could be used for SEO.
As noted elsewhere, Google clearly does have knowledge of the CSS being used to render a page (eg Google can spot black-hat techniques such as white text on a white background).
So if capitalisation (or lack of) is a relevant SEO factor, can the CSS text-transform (or lack of) value also be an SEO factor?
Yes - because Google considers page speed to be an important factor. Text that doesn't need to be transformed by CSS will display faster.

Answer from google:
I don't think we'd do anything special with all-caps headings, but it feels like the kind of thing you'd want to do in CSS instead of in the content, since it's more about styling.
https://mobile.twitter.com/JohnMu/status/1438159561391751170?s=19

Related

Reasons Against Empty Paragraphs in HTML

EDIT: Rephrased question.
Other than being bad practice, what other reasons are there against empty paragraphs in HTML?
ORIGINAL:
Background
Currently to add a nicely space paragraph in our CMS you press Enter key twice. I don't like empty paragraphs because they seem unnecessary to me. If you want a new paragraph, just press Enter and space it with CSS. If you want to write just below some text (e.g. to display code), then do a line break with Shift+Enter.
Question
Is there any very good reason in not allowing empty paragraphs? Is there a standard here? Seems like I just have a philosophical issue right now -- i.e. using empty paragraphs probably won't make page viewing faster or save that much space.
One thing I've learnt the heard way is that any time you have a WYSIWYG editor for a web page, you stand a risk of ending up with poor quality HTML.
It doesn't matter how good the editor is, or how well trained your people are to use it, you will end up with bad code.
They'll click the 'bold' button instead of selecting your sub-title class. They'll create spurious paragraph tags rather than line breaks. And I've had to explain to one person several times why it's a bad idea to use multiple spaces to indent stuff.
Even when people are very good at using the editor and understand the implications, you'll still get things like stray markup setting styles and then unsetting them without any content, because if you (for example) make a word bold and then delete it, it generally doesn't delete the bold tags, and no-one thinks to switch to the HTML view to check.
The basic problem is that when you make it easy to use like a word processor, people will treat it like a word processor, and the underlying code becomes completely irrelevant to them. Their job is to produce content that looks good, and as long as they can achieve that, they don't generally care for how the code looks.
The good thing is that there is a solution. In general, the people generating the content are the same people who care the most about SEO. If you emphasise that there might be SEO consequences to poor quality HTML, I find that they suddenly care a lot more about the code they're generating. They still don't generally have the skills to fix it when they've broken it, but it does seem to make people take more care to follow the rules.
To directly answer your question, I don't think it's a disaster to have empty paragraph tags like that. It's preferable not to though, and you need to consider how the content would look semantically to a search engine - it may cause the search engine to see the two paragraphs of content as being less connected to each other than they should be. This may affect how it weights the content of each paragraph when it comes to deciding its page rank. In truth, it's unlikely to be a huge difference; in fact, I'd say it's probably very tiny, but in a competitive world, it could be enough to push you down a few places. There are probably other more important SEO issues for you to deal with, but as they say, every little helps.
There are times when you have a CSS styling a particular element in your case a paragraph. IF you will use empty paragraph they will unneccesarily pick up that styling which might not be needed.
By styling paragraphs with CSS, you can change the way paragraphs are styled easily in future.
For example, you might want to style differently if the user is browsing on a mobile device, or you might just decide that you want to add more or less space between paragraphs (using attributes like margin-top and margin-bottom on the p tag I guess) because it just looks better that way. If the spacing is done with extra p tags it'd be a lot harder to change.
I expect that things like screen readers for the visually impaired would deal with CSS-styled paragraphs better than if the structure of the page is changed by adding empty paragraphs.

Split html text in a SEO friendly manner

I've some html text like
<h1>GreenWhiteRed</h1>
Is it SEO friendly to split this text in something like
<h1><span class="green">Green</span><span class="white">White</span><span class="red">Red</span></h1>
Is the text still ranking well and is it interpreted as a single word 'GreenWhiteRed'?
It does not matter at all. No worries. What wheresrhys said is a common myth.
Start focusing on building strong backlinks. That 's the key to the castle.
This would greatly harm your SEO ranking. One of the major factors used in calculating pagerank is (probably - nobody knows for sure) a low code to text ratio, in other words, that your code is mostly useful, informative text, rather than a load of tags... even if your extra tags contain relevant information in attributes, unless it's part of a recognised standrad (eg hCards) it will probably not count in your favour.
Most search engine spiders see your site as a text browser would, so they would see the text as GreenWhiteRed still.
Source:
http://www.google.com/support/webmasters/bin/answer.py?answer=35769#2
I believe it really matters on the subject of the page. If you have a website about dogs and only dogs then you put in the word cat, nothing will happen. Searches for 'cats' will never reach you. Also with colors, just like Andy said, mean nothing to a spider. The real SEO is what a user would read not see. Try to maintain alt tags and don't overflow the same word.

What is the benefit to using <acronym> and <abbr>?

Should i give my time to change terms and abbreviations to <acronym> and <abbr>? Is it worth to use? What are pros to use both tags? Is it useful for SEO and screen reader?
See W3C specs.
An acronym is a kind of abbreviation but not vice versa.
E.g. <acronym lang="en" title="Radio Detection And Ranging">Radar</acronym> or <abbr lang="en" title="Abbreviation">abbr.</abbr>.
There is likely to be no or infinitesimally small SEO benefit from using these tags unless the abbreviation is not well known or something you made up or there is some ambiguity. For example, in an article about LILO the Linux Loader, you may want to specify <acronym title="Linux Loader">LILO</acronym> to avoid confusion with Last In, Last Out.
Any accessibility benefit would exist only for those acronyms and abbreviations that are not well known by the target audience. For instance, it makes very little to no sense to have <abbr lang="en" title="Mister">Mr.</abbr> (WCAG checkpoint 4.2 disagrees with me on this. Note also that I did not provide an expansion of WCAG in my post).
On the other hand, if you use are not using IMF to refer to the International Monetary Fund, it might make sense to use <acronym lang="en" title="Impossible Mission Force">IMF</acronym>.
Now, what happens if you also want to use IMF to mean International Monetary Fund in the same document?
The article The Accessibility Hat Trick: Getting Abbreviations Right might also be useful.
Interesting nuggets:
The assertion that abbr is structural is misguided, as the point of the tag is the content of its title attribute.
...
In [XHTML] version 2, the acronym element has been deprecated, so we're now using the abbr element for all shortened forms.
The first time you use an acronym or an abbreviation in a part of your site, you should mark it with abbr. Here's an example:
I visit <abbr title="Stack Overflow">SO</abbr>, and so should you.
This is useful for a number of reasons:
Screen readers can read the unabbreviated term
A user hovering the cursor over that term can see the unabbreviated term
This can be coupled with CSS styling to hint that the term is an abbreviation (some browsers do this automatically)
Search engines are more likely to understand the context of the term
Should you use abbr?
I would recommend using abbr for long-lived documents, such as help pages. Here, clarity is important, and it's worth the extra few minutes peppering your content with abbr tags.
For periodicals like blog posts, you can probably skip abbr. Chances are that if you use an obscure abbreviation, you'll explain it in-text anyway. There's no sense grinding your creative process to a halt by typing HTML tags.
Avoid acronym
If you are going to use acronym or abbr at all though, you may consider using only abbr. Acronyms are a type of abbreviation, and the acronym tag is being dropped in HTML 5.
Before asking "what is the benefit", normally the question you need to answer first is "what is the alternative?"
CSS tooltips?
JavaScript tooltips?
Spelling out the entire word every time?
Putting the abbreviation in parentheses just once?
The first two put you at risk of the various CSS and JS browser incompatibilities. And the third is going to be pretty irritating for both you and your readers when you have the phrase "National Technology Transfer and Advancement Act" repeated 500 times on the page.
And the last of those... well, it's pretty much the same as using the <ACRONYM> or <ABBR> tag, except using the tags lets the browser decide how to render it (usually with a nice tooltip).
SEO and accessibility... maybe there's some benefit, but I think you should use these tags because they are the right tags, just like <p> is the right tag for a paragraph and <em> is the right tag for emphasized text. Say what you mean!
<acronym> and <abbr> are actually quite useful, precisely for the reasons you mentioned yourself: accessibility and SEO. And it's not that much work, either, because it suffices to mark up only the first occurrence of an acronym or abbreviation on any given page, not all occurrences. In fact, that's precisely what W3C recommends in its accessibility guidelines:
Specify the expansion of each abbreviation or acronym in a document where it first occurs.
Reading the comments on Sinan's answer I think I understand what the question is getting at...
I'd say it completely depends on the circumstances of your text. Using <abbr> tags on everything is pure madness, but you can use it to enhance understanding.
Traditionally, a text that uses abbreviations explains the abbreviation when it's first used. Long names or words can be shortened for the remainder of an article, like so:
The Agency for Awesomeness (AfA) announced [...] An AfA representative said ...
Alternatively, if something is already widely known by its acronym, it is usually briefly clarified the first time it is used like so:
The IMF (International Monetary Fund) has ...
The problem with the web is that you may have a long text split over several pages, and a user can jump to any page without having to read the previous pages. For stylistic purposes you may not want to repeat the definition of every used abbreviation and acronym on every new page. On the other hand, you also don't want to force the user to read your text from the very beginning. This is where the <abbr> and <acronym> tags come in handy. They allow you to (re-)define something without having to break up the flow of the text.
I would add another reason: Style.
One would give more letter-spacing inside an abbr to improve its readability. In the same way, in order to not break the balance of the text, sometimes is preferable to use small caps instead of normal caps.

Programmatically detecting "most important content" on a page

What work, if any, has been done to automatically determine the most important data within an html document? As an example, think of your standard news/blog/magazine-style website, containing navigation (with submenu's possibly), ads, comments, and the prize - our article/blog/news-body.
How would you determine what information on a news/blog/magazine is the primary data in an automated fashion?
Note: Ideally, the method would work with well-formed markup, and terrible markup. Whether somebody uses paragraph tags to make paragraphs, or a series of breaks.
Readability does a decent job of exactly this.
It's open source and posted on Google Code.
UPDATE: I see (via HN) that someone has used Readability to mangle RSS feeds into a more useful format, automagically.
think of your standard news/blog/magazine-style website, containing navigation (with submenu's possibly), ads, comments, and the prize - our article/blog/news-body.
How would you determine what information on a news/blog/magazine is the primary data in an automated fashion?
I would probably try something like this:
open URL
read in all links to same website from that page
follow all links and build a DOM tree for each URL (HTML file)
this should help you come up with redundant contents (included templates and such)
compare DOM trees for all documents on same site (tree walking)
strip all redundant nodes (i.e. repeated, navigational markup, ads and such things)
try to identify similar nodes and strip if possible
find largest unique text blocks that are not to be found in other DOMs on that website (i.e. unique content)
add as candidate for further processing
This approach of doing it seems pretty promising because it would be fairly simple to do, but still have good potential to be adaptive, even to complex Web 2.0 pages that make excessive use of templates, because it would identify similiar HTML nodes in between all pages on the same website.
This could probably be further improved by simpling using a scoring system to keep track of DOM nodes that were previously identified to contain unique contents, so that these nodes are prioritized for other pages.
Sometimes there's a CSS Media section defined as 'Print.' It's intended use is for 'Click here to print this page' links. Usually people use it to strip a lot of the fluff and leave only the meat of the information.
http://www.w3.org/TR/CSS2/media.html
I would try to read this style, and then scrape whatever is left visible.
You can use support vector machines to do text classification. One idea is to break pages into different sections (say consider each structural element like a div is a document) and gather some properties of it and convert it to a vector. (As other people suggested this could be number of words, number of links, number of images more the better.)
First start with a large set of documents (100-1000) that you already choose which part is the main part. Then use this set to train your SVM.
And for each new document you just need to convert it to vector and pass it to SVM.
This vector model actually quite useful in text classification, and you do not need to use an SVM necessarily. You can use a simpler Bayesian model as well.
And if you are interested, you can find more details in Introduction to Information Retrieval. (Freely available online)
I think the most straightforward way would be to look for the largest block of text without markup. Then, once it's found, figure out the bounds of it and extract it. You'd probably want to exclude certain tags from "not markup" like links and images, depending on what you're targeting. If this will have an interface, maybe include a checkbox list of tags to exclude from the search.
You might also look for the lowest level in the DOM tree and figure out which of those elements is the largest, but that wouldn't work well on poorly written pages, as the dom tree is often broken on such pages. If you end up using this, I'd come up with some way to see if the browser has entered quirks mode before trying it.
You might also try using several of these checks, then coming up with a metric for deciding which is best. For example, still try to use my second option above, but give it's result a lower "rating" if the browser would enter quirks mode normally. Going with this would obviously impact performance.
I think a very effective algorithm for this might be, "Which DIV has the most text in it that contains few links?"
Seldom do ads have more than two or three sentences of text. Look at the right side of this page, for example.
The content area is almost always the area with the greatest width on the page.
I would probably start with Title and anything else in a Head tag, then filter down through heading tags in order (ie h1, h2, h3, etc.)... beyond that, I guess I would go in order, from top to bottom. Depending on how it's styled, it may be a safe bet to assume a page title would have an ID or a unique class.
I would look for sentences with punctuation. Menus, headers, footers etc. usually contains seperate words, but not sentences ending containing commas and ending in period or equivalent punctuation.
You could look for the first and last element containing sentences with punctuation, and take everything in between. Headers are a special case since they usually dont have punctuation either, but you can typically recognize them as Hn elements immediately before sentences.
While this is obviously not the answer, I would assume that the important content is located near the center of the styled page and usually consists of several blocks interrupted by headlines and such. The structure itself may be a give-away in the markup, too.
A diff between articles / posts / threads would be a good filter to find out what content distinguishes a particular page (obviously this would have to be augmented to filter out random crap like ads, "quote of the day"s or banners). The structure of the content may be very similar for multiple pages, so don't rely on structural differences too much.
Instapaper does a good job with this. You might want to check Marco Arment's blog for hints about how he did it.
Today most of the news/blogs websites are using a blogging platform.
So i would create a set of rules by which i would search for content.
By example two of the most popular blogging platforms are wordpress and Google Blogspot.
Wordpress posts are marked by:
<div class="entry">
...
</div>
Blogspot posts are marked by:
<div class="post-body">
...
</div>
If the search by css classes fails you could turn to the other solutions, identifying the biggest chunk of text and so on.
As Readability is not available anymore:
If you're only interested in the outcome, you use Readability's successor Mercury, a web service.
If you're interested in some code how this can be done and prefer JavaScript, then there is Mozilla's Readability.js, which is used for Firefox's Reader View.
If you prefer Java, you can take a look at Crux, which does also pretty good job.
Or if Kotlin is more your language, then you can take a look at Readability4J, a port of above's Readability.js.

Apart from <script> tags, what should I strip to make sure user-entered HTML is safe?

I have an app that reprocesses HTML in order to do nice typography. Now, I want to put it up on the web to let users type in their text. So here's the question: I'm pretty sure that I want to remove the SCRIPT tag, plus closing tags like </form>. But what else should I remove to make it totally safe?
Oh good lord you're screwed.
Take a look at this
Basically, there are so many things you want to strip out. Plus, there's stuff that's valid, but could be used in malicious ways. What if the user wants to set their font size smaller on a footnote? Do you care if that get applied to your entire page? How about setting colors? Now all the words on your page are white on a white background.
I would look into the requirements phase again.
Is a markdown-like alternative possible?
Can you restrict access to the final content, reducing risk of exposure? (meaning, can you set it up so the user only screws themselves, and can't harm other people?)
You should take the white-list rather than the black-list approach: Decide which features are desired, rather than try to block any unwanted feature.
Make a list of desired typographic features that match your application. Note that there is probably no one-size-fits-all list: It depends both on the nature of the site (programming questions? teenagers' blog?) and the nature of the text box (are you leaving a comment or writing an article?). You can take a look at some good and useful text boxes in open source CMSs.
Now you have to chose between your own markup language and HTML. I would chose a markup language. The pros are better security, the cons are incapability to add unexpected internet contents, like youtube videos. A good idea to prevent users' rage is adding an "HTML to my-site" feature that translates the corresponding HTML tags to your markup language, and delete all other tags.
The pros for HTML are consistency with standards, extendability to new contents types and simplicity. The big con is code injection security issues. Should you pick HTML tags, try to adopt some working system for filtering HTML (I think Drupal is doing quite a good job in this case).
Instead of blacklisting some tags, it's always safer to whitelist. See what stackoverflow does: What HTML tags are allowed on Stack Overflow?
There are just too many ways to embed scripts in the markup. javascript: URLs (encoded of course)? CSS behaviors? I don't think you want to go there.
There are plenty of ways that code could be sneaked in - especially watch for situations like <img src="http://nasty/exploit/here.php"> that can feed a <script> tag to your clients, I've seen <script> blocked on sites before, but the tag got right through, which resulted in 30-40 passwords stolen.
<iframe>
<style>
<form>
<object>
<embed>
<bgsound>
Is what I can think of. But to be sure, use a whitelist instead - things like <a>, <img>† that are (mostly) harmless.
† Just make sure that any javascript:... / on*=... are filtered out too... as you can see, it can get quite complicated.
I disagree with person-b. You're forgetting about javascript attributes, like this:
<img src="xyz.jpg" onload="javascript:alert('evil');"/>
Attackers will always be more creative than you when it comes to this. Definitely go with the whitelist approach.
MediaWiki is more permissive than this site; yes, it accepts setting colors (even white on white), margins, indents and absolute positioning (including those that would put the text completely out of screen), null, clippings and "display;none", font sizes (even if they are ridiculously small or excessively large) and font-names (even if this is a legacy non-Unicode Symbol font name that will not render text successfully), as opposed to this site which strips out almost everything.
But MediaWiki successifully strips out the dangerous active scripts from CSS (i.e. the behaviors, the onEvent handlers, the active filters or javascript link targets) without filtering completely the style attribute, and bans a few other active elements like object, embed, bgsound.
Both sits are banning marquees as well (not standard HTML, and needlessly distracting).
But MediaWiki sites are patrolled by lots of users and there are policy rules to ban those users that are abusing repeatedly.
It offers support for animated iamges, and provides support for active extensions, such as to render TeX maths expressions, or other active extensions that have been approved (like timeline), or to create or customize a few forms.