Does SVG text generated by Javascript get indexed by search engines? - html

I am concerned about doing a product animated presentation with SVG. The planed animation is a little too complex to be achieved with regular DOM manipulation (non SVG) and of course canvas is not an alternative since the content has to be indexed by Search engines. The animation is already mocked up and follows a typographic style.
That concern comes from the fact that I don't know if dynamic generated and injected text inside SVG will be indexed by search engines with the same richness other DOM elements will, or if it will be indexed at all!?
Would be good to know if somebody here already managed this situation, in practice, and if the indexing happened as expected (although any good and well documented hypothesis could help). In negative case, alternative solutions are welcome too.

No - it won't be in your page when it is crawled. If you want content to be indexed, serve it up straight from the server.

Related

SEO for content inside canvas element

I recently fount this project https://github.com/flipboard/react-canvas which uses canvas to render the whole page on mobile. The result is astonishing and performs really well. By the way, from what I know, content inside canvas elements is not seen by search engines. What is the correct way to index this content?
Traditionally (with Flash/Silverlight sites) it has has been done by feeding search engines with alternative content that displays indexable text. I'm sure you can do the same with the canvas element.
The easiest approach being:
<canvas>This is what search engines actually see</canvas>

Designing entire webpages as SVG files

Disclaimer
I realize that given the absurdity of the title, this sounds like a troll. However, it's a genuine question. My background involves OpenGL / x86 assembly. I've recently started learning web programming. I really like SVG + CSS, and was wondering -- why do people not design entire webpages in SVG?
Context
SVG provides beautiful primitive: quadratic + cubic bezier curves; lines + filling -- all as vector graphics
SVG provides text
SVG provides affine transformations
Questions
Are there examples of people designing entire websites as a giant SVG file?
If not, what the limitations?
Are there performance hits when using SVG primitives as opposed to divs/tables?
It is possible; for example you can embed HTML fragments in SVG documents in order to get things like hyperlinks.
However, there are some significant disadvantages, at least at present:
Current web browsers treat SVGs as images, and may not present as good a UI to users. For example, I think Firefox doesn't allow the user to select text in SVG files.
You lose separation of content and presentation. While SVG does use CSS, and you can in principle maintain the separation if you edit by hand, you are probably designing the layout together with the content. This has several drawbacks:
As a corollary, it's harder to adapt the resulting page to other formats. Particularly:
What's the behavior when the text size is changed? the document is printed? the window is resized? It's hard to design a complex drawing that supports reflow nicely (and if your drawing isn't complex, you may as well just use HTML+CSS).
Screen reader support: since order is not clear (below), screen readers may give incomprehensible, scrambled output. More basically, screen readers may assume the SVG is an image and not even try to read the text.
SVG is exclusively based on XML, and hence requires pretty strict adherence to the rules. With (X)HTML, you have the option of using the plain HTML serialization. Then many of these rules are relaxed, and browsers are more robust if you feed them bogus input (as opposed to an XML PARSING ERROR if you have a single misplaced >).
Current search engines probably won't index your pages (they'll just treat them as monolithic images). Never mind
Order of the content is not clear. Tools like Inkscape don't need to care about the order elements are output in, as long as they are positioned correctly in the output and the z-order is correct. But if you're making a web page, this does matter, because screen readers don't know for sure which element is semantically first. This isn't an issue if you're only editing the SVG by hand, but the usual SVG tools may scramble your order. With HTML, it's generally clear.
It's difficult to implement fragment identifiers (#id_of_some_element at end of URL) well. The presentation program below uses them, but I think it depends on Javascript (bad for search engines and people with javascript disabled). (I'm not sure about this one)
*̶ ̶I̶t̶'̶s̶ ̶d̶i̶f̶f̶i̶c̶u̶l̶t̶ ̶t̶o̶ ̶c̶o̶n̶v̶e̶r̶t̶ ̶t̶o̶ ̶t̶e̶x̶t̶ ̶(̶f̶o̶r̶ ̶s̶e̶a̶r̶c̶h̶ ̶e̶n̶g̶i̶n̶e̶s̶,̶ ̶s̶c̶r̶e̶e̶n̶ ̶r̶e̶a̶d̶e̶r̶s̶,̶ ̶c̶o̶p̶y̶-̶a̶n̶d̶-̶p̶a̶s̶t̶e̶,̶ ̶e̶t̶c̶.̶)̶,̶ ̶p̶a̶r̶t̶i̶c̶u̶l̶a̶r̶l̶y̶ ̶w̶h̶e̶n̶ ̶g̶e̶n̶e̶r̶a̶t̶e̶d̶ ̶w̶i̶t̶h̶ ̶g̶r̶a̶p̶h̶i̶c̶a̶l̶ ̶t̶o̶o̶l̶s̶.̶ ̶F̶o̶r̶ ̶e̶x̶a̶m̶p̶l̶e̶,̶ ̶h̶t̶t̶p̶:̶/̶/̶w̶w̶w̶.̶a̶f̶r̶o̶k̶a̶d̶a̶n̶s̶.̶c̶o̶m̶/̶P̶a̶g̶e̶_̶0̶1̶.̶s̶v̶g̶ ̶s̶h̶o̶w̶s̶ ̶u̶p̶ ̶i̶n̶ ̶G̶o̶o̶g̶l̶e̶ ̶a̶s̶ ̶̶D̶A̶N̶S̶E̶ ̶E̶T̶ ̶M̶I̶S̶E̶ ̶E̶N̶ ̶F̶O̶R̶M̶E̶ ̶B̶I̶E̶N̶V̶E̶N̶U̶E̶ ̶D̶ ̶a̶ ̶n̶ ̶s̶ ̶e̶ ̶r̶ ̶p̶o̶ ̶u̶ ̶r̶ ̶c̶é̶ ̶l̶ ̶é̶b̶ ̶r̶ ̶e̶ ̶r̶ ̶l̶ ̶a̶ ̶v̶ ̶i̶e̶.̶ ̶E̶ ̶n̶ ̶v̶e̶ ̶l̶ ̶o̶p̶p̶ ̶é̶ ̶e̶ ̶d̶ ̶e̶ ̶t̶ ̶a̶ ̶m̶ ̶b̶ ̶o̶ ̶u̶r̶ ̶s̶ ̶i̶n̶ ̶d̶o̶ ̶c̶ ̶i̶l̶ ̶e̶ ̶s̶ ̶e̶ ̶t̶ ̶i̶ ̶n̶ ̶c̶e̶ ̶s̶s̶ ̶a̶ ̶n̶t̶s̶̶.̶ ̶I̶n̶ ̶c̶o̶n̶t̶r̶a̶s̶t̶,̶ ̶e̶v̶e̶n̶ ̶c̶r̶a̶p̶p̶y̶ ̶W̶Y̶S̶I̶W̶Y̶G̶ ̶H̶T̶M̶L̶ ̶g̶e̶n̶e̶r̶a̶t̶o̶r̶s̶ ̶w̶o̶n̶'̶t̶ ̶s̶p̶l̶i̶t̶ ̶u̶p̶ ̶r̶u̶n̶s̶ ̶o̶f̶ ̶t̶e̶x̶t̶ ̶l̶i̶k̶e̶ ̶t̶h̶a̶t̶.̶
Something to consider is that it is possible to embed SVG elements in XHTML and HTML 5, so you get some of the benefits without throwing off browsers/search engines.
Existing usage
I've heard of its use for presentations (which are closer to drawings in some respects, so some of the above drawbacks don't apply):
"Jessyink is a JavaScript that can be incorporated into an Inkscape SVG image containing several layers. Each layer will be converted into one slide of a presentation."
Another implemetation; info and an example
It is really cool to think that an .svg image file can be used to create a webpage without knowing any html. Considering how messy html standards are, to be able to use an Inkscape .svg file might be a lot easier for people who are naturally artistic. When you think about Inkscape and Openclipart being mostly useful for people who are doing Desktop Publishing, Scrapbooking etc to be able to export a webpage directly from Inkscape would be a powerful tool.
As for 600 years of typesetting we are in the 21st century, and media publishing doesn't have to conform to medieval ideas of formatting. Typesetting does have its palace but we are talking about a powerful experiment that could help 'average Joe' users turn their art into webpages without knowing CSS or HTML.
For reference, the Opera team has done quite a lot of work with using SVG for web pages - http://dev.opera.com/articles/svg/
There is a platform designed specifically to create SVG-based websites: Svija (disclaimer: it is my project).
The advantage of an SVG website is that you can literally do anything you want; you're not limited to rows of rectangles as you are with HTML.
The big issue is accessibility. Right now SVG text is crawled by Google but it can easily be out of order depending on the source document. Also semantic information that would normally be conveyed by HTML tags (H1 is more important than P) is nonexistant.
The main things to realize are that:
you can use external fonts and images in an SVG file
the width and height have to be specified correctly for Microsoft's browsers to display the SVG at the correct size
if you combine different SVG files, conflicting ID's can cause CSS styles to be misapplied
The only program that I have found that can correctly link to images and fonts is Adobe Illustrator. There is a list of all the programs I have found that can create SVG files here, with some information about which programs support which features.
Actually, there are pages that rely heavily on SVG; using a Javascript library such as Raphaël is a common way to do it. Paper.js is also worth a peek, although they chose not to use SVG.
You can indeed make entire web sites with SVG, I've been doing it for years. If you're willing to change how you look at designing pages (Think Cards) then it's just a matter of trial and error.

We hear so much about "semantic html". Where/what are the algorithms reading our semantic html?

I keep making attempts at properly using HTML5 but I feel like it's still not even close to anything semantically valuable.
My attempts:
HTML5 Article node Architecture
HTML5 Blog Page Architecture
But there's such subtleties in every single tag!
My question is, what specific software out there on the web is actually doing things like processing our HTML DOM, calculating and comparing elements to say "oh, this is a <header>, and it's just after <section>, and it has <time> in it, so the <time> tag must be "metadata" in relation to the <header>...", and saying "The content within the <time> tag not only is the "published time", but also relates to the author's birthday, so it must be a special post (say because there was also a <cite> or <address class='vcard'> tag in there too)".
I mean, what benefit am I ever going to get in using HTML5 if I don't know the algorithms that are interpreting it? If I just stuck with the basic div, ol, ul, li, p, a, h[1-6] tags, I could do everything with half the number of DOM elements.
Looking forward to some specific algorithms that I can use to shape how I structure the DOM from here on out.
I'm at the point where I don't even think we should be using HTML5 tags at all. For example, on the iPhone especially, the goal should be to minimize dom elements to decrease load time. Plus, if the iPhone site is a mirror of the traditional browser version, the search engines won't even see the iPhone site (ideally). So there's no real point in making the DOM semantic. So if I can use 1/2 the amount of <div> tags to achieve the same layout as if I used a somewhat "semantic HTML5" rendition, and that's a good thing for the iPhone, why don't I do that for the regular browser too? That's where I'm coming from.
Articles like this are basically saying it's pointless to worry about semantic HTML.
What algorithms are reading your semantic HTML? Google, that's who. Their algorithm tries to extract every bit of meaning from pages that it can, because that helps Google construct smart, relevant search results. For one example, Google tries to determine the dates of things by reading the HTML and gives headers extra consideration in determining the overall topic of a page.
Also, your assertion that we shouldn't use HTML5 tags on the iPhone "to minimize dom elements" isn't founded in any technical basis. HTML5 doesn't dictate that we use more DOM elements, and in fact it can let us leave out tags that would be required by XHTML. You should use HTML5 on the iPhone more than anywhere else. For example, the new input types like number and email don't do much on the desktop, but that extra information can really make things nicer on the iPhone by allowing it to present an appropriate interface.
Whenever a "machine" tries to make sense of your content.
In addition to search engines (→ SEO), screen readers (→ Accessibility) interpret the markup. They get better from version to version.
Also, think of all the tools that might come one day. The great thing about the Web is, that all the web pages could still exist in 5, 10, 100 … years from now. Imagine the user-agents and algorithms and search tools that might exist then, and how they could extract the meaning of your old documents.
Search engines can/will better interpret your pages which combined with other factors will result in better rankings for your pages.
Moreover if you use the tags consistently and semantically, you could build your own reusable widgets and libraries that derive knowledge from the HTML structure independent of how the data is stored in the backend.
Consider this sample Google search where you can filter results by date. By using semantic HTML, for let's say, <article> and <time>, you can write a simple crawler that recreates this functionality or allows users to specify a timespan within which to search articles in your own site(s).
Off the top of my head, I don’t know of any algorithms making use of the new semantic tags in HTML5. (Obviously, that doesn’t mean there aren’t any.)
But the idea that you should tailor your HTML to specific algorithms is, I think, a bit contrary to how the web works. The web is worldwide, and will hopefully be around for a long time. We can’t know what uses our HTML will be put to, and useful algorithms can’t be written until there’s a good amount of actual content out there.
The <a> tag wasn’t designed with Google’s PageRank algorithm in mind. Some people thought links would be useless if they weren’t inherently two-way, because you’d get too many broken links when one end went away.
Of course, if the vague possibility of undefined future benefits makes it not worth using some or all HTML5 tags for whatever project you’re working on, don’t use them.
For me, the benefit of using them is that there’s a well-known, public, non-proprietary specification that tells you, and anyone else working on the code, what we’ve agreed the tags mean. Future developers don’t just get a <div> with a class name that I made up in a coffee-fuelled 7 p.m. code print, they get a tag designed and documented by people smarter and more experienced than me. There’s also the chance that the code will become more useful in future if people use the meaning contained in HTML5 tags in algorithms, whereas there’s less chance of that if it’s all just a bunch of <div>s.
I don’t think the size increase of our pages from HTML5 tags is particularly worth worrying about though. After gzipping, the size increases aren’t enough to worry about, especially as mobile performance is as much hampered by the latency (which you can’t do much about) as the bandwidth. Plus mobile bandwidth is likely to trend up, rather than down.

Does the CSS property "text-transform" affect SEO results?

I am building a site with a ton of 1999 style capitalization of navigation and headings. I have been simply adding in the text content as it appears (capitalized), but the other designer on the project insists on using lower case text in his HTML and capitalizing it with an applied style:
.tedious {text-transform:uppercase;}
I understand the argument of separation of style from content, but in this case it really doesn't matter because I personally will not maintain the site, nor do I ever imagine that the client will need to un-capitalize all of this text. The question is: 1. will search engines pay any attention at all to capitalization of text in a document and 2. would a crawler go so far as to read my style sheet and look for such things (me thinks not). I know that BOLD, STRONG, EM, etc have a (diminishing) effect on SEO so I can imagine a scenario where CAPS would, but have never heard of anyone actually claiming, let alone confirming this.
Digging this site the last few months. First post.
It will only effect what is shown in the search results, you colleagues work will show as lower case in the results.
You mentioned separation of style from content, but i'm not convinced that text-transform is a style really, it's a change of content, i'm sure some people would argue the other side though.
if i was a search engine - I wouldn't care about casing. I would care about the content.
From a human readability standpoint - upper case isn't as easy to read.
Well, I was taught at school that all proper nouns (eg names and names of places) should begin with capital letters.
How would Google know whether I was talking about reading (as in a book) or Reading (as in the town of Reading, Berkshire), without taking into account the capitalisation? I would argue that capitalisation is definitely a semantic indicator rather than simply a case of aesthetics, and is therefore one factor that could be used for SEO.
As noted elsewhere, Google clearly does have knowledge of the CSS being used to render a page (eg Google can spot black-hat techniques such as white text on a white background).
So if capitalisation (or lack of) is a relevant SEO factor, can the CSS text-transform (or lack of) value also be an SEO factor?
Yes - because Google considers page speed to be an important factor. Text that doesn't need to be transformed by CSS will display faster.
Answer from google:
I don't think we'd do anything special with all-caps headings, but it feels like the kind of thing you'd want to do in CSS instead of in the content, since it's more about styling.
https://mobile.twitter.com/JohnMu/status/1438159561391751170?s=19

Programmatically detecting "most important content" on a page

What work, if any, has been done to automatically determine the most important data within an html document? As an example, think of your standard news/blog/magazine-style website, containing navigation (with submenu's possibly), ads, comments, and the prize - our article/blog/news-body.
How would you determine what information on a news/blog/magazine is the primary data in an automated fashion?
Note: Ideally, the method would work with well-formed markup, and terrible markup. Whether somebody uses paragraph tags to make paragraphs, or a series of breaks.
Readability does a decent job of exactly this.
It's open source and posted on Google Code.
UPDATE: I see (via HN) that someone has used Readability to mangle RSS feeds into a more useful format, automagically.
think of your standard news/blog/magazine-style website, containing navigation (with submenu's possibly), ads, comments, and the prize - our article/blog/news-body.
How would you determine what information on a news/blog/magazine is the primary data in an automated fashion?
I would probably try something like this:
open URL
read in all links to same website from that page
follow all links and build a DOM tree for each URL (HTML file)
this should help you come up with redundant contents (included templates and such)
compare DOM trees for all documents on same site (tree walking)
strip all redundant nodes (i.e. repeated, navigational markup, ads and such things)
try to identify similar nodes and strip if possible
find largest unique text blocks that are not to be found in other DOMs on that website (i.e. unique content)
add as candidate for further processing
This approach of doing it seems pretty promising because it would be fairly simple to do, but still have good potential to be adaptive, even to complex Web 2.0 pages that make excessive use of templates, because it would identify similiar HTML nodes in between all pages on the same website.
This could probably be further improved by simpling using a scoring system to keep track of DOM nodes that were previously identified to contain unique contents, so that these nodes are prioritized for other pages.
Sometimes there's a CSS Media section defined as 'Print.' It's intended use is for 'Click here to print this page' links. Usually people use it to strip a lot of the fluff and leave only the meat of the information.
http://www.w3.org/TR/CSS2/media.html
I would try to read this style, and then scrape whatever is left visible.
You can use support vector machines to do text classification. One idea is to break pages into different sections (say consider each structural element like a div is a document) and gather some properties of it and convert it to a vector. (As other people suggested this could be number of words, number of links, number of images more the better.)
First start with a large set of documents (100-1000) that you already choose which part is the main part. Then use this set to train your SVM.
And for each new document you just need to convert it to vector and pass it to SVM.
This vector model actually quite useful in text classification, and you do not need to use an SVM necessarily. You can use a simpler Bayesian model as well.
And if you are interested, you can find more details in Introduction to Information Retrieval. (Freely available online)
I think the most straightforward way would be to look for the largest block of text without markup. Then, once it's found, figure out the bounds of it and extract it. You'd probably want to exclude certain tags from "not markup" like links and images, depending on what you're targeting. If this will have an interface, maybe include a checkbox list of tags to exclude from the search.
You might also look for the lowest level in the DOM tree and figure out which of those elements is the largest, but that wouldn't work well on poorly written pages, as the dom tree is often broken on such pages. If you end up using this, I'd come up with some way to see if the browser has entered quirks mode before trying it.
You might also try using several of these checks, then coming up with a metric for deciding which is best. For example, still try to use my second option above, but give it's result a lower "rating" if the browser would enter quirks mode normally. Going with this would obviously impact performance.
I think a very effective algorithm for this might be, "Which DIV has the most text in it that contains few links?"
Seldom do ads have more than two or three sentences of text. Look at the right side of this page, for example.
The content area is almost always the area with the greatest width on the page.
I would probably start with Title and anything else in a Head tag, then filter down through heading tags in order (ie h1, h2, h3, etc.)... beyond that, I guess I would go in order, from top to bottom. Depending on how it's styled, it may be a safe bet to assume a page title would have an ID or a unique class.
I would look for sentences with punctuation. Menus, headers, footers etc. usually contains seperate words, but not sentences ending containing commas and ending in period or equivalent punctuation.
You could look for the first and last element containing sentences with punctuation, and take everything in between. Headers are a special case since they usually dont have punctuation either, but you can typically recognize them as Hn elements immediately before sentences.
While this is obviously not the answer, I would assume that the important content is located near the center of the styled page and usually consists of several blocks interrupted by headlines and such. The structure itself may be a give-away in the markup, too.
A diff between articles / posts / threads would be a good filter to find out what content distinguishes a particular page (obviously this would have to be augmented to filter out random crap like ads, "quote of the day"s or banners). The structure of the content may be very similar for multiple pages, so don't rely on structural differences too much.
Instapaper does a good job with this. You might want to check Marco Arment's blog for hints about how he did it.
Today most of the news/blogs websites are using a blogging platform.
So i would create a set of rules by which i would search for content.
By example two of the most popular blogging platforms are wordpress and Google Blogspot.
Wordpress posts are marked by:
<div class="entry">
...
</div>
Blogspot posts are marked by:
<div class="post-body">
...
</div>
If the search by css classes fails you could turn to the other solutions, identifying the biggest chunk of text and so on.
As Readability is not available anymore:
If you're only interested in the outcome, you use Readability's successor Mercury, a web service.
If you're interested in some code how this can be done and prefer JavaScript, then there is Mozilla's Readability.js, which is used for Firefox's Reader View.
If you prefer Java, you can take a look at Crux, which does also pretty good job.
Or if Kotlin is more your language, then you can take a look at Readability4J, a port of above's Readability.js.