Flash/HTML Architecture: SEO Implications? - html

A client of mine has a full-Flash site and an HTML site (wordpress). Currently, the HTML site lives at http://www.domain.com, while the Flash site lives at http://www.domain.com/flash (swfobject detection at http://www.domain.com redirects flash users to the flash URL). The client isn't entirely pleased with this arrangement in terms of SEO, as links to their site sometimes point to http://www.domain.com and sometimes to http://www.domain.com/flash.
In a few weeks, the client will be rolling out a new version of their Flash site, which features deeplinking, among other things. Instead of living in its own folder off of the domain, the full-Flash site will be a "progressively enhanced" version of the HTML site, so if a user supports Flash, all HTML content will be replaced by Flash content.
Once the new site is launched, each page/URL in the Flash site will have a corresponding HTML page/URL; for example, the Flash content at http://www.domain.com/#/about/clients corresponds to the HTML content at http://www.domain.com/about/clients.
We're going to implement a 301 redirect so the old /flash path points to the domain itself, but we're not sure how to proceed in terms of redirects between the HTML and Flash versions of the site. One possibility would be to simply do client-side detection of capabilities and redirect the user to the appropriate version; under that scenario, a non-Flash-capable client that attempts to visit http://www.domain.com/#/about/clients would be JS-redirected to http://www.domain.com/about/clients, and a Flash-capable client visiting http://www.domain.com/about/clients would be JS-redirected to http://www.domain.com/#/about/clients.
Is this a reasonable approach? Are there any potential SEO red flags that we should be aware of before proceeding?
Thanks for your consideration!

The redirect from /#/about/clients to /about/clients sounds reasonable, but applying the reverse could cause problems - if your Flash detection doesn't work correctly (perhaps Flash is blocked etc.) then you may send the user into an infinite redirect loop.
Personally, I would recommend that non-hash links always load their content as expected, in a static manner. If the user then navigates, you may either end up with a URL like /about/clients#/ (if they went to the home page) (this shouldn't be an issue as crawlers will never end up visiting them this way) or you can have them redirect to / next time they navigate.
IMHO, I'd say that a pure JavaScript solution to the hash problem would be easier to manage as there are already many good examples of this.
Also consider using #! instead of # - this 'hash-bang' technique is being pushed by Google as a way of identifying to search engines that your hash is important and that its contents differ from what you would see without the hash part. Google can already point to specific parts of a page using # and if you follow the hash-bang technique on the client and server-side, it will be able to index your AJAX/Flash links just like regular links (see the implementation details and the requirements you need to fulfill).

Related

How to convert Javascript, CSS, and HTML content into a interactive-pdf or .h5p page

I have a webapp that let users place dots on sitemap and link them to images.
The web app uses Javascript, CSS, and HTML.
phase1
While the user is subscribed he uses a rich set of functionalities to:
add dots on the sitemap and link them to images
edit the dots: move, delete, link momultiple images etc ..
etc..
This is done via the website that hosts the webapp.
phase2
When the user ends the subscription, he gets a .zip file with the information that he created (sitemap, images, links between the sitemap and the images, etc..).
The user can then connect to the website that hosts the webapp, without signing in and get a subset of the functionalities (e.g. he can only click on the dots and see the linked images, but he can no longer edit the dots or add images).
I want to change phase2.
Instead of interacting with the webapp on the website, I want to "freeze" the webapp into a interactive-pdf, or h5p page that can be played independently without the webapp.
There are multiple reasons that motivate to do this:
the webapp is complex, so engaging with the webapp is prone to more errors.
If the small subset functionality of the final data, which boils down to showing the image when clicking on the hyperlink, can be done via h5p browsing, then the risks for runtime errors are greatly reduced.
the interactive-pdf or .h5p file can be browsed by variety of tools potentially even when being offline.
the end product can be re-designed to appear more simple.
My questions:
is it possible to programatically convert the Javascript, CSS, and HTML content into a interactive-pdf or .h5p page?
Every end-product will be different (e.g. by the number of dots, and their location in the sitemap) so having to manually create the .h5p page every time is not practical.
are there mobile apps (e.g. on Apple Store, or Google Play) that can read .h5p content locally, e.g. when the device is offline?
Thanks
EDIT:
Oliver Tacke, thank you for replying.
Up to few days ago, looking for a solution to my problem, I did not hear about h5p at all.
When looking into h5p, I see that
many comments rlated to h5p that is a bit old - from ~5/6 years ago.
h5p is frequently talked in context of education (e.g. Moodle)
when I filed the question I could not even find a tag for 'h5p'
I could not find forums for h5p in mainstream channels like Discourse or Slack
So I want to know if I'm in the right direction at all.
Is h5p a new thing that just takes time to pick up, or is it something that started a while ago and dwindlled down,
or maybe I'm wrong and it is currently more active than I think (I'm aware of h5p.org and I do see activity there).
Basically, I want to create interactive content that can work
ideally offline, or
online but with a mainstream browser/tool/website (i.e. without needing my special website)
In the design industry, I know there are interactive catalogues.
But I don't know if the user can download them and somehow (e.g. with an epub reader) read them.
Thanks
I don't know anything about creating PDFs programmatically, so I can only offer a partial answer for the H5P related part. Given the broad scope of your question, this may be acceptable as a comment.
H5P content follows a specification that is documented at https://h5p.org/documentation/developers/h5p-specification.
You would basically have to implement an H5P content type library (file) from the files that you are given by the service. I assume that the JavaScript and CSS files are always the same, then those could be reused directly (but potentially not legally). You would also have to add some more JavaScript that takes parameters and generates the HTML output that you get from the service. You would then have to model semantics.json to suit the parameters, and then you essentially have an H5P content type. You don't have to use the then available form based editor (which probably wouldn't make sense), but you could create the content.json file programmatically and put it into the H5P content file archive. To create that file programmatically, you'd have to create a converter that identities the parameters in the HTML file generated by that service and transform them into the H5P semantics/content format. Not sure if it made more sense to rather create an editor widget for H5P, so you wouldn't have to depend on the other service at all.
There are currently no known mobile apps that allow you to load and run H5P content. They are on the roadmap of the H5P core team, but I wouldn't expect them to work on those any time soon. There's the moodle app for the moodle LMS that allows to use H5P content offline, but it needs to be fetched from a moodle instance. There's Lumi that allows to run H5P content locally on Windows, MacOS and Linux, but not on Android or iOS. However, Lumi also allows to create single standalone HTML files from H5P content containing all the content and logic ready to play, so that would allow offline use on Android and iOS.

Can images from another website create cookies on my site?

I have a static website, it only contains html and css. No javascript, no php, no databases. On this site, I'm using images, which I get from image-hosting websites (like imgur).
I've noticed when I visit my website (on Google Chrome at least), if I click the information button next to the URL, it says there are cookies on this site. If I click on the cookies button, it says The following cookies were set when you viewed this page and has a list from cookies, including from those sites that I use for image-hosting.
If I delete them, they come back after a while, but not immediately. I'm trying to avoid cookies as the site is very simple. Are they considered part of my site? If so, is there anything I can do, except hosting the images myself?
I always though that if you link to an image directly (as in a link ending in .png for example) it would be the same as if you were hosting the image yourself, and there would be no javascript being run (to save cookies).
Are they considered part of my site?
That depends on your perspective.
The browser doesn't consider them to be part of your site. Cookies are stored on a per-domain basis, so a cookie received in response to a request for an image from http://example.com will belong to http://example.com and not to your site.
However, for the purpose of privacy laws (such as GDPR) then they are considered part of your site and, if they are used by the third party to track personally identifiable information, you are required to jump through the usual GDPR hoops.
If so, is there anything I can do, except hosting the images myself?
Not really.
I always though that if you link to an image directly (as in a link ending in .png for example) it would be the same as if you were hosting the image yourself, and there would be no javascript being run (to save cookies).
Cookies are generally set with HTTP response headers, not with JavaScript.
Whenever a browser requests a file from a server it automatically forwards any cookie data along with the request. Image Hosting services may use that for different purposes.
I always though that if you link to an image directly (as in a link ending in .png for example) it would be the same as if you were hosting the image yourself, and there would be no javascript being run (to save cookies).
So the question is, how to they set these cookies?
Let's say, you use a simple img tag to load an image from a hoster.
<img src="imageHoster.tld/123xyz.png">
The site imageHoster.tld can handle that request by redirecting all requests to e.g. requestHandler.php and that file can set the cookie before sending the image with a simple
<?
setcookie("cookieName", "whateverValue", time()+3600);
header('content-type: image/png');
...
?>
What happens there is actually the same as if you would set the image source like that:
<img src="imageHoster.tld/requestHandler.php?img=123xyz">
Are they considered part of my site?
Since these so called third party cookies are set when visiting your site one could consider them as part of your site. To be on the safe side I would at least mention the use of third party services in the data privacy statement.
If so, is there anything I can do, except hosting the images myself?
Third party cookies can be disabled in the clients browser. But you can't disable them for the visitors of your site. So no, to avoid third parties setting cookies on client browsers visiting your site you can only avoid using their services.

Difference between href="#!" and href="#" [duplicate]

I've just noticed that the long, convoluted Facebook URLs that we're used to now look like this:
http://www.facebook.com/example.profile#!/pages/Another-Page/123456789012345
As far as I can recall, earlier this year it was just a normal URL-fragment-like string (starting with #), without the exclamation mark. But now it's a shebang or hashbang (#!), which I've previously only seen in shell scripts and Perl scripts.
The new Twitter URLs now also feature the #! symbols. A Twitter profile URL, for example, now looks like this:
http://twitter.com/#!/BoltClock
Does #! now play some special role in URLs, like for a certain Ajax framework or something since the new Facebook and Twitter interfaces are now largely Ajaxified?
Would using this in my URLs benefit my Web application in any way?
This technique is now deprecated.
This used to tell Google how to index the page.
https://developers.google.com/webmasters/ajax-crawling/
This technique has mostly been supplanted by the ability to use the JavaScript History API that was introduced alongside HTML5. For a URL like www.example.com/ajax.html#!key=value, Google will check the URL www.example.com/ajax.html?_escaped_fragment_=key=value to fetch a non-AJAX version of the contents.
The octothorpe/number-sign/hashmark has a special significance in an URL, it normally identifies the name of a section of a document. The precise term is that the text following the hash is the anchor portion of an URL. If you use Wikipedia, you will see that most pages have a table of contents and you can jump to sections within the document with an anchor, such as:
https://en.wikipedia.org/wiki/Alan_Turing#Early_computers_and_the_Turing_test
https://en.wikipedia.org/wiki/Alan_Turing identifies the page and Early_computers_and_the_Turing_test is the anchor. The reason that Facebook and other Javascript-driven applications (like my own Wood & Stones) use anchors is that they want to make pages bookmarkable (as suggested by a comment on that answer) or support the back button without reloading the entire page from the server.
In order to support bookmarking and the back button, you need to change the URL. However, if you change the page portion (with something like window.location = 'http://raganwald.com';) to a different URL or without specifying an anchor, the browser will load the entire page from the URL. Try this in Firebug or Safari's Javascript console. Load http://minimal-github.gilesb.com/raganwald. Now in the Javascript console, type:
window.location = 'http://minimal-github.gilesb.com/raganwald';
You will see the page refresh from the server. Now type:
window.location = 'http://minimal-github.gilesb.com/raganwald#try_this';
Aha! No page refresh! Type:
window.location = 'http://minimal-github.gilesb.com/raganwald#and_this';
Still no refresh. Use the back button to see that these URLs are in the browser history. The browser notices that we are on the same page but just changing the anchor, so it doesn't reload. Thanks to this behaviour, we can have a single Javascript application that appears to the browser to be on one 'page' but to have many bookmarkable sections that respect the back button. The application must change the anchor when a user enters different 'states', and likewise if a user uses the back button or a bookmark or a link to load the application with an anchor included, the application must restore the appropriate state.
So there you have it: Anchors provide Javascript programmers with a mechanism for making bookmarkable, indexable, and back-button-friendly applications. This technique has a name: It is a Single Page Interface.
p.s. There is a fourth benefit to this technique: Loading page content through AJAX and then injecting it into the current DOM can be much faster than loading a new page. In addition to the speed increase, further tricks like loading certain portions in the background can be performed under the programmer's control.
p.p.s. Given all of that, the 'bang' or exclamation mark is a further hint to Google's web crawler that the exact same page can be loaded from the server at a slightly different URL. See Ajax Crawling. Another technique is to make each link point to a server-accessible URL and then use unobtrusive Javascript to change it into an SPI with an anchor.
Here's the key link again: The Single Page Interface Manifesto
First of all: I'm the author of the The Single Page Interface Manifesto cited by raganwald
As raganwald has explained very well, the most important aspect of the Single Page Interface (SPI) approach used in FaceBook and Twitter is the use of hash # in URLs
The character ! is added only for Google purposes, this notation is a Google "standard" for crawling web sites intensive on AJAX (in the extreme Single Page Interface web sites). When Google's crawler finds an URL with #! it knows that an alternative conventional URL exists providing the same page "state" but in this case on load time.
In spite of #! combination is very interesting for SEO, is only supported by Google (as far I know), with some JavaScript tricks you can build SPI web sites SEO compatible for any web crawler (Yahoo, Bing...).
The SPI Manifesto and demos do not use Google's format of ! in hashes, this notation could be easily added and SPI crawling could be even easier (UPDATE: now ! notation is used and remains compatible with other search engines).
Take a look to this tutorial, is an example of a simple ItsNat SPI site but you can pick some ideas for other frameworks, this example is SEO compatible for any web crawler.
The hard problem is to generate any (or selected) "AJAX page state" as plain HTML for SEO, in ItsNat is very easy and automatic, the same site is in the same time SPI or page based for SEO (or when JavaScript is disabled for accessibility). With other web frameworks you can ever follow the double site approach, one site is SPI based and another page based for SEO, for instance Twitter uses this "double site" technique.
I would be very careful if you are considering adopting this hashbang convention.
Once you hashbang, you can’t go back. This is probably the stickiest issue. Ben’s post put forward the point that when pushState is more widely adopted then we can leave hashbangs behind and return to traditional URLs. Well, fact is, you can’t. Earlier I stated that URLs are forever, they get indexed and archived and generally kept around. To add to that, cool URLs don’t change. We don’t want to disconnect ourselves from all the valuable links to our content. If you’ve implemented hashbang URLs at any point then want to change them without breaking links the only way you can do it is by running some JavaScript on the root document of your domain. Forever. It’s in no way temporary, you are stuck with it.
You really want to use pushState instead of hashbangs, because making your URLs ugly and possibly broken -- forever -- is a colossal and permanent downside to hashbangs.
To have a good follow-up about all this, Twitter - one of the pioneers of hashbang URL's and single-page-interface - admitted that the hashbang system was slow in the long run and that they have actually started reversing the decision and returning to old-school links.
Article about this is here.
I always assumed the ! just indicated that the hash fragment that followed corresponded to a URL, with ! taking the place of the site root or domain. It could be anything, in theory, but it seems the Google AJAX Crawling API likes it this way.
The hash, of course, just indicates that no real page reload is occurring, so yes, it’s for AJAX purposes. Edit: Raganwald does a lovely job explaining this in more detail.

Google bot crawling on AngularJS site with HTML5 Mode routes

We have an AngularJS site using HTML5 routes. I just did some test "Fetch as Google" runs. The results are a bit confusing:
On the fetching tab, I see our site as it looks on view source, with all the front end bindings {{ }}, and not all the HTML rendered
On the rendering tab, our site looks perfectly fine, no {{ }} variables, it seems like Google bot fetched and rendered the site fine, which is maybe in line with this, http://googlewebmastercentral.blogspot.ae/2014/05/rendering-pages-with-fetch-as-google.html.
However, we are already prepared for Google to not be able to crawl our site, so we have already added , so the Google bot revisits our page with “?_escaped_fragment_=". We followed this, https://developers.google.com/webmasters/ajax-crawling/docs/getting-started (section "3. Handle pages without hash fragments"). In our Nginx config we have something like this:
if ($args ~ "_escaped_fragment_=") {
serve the static HTML snapshots
}
, and indeed it works fine, if we pass the _escaped_fragment_= ourselves. However, the Google bot never tried to crawl our site with this param, so it never crawled the snapshot. Are we missing something? Should we also add agent detection for Google bot on our Nginx conf? Something like this?
if ($http_user_agent ~* "googlebot|yahoo|bingbot|baiduspider|yandex|yeti|yodaobot|gigabot|ia_archiver|facebookexternalhit|twitterbot|developers\.google\.com") {
server from snapshots
}
It would be great if we can understand this better, thank you so much in advance!
UPDATE:
I just read this, http://scotch.io/tutorials/javascript/angularjs-seo-with-prerender-io?_escaped_fragment_=tag#caveats. So, it seems that when using the manual tools (Fetch as Google), we should pass ourselves either #! or ?_escaped_fragment_= in the right place. Indeed, if I pass ?_escaped_fragment_= in our case, I do see the HTML snapshot that we have created.
Is that true? Is this how it works indeed?
UPDATE 2
On the bottom of this thread, a Google employee verifies that for Google Webmasters "Fetch as Google", you need to manually pass the _escaped_fragment_= param yourself, https://productforums.google.com/forum/#!msg/webmasters/fZjdyjq0n98/PZ-nlq_2RjcJ
Cheers,
Iraklis
I will try to answer your questions based on our experiences in the last month of developing a SPA with HTML5 mode.
How do I get Googlebot to use ?_escaped_fragment_= instead of the direct links.
This is actually quite simple but easy to overlook. In fact, there are two different ways to get Googlebot to try the escaped_fragment. The first method is to run your site in non-html5 mode. This means that your URLs will be of the form:
http://my.domain.com/base/#!some/path/on/website
Googlebot recognizes the #! and makes a second call to your server with an altered URL:
http://my.domain.com/base/?_escaped_fragment_=some/path/on/website
Which you can then handle as you wish. The second way to get Googlebot to try _escaped_fragment_ mode is to include the following meta tag on the index page you supply to the bot:
<meta name="fragment" content="!">
This will make googlebot check the other version of the webpage every time it sees the tag. Interestingly you can use both these techniques together or you can do what we ended up doing, which is running in html5 mode with the meta tag. This means that your URLs will be escaped as follows:
http://my.domain.com/base/some/path/on/website?_escaped_fragment_=
Interestingly, the bot will not put anything at the end of the fragment. But depending on what webserver you are running, you can easily map this with a pattern matching the "_escaped_fragment_" text to your alternate bot page. For more information on the escaped fragment go here.
"Fetch as Googlebot" returns two different versions of my page, the source with {{}} and the rendered page looking correct. What does that mean?
Google's Bots can actually interpret JavaScript to a limited extent since early 2014. For more information, read the official blog entry on google webmasters here. However, as is made clear in the blog entry, this comes with a lot of caveats. For instance:
Googlebot does not guarantee to execute all javascript code.
Googlebot will attempt to find links in the javascript to follow and use them to help find more pages.
Googlebot will render the preview in webmasters tools by executing as much of the javascript as it can (thus the lack of {{}} in the rendered version).
Googlebot will not necessarily use the rendered version in order to build the meta information about your site for its index.
As of 18/12/2014, we are still unsure if Googlebot can actually extract any information from an SPA in rendered mode for its index beyond finding links to follow in the javascript. In our experience, Googlebot will include {{}} in its index listing so that when you try to use {{}} to fill meta information (description, keywords, title, etc...) your site looks like this in Google Search results:
{{meta.siteTitle}}
http://my.domain.com/base/some/path/on/website
{{meta.description}}
rather than what you expect which might look like this:
Domain
http://my.domain.com/base/some/path/on/website
This is a random page on my domain. An excellent example page to be sure!
GoogleBot for Search Engine uses _escaped_fragment_ but we can not be sure for other services
Google recommend to serve an HTML snapshot of AJAX website by using hashbang (#!) and _escaped_fragment_ param.
But as often for new Google feature all Google services do not support it from the begging.
For now, by experience, we are sure GoogleBot for indexing webpage use HTML snapshot and _escaped_fragment_. You can check your Server Access Logs to be sure Google did it on your application.
(For now and by experience, nothing official by Google) other services like PageSpeed Insight, Webmaster Tools parser, Richsnippet testing tools, etc.: hasbang (#!) is not supported. You have to use _escaped_fragment_.
Should you use User Agent detection to serve HTML snapshot?
No. Just don't. For different reasons :
You just do not know which services/bots on the web would like to parse your content and you can not be exhaustive (for instance, think of all the social networks existing on the web using Bot to create a snippet of your content : you can not handle them one by one)
This can be considered as cloacking : serving a different version depending on type of user on the same URL, which is basically wrong for SEO.
Google looks for #! in our site urls and then takes everything after the #! and adds it in _escaped_fragment_ query parameter. Some developers create basic html pages with real data and serve these pages from server side at the time of crawling. So , why not we render same pages with PhantomJS on serve side which has _escaped_fragment_.
For more detail please read this blog .
Maybe a bit outdated, but for the completeness:
According to the statement from May 23, 2014 Google bot is now able to "see your content more like modern Web browsers".
According to their statement from October 14, 2015 Google deprecated the AJAX crawling scheme.
So using the HTML5 History API (html5mode in angular) should be no problem to Google.

HTML5 - Do users get to see all my client code?

If I am building an HTML5 web app.. And all the rendering, UI events, etc are handled on the client, then the client gets to see the source code correct?
I am working on an enterprise HTML5 application but Id like the source code to be hidden. Are there any options?
Is it also possible to somehow hide UI graphic elements (buttons, backgrounds, sounds, etc?)
What are the options here?
Thank you
My ready answer is No : your javascript code as well as links to jQuery UI code is visible on the client's asking to "view the source".
The question is : Is it possible for your code to be applied/run by the client's browser without being shown as "source?"Is there a way :- to prevent the client from seeing the "source"; or - to destroy the incoming code as soon as it has been run and displayed once?
The second eventuality seems excluded unless there are no further javascript actions on the client's side(?)
Danquest
Quick answer: No.
Why? Well, your browser (the client) effectively downloads assets like HTML, JS and CSS (along with images and other media objects), to render on the users machine.
Because all client code is downloaded to the client, the user can essentially do with the client code, whatever they wish to.
Server side code does not get to the client, because it is processed on the server, which then produces client translatable output...again, HTML etc. You only see the end result, with the source that produced it locked away on your guarded server.
Your best bet, is to simply minify and compress your JS assets. This won't do much against a savvy developer, but it may be off-putting to the casual thief.
In any case, theft is theft and if your code is found to be used by someone else's company, I guess you have a case to file a lawsuit against them...even though in a way, it's public code.
Make sure you put a license statement with all of your code, so that you're legally covered.