Indexe site in angular by search robots if $locationProvider.html5Mode(true) (without !#) - html

friends.
I have site written in angular.
How google or other search robots indexing this sites?
What techniques can be used to allow robots to see pages content?
Internet says how optimise site on angular for indexing if $locationProvider.html5Mode(false);
Exist method for indexing site if url of pages like: http://site.com/!#audio/1234
But in my site $locationProvider.html5Mode(true);
That means that url of pages without !#
http://site.com/audio/1234.
But all content of page loades with javascript.
I need help. Thank you!

Your server will receive a request at http://site.com/audio/1234. Somehow you will need to be able to return some content.
For now the only strategy I know is to render that page on the server-side also... Which involved some (or a lot of) work.
It's also good for the browsers that does not support pushStates yet. That way these users can still access the data.
You can also take a look at this http://docs.meteor.com/#spiderable They render the page on the server-side by executing the client-side javascript on the server-side.
Here's a similar question: pushState and SEO

Related

SaaS CMS API possible and/or useful?

Everything is increasingly distributed. Outsource your authentication with Facebook or OpenID, comments with Disqus, file uploading with filepicker.io and storage with AWS. Perhaps in the future websites will merely be interfaces that link all these services together in a meaningful way to the user. Another part of a website that might be outsourced is the content. Imagine retrieving JSON with page content, menu structures and lists of blog posts. The content would be edited on a different website than your own.
A problem a CMS like this might face is the automatic creation of pages. Requiring clients to install a htaccess file could be an option. Maybe clients only want to allow automatic loading of pages on specific subredectories like domain.com/blog. Probably the routing should be left entirely to the client, and the API could be invoked on their content-page.php
I think it should be very minimal and not enforce use of certain template engines upon clients. It should just load the HTML of the content.
I'm not sure if this would be useful, considering that you might want your CMS to also handle routing to content pages, and I don't think that could be done through an API but please correct me if you see a way.
My question is: would you use something like this? Do you foresee any problems that I'm not seeing? Any suggestions?
To illustrate more clearly what I mean, here's an example content-page.php of a client:
$cms = new SaaSCMS($apiKey);
$content = $cms->getContent($_GET['page']);
if ($content)
{
// Display content
}
else
{
// 404
}
Not sure, but it reminds very nicely of Jekyll, which achieves most of the goals you stated above

Getting html content from one page and adding it to my website

I have affiliated with expedia and I am using their API system. One of their requirements for launching the site is adding the terms and agreements to my page and they give us this page: http://travel.ian.com/index.jsp?pageName=userAgreement&locale=en_US&cid=xxx. I do not want to go to a different site, and I can not copy and paste the information because of updates. I also prefer not to use an iframe. Does anyone have any ideas on how to do this? Here is a webpage using this on their site with their domain: http://www.helloweekends.com/terms.htm. Does anyone know how they did this? Any help would be greatly appreciated!
Since it originates from another domain, it wouldn't be possible to use JavaScript, due to the same origin policy. Also, relying on JavaScript for the update would be trouble for users who has JavaScript disabled, as they wouldn't see the terms. Since you don't want to use an iframe, or copy the content, I guess your best shot would be to scrape their page with a server-side language of your choice, and then display it on your page.
Scraping can be a bit tricky though, if you rely on their markup. If they change their markup, there is a chance that your script will break, thus stop updating the terms.
There are various tutorials available on how to scrape sites. Here are a few PHP examples:
Web scrape with PHP
PHP Screen Scraping Tutorial
Note Make sure that they allow you to scrape the page prior to implementing it, so that you don't violate their rules.
Do you know if their API serves something with JSON? A JSONP call can get the values to you, but it will make your page rely on javascript for the users to see the updated page.
Another option is to use PHP of any other server side language to get the contents of the url, process it and return the block you require.
I would suggest the load() function offered by jQuery. It makes a simple AJAX call to retrieve a file, and you could even use a selector to only grab part of the page. For example, load the contents of a HTML page into a div:
$('#div_id').load('my_file.html');
Or just load a part of the page:
$('#div_id').load('my_file.html #main_text_id');

ExtJs and Sencha Touch Search Engine Optimization

I've started learning ExtJS 4 and Sencha Touch 2, and i really like it.
The main difference between Sencha products and jQuery(& others) is that instead of enhancing preexisting HTML, it generates its own DOM based on objects created in JavaScript.
Apps developed like this are great as intranet apps, but can you create a consumer oriented website using Sencha?(like an online store)
I see that you don't write any HTML code in ExtJS or Sencha Touch so i am wondering how can fully generated Javascript page be indexed by Search Engines like Google. As i know, the Google Bot only sees the plain HTML code.
Is there anyway to SEO a Sencha WebApp?
Kind Regards,
Dan Cearnau
Nothing is impossible. You just need to do some work.
1. Generate standard static page using PHP or smth else. The page should look like the page of your ExtJS app. But all links must have GET params in URL. Also PHP should aggregate input GET params.
2. Add your ExtJS app to the page. In the app you have to take into an account GET params and make proper request.
2a. If a real user opens your page: PHP generates the output, then ExtJS app starts and hides the static page and generates the dynamic output.
2b. If a crawler opens your page so JS is disabled, PHP aggregates the request according to GET params and generates the output.
You can add params to URL like #param1&param2&param3 in ExtJS when clicking on links, so real users will be able to share their links. Just learn the router on PHP-side to understand URLs like this.
There is no way to make SEO-friendly pages using JavaScript only.
Using a full blown app it would be close to impossible to SEO. They are far too dynamic. Search engines work of indexing pages. They can deal with some Ajax stuff by supporting pages with #s but imagine how many pages a fully functional app will have. Every view you have has 100s of options that would constitute a new page, which also has 100s of options. All these virtual pages would most likely be just slight variations from other pages. different sort order, different filter, moved panel, search option.
If you use ExtJs to enhance a website like jQuery is often used, then that's a different story. You will have html for the spiders to read and then you enhance how the content works via javascript (see progressive enhancement).
Actually in Touch 2 you can define paths and use history support. This will treat sections of your app as actual pages in the browser w/ standard functionality like going back in the browser etc... this will be your best bet when working with mobile SEO
Getting any kind of SEO out of a Sencha app is impossible since it builds everything on the fly. Even if you use the history support in Sencha Touch, thats also done on the fly and has no effect on SEO.
For consumer-facing websites, Sencha is not the answer. For back-end (for maybe managing the shopping cart), its a different story.

Prevent site configuration info from showing up on Google

I have a site that's running WordPress.
The main page has an embedded Flash player and an imbedded iframe, and for some reason, all the configuration info from the Flash player is showing up on Google for my site, and nothing else.
How can I have the main site information show up on Google, without having that Flash player config info show up?
And can I customize what shows up at all?
If there's some way to tag the info I don't want to show up, or tag the info I want to show up, I can probably do most ofthe edits myself, I just don't know where to start...
EDIT: I tried most of the suggestions below, and I didn't get anywhere...
Any other ideas?
Thanks a lot!
If you don't want Google, or other crawler to access certain parts of your website you should use a robots.txt file. Inside you specify which parts are accessible and which aren't, when the crawlers get to your website will always look for this file for instructions.
You can check some documentation on how to do it here and here
In order to influence what text is used on the google search result try putting this within your head tags
<meta name="description" content="WHATEVER YOU WANT DISPLAYED ON GOOGLE">
Source: http://static.googleusercontent.com/external_content/untrusted_dlcp/www.google.com/en/us/webmasters/docs/search-engine-optimization-starter-guide.pdf
Some more information from google on controling parts of a page. Apparently there are google off/google on tags.
http://perishablepress.com/press/2009/08/23/tell-google-to-not-index-certain-parts-of-your-page/
Hope this helps.
If you want Google to index only part of your pages, you can't follow normal SEO routines. You should provide a mechanism to understand whether the current client (requester) is a robot or not. If yes, then don't render that part. This is the only way. Otherwise, a robot either gets the whole rendered content, or doesn't have access based on robots.txt file (Robot Exclusion Protocol).
Another way (which is not really smart, and can't be guaranteed to work) is to dynamically inject your content into the page via JavaScript. Because AMAIK, robots don't run JavaScript.
As search spiders won't render javascript generated markup (JS is not run as it is client-side in the browser), a quick fix would be to don't output any of flash / markup initially in the HTML document and then use JS to add the flash stuff on load.
Note: as far as I'm aware, Google is currently testing a JS reading spider so this may not work long term.
Google is returning this data because it simply can't find any content where it normally would. Search engines require content - they're not advanced enough to process your multimedia to determine what it's all about.
Google will IGNORE your meta description if it doesn't feel that it reflects your page content (of which there is only iframes and JS)
Use SWFObject to provide alternate content for users without flash (including search engines) - ensure it's not some dinky text like "download flash here" - but a lengthy descriptive content piece about your site or media that they would normally experience if they could experience.
Use robots.txt or <meta name="robots" content="noindex,follow"> for the iframe content to prevent it from being indexed.
For the love of all things holy, please look at reducing the number of JS files and inline JS on your site (i'd recommend WP-minify since it's so obvious that you love plugins)

HTML Snapshot for crawler - Understanding how it works

i'm reading this article today. To be honest, im really interessed to "2. Much of your content is created by a server-side technology such as PHP or ASP.NET" point.
I want understand if i have understood :)
I create that php script (gethtmlsnapshot.php) where i include the server-side ajax page (getdata.php) and i escape (for security) the parameters. Then i add it at the end of the html static page (index-movies.html). Right? Now...
1 - Where i put that gethtmlsnapshot.php? In other words, i need to call (or better, the crawler need) that page. But if i don't have link on the main page, the crawler can't call it :O How can crawler call the page with _escaped_fragment_ parameters? It can't know them if i don't specific them somewhere :)
2 - How can crewler call that page with the parameters? As before, i need link to that script with the parameters, so crewler browse each page and save the content of the dinamic result.
Can you help me? And what do you think about this technique? Won't be better if the developers of crawler do their own bots in some others ways? :)
Let me know what do you think about. Cheers
I think you got something wrong so I'll try to explain what's going on here including the background and alternatives. as this indeed a very important topic that most of us stumbled upon (or at least something similar) from time to time.
Using AJAX or rather asynchronous incremental page updating (because most pages actually don't use XML but JSON), has enriched the web and provided great user experience.
It has however also come at a price.
The main problem were clients that didn't support the xmlhttpget object or JavaScript at all.
In the beginning you had to provide backwards compatibility.
This was usually done by providing links and capture the onclick event and fire an AJAX call instead of reloading the page (if the client supported it).
Today almost every client supports the necessary functions.
So the problem today are search engines. Because they don't. Well that's not entirely true because they partly do (especially Google), but for other purposes.
Google evaluates certain JavaScript code to prevent Blackhat SEO (for example a link pointing somewhere but with JavaScript opening some completely different webpage... Or html keyword codes that are invisible to the client because they are removed by JavaScript or the other way round).
But keeping it simple its best to think of a search engine crawler of a very basic browser with no CSS or JS support (it's the same with CSS, its party parsed for special reasons).
So if you have "AJAX links" on your website, and the Webcrawler doesn't support following them using JavaScript, they just don't get crawled. Or do they?
Well the answer is JavaScript links (like document.location whatever) get followed. Google is often intelligent enough to guess the target.
But ajax calls are not made. simple because they return partial content and no senseful whole page can be constructed from it as the context is unknown and the unique URI doesn't represent the location of the content.
So there are basically 3 strategies to work around that.
have an onclick event on the links with normal href attribute as fallback (imo the best option as it solves the problem for clients as well as search engines)
submitting the content websites via your sitemap so they get indexed, but completely apart from your site links (usually pages provide a permalink to this urls so that external pages link them for the pagerank)
ajax crawling scheme
the idea is to have your JavaScript xmlhttpget requests entangled with corresponding href attributes that look like so:
www.example.com/ajax.php#!key=value
so the link looks like:
go to my imprint
the function handleajax could evaluate the document.location variable to fire the incremental asynchronous page update. its also possible to pass an id or url or whatever.
the crawler however recognises the ajax crawling scheme format and automatically fetches http://www.example.com/ajax.php.php?%23!page=imprint instead of http://www.example.com/ajax.php#!page=imprint
so you the query string then contanis the html fragment from which you can tell which partial content has been updated.
so you have just have to make sure that http://www.example.com/ajax.php.php?%23!page=imprint returns a full website that just looks like the website should look to the user after the xmlhttpget update has been made.
a very elegant solution is also to pass the a object itself to the handler function which then fetches the same URL as the crawler would have fetched using ajax but with additional parameters. Your server side script then decides whether to deliver the whole page or just the partial content.
It's a very creative approach indeed and here comes my personal pr/ con analysis:
pro:
partial updated pages receive a unique identifier at which point they are fully qualified resources in the semantic web
partially updated websites receive a unique identifier that can be presented by search engines
con:
it's just a fallback solution for search engines, not for clients without JavaScript
it provides opportunities for black hat SEO. So Google for sure won't adopt it fully or rank pages with this technique high with out proper verification of the content.
conclusion:
just usual links with fallback legacy working href attributes, but an onclick handler are a better approach because they provide functionality for old browsers.
the main advantage of the ajax crawling scheme is that partially updated websites get a unique URI, and you don't have to do create duplicate content that somehow serves as the indexable and linkable counterpart.
you could argue that ajax crawling scheme implementation is more consistent and easier to implement. I think this is a question of your application design.