How to make mediawiki sitemap URLs match the canonical URLs? - mediawiki

From my homepage, links look like "/index.php?title=My_Page_Name". I turned on $wgEnableCanonicalServerLink, so my pages contain canonical meta data, and the URL is the same. So far so good!
Unfortunately, generateSitemap.php is making entries that look like "/index.php/My_Page_Name", i.e. without the "title=".
Google's indexing is mad about this discrepancy. What's the magic incantation to make them all contain "title="?

Related

Is there a way to make search bots ignore certain text? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 9 months ago.
Improve this question
I have my blog (you can see it if you want, from my profile), and it's fresh, as well as google robots parsing results are.
The results were alarming to me. Apparently the most common 2 words on my site are "rss" and "feed", because I use text for links like "Comments RSS", "Post Feed", etc. These 2 words will be present in every post, while other words will be more rare.
Is there a way to make these links disappear from Google's parsing? I don't want technical links getting indexed. I only want content, titles, descriptions to get indexed. I am looking for something other than replacing this text with images.
I found some old discussions on Google, back from 2007 (I think in 3 years many things could have changed, hopefully this too)
This question is not about robots.txt and how to make Google ignore pages. It is about making it ignore small parts of the page, or transforming the parts in such a way that it will be seen by humans and invisible to robots.
There is a simple way to tell google to not index parts of your documents, that is using googleon and googleoff:
<p>This is normal (X)HTML content that will be indexed by Google.</p>
<!--googleoff: index-->
<p>This (X)HTML content will NOT be indexed by Google.</p>
<!--googleon: index-->
In this example, the second paragraph will not be indexed by Google. Notice the “index” parameter, which may be set to any of the following:
index — content surrounded by “googleoff: index” will not be indexed
by Google
anchor — anchor text for any links within a “googleoff: anchor” area
will not be associated with the target page
snippet — content surrounded by “googleoff: snippet” will not be used
to create snippets for search results
all — content surrounded by “googleoff: all” are treated with all
source
Google ignores HTML tags which have data-nosnippet:
<p>
This text can be included in a snippet
<span data-nosnippet>and this part would not be shown</span>.
</p>
Source: Special tags that Google understands - Inline directives
I work on a site with top-3 google ranking for thousands of school names in the US, and we do a lot of work to protect our SEO. There are 3 main things you could do (which are all probably a waste of time, keep reading):
Move the stuff you want to downplay to the bottom of your HTML and use CSS and/or to place it where you want readers to see it. This won't hide it from crawlers, but they'll value it lower.
Replace those links with images (you say you don't want to do that, but don't explain why not)
Serve a different page to crawlers, with those links stripped. There's nothing black hat about this, as long as the content is fundamentally the same as a browser sees. Search engines will ding you if you serve up a page that's significantly different from what users see, but if you stripped RSS links from the version of the page crawlers index, you would not have a problem.
That said, crawlers are smart, and you're not the only site filled with permalink and rss links. They care about context, and look for terms and phrases in your headings and body text. They know how to determine that your blog is about technology and not RSS. I highly doubt those links have any negative effect on your SEO. What problem are you actually trying to solve?
If you want to build SEO, figure out what value you provide to readers and write about that. Say interesting things that will lead others to link to your blog, and crawlers will understand that you're an information source that people value. Think more about what your readers see and understand, and less about what you think a crawler sees.
Firstly think about the issue. If Google think "RSS" is the main keyword that may suggest the rest of your content is a bit shallow and needs expanding. Perhaps this should be the focus of your attention.If the rest of your content is rich I wouldn't worry about the issue as a search engine should know what the page is about from title and headings. Just make sure RSS etc is not in a heading or bold or strong tag.
Secondly as you rightly mention, you probably don't want use images as they are not assessable to screen readers without alt text and if they have alt text or supporting text then you add the keyword back in. However aria live may help you get around this issue, but I'm not an expert on accessibility.
Options:
Use JavaScript to write that bit of content (maybe ajax it in after load). Search engines like Google can execute JavaScript but I would guess it wont value any JS written content very highly.
Re-word the content or remove duplicates of it, one prominent RSS feed link may be better than several smaller ones dotted around the page.
Use the css content attribute with pseudo :before or :after to add your content. I'm not sure if bots will index words in content attributes in CSS and know that contents value in relation to each page but it seems unlikely. Putting words like RSS in the CSS basically says it's a style thing not an HTML thing, therefore even if engines to index it they wont add much/any value to it. For example, the HTML and CSS could be:
.add-text:after { content:'View my RSS feed'; }
Note the above will not work in older versions of IE, so you may need some IE version comments if you care about that.
"googleon" and "googleoff" are only supported by the Google Search Appliance (when you host your own search results, usually for your own internal website).
They are not supported by Google's web-search at all. So please refrain from doing that and I think that should not be marked as a correct answer as this might create ambiguity.
Now, to get Google to exclude part of a page, you will need to place that content in a separate file, such as excluded.html, and use an iframe to display that content in the host page.
The iframe tag grabs content from another file and inserts it into the host page. I think there is no other available method so far.
The only control that you have over the indexing robots, is the robots.txt file. See this documentation, linked by Google on their page explaining the usage of the file.
You basically can prohibit certain links and URL's but not necessarily keywords.
Other than black-hat server-side methods, there is nothing you can do. You may want to look at why you have those words so often and remove some of them from the site.
It used to be that you could use JS to "hide" things from googlebot, but you can't now that it parses JS. ( http://www.webmasterworld.com/google/4159807.htm )
Google crawler are smart but someone that program them are smartest. Human always sees what is sensible in the page, they will spend time on blog that have some nice content and most rare and unique.
It is all about common sense, how people visit your blog and how much time they spend. Google measure the search result in the same way. Your page ranking also increase as daily visits increase and site content get better and update every day.
This page has "Answer" words repeated multiple times. It doesn't mean that it will not get indexed. It is how much useful is to every one.
I hope it will give you some idea
you have to manually detect the "Google Bot" from request's user agent and feed them little different content than you normally serve to your user.

About META Tags: Can not Find Them in Page Source!

I encountered many sites including stackoverflow.com whose page source do not show META tags like keywords, description.
I am just wandering is it because they blocked it by some sort of tech or they just drop them since, as I know, those tags are not so much valuable as before.
If they blacked them, then what kind of software or tech do they need. If not then how Google extract description from those sites when Google displays search results?
Lot of dumb questions, thanks for your time and reply!
Any input is appreciated!
They're not MATA tags, they're META tags. They are not as important as the actual content of your site and the other sites that link to yours, since it's well known that meta tag content is easier to abuse and misrepresent. Meta elements are more useful in the areas where there is no benefit from such abuse, eg. content encoding or language, but some of this data can be sent by the web server in the HTTP headers anyway. So you rarely, if ever, need any meta elements.
You don't need any sort of technology to 'block' meta tags. Every tag is just a bit of text you insert into your HTML. If you don't want to send out a meta tag, you just don't write it into the HTML.
If you want specific information on how Google views your site then you could start with their webmasters page.
Just had a look around on Google .. may be followings help you something.
Avoid the META keyword tag!
Do not use the meta keywords tag. Many
people still think of this as a quick
fix for SEO. It’s not. Google no
longer uses it. In fact, it is likely
that Google penalizes sites that do
employ the meta keywords tag. Yahoo is
perhaps the only search engine that
still uses the meta keywords tag but
places very little weight on it.
Death of META Tag
pretty old link though
"In the past we have indexed the meta
keywords tag but have found that the
high incidence of keyword repetition
and spam made it an unreliable
indication of site content and
quality. We do continue to look at
this issue, and may re-include them if
the perceived quality improves over
time," said Jon Glick, AltaVista's
director of internet search.

How to request certain page elements not be indexed

Essentially I would like to specify an element to be ignored by search engines. If I reference pornography from an academic standpoint, I don't want Google to list my site under porn searches, for instance, but would like it to index the rest of the page.
Is this possible? I'm sure I have come across a method of including meta data into one's html to achieve this.
I have tried to find this on the web, but have been unsuccessful.
I can't make sense of this page, since I don't know if because it's a draft specification it is not recognised by crawl bots.
Using the robots.txt File in the root directory of your website.
User-agent: *
Disallow: /myreference_dir/
Disallow: /myreference_dir/myarticle.html
Wikipedia

Do html entities in meta tags influence indexing?

I was wondering if using HTML entities in meta tags (like keywords and description) is the best way to go?
Does it influence the indexing from search engines?
I'd put the meta tags contents without entities as long as my charset allows the chars. I researched a bit and I found this on Google Webmasters/Site owners help and the example contains £9.24 not £9.24 nor £9.24
As is true that meta tags aren't a big factor for success, they can be a factor for failure. Indexer robots may detect a try of cheat them by using invalid keywords or description. From Wikipedia:
Early versions of search algorithms
relied on webmaster-provided
information such as the keyword meta
tag, or index files in engines like
ALIWEB. Meta tags provide a guide to
each page's content. But using meta
data to index pages was found to be
less than reliable because the
webmaster's choice of keywords in the
meta tag could potentially be an
inaccurate representation of the
site's actual content. Inaccurate,
incomplete, and inconsistent data in
meta tags could and did cause pages to
rank for irrelevant searches. Web
content providers also manipulated a
number of attributes within the HTML
source of a page in an attempt to rank
well in search engines.
The meta description can be used as the default snippet.
The meta keywords are pretty much completely ignored, but everyone still uses them anyway.
Neither will have much (if any) effect on your ranking, but a good meta description could boost your clickthrough.
Entities make difference only in amateur HTML "parsers" done with regular expressions. They aren't problem for Google.
Meta tags are not ignored. There are still read by Google, so I think, they should be used in the proper way. Google loves pages done in proper way, but remember, that meta tag is one of hundreds things that robots take into consideration.
if there are umlaute dont use entities.
i think, google is indexing the word "bremsbeläge" as "bremsbelaege" and "bremsbeläge".
The meta tag "description" does have an effect on the ranking. It is the description that Google gives in the listing, so this is the most important part that influences people to click on your link. When more people click on your link, Google assumes it has more worth for users in the searches and moves you up.

HTML meta keyword/description element, useful or not?

Does filling out HTML meta description/keyword tags matter for SEO?
This article has some info on it.
A quick summary for keywords is:
Google and Microsoft: No
Yahoo and Ask: Yes
Edit: As noted below, the meta description is used by Google to describe your site to potential visitors (although may not be used for ranking).
Google will use meta tags, but the description, to better summarize your site. They won't help to increase your page rank.
See:
http://www.google.com/support/webmasters/bin/answer.py?hl=en&answer=79812
EDIT: #Petr, are you sure that meta tags influence page rank? I am pretty sure that they don't, but if you have some references, I'd love to learn more about this. I have seen this, from the Official Google Webmaster Central Blog, which is what leads me to believe that they don't:
Even though we sometimes use the
description meta tag for the snippets
we show, we still don't use the
description meta tag in our ranking.
Keywords: Useless
All major search engines don't use them at all.
Description: Useful!
Replaces the default text in search engines if there isn't anything better. Use this to describe the page properly. Not perhaps useful for SEO, but it makes your results look more useful, and will hopefully increase click through rates by users.
If you want your users to share your content on Facebook, the meta tags actually come in handy, as Facebook will use this information when styling the post.
See Facebook Share Partners for more information.
Edit; whoops, wrong url. Fixed.
If your pages are part of an intranet then both the keywords and description meta tags can be very useful. If you have access to the search engine crawling your pages (and thus you can specifically look for sepcific tags/markup), they can add tremendous value without costing you too much time and are easy to change.
For pages outside of an intranet, you may have less success with keywords for reasons mentioned above.
The description meta is important as it is displayed ad-verbatim on Google search results below your site title. The absence of which, Google pulls and shows the first few lines of content on SERPs. The description tag allows you to control what SE users see as a page summary before clicking. This helps in increasing your CTRs from Search.
The keyword meta usefulness is still inconclusive, but SEOers continue to use them. Avoid using more than 5-6 keywords in the tag per page to avoid Google from detecting and penalising due to any suspected keyword dumping.
The problem with keyword meta tags is they are a completely unreliable source of information for search engines. The temptation for people to alter search results in their favour with misleading keywords is just too great.
Those are two of the things that are used by search engines. The exact weight of each changes frequently, they are generally regarded; however, as being fairly important.
One thing to note, care should be taken when entering values. The more relevant the keywords and description are to the textual content of the site, the more weight may be given to them. Of course there are no guarantees as nobody outside of the search engine companies really know what algorithms are being used.
This post talks a bit more about some aspects.