Determine if list of bulk URLs are dead, live, or parked - html

There's a list of 100s of URLs that need to be checked to determined if the sites are live (someone has put their own content, even if just a landing), unreachable, or parked.
Unreachable is self explanatory, but distinguishing between actual user content and a parked domain is trickier. What I mean to say is someone who's hosting a domain through GoDaddy and uses their default landing page versus a hosted site with unique content as a landing page.
Using http codes (2xx,3xx,4xx,etc) isn't reliable. Does anyone know of a solution? It doesn't need to be 100% accurate in all instances, just accurate when it says it's accurate in order to minimise manual checking.
The best solution I can come up with is seeing who the site is registered with and comparing the code against other sites also registered there where matches >.9 or something to that effect. This is clunky.
Are there any ready-made solutions for this problem? If not, is there a more efficient methodology?

Related

a/b testing a major html/css redesign

At my company we are redesiging our e-commerce website. HTML and CSS is re-written from the ground up to make the website responsive / mobile friendly.
Since it concerns one of our biggest websites which is responsible for generating of over 80% of our revenue it is very important that nothing goes "wrong".
Our application is running on a LAMP stack.
What are the best practices for testing a major redesign?
Some issues i am thinking of:
When a/b testing a whole design (if possible) i guess you definitaly
dont want Google to come by and index youre new design (since its
still in test phase). How to handle this?
Should you redirect a percentage of the users to a new url (or
perhaps subdomain)? Or is it better to serve the new content from the
existing indexed urls based on session?
How to compare statistics from a Google Analytics point of view?
How to hint Google about a new design? Should i e.g.
create a new UA code?
Solution might be to set a cookie only for customers who enter the website via the homepage. Doing so, you're excluding adwords traffic and returning visitors, who might be expecting an other webdesign, serve them the original website and leave their experience untouched.
Start the test with home traffic only, set cookie and redirect a percentage to a subdomain. Measure conversion rate by a dimension in Google analytics, within same analytics account. Set a 'disallow subdomain' in your robots.txt to exclude the subdomain from crawling by SE's.
Marc, You’re mixing a few different concerns here:
Instrumentation. If you changes can be expressed via HTML/CSS/JavaScript only, i.e. optimizational in nature, you may be able to instrument using tols like VWO or Optimizely. If there are server side changes too, then a tool like Sitespect (any server stack) or Variant (Java only) might be in order. The advantange of using a commecial product is that they provide a number of important features out of the box, e.g. collecting experiment data, experience stability (returning user sees the same experience), etc. You may be able to instrument on your own, but unless you’re looking at a handful of pages, that typically is hard, particularly if you want to do it outside of the app, via the DevOps mechanisms.
SEO. If you get your instrumentation right, this shouldn’t be an issue. Public URIs should not differ for the control and variant of the same resource.
Traffic routing. Another reason to consider a commercial tool. They factor that out of your app and let you set percentages. Some tools, like Variant, will allow you to write custom targeters, e.g. “value” users always see control.

Google not listing site, but no errors and content appears rich

We have inherited a website of around 30,000 pages (actual content), each with a unique title and rich content. Whatever we try, Google seems to not like listing the new site and visits have dropped by 80% (vs. old site & domain).
The website was redeveloped recently and changed domain at that point too, which hasn't helped work out what is happening and this marked the drop in visitors. The old domain was registered Oct 2005, the new in 2009, so both have some age to them. In Webmaster tools recently submitted a notice that the site address had changed, possibly too soon to see any effect of this (7th Dec).
The older CMS was hard to redirect from, so have a very large .htaccess file (1MB), is there a limit to how large this file should be with redirects? I could perhaps code something in PHP to handle the 30,000 redirects programatically, but the URL's in the old were pretty strange using comma separation and other symbols. I have used header checkers and the correct 301's are being returned.
We also submitted a sitemap with 25,000 pages via Webmaster tools, of which it listed 11! There were no errors and as I say, the page content is rich with descriptive titles.
Google can see 68,000 pages in Webmaster tools, but the actual listed in only 175, so the problem seems quite significant and the others remain 'unselected'. The curve of the 'unselected' seems to reflect the efforts we have put in to have the site list, yet they seem not to be indexed.
Site: http://bit.ly/VKYClf
(The older site was the same name but hyphenated)
I have researched a lot, but all steps so far have been fruitless and the pages listed hands around the 170 mark.
Can you think of any specific steps worth taking to identify any factors preventing the site from listing?
Thanks in advance and happy to provide more information on anything.
EDIT: In case it helps anyone else, the website is built of Wordpress but uses custom vars to generate lots of pages on the fly... since WP 2.9 a canonical tag was added, but the two were not playing well together and they were pointing to anything WP could find with that ID... have now removed and hopefully things are moving forward
It's going to take awhile to get fully indexed if you just submitted 25000 url's, it may take months for that.
I would recommend you log into google webmasters, go to health (fetch as Google) you can give individual url's to crawl, Google should respond within 24 hours. If your indexed page still does not show up you have major problems that I would not be able to assist with. You get 500 fetches a week. If you want your site indexed right away it may be your only shot.

Remote images displaying only sometimes

I maintain a local intranet site that among other things, displays movie poster images from IMDB.com. Until recently, I simply had a perl script download the images I needed and save them to the local server. But that became a HUGE space-hog, so I thought I could simply point my site directly to IMDB servers, since my traffic is very minimal.
The result was that some images would display, while others wouldn't. And images that were displayed, would sometimes disappear after a few refreshes. The images existed on the IMDB servers, they just wouldn't display on my page.
It seems unlikely to me that IMDB would somehow block this kind of access, but is that possible? Is there something that needs to be configured on my end?
I'm out of ideas - it just doesn't make sense to me.
I'm serving my pages with mod_perl and HTML::Mason, if that's relevant.
Thanks,
Ryan
Apache/2.2.14 (Unix) mod_ssl/2.2.14 OpenSSL/0.9.8l DAV/2 mod_perl/2.0.4 Perl/v5.10.0
Absolutely they would block that kind of access. You're using their bandwidth, which they have to pay for, for your web site. Sites will often look at the referrer, see that its not coming from their site, and either block or throttle access. Likely you're seeing this as an intermittent problem because IMDB is allowing you some amount of use of their images.
To find out more, look at the HTTP logs on your client. Either by using a browser plugin or by scripting it. Look at the HTTP response codes and you'll probably see some 4xx or 5xx responses.
I would suggest either caching the images in a cache that expires unused images, that will balance accesses with space, or perhaps getting a paid IMDB account. You may be able to get an API key to use to fetch images indicating you are a paying customer.
IMDB sure could be preventing your 'bandwidth theft' by checking the "referer". More info here: http://www.thesitewizard.com/archive/bandwidththeft.shtml
Why is it intermittent? Maybe they only implement this on some of the servers in their web farm.
Just to add to the existing answers, what you're doing is called "hotlinking", and people who run websites don't like it very much. Google for "hotlink blocking".

Is there any tips for minimising access to a public page without login?

I have a page that is just a non interactive display for a shop window.
Obviously, I don't link to it, and I'd also like to avoid people stumbling across it (by Google etc).
It will always be powered by Chrome.
I have thought of...
Checking User Agent for Chrome
Ensuring resolution is 1920 x 1080 (not that useful as it is a client side check)
Banning under robots.txt to keep Google out of it
Do you have any more suggestions?
Should I not really worry about it?
Not that I would EVER recommend what I'm about to suggest - how about filtering by IP address. Since you provider IP is rarely going to change you can use Javascript to kick out or deny requests from IP addresses other than yours. Maybe a clean redirect to http://www.google.com or something silly like that. Although I would still suggest locking it down with a login and password and just have it write a never expiring cookie. That's still not a great idea but a shy bit better than the road your trucking down right now.
You could always limit the connections by IP address (If you know it ahead of time/it's reliable):
Apache's access control
If it is just for a shop window, do you even need access to a web page?
You can host the file locally.
Personally, I wouldn't worry about it, if no-one is linking to it externally it is unlikely to ever be found by search engines.

SSL Encryption and an external image server

I have an ASP.NET web site technology that I use for scores of clients. Each client gets their own web site (a copy of the core site that can then be customized). The web site includes a fair amount of content - articles on health and wellness - that is loaded from a central content server. I can load the html for these articles from a central content server by copying from the content server and then inserting the text into the page as it is produced.
Easy so far.
However, these articles have image references that point back to the central server. The problem that I have is due to the fact that these sites are always accessed (every page) via an SSL link. When a page with an external image reference is loaded, the visitor receives a message that the page "contains both secure and insecure elements" (or something similar) because the images come from the (unsecured) server. There is really no way around this.
So, in your judgment, is it better to:
A) just put a cert on the content server so I can get the images over SSL? Are there problems there due to the page content having two certs? Any other thoughts?
B) change the links to the article presentation page so they don't use SSL? They don't need SSL but the left side of the page contains lots of links to pages that do need - all of which are now relative links. Making them all absolute links is grody because each client's site has its own URL so all links would need to be generated in code (blech).
C) Something else that I haven't thought of? This is where I am hoping that someone with experience in the area will offer something brilliant!
NOTE: I know that I can not get rid of the warning about insecure elements - it is there for a reason. I am just wondering if anyone else has experience in this area and has a reasonable compromise or some new insight.
Not sure how feasable this is but it may be possible to use a rewrite or proxy module to mirror the (img directory) structure on each clone to that of the central. With such a rule in place you could use relative img urls instead & internally rewrite all requests to these images over to the central server, silently
e.g.:
https://cloneA/banner.jpg -> http://central/static/banner.jpg
https://cloneB/topic7/img/header.jpg -> http://central/static/topic7/header.jpg
I'd go with B.
Sadly, I think you'll find this is a sad fact of life in SSL. Even if you were to put a cert on the other server, I think it may still get confused because of different sites [can't confirm nor deny though], and regardless, you don't want to waste the time of your media server by encrypting images.
I figured out a completely different way to import the images late last night after asking this question. In IIS, at least, you can set up "Virtual Directories" that can point essentially anywhere (I'm now evaluating whether to use a dedicated directory on each web server or a URL). If I use a dedicated directory on each server I will have three directories to keep up to date, but at least I don't have 70+.
Since each site will pull the images using resource locations found on the local site, then I don't have to worry about changing the SSL status of any page.