How deal with redirects in Wikipedia dump? - mysql

I have successfully imported the enwiki-latest-pages-articles-multistream.XML page into MySQL using this guide.
When I lookup the text for a page (process described here), it will often be #REDIRECT [[some_page_name]]. The only way I know of to follow this redirect is by searching through all page titles for some_page_name. Not only is this time consuming but sometimes there are multiple articles under the exact same title name!
I'm considering just removing all redirect pages from the database.
But before I do, is there a better way to handle these redirects?

As I understand, you want to determine what is the target of the redirect. Right?. If yes, then you can get it using this query:
select rd_title from redirect
inner join page
on page_id = rd_from
where page_title like "some_page_name"
The rd_title is the target page of the redirect.
Please correct me if I'm wrong.

Related

How do I direct all traffic or searches going to duplicate (or similar URLs) to one URL on our website?

I'll try to keep this as simple as possible, as i don't quite understand how to frame the question entirely correctly myself.
We have a report back on our website that is indicating duplicate meta titles and descriptions, which look very much (almost exactly) like the following - although i have used an example domain below:
http://example.com/green
https://example.com/green
http://www.example.com/green
https://www.example.com/green
But, only one of these actually exists as an HTML file on our server, which is:
https://www.example.com/green
As i understand it, i need to somehow tell google and other search engines which of these URLs is correct, and this should be done by specifying a 'canonical' link or URL.
My problem is that the canonical reference must apparently be added to any duplicate pages that exist, and not the actual main canonical page? But we don't actually have any other pages, beyond the one mentioned just above. So there is nowhere to set these canonical rel references?
I'm sure there must be a simple explanation for this that i am completely missing?
So it turns out that these were duplicate URLs which occur as a result of the fact that our website exists as a subdomain of our domain. Any traffic that arrives at example.com (our domain) needs permanent redirect to https://www.example.com, by way of a redirect within the htaccess documentation.

Category products not found

WHOOPS, OUR BAD... The page you requested was not found, and we have
a fine guess why. If you typed the URL directly, please make sure
the spelling is correct. If you clicked on a link to get here, the
link is outdated.
What can you do?
Have no fear, help is near! There are many ways you can get back on
track with Magento Store.
Go back to the previous page.
Use the search bar at the top of the page to search for your products.
Follow these links to get you back on track!
Store Home | My Account.
I get these errors in Magento. How should I solve this?
Check in the 'url rewrite' that created urls are correct, specifically in the column 'url requested' because sometimes puts extensions like 'htm, html ...) or perhaps have some special character.
You might have faced problem due to following reasons.
Your category is not active in admin
If active, products are not assigned.
If assigned, Indexing is not done.
If indexing done, please ensure you are using default category URL.
If category url is not default, please make sure you did catalog_url_rewrite indexing properly.
These above are the cases where you can get resolve your problem.

Web-scraping only a specific domain

I am trying to make a web scrpper that, for this example, scrapes news articles from Reuters.com. I want to get the title and date. I know I will ultimately just have to pull the source code from each address and then parse the HTML using something like JSoup.
My question is: How do I ensure I do this for each news article on Reuters.com? How do I know I have hit all the reuters.com addresses? Is there any APIs that can help me with this?
What you are referring to is called web scraping plus web crawling. What you have to do is visit every link matching some criteria (crawling) and then scrape the content (scraping). I've never used them but here are two java frameworks for the job
http://wiki.apache.org/nutch/NutchTutorial
https://code.google.com/p/crawler4j/
Of course you will have to use jsoup (or simillar) for parsing the content after you've collected the urls
Update
Check this out Sending cookies in request with crawler4j? for a better list of crawlers. Nutch is pretty good, but very complicated if the only thing you want is to crawl one site. crawler4j is very simple but I don't know if it supports cookies (and if that matters to you it's a deal breaker).
Try this website http://scrape4me.com/
I was able to generate this url for headline: http://scrape4me.com/api?url=http%3A%2F%2Fwww.reuters.com%2F&head=head&elm=&item[][DIV.topStory]=0&ch=ch

Need to have many different URLS resolve to a single web page

And I don't want to use GET params.
Here are the details:
the user has a bunch of photos, and each photo must be shown by itself and have a unuique URL of the form
www.mysite.com/theuser/photoNumber-N
I create a unique URL for a user each time they add a new photo to their gallery
the web page that displays the user's photo is the same code for every user and every photo -- only the photo itself is different.
the user gives a URL to Person-A but then Person-A has one URL to that one photo and cannot see the user's other photos (because each photo has a unique URL and Person-A was given only one URL for one photo)
I want the following URLS to (somehow) end up loading only one web page with only the photo contents being different:
www.mysite/user-Terry/terryPhoto1
www.mysite/user-Terry/terryPhoto2
www.mysite/user-Jackie/JackiesWeddingPhoto
www.mysite/user-Jackie/JackiesDogPhoto
What I'm trying to avoid is this: having many copies of the same web page on my server, with the only difference being the .jpeg filename.
If I have 200 user and each has 10 photos -- and I fulfill my requirement that each photo is on a page by itself with a distinct URL -- right now I've got 2000 web pages, each displaying a unique photo, taking space on my web server and every page is identical and redundant disk-space-wasting HTML code, the only difference being the .JPEG file name of the photo to display.
Is there something I can do to avoid wasting diskspace and still meet my requirement that each photo has a unique URL?
Again I cannot use GET with parameters.
If you are on an Apache server, you can use Apache's mod_rewrite to accomplish just that. While the script you are writing will ultimately still be fetching GET variables (www.mysite.com/photos.php?id=photo-id), mod_rewrite will convert all the URL's served in the format you choose (www.mysite.com/user-name/photo-id).
Some ways you can implement it can be found here and here, while the actual documentation on the Apache module itself can be found here.
Go to IIS Manager. Go to the site hosted in IIS. Add additional binding for each url.
This will redirect all request the same location.

protect webpage with question

i want to share something with a specific group of people.
the way i want to do it is: before the page is loaded , i prompt a question to the viewer , if the answer is right , then the page is loaded, if the answer is wrong , the user is turn to the warning page.(i want to avoid the registration process, a specific question is ok)
but there is a problem with this : every time the page is reloaded , the user have to type the answer again?
is there anyway that i can avoid this ?
(I assume you don't know how sessions work because you look new to StackOverflow) No, PHP (or other modern server technologies like ASP) have a session system that allows multiple users be online at same time. The server stores session variables in files, one per user. See http://ca2.php.net/manual/en/intro.session.php
You might also be interested in using Apache's .htaccess files to control access: http://httpd.apache.org/docs/2.0/howto/auth.html
(for questions about using .htaccess, check ServerFault)
First, I wouldn't recommend your approach for anything more than a trivial scenario. That being said, you would want to write a page that serves as your security page. On postback, validate the answer, set a session variable, and redirect to the protected page. The protected page should do a check during its load on that same session variable and redirect to the security page if the user has not answered the security question.