Im querying a wiki that has urls structured like:
wiki.xxxxxxx.com/index.php?title=titleofarticlehere
How could I get a list of all pages (the "titleofarticlehere" part above^)?
For what usages do you want it? There is a special page "Special:AllPages" to view all pages in this wiki, or you could use a machine readable list using the api.
Related
I'm nearing the end of my first web development project and I'm looking to build a sitemap for our website as part of Search Engine Optimalisation. If I understand correctly a sitemap, when done correctly, is a file that shows a content tree (similar to paths in windows explorer) to all the public pages of my website.
For the purpose of my question you're going to need some background information on the site and how it works. The site is about bird migration, a user enters the site on a homepage that holds a searchbox, he or she is able to search for a species of birds and if we have data on it the user is able to go to a seperate page with information on this bird. From there the user can access statistical data about this species. The page will look something like below, filled with content that we get from a database.
The URL will look something like http://domain.com/searchbird.html?bird=Sedge%20Warbler?lang=1 for the informational page, and http://domain.com/statistics.html?bird=Sedge%20Warbler?lang=1 for the statistical page.
Every bird species uses the same base HTML file (searchbird.html) that is filled with data based on the ?bird="" parameter. I have about four HTML files in my webroot (lets call them: index.html, searchbird.html, statistics.html, about.html).
So when I go to create a sitemap using some sort of sitemap generation tool, I get a sitemap that contains those 4 .html files, which is great! Yet I'm missing the 500 bird species that users are going to be able to find.
Is there a way for me to include every possible URL in the sitemap automatically, and how would I go about doing such a thing? I've used HTML, CSS and Javascript in the past. but I'm only a beginner. If an executable tool exists for this that'd be great, but my Google searches haven't been successful yet.
You have to generate the list of URLs for your existing pages.
So dig into your data source (database or whatever you use), find all existing bird species, and generate the two URLs per species.
Directory for users/bots
It would probably be a good idea (for visitors as well as for bots) to output these links on your website, too. Visitors would have two ways to find a species (search for it or browse the directory), and as most bots don’t use search functions, they wouldn’t be able to find the links on your site otherwise (they would have to use your sitemap, which not all bots do, or they would have to hope to find the links from some other external website).
(If you do this, you could also use a sitemap generator service; but it’s usually better do generate it yourself.)
URL design
By the way, you might want to consider changing your URL design to a more human-friendly one. Instead of
http://example.com/searchbird.html?bird=Sedge%20Warbler?lang=1
http://example.com/statistics.html?bird=Sedge%20Warbler?lang=1
you could use something like
http://example.com/en/birds/sedge-warbler
http://example.com/en/birds/sedge-warbler/statistics
where en is the language code for "English" (these are standardized, and users have a chance to understand them, contrary to lang=1), and where http://example.com/en/birds could lead to the page listing all species. For other languages, you would of course ideally translate "birds" and "statistics".
Changing the URL design is possible with URL rewriting.
U can use sitemap generator. U can use https://www.xml-sitemaps.com/. U only need put url index. That website will search all link and generate sitemap automatically.
If u use wordpress u can use plugin wordpress like https://wordpress.org/plugins/google-sitemap-generator/.
Hope that help
Is it possible to format a Category page on a wiki (I'm working on a MediaWiki but I suppose it's the same for all) so it shows all the categories in one column instead of the default-3 columns?
If not, is there a way to create another kind of page that dynamically updates its content as a Category page does? I couldn't find an example on wikipedia.
Both Extension:Semantic Mediawiki and Extension:Dynamic Page List are pretty potent tools that allow the creation of pages with dynamic lists and tables and so on.
Mediawiki has the great {{Special:Recent pages}} template that you can transclude to just show a certain number of pages. However, I'd like a simple list of the latest pages created by users to display on the home page. Is there a way to do this easily? Perhaps something with the dynamic lists extension?
Have you looked at answers to this question ?
Embedding Recent Changes on Main Page on MediaWiki
You have several options :
The Dynamic Article List extension
The News extension
Use the Semantic Media Wiki extension and query your pages using the Modified Date property
Is it possible to make JSON data readable by a Google spider?
Say for instance that I have a JSON feed that contains the data for an e-commerce site. This JSON data is used to populate a human-readable page in the users browser. (I.E. The translation from JSON data to human displayed page is done inside the users browser; not my choice, just what I've been given to work with, its an old legacy CGI application and not an actual server-side scripting language.)
My concern here is that, the google spiders will not be able to pickup/directly link to the item in question when a user clicks on it in google, being presented with an index page full of all the items, rather than being linked directly to the item they clicked on.
Is there anyway of "informing" the google spider in the JSON that what they should feed the user a different link?
While Google does crawl and index JavaScript in some circumstances, it's still best to serve "normal" (X)HTML content if at all possible. In this case, it would help to know the rest of the site's setup, in particular: is the JSON content just used to create a feed of links to the product pages (with static content) or are all product pages also generated by JSON feeds? If the feed is only used to point to the actual product pages (which are static) then one way to make the product pages discoverable could be to create a HTML sitemap page or some other alternate form of navigation. A XML Sitemap file can also help, but I would recommend not using it as the sole way of making the product pages discoverable.
If all of the content is only accessible through JSON feeds, then I think you will have to make some bigger changes if you want that content to be accessible through search results.
One way to handle it could also be to use the new JavaScript crawling/indexing proposal, which basically would result in a headless browser being set up between your site and Google: http://code.google.com/web/ajaxcrawling/ (whether setting this up or revamping the rest of the site is easier is hard to say :-))
You should make a wrapper page in server-side code around the JSON data, and respond to requests with either the wrapper or the regular version depending on the User-Agent.
What I mean by autolinking is the process by which wiki links inlined in page content are generated into either a hyperlink to the page (if it does exist) or a create link (if the page doesn't exist).
With the parser I am using, this is a two step process - first, the page content is parsed and all of the links to wiki pages from the source markup are extracted. Then, I feed an array of the existing pages back to the parser, before the final HTML markup is generated.
What is the best way to handle this process? It seems as if I need to keep a cached list of every single page on the site, rather than having to extract the index of page titles each time. Or is it better to check each link separately to see if it exists? This might result in a lot of database lookups if the list wasn't cached. Would this still be viable for a larger wiki site with thousands of pages?
In my own wiki I check all the links (without caching), but my wiki is only used by a few people internally. You should benchmark stuff like this.
In my own wiki system my caching system is pretty simple - when the page is updated it checks links to make sure they are valid and applies the correct formatting/location for those that aren't. The cached page is saved as a HTML page in my cache root.
Pages that are marked as 'not created' during the page update are inserted into the a table of the database that holds the page and then a csv of pages that link to it.
When someone creates that page it initiates a scan to look through each linking page and re-caches the linking page with the correct link and formatting.
If you weren't interested in highlighting non-created pages however you could just have a checker to see if the page is created when you attempt to access it - and if not redirect to the creation page. Then just link to pages as normal in other articles.
I tried to do this once and it was a nightmare! My solution was a nasty loop in a SQL procedure, and I don't recommend it.
One thing that gave me trouble was deciding what link to use on a multi-word phrase. Say you had some text saying "I am using Stack Overflow" and your wiki had 3 pages called "stack", "overflow" and "stack overflow"....which part of your phrase gets linked to where? It will happen!
My idea would be to query the titles like SELECT title FROM articles and simply check if each wikilink is in that array of strings. If it is you link to the page, if not, you link to the create page.
In a personal project I made with Sinatra (link text) after I run the content through Markdown, I do a gsub to replace wiki words and other things (like [[Here is my link]] and whatnot) with proper links, on each checking if the page exists and linking to create or view depending.
It's not the best, but I didn't build this app with caching/speed in mind. It's a low resource simple wiki.
If speed was more important, you could wrap the app in something to cache it. For example, sinatra can be wrapped with the Rack caching.
Based on my experience developing Juli, which is an offline personal wiki with autolink, generating static HTML approach may fix your issue.
As you think, it takes long time to generate autolinked Wiki page. However, in generating static HTML situation, regenerating autolinked Wiki page happens only when a wikipage is newly added or deleted (in other words, it doesn't happen when updating wikipage) and the 'regenerating' can be done in background so that usually I don't matter how it take long time. User will see only the generated static HTML.