Get categories from Wikipedia:Vital articles - mediawiki

I'm trying to get a "category tree" from wikipedia for a project I'm working on. The problem is I only want more common topics and fields of study, so the larger dumps I've been able to find have way too many peripheral articles included.
I recently found the vital articles pages which seem to be a collection of exactly what I'm looking for. Unfortunately I don't really know how to extract the information from those pages or to filter the larger dumps to only include those categories and articles.
To be explicit, my question is: given a vital article level (say level 4), how can I extract the tree of categories and article names for a given list e.g. People, Arts, Physical sciences etc. into a csv or similar file that I can then import into another program. I don't need the actual content of the articles, just the name (and ideally the reference to the article to get more information at a later point).
I'm also open to suggestions about how to better accomplish this task.
Thanks!

Did you use PetScan?. It's wikimedia based tool that allow extract data from pages based on some conditions.
You can achieve your goal by go the tool, then navigate to "Templates&links" tab, then type the page name in field "Linked from All of these pages:", e.g. Wikipedia:Vital_articles/Level/4/History. If you want to add more than one page in the textarea, just type it line by line.
Finally, press Do it! button, and the data will be generated. After that you can download the data from output tab.

Related

Search HTML Tables on Multiple Pages

Hello Stack Overflow Community!
I am making a directory of many thousand custom mods for a game using HTML tables. When I started this project, I thought one HTML page would be slow, but adequate for the ~4k files I was expecting. As I progressed, I realized there are tens of thousands of files I need to have in these tables, and let the user search though to find what they are missing to load up a new scenario. Each entry has about 20 text entries and a small image (~3KB). I only need to be able to search through one column.
I'm thinking of dividing the tables across several pages on my website to help loading speeds and improve overall organization. But then a user would have to navigate to each page, and perform a search there. This could take a while and be very cumbersome.
I'm not great at website programming. Can someone advise a way to allow the user to search through several web pages and tables from one location? Ideally this would jump to the location in the table on the new webpage, or maybe highlight the entry like the browser's search function does.
You can see my current setup here : https://www.loco-dat-directory.site/
Hopefully someone can point me in the right direction, as I'm quite confused now :-)
This would be my steps,
Copy all my info into an excel spredsheet, then convert that to json, then make that an array for javascript (myarray), then can make an input field, and on click an if statement if input == myarray[0].propertyName
if you want something more than an exact match, you'd need https://lodash.com/
in your project.
Hacky Solution
There is a browser tool, called TableCapture, to capture data from html tables and load into excel/spreadsheets - where you are basically deferring to spreadsheet software to manage the searching.
You would have to see if:
This type of tool would solve your problem - maybe you can pull each HTML page's contents manually, then merge these pages into a document with multiple "sheets", and then let people download the "spreadsheet" from your website.
If you do not take on the labor above and just tell other people to do it, then you'd have to see if you can teach the people how to perform the search and do this method on their own. eg. "download this plugin, use it on these pages, search"
Why your question is difficult to answer
The reason why it will be hard for people to answer you in stackoverflow.com (usually code solutions) is that you need a more complicated solution (in my opinion) than hard coded tables and html/css/javascript.
This type of situation is exactly why people use databases and APIs to accept requests ("term": "something") for information and deliver responses ( "results": [...] ).
Thank you everyone for your great advice. I wasn't aware most of these potential solutions existed, and it was good to see how other people were tackling problems of similar scope.
I've decided to go with DataTables for their built-in sorting and filtering : https://datatables.net/
I'm also going to use a javascript array with an input field on the main page to allow users to search for which pack their mod is in. This will lead them to separate pages on my site, each with a unique datatable for a mod pack. Separate pages will load up much quicker than one gigantic page trying to show everything.

Extracting file/page names under the hierarchy of url

Given I have an link how do I extract file/page names under the hierarchy,
For example in this stackoverflow exchange,
https://stackoverflow.com/questions/
There are many links that go after this.
stackoverflow.com/questions/31236312
stackoverflow.com/questions/31235818
...
Etc
I know "stackoverflow.com/questions/" and wish to find out these numbers, names that go after this.
Is there anyway to do this?
The websites I am looking into uses CSS and
it does not allow access to, for example, stackoverflow.com/questions/ (I get Error 403--Forbidden)
but only allows specfic pages that goes under it.
These file names consists of mixture of numbers and alphabet character I.e. 72304, or A1103457 etc.
There are over 100 files under that hierarchy and I wish to find out all of its names/url.
Many thanks in advance.
In short, you can't.There is no way to just grab every page under a given url/domain path.
In longer... You could use a spider like
https://github.com/mvdbos/php-spider
To follow links and do a breadth depth search, looking for all links it can find under that given url. It will however load every single page it finds, searches it for links and then continues. So it will be very slow on large sites and may result in locking of accounts and breaking terms of service.

How can I disable semantic notations in text areas in Semantic MediaWiki Forms?

I am working on a user-moderated database and settled on MediaWiki with Semantic MediaWiki as an engine. I installed Semantic Forms to force the end users to conform to a certain standard when creating or editing entries. The problem is that since a user can add a semantic notation to any form text input it can throw off the proper structure of the system, i.e. if it was an IMDB clone a user can add [[Directed by:Forest Gump]] which would then result in the movie "Forest Gump" showing up under a list of directors.
I doubt that there's any setting that can simply turn this off or on, but I've had one or two ideas as to how to get it working.
One, perhaps there's a way to disable semantic notation on specific namespaces and put the forms on those namespaces. I have a feeling that this will cause the forms to merely break.
Another idea is to modify the code. This is clearly the less ideal approach. To get started, I believe I would need to create some sort of filter on SFTextAreaInput which would disable semantic notations for the user inserted text, but alas I'm unsure as to how to get started on that.
Well, Semantic MediaWiki is still a Wiki. In your classical enterprise database, you restrict the users' input options as a means of ensuring data integrity. That isn't what wikis do; the thinking with a wiki is, yes, the user can enter incorrect information, but another user will amend it and let the first user know what was wrong.
I wouldn't try to coerce SMW into rigid data acquisition. I mean, you do have options such as removing the standard input fields in forms:
'''Free text:'''
{{{standard input|free text|rows=10}}}
If users are selecting a movie page when they should be selecting a director page, then you probably want to encourage correct selection by populating the form control from the Directors category, like:
{{{field|Director|input type=combobox|values from category=Directors}}}
Yes, they can still go very far out of their way to select "Forrest Gump", but if that happens then the fact that someone wilfully circumvented the preselected correct options is a more pressing concern than the fact that the system permits it.
Wikis work best when the system encourages rather than enforces valid knowledge.
My name is Wolfgang Fahl I am behind the smartMediaWiki approach. You might want to go the smartMediaWiki route
see
http://semantic-mediawiki.org/wiki/SMWCon_Spring_2015/smartMediaWiki
For a start don't go just by the property values but e.g. also by a category.
{{#ask: [[Category:Movie]] [[Directed by::+]]
|?Directed by
}}
will only show pages that have both the property set and are in the correct category.
In the smartMediaWiki approach you'd create a topic "Movie" and the entry of movies would be done via Forms. This is an elaboration of the SemanticForms and semantic PageSchemas idea that recently evolved. You can find out more about this at SMWCon Barcelona 2015 this fall.

Mediawiki: I need a Table Of Contents for the entire wiki

I administer my own company internal wiki using MediaWiki. I like MediaWiki because many people are already familiar with it having used Wikipedia. Also, it was a joy to configure and I didn't run into a lot of issues, not being that familiar with PHP. (So I'm not necessarily looking for another solution, like DokuWiki...)
My requirement is that the opening page be a listing of all pages, broken down alphabetically by category - much like a Table of Contents for the entire wiki. It would look like this (on the "Main Page"):
Category 1
Page A
Page B
Page C
Category 2
Page E
Page N
Page X
Page Z
Category 3
Page Q
Page V
Each page gets the category assigned to it. I know about the Special:Categories page, but that only shows the categories, and one must drill down (follow the link) to see the pages within that category - therefore, I cannot see multiple pages/multiple categories.
I have seen Extension:Hierarchy, but this does not fit my needs because the "Table of Contents" has to be edited rather than being auto generated by declaring the "parent" or "category" on each page itself.
Is there already existing functionality for this for MediaWiki? (I understand that as the wiki grows, so too will this Table of Contents page, but that is okay.)
Alternatively, I know about the MediaWiki API. I can create a server-side process that:
Does a MySQL lookup for all pages and their categories
Sorts them
Uses the MediaWiki API to generate this Table of Contents on the Main Page
And I can run this process periodically. I am up for the challenge, because I am a programmer and it is an interesting exercise, but why reinvent the wheel if I don't have to?
CategoryTree is an option. Now, a challenge here is that MediaWiki categories are not hierarchical. In other word, you can have category loops (A>B>C>A). Also, one article can show up in any number of categories, and articles can be without categories. The only thing that has to be done manually is to put <categorytree>Category Name</categorytree> for each category on the home "Table of Contents" page. Granted that new categories are not likely going to pop up a lot, this will not be a terrible issue. However, one solution for this inconvenience is to just put all your (top-level) categories into Category:Categories and then display that category via the extension (see the depth and hideroot parameters).
Hard to use, but wikistats produces an HTML representation from an XML dump, see e.g. MediaWiki.org categories.
CatGraph is another analysis tool, even more complex it would seem (but I've not tried setting it up for a wiki of mine, unlike wikistats).

Random Article button

I'd like to create a button on a menu bar that can generate a link to a random article from my blog posts (much like Wikipedia has). It's for a client, and they'd like to have this functionality on the site. I'm not familiar with PHP so I'd like to find a way around that, especially since I don't have access to the root user on my server host's mySQL installation (if this is relevant).
I had a theoretical solution: have a .txt or .xml file containing a list of all the URLs to each of the posts, with a "key" assigned to each of them. Then, when the user clicks the random article button, the current time (ex. 1:45) is hashed and mapped to a specific URL. I am fairly new to Drupal, however, I was wondering if there was some way to have the random article button use a .c file to execute these steps. The site is being hosted on a server that uses Apache 2, and I looked through some modules that were implemented in C code. I'm pretty new to all of this (although proficient in C), and spent many fruitless hours searching for solutions.
In a pure Drupal fashion (don't know if you are interested by this kind of solution), you could create a view (create a block) which retrieve blog posts, use a random sort criteria and limit results to 1 item. Then configure this view to display fields, and add only one field : post title, and check "link to content" in this field parameters window. You'll get one random blog post title which will be rendered as a link to this blog post.
Finally in Structure->Block assign your new block in a region to see it.
It's a pure Drupal / Views / no-code-just-clicks :) way, but it will be far more maintainable and easy to setup than introducing C for such a simple feature.
Views module
Let me know if you try this and have problems configuring your view or anything else.
Good luck