Internal link stucture of a wiki/page ranker - mediawiki

I want to get the internal link structure of and also apply page ranker to a wiki in the Wikimedia format. I have an extremely large html file dump which is the history.xml file. This includes all the internal link in [[]] but also has every revised version of all pages. I am wondering if there is a way to extract the internal link structure. I know the Wikipedia dumps come with this in a separate file but I only have the history.xml

Dump the pagelinks table of the wiki (or a new wiki you have imported the xml dump into).

Related

Convert mediawiki to word documents

I am trying to move a mediawiki site to confluence v5. Confluence now no longer support using the Universal Wiki Converter (UWC) for version 5. Confluence has advised that they accept "page and page tree imports in the form of Word Documents" or "Confluence Space Exports and Confluence Site Exports".
Therefore I need a way on converting the mediawiki pages into word documents, as well as retaining the structure.
Currently, I have used mediawiki's dumpBackup.phpto create an XML dump of my wiki. I then used a mediawikiXML_exporter.php. This created a exported_mediawiki_pages/directory containing all Pages, File, Categories, Projects, and Users. I checked and all my pages and these files contained mediawiki versions of all my pages and users etc, but stored as .txt files.
Given I cannot use UWC to upload this to confluence, is there a way to convert to a docx format so that I can upload into confluence?
P.S. I'm well aware I may be doing with the wrong approach so if there is a better way to do this from scratch, I'm open to any solution.
You could try https://www.mediawiki.org/wiki/Extension:Collection to convert it to a PDF document, and then see if you can maybe convert that to word.

Parsing urls containing specific filetypes in a mediawiki dump

I've a large .xml file (about 500mb) which is a dump of site based on mediawiki.
My goal is to find all url links, which contain image filename extensions. Then group links by second level domain and export result containing only links in above order.
Example: there're many links beginning with domain.com/.png, host.com/.png and image.com/*.png. Grouping them in separate files divided by specific second level domain with it's links - that's a final result.
So you want to parse the links in the wikitext. Writing a MediaWiki parser is a pain, so you should use an existing parser.
The easiest way (easiest but not easy) is probably to import your dump into a MediaWiki install and rebuild some tables id needed, then export the externallinks table.

MediaWiki edit history in composite files

I have several MediaWiki files, which include a list of templates inside them. If I edit one of these files I can see the edit history. But if I edit a Template file, it's is not shown in the composite file. I know logically, the history belongs to the Template file, not the main file. But is there a way I can see the edit history of a template be included in a composite file as well?
Not in Mediawiki, but since each history is available as a RSS feed, you could combine them to get a single flow RSS which probably solve your needs. Big downside is that you have to create each combination manually. I used rssmix.com to generate this example out of an article's history and a template used on that page: http://www.rssmix.com/u/4247980/rss.xml

Using a JSON file instead of a database - feasible?

Imagine I've created a new javascript framework, and want to showcase some examples that utilise it, and let other people add examples if they want. Crucially I want this to all be on github.
I imagine I would need to provide a template HTML document which includes the framework, and sorts out all the header and footer correctly. People would then add examples into the examples folder.
However, doing it this way, I would just end up with a long list of HTML files. What would I need to do if I wanted to add some sort of metadata about each example, like tags/author/date etc, which I could then provide search functionality on? If it was just me working on this, I think I would probably set up a database. But because it's a collaboration, this is a bit tricky.
Would it work if each HTML file had a corresponding entry in a JSON file listing all the examples where I could put this metadata? Would I be able to create some basic search functionality using this? Would it be a case of: Step 1 : create new example file, step 2: add reference to file and file metadata to JSON file?
A good example of something similar to what I want is wbond's package manager http://wbond.net/sublime_packages/community
(There is not going to be a lot of create/update/destroy going on - mainly just reading.
Check out this Javascript database: http://www.taffydb.com/
There are other Javascript databases that let you load JSON data and then do database operations. Taffy lets you search for documents.
It sounds like a good idea to me though - making HTML files and an associated JSON document that has meta data about it.

How to handle uploading html content to an AppEngine application?

I would like to allow my users to upload HTML content to my AppEngine web app. However if I am using the Blobstore to upload all the files (HTML files, css files, images etc.) this causes a problem as all the links to other files (pages, resources) will not work.
I see two possibilities, but both of them are not very pretty and I would like to avoid using them:
Go over all the links in the html files and change them to the relevant blob key.
Save a mapping between a file and a blob key, catch all the redirections and serve the blobs (could cause problems with same name files).
How can I solve this elegantly without having to go over and change my user's files?
Because app engine is running your content on multiple servers, you are not able to write to the filesystem. What you could do is ask them to upload a zip file containing their html, css, js, images,... The zipfile module from python is available in appengine, so you can unzip these files, and store them individually. This way, you know the directory structure of the zip. This allows you to create a mapping of relative paths to the content in the blobstore. I don't have enough experience with zipfile to write a full example here, I hope someone more experienced can edit my answer, or create a new one with an example.
Saving a mapping is the best option here. You'll need to identify a group of files in some way, since multiple users may upload a file with the same name, then associate unique pathnames with each file in that group. You can use key names to make it a simple datastore get to find the blob associated with a given path. No redirects are required - just use the standard Blobstore serving approach of setting the blobstore header to have App Engine serve the blob to the user.
Another option is to upload a zip, as Frederik suggests. There's no need to unpack and store the files individually, though - you can serve them directly out of the zip in blobstore, as this demo app does.