mediawiki Special:Export - mediawiki

I've just set up a mediawiki server. I wanted to export data from wikipedia, but it doesn't allow for a pagelink_depth higher than 0 by default. It seems that you can only change the maximum pagelink_depth by setting up your own mediawiki and adjusting the $wgExportMaxLinkDepth. Now I've done all that, but obviously my own mediawiki has no content. So I was wondering if there was a way to bulk copy all of wikipedia into my own server. From the information I've read this seems only doable with about a 100 pages at a time. If that's the case there'd be absolutely 0 purpose for the Special:Export in general, as you'd need to know exactly which pages you want to import prior to doing the export, which defeats the purpose altogether. Any help would be much appreciated.

Special:Export isn't meant for a complete export of a wiki, especially not using the web-interface and with so much pages in the database. Special:Export should be used, if you want to export a known page with all contents to import this page (or a small amount of pages) into another wiki, e.g. to export and import a template from one wiki into the other one. So, the Special:Export special page has a valid purpose, but you try to use it for another use case, for which it wasn't developed for ;)
If you want to export any page of a MediaWiki wiki, you should use the maintenance script (run-able through the command line) dumpBackup.php or any other backup script in the maintenance folder. This will ensure, that you get what you want.
For the case of Wikipedia, you can't access these scripts (I mentioned this for general purpose only), but the Wikimedia foundation provides database dumps of the Wikimedia-Wikis, including Wikipedia.

"So I was wondering if there was a way to bulk copy all of wikipedia into my own server" I would recommend against this simply on the sheer size of the data & the vast number of open links (or "redlinks" or "bad links") you would be adding if you didn't actually copy it all in. A better approach is to follow all the Wikipedia conventions about page NAMING, to the punctuation mark.. then write a script that checks say once a night whether you have linked to something that is already defined in Wikipedia, and then imports ONLY THAT PAGE and adds a link up top to the EXACT VERSION OF IT that was imported. That way you only bring in what you actually reference, but your database can integrate with Wikipedia's.
This will also come in immensely handy if you have to support multiple languages, like Spanish or French, as well, since Wikipedia has links to 'the same article in another language' thus translating at least those concepts for you.

Related

GEDCOM to HTML and RDF

I was wondering if anyone knew of an application that would take a GEDCOM genealogy file and convert it to HTML format for viewing and publishing on the web. I'd like to have separate html files for each individual and perhaps additional files for other content as well. I know there are some tools out there but I was wondering if anyone used any tools and could advise on this. I'm not sure what format to look for such applications. They could be Python or php files that one can edit, or even JavaScript (maybe) or just executable files.
The next issue might be appropriate for a topic in itself. Export of GEDCOM to RDF. My interest here would be to align the information with specific vocabularies, such as BIO or REL which both are extended from FOAF.
Thanks,
Bruce
Like Rob Kam said, Ged2Html was the most popular such program for a long time.
GRAMPS can also create static HTML sites and has the advantage of being free software and having a native XML format which you could easily modify to fit your needs.
Several years ago, I created a simple Java program to turn gedcom into xml. I then used xslt to generate html and rdf. The html I generate is pretty rudimentary, so it would probably be better to look elsewhere for that, but the rdf might be useful to you:
http://jay.askren.net/Projects/SemWeb/
There are a number of these. All listed at http://www.cyndislist.com/gedcom/gedcom-to-web-page-conversion/
Ged2html used to be the most popular and most versatile, but is now no longer being developed. It's an executable, with output customisable through its own scripting syntax.
Family Historian http://www.family-historian.co.uk will create exactly what you are looking for, eg one file per person using the built in Web Site creator. As will a couple of the other Major genealogy packages. I have not seen anything for the RDF part of your question.
I have since tried to produce a Genealogy application using Semantic MediaWiki - MediaWiki, the software behind Wikipedia, and Semantic MediaWiki includes various extensions related to the Semantic Web. I thought it is very easy to use with the forms and the ability to upload a GEDCOM but some feedback from people into genealogy said that it appeared too technical and didn't seem to offer anything new.
So, now the issue is whether to stay with MediaWiki and make it more user friendly or create an entirely new application that allows for adding and updating data in a triple store as well as displaying. I'm not sure how to generate a family tree graphical view of the data, like on sites like ancestry.com, where one can click on a box to see details about the person and update that info or one could click on a right or left arrow around a box to navigate the tree. The data comes from SPARQL queries sent to the data set/triple store both when displaying the initial view and when navigating the tree, where an Ajax call is needed to get more data.
Bruce

Markdown or HTML

I have a requirement for users to create, modify and delete their own articles. I plan on using the WMD editor that SO uses to create the articles.
From what I can gather SO stores the markdown and the HTML. Why does it do this - what is the benefit?
I can't decide whether to store the markdown, HTML or both. If I store both which one do I retrieve and convert to display to the user.
UPDATE:
Ok, I think from the answers so far, i should be storing both the markdown and HTML. That seems cool. I have also been reading a blog post from Jeff regarding XSS exploits. Because the WMD editor allows you to input any HTML this could cause me some headaches.
The blog post in question is here. I am guessing that I will have to follow the same approach as SO - and sanitize the input on the server side.
Is the sanitize code that SO uses available as Open Source or will I have to start this from scratch?
Any help would be much appreciated.
Thanks
Storing both is extremely useful/helpful in terms of performance and compatiblity (and eventually also social control).
If you store only Markdown (or whatever non-HTML markup), then there's a performance cost by parsing it into HTML flavor everytime. This is not always noticeably cheap.
If you store only HTML, then you'll risk that bugs are silently creeping in the generated HTML. This would lead to lot of maintenance and bugfixing headache. You'll also lose social control because you don't know anymore what the user has actually filled in. You'd for example as being an admin also like to know which users are trying to do XSS using <script> and so on. Also, the enduser won't be able to edit the data in Markdown format. You'd need to convert it back from HTML.
To update the HTML on every change of Markdown version, you just add one extra field representing the Markdown version being used for generating the HTML output. Whenever this has been changed in the server side at the moment you retrieve the row, re-parse the data using the new version and update the row in the DB. This is only an one-time extra cost.
By storing both you only have to process the markdown once (when it is posted). You would then retrieve the HTML so that you can load your pages faster.
If you only stored one, you'd forever have to recreate the other for either the display view or the edit view.

What should I put in header comments at the top of source files?

I've got lots of source code files written in various languages, but none of them have a standard comment at the top (sometimes even across the same project). Some of them don't have any header comment at all :-)
I've been thinking about creating a standard template that I can use at the top of my source files, and was wondering what fields I should include.
I know I want to include my name and a short description of what the file contains/does. Should I also include the date created? The date last modified? The programmer who last modified the file? What other fields have you found to be useful?
Any tips and comments welcome.
Thanks,
Cameron
This seems to be a dying practice.
Some people here on StackOverflow are against code comments altogether (reasoning that code should be written to be self explanatory) While I wouldn't go that far, some of the points of the anti-comment crowd make sense, such as the fact that comments tend to be out of date.
Header blocks of comments suffer from these symptoms even more so. Every organization I've been with that has had these header blocks, they are out of date. They have a author name of some guy who doesnt even work there any more, a description that does not match the code at all (assuming it ever did) and a last modified date, that once compared with version control history, seems to have missed its last dozen updates.
In my personal opinion, keep comments close to the code. If you want to know purpose of, and/or history of, a code file, use your version control system.
Date created, date modified and author who last changed the file should be stored in your source control software.
I usually put:
The main purpose of the file and things within the file.
The project/module the file belongs to.
The license associated with the file (and a LICENSE file in the project root).
Who is responsible for the file (either the team, person, or both)
Back in 2002, when I was straight out of college and jobs were few and far between after the dot-com bust, I joined a service company which used to create software customized for their clients in Java. I had to sit in the office of a client (which was a ramshackle room in an electric sub-station rigged with an AC to keep the servers running), sharing chairs/PCs with other guys in the team. The other engineers (if I can call them engineers ;) in the group used to make changes ad-hoc to the source code, compile the files and put them into production.
No way to figure out who made what change.
No way to figure out why any change was made.
No way to go to previous version of code, unless the engineer "remembered" what he modified.
Backup: Copy over files from the production server, which were replaced with new files.
Location of backup: Home directory of engineer copying over files to production server.
Reports of production servers going down due to botched attempts of copying over files to the server (missed a file to be copied over, backups getting lost or wrong files being copied over or not all files being copied over) were met with shrugs (oh no, is it down? let's see what happened; hey who changed what recently...? ummm...).
During those days, after spending several frustrating days trying to figure out the whos and whys behind the code, I had devised a system for comments in a list in the header of the source file which detailed the following:
Date of change made
Who made the change
Why was the change made
Two months later when the list threatened to challenge the size of the source code in the file, the manager had the bright idea of getting a source version control system.
I have never needed to put any comments in headers of source files (except for copyright notices) in any company I worked since. In my current company, everything else is mostly self-evident by looking at the code, or going to the bug reporting system which is integrated with the source version control system.
What fields do you need? If you have to ask whether to put some info there, you don't really need that info. Unless you are forced, by some bureaucratic incompetence of your employer, I don't see why you should go looking for more info than you already feel should be there.\
In most organizations, all source files have to begin with a legal blurb. If you're really lucky, it's just a one-liner, but in most cases it's a really long block of legalese. As a result, few people ever read these. Our eye just travels to the first program element and then goes up to its documentation.
So if you want to write anything, write it in association with the topmost program element, not the file.
Any other bookkeeping information should generally be part of your version control, not maintained (poorly) in the file itself.
In addition to the comment above stating license, the project that it belongs to, etc I also tend to put the "weird" requirements at the top as well (such as "built with version X of library Y") so you, or the person who picks it up after you won't change something that the program relies on without realizing it (or, if they do, they will at least know what to change back)
A lot depends on whether you're using an auto-documentation generation tool or not.
While I agree with many of the comments, if you're using JavaDoc or some other documentation generating tool that depends on comments, you'll obviously need to include the things it wants to see.
You did not mention that you are using a version control system and your comment in Neil N's answer confirms this for your older code. While using version control is the best way to go I also have experienced many situations where the cost of doing so for older code would not be paid for by the project's sponsor. If you do not have a centralized change history for the project then the change history can be put in the modules. It is good that you are using a version control system for your new code.
Your company name
All rights reserved (c) year - or reference to appropriate license
Project or library this file is for
Module it belongs to
Description of what it contains
History
-------
01/08/2010 - Programmer - version
Initial creation.
01/09/2010 - Programmer - version
Change description.
01/10/2010 - Programmer - version
Change description.
Those useful fields that you mentioned are good ones. Who modified the file and when.
Your version control software should allow for the embedding of keywords within comments. For example, in CVS, the $Id$ will resolve to the file, date/time modified, and user that modified the file. It will automatically be kept up to date with each check-in.
Include the following information:
What this file is for. That's a very useful piece of knowledge and it's more important than anything else. You should tell the reader, why there is such a file, why did you group functions in a separate file/package/module and why they are used. Maybe briefly, one or two lines, but that should be there.
Legal stuff, if appplicant.
Leave the place for special commands of console editors, such as of Emacs.
Add special commands that your auto-documenting system requires.
Things things you shouldn not include are
Who created the file
When it was created
Who modified it the last time
When it was last modified
What was added by the latest modification
You can--and should--retrieve it via the version control system, where it's constantly and automatically kept up-to-date. Let alone that most of these points are just useless.
Who created the file
When it was created
Who modified it the last time
When it was last modified
What was added by the latest modification

Using Semantic MediaWiki for tabular data

Am I completely off-track to think about using Semantic MediaWiki to store (and organise, report on, etc.) 'tabular' data such as financial transactions or weather readings that would usually live in a spreadsheet or database?
It seems that one would need a separate, tiny, page for each tuple; but then, that's by design and perhaps it's perfectly okay.
I ask, simply because SMW seems like such a quick and easy way to get a collaborative data repository up and running.
Semantic MediaWiki is better suited for keeping track of Factual or Encyclopedic data, where you can have pages about everything you need to know about a certain topic.
For tabular or numerical data such as measurements, financial, sensor data, you would indeed need to create little pages about each data point, which is not practical in many cases.
However, there are extensions to Media Wiki that allow you to integrate external data sources (in MySQL databases or CSV files somewhere) with MediaWiki pages. This can allow you to have the best of both worlds - dynamic access and queries of tabular data and semantic annotations of pages around them.
Take a look at :
http://www.mediawiki.org/wiki/Extension:External_Data
No, I don't think it's such a bad idea.
Using SemanticForms you could enter lots of little data pages quickly and easily (for example, an invoice might require additional pages for each line item, but they could all be entered from one form using the 'multiple' feature of the for template form tag). So although I've never tried logging weather data in SMW, I think it would be pretty easy. I don't see what the problem would be with storing data across so many pages; it's easy enough to combine it in whatever formats you require.
Give it a go and let us know how it goes!
You can use either the Semantic Internal Objects extension (SIO), or SMW's built in subobjects (the former works well with the already mentioned External Data extension), to store multiple semantic objects (could be the rows of your spreadsheet) in one page.
However, unless you are really looking for a collaborative tool with semantic capabilities, I doubt SMW is the best suited piece of software for your task.
edit (november 2015): Since SMW version 1.9, there nothing that SIO can do that the built-in subobjects can't, so I would recommend the latter.

PDF creation - Tags - Authouring

This is so vague it's ridiculous but who knows...
We have got this client who will not budge - they are supplying PDF files auto generated by their own software. These files don't import into our (printing) lab management software - made by kodak.
So I emailed Kodak the error log and relevant files and got this back..
DP2 supports the importing of PDF's from – Adobe Illustrator and Quark Express
Some of the capabilities when importing PDF's as ORDER ITEMS is that the images can be modified,
color corrected, or replaced. To accomplish this, the PDF is disassembled. PDF's from Illustrator and Quark,
contain additional information that tells us where everythings goes and how, thus enought information for
us to reassemble the PDF. While other applications do generate PDF's they don't contain this additional
information.
After speaking with a 3rd party 'expert' we need to consider another 3rd party 'rip' software that's fairly expensive. So before I go ahead I thought I'd ask if any one has experience with this stuff?
Cheers
Thats a tough one, PDFs can be created in so many different ways, it hard to tell exactly what any given PDF may be composed of, personally I'd try some different PDF editors first to see if you can exact the data you need before going the expensive route.
Eg Foxit PDF offer an editor (I think its free, or cheap in any case)
Darknight