Wikipedia revision history - mediawiki

I am trying to get all revision histories for every English wikipedia article. I just need all editors' names and edition sizes (in bytes) along with the article title or id. The wikipedia dump for all revision history is a few TB and my computer cannot handle it. I also tried to use MediaWiki to query the revision histories, but it seems like it will take a very long time to get everything. Is there any other approaches I can try to get the information I want? Thanks.

Taking the problem the other way around, maybe you don't need to download all the data.
For example, if you plan to use SQL, you can do it from the servers without downloading anything.
Please take a look at https://quarry.wmflabs.org/ and its doc.

Related

Wiki Dump For All Titles In Mainspace Is Way More Than What Wikipedia Reported

I am querying all revision histories for each wikipedia page. I downloaded wiki dump for list of page titles in main namespace from the link https://dumps.wikimedia.org/enwiktionary/20170320/
However, it seems like there are more than 12,000,000 titles from the dump I downloaded, which is way more than what wikipedia reported (https://en.wikipedia.org/wiki/Wikipedia:Size_comparisons). Can anyone tell me what is going on? Am I using the correct dump?
The reason I am asking is that it looks like it will take a few hundred days to get all revision histories if I query the history providing the article titles. So if there are any better ways to extract revision histories, it will be very helpful too.
First of all, that is a dump of pages in Wiktionary. Wikipedia's id is enwiki, however even with the right dump making the counts match takes some efforts:
Some pages are redirects
Some pages aren't counted as valid content pages and thus are excluded from the official statistics. To be considered valid, a page should contain at least one internal link.

what data storage model is used to store articles in wikipedia

Articles in wikipedia get edited. They can grow/shrink/updated etc. What file system/database storage layout etc is used underneath to support it. In database course, I had read a bit on variable length record, but that seemed like more for small strings and not for whole document. Like in file system, files can grow/shrink etc, and I think its done by chaining blocks together. each time, we update a file, not the whole file is rewritten. Perhaps something similar would be done here.
I am looking for specific names,terminologies, may be even how the schema in mysql is defined. (I think wikipedia uses mysql).
Below are links to some writeup on wikipedia architecture, but I am not being able to answer my question from these:
http://swe.web.cs.unibo.it/twiki/pub/WikiFactory/AntonelloDiMuroThesis/Wikipedia-cheapandexplosivescalingwithLAMP.pdf
http://dom.as/uc/workbook2007.pdf
Thanks,
See:
http://www.mediawiki.org/wiki/Manual:Database_layout

How to Determine If Webpage Has Been Modified?

I know I can check the response header's 'last-modified' value to determine when the web page was last modified, but in many instances that header is NOT provided. Also, in many instances the content itself hasn't changed, but the current time/date is displayed on the page, thus giving the appearance of a modification.
Any suggestions on how to overcome the above issues and determine if a web page has been (truly) modified?
Thanks.
Sure. Define for yourself what counts as a "modification" (for example, only things in the "content" div) and only look at that.
If you can't find a way to decide whether something's been changed, then you can't expect a computer to…
You are asking two question here:
When was it modified?
Was it modified?
To answer question #1, you'd have to check the page every so often to meet your granularity requirements e.g. every hour, every day, every week, etc. This could be quite resource intensive. This will depend on if you really need to know this.
To answer question #2, you need to compare something. You could do what #Paul Rosnia suggested, but if they as much as added a comma, it will be considered modified.
Then, you might also want to see what has been modifed. Then you you'd have to save the content and compare them to each other in order to highlight the changes.
You could use http://php.net/manual/en/function.file-get-contents.php and a CRON job to cache the page on your server and then perdiodically compare your cache. The comparing part will be the tricky part, since you have to write specific code to ignore the things that don't matter to you e.g. date/time stamp, header changes, menu changes, etc.
The sure-fire way to detect page changes is to download and checksum it. If the checksum changes, the page has been edited (with extremely high certainty).
Here's an example that works on the command line:
curl -s news.ycombinator.com | md5 #=> d86582bec138c051b0d8322f7823a23c
That was a few minutes ago. If you run it now you'll get a different answer!

What should I put in header comments at the top of source files?

I've got lots of source code files written in various languages, but none of them have a standard comment at the top (sometimes even across the same project). Some of them don't have any header comment at all :-)
I've been thinking about creating a standard template that I can use at the top of my source files, and was wondering what fields I should include.
I know I want to include my name and a short description of what the file contains/does. Should I also include the date created? The date last modified? The programmer who last modified the file? What other fields have you found to be useful?
Any tips and comments welcome.
Thanks,
Cameron
This seems to be a dying practice.
Some people here on StackOverflow are against code comments altogether (reasoning that code should be written to be self explanatory) While I wouldn't go that far, some of the points of the anti-comment crowd make sense, such as the fact that comments tend to be out of date.
Header blocks of comments suffer from these symptoms even more so. Every organization I've been with that has had these header blocks, they are out of date. They have a author name of some guy who doesnt even work there any more, a description that does not match the code at all (assuming it ever did) and a last modified date, that once compared with version control history, seems to have missed its last dozen updates.
In my personal opinion, keep comments close to the code. If you want to know purpose of, and/or history of, a code file, use your version control system.
Date created, date modified and author who last changed the file should be stored in your source control software.
I usually put:
The main purpose of the file and things within the file.
The project/module the file belongs to.
The license associated with the file (and a LICENSE file in the project root).
Who is responsible for the file (either the team, person, or both)
Back in 2002, when I was straight out of college and jobs were few and far between after the dot-com bust, I joined a service company which used to create software customized for their clients in Java. I had to sit in the office of a client (which was a ramshackle room in an electric sub-station rigged with an AC to keep the servers running), sharing chairs/PCs with other guys in the team. The other engineers (if I can call them engineers ;) in the group used to make changes ad-hoc to the source code, compile the files and put them into production.
No way to figure out who made what change.
No way to figure out why any change was made.
No way to go to previous version of code, unless the engineer "remembered" what he modified.
Backup: Copy over files from the production server, which were replaced with new files.
Location of backup: Home directory of engineer copying over files to production server.
Reports of production servers going down due to botched attempts of copying over files to the server (missed a file to be copied over, backups getting lost or wrong files being copied over or not all files being copied over) were met with shrugs (oh no, is it down? let's see what happened; hey who changed what recently...? ummm...).
During those days, after spending several frustrating days trying to figure out the whos and whys behind the code, I had devised a system for comments in a list in the header of the source file which detailed the following:
Date of change made
Who made the change
Why was the change made
Two months later when the list threatened to challenge the size of the source code in the file, the manager had the bright idea of getting a source version control system.
I have never needed to put any comments in headers of source files (except for copyright notices) in any company I worked since. In my current company, everything else is mostly self-evident by looking at the code, or going to the bug reporting system which is integrated with the source version control system.
What fields do you need? If you have to ask whether to put some info there, you don't really need that info. Unless you are forced, by some bureaucratic incompetence of your employer, I don't see why you should go looking for more info than you already feel should be there.\
In most organizations, all source files have to begin with a legal blurb. If you're really lucky, it's just a one-liner, but in most cases it's a really long block of legalese. As a result, few people ever read these. Our eye just travels to the first program element and then goes up to its documentation.
So if you want to write anything, write it in association with the topmost program element, not the file.
Any other bookkeeping information should generally be part of your version control, not maintained (poorly) in the file itself.
In addition to the comment above stating license, the project that it belongs to, etc I also tend to put the "weird" requirements at the top as well (such as "built with version X of library Y") so you, or the person who picks it up after you won't change something that the program relies on without realizing it (or, if they do, they will at least know what to change back)
A lot depends on whether you're using an auto-documentation generation tool or not.
While I agree with many of the comments, if you're using JavaDoc or some other documentation generating tool that depends on comments, you'll obviously need to include the things it wants to see.
You did not mention that you are using a version control system and your comment in Neil N's answer confirms this for your older code. While using version control is the best way to go I also have experienced many situations where the cost of doing so for older code would not be paid for by the project's sponsor. If you do not have a centralized change history for the project then the change history can be put in the modules. It is good that you are using a version control system for your new code.
Your company name
All rights reserved (c) year - or reference to appropriate license
Project or library this file is for
Module it belongs to
Description of what it contains
History
-------
01/08/2010 - Programmer - version
Initial creation.
01/09/2010 - Programmer - version
Change description.
01/10/2010 - Programmer - version
Change description.
Those useful fields that you mentioned are good ones. Who modified the file and when.
Your version control software should allow for the embedding of keywords within comments. For example, in CVS, the $Id$ will resolve to the file, date/time modified, and user that modified the file. It will automatically be kept up to date with each check-in.
Include the following information:
What this file is for. That's a very useful piece of knowledge and it's more important than anything else. You should tell the reader, why there is such a file, why did you group functions in a separate file/package/module and why they are used. Maybe briefly, one or two lines, but that should be there.
Legal stuff, if appplicant.
Leave the place for special commands of console editors, such as of Emacs.
Add special commands that your auto-documenting system requires.
Things things you shouldn not include are
Who created the file
When it was created
Who modified it the last time
When it was last modified
What was added by the latest modification
You can--and should--retrieve it via the version control system, where it's constantly and automatically kept up-to-date. Let alone that most of these points are just useless.
Who created the file
When it was created
Who modified it the last time
When it was last modified
What was added by the latest modification

Randomizing pages in Wikipedia with MySQL and Perl?

I found a perl script that manages randomizing the wikipedia articles in Wikipedia here. The code seems to be slightly computer generated. Due to my present interest in MySQL, I thought you could possibly have the links and related data in a database.
I know that MySQL is good in maintaining relations between tables, while it seems you can easily implement things with Perl. I feel it somehow fuzzy to draw a line to their specialties. So:
How can you randomize Wikipedia
articles with MySQL and Perl?
If you really want to know how THEY (Wikipedia) do it, have a look at this code directly from Media Wiki:
http://svn.wikimedia.org/svnroot/mediawiki/trunk/phase3/includes/specials/SpecialRandompage.php
It is open source software after all ;), and that's the beauty of it.
Edit: From having a quick glance at the code, I am pretty sure they're using a field called page_random, set at row creation time. Then, since it's an indexed field, ordering by it with limit 1 is instant (with a given random offset, valid for this application, of course).
This is a very standard way to make random access quick, due to ORDER BY RAND() being extremely slow, as I mentioned in the other answer.
Edit #2: I love how clean and proper OOP Wiki Media's code is. Definitely bookmarking it to show PHP newbies what good PHP code looks like (and to remind myself).
SELECT id FROM articles ORDER BY RAND() LIMIT 1
You could, of course, just link to http://en.wikipedia.org/wiki/Special:Random