How to compare two MediaWiki sites

How to compare two MediaWiki sites - mediawiki

We moved a private MediaWiki site to a new server. Some months later we discovered that one or two users had continued to update the old MediaWiki site. So we have some edits in the old server that need to be copied into the new server.
Does anyone know of a routine or process to (conveniently?) compare and identify edits in the old site?
Per the comments attached to this post, the Recent Changes page might work if that page accepted a starting date. Unfortunately, it is limited to a max of 30 days. In this case, I need to review changes for 12 months.

Identify edits done
Identify and verify edits done by your users since the fork
Using the database (assuming MySQL) and no table prefixes
Give me all the edits done since Dec 01 2018 (including that date):
SELECT rev_id, rev_page, rev_text_id, rev_comment, rev_user, rev_user_text, rev_timestamp
FROM revision
WHERE rev_timestamp > '20181201';
Note that the actual page text is stored in the text table, and the page name in the page table.
Give me all edits done since Dec 01 2018 (including that date) with page name and revision text:
SELECT rev_id, rev_page, page_namespace, page_title, rev_text_id, rev_comment, rev_user, rev_user_text, rev_timestamp, old_text
FROM revision r
LEFT JOIN page p
ON p.page_id = r.rev_page
LEFT JOIN text t
ON t.old_id = r.rev_text_id
WHERE rev_timestamp > '20181201';
Note that with tools like MySQL Workbench you can copy results as MySQL insert statements. Dependent on what users did to the old wiki, you might just need to transfer records of 3 tables; however if there were file uploads, deletions or user right changes involved, it's getting complicated. You can track these changes through the logging table.
Using the Web Interface
It is of course possible to show more changes than just 500 for the last 30 days. The setting that allow you to configure this is $wgRCLinkLimits and $wgRCLinkDays. You can also just open the recent changes page, tap 30 days and change the URL parameters so the URL becomes path/to/index.php?title=Special:RecentChanges&days=90&limit=1500 (limit of 1500 within the last 90 days).
The length that recent changes history is retained for depends on $wgRCMaxAge. It is currently 90 days but you might be in luck if the purge job didn't yet delete older entries.
Logs can be viewed without that limitation. Visit Special:Log in your wiki.
Using the API
list=allrevisions lists all page revisions (i.e. changes).
It allows specifying start timestamps (arvstart) and continuation.
Example: https://commons.wikimedia.org/w/api.php?action=query&list=allrevisions&arvlimit=1000
To see deletions, user right changes, uploads, ... use list=logevents.
Fix the issue
Either using database scripts (don't forget to back-up before) or with Special:Export in the source wiki and Special:Import in the Wiki in need of an update.
Avoid the issue
For a future migration to a new server $wgReadOnly might be your friend, avoiding this issue in the first place by making the old wiki read-only.
There is also Extension:Sync, though I am not sure what it is capable of.

Related

Using a MySQL query and BASH, how can I Delete, Rename, or Move all image files used by Drupal nodes before a certain date?

BACKSTORY IF YOU'RE INTERESTED: A friend of mine owns a magazine and has been publishing a corresponding Drupal 7 website since 2011. The site has thousands of articles and around 50,000 images supporting those articles. Unfortunately, due to copyright trolling attorneys, he's already been hit with a couple of copyright infringement lawsuits over images that he thought were from "creative commons." Since his first lawsuit in 2016, he's made sure all images are from a stock image company. But apparently, very recently, yet another image from before 2016 has caused another copyright troll to seek $18,000 (it's literally a photo of a hotdog by the way). Nevertheless, his business insurance company just wants to pay the settlement fees rather than risk anything in court, but has demanded that all potentially suspect images be deleted from the site going forward. Since 95% of the stories that have been published on his site have had fewer than 1000 views anyway (they are worth less than 50 cents from advertisers), he has agreed to take all those images down because $.50 is definitely not worth the risk of feeding any more trolls.
QUESTION: What's the best way to delete, rename or move all the images that are connected to a story node before a certain date in 2016? It would be nice if we could temporarily just change the filenames on the filesystem from "trollfood.jpg" to "trollfood.jpg.bak" (or something) so that if/when he can ensure that an image is in fact in the public domain, he can revive it. It would also be nice, if we could replace all the potentially suspect images links (in the db) with a placeholder image links for the time being (so that people can still read the article without wondering where the images have gone...perhaps the image will be a brief explanation of the trolling situation). Anyway, it's been a minute since I've done anything with Drupal, so I've forgotten how drupal links files to nodes (and he has some custom content types powering his main articles).
I've been able to get all the potentially suspect images in a list via mysql:
SELECT fid, filename, timestamp, from_unixtime(timestamp, "%Y-%m-%e")
FROM drupal7_therooster.file_managed
where timestamp between unix_timestamp('2011-01-01') and unix_timestamp('2017-01-01');
// here's sample output:
# fid filename timestamp from_unixtime(timestamp, "%Y-%m-%e")
6154 _MG_5147.jpg 1373763148 2013-07-14
6155 _MG_5179.jpg 1373763148 2013-07-14
6161 The Lone Bellow (4 of 5).jpg 1373866156 2013-07-15
6162 The Lone Bellow (1 of 5).jpg 1373866156 2013-07-15
Now, how can I use this to find the potentially offending stories that uses these images, and perform the following:
Create a list of all the stories that use these images so I can save this in case he ever wants to revive these images. I know SQL well enough...I just don't know which tables keep which data.
Create a query that replaces these image associations in these stories to a placeholder image (so if story uses "trollfood.jpg", that story now uses "safetyimageplaceholder.jpg" instead. Some stories have multiple images attached to them.
Once all the potentially offending articles reference a placeholder image instead, I still need to move all the offending files so they can't be accessed by lawyers...I have access via ssh by the way. Are there any good ways of using bash commands to ONLY move/rename files that match the list I generate from an SQL query? I just want to be careful not to delete/rename/move any images that were NOT part of the query. Bear in mind the file creation date in the filesystem is all 2017+ on the server because the server was moved (or copied) in 2017 so the file system's original creation dates are inaccurate.
I know this is a long question...and it involves a Drupal site, but I think I might need the help of proper SQL and bash experts, so I've posted it here instead of the Drupal specific stackexchange. I'm totally open to any suggestions if another completely different approach is better suited for this problem. Cheers!

I was able to answer my own question. I had to do three main things:
STEP ONE: Create a query for Drupal's MySQL database that would give me a list of all potential copyright infringing files that were being used by nodes created between 2012 and 2017:
SELECT fm.fid, fm.filename,
n.title, n.nid, from_unixtime(n.created, "%Y-%m-%d") as 'node_date'
FROM file_managed fm
JOIN file_usage fu ON fm.fid = fu.fid
JOIN node n ON fu.id = n.nid
WHERE created BETWEEN unix_timestamp('2012-01-01') AND unix_timestamp('2017-01-01')
ORDER BY node_date
This is a moderately complex query, but basically it joins columns from three tables (Drupal 7's file_managed, node, and file_usage tables). The file_usage table is a shared key register of which files (via fid) are used on which nodes (via nid).
STEP TWO: Organize and filter the data to create a list of files.
I filtered and ordered the results by node created dates. I got about 48K records from the join query in step one, and then I created a google spreadsheet to clean up and sort the data. Here's a sample of the google spreadsheet. This sheet also includes data from the node_counter table which tracks page views for each node. Using a simple VLOOKUP function to match the total page views for each nid on the main sheet, now the main sheet can be sorted by page views. I did this so I could prioritize which images attached to each nodes/article I should check first. This is the sql query I used to get that data from the db by the way:
SELECT nid, totalcount, daycount, from_unixtime(timestamp, "%Y-%m-%d") as 'date'
FROM node_counter
ORDER BY totalcount DESC
STEP THREE: Write a Shell Script that will take our filtered list of files, and move them somewhere safe (and off the public webserver).
Basically, I needed a simple BASH script that would use the list of files from step two to move them off the web server. Bear in mind, when each image file is uploaded to the server, Drupal can (and did) created about a dozen different aspect ratios and sizes, and placed each one of these copies into corresponding folders. For example, one image filename could be copied and resized into:
files/coolimage.jpg
files/large/coolimage.jpg
files/hero/coolimage.jpg
files/thumbnails/coolimage.jpg
files/*/coolimage.jpg etc etc.
So, I have to take a list of ~50K filenames, and check for those filenames in a dozen different subfolders, and if they are present in a subfolder, move each of them to an archived folder all while preserving the folder/file tree structure and leaving behind files that are "safe" to keep on the public web server. So...I ended up writing THIS simple script and open sourced it on Github in case anyone else might benefit from it.
That's it! Thankfully I knew some SQL and how to use google spreadsheets...and some basic bash..and well, how to use google and solve problems. If google users are able to find this helpful in the future...cheers!

How to keep updates (diffs) of some entity in the database

What is the best way to keep updates (diffs) of the some entity in the database? Here, at StackOverflow, we can edit questions and answers. And then we can look at any revision of the question or answer we want. For example: revisions of some random question. Maybe someone knows how it realized in StackOverflow?
To be clear in my case I have some entity (article) with some fields (name, description, content). Many users can edit the same article. I want to keep history of the article updates (something like version control) and I want to keep only diffs, not the whole content of the updated article. By the way I use PostgreSQL, but can migrate to any other database.
UPD
Open bounty, so here is some requirements. You don't need to fully satisfy them. But if you do it will be much better. Nevertheless any answer is much appreciated. So I want to have an ability:
to keep only diffs to not waste my space for no purpose.
to fetch any revision (version) of some article. But fetching of the last revision of the article must be really quick. Fetching speed of other revisions is not so important.
to fetch any diff (and list of diffs) of some article. Article can have changes in fields: header, description or content (like StackOverflow have changes in header and content), so it must be taken into account.

In the past, I have used diff-match-patch with excellent (and fast) results. It is available for several languages (my experience with it was in C#). I did not use it for exactly the process you are describing (we were interested in merging), but it seems to me you could:
Save the initial version of an article's text/header/whatever.
When a change is made, use diff-match-patch to compute a patch between the newly edited version and what is already in the database. To get the latest version in the database, simply apply any patches that have already been generated to the original article in order.
Save the newly generated patch.
If you wanted to speed things up even more, you could cache the latest version of the article in its own row/table/however-you-organize-things so that getting the latest version is a simple SELECT. This way, you'd have the initial version, the list of patches, and the current version, giving you some flexibility and speed.
Since you have a set of patches in sequence, fetching any version of the article would simply be a matter of applying patches up to the one desired.
You can take a look at the patch demo to see what its patches look like and get an idea of how big they are.
Like I said, I have not used it for exactly this scenario, but diff-match-patch has been designed for doing more or less exactly what you are talking about. This library is on my short list of software I can use when I have no restrictions on libraries developed out-of-house.
Update: Some example pseudocode
As an example, you could set up your tables like so (this assumes a few other tables, like Authors):
Articles
--------
id
authorId
title
content
timestamp
ArticlePatches
--------------
id
articleId
patchText
timestamp
CurrentArticleContents
----------------------
id
articleId
content
Then some basic CRUD could look like:
Insert new article:
INSERT INTO Articles (authorId, title, content, timestamp)
VALUES(#authorId, #title, #content, GETDATE())
INSERT INTO CurrentArticleContents(articleId, content)
VALUES(SCOPE_IDENTITY(),#content)
GO
Get all articles with latest content for each:
SELECT
a.id,
a.authorId,
a.title,
cac.content,
a.timestamp AS originalPubDate
FROM Articles a
INNER JOIN CurrentArticleContents cac
ON a.id = cac.articleId
Update an article's content:
//this would have to be done programatically
currentContent =
(SELECT content
FROM CurrentArticleContents
WHERE articleId = #articleId)
//using the diff-match-patch API
patches = patch_make(currentContent, newContent);
patchText = patch_toText(patches);
//setting #patchText = patchText and #newContent = newContent:
(INSERT INTO ArticlePatches(articleId, patchText, timestamp)
VALUES(#articleId, #patchText, GETDATE())
INSERT INTO CurrentArticleContents(articleId, content, timestamp)
VALUES(#articleId, #newContent, GETDATE())
GO)
Get the article at a particular point in time:
//again, programatically
originalContent = (SELECT content FROM Articles WHERE articleId = #articleId)
patchTexts =
(SELECT patchText
FROM ArticlePatches
WHERE articleId = #articleId
AND timestamp <= #selectedDate
ORDER BY timestamp ASCENDING)
content = originalContent
foreach(patchText in patchTexts)
{
//more diff-match-patch API
patches = patch_fromText(patchText)
content = patch_apply(patches, content)[0]
}

i have similar issue in workplace.
i implement the use of trigger after update to record all needed data to a another table (which of course you can save only the difference field only), and the new on exist on the real table, while the log live in another table.

Ok, first #PaulGriffin's answer is complete. And #VladimirBaranov's also got me to thinking about the optimal way to do updates. I've got a tasty way to handle it in the use case of frequent updates and less frequent reads. Lazy full record updates.
For example, editing a document online, possibly from different devices. Optimize for light network traffic and less frequent large db record updates. Without the ability of the client to go to specific versions (rather than always getting latest).
The database has collection of deltas, indexed by version(could be timestamp), and a lastDocument with lastDocUpdate (version/timestamp).
Frequent use case: Client Edits
Send only the delta to the Server & update database with delta
Use Case: Client with old document version requests updates
Send all deltas since current client version
Less Frequent use case: New device, no prior data on client
On Server, look at lastDocument, apply deltas since lastDocUpdate
Save updated lastDocument in db, and send to client
The most expensive operation is updating the full document in the database, but it's only done when necessary, ie the client has no version of the document.
That rare action is what actually triggers the full document update.
This setup has no extra db writes, minimal data sent to the client (which updates its' doc with the deltas), and the large text field is updated only when we are already applying the deltas on the server and full document is needed by client.

Trying to delete several thousand users from database

I checked one of my Joomla! websites this evening and to my horror found that I had thousands of spam registrations. I can't bring up all the users on one page on the website because it crashes, it's obviously too much for the server.
Even if I display 100 users per page, I've got 500 pages, it will take me until next week to delete them. So I thought maybe I can do from the database. The same thing happened, if I have 30 users showing, there are over 1000 pages. So I change the setting to show 1000 users, I wasn't able to delete the 1000 user because the page just crashed again.
So I'm thinking that maybe I can backup my own account from the user table. However, do I have to create another user table in order to reinstall my account? I hope you understand my dilemma

What I might do is go to phpmyadmin and export any data you want to keep even one at a time.
Then empty the table (i.e. delete all the rows).
Then import all of the data you exported back into the empty table.
Of it's just the one record you want to keep #Sparkup's answer will be quicker though.
Were you using a user profile plugin? If so you'll want to delete any records there also.
Then at minimum enable recaptcha, but also if you don't really want user registration, turn that off in the global configuration.

If you want to delete every user except your own you could do :
DELETE FROM users WHERE email != 'your_email';
Please note this will delete every other account
Be sure to make a backup of your database first.
If you want to remove emails with a certain extension :
DELETE FROM users WHERE email LIKE '%.co.uk';
DELETE FROM users WHERE email LIKE '%gmail.com';

Mysql Delete would be a good choice. Access Joomla database from terminal(linux) or cmd(windows) that would be fast .using captcha might be useful to stop spamming at certain extent.

How to store / retrieve large amounts of data sets within XPages?

Currently we're working on a solution where we want to track (for analysis) the articles a user clicks on/opens and 'likes' from a given list of articles. Subsequently, the user needs to be able to see and re-click/open the article (searching is not needed) in a section on his/her personal user profile. Somewhere around the 100 new articles are posted every day. The increasing(!) amount of daily visitor (users) lies around the 2000 a day. The articles are currently stored and maintained within a MySQL Db.
We could create a new record in the MySQL Db for every article read / 'liked'. 'Worst case' would this create (2500 * 100 = ) 250000 records a day. That won’t hold long of course… So how would you store (process) this within XPages, given the scenario?
My thoughts after reading “the article” :) about MIME/Bean’s: what about keeping 'read articleObjects' in a scope and (periodically) store/save them as MIME on the user profile document? This only creates 100 articleObjects a day (or 36500 a year). In addition, one could come up with a mechanism where articleObjects are shifted from one field to another as time passes by, so the active scope would only contain the 'read articleObjects' from last month or so.

I would say that this exactly what a relational database is for. My first approach would be to have a managed bean (session scope) to read/access user's data in MySQL (JDBC). If you want, you can build an internal cache inside the bean.
For the presented use case, I would not bother with the JDBC datasources in ExtLib. Perhaps even the #Jdbc functions would suffice.
Also, you did not say how you are doing the analysis? If you store the information in Domino, you'll probably have to write an export tool.

Handling versions

Currently I'm working on a website for a client, who deals in software plugins. The client needs to be able to upload Products, and for these products different Versions, Updates and Patches. For example: veProduct v1.1.4 is Version 1, Update 1 and Patch 4 of the Product veProduct. Customers need to be able to buy a License to a Version. The License is a file that's generated per user, wich needs to be available for download. Currently I'm designing the database for the website, and I stumbled into a problem. How should I handle the different patches & versions?
My current Design:
In this design I created a table Product, which contains the information of the product itself, like the name and the description. Of the Product there can be multiple versions, which require different licenses, so I created another table called Version. In this table there will be a download-link, a changelog and the pricing of this version of the product. Also a discount price for customers who own the license of an older version, but that's not important. After this I created the User table, so that I could link the user and the version in the table License. In this table you can also find the download link to the license file.
After this, my trouble starts. I created the table Update, which is for example v2.3. This means it's version_id is 2 and it's update_id is 3. Then I created the table Patch, which could be for example v2.3.1, where version_id is 2, update_id is 3 and patch_id is 1. Until now there's no problem, but there's one big flaw in this design. That's when I want to upload a patch for a version that has no updates yet, like v1.0.1. That means I have to create a record in the table update, with the update_id 0 and no download link or changelog. I don't want to create a record without any purpose but to be a patches 'parent'.
[deleted]
Fixing my problem:
I find it hard to think of a solution, so I ask your help.
Would it be a good idea to get rid of the download-link fields in the version and update table and rely on the ones in the patch table? This way I would always make another patch, even if I only upload a newer version, or if I upload a completely new product. It just feels wrong to have the update table, only for being a patches parent.
So, can anyone help me figure this one out? How should I store new products, versions, updates and patches in my database. Where should I store my download-links? And last, but not least, how can I keep my version linked to the customer, and to the download-link for the license? (So every user gets his own license per version of a product, not per product, update or patch)
Sincerly,
Scuba Kay
PS: It really sucks not being able to post pictures if you're a new user, 'cause I'd like to post a screenshot of my MySQL Workbench instead of a bulk of code.

Instead of Update and Patch, why not just have a Release table that looks like this:
release_id,
version_id, --foreign key
type, -- this is either UPDATE or PATCH
release_name, -- this is the full ver number 1.0.1, for instances
changelog,
download_link
This would allow you to keep version so you can associate that with customer for licensing purposes but would let you be flexible about what you create first (updates or patches) without having to insert dummy rows.
BTW: In this model, I'd remove the download link from Version so it only exists at the release level (essentially release 1.0.0).

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008