We have a hosted mediawiki running, where we only have limited access - so I can change files on the server and query the database, but do not have a console for executing scripts.
Now I need to get a list of changed pages of the last 6 month, which are only partly available via recent changes, since, as I understood, entries older than a certain time range are purged from the database.
So how can I either select the changes via SQL or use an extension which can list those without the need of the recent pages table?
You should be able to use the revision table for this, it contains every (non-deleted) revision ever made to the wiki.
For those who are interested how to use the revision + page table as suggested by svick, here's the statement to start with:
SELECT date_format( r.rev_timestamp, '%d.%m.%y' ) , cast( p.page_title AS char )
FROM revision r
JOIN page p ON ( r.rev_page = p.page_id )
ORDER BY r.rev_timestamp DESC
Related
I am collecting SSH brute force data, and store it in a table called "attempts". I want to GROUP BY each IP address with the associated location data and country name, but I cannot do this with only_full_group_by enabled. I disabled it and my query works fine, but I have two questions:
What are the consequences of disabling only_full_group_by? I assume it is a default for a reason, why is that? I can see issues if the same IP address had different location data for each record, but is that the only scenario where things go wrong?
If I wanted to accomplish the query without disabling only_full_group_by, what would that query look like?
My code:
SELECT latitude,longitude,country_name,foreign_ip,g.count as counter
FROM attempts
LEFT OUTER JOIN (
SELECT COUNT(foreign_ip) as count,foreign_ip as fi
FROM attempts
GROUP BY foreign_ip
) as g on attempts.foreign_ip = g.fi
GROUP BY foreign_ip;
I assume it is a default for a reason, why is that?
The only_full_group_by prevents using GROUP BY in ways that are incorrect. Having it on by default is a good thing.
You can write the query without the subquery:
SELECT latitude,longitude,country_name, foreign_ip,count(*) as counter
FROM attempts
GROUP BY latitude,longitude,country_name, foreign_ip
MySQL has historically been very tolerant with user errors. That often leads to additional work (e.g. first you store dates and then you need to filter invalid ones one every select) and data loss (your column is too short for the data and you end up with truncated data). They're trying to fix that but they cannot break a million apps that rely on tolerant behaviour so the solution has been to add optional SQL models.
If all columns within groups have the same values, you're right, nothing will break. The problem is when that isn't true. MySQL will not warn you and, instead, will just retrieve an arbitrary (not even random) row per group.
Your current query can be easily fixed to work in either mode:
SELECT latitude,longitude,country_name,foreign_ip,g.count as counter
FROM attempts
LEFT OUTER JOIN (
SELECT COUNT(foreign_ip) as count,foreign_ip as fi
FROM attempts
GROUP BY foreign_ip
) as g on attempts.foreign_ip = g.fi
GROUP BY latitude,longitude,country_name,foreign_ip,g.count
We moved a private MediaWiki site to a new server. Some months later we discovered that one or two users had continued to update the old MediaWiki site. So we have some edits in the old server that need to be copied into the new server.
Does anyone know of a routine or process to (conveniently?) compare and identify edits in the old site?
Per the comments attached to this post, the Recent Changes page might work if that page accepted a starting date. Unfortunately, it is limited to a max of 30 days. In this case, I need to review changes for 12 months.
Identify edits done
Identify and verify edits done by your users since the fork
Using the database (assuming MySQL) and no table prefixes
Give me all the edits done since Dec 01 2018 (including that date):
SELECT rev_id, rev_page, rev_text_id, rev_comment, rev_user, rev_user_text, rev_timestamp
FROM revision
WHERE rev_timestamp > '20181201';
Note that the actual page text is stored in the text table, and the page name in the page table.
Give me all edits done since Dec 01 2018 (including that date) with page name and revision text:
SELECT rev_id, rev_page, page_namespace, page_title, rev_text_id, rev_comment, rev_user, rev_user_text, rev_timestamp, old_text
FROM revision r
LEFT JOIN page p
ON p.page_id = r.rev_page
LEFT JOIN text t
ON t.old_id = r.rev_text_id
WHERE rev_timestamp > '20181201';
Note that with tools like MySQL Workbench you can copy results as MySQL insert statements. Dependent on what users did to the old wiki, you might just need to transfer records of 3 tables; however if there were file uploads, deletions or user right changes involved, it's getting complicated. You can track these changes through the logging table.
Using the Web Interface
It is of course possible to show more changes than just 500 for the last 30 days. The setting that allow you to configure this is $wgRCLinkLimits and $wgRCLinkDays. You can also just open the recent changes page, tap 30 days and change the URL parameters so the URL becomes path/to/index.php?title=Special:RecentChanges&days=90&limit=1500 (limit of 1500 within the last 90 days).
The length that recent changes history is retained for depends on $wgRCMaxAge. It is currently 90 days but you might be in luck if the purge job didn't yet delete older entries.
Logs can be viewed without that limitation. Visit Special:Log in your wiki.
Using the API
list=allrevisions lists all page revisions (i.e. changes).
It allows specifying start timestamps (arvstart) and continuation.
Example: https://commons.wikimedia.org/w/api.php?action=query&list=allrevisions&arvlimit=1000
To see deletions, user right changes, uploads, ... use list=logevents.
Fix the issue
Either using database scripts (don't forget to back-up before) or with Special:Export in the source wiki and Special:Import in the Wiki in need of an update.
Avoid the issue
For a future migration to a new server $wgReadOnly might be your friend, avoiding this issue in the first place by making the old wiki read-only.
There is also Extension:Sync, though I am not sure what it is capable of.
I have set up Superset on my Jupiter notebook and it is working i.e. the sample dashboards etc, work. When I try to create a simple table view to just do a SELECT * from Table to view the whole table (it is a small table), Superset keeps generating the SQL:
SELECT
FROM*
(SELECT Country,
Region,
Users,
Emails
FROM `UserStats`
LIMIT 50000) *AS expr_qry
LIMIT 50000
The first SELECT FROM and the AS expr_qry LIMIT 50000 are automatically generated and I cannot get rid of them (i.e. in the Slice view it shows this as the query, but won't let me edit it). Why does it generate its own SQL and where do you change this?
I tried to find workarounds for this but I feel I am missing something fundamental here.
there is a configuration for 'ROW_LIMIT' in superset/config.py,(default is 50000). if you need remove 'limit cause', try superset/viz.py, query_obj function.
What is the best way to keep updates (diffs) of the some entity in the database? Here, at StackOverflow, we can edit questions and answers. And then we can look at any revision of the question or answer we want. For example: revisions of some random question. Maybe someone knows how it realized in StackOverflow?
To be clear in my case I have some entity (article) with some fields (name, description, content). Many users can edit the same article. I want to keep history of the article updates (something like version control) and I want to keep only diffs, not the whole content of the updated article. By the way I use PostgreSQL, but can migrate to any other database.
UPD
Open bounty, so here is some requirements. You don't need to fully satisfy them. But if you do it will be much better. Nevertheless any answer is much appreciated. So I want to have an ability:
to keep only diffs to not waste my space for no purpose.
to fetch any revision (version) of some article. But fetching of the last revision of the article must be really quick. Fetching speed of other revisions is not so important.
to fetch any diff (and list of diffs) of some article. Article can have changes in fields: header, description or content (like StackOverflow have changes in header and content), so it must be taken into account.
In the past, I have used diff-match-patch with excellent (and fast) results. It is available for several languages (my experience with it was in C#). I did not use it for exactly the process you are describing (we were interested in merging), but it seems to me you could:
Save the initial version of an article's text/header/whatever.
When a change is made, use diff-match-patch to compute a patch between the newly edited version and what is already in the database. To get the latest version in the database, simply apply any patches that have already been generated to the original article in order.
Save the newly generated patch.
If you wanted to speed things up even more, you could cache the latest version of the article in its own row/table/however-you-organize-things so that getting the latest version is a simple SELECT. This way, you'd have the initial version, the list of patches, and the current version, giving you some flexibility and speed.
Since you have a set of patches in sequence, fetching any version of the article would simply be a matter of applying patches up to the one desired.
You can take a look at the patch demo to see what its patches look like and get an idea of how big they are.
Like I said, I have not used it for exactly this scenario, but diff-match-patch has been designed for doing more or less exactly what you are talking about. This library is on my short list of software I can use when I have no restrictions on libraries developed out-of-house.
Update: Some example pseudocode
As an example, you could set up your tables like so (this assumes a few other tables, like Authors):
Articles
--------
id
authorId
title
content
timestamp
ArticlePatches
--------------
id
articleId
patchText
timestamp
CurrentArticleContents
----------------------
id
articleId
content
Then some basic CRUD could look like:
Insert new article:
INSERT INTO Articles (authorId, title, content, timestamp)
VALUES(#authorId, #title, #content, GETDATE())
INSERT INTO CurrentArticleContents(articleId, content)
VALUES(SCOPE_IDENTITY(),#content)
GO
Get all articles with latest content for each:
SELECT
a.id,
a.authorId,
a.title,
cac.content,
a.timestamp AS originalPubDate
FROM Articles a
INNER JOIN CurrentArticleContents cac
ON a.id = cac.articleId
Update an article's content:
//this would have to be done programatically
currentContent =
(SELECT content
FROM CurrentArticleContents
WHERE articleId = #articleId)
//using the diff-match-patch API
patches = patch_make(currentContent, newContent);
patchText = patch_toText(patches);
//setting #patchText = patchText and #newContent = newContent:
(INSERT INTO ArticlePatches(articleId, patchText, timestamp)
VALUES(#articleId, #patchText, GETDATE())
INSERT INTO CurrentArticleContents(articleId, content, timestamp)
VALUES(#articleId, #newContent, GETDATE())
GO)
Get the article at a particular point in time:
//again, programatically
originalContent = (SELECT content FROM Articles WHERE articleId = #articleId)
patchTexts =
(SELECT patchText
FROM ArticlePatches
WHERE articleId = #articleId
AND timestamp <= #selectedDate
ORDER BY timestamp ASCENDING)
content = originalContent
foreach(patchText in patchTexts)
{
//more diff-match-patch API
patches = patch_fromText(patchText)
content = patch_apply(patches, content)[0]
}
i have similar issue in workplace.
i implement the use of trigger after update to record all needed data to a another table (which of course you can save only the difference field only), and the new on exist on the real table, while the log live in another table.
Ok, first #PaulGriffin's answer is complete. And #VladimirBaranov's also got me to thinking about the optimal way to do updates. I've got a tasty way to handle it in the use case of frequent updates and less frequent reads. Lazy full record updates.
For example, editing a document online, possibly from different devices. Optimize for light network traffic and less frequent large db record updates. Without the ability of the client to go to specific versions (rather than always getting latest).
The database has collection of deltas, indexed by version(could be timestamp), and a lastDocument with lastDocUpdate (version/timestamp).
Frequent use case: Client Edits
Send only the delta to the Server & update database with delta
Use Case: Client with old document version requests updates
Send all deltas since current client version
Less Frequent use case: New device, no prior data on client
On Server, look at lastDocument, apply deltas since lastDocUpdate
Save updated lastDocument in db, and send to client
The most expensive operation is updating the full document in the database, but it's only done when necessary, ie the client has no version of the document.
That rare action is what actually triggers the full document update.
This setup has no extra db writes, minimal data sent to the client (which updates its' doc with the deltas), and the large text field is updated only when we are already applying the deltas on the server and full document is needed by client.
What is the faster/better way to keep track on statistical data in a message board?
-> number of posts/topics
Update a column like 'number_of_posts' for each incoming post or after a post gets deleted.
Or just count(*) on the posts matching a topicId?
Just use count(*) - it's built into the database. It's well tested, and already written.
Having a special column to do this for you means you need to write the code to manage it, keep it in sync with the actual value (on adds and deletes). Why make more work for yourself?