Do lower MediaWiki page revision IDs always mean earlier edits?

Do lower MediaWiki page revision IDs always mean earlier edits? - mediawiki

In general it seems true, at least for a single page, that lower revision IDs for Mediawiki page histories mean an earlier edit time. Is this true in General? Are there ever exceptions? How does revision ID minting work?
I am trying attempting trying to write a function with Pywikipedia, that will give the Page text as a of an arbitrary timestamp. It would just be more optimized to sort based on Revision ID, rather than making a dict of revision IDs timestamps, and then sorting the timestamps.

I found the answer for this on IRC thanks to user:halfak. The answer is that there is no guarantee for at least two reasons.
If pages are imported from a secondary wiki, then timestamps can be unrelated. And
If two edits occur within the same second, they will not be properly ordered, which happens sometimes.

Related

Managing Historical Data Dependencies

3 Tables: Device, SoftwareRevision, Message. All data entered is handed by PHP scripts on an Apache server.
A device can have one software revision. A software revision can have many devices. A device can have many messages. A message can have one device.
Something like above.
The issue is, the SoftwareRevision changes how the message is used in the front end application. This means that when the software is updated on the device, we need older messages to retain the information that they were received from a different software revision.
The TL;DR here is that the fully normalized way I see of doing this becomes a real pain. I've got about 5 of these situations in my current project and 3 of them are nested inside of each other.
I see three ways of doing this:
The first is the above fully normalized way. In order to find out how to use the message on the front end application, one must find the latest entry into Device_SoftwareRevision_Records that is before the datetime of the given message. This gets really fiddly when you have a more complex database and application. Just to get the current SoftwareRevision_ID for a device you have to use a MAX GROUP BY type statement (I've ended up having to use views to simplify).
The second is to directly link the Message to the SoftwareVersion. This means you don't have to go through the whole MAX GROUP BY WHERE blah blah. The SoftwareVersion_ID is retrieved by a PHP script and then the message is entered. Of course, this is denormalized so now there is potential for duplicate data.
Aaaand heres our fully denormalized version. The Software_Revision_Records table is purely for bookkeeping purposes. Easy to use for the front-end application but a pain to update at the back-end. The back-end updating can actually be streamlined with triggers for entering into the Software_Revision_Records table so the only thing that can really go wrong is the message gets the wrong software revision when it is entered.
Is there a better way of doing this that I have missed? Is it such a sin to denormalize the database in this situation? Will my decision here cause the business to erupt into flames (probably not)?

If the messages are tied to the software revision for that particular device, then it might make more sense to reflect that relationship in the data model. i.e. have a foreign key from Messages to Device_SoftwareRevision_Records rather than from Messages to Device. You still have the relationship from Messages to Device indirectly, it's normalised, and there's no messing around with dates trying to figure out which messages were created while a given software revision was in place.
In cases where you do need dates, it might also be worth considering having both a start and stop date, and filling in any null dates with something like 9999-12-31 (to indicate that a record has not yet been ended). You can easily find the latest record without needing to do a max. It will also make it a lot easier to query the table if you do need to compare it to other dates - you can just do a between on a single record. In this example, you'd just look for this:
where Message.TimeStamp between Device_SoftwareRevision_Records.StartDate and Device_SoftwareRevision_Records.EndDate
That said, I would still - if at all possible - change the model to relate Messages to the correct table rather than rely on dates. Being able to do simple joins will be quicker, more convenient, more obvious if anyone new needs to learn the structure, and is likely to perform better.

What is the "Rails way" to do correlated subqueries?

I asked nearly the same question in probably the wrong way, so I apologize for both the near duplicate and lousy original phrasing. I feel like my problem now is attempting to fight Rails, which is, of course, a losing battle. Accordingly, I am looking for the idiomatic Rails way to do this.
I have a table containing rows of user data which is scraped from a third party site periodically. The old data is just as important as the new data; the old data is, in fact, probably used more often. There are no performance concerns about referencing the new data, because only a couple people will ever use my service (I keep my standards realistic). But thousands of users are scraped periodically (i.e., way too often). I have named the corresponding models "User" and "UserScrape"
Table users has columns: id, name, email
Table user_scrapes has columns: id, user_id, created_at, address_id, awesomesauce_preference
Note: These are not the real models - user_scrapes has a lot more columns - but you probably get the point
At any given time, I want to find the most recent user_scrapes values associated with the data retrieved from an external source from a given user. I want to find out that my current awesomeauce_preference is, because lately it's probably 'lamesauce' but before, it was 'saucy_sauce'.
I want to have a convenient method that allows me to access the newest scraped data for each user in such a way that I can combine it with separate WHERE clauses to narrow it down further. That's because in at least a dozen parts of my code, I need to deal with the data from the latest scrape.
What I have done so far is this horrible hack that selects the latest user_scrapes for each user with a regular find_by_sql correlated sub-query, then I pluck out the ids of the scrapes, then I put an additional where clause in any relevant query (that needs the latest data).
This is already an issue performance-wise because I don't want to buffer over a million integers (yes, a lot of pages get scraped very often) then try to pass the MySQL driver a list of these and have it miraculously execute a perfect query plan. In my benchmark it took almost as long as it did for me to write this post, so I lied before. Performance is sort of an issue, but not really.
My question
So with my UserScrape class, how can I make a method called 'current', as in: UserScrape.find(1337).current.where(address_id: 1234).awesomesauce_preference when I live at addresses 1234 and 1235 and I want to find out what my awesomsauce_preference is at my latest address?

I think what you are looking for are scopes:
http://guides.rubyonrails.org/active_record_querying.html#scopes
In particular, you can probably use:
scope :current, order("user_scrapes.created_at DESC").limit(1)
Update:
Scopes are meant to return an ActiveRecord object, so that you can continue chaining methods if you wish. There is nothing to prevent you (last I checked anyways) from writing this instead, however:
scope :current, order("user_scrapes.created_at DESC").first
This returns just the one object, and is not chainable, but it may be a more useful function ultimately.
UserScrape.where(address_id: 1234).current.awesomesauce_preference

Should id or timestamp be used to determine the creation order of rows within a database table? (given possibility of incorrectly set system clock)

A database table is used to store editing changes to a text document.
The database table has four columns: {id, timestamp, user_id, text}
A new row is added to the table each time a user edits the document. The new row has an auto-incremented id, and a timestamp matching the time the data was saved.
To determine what editing changes a user made during a particular edit, the text from the row inserted in response to his or her edit is compared to the text in the previously inserted row.
To determine which row is the previously inserted row, either the id column or the timestamp column could be used. As far as I can see, each method has advantages and disadvantages.
Determining the creation order using id
Advantage: Immune to problems resulting from incorrectly set system clock.
Disadvantage: Seems to be an abuse of the id column since it prescribes meaning other than identity to the id column. An administrator might change the values of a set of ids for whatever reason (eg. during a data migration), since it ought not matter what the values are so long as they are unique. Then the creation order of rows could no longer be determined.
Determining the creation order using timestamp
Advantage: The id column is used for identity only, and the timestamp is used for time, as it ought to be.
Disadvantage: This method is only reliable if the system clock is known to have been correctly set each time a row was inserted into the table. How could one be convinced that the system clock was correctly set for each insert? And how could the state of the table be fixed if ever it was discovered that the system clock was incorrectly set for a not precisely known period in the past?
I seek a strong argument for choosing one method over the other, or a description of another method that is better than the two I am considering.

Using the sequential id would be simpler as it's probably(?) a primary key and thus indexed and quicker to access. Given that you have user_id, you can quickly assertain the last and prior edits.
Using the timestamp is also applicable, but it's likely to be a longer entry, and we don't know if it's indexed at all, plus the potential for collisions. You rightly point out that system clocks can change... Whereas sequential id's cannot.
Given your update:
As it's difficult to see what your exact requirements are, I've included this as evidence of what a particular project required for 200K+ complex documents and millions of revisions.
From my own experience (building a fully auditable doc/profiling system) for an internal team of more than 60 full-time researchers. We ended up using both an id and a number of other fields (including timestamp) to provide audit-trailing and full versioning.
The system we built has more than 200 fields for each profile and thus versioning a document was far more complex than just storing a block of changed text/content for each one; Yet, each profile could be, edited, approved, rejected, rolled-back, published and even exported as either a PDF or other format as ONE document.
What we ended up doing (after a lot of strategy/planning) was to store sequential versions of the profile, but they were keyed primarily on an id field.
Timestamps
Timestamps were also captured as a secondary check and we made sure of keeping system clocks accurate (amongst a cluster of servers) through the use of cron scripts that checked the time-alignment regularly and corrected them where necessary. We also used Ntpd to prevent clock-drift.
Other captured data
Other data captured for each edit also included (but not limited to):
User_id
User_group
Action
Approval_id
There were also other tables that fulfilled internal requirements (including automatically generated annotations for the documents) - as some of the profile editing was done using data from bots (built using NER/machine learning/AI), but with approval being required by one of the team before edits/updates could be published.
An action log was also kept of all user actions, so that in the event of an audit, one could look at the actions of an individual user - even when they didn't have the permissions to perform such an action, it was still logged.
With regard to migration, I don't see it as a big problem, as you can easily preserve the id sequences in moving/dumping/transferring data. Perhaps the only issue being if you needed to merge datasets. You could always write a migration script in that event - so from a personal perspective I consider that disadvantage somewhat diminished.
It might be worth looking at the Stack Overflow table structures for there data explorer (which is reasonably sophisticated). You can see the table structure here: https://data.stackexchange.com/stackoverflow/query/new, which comes from a question on meta: How does SO store revisions?
As a revision system, SO works well and the markdown/revision functionality is probably a good example to pick over.

Use Id. It's simple and works.
The only caveat is if you routinely add rows from a store-and-forward server so rows may be added later but should treated as being added earlier

Or add another column whose sole purpose is to record the editing order. I suggest you do not use datetime for this.

How to design a SQL db with undo-redo?

I'm trying to figure out how to design my DB tables to allow Undo-Redo.
Pretend you have a tasks table with the following structure:
id <int>
title <varchar>
memo <string>
date_added <datetime>
date_due <datetime>
Now assume that over a few days and multiple log-ins that several edits have taken place; but a user wants to go back to one of the versions.
Would you have a separate table tracking the changes - or - would you try to keep the changes within the tasks table ("ghost" rows, for lack of a better term)?
Would you track all of the columns or just the ones that changed each time?
If it matters, I'm using MySQL. Also, if it matters, I'd like to be able to show the history (ala Photoshop) and allow a user to switch to any version.
Bonus question: Would you save the whole memo cell on a change or would you try to save the delta only? Reason I ask is because the memo cell could be large and only a single word or character might be changed each revision. Granted, saving the delta would require parsing, but if undos aren't expected very often, wouldn't it be better to save space rather than processing time?
Thank you for your help.

I would create a History table for your tasks table. Same structure as tasks + a new field named previousId. This would hold the previous change id, so you can go back an forth through different changes (undo/redo).
Why a new History table? For a simple reason: do not overload tasks table with things that it was not designed for.
As for space, in the History, instead of a Memo, use a binary format and zip the content of the text you want to store. Don't try to detect changes. You will run into a buggy code which will result in frustration and wasted time...
Optimization:
Even better, you may keep only three columns in History table:
1. taskId (foreign key to tasks)
2. data - a binary field. Before saving in the History table, create an XML string holding only the fields that have changed.
3. previousId (will help maintain a queue of changes and allow navigation back and forth)
As for data field, create an XML string like this:
<task>
<title>Title was changed</title>
<date_added>2011-03-26 01:29:22<date_added>
</task>
This will basically tell you that this time you changed only the title and the date_added fields.
After the XML string is built, just zip it if you want and store it into History table's data field.
XML will also allow for flexibility. If you add / remove a field in tasks table, you don't need to update the History table, too. So this way the structure of the tasks table and History table are decoupled so you don't need to update two tables each time.
PS: don't forget to add some indexes to quickly navigate through the history table. Fields to be indexed: taskId and previousId as you will need fast queries against this table.
Hope this helps.

When I do similar types of things using SQL I always use a second table for revision history. This prevents your primary table from getting overly large with versions. The rationale is that retrieving the record that is current happens almost 100% of the time, viewing history and rolling back (undo) is very infrequent.
If you only have a single UNDO or history, then tracking in-table is probably fine.
Whether you want to save deltas or the entire cell depends on expected growth / usage. If you are comfortable creating the logic to manage deltas, that will save you space. If things don't really create new versions that often I wouldn't start with that, (applying YAGNI)

You might want to compress revisions in delta form but you should still have the current revision in full for quick retrieval.
However, older to newer deltas require lots of processing unless you have some non-delta to base on. Newer to older deltas require reprocessing every time something changes. So deltas usually do not get you many benefits but greater complexity.
Last I checked, which is some years ago, MediaWiki, the software behind Wikipedia, stored full texts and provided some means to compress older revisions with gzip to save space and a dedicated table archive for deleted revisions / pages.
Their website has an ER diagram of their database layout which you might find useful.

Namespaces and records in erlang

Erlang obviously has a notion of namespace, we use things like application:start() every day.
I would like to know if there is such a thing as namespace for records. In my application I have defined record user. Everything was fine until I needed to include rabbit.hrl from RabbitMQ which also defines user, which is conflicting with mine.
Online search didn't yield much to resolve this. I have considered renaming my user record and prefixing it with something, say "myapp_user". This will fix this particular issue, until I suspect I hit another conflict say with my record "session".
What are my options here? Is adding a prefix myapp_ to all my records a good practice, or is there a real support for namespaces with records and I am just not finding it?
EDIT: Thank you everyone for your answers. What I've learned is that the records are global. The accepted answer made it very clear. I will go with adding prefixes to all my records, as I have expected.

I would argue that Erlang has no namespaces whatsoever. Modules are global (with the exception of a very unpopular extension to the language), names are global (either to the node or the cluster), pids are global, ports are global, references are global, etc.
Everything is laid flat. The namespacing in Erlang is thus done by convention rather than any other mean. This is why you have <appname>_app, <appname>_sup, etc. as module names. The registered processes also likely follow that pattern, and ETS tables, and so on.
However, you should note that records themselves are not global things: as JUST MY correct OPINION has put it, records are simply a compiler trick over tuples. Because of this, they're local to a module definition. Nobody outside of the module will see a record unless they also include the record definition (either by copying it or with a header file, the later being the best way to do it).
Now I could argue that because you need to include .hrl files and record definitions on a per-module basis, there is no such thing as namespacing records; they're rather scoped in the module, like a variable would be. There is no reason to ever namespace them: just include the right one.
Of course, it could be the case that you include record definitions from two modules, and both records have the same name. If this happens, renaming the records with a prefix might be necessary, but this is a rather rare occurrence in my experience.
Note that it's also generally a bad idea to expose records to other modules. One of the problems of doing so is that all modules depending on yours now get to include its .hrl file. If your module then change the record definition, you will have to recompile every other module that depends on it. A better practice should be to implement functions to interact with the data. Note that get(Key, Struct) isn't always a good idea. If you can pick meaningful names (age, name, children, etc.), your code and API should make more sense to readers.

You'll either need to name all of your records in a way that is unlikely to conflict with other records, or you need to just not use them across modules. In most circumstances I'll treat records as opaque data structures and add functionality to the module that defines the record to access it. This will avoid the issue you've experienced.

I may be slapped down soundly by I GIVE TERRIBLE ADVICE here with his deeper knowledge of Erlang, but I'm pretty sure there is no namespaces for records in Erlang. The record name is just an atom grafted onto the front of the tuple that the compiler builds for you behind the scenes. (Records are pretty much just a hack on tuples, you see.) Once compiled there is no meaningful "namespace" for a record.
For example, let's look at this record.
-record(branch, {element, priority, left, right}).
When you instantiate this record in code...
#branch{element = Element, priority = Priority, left = nil, right = nil}.
...what comes out the other end is a tuple like this:
{branch, Element, Priority, nil, nil}
That's all the record is at this point. There is no actual "record" object and thus namespacing doesn't really make any sense. The name of the record is just an atom tacked onto the front. In Erlang it's perfectly acceptable for me to have that tuple and another that looks like this:
{branch, Twig, Flower}
There's no problem at the run-time level with having both of these.
But...
Of course there is a problem having these in your code as records since the compiler doesn't know which branch I'm referring to when I instantiate. You'd have to, in short, do the manual namespacing you were talking about if you want the records to be exposed in your API.
That last point is the key, however. Why are you exposing records in your API? The code I took my branch record from uses the record as a purely opaque data type. I have a function to build a branch record and that is what will be in my API if I want to expose a branch at all. The function takes the element, priority, etc. values and returns a record (read: a tuple). The user has no need to know about the contents. If I had a module exposing a (biological) tree's structure, it too could return a tuple that happens to have the atom branch as its first element without any kind of conflict.
Personally, to my tastes, exposing records in Erlang APIs is code smell. It may sometimes be necessary, but most of the time it should remain hidden.

There is only one record namespace and unlike functions and macros there can only be one record with a name. However, for record fields there is one namespace per record, which means that there is no problems in having fields with the same name in different records. This is one reason why the record name must always be included in every record access.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008