Get written data per day on MediaWiki - mediawiki

I want to graph how much data is written inside my MediaWiki per day to graph the activity. The exact amount of bytes is not important, I just want to see the relative change per day/month/year.
I only found the Statistics Log extension which is not maintained since 1.15.
Any solution via extension/api/mysql would be great. If I can get the value of the bytes/chars or anything else by any method I can do the rest.

Not an easy answer. But you can get started with "RecentChanges" table and related API: https://www.mediawiki.org/wiki/API:RecentChanges
rc_old_len
This field stores the size, in bytes, of previous revision's text.
rc_new_len
This field stores the size, in bytes, of the current revision's text.
Reference: https://www.mediawiki.org/wiki/Manual:Recentchanges_table#rc_new_len

Related

Cracking a binary file format if I have the contents of one of these files

I have about 300 measurements (each stored in a dat file) that I would like to read using MATLAB or Python. The files can be exported to text or csv using a proprietary program, but this has to be done one by one.
The question is: what would be the best approach to crack the format of the binary file using the known content from the exported file?
Not sure if this makes any difference to make the cracking easier, but the files are just two columns of (900k) numbers, and from the dat files' size (1,800,668 bytes), it appears as if each number is 16 bits (float) and there is some other information (possible the header).
I tried using HEX-Editor, but wasn't able to pick up any trends from there.
Lastly, I want to make sure to specify that these are measurements I made and the data in them belongs to me. I am not trying to obtain data that I am not supposed to.
Thanks for any help.
EDIT: Reading up a little more I realized that there my be some kind of compression going on. When you look at the data in StreamWare, it gives 7 decimal places, leading me to believe that it is a single precision value (4 bytes). However, the size of the files suggests that each value only takes 2 bytes.
After thinking about it a little more, I finally figured it out. This is very specific, but just in case another Dantec StreamWare user runs into the same problem, it could save him/her a little time.
First, the data is actually only a single vector. The time column is calculated from the length of the recorded signal and the sampling frequency. That information is probably in the header (but I wasn't able to crack that portion).
To obtain the values in MATLAB, I skipped the header bytes using fseek(fid, 668, 'bof'), then I read the data as uint16 using fread(fid, 900000, 'uint16'). This gives you integers.
To get the float value, all you have to do is divide by 2^16 (it's a 16 bit resolution system) and multiply by ten. I assume the factor of ten depends on the range of your data acquisition system.
I hope this helps.

ABAP TVRO field TRAZTD, Route Customizing Data

A customer of mine is looking to mass create some customizing data related the routes. and as such I have a small program which reads in a CSV file with all of the fields as they would be in the customizing transaction.
I'm having a particular problem wrapping my head around a field TVRO-TRAZTD for a couple of reasons.
The user is only filling in a number which represents a number of days.
There is a conversion exit on TRAZTD, except it's obsolete, use CONVERT TIMESTAMP they say
I don't have a timestamp, I have a decimal number representing a part of a day
For example, TRAZTD would be entered as 0,58 from the CSV file, so why is it represented in the table as 135.512?
I tried it the old fashion way and multiplied 0,58 * 24 which gives me 13,92. if I take 13,92 * 10 I get 139.200, which isn't the same but it's the closest I can get, but I don't get it why 10?
Using the conversion exit even though it's obsolete doens't give me a result either, no matter number I give it I always get 0 back. I can't use the convert timestamp either because well, it's not a timestamp or I didn't look up carefully enough how to use it (I didn't see anything other than strings and characters).
The other thing I tried too was just saying "screw it" and placed the data from the CSV directly into the field and hoping the conversion routine will take care of the work, but that doesn't happen either.
Is there anybody out here that can maybe shed some light on where the number after the conversion comes from?
everybody I came to a solution, just incase anybody stumbles upon this same problem.
I took the value from the excel document and multiplied it by 24 to get the amount of hours, and then multipled it 10000 because I don't know, I picked it randomly.

Storing large, session-level datasets?

I'm working on building a web application that consists of users doing the following:
Browse and search against a Solr server containing millions of entries. (This part of the app is working really well.)
Select a privileged piece of this data (the results of some particular search), and temporarily save it as a "dataset". (I'd like dataset size to be limited to something really large, say half a million results.)
Perform some sundry operations on that dataset.
(The frontend's built in Rails, though I doubt that's really relevant to how to solve this particular problem.)
Step two, and how to retrieve the data for step 3, are what's giving me trouble. I need to be able to temporarily save datasets, recover them when they're needed, and expire them after a while. The problem is, my results have SHA1 checksum IDs, so each ID is 48 characters. A 500,000 record dataset, even if I only store IDs, is 22 MB of data. So I can't just have a single database table and throw a row in it for each dataset that a user constructs.
Has anybody out there ever needed something like this before? What's the best way to approach this problem? Should I generate a separate table for each dataset that a user constructs? If so, what's the best way to expire/delete these tables after a while? I can deploy a MySQL server if needed (though I don't have one up yet, all the data's in Solr), and I'd be open to some crazier software as well if something else fits the bill.
EDIT: Some more detailed info, in response to Jeff Ferland below.
The data objects are immutable, static, and reside entirely within the Solr database. It might be more efficient as files, but I would much rather (for reasons of search and browse) keep them where they are. Neither the data nor the datasets need to be distributed across multiple systems, I don't expect we'll ever get that kind of load. For now, the whole damn thing runs inside a single VM (I can cross that bridge if I get there).
By "recovering when needed," what I mean is something like this: The user runs a really carefully crafted search query, which gives them some set of objects as a result. They then decide they want to manipulate that set. When they (as a random example) click the "graph these objects by year" button, I need to be able to retrieve the full set of object IDs so I can take them back to the Solr server and run more queries. I'd rather store the object IDs (and not the search query), because the result set may change underneath the user as we add more objects.
A "while" is roughly the length of a user session. There's a complication, though, that might matter: I may wind up needing to implement a job queue so that I can defer processing, in which case the "while" would need to be "as long as it takes to process your job."
Thanks to Jeff for prodding me to provide the right kind of further detail.
First trick: don't represent your SHA1 as text, but rather as the 20 bytes it takes up. The hex value you see is a way of showing bytes in human readable form. If you store them properly, you're at 9.5MB instead of 22.
Second, you haven't really explained the nature of what you're doing. Are your saved datasets references to immutable objects in the existing database? What do you mean by recovering them when needed? How long is "a while" when you talk about expiration? Is the underlying data that you're referencing static or dynamic? Can you save the search pattern and an offset, or do you need to save the individual reference?
Does the data related to a session need to be inserted into a database? Might it be more efficient in files? Does that need to be distributed across multiple systems?
There are a lot of questions left in my answer. For that, you need to better express or even define the requirements beyond the technical overview you've given.
Update: There are many possible solutions for this. Here are two:
Write those to a single table (saved_searches or such) that has an incrementing search id. Bonus points for inserting your keys in sorted order. (search_id unsigned bigint, item_id char(20), primary key (search_id, item_id). That will really limit fragmentation, keep each search clustered, and free up pages in a roughly sequential order. It's almost a rolling table, and that's about the best case for doing great amounts of insertions and deletions. In that circumstance, you pay a cost for insertion, and double that cost for deletion. You must also iterate the entire search result.
If your search items have an incrementing primary id such that any new insertion to the database will have a higher value than anything that is already in the database, that is the most efficient. Alternately, inserting a datestamp would achieve the same effect with less efficiency (every row must actually be checked in a query instead of just the index entries). If you take note of that maximum id, and you don't delete records, then you can save searches that use zero space by always setting a maximum id on the saved query.

What to use to save important generated data as-is? PDF/XML/DB or other?

My question is about generating invoices and receipts. The said bills use rates, names and values from a database. If the sources for generating the receipt stay unchanged, one can generate the same receipt dynamically each time. However, since names, rates and values may be changed or removed, the receipt also changes with time, i.e dyamically created receipts can only be good for the first time. After creation the receipts, are then stored, currently as PDFs. And the relative path to the pdf stored in a DB table. In this way, one can be reasonably sure about the validity of the receipt. Is this the best way to go about it. Or should I save data in a db or xml or some other format. How do you people do it?
You could store the PDF in the database in a BLOB column, if that helps. Your way seems good enough to me, unless the number of files is growing unmanageably large.

Gathering pageviews MySQL layout

Hey, does anyone know the proper way to set up a MySQL database to gather pageviews? I want to gather these pageviews to display in a graph later. I have a couple ways mapped out below.
Option A:
Would it be better to count pageviews each time someone visits a site and create a new row for every pageview with a time stamp. So, 50,000 views = 50,000 rows of data.
Option B:
Count the pageviews per day and have one row that counts the pageviews. every time someone visits the site the count goes up. So, 50,000 views = 1 row of data per day. Every day a new row will be created.
Are any of the options above the correct way of doing what I want? or is there a better more efficient way?
Thanks.
Option C would be to parse access logs from the web server. No extra storage needed, all sorts of extra information is stored, and even requests to images and JavaScript files are stored.
..
However, if you just want to track visits to pages where you run your own code, I'd definitely go for Option A, unless you're expecting extreme amounts of traffic on your site.
That way you can create overviews per hour of the day, and store more information than just the timestamp (like the visited page, the user's browser, etc.). You might not need that now, but later on you might thank yourself for not losing that information.
If at some point the table grows too large, you can always think of ways on how to deal with that.
If you care about how your pageviews vary with time in a day, option A keeps that info (though you might still do some bucketing, say per-hour, to reduce overall data size -- but you might do that "later, off-line" while archiving all details). Option B takes much less space because it throws away a lot of info... which you might or might not care about. If you don't know whether you care, I think that, in doubt, you should keep more data rather than less -- it's reasonably easy to "summarize and archive" overabundant data, but it's NOT at all easy to recover data you've aggregated away;-). So, aggregating is riskier...
If you do decide to keep abundant per-day data, one strategy is to use multiple tables, say one per day; this will make it easiest to work with old data (summarize it, archive it, remove it from the live DB) without slowing down current "logging". So, say, pageviews for May 29 would be in PV20090529 -- a different table than the ones for the previous and next days (this does require dynamic generation of the table name, or creative uses of ALTER VIEW e.g. in cron-jobs, etc -- no big deal!). I've often found such "sharding approaches" to have excellent (and sometimes unexpected) returns on investment, as a DB scales up beyond initial assumptions, compared to monolithic ones...