What is the most efficient way to read and set file modification timestamps for MVS data sets and PDS members? - mvs

I am trying to access data set and member metadata in MVS and cannot find a mechanism for getting and setting modification times (OK, and RACF rules, but that's not important right now). One of our (many) goals is to reconcile timestamps in USS with an analogous value in MVS when files are deployed.
The obvious mechanism is to use LISTCAT from TSO, but that only shows creation year.day (so today is 19.294). It is horrifically slow when I have to scan thousands of files for recent modifications. I am working in a C environment, which has the ability to embed 360 assembler instructions. The z/OS C/C++ library standard calls, like fstat/stat do not support MVS files or PDS members.
There are hints in the PDS utility documentation that ISPF sometimes sets modification times in the user area of PDS directories and there are hints that DSCB format 1 is used, but we have not been able to verify this, and the field definitions for that format do not describe modification timestamps.

As PDS members are part of a single dataset and why you are getting mixed inidcations is that nothing in the dataset itself definitively records such a timestamp.
By default a PDS does not have such a field on a per member basis. ISPF utilities utilise the user data field, which is part of the directory (a directory entry has information on a per member basis), to record this for PDS members that are edited/editable if and only if edited using ISPF or the ISPF API (as per LMMSTATS).
Note that the ISPF stats are not sacrosanct and take up directory space, they can be removed, for example, using ISPF option 3.5 (a common attempt to fix for D or E37 abends).
If SMF type 42 records are captured/recorded then this may be more indicative but still not all encompassing as it only records such information when STOW (update directory) is issued (explicitly or implicitly). Most programs that update, create or delete members should issue a STOW. However some utilities may not.
You may be interested in subtypes :-
20, 21, 24 and 25 (22 and 23 are DFSMS related).

Related

Optimal DB structure for creating user segments

I want to create a segmentation engine and can't seem to figure out the most optimal DB or DB structure to use for the task.
Currently I use MySQL as my primary DB, but the segmentation engine is a separate software component and thus can have a different DB if applicably.
Basically I have 10 million unique users identified with UserID (integer). Administrator of the Segmentation engine dynamically creates segments with some predefined rules (like age range, geolocation, transaction history etc.). Application should execute the rules of each segment periodically (once every 15 minutes) to get the current list of all users (can be upto 1 million users each) that belong to the segment and store it.
Later application is exposing the API to allow external systems use segmentation functionality, namely:
1. Get list of all segments that a particular UserID belongs to.
2. Get list of all UserIDs that a particular segment contains.
Note that because segments need to be updated very frequently (every 15 min) this causes massive transactions in DB to "maintain" the segments, where non-applicable users should be removed and new ones added all the time.
I have considered several approaches so far:
1. Plain MySQL where I have a table of users belonging to segments (SegmentID,UserID). (this approach has 2 drawbacks: storage space and constant delete/insert/update in MySQL which will degrade innodb performance by introducing page splitting).
Using JSON data types in MySQL, where I can have table (UserID,Segments), where segments is a json containing an array of SegmentIDs. (drawbacks here are slow search and slow updates)
Using Redis with Sets (UserID,Segments), where UserID will be the Key and Segments will be the Set of SegmentIDs. (drawbacks here are no simple way to search by SegmentID).
Has anyone worked with similar task and can provide any guidance?
Any feedback will be appreciated, so I can be pointed to a direction which I can further research.
I think you can use Elasticsearch for this task.

How to store historic time series data

we're storing a bunch of time series data from several measurement devices.
All devices may provide different dimensions (energy, temp, etc.)
Currently we're using MySQL to store all this data in different tables (according to the dimension) in the format
idDevice, DateTime, val1, val2, val3
We're also aggregating this data from min -> Hour -> Day -> Month -> Year each time we insert new data
This is running quite fine, but we're running out of disk space as we are growing and in general I doubt that a RDBMS is the right answer to keep archive data.
So we're thinking of moving old/cold data on Amazon S3 and write some fancy getter that can recieve this data.
So here my question comes: what could be a good data format to support the following needs:
The data must be extensible in terms: once i a while a device will provide more data, then in the past -> the count of rows can grow/increase
The data must be updated. When a customer delivers historic data, we need to be able to update that for the past.
We're using PHP -> would be nice to have connectors/classes :)
I've had a look on HDF5, but it seems there is no PHP lib.
We're also willing to have a look on cloud based TimeSeries Databases.
Thank you in advance!
B
You might consider moving to a dedicated time-series database. I work for InfluxDB and our product meets most of your requirements right now, although it is still pre-1.0 release.
We're also aggregating this data from min -> Hour -> Day -> Month -> Year each time we insert new data
InfluxDB has built-in tools to automatically downsample and expire data. All you do is write the raw points and set up a few queries and retention policies, InfluxDB handles the rest internally.
The data must be extensible in terms: once i a while a device will provide more data, then in the past -> the count of rows can grow/increase
As long as historic writes are fairly infrequent they are no problem for InfluxDB. If you are frequently writing in non-sequential data the write performance can slow down, but only while the non-sequential points are being replicated.
InfluxDB is not quite schema-less, but the schema cannot be pre-defined, and is derived from the points inserted. You can add new tags (metadata) or fields (metrics) simply by writing a new point that includes them, and you can automatically compose or decompose series by excluding or including the relevant tags when querying.
The data must be updated. When a customer delivers historic data, we need to be able to update that for the past.
InfluxDB silently overwrites points when a new matching point comes in. (Matching means same series and timestamp, to the nanosecond)
We're using PHP -> would be nice to have connectors/classes :)
There are a handful of PHP libraries out there for InfluxDB 0.9. None are officially supported but likely one fits your needs enough to extend or fork.
You haven't specified what you want enough.
Do you care about latency? If not, just write all your data points to per-interval files in S3, then periodically collect them and process them. (No Hadoop needed, just a simple script downloading the new files should usually be plenty fast enough.) This is how logging in S3 works.
The really nice part about this is you will never outgrow S3 or do any maintenance. If you prefix your files correctly, you can grab a day's worth of data or the last hour of data easily. Then you do your day/week/month roll-ups on that data, then store only the roll-ups in a regular database.
Do you need the old data at high resolution? You can use Graphite to roll-up your data automatically. The downside is that it looses resolution as you age. But the upside is that your data is a fixed size and never grows, and writes can be handled quickly. You can even combine the above approach and send data to Graphite for quick viewing, but keep the data in S3 for other uses down the road.
I haven't researched the various TSDBs extensively, but here is a nice HN thread about it. InfluxDB is nice, but new. Cassandra is more mature, but the tooling to use it as a TSB isn't all there yet.
How much new data do you have? Most tools will handle 10,000 datapoints per second easily, but not all of them can scale beyond that.
I'm with the team that develops Axibase Time-Series Database. It's a non-relational database that allows you to efficiently store timestamped measurements with various dimensions. You can also store device properties (id, location, type, etc) in the same database for filtering and grouped aggregations.
ATSD doesn't delete raw data by default. Each sample takes 3.5+ bytes per tuple: time:value. Period aggregations are performed at request time and the list of functions includes: MIN, MAX, AVG, SUM, COUNT, PERCENTILE(n), STANDARD_DEVIATION, FIRST, LAST, DELTA, RATE, WAVG, WTAVG as well as some some additional functions for computing threshold violations per period.
Backfilling historical data is fully supported except that the timestamp has to be greater than January 1, 1970. Time precision is milliseconds or seconds.
As for deployment options, you could host this database on AWS. It runs on most Linux distributions. We could run some storage efficiency and throughput tests for you if you want to post sample data from your dataset here.

Could it make sense to schedule an export of SQL database to NoSQL for graphical data mining?

Would it make sense for me to schedule an export my SQL database to a graph database (such as Neo4j) in order to generate interactive graphics of relationships such as here?
UPDATE: Or by extension, should I even be looking to move over to a graph database altogeher?
My graphical database would not need to be a live reflection of the relational database - an extract every few days would be more than sufficient.
In my case, I currently have a relational database (MySQL) where I’m recording stock items as they pass between individuals/depots. The concept is as follows:
Items:
STOCKID DISPATCHDATE
0001 2014-01-01
0002 2015-06-03
Individuals:
USERID FIRSTNAME
0001 Tom
0002 Jones
Depots:
DEPOTID ZIPCODE
0001 50421
0002 71028
Owners:
STOCK_ID USER_ID RECEIVED DISPATCHED
0001 0001 2015-05-01 2015-05-10
0001 0002 2015-05-11 2015-05-20
From the NoSQL database I would like to be able to visually see things such as:
The flow of which people an item has passed through (and dates of each relationship)
Which items are at each individual/depot (on a given date)
Which individuals are at which depots (on a given date)
As N.B. says in the comments, if the tool is useful then use it - worst case is you find that the tool isn't useful after all and you stop using it (having wasted some time in setting it up, but such is life).
In general, there are three ways to sync the database:
Two Phase Commit: modify MySql in one transaction, modify Neo4j in another transaction, if either transaction fails then you roll back both transactions; the transactions don't commit until both signal that they can be committed. This provides the highest data integrity but is very expensive.
Loosely synchronized transactions: modify MySql in one transaction, modify Neo4j in another transaction, if one succeeds and the other fails then retry the failed transaction a few times, and if it still fails then decide what to do (e.g. undo the successful transaction, which is complicated by the fact that the transaction has already committed and the values may have been used; or log the error and ask for a database administrator to manually sync the databases; or third option). This offers decent data integrity and is cheaper than two phase commit, but is more difficult to recover from if something goes horribly wrong.
Batch synchronization: modify MySql, and then after a time interval (five minutes, an hour, whatever's appropriate) you sync the changes with Neo4j based on a row version number or a timestamp (note that it's not much of a problem if you sync a bit too much data since you'll just be overwriting a value with the same value, so err on the side of syncing too much per batch). This solution is easy to program, and is appropriate if Neo4j doesn't need the latest and greatest data.
I worked on a similar project where we were syncing MySql with a key-value nosql database (caching expensive queries), using loosely synchronized transactions. We wrote a customized Transaction wrapper that contained a concurrent queue of side-effects (i.e. changes to be made to the key-value database); if the MySql transaction succeeded then we committed all of the side-effects in the queue to the key-value database (with three retries in the case of transient network failure, after which we logged the error, invalidated the key-value database entry which would result in a fallback to MySql, and notified a database admin - this happened one time when the key-value database crashed for an extended period, and was solved by running a batch synchronization), else we discarded them.
I think before starting with the migration there are some questions worth asking yourself:
Can I do the graphical representation without migrating/adding a new data source (using MySQL)?
What degree of efficiency do I want when using such graphical interface?
How easy would it be, in case, to add a new data source?
What you see in that video is done by a visual component on some data from either databases or flat files, so I'd say that the answer to the first question it's likely to be yes.
Depending on how many people and the kind of user it's going to use such graphical representation (internal or external, analyst or not, etc...) this can be another driver for the decision.
About the third question, without writing a duplicate of the other answer, I think #Zim-Zam O'Pootertoot already covered it. As usual, with many data sources the problem is always to keep things in sync together and the entity resolution problem (which you minimise using the same dataset).
In my experience what Neo4J is very good at is "pattern" querying: given a specific network pattern (drawn with the Cypher language) it will apply and find it to the network dataset.
When it's about neighbours querying also a SQL solution, in small projects, can achieve the same result without too many problems. Of course if your solution has to scale to hundreds analysts and hundred of thousands queries per day consider to move.
Anyway, given your dataset it looks to me that you are working on a time-based type of data. In this kind of scenarios it could be worth to have a look at the dynamical behaviour of your network to find also temporal pattern, more than simple network ones.
From the same author of the video you've posted have also a look at this other graphical representation.
In case you want to model a time based graph just note that there's not a bullet proof solution with any data source yet.
Here's a Neo4J tutorial on your to model and represent the data in case of a time based dataset.
I bet you can do similar things also with MySQL (probably with less efficiency and elegance in querying) but I haven't done it myself yet to give some numbers - maybe someone else did it and can add some benchmarks here.
Disclaimer: I work in the KeyLines team.

Should id or timestamp be used to determine the creation order of rows within a database table? (given possibility of incorrectly set system clock)

A database table is used to store editing changes to a text document.
The database table has four columns: {id, timestamp, user_id, text}
A new row is added to the table each time a user edits the document. The new row has an auto-incremented id, and a timestamp matching the time the data was saved.
To determine what editing changes a user made during a particular edit, the text from the row inserted in response to his or her edit is compared to the text in the previously inserted row.
To determine which row is the previously inserted row, either the id column or the timestamp column could be used. As far as I can see, each method has advantages and disadvantages.
Determining the creation order using id
Advantage: Immune to problems resulting from incorrectly set system clock.
Disadvantage: Seems to be an abuse of the id column since it prescribes meaning other than identity to the id column. An administrator might change the values of a set of ids for whatever reason (eg. during a data migration), since it ought not matter what the values are so long as they are unique. Then the creation order of rows could no longer be determined.
Determining the creation order using timestamp
Advantage: The id column is used for identity only, and the timestamp is used for time, as it ought to be.
Disadvantage: This method is only reliable if the system clock is known to have been correctly set each time a row was inserted into the table. How could one be convinced that the system clock was correctly set for each insert? And how could the state of the table be fixed if ever it was discovered that the system clock was incorrectly set for a not precisely known period in the past?
I seek a strong argument for choosing one method over the other, or a description of another method that is better than the two I am considering.
Using the sequential id would be simpler as it's probably(?) a primary key and thus indexed and quicker to access. Given that you have user_id, you can quickly assertain the last and prior edits.
Using the timestamp is also applicable, but it's likely to be a longer entry, and we don't know if it's indexed at all, plus the potential for collisions. You rightly point out that system clocks can change... Whereas sequential id's cannot.
Given your update:
As it's difficult to see what your exact requirements are, I've included this as evidence of what a particular project required for 200K+ complex documents and millions of revisions.
From my own experience (building a fully auditable doc/profiling system) for an internal team of more than 60 full-time researchers. We ended up using both an id and a number of other fields (including timestamp) to provide audit-trailing and full versioning.
The system we built has more than 200 fields for each profile and thus versioning a document was far more complex than just storing a block of changed text/content for each one; Yet, each profile could be, edited, approved, rejected, rolled-back, published and even exported as either a PDF or other format as ONE document.
What we ended up doing (after a lot of strategy/planning) was to store sequential versions of the profile, but they were keyed primarily on an id field.
Timestamps
Timestamps were also captured as a secondary check and we made sure of keeping system clocks accurate (amongst a cluster of servers) through the use of cron scripts that checked the time-alignment regularly and corrected them where necessary. We also used Ntpd to prevent clock-drift.
Other captured data
Other data captured for each edit also included (but not limited to):
User_id
User_group
Action
Approval_id
There were also other tables that fulfilled internal requirements (including automatically generated annotations for the documents) - as some of the profile editing was done using data from bots (built using NER/machine learning/AI), but with approval being required by one of the team before edits/updates could be published.
An action log was also kept of all user actions, so that in the event of an audit, one could look at the actions of an individual user - even when they didn't have the permissions to perform such an action, it was still logged.
With regard to migration, I don't see it as a big problem, as you can easily preserve the id sequences in moving/dumping/transferring data. Perhaps the only issue being if you needed to merge datasets. You could always write a migration script in that event - so from a personal perspective I consider that disadvantage somewhat diminished.
It might be worth looking at the Stack Overflow table structures for there data explorer (which is reasonably sophisticated). You can see the table structure here: https://data.stackexchange.com/stackoverflow/query/new, which comes from a question on meta: How does SO store revisions?
As a revision system, SO works well and the markdown/revision functionality is probably a good example to pick over.
Use Id. It's simple and works.
The only caveat is if you routinely add rows from a store-and-forward server so rows may be added later but should treated as being added earlier
Or add another column whose sole purpose is to record the editing order. I suggest you do not use datetime for this.

Medium-term temporary tables - creating tables on the fly to last 15-30 days?

Context
I'm currently developing a tool for managing orders and communicating between technicians and services. The industrial context is broadcast and TV. Multiple clients expecting media files each made to their own specs imply widely varying workflows even within the restricted scope of a single client's orders.
One client can ask one day for a single SD file and the next for a full-blown HD package containing up to fourteen files... In a MySQL db I am trying to store accurate information about all the small tasks composing the workflow, in multiple forms:
DATETIME values every time a task is accomplished, for accurate tracking
paths to the newly created files in the company's file system in VARCHARs
archiving background info in TEXT values (info such as user comments, e.g. when an incident happens and prevents moving forward, they can comment about it in this feed)
Multiply that by 30 different file types and this is way too much for a single table. So I thought I'd break it up by client: one table per client so that any order only ever requires the use of that one table that doesn't manipulate more than 15 fields. Still, this a pretty rigid solution when a client has 9 different transcoding specs and that a particular order only requires one. I figure I'd need to add flags fields for each transcoding field to indicate which ones are required for that particular order.
Concept
I then had this crazy idea that maybe I could create a temporary table to last while the order is running (that can range from about 1 day to 1 month). We rarely have more than 25 orders running simultaneously so it wouldn't get too crowded.
The idea is to make a table tailored for each order, eliminating the need for flags and unnecessary forever empty fields. Once the order is complete the table would get flushed, JSON-encoded, into a TEXT or BLOB so it can be restored later if changes need made.
Do you have experience with DBMS's (MySQL in particular) struggling from such practices if it has ever existed? Does this sound like a viable option? I am happy to try (which I already started) and I am seeking advice so as to keep going or stop right here.
Thanks for your input!
Well, of course that is possible to do. However, you can not use the MySQL temporary tables for such long-term storage, you will have to use "normal" tables, and have some clean-up routine...
However, I do not see why that amount of data would be too much for a single table. If your queries start to run slow due to much data, then you should add some indexes to your database. I also think there is another con: It will be much harder to build reports later on, when you have 25 tables with the same kind of data, you will have to run 25 queries and merge the data.
I do not see the point, really. The same kinds of data should be in the same table.