DFHJS2LS - generating Json structure from cobol copybook - json

I have my output structure in COBOL - from which I try to generate to a JSON structure through DFHJS2LS - IBM tools. All the fields change to be required - this giving trouble when generating classes in .Net as all the fields are not present.
Question: How and where (in COBOL or DFHJS2LS) to define fields as optional in order to get them generated properly avoiding null pointer exception.

According to the documentation you can define your COBOL data items with...
data description OCCURS n TIMES
...and use mapping level 4.1 or higher and specify TRUNCATE-NULL-ARRAYS = ENABLED. There is a reference to "structured arrays" which I take to mean you would need to do something like...
05 Something Occurs 1 Times.
10 Something-Real PIC X(8).
...so you get...
"type":"array"
"maxItems":1
"minItems":0
"items":{ ... }
You could also specify mapping level 4.0 or higher and use...
data description OCCURS n TO m TIMES DEPENDING ON t
...to obtain...
"field-name":{
"type":"array",
"maxItems":m
"minItems":n
"items":{ ... }
}`
Mapping level is specified by...
//INPUT.SYSUT1 DD *
[...other control statements...]
MAPPING-LEVEL=4.3
[...other control statements...]

Related

Bioisosteric replacement using SMARTS (KNIME and RDKit)

I am trying to create a KNIME workflow that would accept a list of compounds and carry out bioisosteric replacements (we will use the following example here: carboxylic acid to tetrazole) automatically.
NOTE: I am using the following workflow as inspiration : RDKit-bioisosteres (myexperiment.org). This uses a text file as SMARTS input. I cannot seem to replicate the SMARTS format used here.
For this, I plan to use the Rdkit One Component Reaction node which uses a set of compounds to carry out the reaction on as input and a SMARTS string that defines the reaction.
My issue is the generation of a working SMARTS string describing the reaction.
I would like to input two SDF files (or another format, not particularly attached to SDF): one with the group to replace (carboxylic acid) and one with the list of possible bioisosteric replacements (tetrazole). I would then combine these two in KNIME and generate a SMARTS string for the reaction to then be used in the Rdkit One Component Reaction node.
NOTE: The input SDF files have the structures written with an
attachment point (*COOH for the carboxylic acid for example) which
defines where the group to replace is attached. I suspect this is the
cause of many of the issues I am experiencing.
So far, I can easily generate the reactions in RXN format using the Reaction Builder node from the Indigo node package. However, converting this reaction into a SMARTS string that is accepted by the Rdkit One Component Reaction node has proven tricky.
What I have tried so far:
Converting RXN to SMARTS (Molecule Type Cast node) : gives the following error code : scanner: BufferScanner::read() error
Converting the Source and Target molecules into SMARTS (Molecule Type Cast node) : gives the following error code : SMILES loader: unrecognised lowercase symbol: y
showing this as a string in KNIME shows that the conversion is not carried out and the string is of SDF format : *filename*.sdf 0 0 0 0 0 0 0 V3000M V30 BEGIN etc.
Converting the Source and Target molecules into RDkit first (RDkit from Molecule node) then from RDkit into SMARTS (RDkit to Molecule node, SMARTS option). This outputs the following SMARTS strings:
Carboxylic acid : [#6](-[#8])=[#8]
Tetrazole : [#6]1:[#7H]:[#7]:[#7]:[#7]:1
This is as close as I've managed to get. I can then join these two smarts strings with >> in between (output: [#6](-[#8])=[#8]>>[#6]1:[#7H]:[#7]:[#7]:[#7]:1) to create a SMARTS reaction string but this is not accepted as an input for the Rdkit One Component Reaction node.
Error message in KNIME console :
ERROR RDKit One Component Reaction 0:40 Creation of Reaction from SMARTS value failed: null
WARN RDKit One Component Reaction 0:40 Invalid Reaction SMARTS: missing
Note that the SMARTS strings that this last option (3.) generates are very different than the ones used in the myexperiments.org example ([*:1][C:2]([OH])=O>>[*:1][C:2]1=NNN=N1). I also seem to have lost the attachment point information through these conversions which are likely to cause issues in the rest of the workflow.
Therefore I am looking for a way to generate the SMARTS strings used in the myexperiments.org example on my own sets of substituents. Obviously doing this by hand is not an option. I would also like this workflow to use only the open-source nodes available in KNIME and not proprietary nodes (Schrodinger etc.).
Hopefully, someone can help me out with this. If you need my current workflow I am happy to upload that with the source files if required.
Thanks in advance for your help,
Stay safe and healthy!
-Antoine
What you're describing is template generation, which has been a consistent field of work in reaction prediction and/or retrosynthesis in cheminformatics for a long time.
I'm not particularly familiar with KNIME myself, though I know RDKit extensively: Your last option (3) is closest to what I'd consider a usable workflow. The way I would do this:
Load the substitution pair molecules from SDF into RDKit mol objects.
Export these RDKit mol objects as SMARTS strings rdkit.Chem.MolToSmarts().
Concatenate these strings into the form before_substructure>>after_substructure to generate a reaction SMARTS string.
Load this SMARTS string into a reaction object rxn = rdkit.Chem.AllChem.ReactionFromSmarts()
Use the rxn.RunReactants() method to generate your bioisosterically substituted products.
The error you quote for the RDKit One Component Reaction node input cuts off just before the important information, unfortunately. Running rdkit.Chem.AllChem.ReactionFromSmarts("[#6](-[#8])=[#8]>>[#6]1:[#7H]:[#7]:[#7]:[#7]:1") produces no errors for me locally, which leads me to believe this is specific to the KNIME node functionality.
Note, that the difference between [#6](-[#8])=[#8] and [*:1][C:2]([OH])=O is relatively minimal: The former represents a O-C=O substructure, the latter represents a ~COOH group. Within the square brackets of the latter, the :num refers to an optional 'atom map' number, which allows a one-to-one mapping of reactant and product atoms. For example, [C:1][C:3].[C:2][C:4]>>[C:1][C:3][C:4][C:2] allows you to track which carbon is which during a reaction, for situations where it may matter. The token [*:1] means "any atom" and is equivalent to a wavey line in organic chemistry (and it is mapped to #1).
There are only two situations I can think of where [#6](-[#8])=[#8] and [*:1][C:2]([OH])=O might differ:
You have methanoic acid as a potential input for substitution (former will match, latter might not - I can't remember how implicit hydrogens are treated in this situation)
Inputs are over/under protonated. (COO- != COOH)
Converting these reaction SMARTS to RDKit reaction objects and running them on input molecule objects should potentially create a number of substituted products. Note: Typically, in extensive projects, there will be some SMARTS templates that require some degree of manual intervention - indicating attachment points, specifying explicit hydrogens, etc. If you need any help or have any questions don't hesitate to drop a comment and I'll do my best to help with specifics.

Extract tokens from grammar

I have been working through the Advent of Code problems in Perl6 this year and was attempting to use a grammar to parse the Day 3's input.
Given input in this form: #1 # 1,3: 4x4 and this grammar that I created:
grammar Claim {
token TOP {
'#' <id> \s* '#' \s* <coordinates> ':' \s* <dimensions>
}
token digits {
<digit>+
}
token id {
<digits>
}
token coordinates {
<digits> ',' <digits>
}
token dimensions {
<digits> 'x' <digits>
}
}
say Claim.parse('#1 # 1,3: 4x4');
I am interested in extracting the actual tokens that were matched i.e. id, x + y from coordinates, and height + width from the dimensions from the resulting parse. I understand that I can pull them from the resulting Match object of Claim.parse(<input>), but I have to dig down through each grammar production to get the value I need e.g.
say $match<id>.hash<digits>.<digit>;
this seems a little messy, is there a better way?
For the particular challenge you're solving, using a grammar is like using a sledgehammer to crack a nut.
Like #Scimon says, a single regex would be fine. You can keep it nicely readable by laying it out appropriately. You can name the captures and keep them all at the top level:
/ ^
'#' $<id>=(\d+) ' '
'# ' $<x>=(\d+) ',' $<y>=(\d+)
': ' $<w>=(\d+) x $<d>=(\d+)
$
/;
say ~$<id x y w d>; # 1 1 3 4 4
(The prefix ~ calls .Str on the value on its right hand side. Called on a Match object it stringifies to the matched strings.)
With that out the way, your question remains perfectly cromulent as it is because it's important to know how P6 scales in this regard from simple regexes like the one above to the largest and most complex parsing tasks. So that's what the rest of this answer covers, using your example as the starting point.
Digging less messily
say $match<id>.hash<digits>.<digit>; # [「1」]
this seems a little messy, is there a better way?
Your say includes unnecessary code and output nesting. You could just simplify to something like:
say ~$match<id> # 1
Digging a little deeper less messily
I am interested in extracting the actual tokens that were matched i.e. id, x + y from coordinates, and height + width from the dimensions from the resulting parse.
For matches of multiple tokens you no longer have the luxury of relying on Perl 6 guessing which one you mean. (When there's only one, guess which one it guesses you mean. :))
One way to write your say to get the y coordinate:
say ~$match<coordinates><digits>[1] # 3
If you want to drop the <digits> you can mark which parts of a pattern should be stored in a list of numbered captures. One way to do so is to put parentheses around those parts:
token coordinates { (<digits>) ',' (<digits>) }
Now you've eliminated the need to mention <digits>:
say ~$match<coordinates>[1] # 3
You could also name the new parenthesized captures:
token coordinates { $<x>=(<digits>) ',' $<y>=(<digits>) }
say ~$match<coordinates><y> # 3
Pre-digging
I have to dig down through each grammar production to get the value I need
The above techniques still all dig down into the automatically generated parse tree which by default precisely corresponds to the tree implicit in the grammar's hierarchy of rule calls. The above techniques just make the way you dig into it seem a little shallower.
Another step is to do the digging work as part of the parsing process so that the say is simple.
You could inline some code right into the TOP token to store just the interesting data you've made. Just insert a {...} block in the appropriate spot (for this sort of thing that means the end of the token given that you need the token pattern to have already done its matching work):
my $made;
grammar Claim {
token TOP {
'#' <id> \s* '#' \s* <coordinates> ':' \s* <dimensions>
{ $made = ~($<id>, $<coordinatess><x y>, $<dimensions><digits>[0,1]) }
}
...
Now you can write just:
say $made # 1 1 3 4 4
This illustrates that you can just write arbitrary code at any point in any rule -- something that's not possible with most parsing formalisms and their related tools -- and the code can access the parse state as it is at that point.
Pre-digging less messily
Inlining code is quick and dirty. So is using a variable.
The normal thing to do for storing data is to instead use the make function. This hangs data off the match object that's being constructed corresponding to a given rule. This can then be retrieved using the .made method. So instead of $make = you'd have:
{ make ~($<id>, $<coordinatess><x y>, $<dimensions><digits>[0,1]) }
And now you can write:
say $match.made # 1 1 3 4 4
That's much tidier. But there's more.
A sparse subtree of a parse tree
.oO ( 🎶 On the first day of an imagined 2019 Perl 6 Christmas Advent calendar 🎶 a StackOverflow title said to me ... )
In the above example I constructed a .made payload for just the TOP node. For larger grammars it's common to form a sparse subtree (a term I coined for this because I couldn't find a standard existing term).
This sparse subtree consists of the .made payload for the TOP that's a data structure referring to .made payloads of lower level rules which in turn refer to lower level rules and so on, skipping uninteresting intermediate rules.
The canonical use case for this is to form an Abstract Syntax Tree after parsing some programming code.
In fact there's an alias for .made, namely .ast:
say $match.ast # 1 1 3 4 4
While this is trivial to use, it's also fully general. P6 uses a P6 grammar to parse P6 code -- and then builds an AST using this mechanism.
Making it all elegant
For maintainability and reusability you can and typically should not insert code inline at the end of rules but should instead use Action objects.
In summary
There are a range of general mechanisms that scale from simple to complex scenarios and can be combined as best fits any given use case.
Add parentheses as I explained above, naming the capture that those parentheses zero in on, if that is a nice simplification for digging into the parse tree.
Inline any action you wish to take during parsing of a rule. You get full access to the parse state at that point. This is great for making it easy to extract just the data you want from a parse because you can use the make convenience function. And you can abstract all actions that are to be taken at the end of successfully matching rules out of a grammar, ensuring this is a clean solution code-wise and that a single grammar remains reusable for multiple actions.
One final thing. You may wish to prune the parse tree to omit unnecessary leaf detail (to reduce memory consumption and/or simplify parse tree displays). To do so, write <.foo>, with a dot preceding the rule name, to switch the default automatic capturing off for that rule.
You can refer to each of you named portions directly. So to get the cordinates you can access :
say $match.<coordinates>.<digits>
this will return you the Array of digits matches. Ig you just want the values the easiest way is probably :
say $match.<coordinates>.<digits>.map( *.Int) or say $match.<coordinates>.<digits>>>.Int or even say $match.<coordinates>.<digits>».Int
to cast them to Ints
For the id field it's even easier you can just cast the <id> match to an Int :
say $match.<id>.Int

Working on migration of SPL 3.0 to 4.2 (TEDA)

I am working on migration of 3.0 code into new 4.2 framework. I am facing a few difficulties:
How to do CDR level deduplication in new 4.2 framework? (Note: Table deduplication is already done).
Where to implement PostDedupProcessor - context or chainsink custom? In either case, do I need to remove duplicate hashcodes from the list or just reject the tuples? Here I am also doing column updating for a few tuples.
My file is not moving into archive. The temporary output file is getting generated and that too empty and outside load directory. What could be the possible reasons? - I have thoroughly checked config parameters and after putting logs, it seems correct output is being sent from transformer custom, so I don't know where it is stuck. I had printed TableRowGenerator stream for logs(end of DataProcessor).
1. and 2.:
You need to select the type of deduplication. It is not a big difference if you choose "table-" or "cdr-level-deduplication".
The ite.businessLogic.transformation.outputType does affect this. There is one Dedup only. You can not have both.
Select recordStream for "cdr-level-deduplication", do the transformation to table row format (e.g. if you like to use the TableFileWriter) in xxx.chainsink.custom::PostContextDataProcessor.
In xxx.chainsink.custom::PostContextDataProcessor you need to add custom code for duplicate-handling: reject (discard) tuples or set special column values or write them to different target tables.
3.:
Possibly reasons could be:
Missing forwarding of window punctuations or statistic tuple
error in BloomFilter configuration, you would see it easily because PE is down and error log gives hints about wrong sha2 functions be used
To troubleshoot your ITE application, I recommend to enable the following debug sinks if checking the StreamsStudio live graph is not sufficient:
ite.businessLogic.transformation.debug=on
ite.businessLogic.group.debug=on
ite.businessLogic.sink.debug=on
Run a test with a single input file only and check the flow of your record and statistic tuples. "Debug sinks" write punctuations markers also to debug files.

IBM Websphere scripting ID 'containment paths' explained (I.e. AdminConfig.getid(...) )

I have been digging through IBM knowledge center and have returned frustrated at the lack of definition to some of their scripting interfaces.
IBM keeps talking about a containment path used in their id definitions and I am only assuming that it corresponds to an xml element within file in the websphere config folder hierachy, but that's only from observation. I haven't found this declared in any documentation as yet.
Also, to find an ID, there is a syntax that needs to be used that will retrieve it, yet I cannot find a reference to the possible values of 'types' used in the AdminConfig.getid(...) function call.
So to summarize I have a couple questions:
What is the correct definition of the id "heirachy" both for fetching an id and for the id itself?
e.g. to get an Id of the server I would say something like: AdminConfig.getid('/Cell:MYCOMPUTERNode01Cell/Node:MYCOMPUTERNode01/Server:server1'), which would give me an id something like: server1(cells/MYCOMPUTERNode01Cell/nodes/MYCOMPUTERNode01/servers/server1|server.xml#server_1873491723429)
What are the possible /type values in retrieving the id from the websphere server?
e.g. in the above example, /Cell, /Node and /Server are examples of type values used in queries.
UPDATE: I have discovered this document that outlines the locations of various configuration files in the configuration directory.
UPDATE: I am starting to think that the configuration files represent complex objects with attributes and nested attributes. not every object with an 'id' is query-able through the AdminConfig.getid(<containment_path>). Object attributes (and this is not attributes in the strict xml sense as 'attributes' could be nested nodes with a simple structure within the parent node) may be queried using AdminConfig.showAttribute(<element_id>, <attribute_name>). This function will either return the string value of the inline attribute of the element itself or a string list representation of the ids of the nested attribute node(s).
UPDATE: I have discovered the AdminConfig.types() function that will display a list of all manipulatible object types in the configuration files and the AdminConfig.parent() function that displays what nodes are considered parents of the specified node.
Note: The AdminConfig.parent() function does not reveal the parents in order of their hierachy, rather it seems to just present a list of parents. e.g. AdminConfig.parent('JDBCProvider') gives us this exact list: 'Cell\nDeployment\nNode\nServer\nServerCluster' and even though Server comes before ServerCluster , running AdminConfig.parent('Server') reveals it's parents as: 'Node\nServerCluster'. Although some elements can have no parents - e.g. Cell - Some elements produce an error when running the parent function - e.g. Deployment.
Due to the apparent lack of a reciprocal AdminConfig.children() function, It appears as though obtaining a full hierachical tree of parents/children lies in calling this function on each of the elements returned from the AdminConfig.parent(<>) call and combining the results. That being said, a trial and error approach is mostly fruitful due to the sometimes obvious ordering.

JSON schema : allow to store information in keys

I want to formally define a schema for a JSON-based protocol.
I have two criteria for the schema:
1. I want to be able to use tools to create parser/serializer (php and .net).
2. Result JSON should be easy to human-read
Here is the context. The schema will be describing a game character, as an example I will take one aspect of the profile - professions.
A character can have up to 2 professions (from a list of 10), each profession is described by a name and a level, e.g.:
Skinning - level 200
Blacksmith - level 300
To satisfy criterion #1 it really helps to have XSD schema (or JSON Schema) to drive code generator or a parser library. But that means that my JSON must look something like:
character : {
professions : [
{ profession : "Skinning", level : 525 }
{ profession : "Blacksmith", level : 745 }
]
}
but it feels too chatty, I would rather have JSON look like (notice that profession is used as a key):
character {
professions : {
"Skinning" : 525,
"Blacksmith" : 745
}
}
but the later JSON can not be described with XSD without having to define an element for each profession.
So I am looking for a solution for my situation, here are options I have identified:
shut up and make JSON XSD-friendly (first snippet above)
shut up and make JSON human-friendly and hand-code the parser/serializer.
but I would really like to find a solution that would satisfy both criteria.
Note: I am aware that Newton-King's JSON library would allow me to parse professions as a Dictionary - but it would require me to hand code the type to map this JSON to. Therefore so far I am leaning towards option #2, but I am open to suggestions.
Rename profession to name so it'd be like:
character : {
professions : [
{ name : "Skinning", level : 525 }
{ name : "Blacksmith", level : 745 }
]
}
Then after it's serialized on the client model would be like this:
profession = character.professions[0]
profession.name
=> "Skinning"
Your options are, as you said...
1 shut up and use xml
2 shut up and build your own
OR maybe 3... http://davidwalsh.name/json-validation
I would do #1
- because xml seems to be a fairly common way to transform stuff from X => Y formats
- I prefer to work in C# not JS
- many people use XML, its an accepted standard, there are many resources out there to help you along the way