AllenNLP BERT SRL input format ("OntoNotes v. 5.0 formatted") - allennlp

The goal is to train BERT SRL on another data set. According to configuration, it requires conll-formatted-ontonotes-5.0.
Natively, my data comes in a CoNLL format and I converted it to the conll-formatted-ontonotes-5.0 format of the GitHub edition of OntoNotes v.5.0. Reading the data works and training seems to work, except that precision remains at 0. I suspect that either the encoding of SRL arguments (BOI or phrasal?) or the column structure (other OntoNotes editions in CoNLL format differ here) differ from the expected input. Alternatively, the error may arise because if the role labels are hard-wired in the code. I followed the reference data in using the long form (ARGM-TMP), but you often see the short form (AM-TMP) in other data.
The question is which dataset and format is expected here. I guess it's one of the CoNLL/Skel formats for OntoNotes 5.0 with a restored WORD column, but
The CoNLL edition doesn't seem to be shipped with the LDC edition of OntoNotes
It does not seem to be the format of the "conll-formatted-ontonotes-5.0" edition of OntoNotes v.5.0 on GitHub provided by the OntoNotes creators.
There is at least one other CoNLL/Skel edition of OntoNotes 5.0 data as part of PropBank. This differs from the other one in leaving out 3 columns and in the encoding of predicates. (For parts of my data, this is the native format.)
The SrlReader documentation mentions BIO (IOBES) encoding. This has been used in other CoNLL editions of PropBank data, indeed, but not in the above-mentioned OntoNotes corpora. Other such formats are the CoNLL-2008 and CoNLL-2009 formats, for example, and different variants.
Before I start reverse-engineering the SrlReader, does anyone have a data snippet at hand so that I can prepare my data accordingly?
conll-formatted-ontonotes-5.0 version of my data (sample from EWT corpus):
google/ewt/answers/00/20070404104007AAY1Chs_ans.xml 0 0 where WRB (TOP(S(SBARQ(WHADVP*) - - - - * (ARGM-LOC*) * * -
google/ewt/answers/00/20070404104007AAY1Chs_ans.xml 0 1 can MD (SQ* - - - - * (ARGM-MOD*) * * -
google/ewt/answers/00/20070404104007AAY1Chs_ans.xml 0 2 I PRP (NP*) - - - - * (ARG0*) * * -
google/ewt/answers/00/20070404104007AAY1Chs_ans.xml 0 3 get VB (VP* get 01 - - * (V*) * * -
google/ewt/answers/00/20070404104007AAY1Chs_ans.xml 0 4 morcillas NNS (NP*) - - - - * (ARG1*) * * -

The "native" format is the one under of the CoNLL-2012 edition, see cemantix.org/conll/2012/data.html how to create it.
The Ontonotes class that reads it may, however, encounter difficulties when parsing "native" CoNLL-2012 data, because the CoNLL-2012 preprocessing scripts can lead to invalid parse trees. Parsing with NLTK will naturally lead to a ValueError such as
ValueError: Tree.read(): expected ')' but got 'end-of-string'
at index 1427.
"...LT#.#.) ))"
There is no direct way to solve that at the data level, because the string that is parsed is an intermediate representation, but not the original data. If you want to process CoNLL-2012 data, the ValueError has to be caught, cf. https://github.com/allenai/allennlp/issues/5410.

Related

DFHJS2LS - generating Json structure from cobol copybook

I have my output structure in COBOL - from which I try to generate to a JSON structure through DFHJS2LS - IBM tools. All the fields change to be required - this giving trouble when generating classes in .Net as all the fields are not present.
Question: How and where (in COBOL or DFHJS2LS) to define fields as optional in order to get them generated properly avoiding null pointer exception.
According to the documentation you can define your COBOL data items with...
data description OCCURS n TIMES
...and use mapping level 4.1 or higher and specify TRUNCATE-NULL-ARRAYS = ENABLED. There is a reference to "structured arrays" which I take to mean you would need to do something like...
05 Something Occurs 1 Times.
10 Something-Real PIC X(8).
...so you get...
"type":"array"
"maxItems":1
"minItems":0
"items":{ ... }
You could also specify mapping level 4.0 or higher and use...
data description OCCURS n TO m TIMES DEPENDING ON t
...to obtain...
"field-name":{
"type":"array",
"maxItems":m
"minItems":n
"items":{ ... }
}`
Mapping level is specified by...
//INPUT.SYSUT1 DD *
[...other control statements...]
MAPPING-LEVEL=4.3
[...other control statements...]

How to build and send an IDOC from MII to SAP ECC using IDOC_Asynchronous_Inbound

We have a custom built legacy application that collects data from a SQL server database, builds an IDOC and then "sends" that IDOC to ECC. (This application was written in VB6 and uses the SAPGUI 6 SDK to accomplish this.)
I'm attempting to decommission this solution and replace it with a solution built in MII.
As far as I can tell I need to create the IDOC in MII using IDOC_Asynchronous_Inbound but I'm stuck at how I should populate the fields required.
IDOC_Asynchronous_Inbound has two segments: IDOC_CONTROL_REC_40 and IDOC_DATA_REC_40
I guessed which fields to fill in the IDOC_CONTROL_REC_40/item segment by looking at the source code of the old VB application. I think this should do:
IDOC_INBOUND_ASYNCHRONOUS/TABLES/IDOC_CONTROL_REC_40/item
- IDOCTYP: WMMBID01
- MESTYP: WMMBXY
- SNDPRN: <value>
- SNDPRT: LI
- SNDPOR: <value>
- RCVPRN: <value>
- RCVPRT: LS
- EXPRSS: X
Looking at the source code of the old VB app, I should now add a segment of type E1MBXYH with the following fields filled:
- BLDAT: <date>
- BUDAT: <date>
- TCODE: MB31
- XBLNR: <value>
- BKTXT: <value>
Based on guesswork and some blog posts, I'm guessing I have to add this segment as an item segment to the IDOC_DATA_REC_40 segment.
My guess is I should then add item segments of type E1MBXYI for all of the 'records' I'd like to send to SAP with the following fields:
- MATNR: <value>
- WERKS: <value>
- LGORT: <value>
- CHARG: <value>
- BWART: 261
- ERFMG: <value>
- SHKZG: H
- ERFME: <value>
- AUFNR: <value>
- SGTXT: <value>
Now, looking at the IDOC_DATA_REC_40 segment in MII, these are the fields that are available:
- SEGNAM
- MANDT
- DOCNUM
- SEGNUM
- PSGNUM
- HLEVEL
- SDATA
My guess is that the segment name should go into SEGNAM and the data (properly structured/spaced) should go into SDATA. I'm not sure what I should put in the other fields (if anything). (I have the description file for this IDOC type so I know how to 'structure' the data I have to put in the SDATA segment... counting spaces, yay!)
To hopefully clarify how the IDOC should be structured, this is a (link to a) screenshot of an IDOC posted by the current VB application:
screenshot of an IDOC in SAP showing the data structure
I hope someone here can confirm I'm on the right track in filling the segments and that there's someone who knows which fields I should fill in the data segments.
Kind regards,
Thomas
P.S. Some of the resources consulted:
How to create and send Idocs to SAP using SAP .Net Connector 3
Goods movement IDOC SAP documentation
How to send IDOCs from SAP MII to SAP ERP
P.P.S. Full disclosure: I've also posted this question on the SAP Community Questions & Answers board.
Correctly dealing with SAP IDocs is unfortunately not so easy as it looks at first glance. Maybe it would be a good idea to have a look at the SAP Java IDoc Class Library as mentioned here:
SAP .Net Connector 3.0 - How can I send an idoc from a non-SAP system?
Even if you would not like to switch to Java, it could be at least used as a reference example implementation in order to see how the Remote Function Modules have to be filled with the IDoc data to send.
The SAP Java IDoc Class Library can be downloaded together with the SAP Java Connector from here.
I have no MII system by my side but you'd better thoroughly examine IDoc documentation rather than read the tea leaves. It can contain helpful hints how to fill one or another field of segment.
Go to WE60 and enter your segment names (IDOC_CONTROL_REC_40/IDOC_DATA_REC_40) or IDoc definition name IDOC_Asynchronous_Inbound.
It may not be very helpful but better than nothing.

Config File Checksum guessing (CRC)

I'm currently "hacking" an old 3d Printer, built in 1996. There is Software running on an old Windows PC. I need to modify some parameters which are not accessible from the front end, so I wanted to modify the config files. But if I modify something, it could not be read anymore. I noticed, that there is a checksum at the end of the file, and I'm not really an checksum expert. I assume that, while loading the file, this checksum is calculated again and compared to the one at the end.
I'm having trouble finding out which checksum algorithm is used.
What I already found out: I think it's not just an addition of the bits in the file. When I'm switching two characters, an checksum, that is generated with addition, would not change. But the software won't take that file.
I'm guessing its some kind of CRC16, because a checksum looks like that:
0x4f20
As I have calculated that number with several usual CRC16 parameters and could not find a match with the "4f20", I assume that it must be an custom CRC16..
Here is a complete sample file:
PACKET noname
style 502
last_modified 1511855084 # Tue Nov 28 08:44:44 2017
STRUCTURE MACHINE_OVRL
PARAM distance_units
Value = "millimeters"
ENDPARAM
PARAM language
Value = "English"
ENDPARAM
ENDSTRUCTURE
ENDPACKET
checksum 0x4f20
I think either the checksum itself or the complete line "checksum 0x4f20" is not being considered while calculated, because thats not possible (?)
Any help is appreciated.
Edit: I got some more files with checksums of course, but these are a lot longer than this file. If needed, I could provide them too..
RevEng was written for this purpose. Given several examples of the input and the associated CRCs, RevEng will derive the CRC parameters. If it is a CRC.

Working on migration of SPL 3.0 to 4.2 (TEDA)

I am working on migration of 3.0 code into new 4.2 framework. I am facing a few difficulties:
How to do CDR level deduplication in new 4.2 framework? (Note: Table deduplication is already done).
Where to implement PostDedupProcessor - context or chainsink custom? In either case, do I need to remove duplicate hashcodes from the list or just reject the tuples? Here I am also doing column updating for a few tuples.
My file is not moving into archive. The temporary output file is getting generated and that too empty and outside load directory. What could be the possible reasons? - I have thoroughly checked config parameters and after putting logs, it seems correct output is being sent from transformer custom, so I don't know where it is stuck. I had printed TableRowGenerator stream for logs(end of DataProcessor).
1. and 2.:
You need to select the type of deduplication. It is not a big difference if you choose "table-" or "cdr-level-deduplication".
The ite.businessLogic.transformation.outputType does affect this. There is one Dedup only. You can not have both.
Select recordStream for "cdr-level-deduplication", do the transformation to table row format (e.g. if you like to use the TableFileWriter) in xxx.chainsink.custom::PostContextDataProcessor.
In xxx.chainsink.custom::PostContextDataProcessor you need to add custom code for duplicate-handling: reject (discard) tuples or set special column values or write them to different target tables.
3.:
Possibly reasons could be:
Missing forwarding of window punctuations or statistic tuple
error in BloomFilter configuration, you would see it easily because PE is down and error log gives hints about wrong sha2 functions be used
To troubleshoot your ITE application, I recommend to enable the following debug sinks if checking the StreamsStudio live graph is not sufficient:
ite.businessLogic.transformation.debug=on
ite.businessLogic.group.debug=on
ite.businessLogic.sink.debug=on
Run a test with a single input file only and check the flow of your record and statistic tuples. "Debug sinks" write punctuations markers also to debug files.

Badly Formed hexadecimal uuid string error in Django fixture; json uuid conversion fails issue

File "/home/malikarumi/Projects/cannon/local/lib/python2.7/site-packages/django/db/models/fields/__init__.py", line 2390, in get_db_prep_value
value = uuid.UUID(value)
File "/usr/lib/python2.7/uuid.py", line 134, in __init__
raise ValueError('badly formed hexadecimal UUID string')
ValueError: Problem installing fixture '/home/malikarumi/Projects/cannon/jamf/essell/fixtures/test22byhand.json': badly formed hexadecimal UUID string
I've found the following links so far:
https://github.com/dcramer/django-uuidfield/issues/40
https://github.com/dcramer/django-uuidfield/commit/caae1bc4e45445a06dd11bb22da6a9f07395f78a
Django UUIDField modelfield causes error in Django admin: badly formed hexadecimal UUID string
Django Primary Key: badly formed hexadecimal UUID string
I counted my uuidfield value. It is len=36, because it has dashes in it. At least the string representation I can see is that way. So I replaced it with the same alphanumeric without dashes, as suggested as a test by the bugfix, but I still got the same result.
I checked the model, but there is no max length on any uuid field, nor on the fk link back to the uuid. There's nothing on the fk to suggest it is, or should be limited to, chars, ints, uuids, etc.
Then I found this: http://arthurpemberton.com/2015/04/fixing-uuid-is-not-json-serializable which I hacked into /python2.7/site-packages/django/core/serializers/python.py. The blogger had put it into models.py. But I got the same error, before realizing it was NOT coming from serializers/python.py, as it was yesterday, but from /usr/lib/python2.7/uuid.py, line 134, in init. the relevant portions of that code are:
if hex is not None:
hex = hex.replace('urn:', '').replace('uuid:', '')
hex = hex.strip('{}').replace('-', '')
if len(hex) != 32:
raise ValueError('badly formed hexadecimal UUID string')
int = long(hex, 16)
Rather than try to hack more core code, given that the indication is the problem is json, not Python, I left this alone for now.
Finally, I looked at this:
https://code.djangoproject.com/ticket/24012
It is stated a couple of times here that Django's "UUIDField generates UUIDs in Python". Now here is some history. I created one row, a single instance of Model A into Django with a fixture that had no uuid and no datefield and had no issues. (The uuidfield is on an abstract model, so it is created when the object is created). I did that because I needed the uuid of that Model A instance for a fk field in Model B, which is the one I am struggling with now. I did that by copy pasting the Model A uuid into the fk field on Model B in a csv file which I then converted to json in order to use it as a fixture.
Is it possible that the uuid ran into problems in this copy paste maneuver, before the conversion to json?
If not, that means even though it was an acceptable Python object when it was created, going thru the json conversion messed it up, correct?
If that's the case, what is a workaround?
Can the Arthur Pemberton code be made to work somewhere else in this process?
If I leave the uuid off, I can probably make this work, but then I have to go back and put the all the fk uuid's in manually. Is there a better solution? Maybe a bulk insert of that field alone?
This may be a recurring issue for me, because I am also using Scrapy, which supports but does not require json. None of my scraped items will come with uuid, but how do I automate adding their fk's into my process in order to get them into Django?
Or is all of this a good reason to forget uuids altogether?
Thanks.
EDIT/UPDATE per #rolf:
Since I just discovered that the django shell differs more than I realized (the shell can find settings, the regular interpreter can't) I decided to run this once in each one, but the results were the same.
(cannon)malikarumi#Tetuoan2:~/Projects/cannon/jamf$ python manage.py shell
Python 2.7.10 (default, Oct 14 2015, 16:09:02)
IPython 4.0.3 -- An enhanced Interactive Python.
In [1]: uuid.UUID(a82857b6-e336-4c6c-8499-47601770b39d)
File "<ipython-input-1-e282858da374>", line 1
uuid.UUID(a82857b6-e336-4c6c-8499-47601770b39d)
^
SyntaxError: invalid syntax
In [2]: uuid.UUID(a0a69415-6627-43db-8c7a-b57d0c4cefe2)
File "<ipython-input-2-befebf1573ba>", line 1
uuid.UUID(a0a69415-6627-43db-8c7a-b57d0c4cefe2)
^
SyntaxError: invalid syntax
In [3]: uuid.UUID(e6e11b06-ea3b-4e98-a31f-9a83447ad884)
File "<ipython-input-3-a59ea095e61a>", line 1
uuid.UUID(e6e11b06-ea3b-4e98-a31f-9a83447ad884)
^
SyntaxError: invalid syntax
In [4]: uuid.UUID(bd116432-65d7-4612-abfe-9a99dcaf5cad)
File "<ipython-input-4-c4a04434aa3c>", line 1
uuid.UUID(bd116432-65d7-4612-abfe-9a99dcaf5cad)
^
SyntaxError: invalid syntax
Now that I have posted this, I notice that even Stack Overflow treats these uuid differently, i.e., the way they are colored, if that's relevant and meaningful here.
But now that we know this, what do we do with / about it?
2nd Update
This morning I thought, what about a uuid that had never been anywhere but in Django? So here's what I did:
In [5]: e.uuid
Out[5]: UUID('61877565-5fe5-4175-9f2b-d24704df0b74')
In [6]: uuid.UUID(61877565-5fe5-4175-9f2b-d24704df0b74)
File "<ipython-input-6-56137f5f4eb6>", line 1
uuid.UUID(61877565-5fe5-4175-9f2b-d24704df0b74)
^
SyntaxError: invalid syntax
In [7]: uuid.UUID('61877565-5fe5-4175-9f2b-d24704df0b74')
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-7-3b4d3e5bd156> in <module>()
----> 1 uuid.UUID('61877565-5fe5-4175-9f2b-d24704df0b74')
NameError: name 'uuid' is not defined
This is apparently because I left the quote around the alphanumeric, but why that would generate a uuid not defined error, instead of 'string type' or some such error is beyond me.
In [8]: uuid.UUID(61877565-5fe5-4175-9f2b-d24704df0b74)
File "<ipython-input-8-56137f5f4eb6>", line 1
uuid.UUID(61877565-5fe5-4175-9f2b-d24704df0b74)
^
SyntaxError: invalid syntax
The first time I keyed in the characters by hand. I decided to repeat the test by copying and pasting, but as you can see, it made no difference. If there was something weird about the way only the 5 that the caret is pointing to was generated, we might be on to something, but if so, why do I get the same error in the same place when I typed it in by hand myself?
This no longer seems like a json issue to me, since – as far as I know – json has never touched this uuid, unless it did somehow in the internal workings of Django.
Instead, there is either
1. something wrong with the way uuid.UUID generates uuids, or
2. the way it generates them on my system, (Ubuntu 15.10, Django 1.9.1, Python 2.7.10) or
3. the way it reads and evaluates them when they come back, like in uuid.UUID() or being input outside the internal, automatic uuid generation process.
But that also means people using uuid.UUID() to generate uuids will never know there is an issue unless they do what I did, which is try to bring them in from outside. I remember reading somewhere that all uuids are supposed to be compatible. So, unless someone here has a better insight, I think we might be up for a bug report. But is it a Python bug, a Django bug, or both?
Your syntax is wrong:
uuid.UUID('61877565-5fe5-4175-9f2b-d24704df0b74') # note the quotes