Type in Ruta Script can't be resolved when script is called from a pipeline - type-systems

I try to build a pipeline that incorporates a Ruta script.
I use Uima.CollectionReader to read the text in, and an xmi-writer to write out the result. I call my analysis engines with SimplePipeline.runPipeline, and that works fine, as long as the ruta script is not incorporated.
The Ruta script for itself also works fine and produces the desired output. But when I use the script in the pipeline I get
Caused by: java.lang.IllegalArgumentException: Not able to resolve type: Unsinn
at org.apache.uima.ruta.expression.type.SimpleTypeExpression.getType(SimpleTypeExpression.java:47)
at org.apache.uima.ruta.action.AbstractMarkAction.createAnnotation(AbstractMarkAction.java:42)
at org.apache.uima.ruta.action.MarkAction.execute(MarkAction.java:57)
at org.apache.uima.ruta.rule.AbstractRuleElement.apply(AbstractRuleElement.java:130)
at org.apache.uima.ruta.rule.RuleElementCaretaker.applyRuleElements(RuleElementCaretaker.java:111)
at org.apache.uima.ruta.rule.ComposedRuleElement.applyRuleElements(ComposedRuleElement.java:593)
at org.apache.uima.ruta.rule.AbstractRuleElement.doneMatching(AbstractRuleElement.java:84)
at org.apache.uima.ruta.rule.ComposedRuleElement.fallback(ComposedRuleElement.java:514)
at org.apache.uima.ruta.rule.ComposedRuleElement.fallbackContinue(ComposedRuleElement.java:415)
at org.apache.uima.ruta.rule.RutaRuleElement.startMatch(RutaRuleElement.java:102)
at org.apache.uima.ruta.rule.ComposedRuleElement.startMatch(ComposedRuleElement.java:74)
at org.apache.uima.ruta.rule.RutaRule.apply(RutaRule.java:47)
at org.apache.uima.ruta.rule.RutaRule.apply(RutaRule.java:40)
at org.apache.uima.ruta.rule.RutaRule.apply(RutaRule.java:29)
at org.apache.uima.ruta.RutaScriptBlock.apply(RutaScriptBlock.java:63)
at org.apache.uima.ruta.RutaModule.apply(RutaModule.java:48)
at org.apache.uima.ruta.engine.RutaEngine.process(RutaEngine.java:545)
I give a full path to TypeSystem where the type -- in this case "Unsinn" -- is described, but that doesn't help.
AnalysisEngineDescription ruta =
AnalysisEngineFactory.createEngineDescription(
RutaEngine.class,
RutaEngine.PARAM_MAIN_SCRIPT, "Test",
RutaEngine.PARAM_SCRIPT_PATHS, "C:/Users/some.user/workspace/Test_Ruta/script/test/",
RutaEngine.PARAM_DESCRIPTOR_PATHS, "C:/Users/some.user/workspace/Test_Ruta/descriptor/",
RutaEngine.PARAM_SCRIPT_ENCODING, "UTF-8"
);
I'm new to Uima and also to Stackoverflow, but I'm already pretty desperate. Thanks for any help.

The exception indicates that Ruta is not able to resolve an actual type given the short name "Unsinn". Depending on the configuration parameters, this can have two reasons. In your case, I assume that the type is not present in the type system of the CAS/JCas. The configuration parameter of the RutaEngine does not affect the type system at all. You need to take care that the type system is included in the CAS when it is created, in your case in the collection reader.
I assume that you use uimaFIT for creating the reader and analysis engine (descriptions)? Then, you can simply add the type system containing "Unsinn" in the types.txt file of uimaFIT so that the type system is collected during the classpath scan.
DISCLAIMER: I am a developer of UIMA Ruta

Related

textX: How to generate object names with ObjectProcessors?

I have a simple example model where I would like to generate names for the objects of the Position rule that were not given a name with as <NAME>. This is needed so that I can find them later with the built-in FQN scope provider.
My idea would be to do this in the position_name_generator object processor but that will be only be called after the whole model is parsed. I donĀ“t really understand the reason for that, since by the time I would need a Position object in the Project, the objects are already created, still the object processor will not be called.
Another idea would be to do this in a custom scope provider for Position.location which would then first do the name generation and then use the built-in FQN to find the Location object. Although this would work, I consider this hacky and I would prefer to avoid it.
What would be the textX way of solving this issue?
(Please take into account that this is only a small example. In reality a similar functionality is required for a rather big and complex model. To change this behaviour with the generated names is not possible since it is a requirement.)
import textx
MyLanguage = """
Model
: (locations+=Location)*
(employees+=Employee)*
(positions+=Position)*
(projects+=Project)*
;
Project
: 'project' name=ID
('{'
('use' use=[Position])*
'}')?
;
Position
: 'define' 'position' employee=[Employee|FQN] '->' location=[Location|FQN] ('as' name=ID)?
;
Employee
: 'employee' name=ID
;
Location
: 'location' name=ID
( '{'
(sub_location+=Location)+
'}')?
;
FQN
: ID('.' ID)*
;
Comment:
/\/\/.*$/
;
"""
MyCode = """
location Building
{
location Entrance
location Exit
}
employee Hans
employee Juergen
// Shall be referred to with the given name: "EntranceGuy"
define position Hans->Building.Entrance as EntranceGuy
// Shall be referred to with the autogenerated name: <Employee>"At"<LastLocation>
define position Juergen->Building.Exit
project SecurityProject
{
use EntranceGuy
use JuergenAtExit
}
"""
def position_name_generator(obj):
if "" == obj.name:
obj.name = obj.employee.name + "At" + obj.location.name
def main():
meta_model = textx.metamodel_from_str(MyLanguage)
meta_model.register_scope_providers({
"Position.location": textx.scoping.providers.FQN(),
})
meta_model.register_obj_processors({
"Position": position_name_generator,
})
model = meta_model.model_from_str(MyCode)
assert model, "Could not create model..."
if "__main__" == __name__:
main()
What is the textx way to solve this...
The use case you describe is to define the name of an object based on other model elements, including a reference to other model elements. This is currently not part of any test and use cases included in our test suite and the textx docu.
Object processors are executed at defined stages during model construction (see http://textx.github.io/textX/stable/scoping/#using-the-scope-provider-to-modify-a-model). In the described setup they are executed after reference resolution. Since the name to be defined/deduced itself is required for reference resolution, object processors cannot be used here (even if we allow to control when object processors are executed, before or after scope resolution, the described setup still will not work).
Given the dynamics of model loading (see http://textx.github.io/textX/stable/scoping/#using-the-scope-provider-to-modify-a-model), the solution is located within a scope provider (as you suggested). Here, we allow to control the order of reference resolution, such that references to the object being named by a custom procedure are postponed, until references required to deduce/define the name resolved.
Possible workaround
A preliminary sketch of how your use case can be solved is discussed in a https://github.com/textX/textX/pull/194 (with an attached issue https://github.com/textX/textX/issues/193). This textx PR contains a version of scoping.py you could probably use for your project (just copy and rename the module). A full-fledged solution could be part of the textx TEP-001, where we plan to make scoping more controllable to the end-user.
Playing around with this absolutely interesting issue revealed new aspects to me for the textx framework.
names dependent on model contents (involving unresolved references). This name resolution, which can be Postponed (in the referenced PR, see below), in terms of our reference resolution logic.
Even more interesting are the consequences of that: What happens to references pointing to locations, where unresolved names are found? Here, we must postpone the reference resolution process, because we cannot know if the name might match when resolved...
Your example is included: https://github.com/textX/textX/blob/analysis/issue193/tests/functional/test_scoping/test_name_resolver/test_issue193_auto_name.py

Working on migration of SPL 3.0 to 4.2 (TEDA)

I am working on migration of 3.0 code into new 4.2 framework. I am facing a few difficulties:
How to do CDR level deduplication in new 4.2 framework? (Note: Table deduplication is already done).
Where to implement PostDedupProcessor - context or chainsink custom? In either case, do I need to remove duplicate hashcodes from the list or just reject the tuples? Here I am also doing column updating for a few tuples.
My file is not moving into archive. The temporary output file is getting generated and that too empty and outside load directory. What could be the possible reasons? - I have thoroughly checked config parameters and after putting logs, it seems correct output is being sent from transformer custom, so I don't know where it is stuck. I had printed TableRowGenerator stream for logs(end of DataProcessor).
1. and 2.:
You need to select the type of deduplication. It is not a big difference if you choose "table-" or "cdr-level-deduplication".
The ite.businessLogic.transformation.outputType does affect this. There is one Dedup only. You can not have both.
Select recordStream for "cdr-level-deduplication", do the transformation to table row format (e.g. if you like to use the TableFileWriter) in xxx.chainsink.custom::PostContextDataProcessor.
In xxx.chainsink.custom::PostContextDataProcessor you need to add custom code for duplicate-handling: reject (discard) tuples or set special column values or write them to different target tables.
3.:
Possibly reasons could be:
Missing forwarding of window punctuations or statistic tuple
error in BloomFilter configuration, you would see it easily because PE is down and error log gives hints about wrong sha2 functions be used
To troubleshoot your ITE application, I recommend to enable the following debug sinks if checking the StreamsStudio live graph is not sufficient:
ite.businessLogic.transformation.debug=on
ite.businessLogic.group.debug=on
ite.businessLogic.sink.debug=on
Run a test with a single input file only and check the flow of your record and statistic tuples. "Debug sinks" write punctuations markers also to debug files.

kafka-python 1.3.3: KafkaProducer.send with explicit key fails to send message to broker

(Possibly a duplicate of Can't send a keyedMessage to brokers with partitioner.class=kafka.producer.DefaultPartitioner, although the OP of that question didn't mention kafka-python. And anyway, it never got an answer.)
I have a Python program that has been successfully (for many months) sending messages to the Kafka broker, using essentially the following logic:
producer = kafka.KafkaProducer(bootstrap_servers=[some_addr],
retries=3)
...
msg = json.dumps(some_message)
res = producer.send(some_topic, value=msg)
Recently, I tried to upgrade it to send messages to different partitions based on a definite key value extracted from the message:
producer = kafka.KafkaProducer(bootstrap_servers=[some_addr],
key_serializer=str.encode,
retries=3)
...
try:
key = some_message[0]
except:
key = None
msg = json.dumps(some_message)
res = producer.send(some_topic, value=msg, key=key)
However, with this code, no messages ever make it out of the program to the broker. I've verified that the key value extracted from some_message is always a valid string. Presumably I don't need to define my own partitioner, since, according to the documentation:
The default partitioner implementation hashes each non-None key using the same murmur2 algorithm as the java client so that messages with the same key are assigned to the same partition.
Furthermore, with the new code, when I try to determine what happened to my send by calling res.get (to obtain a kafka.FutureRecordMetadata), that call throws a TypeError exception with the message descriptor 'encode' requires a 'str' object but received a 'unicode'.
(As a side question, I'm not exactly sure what I'd do with the FutureRecordMetadata if I were actually able to get it. Based on the kafka-python source code, I assume I'd want to call either its succeeded or its failed method, but the documentation is silent on the point. The documentation does say that the return value of send "resolves to" RecordMetadata, but I haven't been able to figure out, from either the documentation or the code, what "resolves to" means in this context.)
Anyway: I can't be the only person using kafka-python 1.3.3 who's ever tried to send messages with a partitioning key, and I have not seen anything on teh Intertubes describing a similar problem (except for the SO question I referenced at the top of this post).
I'm certainly willing to believe that I'm doing something wrong, but I have no idea what that might be. Is there some additional parameter I need to supply to the KafkaProducer constructor?
The fundamental problem turned out to be that my key value was a unicode, even though I was quite convinced that it was a str. Hence the selection of str.encode for my key_serializer was inappropriate, and was what led to the exception from res.get. Omitting the key_serializer and calling key.encode('utf-8') was enough to get my messages published, and partitioned as expected.
A large contributor to the obscurity of this problem (for me) was that the kafka-python 1.3.3 documentation does not go into any detail on what a FutureRecordMetadata really is, nor what one should expect in the way of exceptions its get method can raise. The sole usage example in the documentation:
# Asynchronous by default
future = producer.send('my-topic', b'raw_bytes')
# Block for 'synchronous' sends
try:
record_metadata = future.get(timeout=10)
except KafkaError:
# Decide what to do if produce request failed...
log.exception()
pass
suggests that the only kind of exception it will raise is KafkaError, which is not true. In fact, get can and will (re-)raise any exception that the asynchronous publishing mechanism encountered in trying to get the message out the door.
I also faced the same error. Once I added json.dumps while sending the key, it worked.
producer.send(topic="first_topic", key=json.dumps(key)
.encode('utf-8'), value=json.dumps(msg)
.encode('utf-8'))
.add_callback(on_send_success).add_errback(on_send_error)

Reading project parameters in Script Task

This is what I'm trying to do in a script task:
long lngMaxRowsToPull = Convert.ToInt64(Dts.Variables["Project::MaxRowsPerPull"].Value);
I get an error message that the variable does not exist.
Yet Its defined as a ReadOnlyVariable to the script and it does exist as a project parameter.
So close. ;)
Your code is trying to access a variable/parameter named Project::MaxRowsPerPull
In fact, the $ is significant so you need to reference $Project::MaxRowsPerPull
Also note that you have the data type for the parameter as Int32 but are then pushing it into Int64. You can always put a smaller type into a larger container but if you tried to fill the parameter with too large a value your package will asplode.
You need to add $ to your parameter fetch name as per syntax.
long lngMaxRowsToPull = Convert.ToInt64(Dts.Variables["$Project::MaxRowsPerPull"].Value);

How to retrieve useful system information in java?

Which system information are useful - especially when tracing an exception or other problems down - in a java application?
I am thinking about details about exceptions, java/os information, memory/object consumptions, io information, environment/enchodings etc.
Besides the obvious - the exception stack trace - the more info you can get is better. So you should get all the system properties as well as environment variables. Also if your application have some settings, get all their values. Of course you should put all this info into your log file, I used System.out her for simplicity:
System.out.println("----Java System Properties----");
System.getProperties().list(System.out);
System.out.println("----System Environment Variables----");
Map<String, String> env = System.getenv();
Set<String> keys = env.keySet();
for (String key : keys) {
System.out.println(key + "=" + env.get(key));
}
For most cases this will be "too much" information, but for most cases the stack trace will be enough. Once you will get a tough issue you will be happy that you have all that "extra" information
Check out the Javadoc for System.getProperties() which documents the properties that are guaranteed to exist in every JVM.
For pure java applications:
System.getProperty("org.xml.sax.driver")
System.getProperty("java.version")
System.getProperty("java.vm.version")
System.getProperty("os.name")
System.getProperty("os.version")
System.getProperty("os.arch")
In addition, for java servlet applications:
response.getCharacterEncoding()
request.getSession().getId()
request.getRemoteHost()
request.getHeader("User-Agent")
pageContext.getServletConfig().getServletContext().getServerInfo()
One thing that really helps me- to see where my classes are getting loaded from.
obj.getClass().getProtectionDomain().getCodeSource().getLocation();
note: protectiondomain can be null as can code source so do the needed null checks