How do we process two different input streams on onTuple of Custom() Operator of IBM InfoSphere Streams? - infosphere-spl

In Custom Operator ,I am trying to open one file which has been submitted on launch and tokenize the values from the file and compare the values with the input streams which is other file .

You can have multiple onTuple clause in the logic of your Custom operator.
Here's an example. In here, we have two input ports Beacon_1_out0 and Beacon_2_out0. I added onTuple clause for each input port and do the processing for data coming in from each input port. The processing of each port happens independently.
() as Custom_3 = Custom(Beacon_1_out0 ; Beacon_2_out0)
{
logic
onTuple Beacon_1_out0:
{
printStringLn((rstring)Beacon_1_out0);
}
onTuple Beacon_2_out0:
{
printStringLn((rstring)Beacon_1_out0);
}
}
If you are comparing data from multiple streams, you may want to use the Join operator. The Join operator should allow you to more easily compare data from multiple input streams.

Related

How to force addition using meta_table in lua?

I have defined addition using metatable as follows.
local matrix_meta = {}
matrix_meta.__add = function( ... )
return matrix.add( ... )
end
I want to add variables using matrix_meta add. The following commands work well.
matrix(p)+q
matrix(p)+matrix(q)
p+matrix(q)
However the following code doesn't work.
p+q
The reason is obvious that it doesn't recognize p or q as matrix objects. It simply throws error that trying to perform arithmetic on table values. I am curious about how to force addition for matrix objects. I mean that is it possible to execute in lua something like this env-Matrix: p+q or as matrix_meta.__add: p,q so that p and q are auto recognized as matrix objects. So the problem is to perform addition in matrix environment where variables will be recognized as matrix objects. Note that I simply don't want to this only for two variables, there may be more than two variables.
As defined in your comment
local p={{2,4,6},{8,10,12},{14,16,20}}
local q={{1,2,3},{8,10,12},{14,16,20}}
So unless you something like
local p = setmetatable(p={{2,4,6},{8,10,12},{14,16,20}}, matrix_meta)
p and q are just regular Lua tables with no metamethods.
Arithmetic operations are not defined for Lua tables. Hence the error message.
If you don't like the Lua operators or its syntax, consider using another programming language.
It wouldn't hurt to write something like m({2,4,6},{8,10,12},{14,16,20}) instead of {{2,4,6},{8,10,12},{14,16,20}}.

Using property OR in "conditions" parameter of askargs action with Semantic MediaWiki API

I'm trying to fetch results via API using the module askargs. I have no problems getting results when I have just one condition or more conditions aggregated with the operator AND where I make use of the pipe character to separate them (like written in documentation).
E.g.
[[Category:+]] AND [[Jurisdiction::A]] AND [[Type::B]]
Category:+ | Jurisdiction::A | Type::B
But the pipe character doesn't work with OR.
I need to be able to use both logical conditions with several arguments within the same query.
Am I missing something?
Am I missing something?
No. The API doesn't handle OR condition, due to simplistic code in the query parameters formatter.
See file SemanticMediaWiki/src/MediaWiki/Api/ApiRequestParameterFormatter.php
at line 132:
protected function formatConditions( $condition ) {
return "[[$condition]]";
}
Every condition in the query is formatted with surrounding brackets, leading OR to be interpreted as a page title.
An alternative is to use Special:Ask with URL encoded query and json format:
https://www.semantic-mediawiki.org/wiki/Special:Ask/-5B-5BHas-20keyword::askargs-5D-5DOR-5B-5BHas-20keyword::ask-5D-5D/-3F%3Dhelp-20page/-3FHas-20description%3Ddescription/format%3Djson
Since I came here from a website search i'm going to add another neat possibility:
If you use the Alternative separator you can use a double pipe as logical OR conjunction.
Example:
%1FCategory:+%1FJurisdiction::A%1FType::B||C
Which should be read as following
Category:+ AND Jurisdiction::A AND (Type::B OR Type::C)

Google Dataflow (Apache beam) JdbcIO bulk insert into mysql database

I'm using Dataflow SDK 2.X Java API ( Apache Beam SDK) to write data into mysql. I've created pipelines based on Apache Beam SDK documentation to write data into mysql using dataflow. It inserts single row at a time where as I need to implement bulk insert. I do not find any option in official documentation to enable bulk inset mode.
Wondering, if it's possible to set bulk insert mode in dataflow pipeline? If yes, please let me know what I need to change in below code.
.apply(JdbcIO.<KV<Integer, String>>write()
.withDataSourceConfiguration(JdbcIO.DataSourceConfiguration.create(
"com.mysql.jdbc.Driver", "jdbc:mysql://hostname:3306/mydb")
.withUsername("username")
.withPassword("password"))
.withStatement("insert into Person values(?, ?)")
.withPreparedStatementSetter(new JdbcIO.PreparedStatementSetter<KV<Integer, String>>() {
public void setParameters(KV<Integer, String> element, PreparedStatement query) {
query.setInt(1, kv.getKey());
query.setString(2, kv.getValue());
}
})
EDIT 2018-01-27:
It turns out that this issue is related to the DirectRunner. If you run the same pipeline using the DataflowRunner, you should get batches that are actually up to 1,000 records. The DirectRunner always creates bundles of size 1 after a grouping operation.
Original answer:
I've run into the same problem when writing to cloud databases using Apache Beam's JdbcIO. The problem is that while JdbcIO does support writing up to 1,000 records in one batch, in I have never actually seen it write more than 1 row at a time (I have to admit: This was always using the DirectRunner in a development environment).
I have therefore added a feature to JdbcIO where you can control the size of the batches yourself by grouping your data together and writing each group as one batch. Below is an example of how to use this feature based on the original WordCount example of Apache Beam.
p.apply("ReadLines", TextIO.read().from(options.getInputFile()))
// Count words in input file(s)
.apply(new CountWords())
// Format as text
.apply(MapElements.via(new FormatAsTextFn()))
// Make key-value pairs with the first letter as the key
.apply(ParDo.of(new FirstLetterAsKey()))
// Group the words by first letter
.apply(GroupByKey.<String, String> create())
// Get a PCollection of only the values, discarding the keys
.apply(ParDo.of(new GetValues()))
// Write the words to the database
.apply(JdbcIO.<String> writeIterable()
.withDataSourceConfiguration(
JdbcIO.DataSourceConfiguration.create(options.getJdbcDriver(), options.getURL()))
.withStatement(INSERT_OR_UPDATE_SQL)
.withPreparedStatementSetter(new WordCountPreparedStatementSetter()));
The difference with the normal write-method of JdbcIO is the new method writeIterable() that takes a PCollection<Iterable<RowT>> as input instead of PCollection<RowT>. Each Iterable is written as one batch to the database.
The version of JdbcIO with this addition can be found here: https://github.com/olavloite/beam/blob/JdbcIOIterableWrite/sdks/java/io/jdbc/src/main/java/org/apache/beam/sdk/io/jdbc/JdbcIO.java
The entire example project containing the example above can be found here: https://github.com/olavloite/spanner-beam-example
(There is also a pull request pending on Apache Beam to include this in the project)

Connecting output of script to extract performance

When I use the Execute Script operator, where there is one input arc and this input is of type ExampleSet and I run, for example, the one-line script return operator.getInput(ExampleSet.class), and then connect the output to an Extract Performance operator, which takes an ExampleSet as input, I get an error: Mandatory input missing at port Performance.example set.
My goal is to check a Petri-net for soundness via the Analyse soundness operator that comes with the RapidProm extension, and to take and change the first attribute on the first line to either 0 or 1 depending on whether this string matches "is sound", so I can then use Extract Performance and combine it with other performances using Average.
Is doing this with Execute Script the right way to do it, and if so, how should I fix this error?
Firstly: Don't bother about the error Mandatory input missing at port Performance.example set
It will be resolved when you run the model.
Secondly: It is indeed a bit ugly, the output of the operator that checks
the soundness of the model, since it is a very long string that looks like
Woflan diagnosis of net "d1cf46bd-15a9-4801-9f02-946a8f125eaf" - The net is sound End of Woflan diagnosis
You can indeed use the execute script to resolve this :)
See the script below!
The output is an example set that returns 1 if the model is sound, and 0 otherwise. Furthermore, I like to use some log operators to translate this to a nice table useful for documentation purposes.
ExampleSet input = operator.getInput(ExampleSet.class);
for (Example example : input) {
String uglyResult = example["att1"];
String soundResult = "The net is sound";
Boolean soundnessCheck = uglyResult.toLowerCase().contains(soundResult.toLowerCase());
if (soundnessCheck){
example["att1"] = "1"; //the net is sound :)
} else {
example["att1"] = "0"; //the net is not sound!
}
}
return input;
See also the attached example model I created.
RapidMiner Setup

Returning values from InputFormat via the Hadoop Configuration object

Consider a running Hadoop job, in which a custom InputFormat needs to communicate ("return", similarly to a callback) a few simple values to the driver class (i.e., to the class that has launched the job), from within its overriden getSplits() method, using the new mapreduce API (as opposed to mapred).
These values should ideally be returned in-memory (as opposed to saving them to HDFS or to the DistributedCache).
If these values were only numbers, one could be tempted to use Hadoop counters. However, in numerous tests counters do not seem to be available at the getSplits() phase and anyway they are restricted to numbers.
An alternative could be to use the Configuration object of the job, which, as the source code reveals, should be the same object in memory for both the getSplits() and the driver class.
In such a scenario, if the InputFormat wants to "return" a (say) positive long value to the driver class, the code would look something like:
// In the custom InputFormat.
public List<InputSplit> getSplits(JobContext job) throws IOException
{
...
long value = ... // A value >= 0
job.getConfiguration().setLong("value", value);
...
}
// In the Hadoop driver class.
Job job = ... // Get the job to be launched
...
job.submit(); // Start running the job
...
while (!job.isComplete())
{
...
if (job.getConfiguration().getLong("value", -1))
{
...
}
else
{
continue; // Wait for the value to be set by getSplits()
}
...
}
The above works in tests, but is it a "safe" way of communicating values?
Or is there a better approach for such in-memory "callbacks"?
UPDATE
The "in-memory callback" technique may not work in all Hadoop distributions, so, as mentioned above, a safer way is, instead of saving the values to be passed back in the Configuration object, create a custom object, serialize it (e.g., as JSON), saved it (in HDFS or in the distributed cache) and have it read in the driver class. I have also tested this approach and it works as expected.
Using the configuration is a perfectly suitable solution (admittedly for a problem I'm not sure I understand), but once the job has actually been submitted to the Job tracker, you will not be able to amend this value (client side or task side) and expect to see the change on the opposite side of the comms (setting configuration values in a map task for example will not be persisted to the other mappers, nor to the reducers, nor will be visible to the job tracker).
So to communicate information back from within getSplits back to your client polling loop (to see when the job has actually finished defining the input splits) is fine in your example.
What's your greater aim or use case for using this?