Working on migration of SPL 3.0 to 4.2 (TEDA) - infosphere-spl

I am working on migration of 3.0 code into new 4.2 framework. I am facing a few difficulties:
How to do CDR level deduplication in new 4.2 framework? (Note: Table deduplication is already done).
Where to implement PostDedupProcessor - context or chainsink custom? In either case, do I need to remove duplicate hashcodes from the list or just reject the tuples? Here I am also doing column updating for a few tuples.
My file is not moving into archive. The temporary output file is getting generated and that too empty and outside load directory. What could be the possible reasons? - I have thoroughly checked config parameters and after putting logs, it seems correct output is being sent from transformer custom, so I don't know where it is stuck. I had printed TableRowGenerator stream for logs(end of DataProcessor).

1. and 2.:
You need to select the type of deduplication. It is not a big difference if you choose "table-" or "cdr-level-deduplication".
The ite.businessLogic.transformation.outputType does affect this. There is one Dedup only. You can not have both.
Select recordStream for "cdr-level-deduplication", do the transformation to table row format (e.g. if you like to use the TableFileWriter) in xxx.chainsink.custom::PostContextDataProcessor.
In xxx.chainsink.custom::PostContextDataProcessor you need to add custom code for duplicate-handling: reject (discard) tuples or set special column values or write them to different target tables.
3.:
Possibly reasons could be:
Missing forwarding of window punctuations or statistic tuple
error in BloomFilter configuration, you would see it easily because PE is down and error log gives hints about wrong sha2 functions be used
To troubleshoot your ITE application, I recommend to enable the following debug sinks if checking the StreamsStudio live graph is not sufficient:
ite.businessLogic.transformation.debug=on
ite.businessLogic.group.debug=on
ite.businessLogic.sink.debug=on
Run a test with a single input file only and check the flow of your record and statistic tuples. "Debug sinks" write punctuations markers also to debug files.

Related

Undesired behavior when reloading modified nodes into aws-neptune

I'm using the bulk loader to load data from csv files on S3 into a Neptune DB cluster.
The data is loaded successfully. However, when I reload the data with some of the nodes' property values modified, the new value is not replacing the old one, but rather being added to it ,making it a list of values separated by a comma. For example:
Initial values loaded:
~id,~label,ip:string,creationTime:date
2,user,"1.2.3.4",2019-02-13
If I reload this node with a different ip:
2,user,"5.6.7.8",2019-02-13
Then I run the following traversal: g.V(2).valueMap(), and getting: ip=[1.2.3.4, 5.6.7.8], creationTime=[2019-02-13]
While this behavior may be beneficial for some use-cases, it's mostly undesired. I want the new value to replace the old one.
I couldn't find any reference in the documentation to the loader behavior in case of reloading nodes, and there is no relevant parameter to configure in the API request.
How can I have reloaded nodes overwriting the existing ones?
Update: Neptune now supports single cardinality bulk-loading. Just set
updateSingleCardinalityProperties = TRUE
SOURCE: https://docs.aws.amazon.com/neptune/latest/userguide/load-api-reference-load.html
currently the Neptune bulk loader uses Set cardinality. To update an existing property the best way is to use Gremlin via the HTTP or WS endpoint.
From Gremlin you can specify that you want single cardinality (thus replacing rather than adding to the property value). An example would be
g.V('2').property(single,"ip","5.6.7.8")
Hope that helps,
Kelvin

Output index of ELKI

I am using ELKI to cluster data from CSV file
I use
-resulthandler ResultWriter
-out folder/
to save the outputdata
But as an output I have some strange indexes
ID=2138 0.1799 0.2761
ID=2137 0.1797 0.2778
ID=2136 0.1796 0.2787
ID=2109 0.1161 0.2072
ID=2007 0.1139 0.2047
The ID is more than 2000 despite I have less than 100 training samples
DBIDs are internal; the documentation clearly says that you shouldn't make too much assumptions on them because their implementation may change. The only reason they are written to the output at all is because some methods (such as OPTICS) may require cross-referencing objects by this unique ID.
Because they are meant to be unique identifiers, they are usually continuously incremented. The next time you click on "run" in the MiniGUI, you will get the next n IDs... so clearly, you clicked run more than once.
The "Tips & Tricks" in the ELKI DBID documentation probably answer your underlying question - how to use map DBIDs to line numbers of your input file. The best way is to if you want to have object identifiers, assign object identifiers yourself by using an identifier column (and configuring it to be an external identifier).
For further information, see the documentation: https://elki-project.github.io/dev/dbids

Google Dataflow (Apache beam) JdbcIO bulk insert into mysql database

I'm using Dataflow SDK 2.X Java API ( Apache Beam SDK) to write data into mysql. I've created pipelines based on Apache Beam SDK documentation to write data into mysql using dataflow. It inserts single row at a time where as I need to implement bulk insert. I do not find any option in official documentation to enable bulk inset mode.
Wondering, if it's possible to set bulk insert mode in dataflow pipeline? If yes, please let me know what I need to change in below code.
.apply(JdbcIO.<KV<Integer, String>>write()
.withDataSourceConfiguration(JdbcIO.DataSourceConfiguration.create(
"com.mysql.jdbc.Driver", "jdbc:mysql://hostname:3306/mydb")
.withUsername("username")
.withPassword("password"))
.withStatement("insert into Person values(?, ?)")
.withPreparedStatementSetter(new JdbcIO.PreparedStatementSetter<KV<Integer, String>>() {
public void setParameters(KV<Integer, String> element, PreparedStatement query) {
query.setInt(1, kv.getKey());
query.setString(2, kv.getValue());
}
})
EDIT 2018-01-27:
It turns out that this issue is related to the DirectRunner. If you run the same pipeline using the DataflowRunner, you should get batches that are actually up to 1,000 records. The DirectRunner always creates bundles of size 1 after a grouping operation.
Original answer:
I've run into the same problem when writing to cloud databases using Apache Beam's JdbcIO. The problem is that while JdbcIO does support writing up to 1,000 records in one batch, in I have never actually seen it write more than 1 row at a time (I have to admit: This was always using the DirectRunner in a development environment).
I have therefore added a feature to JdbcIO where you can control the size of the batches yourself by grouping your data together and writing each group as one batch. Below is an example of how to use this feature based on the original WordCount example of Apache Beam.
p.apply("ReadLines", TextIO.read().from(options.getInputFile()))
// Count words in input file(s)
.apply(new CountWords())
// Format as text
.apply(MapElements.via(new FormatAsTextFn()))
// Make key-value pairs with the first letter as the key
.apply(ParDo.of(new FirstLetterAsKey()))
// Group the words by first letter
.apply(GroupByKey.<String, String> create())
// Get a PCollection of only the values, discarding the keys
.apply(ParDo.of(new GetValues()))
// Write the words to the database
.apply(JdbcIO.<String> writeIterable()
.withDataSourceConfiguration(
JdbcIO.DataSourceConfiguration.create(options.getJdbcDriver(), options.getURL()))
.withStatement(INSERT_OR_UPDATE_SQL)
.withPreparedStatementSetter(new WordCountPreparedStatementSetter()));
The difference with the normal write-method of JdbcIO is the new method writeIterable() that takes a PCollection<Iterable<RowT>> as input instead of PCollection<RowT>. Each Iterable is written as one batch to the database.
The version of JdbcIO with this addition can be found here: https://github.com/olavloite/beam/blob/JdbcIOIterableWrite/sdks/java/io/jdbc/src/main/java/org/apache/beam/sdk/io/jdbc/JdbcIO.java
The entire example project containing the example above can be found here: https://github.com/olavloite/spanner-beam-example
(There is also a pull request pending on Apache Beam to include this in the project)

Junit: String return for AssertEquals

I have test cases defined in an Excel sheet. I am reading a string from this sheet (my expected result) and comparing it to a result I read from a database (my actual result). I then use AssertEquals(expectedResult, actualResult) which prints any errors to a log file (i'm using log4j), e.g. I get java.lang.AssertionError: Different output expected:<10> but was:<7> as a result.
I now need to write that result into the Excel sheet (the one that defines the test cases). If only AssertEquals returned String, with the AssertionError text that would be great, as I could just write that immediately to my Excel sheet. Since it returns void though I got stuck.
Is there a way I could read the AssertionError without parsing the log file?
Thanks.
I think you're using junit incorrectly here. THis is why
assertEquals not AssertEquals ( ;) )
you shouldnt need to log. You should just let the assertions do their job. If it's all green then you're good and you dont need to check a log. If you get blue or red (eclipse colours :)) then you have problems to look at. Blue is failure which means that your assertions are wrong. For example you get 7 but expect 10. Red means error. You have a null pointer or some other exception that is throwing while you are running
You should need to read from an excel file or databse for the unit tests. If you really need to coordinate with other systems then you should try and stub or mock them. With the unit test you should work on trying to testing the method in code
if you are bootstrapping on JUnit to try and compare an excel sheet and database then I would ust export the table in excel as well and then just do a comparison in excel between columns
Reading from/writing to files is not really what tests should be doing. The input for the tests should be defined in the test, not in the external file which can change - this can either introduce false negatives or even worse false positives (making your tests effectively useless while also giving false confidence that everything is ok because tests are green).
Given your comment (a loop with 10k different parameters coming from file), I would recommend converting this excel file into JUnit Parameterized test. You may want to put the array definition in another class, because 10k lines is quite a lot.
If it is some corporate bureaucracy, and you need to have this excel file, then it makes sense to not write a classic "test". I would recommend just a main method that does the job - reads the file, runs the code, checks the output using simple if (output.equals(expected)) and then writes back to file.
Wrap your AssertEquals(expectedResult, actualResult) with try catch
in catch
catch(AssertionError e){
//deal with e.getMessage or etc.
}
But it not good idea for some reasons, I guess.
And try google something like soft assert
Documentation on assertEquals is pretty clear on what the method does:
Asserts that two objects are equal. If they are not, an AssertionError
without a message is thrown.
You need to wrap the assertion with try-catch block and in the exception handling do Your logging. You will need to make Your own message using the information from the specific test case, but this what You asked for.
Note:
If expected and actual are null, they are considered equal.

Complex Gremlin queries to output nodes/edges

I am trying to implement a query and graph visualisation framework that allows a user to enter a Gremlin query, returning a D3 graph of results. The D3 graph is built using a JSON - this is created using separate vertices and edges outputs from the Gremlin query. For simple queries such as:
g.V.filter{it.attr_a == "foo"}
this works fine. However, when I try to perform a more complicated query such as the following:
g.E.filter{it.attr_a == 'foo'}.groupBy{it.attr_b}{it.outV.value}.cap.next().findAll{k,e->e.size()<=3}
- Find all instances of *value*
- Grouped by unique *attr_b*
- Where *attr_a* = foo
- And *attr_b* is paired with no more than 2 other instances of *value*
Instead, the output is of the following form:
attr_b1: {value1, value2, value3}
attr_b2: {value4}
attr_b3: {value6, value7}
I would like to know if there is a way for Gremlin to output the results as a list of nodes and edges so I can display the results as a graph. I am aware that I could edit my D3 code to take in this new output but there are currently no restrictions to the type/complexity of the query, so the key/value pairs will no necessarily be the same every time.
Thanks.
You've hit what I consider one of the key problems with visualizing Gremlin results. They can be anything. Gremlin results might not just be a list of vertices and edges. There is no way to really control this that I can think of. At the end of the day, you can really only visualize results that match a pattern that D3 expects. I'd start by trying to detect that pattern and visualize only in those cases (simply display non-recognized patterns as JSON perhaps).
Thinking of your specific example that results like this:
attr_b1: {value1, value2, value3}
attr_b2: {value4}
attr_b3: {value6, value7}
What would you want D3 to visualize there? The vertices/edges that were traversed over to get that result? If so, you might be stuck. Gremlin doesn't give you a way to introspect the pipeline to see what's passing through it. In other words, unless the user explicitly gathers vertices and edges within the pipeline that were touched you won't have access to them. It would be nice to be able to "spy" on a pipeline in that way, but at the moment it doesn't do that. There's been internal discussion within TinkerPop to create a new kind of pipeline implementation that would help with that, but at the moment, it doesn't exist.
So, without the "spying" capability, I think your only workarounds would be to:
detect vertex/edge list on your client side and only render those with d3. this would force users to always write gremlin that returned data in such a format, if they wanted visualization. put it in the users hands.
perhaps supply server-side bindings for a list of vertices/edges that a user could explicitly side-effect their vertices/edges into if their results did not conform to those expected by your visualization engine. again, this would force users to write their gremlin appropriately for your needs if they want visualization.