Gemfire pdxInstance datatype - json

I am writing pdxInstances to GemFire using the sequence: rabbitmq => springxd => gemfire.
If I put this JSON into rabbitmq {'ID':11,'value':5}, value appears as a byte value in GemFire. If I put {'ID':11,'value':500}, value appears as a word and if I put {'ID':11,'value':50000} it appears as an Integer.
A problem arises when I query data from GemFire and order them. For example, if I use a query such as select * from /my_region order by value it fails, saying it cannot compare a byte with a word (or byte with an integer).
Is there any way to declare the data type in JSON? Or any other method to get rid of this problem?

To add a bit of insight into this problem... in reviewing GemFire/Geode source code, it would seem it is not possible to configure the desired value type and override GemFire/Geode's default behavior, which can be seen in JSONFormatter.setNumberField(..).
I will not explain how GemFire/Geode involves the JSONFormatter during a Region.put(key, value) operation as it is rather involved and beyond the scope of this discussion.
However, one could argue that the problem is not necessarily with the JSONFormatter class, since storing a numeric value in a byte is more efficient than storing the value in an integer, especially when the value would indeed fit into a byte. Therefore, the problem is really that the Comparator used in the Query processor should be able to compare numeric values in the same type family (byte, short, int, long), upcasting where appropriate.
If you feel so inclined, feel free to file a JIRA ticket in the Apache Geode JIRA repository at https://issues.apache.org/jira/browse/GEODE-72?jql=project%20%3D%20GEODE
Note, Apache Geode is the open source "core" of Pivotal GemFire now. See the Apache Geode website for more details.
Cheers!

Your best bet would be to take care of this with a custom module or a groovy script. You can either write a custom module in Java to do the conversion and then upload the custom module into SpringXD, then you could reference your custom module like any other processor. Or you could write a script in Groovy and pass the incoming data through a transform processor.
http://docs.spring.io/spring-xd/docs/current/reference/html/#processors
The actual conversion probably won't be too tricky, but will vary depending on which method you use. The stream creation would look something like this when you're done.
stream create --name myRabbitStream --definition "rabbit | my-custom-module | gemfire-json-server etc....."
stream create --name myRabbitStream --definition "rabbit | transform --script=file:/transform.groovy | gemfire-json-server etc...."
It seems like you have your source and sink modules set up just fine, so all you need to do is get your processor module setup to do the conversion and you should be all set.

Related

Is Athena a viable/sensible choice for infrequently searching *unstructured* JSON?

I'm recording request and response headers and bodies for all traffic to our API and from our API to 3rd party services into S3 as into tiny objects.
I want to be able to query this data infrequently. For example (pseudo-code):
select $.cars[0].color from "objects" where object_path in (....);
Other info:
Many "objects" in S3 won't have a valid path to $.cars[0].color (it's just one example).
I hope to not use Glue.
Cost is important - this is something that will be queried very infrequently. Configuring some ElasticSearch/similar solution is terribly out of budget for the use case.
I hope to not define my own set of schemas (this is simply not feasible).
Athena says it can search unstructured JSON. I'm having trouble creating a proof-of-concept to show this is true.
Is Athena right fr me? Am I missing a better solution?
I think Athena will work for your case.
Athena handles missing properties in JSON objects. For example, if you define the cars column as array<struct<color:string>>:
the property can be missing ⇒ SELECT cars … will be NULL
it can be an empty list ⇒ SELECT cars[1] … (Athena arrays start at 1) will result in an error, but element_at(cars, 1) and try(cars[1]) will return NULL
the object may be missing the color property ⇒ SELECT cars[1].color … will be NULL
for completely free-form JSON define the column as string and use the JSON functions to query it.
Glue is not necessary. Create the table manually, from your application, or with CloudFormation, and configure it to use partition projection and you will not have to think about using Glue crawlers at all.
Athena doesn't cost anything when you don't use it, and if you will query only infrequently this is key. Make sure to compress your data, and partition it in a way that supports your query patterns (e.g. by date or month if you most often will query recent data).
Not sure what you mean by having to define your "own set of schemas", so perhaps you can clarify that part?

Mapping JSON to Apama Events

I am trying to use the Mapper codec in my connectivity chain to convert a JSON object that looks like this:
{"test2":[
["column1","column2","column3"],
["16091", "449", "05302018"],
["16092", "705", "05302018"]
]}
to an EPL type. To me it looks like a sequence of sequences, so I used
event test1 {
sequence<string> values;
}
event test2 {
sequence<test1> tests;
}
But this gives me the error
Unable to parse event test.1: Incorrect type in get (you asked for map but its' actually list)
Any ideas how I should be using the Mapper codec to this end?
Unless explicitly remapped, that won't quite work. You have to consider the entire structure of the document from top to bottom. It's not a sequence of strings - it's a JSON object/dictionary at top-level, with a value that is a sequence of sequences of string.
A JSON object/dictionary can map to an event type based on field names. So as Matt's answer said, a JSON document like yours would need an event type like
event SomeEventType {
sequence<sequence<string > > test2;
}
If it's not appropriate to create an event type that exactly corresponds to the JSON document's structure, then you'll need to use the mapping codec to rearrange the fields in the JSON document to match the fields and sub-fields in an event type. Or possibly a custom codec; I think Matt's right that the mapper can't do exactly what you want.
Further, because JSON documents are type-less at the top-level, you'll need to make sure that the event type is defined somehow. There are multiple ways of doing that.
(1) If this particular connectivity will only send you events of one type, you can use the 'defaultEventType' configuration option of the apama.eventMap host plug-in at the top of your chain e.g.
apama.eventMap:
defaultEventMap: SomeEventType
(2) If it depends on the structure of the document, you'll need to use the classifier codec. That can take a message going towards the correlator, and assign it an event type based on the content of fields (or simply their presence). You can learn about it in the documentation.
(3) The transport will sometimes define it on messages being sent towards the correlator. For example, in the case of the Universal Messaging transport, then the 'tag' of the UM event will be used as the type. This may or may not be appropriate.
If you do end up doing anything non-trivial with the classifier or mapper, I'd strongly recommend use of the 'diagnostic codec' to help in developing the classifier or mapper rules. This is a codec you can put anywhere in the chain of codecs that will log every event going through it, so you can see how your rules are operating by seeing what happens before and after classification/mapping. You can read about it in the documentation, but it's usually as simple as putting '- diagnosticCodec' somewhere in your chain. I've found it absolutely invaluable when debugging connectivity chains.
you want your event type to look like:
event type1 {
sequence<sequence<string> > data;
}
it's not possible in the mapper directly to convert to your type2/type1 schema, but you'd be able to write your own codec to do that or do post-filtering in EPL.
HTH,
Matt

Strict JSON parsing in Go

The encoding/json exposes a forgiving parser. Every not present property is simply set to its default value. Is there a better way to make a field required than using bulky switch statments and check every field for its default value? Another problem is that not all default types are nil. Is there another way to distinguish between than a not set field and e.g. 0 other than using pointers to be able to check against nil?
You may look at what there is available to implement
so-called "JSON schema validation".
You may start with this search
which yields github.com/juju/gojsonschema among others;
while I have no idea about its quality, it's used as part of
Ubuntu's Juju cloud orchestration solution so I'd expect it
to be battle tested. Still, caveat emptor.

Passing a path as a parameter in Pentaho

In a Job I am checking if the file that I want to read is available or not. If this csv exists I want to read the data and save them in a database table within a transformation.
This is what I have done so far:
1) I have create the job, 2) I have implemented some parameters, one of them with the path for the file, 3) I have indicated that I am going to pass this value to the transformation.
Now, the thing is, I am sure this is should be something very simple to implement, but even when I have follow some blogs, I have not succeeded with this part of the process. I've tried to follow this example:
http://diethardsteiner.blogspot.com.co/2011/03/pentaho-data-integration-scheduling-and.html
My question remains the same. How can I indicate to the transformation that it has to use the parameter that I am given him from the job?
You just mixed up the columns
Parameter should be the name of the parameter in the transformation you are running.
Value is the value you are passing.
Since you are passing a variable, and not a constant value you use the ${} syntax to indicate this.

JMeter / AMQ - Substitute substring when reading strings from JSON file

I've been bashing against a brick wall on this ever since Monday, when the customer told me that we needed to simulate up to 50,000 pseudo-concurrent entities for the purposes of performance testing. This is the setup. I have text files full of JSON objects containing JSON data that looks a bit like this:
{"customerId"=>"900", "assetId"=>"NN_18_144", "employee"=>"", "visible"=>false,
"GenerationDate"=>"2012-09-21T09:41:39Z", "index"=>52, "Category"=>2...}
It's one object to a line. I'm using JMeter's JMS publisher to read the lines sequentially:
${_StringFromFile(${PATH_TO_DATA_FILES}scenario_9.json)}
from the each of which contain a different scenario.
What I need to do is read the files in and substitute assetId's value with a randomly selected value from a list of 50,000 non-sequential, pre-generated strings (I can't possibly have a separate file for each assetId, as that would involve littering the load injector with 50,000 files and configuring a thread group within JMeter for each). Programatically, it's a trivial matter to perform the substitution but it's not so simple to do it in JMeter on the fly.
Normally, I'd treat this as the interesting technical challenge that it is and spend a few days working it out, but I only have the weekend, which I suspect I'll spend sleeping overnight in the office anyway.
Can anyone help me with this, please?
Thanks.
For reading your assets, use a CSV Data SetConfig , I suppose assetId will be the variable name.
Modify your expression:
${_StringFromFile(${PATH_TO_DATA_FILES}scenario_9.json, lineToSubstitute)}
To do the substitution, add a Beanshell sampler or JSR223_SamplerJ (using groovy) and code the substitution:
String assetId = vars.get("assetId");
String lineToSubstitute = vars.get("lineToSubstitute");
String lineSubstituted = ....;
vars.put("lineSubstituted", lineSubstituted);
If your JSON body is always the same or you have little changes in it, you should:
Use an HTTP Sampler with RAW POST Body
Put the JSON body in it with variables for asset ids
Put asset ids in CSV Data Set config
Avoid using ${_StringFromFile} as it has a cost.
If you need scripting , use JSR223 Post Processor with Script in external file + Caching (available since 2.8) so that script is compiled.