How to merge blank nodes in Jena without duplication - duplicates

Using Jena to deserialize RDF that includes blank nodes results in unique IDs for those nodes each time the same RDF is deserialized. If identical RDF is deserialized multiple times and merged, the blank nodes become duplicated. Is there a way to avoid or remove the duplication?
static final String RDF =
"<http://www.foo.com/subject>" +
"<http://www.foo.com/predicate>" +
"[ a <http://www.foo.com/bar> , <http://www.foo.com/baz> ] .";
public static void main(String... args) {
Model m1 = ModelFactory.createDefaultModel().read(new StringReader(RDF), null, "ttl");
Model m2 = ModelFactory.createDefaultModel().read(new StringReader(RDF), null, "ttl");
Model m3 = m1.union(m2);
RDFDataMgr.write(System.out, m3, Lang.TURTLE);
}
//<http://www.foo.com/subject>
// <http://www.foo.com/predicate> [ a <http://www.foo.com/bar> , <http://www.foo.com/baz> ] ;
// <http://www.foo.com/predicate> [ a <http://www.foo.com/bar> , <http://www.foo.com/baz> ] .
This contrived example is a bit silly, but consider that I'm trying to merge RDF files that may or may not be identical.

Related

How to merge a dynamically named record with a static one in Dhall?

I'm creating an AWS Step Function definition in Dhall. However, I don't know how to create a common structure they use for Choice states such as the example below:
{
"Not": {
"Variable": "$.type",
"StringEquals": "Private"
},
"Next": "Public"
}
The Not is pretty straightforward using mapKey and mapValue. If I define a basic Comparison:
{ Type =
{ Variable : Text
, StringEquals : Optional Text
}
, default =
{ Variable = "foo"
, StringEquals = None Text
}
}
And the types:
let ComparisonType = < And | Or | Not >
And adding a helper function to render the type as Text for the mapKey:
let renderComparisonType = \(comparisonType : ComparisonType )
-> merge
{ And = "And"
, Or = "Or"
, Not = "Not"
}
comparisonType
Then I can use them in a function to generate the record halfway:
let renderRuleComparisons =
\( comparisonType : ComparisonType ) ->
\( comparisons : List ComparisonOperator.Type ) ->
let keyName = renderComparisonType comparisonType
let compare = [ { mapKey = keyName, mapValue = comparisons } ]
in compare
If I run that using:
let rando = ComparisonOperator::{ Variable = "$.name", StringEquals = Some "Cow" }
let comparisons = renderRuleComparisons ComparisonType.Not [ rando ]
in comparisons
Using dhall-to-json, she'll output the first part:
{
"Not": {
"Variable": "$.name",
"StringEquals": "Cow"
}
}
... but I've been struggling to merge that with "Next": "Sup". I've used all the record merges like /\, //, etc. and it keeps giving me various type errors I don't truly understand yet.
First, I'll include an approach that does not type-check as a starting point to motivate the solution:
let rando = ComparisonOperator::{ Variable = "$.name", StringEquals = Some "Cow" }
let comparisons = renderRuleComparisons ComparisonType.Not [ rando ]
in comparisons # toMap { Next = "Public" }
toMap is a keyword that converts records to key-value lists, and # is the list concatenation operator. The Dhall CheatSheet has a few examples of how to use both of them.
The above solution doesn't work because # cannot merge lists with different element types. The left-hand side of the # operator has this type:
comparisons : List { mapKey : Text, mapValue : Comparison.Type }
... whereas the right-hand side of the # operator has this type:
toMap { Next = "Public" } : List { mapKey : Text, mapValue : Text }
... so the two Lists cannot be merged as-is due to the different types for the mapValue field.
There are two ways to resolve this:
Approach 1: Use a union whenever there is a type conflict
Approach 2: Use a weakly-typed JSON representation that can hold arbitrary values
Approach 1 is the simpler solution for this particular example and Approach 2 is the more general solution that can handle really weird JSON schemas.
For Approach 1, dhall-to-json will automatically strip non-empty union constructors (leaving behind the value they were wrapping) when translating to JSON. This means that you can transform both arguments of the # operator to agree on this common type:
List { mapKey : Text, mapValue : < State : Text | Comparison : Comparison.Type > }
... and then you should be able to concatenate the two lists of key-value pairs and dhall-to-json will render them correctly.
There is a second solution for dealing with weakly-typed JSON schemas that you can learn more about here:
Dhall Manual - How to convert an existing YAML configuration file to Dhall
The basic idea is that all of the JSON/YAML integrations recognize and support a weakly-typed JSON representation that can hold arbitrary JSON data, including dictionaries with keys of different shapes (like in your example). You don't even need to convert the entire the expression to this weakly-typed representation; you only need to use this representation for the subset of your configuration where you run into schema issues.
What this means for your example, is that you would change both arguments to the # operator to have this type:
let Prelude = https://prelude.dhall-lang.org/v12.0.0/package.dhall
in List { mapKey : Text, mapValue : Prelude.JSON.Type }
The documentation for Prelude.JSON.Type also has more details on how to use this type.

Azure tables unable to store flattened JSON

I am using the npm flat package, and arrays/objects are flattened, but object/array keys are surrounded by '' , like in 'task_status.0.data' using the object below.
These specific fields do not get stored into AzureTables - other fields go through, but these are silently ignored. How would I fix this?
var obj1 = {
"studentId": "abc",
"task_status": [
{
"status":"Current",
"date":516760078
},
{
"status":"Late",
"date":1516414446
}
],
"student_plan": "n"
}
Here is how I am using it - simplified code example: Again, it successfully gets written to the table, but does not write the properties that were flattened (see further below):
var flatten = require('flat')
newObj1 = flatten(obj1);
var entGen = azure.TableUtilities.entityGenerator;
newObj1.PartitionKey = entGen.String(uniqueIDFromMyDB);
newObj1.RowKey = entGen.String(uniqueStudentId);
tableService.insertEntity(myTableName, newObj1, myCallbackFunc);
In the above example, the flattened object would look like:
var obj1 = {
studentId: "abc",
'task_status.0.status': 'Current',
'task_status.0.date': 516760078,
'task_status.1.status': 'Late',
'task_status.1.date': 516760078,
student_plan: "n"
}
Then I would add PartitionKey and RowKey.
all the task_status fields would silently fail to be inserted.
EDIT: This does not have anything to do with the actual flattening process - I just checked a perfectly good JSON object, with keys that had 'x.y.z' in it, i.e. AzureTables doesn't seem to accept these column names....which almost completely destroys the value proposition of storing schema-less data, without significant rework.
. in column name is not supported. You can use a custom delimiter to flatten your objects instead.
For example:
newObj1 = flatten(obj1, {delimiter: '__'});

spark.RDD take(n) returns array of element n, n times

I'm using code from https://github.com/alexholmes/json-mapreduce to read a multi-line json file into an RDD.
var data = sc.newAPIHadoopFile(
filepath,
classOf[MultiLineJsonInputFormat],
classOf[LongWritable],
classOf[Text],
conf)
I printed out the first n elements to check if it was working correctly.
data.take(n).foreach { p =>
val (line, json) = p
println
println(new JSONObject(json.toString).toString(4))
}
However when I try to look at the data, the arrays returned from take don't seem to be correct.
Instead of returning an array of the form
[ data[0], data[1], ... data[n] ]
it is in the form
[ data[n], data[n], ... data[n] ]
Is this an issue with the RDD I've created, or an issue with how I'm trying to print it?
I figured out why take it was returning an array with duplicate values.
As the API mentions:
Note: Because Hadoop's RecordReader class re-uses the same Writable object
for each record, directly caching the returned RDD will create many
references to the same object. If you plan to directly cache Hadoop
writable objects, you should first copy them using a map function.
Therefore in my case it was reusing the same LongWritable and Text objects. For example if I did:
val foo = data.take(5)
foo.map( r => System.identityHashCode(r._1) )
The output was:
Array[Int] = Array(1805824193, 1805824193, 1805824193, 1805824193, 1805824193)
So in order to prevent it from doing this, I simply mapped the reused objects to their respective values:
val data = sc.newAPIHadoopFile(
filepath,
classOf[MultiLineJsonInputFormat],
classOf[LongWritable],
classOf[Text],
conf ).map(p => (p._1.get, p._2.toString))

Efficient parsing of first four elements of large JSON arrays

I am using Jackson to parse JSON from a json inputStream which looks like following:
[
[ 36,
100,
"The 3n + 1 problem",
56717,
0,
1000000000,
0,
6316,
0,
0,
88834,
0,
45930,
0,
46527,
5209,
200860,
3597,
149256,
3000,
1
],
[
........
],
[
........
],
.....// and almost 5000 arrays like above
]
This is the original feed link: http://uhunt.felix-halim.net/api/p
I want to parse it and keep only the first 4 elements of every array and skip other 18 elements.
36
100
The 3n + 1 problem
56717
Code structure I have tried so far:
while (jsonParser.nextToken() != JsonToken.END_ARRAY) {
jsonParser.nextToken(); // '['
while (jsonParser.nextToken() != JsonToken.END_ARRAY) {
// I tried many approaches here but not found appropriate one
}
}
As this feed is pretty big, I need to do this efficiently with less overhead and memory.
Also there are three models to procress JSON: Streaming API, Data Binding and Tree Model. Which one is appropriate for my purpose?
How can I parse this json efficiently with Jackson? How can I skip those 18 elements and jump to next array for better performance?
Edit: (Solution)
Jackson and GSon both works in almost in the same mechanism (incremental mode, since content is read and written incrementally), I am switching to GSON as it has a function skipValue() (pretty appropriate with name). Although Jackson's nextToken() will work like skipValue(), GSON seems more flexible to me. Thanks #Kowser bro for his recommendation, I came to know about GSON before but somehow ignored it. This is my working code:
reader.beginArray();
while (reader.hasNext()) {
reader.beginArray();
int a = reader.nextInt();
int b = reader.nextInt();
String c = reader.nextString();
int d = reader.nextInt();
System.out.println(a + " " + b + " " + c + " " + d);
while (reader.hasNext())
reader.skipValue();
reader.endArray();
}
reader.endArray();
reader.close();
This is for Jackson
Follow this tutorial.
Judicious use of jasonParser.nextToken() should help you.
while (jasonParser.nextToken() != JsonToken.END_ARRAY) { // might be JsonToken.START_ARRAY?
The pseudo-code is
find next array
read values
skip other values
skip next end token
This is for gson.
Take a look at this tutorial. Consider following second example from the tutorial.
Judicious use of reader.begin* reader.end* and reader.skipValue should do the job for you.
And here is the documentation for JsonReader

What is "compressed JSON"?

I see a lot of references to "compressed JSON" when it comes to different serialization formats. What exactly is it? Is it just gzipped JSON or something else?
Compressed JSON removes the key:value pair of json's encoding to store keys and values in seperate parallel arrays:
// uncompressed
JSON = {
data : [
{ field1 : 'data1', field2 : 'data2', field3 : 'data3' },
{ field1 : 'data4', field2 : 'data5', field3 : 'data6' },
.....
]
};
//compressed
JSON = {
data : [ 'data1','data2','data3','data4','data5','data6' ],
keys : [ 'field1', 'field2', 'field3' ]
};
This method of usage i found here
Content from link (http://www.nwhite.net/?p=242)
rarely find myself in a place where I am writing javascript applications that use AJAX in its pure form. I have long abandoned the ‘X’ and replaced it with ‘J’ (JSON). When working with Javascript, it just makes sense to return JSON. Smaller footprint, easier parsing and an easier structure are all advantages I have gained since using JSON.
In a recent project I found myself unhappy with the large size of my result sets. The data I was returning was tabular data, in the form of objects for each row. I was returning a result set of 50, with 19 fields each. What I realized is if I augment my result set I could get a form of compression.
// uncompressed
JSON = {
data : [
{ field1 : 'data1', field2 : 'data2', field3 : 'data3' },
{ field1 : 'data4', field2 : 'data5', field3 : 'data6' },
.....
]
};
//compressed
JSON = {
data : [ 'data1','data2','data3','data4','data5','data6' ],
keys : [ 'field1', 'field2', 'field3' ]
};
I merged all my values into a single array and store all my fields in a separate array. Returning a key value pair for each result cost me 8800 byte (8.6kb). Ripping the fields out and putting them in a separate array cost me 186 bytes. Total savings 8.4kb.
Now I have a much more compressed JSON file, but the structure is different and now harder to work with. So I implement a solution in Mootools to make the decompression transparent.
Request.JSON.extend({
options : {
inflate : []
}
});
Request.JSON.implement({
success : function(text){
this.response.json = JSON.decode(text, this.options.secure);
if(this.options.inflate.length){
this.options.inflate.each(function(rule){
var ret = ($defined(rule.store)) ? this.response.json[rule.store] : this.response.json[rule.data];
ret = this.expandData(this.response.json[rule.data], this.response.json[rule.keys]);
},this);
}
this.onSuccess(this.response.json, text);
},
expandData : function(data,keys){
var arr = [];
var len = data.length; var klen = keys.length;
var start = 0; var stop = klen;
while(stop < len){
arr.push( data.slice(start,stop).associate(keys) );
start = stop; stop += klen;
}
return arr;
}
});
Request.JSON now has an inflate option. You can inflate multiple segments of your JSON object if you so desire.
Usage:
new Request.JSON({
url : 'url',
inflate : [{ 'keys' : 'fields', 'data' : 'data' }]
onComplete : function(json){}
});
Pass as many inflate objects as you like to the option inflate array. It has an optional property called ’store’ If set the inflated data set will be stored in that key instead.
The ‘keys’ and ‘fields’ expect strings to match a location in the root of your JSON object.
Based in Paniyar's answer, we can convert a List of Objects in "compressed" Json format using C# like this:
var JsonString = serializer.Serialize(
new
{
cols = new[] { "field1", "field2", "field3"},
items = data.Select(x => new object[] {x.field1, x.field2, x.field3})
});
I used an array of object because the fields can be int, bool, string...
More Reduction:
If the field is repeated very often and it is a string type, you can get compressed a little be more if you add a distinct list of that field... for instance, a field name job position, city, etc are excellent candidate for this. You can add a distinct list of this items and in each item change the value for a reference number. That will make your Json more lite.
Compressed:
[["KeyA", "KeyB", "KeyC", "KeyD", "KeyE", "KeyF"],
["ValA1", "ValB1", "ValC1", "ValD1", "ValE1", "ValF1"],
["ValA2", "ValB2", "ValC2", "ValD2", "ValE2", "ValF2"],
["ValA3", "ValB3", "ValC3", "ValD3", "ValE3", "ValF3"],
["ValA4", "ValB4", "ValC4", "ValD4", "ValE4", "ValF4"]]
Uncompressed:
[{KeyA: "ValA1", KeyB: "ValB1", KeyC: "ValC1", KeyD: "ValD1", KeyE: "ValE1", KeyF: "ValF1"},
{KeyA: "ValA2", KeyB: "ValB2", KeyC: "ValC2", KeyD: "ValD2", KeyE: "ValE2", KeyF: "ValF2"},
{KeyA: "ValA3", KeyB: "ValB3", KeyC: "ValC3", KeyD: "ValD3", KeyE: "ValE3", KeyF: "ValF3"},
{KeyA: "ValA4", KeyB: "ValB4", KeyC: "ValC4", KeyD: "ValD4", KeyE: "ValE4", KeyF: "ValF4"}]
The most likely answer is that it really is just gzipped JSON. There is no other standard meaning to this phrase.
Re-organizing a homogenous array of JSON objects into a pair of arrays is a very useful technique to make the payload smaller and to speed up encoding and decoding, it is not commonly called "compressed JSON". I haven't run across it ever in open source or any open API, but we use this technique internally and call it "jsontable".