Converting a CSV to RDF where one column is a set of values - csv

I want to convert a CSV to RDF.
One of the column of that CSV is, in fact, a set of values joined with a separator character (in my case, the space character).
Here is a sample CSV (with header):
col1,col2,col3
"A","B C D","John"
"M","X Y Z","Jack"
I would like the conversion process to create a RDF similar to this:
:A :aProperty :B, :C, :D; :anotherProperty "John".
:M :aProperty :X, :Y, :Z; :anotherProperty "Jack".
I usually use Tarql for CSV conversion.
It is fine to iterate per row.
But it has no feature to sub-iterate "inside" a column value.
SPARQL-Generate may help (with iter:regex and sub-generate, as far as a I understand). But I cannot find any example that matches my use case.
PS: may be RML can help too. But I have no prior knowledge of this technology.

You can accomplish this with RML and FnO.
First, we need to access each row which can be accomplished with RML.
RML allows you to iterate over each row of the CSV file (ql:CSV) with a
LogicalSource.
Specifying the iterator (rml:iterator)
is not needed since the default iterator in RML is a row-based iterator.
This results into the following RDF (Turtle):
<#LogicalSource>
a rml:LogicalSource;
rml:source "data.csv";
rml:referenceFormulation ql:CSV.
The actually triples are generated with the help of a TriplesMap which
uses the LogicalSource to retrieve the data from each CSV row:
<#MyTriplesMap>
a rr:TriplesMap;
rml:logicalSource <#LogicalSource>;
rr:subjectMap [
rr:template "http://example.org/{col1}";
];
rr:predicateObjectMap [
rr:predicate ex:aProperty;
rr:objectMap <#FunctionMap>;
];
rr:predicateObjectMap [
rr:predicate ex:anotherProperty;
rr:objectMap [
rml:reference "col3";
];
].
The col3 CSV column be used to create the following triple:
<http://example.org/A> <http://example.org/ns#anotherProperty> "John".
However, the string in the CSV column col2 needs to be split first.
This can be achieved with Fno (Function Ontology) and an RML processor which
supports the execution of FnO functions. Such RML processor can be the
RML Mapper, but other processors can
be used too.
The following RDF is needed to invoke an FnO function which splits the input
string with a space as separator with our LogicalSource as input data:
<#FunctionMap>
fnml:functionValue [
rml:logicalSource <#LogicalSource>; # our LogicalSource
rr:predicateObjectMap [
rr:predicate fno:executes;
rr:objectMap [
rr:constant grel:string_split # function to use
];
];
rr:predicateObjectMap [
rr:predicate grel:valueParameter;
rr:objectMap [
rml:reference "col2" # input string
];
];
rr:predicateObjectMap [
rr:predicate grel:p_string_sep;
rr:objectMap [
rr:constant " "; # space separator
];
];
].
The supported FnO functions by the RML mapper are available here:
https://rml.io/docs/rmlmapper/default-functions/
You can find the function name and its parameters on that page.
Mapping rules
#base <http://example.org> .
#prefix rml: <http://semweb.mmlab.be/ns/rml#> .
#prefix rr: <http://www.w3.org/ns/r2rml#> .
#prefix ql: <http://semweb.mmlab.be/ns/ql#> .
#prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
#prefix fnml: <http://semweb.mmlab.be/ns/fnml#> .
#prefix fno: <https://w3id.org/function/ontology#> .
#prefix grel: <http://users.ugent.be/~bjdmeest/function/grel.ttl#> .
#prefix ex: <http://example.org/ns#> .
<#LogicalSource>
a rml:LogicalSource;
rml:source "data.csv";
rml:referenceFormulation ql:CSV.
<#MyTriplesMap>
a rr:TriplesMap;
rml:logicalSource <#LogicalSource>;
rr:subjectMap [
rr:template "http://example.org/{col1}";
];
rr:predicateObjectMap [
rr:predicate ex:aProperty;
rr:objectMap <#FunctionMap>;
];
rr:predicateObjectMap [
rr:predicate ex:anotherProperty;
rr:objectMap [
rml:reference "col3";
];
].
<#FunctionMap>
fnml:functionValue [
rml:logicalSource <#LogicalSource>;
rr:predicateObjectMap [
rr:predicate fno:executes;
rr:objectMap [
rr:constant grel:string_split
];
];
rr:predicateObjectMap [
rr:predicate grel:valueParameter;
rr:objectMap [
rml:reference "col2"
];
];
rr:predicateObjectMap [
rr:predicate grel:p_string_sep;
rr:objectMap [
rr:constant " ";
];
];
].
Output
<http://example.org/A> <http://example.org/ns#aProperty> "B".
<http://example.org/A> <http://example.org/ns#aProperty> "C".
<http://example.org/A> <http://example.org/ns#aProperty> "D".
<http://example.org/A> <http://example.org/ns#anotherProperty> "John".
<http://example.org/M> <http://example.org/ns#aProperty> "X".
<http://example.org/M> <http://example.org/ns#aProperty> "Y".
<http://example.org/M> <http://example.org/ns#aProperty> "Z".
<http://example.org/M> <http://example.org/ns#anotherProperty> "Jack".
Note: I contribute to RML and its technologies.

You can test this query on the playground https://ci.mines-stetienne.fr/sparql-generate/playground.html and check it behaves as expected:
BASE <http://data.example.com/>
PREFIX : <http://example.com/>
PREFIX iter: <http://w3id.org/sparql-generate/iter/>
PREFIX fun: <http://w3id.org/sparql-generate/fn/>
GENERATE {
<{?col1}> :anotherProperty ?col3.
GENERATE{
<{?col1}> :aProperty <{ ?value }> ;
}
ITERATOR iter:Split( ?col2 , " " ) AS ?value .
}
ITERATOR iter:CSVStream("http://example.com/file.csv", 20, "*") AS ?col1 ?col2 ?col3

The Tabular Data Model and related specs target this use case, although as I recall, we didn't provide for combinations of valueUrl and separator to have sub-columns generate multiple URIs.
The metadata to describe this would be something like the following:
{
"#context": "http://www.w3.org/ns/csvw",
"url": "test.csv",
"tableSchema": {
"columns": [{
"name": "col1",
"titles": "col1",
"datatype": "string",
"required": true
}, {
"name": "col2",
"titles": "col2",
"datatype": "string",
"separator": " "
}, {
"name": "col3",
"titles": "col3",
"datatype": "string",
"propertyUrl": "http://example.com/anotherProperty",
"valueUrl": "http://example.com/{col3}"
}],
"primaryKey": "col1",
"aboutUrl": http://example.com/{col1}"
}
}

Related

How to combine jq array value into same key in json file in shell script?

I have a json file with this content:
[
{
"id": "one",
"msg": [
"test"
],
"FilePath": [
"JsonSerializer.cs",
"ChatClient.cs",
"MiniJSON.cs"
],
"line": [
358,
1241,
382
]
},
{
"id": "two",
"msg": [
"secondtest"
],
"FilePath": [
"Utilities.cs",
"PhotonPing.cs"
],
"line": [
88,
36
]
}
]
I want the output where as you can see the value combine into one :
one
[
"test"
]
[
"JsonSerializer.cs",358
"ChatClient.cs",1241
"MiniJSON.cs",382
]
two
[
"secondtest"
]
[
"Utilities.cs",88
"PhotonPing.cs",36
]
I have tried this cat stack.json |jq -r '.[]|.id,.msg,.FilePath,.line'
which gave output as
one
[
"test"
]
[
"JsonSerializer.cs",
"ChatClient.cs",
"MiniJSON.cs"
]
[
358,
1241,
382
]
two
[
"secondtest"
]
[
"Utilities.cs",
"PhotonPing.cs"
]
[
88,
36
]
Kindly help me resolve this, I have tried a lot to debug this but unable to get through. Also, the Filepath and line would always be similar for each . For example if FilePath has 3, line would also have 3 values.
You're looking for transpose.
.[] | .id, .msg, ([.FilePath, .line] | transpose | add), ""
Online demo
<stack.json jq -r '.[] | .id, .msg, ([.FilePath, .line]|transpose|add)'
gives the output as required by you.
transpose turns [[1,2,3],[4,5,6]] into [[1,4],[2,5],[3,6]] and add collects all array elements into a single array.
If you are looking to have file path and line number in a single line, I suggest formatting them as string, separated by colon:
.[] | .id, .msg, ([.FilePath, .line]|transpose|map(join(":")))

Snowflake Get MIN/MAX values from Nested JSON

I have the following sample structure in snowflake column, indicating a series of coordinates for a map polygon. Each pair of values is formed of "longitude, latitude" like this:
COLUMN ZONE
{
"coordinates": [
[
[
-58.467372,
-34.557908
],
[
-58.457565,
-34.569341
],
[
-58.446836,
-34.573511
],
[
-58.43482,
-34.553367
],
[
-58.441944,
-34.547923
]
]
],
"type": "POLYGON"
}
I need to get the smallest longitud and the smallest latitude and the biggest longitud and latitude for every table row.
So, for this example, I need to get something like:
MIN: -58.467372,-34.573511
MAX: -58.43482,-34.547923
Do you guys know if this is possible using a query?
I got as far as to navigate the json to get the sets of coordinates, but I'm not sure how to proceed from there. I tried doing a MIN to the coordinates column, but I'm not sure how to reference only the "latitude" or "longitude" value.
This obviously doesn't work:
MIN(ZONE['coordinates'][0])
Any suggestions?
You can do this with some Snowflake GIS functions, and massaging the input data for an easier parsing:
with data as (
select '
{
"coordinates": [
[
[
-58.467372,
-34.557908
],
[
-58.457565,
-34.569341
],
[
-58.446836,
-34.573511
],
[
-58.43482,
-34.553367
],
[
-58.441944,
-34.547923
]
]
],
"type": "POLYGON"
}
' x
)
select st_xmin(g), st_xmax(g), st_ymin(g), st_ymax(g)
from (
select to_geography(replace(x, 'POLYGON', 'MultiLineString')) g
from data
)

Is there any way in rml / r2rml to take a value as an IRI?

I'm using RMLMapper to transfrom JSON to RDF. One of the values stored in the JSON is a URL. I would like to use this as the basis of an IRI for the object of RDF statements.
The input is
{
"documentId": {
"value": "http://example.org/345299"
},
...
I want the IRI for the subject of statements to be http://example.org/345299#item, e.g. <http://example.org/345299#item> a <http://schema.org/Thing> .
I tried
#prefix rr: <http://www.w3.org/ns/r2rml#>.
#prefix rml: <http://semweb.mmlab.be/ns/rml#>.
#prefix ql: <http://semweb.mmlab.be/ns/ql#>.
<#Mapping> a rr:TriplesMap ;
rml:logicalSource [
rml:source "input.json";
rml:referenceFormulation ql:JSONPath;
rml:iterator "$"
];
rr:subjectMap [
rr:template "{documentId.value}#item" ;
rr:class schema:Thing
]
gives and error that rr:template "{documentId.value}#item" doesn't produce a valid IRI.
Providing a value for #base gives a valid IRI, but it is the base with the url-encoded value appended to the base, e.g. <http://example.org/http%3A%2F%2Fexample.org%2Fjobposts%2F345299#item>
So is there any way in r2rml / rml to take a value and just use it as an IRI? Or to convert a string in to an IRI?
One option would be to not use a rr:template, but instead an FnO function to concatenate the stored URI with #item.
Documentation on how to do this can be found here.
In the example you give, replacing the subject map with this gives the required solution:
<#Mapping> rr:subjectMap [
a fnml:FunctionTermMap;
rr:termType rr:IRI;
fnml:functionValue [
rml:logicalSource <#Source> ;
rr:predicateObjectMap [
rr:predicate fno:executes ;
rr:objectMap [ rr:constant grel:array_join ] ;
] ;
rr:predicateObjectMap [
rr:predicate grel:p_array_a ;
rr:objectMap [ rml:reference "documentId.value" ] ;
] ;
rr:predicateObjectMap [
rr:predicate grel:p_array_a ;
rr:objectMap [ rr:constant "#item" ] ;
] ;
] .
Note: I contribute to RML and its technologies.
rml:reference is close enough.
rr:subjectMap [
rml:reference "documentId.value" ;
rr:class schema:Thing
]
Doesn't seem to let me append #item but it'll do for now.

Compare two nested json files and show user where exactly the change has occurred and which json file using Python?

I have two json files. I am validating the response is same or different. I need to show the user where there is an exact change. Some what like the particular key is added or removed or changed in this file.
file1.json
[
{
"Name": "Jack",
"region": "USA",
"tags": [
{
"name": "Name",
"value": "Assistant"
}
]
},
{
"Name": "MATHEW",
"region": "USA",
"tags": [
{
"name": "Name",
"value": "Worker"
}
]
}
]
file2.json
[
{
"Name": "Jack",
"region": "USA",
"tags": [
{
"name": "Name",
"value": "Manager"
}
]
},
{
"Name": "MATHEW",
"region": "US",
"tags": [
{
"name": "Name",
"value": "Assistant"
}
]
}
]
If you see Two JSON you can find the difference as a region in file2.json has changed US and Values changed from manager to assistant and worker. Now I want to show the user that file2.json has some changes like region :US and Manager changed to Assistant.
I have used deepdiff for validating purpose.
from deepdiff import DeepDiff
def difference(oldurl_resp,newurl_resp,file1):
ddiff = DeepDiff(oldurl_resp, newurl_resp,ignore_order=True)
if(ddiff == {}):
print("BOTH JSON FILES MATCH !!!")
return True
else:
print("FAILURE")
output = ddiff
if(output.keys().__contains__('iterable_item_added')):
test = output.get('iterable_item_added')
print('The Resource name are->')
i=[]
for k in test:
print("Name: ",test[k]['Name'])
print("Region: ",test[k]['region'])
msg= (" Name ->"+ test[k]['Name'] +" Region:"+test[k]['region'] +". ")
i.append(msg)
raise JsonCompareError("The json file has KEYS changed!. Please validate for below" +str(i) +"in "+file1)
elif(output.keys().__contains__('iterable_item_removed')):
test2 = output.get('iterable_item_removed')
print('The name are->')
i=[]
for k in test2:
print(test2[k]['Name'])
print(test2[k]['region'])
msg= (" Resource Name ->"+ test2[k]['Name'] +" Region:"+test2[k]['region'] +". ")
i.append(msg)
raise JsonCompareError("The json file has Keys Removed!!. Please validate for below" +str(i)+"in "+file1)
This code just shows the resource Name I want to show the tags also which got changed and added or removed.
Can anybody guide me
If you just print out the values of "test" variables, you will find out that "tag" variable changes are inside of it, test value of test in this example will be:
test = {'root[0]': {'region': 'USA', 'Name': 'Jack', 'tags': [{'name': 'Name', 'value': 'Manager'}]}, 'root[1]': {'region': 'US', 'Name': 'MATHEW', 'tags': [{'name': 'Name', 'value': 'Assistant'}]}}
and you can print test[k]['tags'] or add it your "msg" variable.
Suggestion:
Also, if your data has some primary key (for example they have "id", or their order is always fixed), you can compare their data 1 by 1 (instead of comparing whole lists) and you can have a better comparison. For example if you compare data of "Jack" together, you will have the following comparison:
{'iterable_item_removed': {"root['tags'][0]": {'name': 'Name', 'value': 'Assistant'}}, 'iterable_item_added': {"root['tags'][0]": {'name': 'Name', 'value': 'Manager'}}}
You should try the deepdiff library. It gives you the key where the difference occurs and the old and new value.
from deepdiff import DeepDiff
ddiff = DeepDiff(json_object1, json_object2)
# if you want to compare by ignoring order
ddiff = DeepDiff(json_object1, json_object2, ignore_order=True)

Create a new json string from jq output elements

My jq command returns objects in brackets but without comma separators. But I would like to create a new json string from it.
This call finds all elements of arr that have a FooItem in them and then returns texts from the nested array at index 3:
jq '.arr[] | select(index("FooItem")) | .[3].texts'
on this json (The original has more elements ):
{
"arr": [
[
"create",
"w199",
"FooItem",
{
"index": 0,
"texts": [
"aBarfoo",
"avalue"
]
}
],
[
"create",
"w200",
"NoItem",
{
"index": 1,
"val": 5,
"hearts": 5
}
],
[
"create",
"w200",
"FooItem",
{
"index": 1,
"texts": [
"mybarfoo",
"bValue"
]
}
]
]
}
returns this output:
[
"aBarfoo",
"avalue"
]
[
"mybarfoo",
"bValue"
]
But I'd like to create a new json from these objects that looks like this:
{
"arr": [
[
"aBarfoo",
"avalue"
],
[
"mybarfoo",
"bValue"
]
]
}
Can jq do this?
EDIT
One more addition: Considering that texts also has strings of zero length, how would you delete those/not have them in the result?
"texts": ["",
"mybarfoo",
"bValue",
""
]
You can always embed a stream of (zero or more) JSON entities within some other JSON structure by decorating the stream, that is, in the present case, by wrapping the STREAM as follows:
{ arr: [ STREAM ] }
In the present case, however, we can also take the view that we are simply editing the original document, and accordingly use a variation of the map(select(...)) idiom:
.arr |= map( select(index("FooItem")) | .[3].texts)
This latter approach ensures that the context of the "arr" key is preserved.
Addendum
To filter out the empty strings, simply add another map(select(...)):
.arr |= map( select(index("FooItem"))
| .[3].texts | map(select(length>0)))