Removing duplicates: unexpected Sparql Sample behaviour, missing result - duplicates

I'm querying the IdRef Sparql endpoint to get researchers co-authors. In order to get more complete results, I'm doing a federated query against HAL endpoint.
My query works pretty well but generates duplicates, which I aim to de-duplicate using authorities identifiers (ORCID, ISNI or whatever).
So far, I achieved the following query, but now my problem is that one result is missing.
My query is:
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX owl: <http://www.w3.org/2002/07/owl#>
SELECT distinct ?aut ?auturi
WHERE {
SELECT distinct (SAMPLE(?auturi) AS ?auturi) (SAMPLE(?aut) AS ?aut) ?ids
WHERE {
{
?uri ?rel <http://www.idref.fr/139753753/id>. #entities our author has a link with
?uri ?relcontrib ?auturi. #other with a link to these entities
?auturi a foaf:Person. #filter for persons
?auturi skos:prefLabel ?aut. #get authors' name
FILTER (?auturi != <http://www.idref.fr/139753753/id>) #exclude the same author we're querying
OPTIONAL {
?auturi owl:sameAs ?ids. #get authors' identifiers
}
} UNION {
<http://www.idref.fr/139753753/id> owl:sameAs ?id.
FILTER (CONTAINS(STR(?id), "archives-ouvertes.fr"))
BIND(URI(REPLACE(STR(?id), "#.*", "")) as ?idHal) #get an ID to query HAL
SERVICE <http://sparql.archives-ouvertes.fr/sparql> {
?idHal foaf:publications ?uri. #same as above
?auturi foaf:publications ?uri.
?auturi foaf:name ?aut.
FILTER (?idHal != ?auturi)
OPTIONAL {
?auturi owl:sameAs ?ids.
}
}
}
}
}
As you can see, I'm using a subquery with sample to perform the "de-duplication", but it doesn't work as expected (or at least as I'd expect): one result is stripped away. You can see here the un-sampled subquery, it returns an extra result matching this uri: https://data.archives-ouvertes.fr/author/marie-masclet-de-barbarin.rdf
At first I thought it was because this result had no matching owl:sameAs object, but another result in the set doesn't either and yet is in the final results set.
I'm quite puzzled by this behaviour and I suspect it is because I don't fully understand how sample works. Maybe there is a more accurate way to achieve what I'm looking for.
Edit: results (with duplicates) are as follow:
# auturi aut
1 http://www.idref.fr/057577889/id Lantenois, Annick (1956-....)
2 http://www.idref.fr/033888760/id Cubaud, Pierre
3 http://www.idref.fr/028984838/id Suber, Peter
4 http://www.idref.fr/165836652/id Cramer, Florian (1969-....)
5 http://www.idref.fr/050447823/id Mounier, Pierre (1970-....)
6 http://www.idref.fr/174428006/id Ena, Alexandra (19..-....)
7 http://www.idref.fr/052212807/id Lebert, Marie
8 https://data.archives-ouvertes.fr/author/pierre-mounier Pierre Mounier
9 https://data.archives-ouvertes.fr/author/patrice-bellot Patrice Bellot
10 https://data.archives-ouvertes.fr/author/marlene-delhaye Marlène Delhaye
11 https://data.archives-ouvertes.fr/author/denis-bertin Denis Bertin
12 https://data.archives-ouvertes.fr/author/emma-bester Emma Bester
13 https://data.archives-ouvertes.fr/author/marie-masclet-de-barbarin Marie Masclet de Barbarin
Basically the only duplicates are #5 & #8. They can be identified as such because they share a common ?ids object (not shown in results here for clarity. See full results, with ?ids, here)

Marie Masclet de Barbarin is hidden precisely because there is another person, Emma Bester, who also does not have an owl:sameAs edge.
Consider this query:
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX owl: <http://www.w3.org/2002/07/owl#>
SELECT DISTINCT ?auturi ?aut ?ids
WHERE {
<http://www.idref.fr/139753753/id> owl:sameAs ?id.
FILTER (CONTAINS(STR(?id), "archives-ouvertes.fr"))
BIND(URI(REPLACE(STR(?id), "#.*", "")) as ?idHal) #get an ID to query HAL
SERVICE <http://sparql.archives-ouvertes.fr/sparql> {
?idHal foaf:publications ?uri. #same as above
?auturi foaf:publications ?uri.
?auturi foaf:name ?aut.
FILTER (?idHal != ?auturi)
OPTIONAL {
?auturi owl:sameAs ?ids.
}
}
}
This yields 12 results:
Notice that many of these people have multiple values of owl:sameAs, and they are all different between each other.
However, Marie and Emma have no value, so the database assigns them a 'null' value.
So, when sampling the author name and uri (grouping by ?ids), we can use the following query:
SELECT DISTINCT (SAMPLE(?auturi) AS ?auturi) (SAMPLE(?aut) AS ?aut) ?ids
WHERE {
<http://www.idref.fr/139753753/id> owl:sameAs ?id.
FILTER (CONTAINS(STR(?id), "archives-ouvertes.fr"))
BIND(URI(REPLACE(STR(?id), "#.*", "")) as ?idHal) #get an ID to query HAL
SERVICE <http://sparql.archives-ouvertes.fr/sparql> {
?idHal foaf:publications ?uri. #same as above
?auturi foaf:publications ?uri.
?auturi foaf:name ?aut.
FILTER (?idHal != ?auturi)
OPTIONAL {
?auturi owl:sameAs ?ids.
}
}
}
This only has 11 results however, with Marie missing:
Why? Because the ?ids has a null value for two separate authors, and by sampling we are asking for only one of these authors, so the second one gets skipped.
So why is Marie skipped 100% of the time and not 50%? Most likely this is determined by the order in which the triples were loaded into the store, so the SAMPLE function is deterministic given a certain loading sequence, i.e. if you took the data and loaded it into a different machine with possibly a different triplestore, it is possible that Emma would be the one that is skipped.
How to solve this?
The hard part is that Pierre Mounier exists as almost two different entities, with two ?ids and even two text names, "Pierre Mounier" and "Mounier, Pierre (1970-...)".
Thus, the obvious solution of sampling ?auturi and grouping by ?aut will show Marie, but also will not deduplicate Pierre.
A better solution would be to use COALESCE to bind ?ids to something different for each author, instead of letting it be null for both. This is done like this:
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX owl: <http://www.w3.org/2002/07/owl#>
SELECT DISTINCT ?auturi ?aut ?idsClean
WHERE {
<http://www.idref.fr/139753753/id> owl:sameAs ?id.
FILTER (CONTAINS(STR(?id), "archives-ouvertes.fr"))
BIND(URI(REPLACE(STR(?id), "#.*", "")) as ?idHal) #get an ID to query HAL
SERVICE <http://sparql.archives-ouvertes.fr/sparql> {
?idHal foaf:publications ?uri. #same as above
?auturi foaf:publications ?uri.
?auturi foaf:name ?aut.
FILTER (?idHal != ?auturi)
OPTIONAL {
?auturi owl:sameAs ?ids.
}
BIND(COALESCE(?ids, CONCAT("No ID: ", ?aut)) AS ?idsClean)
}
}
This will return:
Putting this method to work in the larger query, we obtain:
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX owl: <http://www.w3.org/2002/07/owl#>
SELECT distinct ?aut ?auturi
WHERE {
SELECT distinct (SAMPLE(?auturi) AS ?auturi) (SAMPLE(?aut) AS ?aut) ?ids_clean
WHERE {
{
?uri ?rel <http://www.idref.fr/139753753/id>. #entities our author has a link with
?uri ?relcontrib ?auturi. #other with a link to these entities
?auturi a foaf:Person. #filter for persons
?auturi skos:prefLabel ?aut. #get authors' name
FILTER (?auturi != <http://www.idref.fr/139753753/id>) #exclude the same author we're querying
OPTIONAL {
?auturi owl:sameAs ?ids. #get authors' identifiers
}
BIND(COALESCE(?ids, CONCAT("No ID: ", ?aut)) AS ?ids_clean)
} UNION {
<http://www.idref.fr/139753753/id> owl:sameAs ?id.
FILTER (CONTAINS(STR(?id), "archives-ouvertes.fr"))
BIND(URI(REPLACE(STR(?id), "#.*", "")) as ?idHal) #get an ID to query HAL
SERVICE <http://sparql.archives-ouvertes.fr/sparql> {
?idHal foaf:publications ?uri. #same as above
?auturi foaf:publications ?uri.
?auturi foaf:name ?aut.
FILTER (?idHal != ?auturi)
OPTIONAL {
?auturi owl:sameAs ?ids.
}
BIND(COALESCE(?ids, CONCAT("No ID: ", ?aut)) AS ?ids_clean)
}
}
}
}
And this yields the correct 12 results:

Related

select key-value pairs from json as rows in Snowflake

Assume I have the following JSON:
{
"CURRENCY_CODE": "INR",
"CURRENCY_NAME": "Indian Rupee",
"ID_CURRENCY": 8,
"ISO_CODE": 4217
}
I want to query it in Snowflake so I get the following output:
Key
Value
CURRENCY_CODE
INR
CURRENCY_NAME
Indian Rupee
ID_CURRENCY
8
ISO_CODE
4217
I expect something like:
select d.key, d.value
from table('Here goes json') d
Does Snowflake have any function to achieve this?
It is possible to pass the argument directly into FLATTEN:
SELECT f.*
FROM TABLE(FLATTEN(INPUT => PARSE_JSON('<here goes json>'))) f;
You can use LATERAL FLATTEN() in your FROM clause to achieve this.
As an example:
SELECT fdt.KEY, fdt.VALUE
FROM VALUES('{
"CURRENCY_CODE": "INR",
"CURRENCY_NAME": "Indian Rupee",
"ID_CURRENCY": 8,
"ISO_CODE": 4217
}') dt
,lateral flatten( input => parse_json(column1) ) fdt;
KEY
VALUE
CURRENCY_CODE
INR
CURRENCY_NAME
Indian Rupee
ID_CURRENCY
8
ISO_CODE
4217

How to select query from json in Postgres

I have JSON data in field hotel_data like this:
{
"title":"foo",
"description":[
{
"locale":"pt",
"content":"pt text"
},
{
"locale":"fr",
"content":"fr text"
}
]
}
I would like to select description only fr description. It is possible using Postgres and how?
I was trying use ->> but it is not working...
SELECT
hotel_data->'description'->>'locale' = 'fr' AS description
FROM hotel LIMIT 1;
Note:
I don't want to use SELECT *...
Excepted output: {description: "fr text"}
You can use a lateral join and json_to_recordset to expand the json array as a set of records. Then, you can filter on column locale in the generated records, and finally recompose a new json object with your expected result:
select json_build_object('description', d.content) hotel_data_descr_fr
from
mytable,
json_to_recordset(hotel_data->'description') as d("locale" text, "content" text)
where d.locale = 'fr'
Demo on DB Fiddle:
with mytable as (
select '{
"title":"foo",
"description":[
{
"locale":"pt",
"content":"pt text"
},
{
"locale":"fr",
"content":"fr text"
}
]
}'::json hotel_data
)
select json_build_object('description', d.content) hotel_data_descr_fr
from
mytable,
json_to_recordset(hotel_data->'description') as d("locale" text, "content" text)
where d.locale = 'fr'
| hotel_data_descr_fr |
| :------------------------- |
| {"description": "fr text"} |
The filtering can be done using the #> operator which can use a GIN index on the hotel_data column. This is typically faster than expanding the array.
select ...
from hotel
where hotel_data #> '{"description": [{"locale":"fr"}] }';
This can also be extended to include more properties:
select ...
from hotel
where hotel_data #> '{"description": [{"locale":"fr", "headline": "nice view'}] }';
But you can only express equality conditions on the key/value pairs with that. Using LIKE is not possible. You will have to expand the array if you want to do that and apply the condition in the WHERE clause - see GMB's answer.
To extract that description, I would use a scalar sub-query:
select (select jsonb_build_object('description', t.descr ->> 'content')
from jsonb_array_elements(h.hotel_data -> 'description') as t(descr)
where t.descr ->> 'locale' = 'fr'
limit 1)
from hotel h
where h.hotel_data #> '{"description": [{"locale":"fr"}] }';
That way you don't need to expand the array for filtering which I expect to be faster if only a few hotels qualify for that condition. But it has the drawback that you need to repeat the condition on the locale in the sub-select.
The limit 1 is only a safety net in case you have more than one french description. If you never have that, it doesn't hurt either
With Postgres 12 this is easier:
select jsonb_build_object(
'description',
jsonb_path_query_first(hotel_data, '$.description ? (#.locale == "fr")') -> 'content'
)
from hotel
where hotel_data #> '{"description": [{"locale":"fr"}] }'
All of the above assumes hotel_data is a jsonb column, if it's not (which it should be) you need to cast it: hotel_data::jsonb

query to Extract from Json in Postgres

I've a json object in my postgres db, which looks like as given below
{"Actor":[{"personName":"Shashi Kapoor","characterName":"Prem"},{"personName":"Sharmila Tagore","characterName":"Preeti"},{"personName":"Shatrughan Sinha","characterName":"Dr. Amar"]}
Edited (from editor: left the original because it is an invalid json, in my edit I fixed it)
{
"Actor":[
{
"personName":"Shashi Kapoor",
"characterName":"Prem"
},
{
"personName":"Sharmila Tagore",
"characterName":"Preeti"
},
{
"personName":"Shatrughan Sinha",
"characterName":"Dr. Amar"
}
]
}
the name of the column be xyz and I've a corresponding content_id.
I need to retrieve content_ids that have Actor & personName = Sharmila Tagore.
I tried many queries, among those these two where very possible query to get but still i didn't get.
SELECT content_id
FROM content_table
WHERE cast_and_crew #>> '{Actor,personName}' = '"C. R. Simha"'
.
SELECT cast_and_crew ->> 'content_id' AS content_id
FROM content_table
WHERE cast_and_crew ->> 'Actor' -> 'personName' = 'C. R. Simha'
You should use jsonb_array_elements() to search in a nested jsonb array:
select content_id, value
from content_table,
lateral jsonb_array_elements(cast_and_crew->'Actor');
content_id | value
------------+-----------------------------------------------------------------
1 | {"personName": "Shashi Kapoor", "characterName": "Prem"}
1 | {"personName": "Sharmila Tagore", "characterName": "Preeti"}
1 | {"personName": "Shatrughan Sinha", "characterName": "Dr. Amar"}
(3 rows)
Column value is of the type jsonb so you can use ->> operator for it:
select content_id, value
from content_table,
lateral jsonb_array_elements(cast_and_crew->'Actor')
where value->>'personName' = 'Sharmila Tagore';
content_id | value
------------+--------------------------------------------------------------
1 | {"personName": "Sharmila Tagore", "characterName": "Preeti"}
(1 row)
Note, if you are using json (not jsonb) use json_array_elements() of course.

Finding all records which have a 'true' value in nested JSON

This is my nested JSON:
{
"business_id":"pNQwnY_q4okdlnPiR-3RBA",
"full_address":"6105 S Fort Apache Rd\nSpring Valley\nLas Vegas, NV 89148",
"hours":{ },
"open":true,
"categories":[ ],
"city":"Las Vegas",
"review_count":68,
"name":"Empire Bagels",
"neighborhoods":[
"Spring Valley"
],
"longitude":-115.298175926911,
"state":"NV",
"stars":3.0,
"latitude":36.07728616051,
"attributes":{
"Take-out":true,
"Wi-Fi":"no",
"Good For":{
"dessert":false,
"latenight":false,
"lunch":false,
"dinner":false,
"breakfast":true,
"brunch":false
},
"Caters":true,
"Noise Level":"quiet",
"Takes Reservations":false,
"Delivery":false,
"Ambience":{
"romantic":false,
"intimate":false,
"classy":false,
"hipster":false,
"divey":false,
"touristy":false,
"trendy":false,
"upscale":false,
"casual":true
},
"Parking":{
"garage":false,
"street":false,
"validated":false,
"lot":true,
"valet":false
},
"Has TV":true,
"Outdoor Seating":true,
"Attire":"casual",
"Alcohol":"none",
"Waiter Service":false,
"Accepts Credit Cards":true,
"Good for Kids":true,
"Good For Groups":true,
"Price Range":1
},
"type":"business"
}
I am querying this Using apache drill. I want to find out the top 10 most common 'true' attributes for all restaurants in a city.I want it something like:
Accepts Credit Cards : 200,
Alcohol: 300,
Good For Kids : 500
How will my query look like? This is what I did:
select attributes, count(*) attributes from `yelp_dataset` group by attributes;
I get this error:
Error: SYSTEM ERROR: UnsupportedOperationException: Map, Array, Union or repeated scalar type should not be used in group by, order by or in a comparison operator. Drill does not support compare between MAP:REQUIRED and MAP:REQUIRED.
Fragment 0:0
[Error Id: 8fe8a616-92c7-4da0-ab65-b5542d391f47 on 192.168.10.104:31010] (state=,code=0)
What should my query be?
I was not able to auto-flatten the attributes using KVGEN() because of mixed data types, but you might try a CTE with a bit of UNION ALL brute force:
WITH ReviewAttributes AS (
SELECT
reviews.name,
'Accepts Credit Cards' as `AttributeName`,
CASE WHEN reviews.attributes.`Accepts Credit Cards` = true THEN 1 ELSE 0 END as `AttributeValue`
FROM
`yelp_dataset` reviews
UNION ALL
SELECT
reviews.name,
'Alcohol' as `AttributeName`,
CASE WHEN reviews.attributes.`Alcohol` <> 'none' THEN 1 ELSE 0 END as `AttributeValue`
FROM
`yelp_dataset` reviews
UNION ALL
SELECT
reviews.name,
'Good for Kids' as `AttributeName`,
CASE WHEN reviews.attributes.`Good for Kids` = true THEN 1 ELSE 0 END as `AttributeValue`
FROM
`yelp_dataset` reviews
)
SELECT
`AttributeName`,
SUM(`AttributeValue`) as `AttributeCount`
FROM
ReviewAttributes
GROUP BY
`AttributeName`;
The CASE statements might also help you work around some of the differences between boolean and enumerated fields, like counting Alcohol from your sample.

Adjacency List to JSON graph with Postgres

I have the following schema for the tags table:
CREATE TABLE tags (
id integer NOT NULL,
name character varying(255) NOT NULL,
parent_id integer
);
I need to build a query to return the following structure (here represented as yaml for readability):
- name: Ciencia
parent_id:
id: 7
children:
- name: Química
parent_id: 7
id: 9
children: []
- name: Biología
parent_id: 7
id: 8
children:
- name: Botánica
parent_id: 8
id: 19
children: []
- name: Etología
parent_id: 8
id: 18
children: []
After some trial and error and looking for similar questions in SO, I've came up with this query:
WITH RECURSIVE tagtree AS (
SELECT tags.name, tags.parent_id, tags.id, json '[]' children
FROM tags
WHERE NOT EXISTS (SELECT 1 FROM tags tt WHERE tt.parent_id = tags.id)
UNION ALL
SELECT (tags).name, (tags).parent_id, (tags).id, array_to_json(array_agg(tagtree)) children FROM (
SELECT tags, tagtree
FROM tagtree
JOIN tags ON tagtree.parent_id = tags.id
) v
GROUP BY v.tags
)
SELECT array_to_json(array_agg(tagtree)) json
FROM tagtree
WHERE parent_id IS NULL
But it returns the following results when converted to yaml:
- name: Ciencia
parent_id:
id: 7
children:
- name: Química
parent_id: 7
id: 9
children: []
- name: Ciencia
parent_id:
id: 7
children:
- name: Biología
parent_id: 7
id: 8
children:
- name: Botánica
parent_id: 8
id: 19
children: []
- name: Etología
parent_id: 8
id: 18
children: []
The root node is duplicated.
I could merge the results to the expected result in my app code but I feel I am close and it could be done al from PG.
Here's an example with SQL Fiddle:
http://sqlfiddle.com/#!15/1846e/1/0
Expected output:
https://gist.github.com/maca/e7002eb10f36fcdbc51b
Actual output:
https://gist.github.com/maca/78e84fb7c05ff23f07f4
Here's a solution using PLV8 for your schema.
First, build a materialized path using PLSQL function and recursive CTEs.
CREATE OR REPLACE FUNCTION get_children(tag_id integer)
RETURNS json AS $$
DECLARE
result json;
BEGIN
SELECT array_to_json(array_agg(row_to_json(t))) INTO result
FROM (
WITH RECURSIVE tree AS (
SELECT id, name, ARRAY[]::INTEGER[] AS ancestors
FROM tags WHERE parent_id IS NULL
UNION ALL
SELECT tags.id, tags.name, tree.ancestors || tags.parent_id
FROM tags, tree
WHERE tags.parent_id = tree.id
) SELECT id, name, ARRAY[]::INTEGER[] AS children FROM tree WHERE $1 = tree.ancestors[array_upper(tree.ancestors,1)]
) t;
RETURN result;
END;
$$ LANGUAGE plpgsql;
Then, build the tree from the output of the above function.
CREATE OR REPLACE FUNCTION get_tree(data json) RETURNS json AS $$
var root = [];
for(var i in data) {
build_tree(data[i]['id'], data[i]['name'], data[i]['children']);
}
function build_tree(id, name, children) {
var exists = getObject(root, id);
if(exists) {
exists['children'] = children;
}
else {
root.push({'id': id, 'name': name, 'children': children});
}
}
function getObject(theObject, id) {
var result = null;
if(theObject instanceof Array) {
for(var i = 0; i < theObject.length; i++) {
result = getObject(theObject[i], id);
if (result) {
break;
}
}
}
else
{
for(var prop in theObject) {
if(prop == 'id') {
if(theObject[prop] === id) {
return theObject;
}
}
if(theObject[prop] instanceof Object || theObject[prop] instanceof Array) {
result = getObject(theObject[prop], id);
if (result) {
break;
}
}
}
}
return result;
}
return JSON.stringify(root);
$$ LANGUAGE plv8 IMMUTABLE STRICT;
This will yield the required JSON mentioned in your question. Hope that helps.
I've written a detailed post/breakdown of how this solution works here.
Try PL/Python and networkx.
Admittedly, using the following doesn't yield JSON in exactly the requested format, but the information seems to be all there and, if PL/Python is acceptable, this might be adapted into a complete answer.
CREATE OR REPLACE FUNCTION get_adjacency_data(
names text[],
ids integer[],
parent_ids integer[])
RETURNS jsonb AS
$BODY$
pairs = zip(ids, parent_ids)
import networkx as nx
import json
from networkx.readwrite import json_graph
name_dict = dict(zip(ids, names))
G=nx.DiGraph()
G.add_nodes_from(ids)
nx.set_node_attributes(G, 'name', name_dict)
G.add_edges_from(pairs)
return json.dumps(json_graph.adjacency_data(G))
$BODY$ LANGUAGE plpythonu;
WITH raw_data AS (
SELECT array_agg(name) AS names,
array_agg(parent_id) AS parent_ids,
array_agg(id) AS ids
FROM tags
WHERE parent_id IS NOT NULL)
SELECT get_adjacency_data(names, parent_ids, ids)
FROM raw_data;
i was finding same solution and may be this example could be useful for anyone
tested on Postgres 10 with table with same structure
table with columns: id, name and pid as parent_id
create or replace function get_c_tree(p_parent int8) returns setof jsonb as $$
select
case
when count(x) > 0 then jsonb_build_object('id', c.id, 'name', c.name, 'children', jsonb_agg(f.x))
else jsonb_build_object('id', c.id, 'name', c.name, 'children', null)
end
from company c left join get_c_tree(c.id) as f(x) on true
where c.pid = p_parent or (p_parent is null and c.pid is null)
group by c.id, c.name;
$$ language sql;
select jsonb_agg(get_c_tree) from get_c_tree(null::int8);