Wikidata query with values from CSV-File - csv

I would like to do a Wikidata query of many values that are listed in a column of a CSV file on my computer.
How can I load the values from the CSV file into the Wikidata query automatically without copying them in manually?
So far I have worked with the Wikidata query in Visual Studio Code.
This is the query I made for one person:
SELECT ?Author ?AuthorLabel ?VIAF ?birthLocation
WHERE {
VALUES ?VIAF {"2467372"}
?Author wdt:P214 ?VIAF ;
wdt:P19 ?birthLocation .
SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_Language],de". }
}
I want to automatically load many values into the curly brackets of the query above from the column of my CSV file.

First, I feel compelled to point out that if you don't already know a programming language OpenRefine can do this for you in a few clicks.
Having said that, here's a basic Python program that accomplishes what you literally asked for - reading a set of VIAF ids and adding them to your query:
import csv
def expand_query(ids):
query = """
SELECT ?Author ?AuthorLabel ?VIAF ?birthLocation ?birthLocationLabel WHERE {
VALUES ?VIAF {
""" + '"' + '" "'.join(ids) + '"' """
}
?Author wdt:P214 ?VIAF.
OPTIONAL { ?Author wdt:P19 ?birthLocation. }
SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_Language],de,en". }
}
"""
return query
def main():
with open('../data/authors.csv', "rt") as csvfile:
csvreader = csv.DictReader(csvfile, dialect=csv.excel)
ids = [row["viaf"] for row in csvreader]
print(expand_query(ids))
if __name__ == "__main__":
main()
It expects a CSV file with a column called viaf and will ignore all other columns. e.g.
name,viaf
Douglas Adams,113230702
William Shakespeare,96994048
Bertolt Brecht,2467372
I've tweaked the query slightly to:
always output a row even if the birth location isn't available
output the label for the birth location
add English as an additional fallback language for labels
This makes the assumption that you've got a small enough set of identifiers to be able to use a single query, but you can extended it to:
read identifiers in batchs of a convenient size
use SPARQLwrapper to send the results to the Wikidata SPARQL endpoint and parse the results
write the results to a different CSV file in chunks as they're received

So, say you have a file my_file.csv with the following content:
2467372
63468347
12447
First of all, import a python library for reading files (like fileinput).
Then declare the pattern that you want to use for your query, using %s as placeholder for the identifiers.
Now, build a list of identifiers as follows:
identifiers = ['wd:'+line.strip() for line in fileinput.input(files='my_file.csv')]
And finally join the list using a space character as separator and pass this string to your query pattern:
query = query_pattern % ' '.join(identifiers)
This is the final code:
import fileinput
filename = 'my_file.csv'
query_pattern = '''SELECT ?Author ?AuthorLabel ?VIAF ?birthLocation
WHERE {
VALUES ?VIAF { %s }
?Author wdt:P214 ?VIAF ;
wdt:P19 ?birthLocation .
SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_Language],de". }
}'''
identifiers = ['"'+line.strip()+'"' for line in fileinput.input(files=filename)]
query = query_pattern % ' '.join(identifiers)
print(query)
Executing it, you'll get:
SELECT ?Author ?AuthorLabel ?VIAF ?birthLocation
WHERE {
VALUES ?VIAF { "2467372" "63468347" "12447" }
?Author wdt:P214 ?VIAF ;
wdt:P19 ?birthLocation .
SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_Language],de". }
}

Related

how to evaluate String to identify duplicate keys

I have an input string in groovy which is not strictly JSON.
String str = "['OS_Node':['eth0':'1310','eth0':'1312']]"
My issue is to identify the duplicate "eth0" . I tried to convert this into map using Eval.me(), but it automatically removes the duplicate key "eth0" and gives me a Map.
What is the best way for me to identify the presence of duplicate key ?
Note: there could be multiple OS_Node1\2\3\ entries.. need to identify duplicates in each of them ?
Is there any JSON api that can be used? or need to use logic based on substring() ?
One way to solve this could be to cheat a little and replace colons with commas which would transform the maps into lists and then do a recursive search for duplicates:
def str = "['OS_Node':['eth0':'1310','eth0':'1312'], 'OS_Node':['eth1':'1310','eth1':'1312']]"
def tree = Eval.me(str.replaceAll(":", ","))
def dupes = findDuplicates(tree)
dupes.each { println it }
def findDuplicates(t, path=[], dupes=[]) {
def seen = [] as Set
t.collate(2).each { k, v ->
if (k in seen) dupes << [path: path + k]
seen << k
if (v instanceof List) findDuplicates(v, path+k, dupes)
}
dupes
}
when run, prints:
─➤ groovy solution.groovy
[path:[OS_Node, eth0]]
[path:[OS_Node]]
[path:[OS_Node, eth1]]
i.e. the method finds all paths to duplicated keys where "path" is defined as the key sequence required to navigate to the duplicate key.
The function returns a list of maps which you can then do whatever you wish with. Should be noted that the "OS_Node" key is with this logic treated as a duplicate but you could easily filter that out as a step after this function call.
First of all, the string you have there is not JSON - not only due to
the duplicate keys, but also by the use of [] for maps. This looks
a lot more like a groovy map literal. So if this is your custom format
and you can not do anything against it, I'd write a small parser for this,
because sooner or later edge cases or quoting problems come around the
corner.
#Grab("com.github.petitparser:petitparser-core:2.3.1")
import org.petitparser.tools.GrammarDefinition
import org.petitparser.tools.GrammarParser
import org.petitparser.parser.primitive.CharacterParser as CP
import org.petitparser.parser.primitive.StringParser as SP
import org.petitparser.utils.Functions as F
class MappishGrammerDefinition extends GrammarDefinition {
MappishGrammerDefinition() {
define("start", ref("map"))
define("map",
CP.of("[" as Character)
.seq(ref("kv-pairs"))
.seq(CP.of("]" as Character))
.map{ it[1] })
define("kv-pairs",
ref("kv-pair")
.plus()
.separatedBy(CP.of("," as Character))
.map{ it.collate(2)*.first()*.first() })
define("kv-pair",
ref('key')9
.seq(CP.of(":" as Character))
.seq(ref('val'))
.map{ [it[0], it[2]] })
define("key",
ref("quoted"))
define("val",
ref("quoted")
.or(ref("map")))
define("quoted",
CP.anyOf("'")
.seq(SP.of("\\''").or(CP.pattern("^'")).star().flatten())
.seq(CP.anyOf("'"))
.map{ it[1].replace("\\'", "'") })
}
// Helper for `def`, which is a keyword in groovy
void define(s, p) { super.def(s,p) }
}
println(new GrammarParser(new MappishGrammerDefinition()).parse("['OS_Node':['eth0':'1310','eth0':'1312'],'OS_Node':['eth0':'42']]").get())
// → [[OS_Node, [[eth0, 1310], [eth0, 1312]]], [OS_Node, [[eth0, 42]]]]

How do I search for a string in this JSON with Python

My JSON file looks something like:
{
"generator": {
"name": "Xfer Records Serum",
....
},
"generator": {
"name: "Lennar Digital Sylenth1",
....
}
}
I ask the user for search term and the input is searched for in the name key only. All matching results are returned. It means if I input 's' only then also both the above ones would be returned. Also please explain me how to return all the object names which are generators. The more simple method the better it will be for me. I use json library. However if another library is required not a problem.
Before switching to JSON I tried XML but it did not work.
If your goal is just to search all name properties, this will do the trick:
import re
def search_names(term, lines):
name_search = re.compile('\s*"name"\s*:\s*"(.*' + term + '.*)",?$', re.I)
return [x.group(1) for x in [name_search.search(y) for y in lines] if x]
with open('path/to/your.json') as f:
lines = f.readlines()
print(search_names('s', lines))
which would return both names you listed in your example.
The way the search_names() function works is it builds a regular expression that will match any line starting with "name": " (with varying amount of whitespace) followed by your search term with any other characters around it then terminated with " followed by an optional , and the end of string. Then applies that to each line from the file. Finally it filters out any non-matching lines and returns the value of the name property (the capture group contents) for each match.

Retrieve data by Ignoring null values and header row from csv file

Working on Groovy Script in soapui 5.3.0 and facing the below issue while extracting the values from file to a list.
Purpose of below code is, the list retrieved has to be compared with another list with valid values only.
Attaching the code snippet and the sample csv file for reference.
code to retrieve the values:
def DBvalue= context["csvfile"] //csv file containing the data
def count= context["dbrowcount"] //here the rowcount is 23
for (i=0;i<count;i++) {
def lines= ""
lines= DBvalue.text.split('\n')
list<string> rows = lines.collect{it.split(';)}
log.info "list is"+rows
}
Sample CSV file on which am working contains 600 column of data with 23 rows
abc;null;1;2;3;5;8;null
cdf;null;2;3;6;null;5;6
hgf;null;null;null;jr;null;II
Currently my code is fetching the below output:
[[abc,null,1,2,3,5,8,null]]
[[abc,null,1,2,3,5,8,null]]
[[abc,null,1,2,3,5,8,null]]
Desired output:
[1,2,3,5,8]
[2,3,6,5,6]
[jr,II]
You should be able achieve it with below, and follow in-line comments.
//Provide your file path; change if needed
def file = new File('/tmp/test.csv')
//To hold all the rows
def list = []
//Change delimiter if needed
def delimiter = ';'
file.readLines().eachWithIndex { line, index ->
if (index) {
//Get the row data by split, filter
def lineData = line.split(delimiter).findAll { 'null' != it && it}
log.info lineData
list << lineData
}
}
//Print all the row data
log.info list
Input:
Output:

How to convert a JSON file to an SQLite database

If I have some sample data, how do I put it into SQLite (preferably fully automated)?
{"uri":"/","user_agent":"example1"}
{"uri":"/foobar","user_agent":"example1"}
{"uri":"/","user_agent":"example2"}
{"uri":"/foobar","user_agent":"example3"}
I found the easiest way to do this is by using jq and CSV as an intermediary format.
Getting the CSV
First write your data to a file.
I will assume data.json here.
Then construct the header using jq:
% head -1 data.json | jq -r 'keys | #csv'
"uri","user_agent"
The head -1 is because we only want one line.
jq's -r makes the output a plain string instead of a JSON-String wrapping the CSV.
We then call the internal function keys to get the keys of the input as an array.
This we send to the #csv formatter which outputs us a single string with the headers in quoted CSV format.
We then need to construct the data.
% jq -r 'map(tostring) | #csv' < data.json
"/","example1"
"/foobar","example1"
"/","example2"
"/foobar","example3"
We now take the whole input and deconstruct the associative array (map) using .[] and then put it back into a simple array […].
This basically converts our dictionary to an array of keys.
Sent to the #csv formatter, we again get some CSV.
Putting it all together we get a single one-liner in the form of:
% (head -1 data.json | jq -r 'keys | #csv' && jq -r 'map(tostring) | #csv' < data.json) > data.csv
If you need to convert the data on the fly, i.e. without a file, try this:
% cat data.json | (read -r first && jq -r '(keys | #csv),(map(tostring) | #csv)' <<<"${first}" && jq -r 'map(tostring) | #csv')
Loading it into SQLite
Open an SQLite database:
sqlite3 somedb.sqlite
Now in the interactive shell do the following (assuming you wrote the CSV to data.csv and want it in a table called my_table):
.mode csv
.import data.csv my_table
Now close the shell and open it again for a clean environment.
You can now easily SELECT from the database and do whatever you want to.
Putting it all together
Have an asciinema recording right there:
Edits
Edit:
As pointed out (thanks #Leo), the original question did show newline delimited JSON objects, which each on their own conform to rfc4627, but not all together in that format.
jq can handle a single JSON array of objects much the same way though by preprocessing the file using jq '.[]' <input.json >preprocessed.json.
If you happen to be dealing with JSON text sequences (rfc7464) luckily jq has got your back too with the --seq parameter.
Edit 2:
Both the newline separated JSON and the JSON text sequences have one important advantage; they reduce memory requirements down to O(1), meaning your total memory requirement is only dependent on your longest line of input, whereas putting the entire input in a single array requires that either your parser can handle late errors (i.e. after the first 100k elements there's a syntax error), which generally isn't the case to my knowledge, or it will have to parse the entire file twice (first validating syntax, then parsing, in the process discarding previous elements, as is the case with jq --stream) which also happens rarely to my knowledge, or it will try to parse the whole input at once and return the result in one step (think of receiving a Python dict which contains the entirety of your say 50G input data plus overhead) which is usually memory backed, hence raising your memory footprint by just about your total data size.
Edit 3:
If you hit any obstacles, try using keys_unsorted instead of keys.
I haven't tested that myself (I kind of assume my columns were already sorted), however #Kyle Barron reports that this was needed.
Edit 4:
As pointed out by youngminz in the comment below the original command fails when working with non-{number,string} values like nested lists.
The command has been updated (with a slightly adapted version from the comment, map() – unlike map_values() converts objects to their keys the same as [.[]], making the map more readable).
Keys remain unaffected, if you really have complex types as keys (which may not even conform to JSON, but I'm too lazy to look it up right now) you can do the same for the key-related mappings.
A way do this without CSV or a 3rd party tool is to use the JSON1 extension of SQLite combined with the readfile extension that is provided in the sqlite3 CLI tool. As well as overall being a "more direct" solution, this has the advantage of handling JSON NULL values more consistently than CSV, which will otherwise import them as empty strings.
If the input file is a well-formed JSON file, e.g. the example given as an array:
[
{"uri":"/","user_agent":"example1"},
{"uri":"/foobar","user_agent":"example1"},
{"uri":"/","user_agent":"example2"},
{"uri":"/foobar","user_agent":"example3"}
]
Then this can be read into the corresponding my_table table as follows. Open the SQLite database file my_db.db using the sqlite3 CLI:
sqlite3 my_db.db
then create my_table using:
CREATE TABLE my_table(uri TEXT, user_agent TEXT);
Finally, the JSON data in my_data.json can be inserted into the table with the CLI command:
INSERT INTO my_table SELECT
json_extract(value, '$.uri'),
json_extract(value, '$.user_agent')
FROM json_each(readfile('my_data.json'));
If the initial JSON file is newline separated JSON elements, then this can be converted first using jq using:
jq -s <my_data_raw.json >my_data.json
It's likely there is a way to do this directly in SQLite using JSON1, but I didn't pursue that given that I was already using jq to massage the data prior to import to SQLite.
sqlitebiter appears to provide a python solution:
A CLI tool to convert CSV/Excel/HTML/JSON/LTSV/Markdown/SQLite/TSV/Google-Sheets to a SQLite database file. http://sqlitebiter.rtfd.io/
docs:
http://sqlitebiter.readthedocs.io/en/latest/
project:
https://github.com/thombashi/sqlitebiter
last update approximately 3 months ago
last issue closed approximately 1 month ago, none open
noted today, 2018-03-14
You can use spyql.
spyql reads the json files (with 1 json object per line) and generates INSERT statements that you can pipe into sqlite:
$ spyql -Otable=my_table "SELECT json->uri, json->user_agent FROM json TO sql" < sample3.json | sqlite3 my.db
This assumes that you already created an empty table in the sqlite database my.db.
Disclaimer: I am the author of spyql.
To work with a file of newline delimited JSON objects, including \n in the data.
Add a header column name and ensure the JSON is compact (1 line per record).
cat <(echo '"line"') source.json | jq -c '.' > source.fauxcsv
Import the JSON and header as a "csv" into a temporary table with a column separator \t that won't occur in the JSON. Then create the real table via SQLites JSON functions.
sqlite3 file.db \
-cmd '.separator \t \n' \
-cmd '.import --schema temp source.fauxcsv temp_json_lines' <<-'EOSQL'
INSERT into records SELECT
json_extract(line, '$.rid'),
coalesce(json_extract(line, '$.created_at'), strftime('%Y-%m-%dT%H:%M:%fZ', 'now')),
json_extract(line, '$.name')
FROM temp_json_lines;
EOSQL
If (as in the original question) the JSON data comes in the form of JSONLines (that is, one JSON entity per line), and if it is desired to create a table with one of these entities per row, then sqlite3 can be used to import the data by setting .mode=line, e.g. as follows:
create table input (
raw JSON
);
.mode=line
.import input.json input
This approach is worth knowing not least because it can easily be adapted to handle cases where the data is not already in JSONLines format. For example, if input.json contains a single very long JSON array, we could use a tool such as jq or gojq to "splat" it:
.mode=line
.import "|jq -c .[] input.json" input
Similarly, if input.json contains a single object with many keys, and if it is desired to create a table of corresponding single-key objects:
.mode=line
.import "|jq -c 'to_entries[] | {(.key): .value}'" input
If the original data is a single very large JSON array or JSON object, jq's streaming parser could be used to save memory. In this context, it may be worth mentioning two CLI tools with minimal memory requirements: my own jm (based on JSON Machine), and jm.py (based on ijson). E.g., to "splat" each array in a file containing one or more JSON arrays:
.mode=line
.import "|jm input.json" input
With the JSON data safely in an SQLite table, it is (thanks to SQLite's support for JSON) now quite straightforward to create indices, populate other tables, etc., etc.
Here is the first answer compiled into a deno script:
// just for convenience (pathExists)
import {} from "https://deno.land/x/simple_shell#0.9.0/src/stringUtils.ts";
/**
* #description
* convert a json db to csv and then to sqlite
*
* #note
* `sqliteTableConstructor` is a string that is used to create the table, if it is specified the csv file *should not* contain a header row.
* if it's not specified then the csv file *must* contain a header row so it can be used to infer the column names.
*/
const jsonToSqlite = async (
{
jsonDbPath,
jsonToCsvFn,
sqliteDbPath,
sqliteTableConstructor,
tableName,
}: {
jsonDbPath: string;
sqliteDbPath: string;
tableName: string;
sqliteTableConstructor?: string;
// deno-lint-ignore no-explicit-any
jsonToCsvFn: (jsonDb: any) => string;
},
) => {
// convert it into csv
const csvDbPath = `${jsonDbPath.replace(".json", "")}.csv`;
if (csvDbPath.pathExists()) {
console.log(`${csvDbPath} already exists`);
} else {
const db = JSON.parse(await Deno.readTextFile(jsonDbPath));
const csv = jsonToCsvFn(db);
await Deno.writeTextFile(csvDbPath, csv);
}
// convert it to sqlite
if (sqliteDbPath.pathExists()) {
console.log(`${sqliteDbPath} already exists`);
} else {
const sqlite3 = Deno.spawnChild("sqlite3", {
args: [sqliteDbPath],
stdin: "piped",
stderr: "null", // required to make sqlite3 work
});
await sqlite3.stdin.getWriter().write(
new TextEncoder().encode(
".mode csv\n" +
(sqliteTableConstructor ? `${sqliteTableConstructor};\n` : "") +
`.import ${csvDbPath} ${tableName}\n` +
".exit\n",
),
);
await sqlite3.status;
}
};
Example of usage:
await jsonToSqlite(
{
jsonDbPath: "./static/db/db.json",
sqliteDbPath: "./static/db/db.sqlite",
tableName: "radio_table",
sqliteTableConstructor:
"CREATE TABLE radio_table(name TEXT, country TEXT, language TEXT, votes INT, url TEXT, favicon TEXT)",
jsonToCsvFn: (
db: StationDBType[],
) => {
const sanitize = (str: string) =>
str.trim().replaceAll("\n", " ").replaceAll(",", " ");
return db.filter((s) => s.name.trim() && s.url.trim())
.map(
(station) => {
return (
sanitize(station.name) + "," +
sanitize(station.country) + "," +
sanitize(station.language) + "," +
station.votes + "," +
sanitize(station.url) + "," +
sanitize(station.favicon)
);
},
).join("\n");
},
},
);
Edit1:
Importing csv to sqlite by defaults sets all column types to string. In this edit I allow the user to create the table first (via an optional constructor) before importing the csv into it, this way he can specify the exact column types.
Improve example
Edit2:
Turns out that with deno and sqlite-deno you don't need to use csv as an intermediate or shell out to sqlite, here is an example on how to achieve this:
This next code will create a new sql db from the json one.
import { DB } from "https://deno.land/x/sqlite#v3.2.1/mod.ts";
export interface StationDBType {
name: string;
country: string;
language: string;
votes: number;
url: string;
favicon: string;
}
export const db = new DB("new.sql");
db.query(
"create TABLE radio_table (name TEXT, country TEXT, language TEXT, votes INT, url TEXT, favicon TEXT)",
);
const jsonDb: StationDBType[] = JSON.parse(
await Deno.readTextFile("static/db/compressed_db.json"),
);
const sanitize = (s: string) => s.replaceAll('"', "").replaceAll("'", "");
db.query(
`insert into radio_table values ${
jsonDb.map((station) =>
"('" +
sanitize(station.name) +
"','" +
sanitize(station.country) +
"','" +
sanitize(station.language) +
"'," +
station.votes +
",'" +
sanitize(station.url) +
"','" +
sanitize(station.favicon) +
"')"
).join(",")
}`,
);
db.close();

Groovy csv to string

I am using Dell Boomi to map data from one system to another. I can use groovy in the maps but have no experience with it. I tried to do this with the other Boomi tools, but have been told that I'll need to use groovy in a script. My inbound data is:
132265,Brown
132265,Gold
132265,Gray
132265,Green
I would like to output:
132265,"Brown,Gold,Gray,Green"
Hopefully this makes sense! Any ideas on the groovy code to make this work?
It can be elegantly solved with groupBy and the spread operator:
#Grapes(
#Grab(group='org.apache.commons', module='commons-csv', version='1.2')
)
import org.apache.commons.csv.*
def csv = '''
132265,Brown
132265,Gold
132265,Gray
132265,Green
'''
def parsed = CSVParser.parse(csv, CSVFormat.DEFAULT.withHeader('code', 'color')
parsed.records.groupBy({ it.code }).each { k,v -> println "$k,\"${v*.color.join(',')}\"" }
The above prints:
132265,"Brown,Gold,Gray,Green"
Well, I don't know how are you getting your data, but here is a general way to achieve your goal. You can use a library, such as the one bellow to parse the csv.
https://github.com/xlson/groovycsv
The example for your data would be:
#Grab('com.xlson.groovycsv:groovycsv:1.1')
import static com.xlson.groovycsv.CsvParser.parseCsv
def csv = '''
132265,Brown
132265,Gold
132265,Gray
132265,Green
'''
def data = parseCsv(csv)
I believe you want to associate the number with various values of colors. So for each line you can create a map of the number and the colors associated with that number, splitting the line by ",":
map = [:]
for(line in data) {
number = line.split(',')[0]
colour = line.split(',')[1]
if(!map[number])
map[number] = []
map[number].add(colour)
}
println map
So map should contain:
[132265:["Brown","Gold","Gray","Green"]]
Well, if it is not what you want, you can extract the general idea.
Assuming your data is coming in as a comma separated string of data like this:
"132265,Brown 132265,Gold 132265,Gray 132265,Green 122222,Red 122222,White"
The following Groovy script code should do the trick.
def csvString = "132265,Brown 132265,Gold 132265,Gray 132265,Green 122222,Red 122222,White"
LinkedHashMap.metaClass.multiPut << { key, value ->
delegate[key] = delegate[key] ?: []; delegate[key] += value
}
def map = [:]
def csv = csvString.split().collect{ entry -> entry.split(",") }
csv.each{ entry -> map.multiPut(entry[0], entry[1]) }
def result = map.collect{ k, v -> k + ',"' + v.join(",") + '"'}.join("\n")
println result
Would print:
132265,"Brown,Gold,Gray,Green"
122222,"Red,White"
Do you HAVE to use scripting for some reason? This can be easily accomplished with out-of-the-box Boomi functionality.
Create a map function that prepends the ID field to a string of your choice (i.e. 222_concat_fields). Then use that value to set a dynamic process prop with that value.
The value of the process prop will contain the result of concatenating the name fields. Simply adding this function to your map should take care of it. Then use the final value to populate your result.
Well it depends upon the data how is it coming.
If the data which you have posted in the question is coming in a single document, then you can easily handle this in a map with groovy scripting.
If the data which you have posted in the question is coming into multiple documents i.e.
doc1: 132265,Brown
doc2: 132265,Gold
doc3: 132265,Gray
doc4: 132265,Green
In that case it cannot be handled into map. You will need to use Data Process Step with Custom Scripting.
For the code which you are asking to create in groovy depends upon the input profile in which you are getting the data. Please provide more information i.e. input profile, fields etc.