Convert data from CSV to JSON with grouping - json

Example csv data (top row is column header followed by three data lines);
floor,room,note1,note2,note3
floor1,room1,2people
floor2,room4,6people,projector
floor6,room5,20people,projector,phone
I need the output in json, but grouped by floor, like this;
floor
room
note1
note2
note3
room
note1
note2
note3
floor
room
note1
note2
note3
room
note1
note2
note3
So all floor1 rooms are in their own json grouping, then floor2 rooms etc.
Please could someone point me in the right direction in terms of which tools to look at and any specific functions e.g. jq + categories. I've done some searching already and got muddled up between lots of different posts relating to csvtojson, jq and some python scripts. Ideally I would like to include the solution in a shell script rather than a separate program/language (I have sys admin experience but not a programmer).
Many thanks

Perhaps this can get you started.
Use a programming language like Python to convert the CSV data into a dictionary data structure by splitting on the commas, and use the JSON library to dump your dictionary out as JSON.
I have assumed that actually you expect to have more than one room per floor and thus I took the liberty to adjust your input data a little.
import json
csv = """floor1,room1,note1,note2,note3
floor1,room2,2people
floor1,room3,3people
floor2,room4,6people,projector
floor2,room5,3people,projector
floor3,room6,1person
"""
response = {}
for line in csv.splitlines():
fields = line.split(",")
floor, room, data = fields[0], fields[1], fields[2:]
if floor not in response:
response[floor] = {}
response[floor][room] = data
print json.dumps(response)
If you then run that script and pipe it into jq (where JQ is just used for pretty-printing the output on your screen ; it is not really required) you will see:
$ python test.py | jq .
{
"floor1": {
"room2": [
"2people"
],
"room3": [
"3people"
],
"room1": [
"note1",
"note2",
"note3"
]
},
"floor2": {
"room4": [
"6people",
"projector"
],
"room5": [
"3people",
"projector"
]
},
"floor3": {
"room6": [
"1person"
]
}
}

Related

Dask how to open json with list of dicts

I'm trying to open a bunch of JSON files using read_json In order to get a Dataframe as follow
ddf.compute()
id owner pet_id
0 1 "Charlie" "pet_1"
1 2 "Charlie" "pet_2"
3 4 "Buddy" "pet_3"
but I'm getting the following error
_meta = pd.DataFrame(
columns=list(["id", "owner", "pet_id"]])
).astype({
"id":int,
"owner":"object",
"pet_id": "object"
})
ddf = dd.read_json(f"mypets/*.json", meta=_meta)
ddf.compute()
*** ValueError: Metadata mismatch found in `from_delayed`.
My JSON files looks like
[
{
"id": 1,
"owner": "Charlie",
"pet_id": "pet_1"
},
{
"id": 2,
"owner": "Charlie",
"pet_id": "pet_2"
}
]
As far I understand the problem is that I'm passing a list of dicts, so I'm looking for the right way to specify it the meta= argument
PD:
I also tried doing it in the following way
{
"id": [1, 2],
"owner": ["Charlie", "Charlie"],
"pet_id": ["pet_1", "pet_2"]
}
But Dask is wrongly interpreting the data
ddf.compute()
id owner pet_id
0 [1, 2] ["Charlie", "Charlie"] ["pet_1", "pet_2"]
1 [4] ["Buddy"] ["pet_3"]
The invocation you want is the following:
dd.read_json("data.json", meta=meta,
blocksize=None, orient="records",
lines=False)
which can be largely gleaned from the docstring.
meta looks OK from your code
blocksize must be None, since you have a whole JSON object per file and cannot split the file
orient "records" means list of objects
lines=False means this is not a line-delimited JSON file, which is the more common case for Dask (you are not assuming that a newline character means a new record)
So why the error? Probably Dask split your file on some newline character, and so a partial record got parsed, which therefore did not match your given meta.

jq - How to extract domains and remove duplicates

Given the following json:
Full file here: https://pastebin.com/Hzt9bq2a
{
"name": "Visma Public",
"domains": [
"accountsettings.connect.identity.stagaws.visma.com",
"admin.stage.vismaonline.com",
"api.home.stag.visma.com",
"api.workbox.dk",
"app.workbox.dk",
"app.workbox.co.uk",
"authz.workbox.dk",
"connect.identity.stagaws.visma.com",
"eaccounting.stage.vismaonline.com",
"eaccountingprinting.stage.vismaonline.com",
"http://myservices-api.stage.vismaonline.com/",
"identity.stage.vismaonline.com",
"myservices.stage.vismaonline.com"
]
}
How can I transform the data to the below. Which is, to identify the domains in the format of site.SLD.TLD present and then remove the duplication of them. (Not including the subdomains, protocols or paths as illustrated below.)
{
"name": "Visma Public",
"domains": [
"workbox.co.uk",
"workbox.dk",
"visma.com",
"vismaonline.com"
]
}
I would like to do so in jq as that is what I've used to wrangled the data into this format so far, but at this stage any solution that I can run on Debian (I'm using bash) without any extraneous tooling ideally would be fine.
I'm aware that regex can be used within jq so I assume the best way is to regex out the domain and then pipe to unique however I'm unable to get anything working so far I'm currently trying this version which seems to me to need only the text transformation stage adding in somehow either during the jq process or with a run over with something like awk after the event perhaps:
jq '[.[] | {name: .name, domain: [.domains[]] | unique}]' testfile.json
This appears to be useful: https://github.com/stedolan/jq/issues/537
One solution was offered which does a regex match to extract the last two strings separated by . and call the unique function on that & works up to a point but doesn't cover site.SLD.TLD that has 2 parts. Like google.co.uk would return only co.uk with this jq for example:
jq '.domains |= (map(capture("(?<x>[[:alpha:]]+).(?<z>[[:alpha:]]+)(.?)$") | join(".")) | unique)'
A programming language is much more expressive than jq.
Try the following snippet with python3.
import json
import pprint
import urllib.request
from urllib.parse import urlparse
import os
def get_tlds():
f = urllib.request.urlopen("https://publicsuffix.org/list/effective_tld_names.dat")
content = f.read()
lines = content.decode('utf-8').split("\n")
# remove comments
tlds = [line for line in lines if not line.startswith("//") and not line == ""]
return tlds
def extract_domain(url, tlds):
# get domain
url = url.replace("http://", "").replace("https://", "")
url = url.split("/")[0]
# get tld/sld
parts = url.split(".")
suffix1 = parts[-1]
sld1 = parts[-2]
if len(parts) > 2:
suffix2 = ".".join(parts[-2:])
sld2 = parts[-3]
else:
suffix2 = suffix1
sld2 = sld1
# try the longger first
if suffix2 in tlds:
tld = suffix2
sld = sld2
else:
tld = suffix1
sld = sld1
return sld + "." + tld
def clean(site, tlds):
site["domains"] = list(set([extract_domain(url, tlds) for url in site["domains"]]))
return site
if __name__ == "__main__":
filename = "Hzt9bq2a.json"
cache_path = "tlds.json"
if os.path.exists(cache_path):
with open(cache_path, "r") as f:
tlds = json.load(f)
else:
tlds = get_tlds()
with open(cache_path, "w") as f:
json.dump(tlds, f)
with open(filename) as f:
d = json.load(f)
d = [clean(site, tlds) for site in d]
pprint.pprint(d)
with open("clean.json", "w") as f:
json.dump(d, f)
May I offer you achieving the same query with jtc: the same could be achieved in other languages (and of course in jq) - the query is mostly how to come up with the regex to satisfy your ask:
bash $ <file.json jtc -w'<domains>l:>((?:[a-z0-9]+\.)?[a-z0-9]+\.[a-z0-9]+)[^.]*$<R:' -u'{{$1}}' /\
-ppw'<domains>l:><q:' -w'[domains]:<[]>j:' -w'<name>l:'
{
"domains": [
"stagaws.visma.com",
"stage.vismaonline.com",
"stag.visma.com",
"api.workbox.dk",
"app.workbox.dk",
"workbox.co.uk",
"authz.workbox.dk"
],
"name": "Visma Public"
}
bash $
Note: it does extract only DOMAIN.TLD, as per your ask. If you like to extract DOMAIN.SLD.TLD, then the task becomes a bit less trivial.
Update:
Modified solution as per the comment: extract domain.sld.tld where 3 or more levels and domain.tld where there’s only 2
PS. I'm the creator of the jtc - JSON processing utility. This disclaimer is SO requirement.
One of the solutions presented on this page offers that:
A programming language is much more expressive than jq.
It may therefore be worthwhile pointing out that jq is an expressive, Turing-complete programming language, and that it would be as straightforward (and as tedious) to capture all the intricacies of the "Public Suffix List" using jq as any other programming language that does not already provide support for this list.
It may be useful to illustrate an approach to the problem that passes the (revised) test presented in the Q. This approach could easily be extended in any one of a number of ways:
def extract:
sub("^[^:]*://";"")
| sub("/.*$";"")
| split(".")
| (if (.[-1]|length) == 2 and (.[-2]|length) <= 3
then -3 else -2 end) as $ix
| .[$ix : ]
| join(".") ;
{name, domain: (.domains | map(extract) | unique)}
Output
{
"name": "Visma Public",
"domain": [
"visma.com",
"vismaonline.com",
"workbox.co.uk",
"workbox.dk"
]
}
Judging from your example, you don't actually want top-level domains (just one component, e.g. ".com"), and you probably don't really want second-level domains (last two components) either, because some domain registries don't operate at the TLD level. Given www.foo.com.br, you presumably want to find out about foo.com.br, not com.br.
To do that, you need to consult the Public Suffix List. The file format isn't too complicated, but it has support for wildcards and exceptions. I dare say that jq isn't the ideal language to use here — pick one that has a URL-parsing module (for extracting hostnames) and an existing Public Suffix List module (for extracting the domain parts from those hostnames).

How can I use jq to create a CSV with multiple headers and detail lines?

I'd like to use jq to output in CSV format, but for multiple headers, followed by multiple details. The solutions that I've already seen on Stack Overflow provide a way to insert a single header, but I haven't found anything for multiple headers.
To give you an idea of what I'm talking about, here is some sample JSON input:
[
{
"HDR": [1, "abc"],
"DTL": [ [101,"Descr A"], [102,"Descr B"] ]
}, {
"HDR": [2, "def"],
"DTL": [ [103,"Descr C"], [104,"Descr D"] ]
}
]
Desired output:
HDR|1|abc
DTL|101|Descr A
DTL|102|Descr B
HDR|2|def
DTL|103|Descr C
DTL|104|Descr D
I don't know if it's possible, but my approach so far has been to try to create a filter to give me the following, since transforming this to what I need would be trivial:
["HDR", 1, "abc"]
["DTL", 101, "Descr A"]
["DTL", 102, "Descr B"]
["HDR", 2, "def"]
["DTL", 103, "Descr C"]
["DTL", 104, "Descr D"]
To be clear, I know how to do this in any number of scripting languages, but I'm really trying to stick with a single jq filter, if it's at all possible.
Edit: I should clarify that I don't necessarily need to copy the "HDR" and "DTL" keys into the CSV (I can hard-code those), so the sample JSON could look like this, if it makes the problem easier.
[
[
[1, "abc"],
[[101,"Descr A"], [102,"Descr B"]]
], [
[2, "def"],
[[103,"Descr C"], [104,"Descr D"]]
]
]
Edit: This filter technically answers the question with the second sample data I provided (the last one, that's only arrays and no objects), but I would still appreciate a better answer, if for no other reasons than the header length has to be hard-coded, and putting the HDR into two sets of arrays so that it can be flatten()'d later feels wrong. But I'll leave it here for reference.
.[] | flatten(1) | [[["HDR"] + .[0:2]]] as $hdr | .[2:] as $dtl | $dtl | map([["DTL"] + .]) as $dtl | $hdr + $dtl | flatten(1) | .[] | join("|")
This works for your original input, assuming you chose | as the delimiter because none of your fields can contain |.
jq -r 'map(["HDR"]+.HDR, ["DTL"] + .DTL[])[] | join("|")' data.json
map produces multiple array elements per object.
.DTL[] ensures "DTL" is prefixed to each sublist
[] flattens the result of the map

How to Change a value in a Dataframe based on a lookup from a json file

I want to practice building models and I figured that I'd do it with something that I am familiar with: League of Legends. I'm having trouble replacing an integer in a dataframe with a value in a json.
The datasets I'm using come off of the kaggle. You can grab it and run it for yourself.
https://www.kaggle.com/datasnaek/league-of-legends
I have json file of the form: (it's actually must bigger, but I shortened it)
{
"type": "champion",
"version": "7.17.2",
"data": {
"1": {
"title": "the Dark Child",
"id": 1,
"key": "Annie",
"name": "Annie"
},
"2": {
"title": "the Berserker",
"id": 2,
"key": "Olaf",
"name": "Olaf"
}
}
}
and dataframe of the form
print df
gameDuration t1_champ1id
0 1949 1
1 1851 2
2 1493 1
3 1758 1
4 2094 2
I want to replace the ID in t1_champ1id with the lookup value in the json.
If both of these were dataframe, then I could use the merge option.
This is what I've tried. I don't know if this is the best way to read in the json file.
import pandas
df = pandas.read_csv("lol_file.csv",header=0)
champ = pandas.read_json("champion_info.json", typ='series')
for i in champ.data[0]:
for j in df:
if df.loc[j,('t1_champ1id')] == i:
df.loc[j,('t1_champ1id')] = champ[0][i]['name']
I get the below error:
the label [gameDuration] is not in the [index]'
I'm not sure that this is the most efficient way to do this, but I'm not sure how to do it at all either.
What do y'all think?
Thanks!
for j in df: iterates over the column names in df, which is unnecessary, since you're only looking to match against the column 't1_champ1id'. A better use of pandas functionality is to condense the id:name pairs from your JSON file into a dictionary, and then map it to df['t1_champ1id'].
player_names = {v['id']:v['name'] for v in json_file['data'].itervalues()}
df.loc[:, 't1_champ1id'] = df['t1_champ1id'].map(player_names)
# gameDuration t1_champ1id
# 0 1949 Annie
# 1 1851 Olaf
# 2 1493 Annie
# 3 1758 Annie
# 4 2094 Olaf
Created a dataframe from the 'data' in the json file (also transposed the resulting dataframe and then set the index to what you want to map, the id) then mapped that to the original df.
import json
with open('champion_info.json') as data_file:
champ_json = json.load(data_file)
champs = pd.DataFrame(champ_json['data']).T
champs.set_index('id',inplace=True)
df['champ_name'] = df.t1_champ1id.map(champs['name'])

Spark: write JSON several files from DataFrame based on separation by column value

Suppose I have this DataFrame (df):
user food affinity
'u1' 'pizza' 5
'u1' 'broccoli' 3
'u1' 'ice cream' 4
'u2' 'pizza' 1
'u2' 'broccoli' 3
'u2' 'ice cream' 1
Namely each user has a certain (computed) affinity to a series of foods. The DataFrame is built from several What I need to do is create a JSON file for each user, with their affinities. For instance, for user 'u1', I want to have file for user 'u1' containing
[
{'food': 'pizza', 'affinity': 5},
{'food': 'broccoli', 'affinity': 3},
{'food': 'ice cream', 'affinity': 4},
]
This would entail a separation of the DataFrame by user and I cannot think of a way to do this as the writing of a JSON file would be achieved, for full DataFrame, with
df.write.json(<path_to_file>)
You can partitionBy (it will give you a single directory and possibly multiple files per user):
df.write.partitionBy("user").json(<path_to_file>)
or repartition and partitionBy (it will give you a single directory and a single file per user):
df.repartition(col("user")).write.partitionBy("user").json(<path_to_file>)
Unfortunately none of the above will give you a JSON array.
If you use Spark 2.0 you can try with collect list first:
df.groupBy(col("user")).agg(
collect_list(struct(col("food"), col("affinity"))).alias("affinities")
)
and partitionBy on write as before.
Prior to 2.0 you'll have to use RDD API, but it is language specific.