I have a dataset of JSON File
[
{
"id": 333831567,
"pieceId": 25395616,
"status": 10800,
"userId": 911,
"startTime": 1490989764,
"endTime": 1491001113
},
{
"id": 333883698,
"pieceId": 25390812,
"status": 10451,
"userId": 88738562,
"startTime": 1491004450,
"endTime": 1491004579
The JSON file has over 15000 entries. How do I calculate unique status from this dataset.
Using pandas
import pandas as pd
# convert your "data" into pandas dataframe
df = pd.DataFrame.from_dict(data, orient='columns')
# count non unique values for status column
df.loc[: ,'status'].nunique()
Using dictionary comprehension + len() + set()
res = {key: len(set([sub[key] for sub in data ]))
for key in data[0].keys()}
# prints unique values for each keys in the dictionary
print("Unique count of keys : " + str(res))
# print unique values for status
print("Unique count of status : " + str(res['status']))
class (object): def unique_statuses_count(self) -> int
(requirement in comments)
class Jsondt:
def unique_statuses_count(self) :
res = {key: len(set([sub[key] for sub in self ]))
for key in self[0].keys()}
#return unique count for each key in the dataset
return res
# return unique count for "status" column as integer
# or choose any othere column present in data
Jsondt.unique_statuses_count(data)['status']
Related
I have a use case where we are receiving millions of JSON files into our GCS bucket. I am creating an external table on top of the GCS bucket. problem is for one particular field the data type is not consistent.
few files have string and other has Array.
My question is
example:
can we alter the json to make these strings to Array ? or any other recommendation to handle this
**string**:
"ing": {
"info": "abc,def",
"details": []
},
**array**:
"ing": {
"info": [
"abc,def",
"abc,efg"
],
"details": []
},
I tried by adding the [] to string value and queryng the external table it works . But need a way to efficiently alter the 1M json files to add brackets.
am expecting move this data from external table into biquery table
I hope below query gives you a hint to handle your problem
WITH sample_table AS (
SELECT '{"ing": {"enfo": "abc,def", "details": []}}' json UNION ALL
SELECT '{"ing": {"info": "abc,def", "details": []}}' json UNION ALL
SELECT '{"ing": {"info": ["abc,def", "abc,efg"], "details": []}}' UNION ALL
SELECT '{"ing": {"info": null, "details": []}}'
)
SELECT COALESCE(
JSON_VALUE_ARRAY(json, '$.ing.info'),
ARRAY(SELECT e FROM UNNEST([JSON_VALUE(json, '$.ing.info')]) e WHERE e IS NOT NULL)
) AS info
FROM sample_table;
Query results
External Table
CREATE SCHEMA IF NOT EXISTS `your-project.stackoverflow`;
CREATE OR REPLACE EXTERNAL TABLE `stackoverflow.sample_table` (
json STRING
)
OPTIONS (
format = 'CSV',
field_delimiter = CHR(1),
uris = ['https://drive.google.com/open?id=1CIW3UmvYr2JAmSounOY6l5dUFUJCOJOH']
);
SELECT COALESCE(
JSON_VALUE_ARRAY(json, '$.ing.info'),
[JSON_VALUE(json, '$.ing.info')]
) AS info
FROM `stackoverflow.sample_table`;
I have a JSON and I'm trying to read part of it to extract keys and values.
Assuming response is my JSON data, here is my code:
data_dump = json.dumps(response)
data = json.loads(data_dump)
Here my data object becomes a list and I'm trying to get the keys as below
id = [key for key in data.keys()]
This fails with the error:
A list object does not have an attribute keys**. How can I get over this to get my below output?
Here is my JSON:
{
"1": {
"task": [
"wakeup",
"getready"
]
},
"2": {
"task": [
"brush",
"shower"
]
},
"3": {
"task": [
"brush",
"shower"
]
},
"activites": ["standup", "play", "sitdown"],
"statuscheck": {
"time": 60,
"color": 1002,
"change(me)": 9898
},
"action": ["1", "2", "3", "4"]
}
The output I need is as below. I do not need data from the rest of JSON.
id
task
1
wakeup, getready
2
brush , shower
If you know that the keys you need are "1" and "2", you could try reading the JSON string as a dataframe, unpivoting it, exploding and grouping:
from pyspark.sql import functions as F
df = (spark.read.json(sc.parallelize([data_dump]))
.selectExpr("stack(2, '1', `1`, '2', `2`) (id, task)")
.withColumn('task', F.explode('task.task'))
.groupBy('id').agg(F.collect_list('task').alias('task'))
)
df.show()
# +---+------------------+
# | id| task|
# +---+------------------+
# | 1|[wakeup, getready]|
# | 2| [brush, shower]|
# +---+------------------+
However, it may be easier to deal with it in Python:
data = json.loads(data_dump)
data2 = [(k, v['task']) for k, v in data.items() if k in ['1', '2']]
df = spark.createDataFrame(data2, ['id', 'task'])
df.show()
# +---+------------------+
# | id| task|
# +---+------------------+
# | 1|[wakeup, getready]|
# | 2| [brush, shower]|
# +---+------------------+
I have a PySpark dataframe where columns have JSON string values like this:
col1 col2
{"d1":"2343","v1":"3434"} {"id1":"123"}
{"d1":"2344","v1":"3435"} {"id1":"124"}
I want to update "col1" JSON string values with "col2" JSON string values to get this:
col1 col2
{"d1":"2343","v1":"3434","id1":"123"} {"id1":"123"}
{"d1":"2344","v1":"3435","id1":"124"} {"id1":"124"}
How to do this in PySpark?
Since you're dealing with string type columns, you can remove the last } from "col1", remove the first { from "col2" and join the strings together with comma , as delimiter.
Input:
from pyspark.sql import functions as F
df = spark.createDataFrame(
[('{"d1":"2343","v1":"3434"}', '{"id1":"123"}'),
('{"d1":"2344","v1":"3435"}', '{"id1":"124"}')],
["col1", "col2"])
Script:
df = df.withColumn(
"col1",
F.concat_ws(
",",
F.regexp_replace("col1", r"}$", ""),
F.regexp_replace("col2", r"^\{", "")
)
)
df.show(truncate=0)
# +-------------------------------------+-------------+
# |col1 |col2 |
# +-------------------------------------+-------------+
# |{"d1":"2343","v1":"3434","id1":"123"}|{"id1":"123"}|
# |{"d1":"2344","v1":"3435","id1":"124"}|{"id1":"124"}|
# +-------------------------------------+-------------+
I have the following JSON:
{
"rewards": {
"reward_1": {
"type": "type 1",
"amount": "amount 1"
},
"reward_2": {
"type": "type 2",
"amount": "amount 2"
},
"reward_3": {
"type": "type 3",
"amount": "amount 3"
},
"reward_4": {
"type": "type 4",
"amount": "amount 4"
}
}
}
This JSON is dynamic and I don't necessarily know how many rewards it will get, here it's 4 but it can be 2 or 8 etc.
I want to write a query in Big Query that will parse those values dynamically without knowing how many of them exist, and then split them into column, like this:
Thank you!
Hope these are helpful.
since a JSON data is dynamic, first step is to find a max reward sequence. (I've used a regular expression and max_reward UDF.)
and then, extract each reward from a json rewards field in an iterative way.
lastly, make the result to be a wide form using PIVOT query.
If you want a more generic solution, you need to use BigQuery dynamic SQL to generate PIVOT columns. I've hard-coded them in the query.
('reward_1', 'reward_2', 'reward_3', 'reward_4')
query:
CREATE TEMP TABLE sample AS
SELECT 1 AS id, '{"rewards": { "reward_1": { ... ' AS json -- put your json here
UNION ALL
SELECT 2 AS id, '{"rewards": { "reward_1": { ... ' AS json -- put your another json here
;
CREATE TEMP FUNCTION extract_reward(json STRING, seq INT64)
RETURNS STRUCT<type STRING, amount STRING>
LANGUAGE js AS """
return JSON.parse(json)['reward_' + seq];
""";
CREATE TEMP FUNCTION max_reward(arr ARRAY<STRING>) AS ((
SELECT MAX(CAST(v AS INT64)) FROM UNNEST(arr) v
));
SELECT * FROM (
SELECT id,
'reward_' || seq AS reward,
extract_reward(FORMAT('%t', JSON_QUERY(json, '$.rewards')), seq) AS value
FROM sample, UNNEST(GENERATE_ARRAY(1, max_reward(REGEXP_EXTRACT_ALL(json, r'"reward_([0-9]+)"')))) seq
) PIVOT (ANY_VALUE(value) FOR reward IN ('reward_1', 'reward_2', 'reward_3', 'reward_4'));
output:
▶ Split a reward STRUCT column into separate columns
SELECT * FROM (
SELECT id,
'reward_' || seq || '_' || IF (offset = 0, 'type', 'amount') AS reward,
value
FROM sample,
UNNEST(GENERATE_ARRAY(1, max_reward(REGEXP_EXTRACT_ALL(json, r'"reward_([0-9]+)"')))) seq,
UNNEST([extract_reward(FORMAT('%t', JSON_QUERY(json, '$.rewards')), seq)]) pair,
UNNEST([pair.type, pair.amount]) value WITH OFFSET
) PIVOT (ANY_VALUE(value) FOR reward IN ('reward_1_type', 'reward_2_type', 'reward_3_type', 'reward_4_type', 'reward_1_amount', 'reward_2_amount', 'reward_3_amount', 'reward_4_amount'));
output:
1) I am trying to generate a CSV file using jq from a json.
2) I need parent keys along with one key-value pair from the child array
3) Which ever value has latest date in it , will be the resulting key-value pair
4) Need to generate a csv out of that result
This is my json
{
"students": [
{
"name": "Name1",
"class": "parentClass1",
"teacher": "teacher1",
"attendance": [
{
"key": "class1",
"value": "01-DEC-2018"
},
{
"key": "class1",
"value": "28-Nov-2018"
},
{
"key": "class1",
"value": "26-Oct-2018"
}
]
},
{
"name": "Name2",
"class": "parentClass2",
"teacher": "teacher2",
"attendance": [
{
"key": "class2",
"value": "05-DEC-2018"
},
{
"key": "class2",
"value": "25-Nov-2018"
},
{
"key": "class2",
"value": "20-Oct-2018"
}
]
}
]
}
I did not made much progress I am trying to create csv like this
jq '.students[] | [.name, .class, attendance[].key,.properties[].value] | #csv ' main.json
Below is expected CSV from that json
Name ParentClass key dateValue Summary
Name1 parentClass1 class1 150 days ago(difference with today date with latest date i.e 01-DEC-2018 ) Teacher1.parentClass1
Name2 parentClass2 class2 150 days ago(difference with today date with latest date i.e 05-DEC-2018 ) Teacher2.parentClass2
Parse dates using strptime and assign the result to values, thus you can get the latest attendance using max_by. Convert the value to seconds since Epoch using mktime, substract it from now, divide by 24 * 60 * 60 to get number of days since.
$ jq -r '
def days_since:
(now - .) / 86400 | floor;
.students[]
| [ .name, .class ] +
( .attendance
| map(.value |= strptime("%d-%b-%Y"))
| max_by(.value)
| [ .key, "\(.value | mktime | days_since) days ago" ]
) +
[ .teacher + "." + .class ]
| #tsv' file
Name1 parentClass1 class1 148 days ago teacher1.parentClass1
Name2 parentClass2 class2 144 days ago teacher2.parentClass2
Note that this solution doesn't deal with daylight saving time changes.
For production purposes jq can't be used here because it doesn't allow to perform daylight saving time safe date calculations.
I would use Python because it allows to perform daylight saving time safe date calculations, comes with json support by default and is installed on most to all UNIX derivates.
#!/usr/bin/env python
import argparse
from datetime import datetime
import json
def parse_args():
parser = argparse.ArgumentParser()
parser.add_argument('filename')
return parser.parse_args()
def main():
args = parse_args()
with open(args.filename) as file_desc:
data = json.load(file_desc)
print('Name\tParentClass\tkey\tdateValue')
today = datetime.today()
for record in data['students']:
for a in record['attendance']:
date = datetime.strptime(a['value'], '%d-%b-%Y')
a['since'] = (today - date).days
last = sorted(record['attendance'], key=lambda x: x['since'])[0]
print('\t'.join([
record['name'],
record['class'],
last['key'],
'{} days ago'.format(last['since']),
'{}.{}'.format(record['teacher'], record['class']),
]))
if __name__ == '__main__':
main()
Output (on the day when this answer was written):
Name ParentClass Key DateValue Summary
Name1 parentClass1 class1 148 days ago teacher1.parentClass1
Name2 parentClass2 class2 144 days ago teacher2.parentClass2