json format to csv format conversion, special case - json

I have a json file whose rows are in the format as follows:
{"checkin_info": {"11-3": 17, "8-5": 1, "15-0": 2, "15-3": 2, "15-5": 2, "14-4": 1, "14- 5": 3, "14-6": 6, "14-0": 2, "14-1": 2, "14-3": 2, "0-5": 1, "1-6": 1, "11-5": 3, "11-4": 11, "13-1": 1, "11-6": 6, "11-1": 18, "13-6": 5, "13-5": 4, "11-2": 9, "12-6": 5, "12-4": 8, "12-5": 5, "12-2": 12, "12-3": 19, "12-0": 20, "12-1": 14, "13-3": 1, "9-5": 2, "9-4": 1, "13-2": 6, "20-1": 1, "9-6": 4, "16-3": 1, "16-1": 1, "16-5": 1, "10-0": 3, "10-1": 4, "10-2": 4, "10-3": 4, "10-4": 1, "10-5": 2, "10-6": 2, "11-0": 3}, "type": "checkin", "business_id": "KO9CpaSPOoqm0iCWm5scmg"}
and so on....it has 8282 entries like this.
I want to convert it into csv file like this.
business_id "0-0" "1-0" "2-0" "3-0" ….. "23-0" "0-1" ……. "23-1" …….. "0-4" …… "23-4" …… "23-6"
1 KO9CpaSPOoqm0iCWm5scmg 2 1 0 1 NA 1 1 NA NA NA NA NA 6 NA 7
2 oRqBAYtcBYZHXA7G8FlPaA 1 2 2 NA NA 2 NA NA 1 NA 2 NA 2 NA 2
I tried this code:
urlc <- "C:\\Users\\Ayush\\Desktop\\yelp_training_set\\yelp_training_set_checkin.json"
conc = file(urlc, "r")
inputc <- readLines(conc, -1L)
usec <- lapply(X=inputc,fromJSON)
for (i in 1:8282)
{
tt<-usec[[i]]$checkin_info
bb<-toString(tt)
usec[[i]]$checkin_info<-bb
}
dfc <- data.frame(matrix(unlist(usec), nrow=length(usec), byrow=T))
write.csv(dfc,file="checkin_tr.csv")
to convert it into form like this:
X1
business_id
1, 1, 1, 1, 1, 1, 2, 1, 2, 1, 1, 2, 1, 1, 1, 2, 1
D0IB17N66FiyYDCzTlAI4A
1, 1, 2, 1, 1
HLQGo3EaYVvAv22bONGkIw
1, 1, 1, 1
J6OojF0R_1OuwNlrZI-ynQ 2, 1, 2, 1, 2, 1, 1, 1, 1, 4, 1, 1, 1, 1, 1, 1, 2, 1, 2
But I want entries in column "X1" above in separate columns, as shown in the first table.
How can I do this? Please help

Using RJSONIO you can do something like this :
library(RJSONIO)
tt <- fromJSON(tt)
data.frame(business_id =tt$business_id,
do.call(rbind,list(tt$checkin_info)))
business_id X11.3 X8.5 X15.0 X15.3 X15.5 X14.4 X14.5 X14.6 X14.0 X14.1 X14.3 X0.5 X1.6 X11.5 X11.4 X13.1 X11.6 X11.1 X13.6 X13.5 X11.2 X12.6 X12.4
1 KO9CpaSPOoqm0iCWm5scmg 17 1 2 2 2 1 3 6 2 2 2 1 1 3 11 1 6 18 5 4 9 5 8
X12.5 X12.2 X12.3 X12.0 X12.1 X13.3 X9.5 X9.4 X13.2 X20.1 X9.6 X16.3 X16.1 X16.5 X10.0 X10.1 X10.2 X10.3 X10.4 X10.5 X10.6 X11.0
1 5 12 19 20 14 1 2 1 6 1 4 1 1 1 3 4 4 4 1 2 2 3
EDIT
I use a new idea here. It is easier to create a long format data.frame then convert it to a wide format using reshape2 for example.
library(RJSONIO)
## I create 2 shorter lines with different id
tt <- '{"checkin_info": {"11-3": 17, "8-5": 1, "15-0": 2}, "type": "checkin", "business_id": "KO9CpaSPOoqm0iCWm5scmg"}'
tt1 <- '{"checkin_info": {"12-0": 17, "7-5": 1, "15-0": 5}, "type": "checkin", "business_id": "iddd2"}'
## use inputc <- readLines(conc, -1L) in your case
inputc <- list(tt,tt1)
usec <- lapply(X=inputc,function(x){
tt <- fromJSON(x)
data.frame(business_id =tt$business_id,
names = names(tt$checkin_info),
values =unlist(tt$checkin_info))
})
## create a long data frame
dat <- do.call(rbind,usec)
## put in the wide format
library(reshape2)
dcast(business_id~names,data=dat)
business_id 11-3 15-0 8-5 12-0 7-5
1 KO9CpaSPOoqm0iCWm5scmg 17 2 1 NA NA
2 iddd2 NA 5 NA 17 1

Related

How to make a rule by sql please?

I have a table named flup and some datas like:
pid, flup_time, degree, oc, flup_type
1, 2018-05-06, 1, 0, 2
1, 2018-08-01, 2, 0, 3
1, 2018-08-13, 2, 0, 1
1, 2018-08-25, 2, 1, 1
1, 2018-11-20, 2, 1, 2
1, 2019-01-09, 2, 1, 2
2, 2018-06-01, 1, 0, 2
2, 2018-08-27, 2, 0, 2
2, 2018-11-30, 2, 0, 2
...
First, find all datas group by pid, for this pid (here pid=1), order by flup_time asc. Give a period of time (like from 2018-01-01 to 2019-07-01), for every row, make rules:
rule1. if degree = 1, then next flup_time must in 90 days.
rule2. if degree = 2 and oc != 1, then next flup_time must in 15 days.
rule3. if degree = 2 and oc = 1, then next flup_time must in 90 days.
I want to create a view (flup_view), has all the columns of flup, and more column named pass_check. If the row met the rule1,2,3, pass_check = 1, otherwise pass_check = 2. Like:
pid, flup_time, degree, oc, flup_type, pass_check
1, 2018-05-06, 1, 0, 2, -1
1, 2018-08-01, 2, 0, 3, 1
1, 2018-08-13, 2, 0, 1, 1
1, 2018-08-25, 2, 1, 1, 1
1, 2018-11-20, 2, 1, 2, 1
1, 2019-01-09, 2, 1, 2, 1
2, 2018-06-01, 1, 0, 2, -1
2, 2018-08-27, 2, 0, 2, 1
2, 2018-11-30, 2, 0, 2, 2
How to do this by sql please?
There are a couple of pieces that you'll need for this to work. I'm not sure how strong your SQL background is, so I'll include the basics as well.
First, in order to create the rule, you'll need to use a CASE WHEN:
https://www.w3schools.com/sql/func_mysql_case.asp
Next, you need to get the following row for each ID, you need to use the LEAD function. Here's a general overview:
https://dev.mysql.com/doc/refman/8.0/en/window-function-descriptions.html
and a tutorial for LAG, which is the same as LEAD, but it checks the row above rather than the row below:
http://www.mysqltutorial.org/mysql-window-functions/mysql-lag-function/
(LEAD didn't exist in early version of MySQL, so your version might not have this)
Finally, you want to compare dates using the DATE_ADD function:
https://www.w3schools.com/sql/func_mysql_date_add.asp
It will be a little complicated, but these three things should be enough to let you build the query you need.

unable to parse specific value from a particular column of a csv file

I am making a predictive model to predict revenue and trying to parse this 'cast' value from the data frame as it is not a list or a dict
x['cast']
And the output is
0 [{'cast_id': 4, 'character': 'Lou', 'credit_id...
1 [{'cast_id': 1, 'character': 'Mia Thermopolis'...
2 [{'cast_id': 5, 'character': 'Andrew Neimann',...
3 [{'cast_id': 1, 'character': 'Vidya Bagchi', '...
4 [{'cast_id': 3, 'character': 'Chun-soo', 'cred...
5 [{'cast_id': 6, 'character': 'Pinocchio (voice...
6 [{'cast_id': 23, 'character': 'Clyde', 'credit...
7 [{'cast_id': 2, 'character': 'Himself', 'credi...
8 [{'cast_id': 1, 'character': 'Long John Silver...
9 [{'cast_id': 24, 'character': 'Jonathan Steinb...
Name: cast, dtype: object
I need all the 'character' values in a list.
but when I try
x['cast'][0]['character']
It throws this error
TypeError: string indices must be integers
Help me out with this please.
First convert json to list of dictionaries and then get values from first list by key of dict:
import ast
mask = x['cast'].notna()
x.loc[mask, 'cast'] = x.loc[mask, 'cast'].apply(ast.literal_eval)
#alternative
#x.loc[mask, 'cast'] = x.loc[mask, 'cast'].apply(pd.io.json.loads)
x.loc[mask, 'cast'] = x.loc[mask, 'cast'].apply(lambda x: x[0].get('character', 'not match data'))
EDIT:
If still problem use Series.str.extract:
x = pd.DataFrame({'cast':[[{'cast_id': 4, 'character': 'Lou'}], np.nan]})
x['cat'] = x['cast'].astype(str).str.extract("'character': '([^'']+)'")
print (x)
cast cat
0 [{'cast_id': 4, 'character': 'Lou'}] Lou
1 NaN NaN

Cyclic group generator of [1, 2, 3, 4, 5, 6] under modulo 7 multiplication

Find all the generators in the cyclic group [1, 2, 3, 4, 5, 6] under modulo 7 multiplication.
I got <1> and <5> as generators. The answer is <3> and <5>. Can somebody please tell why is 3 a generator?
You compute the cyclic subgroups of [1, 2, 3, 4, 5, 6] by computing the powers of each element:
1 = {1^1 mod 7 = 1, 1^2 mod 7 = 1, ...}
2 = {2^1 mod 7 = 2, 2^2 mod 7 = 4, ...}
3 = {3, 2, 6, 4, 5, 1}
4 = {4, 2, 1}
5 = {5, 4, 6, 2, 3, 1}
6 = {6,1}
From that you can see that 3 and 5 are cyclic.

With couchbase, how to sum total by day

I am looking to total (sum) equipment used by day. The data i have is {job, toDate, fromDate, equipmentUsed}. Would mapreduce be the best and how would i do that with the "to" and "from" dates?
Here is some background. We have many projects. Many workorders for each projects. Workorders are by day and have inventory for that day. i want to sum the inventory for each day in a date range to see if we will run out of inventory.
I will post sample data shortly
{“project::100”: {“name”: “project one”}
,“project::101”: {“name”: “project two”}
,”workOrder::1000”: {“project”: “project::100”, “dateNeeded”: jan 1, “inventory”: [“equip1”: 2, “equip2”: 1, “equip3”: 3 , “equip4”: 4]}
,”workOrder::1001”: {“project”: “project::100”, “dateNeeded”: jan 2, “inventory”: [“equip1”: 1 , “equip2”: 2 , “equip3”: 1 , “equip4”: 4]}
,”workOrder::1002”: {“project”: “project::100”, “dateNeeded”: jan 4, “inventory”: [“equip1”: 1, “equip2”: 2, “equip3”: 3, “equip4”: 1 ]}
,”workOrder::1000”: {“project”: “project::101”, “dateNeeded”: jan 1, “inventory”: [“equip1”: 1, “equip2”: 3, “equip4”: 1]}
,”workOrder::1001”: {“project”: “project::101”, “dateNeeded”: jan 3, “inventory”: [ “equip2”: 1, “equip3”: 3 , “equip4”: 1]}
,”workOrder::1002”: {“project”: “project::101”, “dateNeeded”: jan 4, “inventory”: [“equip1”: 1, “equip2”: 1, “equip3”: 2 , “equip4”: 3]}
}
Can you give an example of what exactly you want? Looks like you want to consider aggregating equipUsed for the overlapping dates, as well as gaps in the date ranges etc. For ex:
{J1, Jan 7, Jan 1, 4},
{J2, Jan 4, Jan 2, 7},
{J3, Jan 10, Jan 5, 10},
{J4, Jan 25, Jan 15, 20} etc.,
The output is:
{Jan 1, 4}, {Jan2, 11 /4 + 7/}, {Jan 3, 11}, {Jan4, 11}, {jan 5, 14 /4+10/}, {Jan 6, 14}, {Jan 7, 14}, {Jan 8, 10}, {Jan 9, 10}, {Jan 10, 10}, {Jan 11 to 14th, 0}, and {Jan 15th to 25th, 20} etc.,
This is some non-trivial logic. You can use N1QL API with some programming language (java, python, node etc) to solve this. For example, an exhaustive algorithm in pseudo code is (assuming docs in 'default' bucket):
minDate = Run_N1QLQuery("SELECT MIN(fromDate} from default");
maxDate = Run_N1QLQuery("SELECT MAX(toDate) from default");
for d = minDate to maxDate
sum_used = Run_N1QLQuery("SELECT SUM(equipUsed) from default WHERE %s BETWEEN fromDate AND toDate", d);
d = increment_date(d);
Depending on what is exactly needed, one can write much more efficient algorithm.
hth,
-Prasad

JSON to R data frame: preserve repeated values

I have a JSON data source that is a list of objects. Some of the object properties are themselves lists. I want to turn the whole thing into a data frame, preserving the lists as data frame values.
Example JSON data:
[{
"id": "A",
"p1": [1, 2, 3],
"p2": "foo"
},{
"id": "B",
"p1": [4, 5, 6],
"p2": "bar"
}]
Desired data frame:
id p2 p1
1 A foo 1, 2, 3
2 B bar 4, 5, 6
Failed attempt 1
I have found this nicely straightforward way of parsing my JSON:
unlisted_data <- lapply(fromJSON(json_str), function(x){unlist(x)})
data.frame(do.call("rbind", unlisted_data))
However, the unlisting process spreads my repeated value across multiple columns:
id p11 p12 p13 p2
1 A 1 2 3 foo
2 B 4 5 6 bar
I expected that calling unlist with the recursive = FALSE option would take care of this, but it doesn't.
Failed attempt 2
I noticed that I can almost do this with the I function:
> data.frame(I(parsed_json[[1]]))
parsed_json..1..
id A
p1 1, 2, 3
p2 foo
But the rows and columns are reversed. Transposing the result mangles the repeated data:
> t(data.frame(I(parsed_json[[1]])))
id p1 p2
parsed_json..1.. "A" Numeric,3 "foo"
The jsonlite package can handle this just fine:
library(jsonlite)
fromJSON(txt)
# id p1 p2
#1 A 1, 2, 3 foo
#2 B 4, 5, 6 bar
fromJSON(txt)$p1
#[[1]]
#[1] 1 2 3
#
#[[2]]
#[1] 4 5 6