Related
OK, so this question isn't as simple as the title may sound. I've got a table that's structued like this:
| Brand | First Name | Last Name | Amount | e-mail |
|-------|------------|-----------|---------|---------------------|
| A | John | Smith | 920 USD | johnsmith#email.com |
| A | Mary | Smith | 650 USD | johnsmith#email.com |
| A | Margaret | Smith | 400 USD | johnsmith#email.com |
| B | Eric | Davis | 120 USD | jdavis#email.com |
| B | Wanda | Davis | 500 USD | jdavis#email.com |
| B | Jean | Davis | 300 USD | jdavis#email.com |
| A | Daniel | Barnes | 400 USD | dbarnes#email.com |
What I'm ultimately trying to do is generate emails to be sent to inform customers of their credit balance, and in the above example, I'd like to send one email to johnsmith#email.com that says something like "You have credits with Brand A. John Smith has 920 USD, Mary Smith has 650 USD, Margaret Smith has 400 USD."
I don't need to get all the way there with this question, but what I would like to do is have one row for each e-mail which somehow includes the information for each row with that email. Maybe some kind of generated concatenated field? It seems simple in theory, but in practice I'm having a tough time coming up with how exactly to do this in R. Any help would be much appreciated!
Bonus: I'm also fairly experienced with MySQL, so if there's a better way to do it in SQL, that'd be great!
Edit: Dput output (with names and emails edited)
structure(list(BRAND = c("R", "C", "C", "C", "C", "R", "R", "C",
"C", "C"), GUEST_S_LAST_NAME = c("Stockman", "Ericson", "Ericson",
"Alcin", "Andrews", "Smith", "Smith", "Brown", "Brown", "Brown"
), GUEST_S_FIRST_NAME = c("Margaret", "Abraham", "Naomi", "Dina",
"Arthur", "Laura", "Alan", "Gregory", "Marina", "Viktoria"),
COMPENSATIONAMOUNT_OR_PERCENT = c("920 USD", "1363 USD",
"1363 USD", "452 USD", "452 USD", "250 USD", "250 USD", "1019 USD",
"1019 USD", "323 USD"), EXPIRATION_DATE = c("04/30/2022 12:00:00 00 am",
"12/31/2021 12:00:00 00 am", "12/31/2021 12:00:00 00 am",
"12/31/2021 12:00:00 00 am", "12/31/2021 12:00:00 00 am",
"04/30/2022 12:00:00 00 am", "04/30/2022 12:00:00 00 am",
"12/31/2021 12:00:00 00 am", "12/31/2021 12:00:00 00 am",
"12/31/2021 12:00:00 00 am"), EMAIL = c("email1#email.com",
"email2#email.com", "email2#email.com", "email3#email.com",
"email3#email.com", "email4#email.com", "email4#email.com",
"email5#email.com", "email5#email.com", "email5#email.com"
)), row.names = c(NA, -10L), class = c("tbl_df", "tbl", "data.frame"
))
Here is my approach with dplyr:
library(dplyr)
your_data %>%
group_by(BRAND, EMAIL) %>%
summarize(text = paste0(
sprintf("You have credits with Brand %s. ", BRAND),
paste(sprintf("%s %s has %s",
GUEST_S_FIRST_NAME,
GUEST_S_LAST_NAME,
COMPENSATIONAMOUNT_OR_PERCENT),
collapse = ", "), "."))
Returns:
# A tibble: 10 x 3
# Groups: BRAND, EMAIL [5]
BRAND EMAIL text
<chr> <chr> <chr>
1 C email2#email… You have credits with Brand C. Abraham Ericson has 1363 …
2 C email2#email… You have credits with Brand C. Abraham Ericson has 1363 …
3 C email3#email… You have credits with Brand C. Dina Alcin has 452 USD, A…
4 C email3#email… You have credits with Brand C. Dina Alcin has 452 USD, A…
5 C email5#email… You have credits with Brand C. Gregory Brown has 1019 US…
6 C email5#email… You have credits with Brand C. Gregory Brown has 1019 US…
7 C email5#email… You have credits with Brand C. Gregory Brown has 1019 US…
8 R email1#email… You have credits with Brand R. Margaret Stockman has 920…
9 R email4#email… You have credits with Brand R. Laura Smith has 250 USD, …
10 R email4#email… You have credits with Brand R. Laura Smith has 250 USD, …
# Data used:
your_data <- structure(list(BRAND = c("R", "C", "C", "C", "C", "R", "R", "C", "C", "C"), GUEST_S_LAST_NAME = c("Stockman", "Ericson", "Ericson", "Alcin", "Andrews", "Smith", "Smith", "Brown", "Brown", "Brown"), GUEST_S_FIRST_NAME = c("Margaret", "Abraham", "Naomi", "Dina", "Arthur", "Laura", "Alan", "Gregory", "Marina", "Viktoria"), COMPENSATIONAMOUNT_OR_PERCENT = c("920 USD", "1363 USD", "1363 USD", "452 USD", "452 USD", "250 USD", "250 USD", "1019 USD", "1019 USD", "323 USD"), EXPIRATION_DATE = c("04/30/2022 12:00:00 00 am", "12/31/2021 12:00:00 00 am", "12/31/2021 12:00:00 00 am", "12/31/2021 12:00:00 00 am", "12/31/2021 12:00:00 00 am", "04/30/2022 12:00:00 00 am", "04/30/2022 12:00:00 00 am", "12/31/2021 12:00:00 00 am", "12/31/2021 12:00:00 00 am", "12/31/2021 12:00:00 00 am"), EMAIL = c("email1#email.com", "email2#email.com", "email2#email.com", "email3#email.com", "email3#email.com", "email4#email.com", "email4#email.com", "email5#email.com", "email5#email.com", "email5#email.com")), row.names = c(NA, -10L), class = c("tbl_df", "tbl", "data.frame"))
I have a csv file structured as the one below:
| Taiwan | | US |
| ASUS | MSI | DELL | HP
------------------------------------------
CPU | 50 | 49 | 43 | 65
GPU | 60 | 64 | 75 | 54
HDD | 75 | 70 | 65 | 46
RAM | 60 | 79 | 64 | 63
assembled| 235 | 244 | 254 | 269
and I have to use an awk script to print a comparison between the sum of prices of the individual computer pieces (rows 3 to 6) "versus" the assembled computer price (row 7) displaying also the country each brand comes from. The printed result in the terminal should be something like:
Taiwan Asus 245 235
Taiwan MSI 262 244
US DELL 247 254
US HP 228 269
Where the third column is the sum of CPU, GPU, HDD and RAM prices and the fourth column is the price same value seen in row 7 per each computer brand.
So far I have been able to sum the individual columns transforming the solution provided at the post I link below, but I don´t know how I could display the result I want in the desired format. Could anyone help me with this? I´m a bit desperate at this point.
Sum all values in each column bash
This is the content of the original csv file represented at the top of this message:
,Taiwan,,US,
,ASUS,MSI,DELL,HP
CPU,50,49,43,65
GPU,60,64,75,54
HDD,75,70,65,46
RAM,60,79,64,63
assembled,235,244,254,269
Thank you very much in advance.
$ cat tst.awk
BEGIN { FS=","; OFS="\t" }
NR == 2 {
for (i=2; i<=NF; i++) {
corp[i] = (p[i] == "" ? p[i-1] : p[i]) OFS $i
}
}
NR > 2 {
for (i=2; i<=NF; i++) {
tot[i] += p[i]
}
}
{ split($0,p) }
END {
for (i=2; i<=NF; i++) {
print corp[i], tot[i], p[i]
}
}
.
$ awk -f tst.awk file
Taiwan ASUS 245 235
Taiwan MSI 262 244
US DELL 247 254
US HP 228 269
I have a problem when I try to put a datetime. I read a CSV with my Json to put the data on Orientdb data base
csv:
id;id_tag;tag_name;date
1;1;tag1;"2014-3-24 6:49:2"
2;1;tag1;"2009-11-22 13:12:7"
3;1;tag1;"2014-10-18 14:47:6"
4;1;tag1;"2013-2-10 15:23:27"
Json:
{
"config": {
"log": "debug"
},
"source": { "file": { "path": "/Users/jonathanmahe/Documents/OrientTest/GeospatialTest/bd1000/bd1000.csv" } },
"extractor": { "csv": {
"separator": ";",
"columns": ["id:Integer","id_tag:Integer","tag_name:String","date:dateTime"] } },
"transformers": [
{ "command": { "command": "INSERT INTO Tag(id,id_tag,tag_name,date) values('${input.id}','${input.id_tag}','${input.tag_name}','${input.date}')"} }
],
"loader": {
"orientdb": {
"dbURL": "remote:localhost/databases/bd1000",
"dbUser": "admin",
"dbPassword": "admin",
"serverUser": "root",
"serverPassword": "root",
"dbType": "graph",
"batchCommit": 1000
}
}
}
the error I get is:
ERROR exception=Error on conversion of date 'Mon Mar 24 06:49:00 CET 2014' using the format: yyyy-MM-dd HH:mm:ss
Someone has an idea?
before launching the etl import you need to modify the datetime format:
connect to the db
alter database DATETIMEFORMAT "EEE MMM dd HH:mm:ss zzz yyyy"
run etl script
orientdb {db=bd1000}> select from tag
+----+-----+------+----+------+--------+-----------------------------+
|# |#RID |#CLASS|id |id_tag|tag_name|date |
+----+-----+------+----+------+--------+-----------------------------+
|0 |#21:0|Tag |1 |1 |tag1 |Mon Mar 24 06:49:00 CET 2014 |
|1 |#22:0|Tag |2 |1 |tag1 |Sun Nov 22 13:12:00 CET 2009 |
|2 |#23:0|Tag |3 |1 |tag1 |Sat Oct 18 14:47:00 CEST 2014|
|3 |#24:0|Tag |4 |1 |tag1 |Sun Feb 10 15:23:00 CET 2013 |
+----+-----+------+----+------+--------+-----------------------------+
My end game is to create a tree visualization from a hierarchical JSON file using D3js.
The hierarchy I need to represent is this diagram, where A has children B,C,D ; B has children E,F,G; C has children H, I ; and D has no children. The nodes will have multiple key:value pairs.I've only listed 3 for simplicity.
-- name:E
| type:dkBlue
| id: 005
|
|-- name:F
-- name:B ------| type:medBlue
| type:blue | id: 006
| id:002 |
| |-- name:G
| type:ltBlue
name:A ----| id:007
type:colors|
id:001 |-- name:C ----|-- name:H
| type:red | type:dkRed
| id:003 | id:008
| |
| |
| |-- name:I
| type:medRed
| id:009
|-- name:D
type:green
id: 004
My source data in R looks like:
nodes <-read.table(header = TRUE, text = "
ID name type
001 A colors
002 B blue
003 C red
004 D green
005 E dkBlue
006 F medBlue
007 G ltBlue
008 H dkRed
009 I medRed
")
links <- read.table(header = TRUE, text = "
startID relation endID
001 hasSubCat 002
001 hasSubCat 003
001 hasSubCat 004
002 hasSubCat 005
002 hasSubCat 006
002 hasSubCat 007
003 hasSubCat 008
003 hasSubCat 009
")
I must convert it to the following JSON:
{"name": "A",
"type": "colors",
"id" : "001",
"children": [
{"name": "B",
"type": "blue",
"id" : "002",
"children": [
{"name": "E",
"type": "dkBlue",
"id" : "003"},
{"name": "F",
"type": "medBlue",
"id": "004"},
{"name": "G",
"type": "ltBlue",
"id": "005"}
]},
{"name": "C",
"type": "red",
"id" : "006",
"children": [
{"name": "H",
"type": "dkRed",
"id" : "007"},
{"name": "I",
"type": "dkBlue",
"id": "008"}
]},
{"name": "D",
"type": "green",
"id" : "009"}
]}
I would appreciate any help you may be able to offer!
[UPDATE 2017-04-18]
Based on Ian's references I looked into R's data.tree. I can recreate my hierarchy if I restructure my data as shown below. Note that I've lost the type of relation (hasSubcat) between each node, the value of which can vary for each link/edge in real life. I am willing to let that go (for now) if I can get a workable hierarchy. The revised data for data.tree:
df <-read.table(header = TRUE, text = "
paths type id
A colors 001
A/B blue 002
A/B/E dkBlue 005
A/B/F medBlue 006
A/B/G ltBlue 007
A/C red 003
A/C/H dkRed 008
A/C/I medRed 009
A/D green 004
")
myPaths <- as.Node(df, pathName = "paths")
myPaths$leafCount / (myPaths$totalCount - myPaths$leafCount)
print(myPaths, "type", "id", limit = 25)
The print displays the hierarchy I sketched out in the original post and even contains the key:values for each node. Nice!
levelName type id
1 A colors 1
2 ¦--B blue 2
3 ¦ ¦--E dkBlue 5
4 ¦ ¦--F medBlue 6
5 ¦ °--G ltBlue 7
6 ¦--C red 3
7 ¦ ¦--H dkRed 8
8 ¦ °--I medRed 9
9 °--D green 4
Once again I am at loss for how to translate this from the tree to nested JSON. The example here https://ipub.com/data-tree-to-networkd3/ , like most examples, assumes key:value pairs only on leaf nodes, not branch nodes. I think the answer is in creating a nested list to feed into JSONIO or JSONLITE, and I have no idea how to do that.
data.tree is very helpful and probably the better way to accomplish your objective. For fun, I will submit a more roundabout way to achieve your nested JSON using igraph and d3r.
nodes <-read.table(header = TRUE, text = "
ID name type
001 A colors
002 B blue
003 C red
004 D green
005 E dkBlue
006 F medBlue
007 G ltBlue
008 H dkRed
009 I medRed
")
links <- read.table(header = TRUE, text = "
startID relation endID
001 hasSubCat 002
001 hasSubCat 003
001 hasSubCat 004
002 hasSubCat 005
002 hasSubCat 006
002 hasSubCat 007
003 hasSubCat 008
003 hasSubCat 009
")
library(d3r)
library(dplyr)
library(igraph)
# make it an igraph
gf <- graph_from_data_frame(links[,c(1,3,2)],vertices = nodes)
# if we know that this is a tree with root as "A"
# we can do something like this
df_tree <- dplyr::bind_rows(
lapply(
all_shortest_paths(gf,from="A")$res,
function(x){data.frame(t(names(unclass(x))), stringsAsFactors=FALSE)}
)
)
# we can discard the first column
df_tree <- df_tree[,-1]
# then make df_tree[1,1] as 1 (A)
df_tree[1,1] <- "A"
# now add node attributes to our data.frame
df_tree <- df_tree %>%
# let's get the last non-NA in each row so we can join with nodes
mutate(
last_non_na = apply(df_tree, MARGIN=1, function(x){tail(na.exclude(x),1)})
) %>%
# now join with nodes
left_join(
nodes,
by = c("last_non_na" = "name")
) %>%
# now remove last_non_na column
select(-last_non_na)
# use d3r to nest as we would like
nested <- df_tree %>%
d3_nest(value_cols = c("ID", "type"))
Consider walking down the levels iteratively converting dataframe columns to a multi-nested list:
library(jsonlite)
...
df2list <- function(i) as.vector(nodes[nodes$name == i,])
# GRANDPARENT LEVEL
jsonlist <- as.list(nodes[nodes$name=='A',])
# PARENT LEVEL
jsonlist$children <- lapply(c('B','C','D'), function(i) as.list(nodes[nodes$name == i,]))
# CHILDREN LEVEL
jsonlist$children[[1]]$children <- lapply(c('E','F','G'), df2list)
jsonlist$children[[2]]$children <- lapply(c('H','I'), df2list)
toJSON(jsonlist, pretty=TRUE)
However, with this approach, you will notice some internal children of one-length elements are enclosed in brackets. Because R cannot have complex types inside a character vector the entire object must be a list type which output in brackets.
Hence, consider a cleanup of extra brackets with nested gsub which still renders valid json:
output <- toJSON(jsonlist, pretty=TRUE)
gsub('"\\]\n', '"\n', gsub('"\\],\n', '",\n', gsub('": \\["', '": "', output)))
Final Output
{
"ID": "001",
"name": "A",
"type": "colors",
"children": [
{
"ID": "002",
"name": "B",
"type": "blue",
"children": [
{
"ID": "005",
"name": "E",
"type": "dkBlue"
},
{
"ID": "006",
"name": "F",
"type": "medBlue"
},
{
"ID": "007",
"name": "G",
"type": "ltBlue"
}
]
},
{
"ID": "003",
"name": "C",
"type": "red",
"children": [
{
"ID": "008",
"name": "H",
"type": "dkRed"
},
{
"ID": "009",
"name": "I",
"type": "medRed"
}
]
},
{
"ID": "004",
"name": "D",
"type": "green"
}
]
}
a nice, if a bit difficult to wrap one's head around, way of doing this is with a self referential function as in the following...
nodes <- read.table(header = TRUE, colClasses = "character", text = "
ID name type
001 A colors
002 B blue
003 C red
004 D green
005 E dkBlue
006 F medBlue
007 G ltBlue
008 H dkRed
009 I medRed
")
links <- read.table(header = TRUE, colClasses = "character", text = "
startID relation endID
001 hasSubCat 002
001 hasSubCat 003
001 hasSubCat 004
002 hasSubCat 005
002 hasSubCat 006
002 hasSubCat 007
003 hasSubCat 008
003 hasSubCat 009
")
convert_hier <- function(linksDf, nodesDf, sourceId = "startID",
targetId = "endID", nodesID = "ID") {
makelist <- function(nodeid) {
child_ids <- linksDf[[targetId]][which(linksDf[[sourceId]] == nodeid)]
if (length(child_ids) == 0)
return(as.list(nodesDf[nodesDf[[nodesID]] == nodeid, ]))
c(as.list(nodesDf[nodesDf[[nodesID]] == nodeid, ]),
children = list(lapply(child_ids, makelist)))
}
ids <- unique(c(linksDf[[sourceId]], linksDf[[targetId]]))
rootid <- ids[! ids %in% linksDf[[targetId]]]
jsonlite::toJSON(makelist(rootid), pretty = T, auto_unbox = T)
}
convert_hier(links, nodes)
a few notes...
I added colClasses = "character" to your read.table commands so that the ID numbers are not coerced to integers with no leading zeros and so that the strings are not converted into factors.
I wrapped everything in the convert_hier function to make it easier to adapt to other scenarios, but the real magic is in the makelist function.
Let's say I have a table order as following:
o_id | o_uid | o_date
1 | 5 | June 10, 2015
2 | 1 | June 10, 2015
3 | 8 | June 10, 2015
5 | 15 | June 11, 2015
6 | 11 | June 11, 2015
7 | 16 | June 12, 2015
8 | 19 | June 12, 2015
I tried running the following query:
SELECT o_id, o_uid FROM order GROUP BY o_date
I thought it will give me result as follows:
[
"June 10, 2015" => [
[
"o_id" => 1
"o_uid" => 5
],
[
"o_id" => 2
"o_uid" => 1
],
[
"o_id" => 3
"o_uid" => 8
]
],
"June 11, 2015" => [
[
"o_id" => 4
"o_uid" => 15
],
[
"o_id" => 5
"o_uid" => 11
]
],
...
]
The query does not provide results as expected. I can use PHP to get the required results but why not use MySQL if you can and finish things off in a line. The GROUP BY Clause is quite confusing. Is there any other clause which can group records by date?
The result I get:
SQL Fiddle
use
SELECT o_id, GROUP_CONCAT(o_id),GROUP_CONCAT(o_uid),o_date FROM `order`
GROUP BY o_date
And then separate comma separated list using php/do manipulation accordingly.
Output : Formatter output from sqlfiddle.
o_id GROUP_CONCAT(o_id) GROUP_CONCAT(o_uid) o_date
1 1,2,3 5,1,8 June 10, 2015
4 4,5 15,11 June 11, 2015
6 6,7 16,19 June 12, 2015
Fiddle