How to write a set of fields to JSON? - json

I am trying to write few fields in my data frame to JSON. My data structure in the data frame is
Key|col1|col2|col3|col4
key|a |b |c |d
Key|a1 |b1 |c1 |d1
Now i am trying to convert just the col1 to col4 fields to JSON and give a name to the Json field
expected output
[Key,{cols:[{col1:a,col2:b,col3:c,col4:d},{col1:a1,col2:b1,col3:c1,col4:d1}]
I wrote a udf for this.
val summary = udf(
(col1:String, col2:String, col3:String, col4:String) => "{\"cols\":[" + " {\"col1\":" + col1 + ",\"col2\":" + col2 + ",\"col3\":" + col3 + ",\"col4\":" + col4 + "}]}"
)
val result = input.withColumn("Summary",summary('col1,'col2,'col3,'col4))
val result1 = result.select('Key,'Summary)
result1.show(10)
This is my result
[Key,{cols:[{col1:a,col2:b,col3:c,col4:d}]}]
[Key,{cols:[{col1:a1,col2:b1,col3:c1,col4:d1}]}]
As you can see, they are not grouped. Is there a way to group these rows using the UDF itself. I am new to scala/Spark and not able to figure out the proper udf.

// Create your dataset
scala> val ds = Seq((1, "hello", 1L), (2, "world", 2L)).toDF("id", "token", "long")
ds: org.apache.spark.sql.DataFrame = [id: int, token: string ... 1 more field]
// select the fields you want to map to json
scala> ds.select('token, 'long).write.json("your-json")
// check the result
➜ spark git:(master) ✗ ls -ltr your-json/
total 16
-rw-r--r-- 1 jacek staff 27 11 kwi 17:18 part-r-00007-91f81f62-54bb-42ae-bddc-33829a0e3c16.json
-rw-r--r-- 1 jacek staff 0 11 kwi 17:18 part-r-00006-91f81f62-54bb-42ae-bddc-33829a0e3c16.json
-rw-r--r-- 1 jacek staff 0 11 kwi 17:18 part-r-00005-91f81f62-54bb-42ae-bddc-33829a0e3c16.json
-rw-r--r-- 1 jacek staff 0 11 kwi 17:18 part-r-00004-91f81f62-54bb-42ae-bddc-33829a0e3c16.json
-rw-r--r-- 1 jacek staff 27 11 kwi 17:18 part-r-00003-91f81f62-54bb-42ae-bddc-33829a0e3c16.json
-rw-r--r-- 1 jacek staff 0 11 kwi 17:18 part-r-00002-91f81f62-54bb-42ae-bddc-33829a0e3c16.json
-rw-r--r-- 1 jacek staff 0 11 kwi 17:18 part-r-00001-91f81f62-54bb-42ae-bddc-33829a0e3c16.json
-rw-r--r-- 1 jacek staff 0 11 kwi 17:18 part-r-00000-91f81f62-54bb-42ae-bddc-33829a0e3c16.json
-rw-r--r-- 1 jacek staff 0 11 kwi 17:18 _SUCCESS
➜ spark git:(master) ✗ cat your-json/part-r-00003-91f81f62-54bb-42ae-bddc-33829a0e3c16.json
{"token":"hello","long":1}
➜ spark git:(master) ✗ cat your-json/part-r-00007-91f81f62-54bb-42ae-bddc-33829a0e3c16.json
{"token":"world","long":2}

UDFs will map one row to one row. If you have multiple rows in your DataFrame that you want to combine into one element, you're going to need to use a function like reduceByKey that aggregates multiple rows.
There may be a DataFrame specific function to do this, but I would do this processing with the RDD functionality, like so:
val colSummary = udf(
(col1:String, col2:String, col3:String, col4:String) => "{\"col1\":" + col1 + ",\"col2\":" + col2 + ",\"col3\":" + col3 + ",\"col4\":" + col4 + "}"
)
val colRDD = input.withColumn("Summary",summary('col1,'col2,'col3,'col4)).rdd.map(x => (x.getString(0),x.getString(5)))
This gives us a RDD[(String,String)], which will allow us to use the PairRDDFunctions like reduceByKey (see the docs). The key of the tuple is the original key, and the value is the json encoding for a single element that we need to aggregate together to make the cols list. We glue them all together into a comma-separated list, and then we add the beginning and end, and then we're done.
val result = colRDD.reduceByKey((x,y) => (x+","+y)).map(x => "["+x._1+",{\"cols\":["+x._2+"]}]")
result.take(10)

Related

how to create a single json string from a postgresql table with sql query

how can i create a json file from a postgresql table "test4json" with sql query:
column1
column2
column3
name 1
0
10
name 2
0
10
name 3
0
10
a single json file in one row...withot CRLF such this:
{"name 1": {"column2": 0,"column3": 10},"name 2": {"column2": 0,"column3": 10},"name 3":{"column2": 0,"column3": 10}}
for the values column 1 i dont't need the name of the column !
and how can i create from the result a test4json.json file in a directory c:\test4json ?
Origin Table is
name
x
y
width
height
pixelRatio
sdf
S1R1
0
0
20
10
1
false
S1R2
0
10
20
10
1
false
S1R3
0
20
20
10
1
false
S1R4
0
30
20
10
1
false
S1R5
0
40
20
10
1
false
thx
You can use jsonb_object_agg() for this:
select jsonb_object_agg(column1, to_jsonb(t) - 'column1')
from the_table t;
Online example
How you save that as a JSON file depends completely on the SQL client you are using. In psql you could use the \o ("output to") meta command
copy (select jsonb_object_agg(column1, to_jsonb(t) - 'column1')
from the_table t) to 'c:\test4json\test4json.json'; works for the output as a json file

Undefined columns selected using panelvar package

Have anyone used panel var in R?
Currently I'm using the package panelvar of R. And I'm getting this error :
Error in `[.data.frame`(data, , c(colnames(data)[panel_identifier], required_vars)) :
undefined columns selected
And my syntax currently is:
model1<-pvargmm(
dependent_vars = c("Change.."),
lags = 2,
exog_vars = c("Price"),
transformation = "fd",
data = base1,
panel_identifier = c("id", "t"),
steps = c("twostep"),
system_instruments = FALSE,
max_instr_dependent_vars = 99,
min_instr_dependent_vars = 2L,
collapse = FALSE)
I don't know why my panel_identifier is not working, it's pretty similar to the example given by panelvar package, however, it doesn't work, I want to appoint that base1 is on data.frame format. any ideas? Also, my data is structured like this:
head(base1)
id t country DDMMYY month month_text day Date_txt year Price Open
1 1 1296 China 1-4-2020 4 Apr 1 Apr 01 2020 12588.24 12614.82
2 1 1295 China 31-3-2020 3 Mar 31 Mar 31 2020 12614.82 12597.61
High Low Vol. Change..
1 12775.83 12570.32 NA -0.0021
2 12737.28 12583.05 NA 0.0014
thanks in advance !
Check the documentation of the package and the SSRN paper. For me it helped to ensure all entered formats are identical (you can check this with str(base1) command). For example they write:
library(panelvar)
data("Dahlberg")
ex1_dahlberg_data <-
pvargmm(dependent_vars = .......
When I look at it I get
>str(Dahlberg)
'data.frame': 2385 obs. of 5 variables:
$ id : Factor w/ 265 levels "114","115","120",..: 1 1 1 1 1 1 1 1 1 2 ...
$ year : Factor w/ 9 levels "1979","1980",..: 1 2 3 4 5 6 7 8 9 1 ...
$ expenditures: num 0.023 0.0266 0.0273 0.0289 0.0226 ...
$ revenues : num 0.0182 0.0209 0.0211 0.0234 0.018 ...
$ grants : num 0.00544 0.00573 0.00566 0.00589 0.00559 ...
For example the input data must be a data.frame (in my case it had additional type specifications like tibble or data.table). I resolved it by casting as.data.frame() on it.

How do I convert my loop output to a .txt file?

My current code is:
count1 = 0
for i in range(30):
if i%26 == 0:
b = [i+1, i+2, i+3, i+4, i+5, i+6, i+7, i+8, i+9, i+10]
count1 += 1
print([count1])
print(*b, sep=' ')
elif (i-10)%26 == 0:
b = [i+1, i+2, i+3, i+4, i+5, i+6, i+7, i+8, i+9]
count1 += 1
print([count1])
print(*b, sep= ' ')
elif (i-16)%32 == 0:
b = [i+1, i+2, i+3, i+4, i+5, i+6, i+7, i+8, i+9, i+10]
count1 += 1
print([count1])
print(*b, sep= ' ')
which produces lines:
[1]
1 2 3 4 5 6 7 8 9 10
[2]
11 12 13 14 15 16 17 18 19
[3]
17 18 19 20 21 22 23 24 25 26
[4]
27 28 29 30 31 32 33 34 35 36
I'd like to output these lines in a simple text file. I'm familiar with the open and write functions, but do not know how to apply them to my specific example.
Thanks!
On GNU/Linux systems execute the program in the console, add > and the name of the file.
Example:
Assuming that you are in the directory wich contains the executable.
./[name of the program] > [name of the file]
./helloworld > helloworld.txt
This will save all the printed text in the console in a text file.

standard unambiguos format [R] MySQL imported data

OK, to set the scene, I have written a function to import multiple tables from MySQL (using RODBC) and run randomForest() on them.
This function is run on multiple databases (as separate instances).
In one particular database, and one particular table, the "error in as.POSIXlt.character(x, tz,.....): character string not in a standard unambiguous format" error is thrown. The function runs on around 150 tables across two databases without any issues except this one table.
Here is a head() print from the table:
MQLTime bar5 bar4 bar3 bar2 bar1 pat1 baXRC
1 2014-11-05 23:35:00 184 24 8 24 67 147 Flat
2 2014-11-05 23:57:00 203 184 204 67 51 147 Flat
3 2014-11-06 00:40:00 179 309 49 189 75 19 Flat
4 2014-11-06 00:46:00 28 192 60 49 152 147 Flat
5 2014-11-06 01:20:00 309 48 9 11 24 19 Flat
6 2014-11-06 01:31:00 24 177 64 152 188 19 Flat
And here is the function:
GenerateRF <- function(db, countstable, RFcutoff) {
'load required libraries'
library(RODBC)
library(randomForest)
library(caret)
library(ff)
library(stringi)
'connection and data preparation'
connection <- odbcConnect ('TTODBC', uid='root', pwd='password', case="nochange")
'import count table and check if RF is allowed to be built'
query.str <- paste0 ('select * from ', db, '.', countstable, ' order by RowCount asc')
row.counts <- sqlQuery (connection, query.str)
'Operate only on tables that have >= RFcutoff'
for (i in 1:nrow (row.counts)) {
table.name <- as.character (row.counts[i,1])
col.count <- as.numeric (row.counts[i,2])
row.count <- as.numeric (row.counts[i,3])
if (row.count >= 20) {
'Delete old RFs and DFs for input pattern'
if (file.exists (paste0 (table.name, '_RF.Rdata'))) {
file.remove (paste0 (table.name, '_RF.Rdata'))
}
if (file.exists (paste0 (table.name, '_DF.Rdata'))) {
file.remove (paste0 (table.name, '_DF.Rdata'))
}
'import and clean data'
query.str2 <- paste0 ('select * from ', db, '.', table.name, ' order by mqltime asc')
raw.data <- sqlQuery(connection, query.str2)
'partition data into training/test sets'
set.seed(489)
index <- createDataPartition(raw.data$baXRC, p=0.66, list=FALSE, times=1)
data.train <- raw.data [index,]
data.test <- raw.data [-index,]
'find optimal trees to grow (without outcome and dates)
data.mtry <- as.data.frame (tuneRF (data.train [, c(-1,-col.count)], data.train$baXRC, ntreetry=100,
stepFactor=.5, improve=0.01, trace=TRUE, plot=TRUE, dobest=FALSE))
best.mtry <- data.mtry [which (data.mtry[,2] == min (data.mtry[,2])), 1]
'compress df'
data.ff <- as.ffdf (data.train)
'run RF. Originally set to 1000 trees but M1 dataset is to large for laptop. Maybe train at the lab?'
data.rf <- randomForest (baXRC~., data=data.ff[,-1], mtry=best.mtry, ntree=500, keep.forest=TRUE,
importance=TRUE, proximity=FALSE)
'generate and print variable importance plot'
varImpPlot (data.rf, main = table.name)
'predict on test data'
data.test.pred <- as.data.frame( predict (data.rf, data.test, type="prob"))
'get dates and name date column'
data.test.dates <- data.frame (data.test[,1])
colnames (data.test.dates) <- 'MQLTime'
'attach dates to prediction df'
data.test.res <- cbind (data.test.dates, data.test.pred)
'force date coercion to attempt negating unambiguous format error '
data.test.res$MQLTime <- format(data.test.res$MQLTime, format = "%Y-%m-%d %H:%M:%S")
'delete row names, coerce to dataframe, generate row table name and export outcomes to MySQL'
rownames (data.test.res)<-NULL
data.test.res <- as.data.frame (data.test.res)
root.table <- stri_sub(table.name, 0, -5)
sqlUpdate (connection, data.test.res, tablename = paste0(db, '.', root.table, '_outcome'), index = "MQLTime")
'save RF and test df/s for future use; save latest version of row_counts to MQL4 folder'
save (data.rf, file = paste0 ("C:/Users/user/Documents/RF_test2/", table.name, '_RF.Rdata'))
save (data.test, file = paste0 ("C:/Users/user/Documents/RF_test2/", table.name, '_DF.Rdata'))
write.table (row.counts, paste0("C:/Users/user/AppData/Roaming/MetaQuotes/Terminal/71FA4710ABEFC21F77A62A104A956F23/MQL4/Files/", db, "_m1_rowcounts.csv"), sep = ",", col.names = F,
row.names = F, quote = F)
'end of conditional block'
}
'end of for loop'
}
'close all connection to MySQL'
odbcCloseAll()
'clear workspace'
rm(list=ls())
'end of function'
}
At this line:
data.test.res$MQLTime <- format(data.test.res$MQLTime, format = "%Y-%m-%d %H:%M:%S")
I have tried coercing MQLTime using various functions including: as.character(), as.POSIXct(), as.POSIXlt(), as.Date(), format(), as.character(as.Date())
and have also tried:
"%y" vs "%Y" and "%OS" vs "%S"
All variants seem to have no effect on the error and the function is still able to run on all other tables. I have checked the table manually (which contains almost 1500 rows) and also in MySQL looking for NULL dates or dates like "0000-00-00 00:00:00".
Also, if I run the function line by line in R terminal, this offending table is processed without any problems which just confuses the hell out me.
I've exhausted all the functions/solutions I can think of (and also all those I could find through Dr. Google) so I am pleading for help here.
I should probably mention that the MQLTime column is stored as varchar() in MySQL. This was done to try and get around issues with type conversions between R and MySQL
SHOW VARIABLES LIKE "%version%";
innodb_version, 5.6.19
protocol_version, 10
slave_type_conversions,
version, 5.6.19
version_comment, MySQL Community Server (GPL)
version_compile_machine, x86
version_compile_os, Win32
> sessionInfo()
R version 3.0.2 (2013-09-25)
Platform: i386-w64-mingw32/i386 (32-bit)
Edit: Str() output on the data as imported from MySQl showing MQLTime is already in POSIXct format:
> str(raw.data)
'data.frame': 1472 obs. of 8 variables:
$ MQLTime: POSIXct, format: "2014-11-05 23:35:00" "2014-11-05 23:57:00" "2014-11-06 00:40:00" "2014-11-06 00:46:00" ...
$ bar5 : int 184 203 179 28 309 24 156 48 309 437 ...
$ bar4 : int 24 184 309 192 48 177 48 68 60 71 ...
$ bar3 : int 8 204 49 60 9 64 68 27 192 147 ...
$ bar2 : int 24 67 189 49 11 152 27 56 437 67 ...
$ bar1 : int 67 51 75 152 24 188 56 147 71 0 ...
$ pat1 : int 147 147 19 147 19 19 147 19 147 19 ...
$ baXRC : Factor w/ 3 levels "Down","Flat",..: 2 2 2 2 2 2 2 2 2 3 ...
So I have tried declaring stringsAsfactors = FALSE in the dataframe operations and this had no effect.
Interestingly, if the offending table is removed from processing through an additional conditional statement in the first 'if' block, the function stops on the table immediately preceeding the blocked table.
If both the original and the new offending tables are removed from processing, then the function stops on the table immediately prior to them. I have never seen this sort of behavior before and it really has me stumped.
I watched system resources during the function and they never seem to max out.
Could this be a problem with the 'for' loop and not necessarily date formats?
There appears to be some egg on my face. The table following the table where the function was stopping had a row with value '0000-00-00 00:00:00'. I added another statement in my MySQL function to remove these rows when pre-processing the tables. Thanks to those that had a look at this.

Sphinx Error: failed to merge index

While merging index with delta on Sphinx, I got this error:
~: /usr/local/bin/indexer --merge myindex myindexDelta --rotate;
Sphinx 2.0.6-release (r3473)
Copyright (c) 2001-2012, Andrew Aksyonoff
Copyright (c) 2008-2012, Sphinx Technologies Inc ( http://sphinxsearch.com )
using config file '/usr/local/etc/sphinx.conf'...
merging index 'myindexDelta' into index 'myindex'...
read 414.6 of 414.6 MB, 100.0% done
FATAL: failed to merge index 'myindexDelta' into index 'myindex': failed to open /server/sphinx/data/myindex.sps: No such file or directory
My configuration on sphinx.conf is as following
source myindex
{
type = mysql
sql_host = localhost
sql_user = db
sql_pass =
sql_db = db
sql_query_pre = SET SESSION query_cache_type=OFF
sql_query_pre = REPLACE INTO sph_counter SELECT 1, MAX(id) FROM mytable
sql_query_pre = SET NAMES utf8
sql_query = \
SELECT id,title FROM mytable \
WHERE id<=( SELECT max_doc_id FROM sph_counter WHERE counter_id=1 )
sql_ranged_throttle = 0
}
source myindexDelta : myindex
{
sql_query_pre = SET SESSION query_cache_type=OFF
sql_query_pre = SET NAMES utf8
sql_query = \
SELECT id,title FROM mytable \
WHERE id > ( SELECT max_doc_id FROM sph_counter WHERE counter_id=1 )
}
index myindex
{
source = myindex
path = /server/sphinx/data/myindex
min_word_len = 3
min_infix_len = 0
}
index myindexDelta : myindex
{
source = myindexDelta
path = /server/sphinx/data/myindexDelta
min_word_len = 3
min_infix_len = 0
}
indexes files listings with permissions:
~: ls -lh /server/sphinx/data/
-rw-r--r-- 1 root root 0 Nov 11 21:40 myindexDelta.spa
-rw-r--r-- 1 root root 290K Nov 11 21:40 myindexDelta.spd
-rw-r--r-- 1 root root 328 Nov 11 21:40 myindexDelta.sph
-rw-r--r-- 1 root root 106K Nov 11 21:40 myindexDelta.spi
-rw-r--r-- 1 root root 0 Nov 11 21:40 myindexDelta.spk
-rw------- 1 root root 0 Nov 11 21:40 myindexDelta.spl
-rw-r--r-- 1 root root 0 Nov 11 21:40 myindexDelta.spm
-rw-r--r-- 1 root root 223K Nov 11 21:40 myindexDelta.spp
-rw-r--r-- 1 root root 1 Nov 11 21:40 myindexDelta.sps
-rw-r--r-- 1 root root 0 Jul 3 21:17 myindex.spa
-rw-r--r-- 1 root root 7.0G Jul 3 23:54 myindex.spd
-rw-r--r-- 1 root root 290 Jul 3 23:54 myindex.sph
-rw-r--r-- 1 root root 397M Jul 3 23:54 myindex.spi
-rw-r--r-- 1 root root 0 Jul 3 23:54 myindex.spk
-rw------- 1 root root 0 Nov 11 21:08 myindex.spl
-rw-r--r-- 1 root root 0 Jul 3 21:17 myindex.spm
-rw-r--r-- 1 root root 9.2G Jul 3 23:54 myindex.spp
I am sure the code explains everything, adding description is not necessary.
I'm guessing that the original 'myindex' was made by a different version of sphinx. (ie dont think 2.0.6-release would of been available in July)
And somewhere in that version update, the requirement for a .sps file has changed - the new version requires it, whereas the old doesnt. You have no string attributes hence why the file contains no data in the delta.
I would suggest either rebuilding myindex with your current version of indexer - so they versions are identical.
Or maybe you could try copying myindexDelta.sps to myindex.sps. It contains no data (1 dummy byte!) so it shouldn't corrupt anything. Would only need to do this once.