Greenplum turn strings into table - json

Everyone, how does GP turn the strings looking like a dictionary As below into the table with field names and values.
I mean how to made it in SQL
{"useCoupon":true,"useLevel":false,"usePoints":false,"useActivity" :false}

create table test(a json);
insert into test(a) values('{"useCoupon":true,"useLevel":false,"usePoints":false,"useActivity" :false}');
insert into test(a) values('{"useCoupon":false,"useLevel":false,"usePoints":true,"useActivity" :true}');
After loading data to a staging table using copy or gpfdist or any another utilities, select the required fields like below and load the data to final table.
gpadmin=# select a -> 'useCoupon' as useCoupon,a -> 'useLevel' as useLevel,a -> 'usePoints' as usePoints,a -> 'useActivity' as useActivity from test;
usecoupon | uselevel | usepoints | useactivity
-----------+----------+-----------+-------------
true | false | false | false
false | false | true | true
(2 rows)

Related

Loading quoted numbers into snowflake table from CSV with COPY TO <TABLE>

I have a problem with loading CSV data into snowflake table. Fields are wrapped in double quote marks and hence there is problem with importing them into table.
I know that COPY TO has CSV specific option FIELD_OPTIONALLY_ENCLOSED_BY = '"'but it's not working at all.
Here are some pices of table definition and copy command:
CREATE TABLE ...
(
GamePlayId NUMBER NOT NULL,
etc...
....);
COPY INTO ...
FROM ...csv.gz'
FILE_FORMAT = (TYPE = CSV
STRIP_NULL_VALUES = TRUE
FIELD_DELIMITER = ','
SKIP_HEADER = 1
error_on_column_count_mismatch=false
FIELD_OPTIONALLY_ENCLOSED_BY = '"'
)
ON_ERROR = "ABORT_STATEMENT"
;
Csv file looks like this:
"3922000","14733370","57256","2","3","2","2","2019-05-23 14:14:44",",00000000",",00000000",",00000000",",00000000","1000,00000000","1000,00000000","1317,50400000","1166,50000000",",00000000",",00000000",",00000000",",00000000",",00000000",",00000000",",00000000",",00000000",",00000000",",00000000",",00000000",",00000000",",00000000",",00000000",",00000000",",00000000"
I get an error
'''Numeric value '"3922000"' is not recognized '''
I'm pretty sure it's because NUMBER value is interpreted as string when snowflake is reading "" marks, but since I use
FIELD_OPTIONALLY_ENCLOSED_BY = '"'
it shouldn't even be there... Does anyone have some solution to this?
Maybe something is incorrect with your file? I was just able to run the following without issue.
1. create the test table:
CREATE OR REPLACE TABLE
dbNameHere.schemaNameHere.stacko_58322339 (
num1 NUMBER,
num2 NUMBER,
num3 NUMBER);
2. create test file, contents as follows
1,2,3
"3922000","14733370","57256"
3,"2",1
4,5,"6"
3. create stage and put file in stage
4. run the following copy command
COPY INTO dbNameHere.schemaNameHere.STACKO_58322339
FROM #stageNameHere/stacko_58322339.csv.gz
FILE_FORMAT = (TYPE = CSV
STRIP_NULL_VALUES = TRUE
FIELD_DELIMITER = ','
SKIP_HEADER = 0
ERROR_ON_COLUMN_COUNT_MISMATCH=FALSE
FIELD_OPTIONALLY_ENCLOSED_BY = '"'
)
ON_ERROR = "CONTINUE";
4. results
+-----------------------------------------------------+--------+-------------+-------------+-------------+-------------+-------------+------------------+-----------------------+-------------------------+
| file | status | rows_parsed | rows_loaded | error_limit | errors_seen | first_error | first_error_line | first_error_character | first_error_column_name |
|-----------------------------------------------------+--------+-------------+-------------+-------------+-------------+-------------+------------------+-----------------------+-------------------------|
| stageNameHere/stacko_58322339.csv.gz | LOADED | 4 | 4 | 4 | 0 | NULL | NULL | NULL | NULL |
+-----------------------------------------------------+--------+-------------+-------------+-------------+-------------+-------------+------------------+-----------------------+-------------------------+
1 Row(s) produced. Time Elapsed: 2.436s
5. view the records
>SELECT * FROM dbNameHere.schemaNameHere.stacko_58322339;
+---------+----------+-------+
| NUM1 | NUM2 | NUM3 |
|---------+----------+-------|
| 1 | 2 | 3 |
| 3922000 | 14733370 | 57256 |
| 3 | 2 | 1 |
| 4 | 5 | 6 |
+---------+----------+-------+
Can you try with a similar test as this?
EDIT: A quick look at your data shows many of your numeric fields appear to start with commas, so something definitely amiss with the data.
Assuming your numbers are European formatted , decimal place, and . thousands, reading the numeric formating help, it seems Snowflake does not support this as input. I'd open a feature request.
But if you read the column in as text then use REPLACE like
SELECT '100,1234'::text as A
,REPLACE(A,',','.') as B
,TRY_TO_DECIMAL(b, 20,10 ) as C;
gives:
A B C
100,1234 100.1234 100.1234000000
safer would be to strip placeholders first like
SELECT '1.100,1234'::text as A
,REPLACE(A,'.') as B
,REPLACE(B,',','.') as C
,TRY_TO_DECIMAL(C, 20,10 ) as D;

Trying to make a pingpong stat tracking database with a stored procedure

I'm using a stored procedure to (try to) write to 3 different tables in MYsql to track ping-pong data and show cool statistics.
So I'm a complete noob to MySQL (and StackOverflow) and haven't really done any sort of database language before so all of this is pretty new to me. I'm trying to make a stored procedure that writes ping-pong stats that come from Ignition(I'm fairly certain that Ignition isn't the problem. It's telling me the writes failed so I think it's a problem with my stored procedure).
I currently have one stored procedure that writes to the players table and can add wins, losses, and total games played when a button is pressed. My problem now is that I want to add statistics where I can track the score and who played against who so I could make graphs and stuff.
This stored procedure is supposed to search through the pingpong table to find if the names passed have played against each other before so I can find the corresponding MatchID. If the players haven't played before, then it should create a new row with a new MatchID(This is the key so it should be unique every time). Once I have the MatchID, I can then figure out how many games the players have played against each other before, what the score was, and who beat who and stuff like that.
Here's what I've written and MySQL says it's fine, but obviously it's not working. I know it's not completely finished but I really need some guidance since this is my second time doing anything with MySQL or and database language for that matter and I don't think this should be failing when I test any sort of write.
CREATE DEFINER=`root`#`localhost` PROCEDURE `Matchups`(
#these are passed from Ignition and should be working
IN L1Name VARCHAR(255), #Player 1 name on the left side
IN L2Name VARCHAR(255), #Player 2 name on the left side
IN R1Name VARCHAR(255), #Player 3 name on the right side
IN R2Name VARCHAR(255), #Player 4 name on the right side
IN TWOvTWO int, #If this is 1, then L1,L2,R1,R2 are playing instead of L1,R1
IN LeftScore int,
IN RightScore int)
BEGIN
DECLARE x int DEFAULT 0;
IF((
SELECT MatchupID
FROM pingpong
WHERE (PlayerL1 = L1Name AND PlayerR1 = R1Name) OR (PlayerL1 = R1Name AND PlayerR1 = L1Name)
)
IS NULL) THEN
INSERT INTO pingpong (PlayerL1, PlayerL2, PlayerR1, PlayerR2) VALUES (L1Name, L2Name, R1Name, R2Name);
INSERT INTO pingponggames (MatchupID, Lscore, Rscore) VALUES ((SELECT MatchupID
FROM pingpong
WHERE (PlayerL1 = L1Name AND PlayerR1 = R1Name) OR (PlayerL1 = R1Name AND PlayerR1 = L1Name)), LeftScore, RightScore);
END IF;
END
Here are what my tables currently look like:
pingpong
PlayerL1 | PlayerL2 | PlayerR1 | PlayerR2 | MatchupID
-----------------------------------------------------
L1 | NULL | R1 | NULL | 1
L1 | NULL | L2 | NULL | 3
L1 | NULL | R2 | NULL | 4
L1 | NULL | test2 | NULL | 5
pingponggames
GameID | MatchupID | LScore | RScore
------------------------------------------
1 | 1 | NULL | NULL
pingpongplayers
Name | TotalWins | TotalLosses | GamesPlayed
-----------------------------------------------------
L1 | 8 | 5 | NULL
L2 | 1 | 1 | NULL
R1 | 1 | 6 | 7
R2 | 1 | 1 | NULL
test2 | 1 | 0 | 1
test1 | 0 | 0 | 0
Explained some features, If needed more I need more info
CREATE DEFINER=`root`#`localhost` PROCEDURE `Matchups`(
#these are passed from Ignition and should be working
IN L1Name VARCHAR(255), #Player 1 name on the left side
IN L2Name VARCHAR(255), #Player 2 name on the left side
IN R1Name VARCHAR(255), #Player 3 name on the right side
IN R2Name VARCHAR(255), #Player 4 name on the right side
-- what will be the INPUT other than 1? It's to notice doubles or singles right? so taking 0 as single & 1 as doubles
IN TWOvTWO INT, #If this is 1, then L1,L2,R1,R2 are playing instead of L1,R1
IN LeftScore INT,
IN RightScore INT)
BEGIN
DECLARE x INT DEFAULT 0; # i guess you are using it in the sp
DECLARE v_matchupid INT; #used int --if data type is different, set as MatchupID column datatype
DECLARE inserted_matchupid INT; -- use data type based on your column MatchupID from pingpong tbl
IF(TWOvTWO=0) THEN -- for singles
#what is the need of this query? to check singles or doubles? Currently it search for only single from what you have written, will change according to that
SELECT MatchupID INTO v_matchupid
FROM pingpong
WHERE L1Name IN (PlayerL1, PlayerR1) AND R1Name IN (PlayerL1, PlayerR1); # avoid using direct name(string) have a master tbl for player name and use its id to compare or use to refer in another tbl
# the if part checks is it new between them and insert in both tbls
IF(v_matchupid IS NULL) THEN
INSERT INTO pingpong (PlayerL1, PlayerR1) VALUES (L1Name, R1Name);
SET inserted_matchupid=LAST_INSERT_ID();
INSERT INTO pingponggames (MatchupID, Lscore, Rscore) VALUES (inserted_matchupid, LeftScore, RightScore);
/*
Once I have the MatchID, I can then figure out how many games the players have played against each other before
A: this will not work for new matchup since matchupid is created now
*/
# so assuming if match found update pingponggames tbl with matched matchupid.. i leave it up to you
ELSE
UPDATE pingponggames SET Lscore=LeftScore, Rscore=RightScore WHERE MatchupID=v_matchupid;-- you can write your own
END IF;
-- for doubles
ELSE # assuming the possibilities of TWOvTWO will be either 0 or 1 if more use "elseif(TWOvTWO=1)" for this block as doubles
SELECT MatchupID INTO v_matchupid
FROM pingpong
# Note: If player name are same it will be difficult so better use a unique id as reference
WHERE L1Name IN (PlayerL1, PlayerL2, PlayerR1, PlayerR2) AND
L2Name IN (PlayerL1, PlayerL2, PlayerR1, PlayerR2) AND
R1Name IN (PlayerL1, PlayerL2, PlayerR1, PlayerR2) AND
R2Name IN (PlayerL1, PlayerL2, PlayerR1, PlayerR2);
IF(v_matchupid IS NULL) THEN
INSERT INTO pingpong (PlayerL1, PlayerL2, PlayerR1, PlayerR2) VALUES (L1Name, L2Name, R1Name, R2Name);
SET inserted_matchupid=LAST_INSERT_ID();
INSERT INTO pingponggames (MatchupID, Lscore, Rscore) VALUES (inserted_matchupid, LeftScore, RightScore);
ELSE
UPDATE pingponggames SET Lscore=LeftScore, Rscore=RightScore WHERE MatchupID=v_matchupid;-- you can write your own
END IF;
END IF;
END

spark df.write quote all fields but not null values

I am trying to create a csv from values stored in the table:
| col1 | col2 | col3 |
| "one" | null | "one" |
| "two" | "two" | "two" |
hive > select * from table where col2 is null;
one null one
I am getting the csv using the below code:
df.repartition(1)
.write.option("header",true)
.option("delimiter", ",")
.option("quoteAll", true)
.option("nullValue", "")
.csv(S3Destination)
Csv I get:
"col1","col2","col3"
"one","","one"
"two","two","two"
Expected Csv:WITH NO DOUBLE QUOTES FOR NULL VALUE
"col1","col2","col3"
"one",,"one"
"two","two","two"
Any help is appreciated to know if the dataframe writer has options to do this.
You can go in a udf approach and apply on the column (using withColumn on the repartitioned datafrmae above) where possiblity of double quote empty string is there see below sample code
sqlContext.udf().register("convertToEmptyWithOutQuotes",(String abc) -> (abc.trim().length() > 0 ? abc : abc.replace("\"", " ")),DataTypes.StringType);
String has replace method which does the job.
val a = Array("'x'","","z")
println(a.mkString(",").replace("\"", " "))
will produce 'x',,z

spark rdd fliter by query mysql

I use spark streaming to stream data from Kafka and I want to filter data judge by data in MySql.
For example, I get data from kafka just like:
{"id":1, "data":"abcdefg"}
and there are data in MySql like this:
id | state
1 | "success"
I need to query the MySql to get the state of term id.
I can define a connect to MySql in the function of filter, and it works. The code like this:
def isSuccess(x):
id = x["id"]
sql = """
SELECT *
FROM Test
WHERE id = "{0}"
""".format(id)
conn = mysql_connection(......)
result = rdbi.query_one(sql)
if result == None:
return False
else:
return True
successRDD = rdd.filter(isSuccess)
But it will define connection for every row of the RDD, and will waste a lot of computing resource.
How to do in filter?
I suggest you go for using mapPartition available in Apache Spark to prevent initialization of MySQL connection for every RDD.
This is the MySQL table that I created:
create table test2(id varchar(10), state varchar(10));
With the following values:
+------+---------+
| id | state |
+------+---------+
| 1 | success |
| 2 | stopped |
+------+---------+
Use the following PySpark Code as reference:
import MySQLdb
data1=[["1", "afdasds"],["2","dfsdfada"],["3","dsfdsf"]] #sampe data, in your case streaming data
rdd = sc.parallelize(data1)
def func1(data1):
con = MySQLdb.connect(host="127.0.0.1", user="root", passwd="yourpassword", db="yourdb")
c=con.cursor()
c.execute("select * from test2;")
data=c.fetchall()
dict={}
for x in data:
dict[x[0]]=x[1]
list1=[]
for x in data1:
if x[0] in dict:
list1.append([x[0], x[1], dict[x[0]]])
else:
list1.append([x[0], x[1], "none"]) # i assign none if id in table and one received from streaming dont match
return iter(list1)
print rdd.mapPartitions(func1).filter(lambda x: "none" not in x[2]).collect()
The output that i got was:
[['1', 'afdasds', 'success'], ['2', 'dfsdfada', 'stopped']]

mysql recursive self join

create table test(
container varchar(1),
contained varchar(1)
);
insert into test values('X','A');
insert into test values('X','B');
insert into test values('X','C');
insert into test values('Y','D');
insert into test values('Y','E');
insert into test values('Y','F');
insert into test values('A','P');
insert into test values('P','Q');
insert into test values('Q','R');
insert into test values('R','Y');
insert into test values('Y','X');
select * from test;
mysql> select * from test;
+-----------+-----------+
| container | contained |
+-----------+-----------+
| X | A |
| X | B |
| X | C |
| Y | D |
| Y | E |
| Y | F |
| A | P |
| P | Q |
| Q | R |
| R | Y |
| Y | X |
+-----------+-----------+
11 rows in set (0.00 sec)
Can I find out all the distinct values contained under 'X' using a single self join?
EDIT
Like, Here
X contains A, B and C
A contains P
P contains Q
Q contains R
R contains Y
Y contains C, D and E...
So I want to display A,B,C,D,E,P,Q,R,Y when I query for X.
EDIT
Got it right by programming.
package com.catgen.helper;
import java.sql.Connection;
import java.sql.SQLException;
import java.util.ArrayList;
import java.util.List;
import com.catgen.factories.Nm2NmFactory;
public class Nm2NmHelper {
private List<String> fetched;
private List<String> fresh;
public List<String> findAllContainedNMByMarketId(Connection conn, String marketId) throws SQLException{
fetched = new ArrayList<String>();
fresh = new ArrayList<String>();
fresh.add(marketId.toLowerCase());
while(fresh.size()>0){
fetched.add(fresh.get(0).toLowerCase());
fresh.remove(0);
List<String> tempList = Nm2NmFactory.getContainedNmByContainerNm(conn, fetched.get(fetched.size()-1));
if(tempList!=null){
for(int i=0;i<tempList.size();i++){
String current = tempList.get(i).toLowerCase();
if(!fetched.contains(current) && !fresh.contains(current)){
fresh.add(current);
}
}
}
}
return fetched;
}
}
Not the same table and fields though. But I hope you get the concept.
Thanks guys.
You can't get all the contained objects recursively using a single join with that data structure. You would need a recursive query but MySQL doesn't yet support that.
You could however construct a closure table, then you can do it with a simple query. See Bill Karwin's slideshow Models for heirarchical data for more details and other approaches (for example, nested sets). Slide 69 compares the different designs for ease of implementing 'Query subtree'. Your chosen design (adjacency list) is the most awkward of all four designs for this type of query.
What about reading the whole table into a php array, and determine the children via. a function which would call itself?
But this is not a good solution if the table has more than 10000 rows...