convert json string to integer with pyspark

convert json string to integer with pyspark - json

I want to convert a string object from json file into integer using pyspark.
df1.select(df1["`result.price`"]).dtypes
Out[15]: [('result.price', 'string')]
df1=df1.withColumn(df1.select(df1["`result.price`"]),F.col(df1.select(df1["`result.price`"])).cast(T.IntegerType()))
'DataFrame' object has no attribute '_get_object_id'

If you want to modify inline:
Since you are trying to modify the data type of nested struct field, I think you need to apply the new StructType.
Take a look at this https://stackoverflow.com/a/63270808/2956135
If you are okay with extracting to a different column:
df1 = df1.withColumn('price', F.col('result.price').cast(T.IntegerType()))
TL;DR
Why your line gives an error?
There is a few mistakes in this syntax.
df1 = df1.withColumn(df1.select(df1["`result.price`"]),F.col(df1.select(df1["`result.price`"])).cast(T.IntegerType()))
First, 1st argument of withColumn has to be string of a column name that you want to save as.
Second, F.col's argument has to be string of a column name or reference to the column.
So, this syntax should not throw an error, however, the casted value is saved to the new column.
df1 = df1.withColumn('result.price', F.col('result.price').cast(T.IntegerType()))

Related

mySql JSON string field returns encoded

First week having to deal with a MYSQL database and JSON field types and I cannot seem to figure out why values are encoded automatically and then returned in encoded format.
Given the following SQL
-- create a multiline string with a tab example
SET #str ="Line One
Line 2 Tabbed out
Line 3";
-- encode it
SET #j = JSON_OBJECT("str", #str);
-- extract the value by name
SET #strOut = JSON_EXTRACT(#J, "$.str");
-- show the object and attribute value.
SELECT #j, #strOut;
You end up with what appears to be a full formed JSON object with a single attribute encoded.
#j = {"str": "Line One\n\tLine 2\tTabbed out\n\tLine 3"}
but using JSON_EXTRACT to get the attribute value I get the encoded version including outer quotes.
#strOut = "Line One\n\tLine 2\tTabbed out\n\tLine 3"
I would expect to get my original string with the \n \t all unescaped to the original values and no outer quotes. as such
Line One
Line 2 Tabbed out
Line 3
I can't seem to find any JSON_DECODE or JSON_UNESCAPE or similar functions.
I did find a JSON_ESCAPE() function but that appears to be used to manually build a JSON object structure in a string.
What am I missing to extract the values to the original format?

I like to use handy operator ->> for this.
It was introduced in MySQL 5.7.13, and basically combines JSON_EXTRACT() and JSON_UNQUOTE():
SET #strOut = #J ->> '$.str';

You are looking for the JSON_UNQUOTE function
SET #strOut = JSON_UNQUOTE( JSON_EXTRACT(#J, "$.str") );

The result of JSON_EXTRACT() is intentionally a JSON document, not a string.
A JSON document may be:
An object enclosed in { }
An array enclosed in [ ]
A scalar string value enclosed in " "
A scalar number or boolean value
A null — but this is not an SQL NULL, it's a JSON null. This leads to confusing cases because you can extract a JSON field whose JSON value is null, and yet in an SQL expression, this fails IS NULL tests, and it also fails to be equal to an SQL string 'null'. Because it's a JSON type, not a scalar type.

In Python, how to concisely get nested values in json data?

I have data loaded from JSON and am trying to extract arbitrary nested values using a list as input, where the list corresponds to the names of successive children. I want a function get_value(data,lookup) that returns the value from data by treating each entry in lookup as a nested child.
In the example below, when lookup=['alldata','TimeSeries','rates'], the return value should be [1.3241,1.3233].
json_data = {'alldata':{'name':'CAD/USD','TimeSeries':{'dates':['2018-01-01','2018-01-02'],'rates':[1.3241,1.3233]}}}
def get_value(data,lookup):
res = data
for item in lookup:
res = res[item]
return res
lookup = ['alldata','TimeSeries','rates']
get_value(json_data,lookup)
My example works, but there are two problems:
It's inefficient - In my for loop, I copy the whole TimeSeries object to res, only to then replace it with the rates list. As #Andrej Kesely explained, res is a reference at each iteration, so data isn't being copied.
It's not concise - I was hoping to be able to find a concise (eg one or two line) way of extracting the data using something like list comprehension syntax

If you want one-liner and you are using Python 3.8, you can use assignment expression ("walrus operator"):
json_data = {'alldata':{'name':'CAD/USD','TimeSeries':{'dates':['2018-01-01','2018-01-02'],'rates':[1.3241,1.3233]}}}
def get_value(data,lookup):
return [data:=data[item] for item in lookup][-1]
lookup = ['alldata','TimeSeries','rates']
print( get_value(json_data,lookup) )
Prints:
[1.3241, 1.3233]

I don't think you can do it without a loop, but you could use a reducer here to increase readability.
functools.reduce(dict.get, lookup, json_data)

reading .csv file with decimals separated by a comma with CSV.jl

I am trying to read some data into julia into a data frame to work with it. A minimal example of the .csv file could look like this:
A; B; C; D
ab; 1,23; 4; 9,2
ab; 3,4; 7; 1,1
ba; 6; 2,3; 8,6
I load the following to packages and read the data:
using DataFrames
using CSV
d = CSV.read( "test.csv", delim=";")
Julia recognizes the following types:
eltypes(d)
CategoricalArrays.CategoricalString{UInt32}
String
String
String
How could I now turn whole columns to floats with the comma replaced by a dot? My first idea was to use:
float(d[1,2])
But I did not find an option to tell julia to replace the comma with a dot.
My next idea was to first replace the comma and then convert it:
float(replace(d[1,2], ",", "."))
That works fine on a single cell but not on a whole column:
float(replace(d[:,2], ",", "."))
MethodError: no method matching
replace(::WeakRefStrings.WeakRefStringArray{WeakRefString{UInt8},1,Union{}},
::String, ::String)
I also tried:
d = CSV.read( "test.csv", delim=";", decimal=",")
which also just gives an error ...
Any ideas how to handle this problem and how to efficiently read the data into julia?
Thanks a lot!
Best regards.

One straightforward way is to read the file to string, replace the comma decimal separators by dots and then create the DataFrame from it:
s = replace(readstring("test.csv"), ",", ".")
CSV.read(IOBuffer(s); delim=';', types=[String, Float64, Float64, Float64])
Note that you can use the types keyword to specifiy the column types (it will then implicitly parse the string entries).
EDIT: According to this github issue the CSV.jl's read method supports a decimal keyword (from version v0.2.0 on) which allows you to do
CSV.read("test.csv"; delim=';', decimal=',', types=[String, Float64, Float64, Float64])
EDIT: Removed hint to alternatively use readtable from DataFrames.jl because it seems to be deprecated in favor of CSV.read.

jsx:decode json into string value instead of binary

I read the doc of jsx which removed the post_decode option because it prevented the evolution of the jsx.
So what's the option now if I want what post_decode can do. For example, I can have a function to convert binary value into string value with this option.
F = fun(E) when is_binary(E) -> binary_to_list(E) end,
jsx:decode(BinaryJsonString, [{post_decode,F}]).
How do I do that now?

How can I loop over a map of String List (with an iterator) and load another String List with the values of InputArray?

How can I iterate over a InputArray and load another input array with the same values except in lower case (I know that there is a string to lower function)?
Question: How to iterate over a String List with a LOOP structure?
InputArray: A, B, C
OutputArray should be: a, b, c

In case, you want to retain the inputArray as such and save the lowercase values in an outputArray, then follow steps in below image which is self explanatory:
In the loop Step, Input Array should be /inputArray and Output Array should be /outputArray.

Your InputArray field looks like a string field. It's not a string list.
You need to use pub.string:tokenize from the WmPublic package to split your strings into a string list and then loop through the string list.
A string field looks like this in the pipeline:
A string list looks like this in the pipeline:
See the subtle difference in the little icon at the left ?

I can see two cases out here.
If your input is a string
Convert the string to stringlist by pub.string:tokenize service.
Loop over the string list by providing the name of string list in input array property of loop.
within loop use pub.string:toLower service as transformer and map the output to an output string.
put the output string name in the output array property of Loop.
once you come out of the loop you will see two string lists, one with upper case and one with lower case.
If your input is a string list.
In this case follow steps 2 to 5 as mentioned above.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

convert json string to integer with pyspark - json

Related

mySql JSON string field returns encoded

In Python, how to concisely get nested values in json data?

reading .csv file with decimals separated by a comma with CSV.jl

jsx:decode json into string value instead of binary

How can I loop over a map of String List (with an iterator) and load another String List with the values of InputArray?

Categories

Resources