reading .csv file with decimals separated by a comma with CSV.jl - csv

I am trying to read some data into julia into a data frame to work with it. A minimal example of the .csv file could look like this:
A; B; C; D
ab; 1,23; 4; 9,2
ab; 3,4; 7; 1,1
ba; 6; 2,3; 8,6
I load the following to packages and read the data:
using DataFrames
using CSV
d = CSV.read( "test.csv", delim=";")
Julia recognizes the following types:
eltypes(d)
CategoricalArrays.CategoricalString{UInt32}
String
String
String
How could I now turn whole columns to floats with the comma replaced by a dot? My first idea was to use:
float(d[1,2])
But I did not find an option to tell julia to replace the comma with a dot.
My next idea was to first replace the comma and then convert it:
float(replace(d[1,2], ",", "."))
That works fine on a single cell but not on a whole column:
float(replace(d[:,2], ",", "."))
MethodError: no method matching
replace(::WeakRefStrings.WeakRefStringArray{WeakRefString{UInt8},1,Union{}},
::String, ::String)
I also tried:
d = CSV.read( "test.csv", delim=";", decimal=",")
which also just gives an error ...
Any ideas how to handle this problem and how to efficiently read the data into julia?
Thanks a lot!
Best regards.

One straightforward way is to read the file to string, replace the comma decimal separators by dots and then create the DataFrame from it:
s = replace(readstring("test.csv"), ",", ".")
CSV.read(IOBuffer(s); delim=';', types=[String, Float64, Float64, Float64])
Note that you can use the types keyword to specifiy the column types (it will then implicitly parse the string entries).
EDIT: According to this github issue the CSV.jl's read method supports a decimal keyword (from version v0.2.0 on) which allows you to do
CSV.read("test.csv"; delim=';', decimal=',', types=[String, Float64, Float64, Float64])
EDIT: Removed hint to alternatively use readtable from DataFrames.jl because it seems to be deprecated in favor of CSV.read.

Related

data type mismatch in SQL Spark

I'm trying to extract a small part of an array and have casted the array into string type, then use split/split_part to extract the data. But jupyter keeps saying that the column, which I have already casted it from array to string, cannot be resolved due to data type mismatch.
here's my sql code:
TRIM(SPLIT(CAST(SPLIT(CAST(log as STRING),' ',4) as STRING),'OpenLevel39',2)) as server_launch_date
or another line of code is also using the same method:
datediff('day', DATE(TRIM(SPLIT(SPLIT(CAST(log as STRING),' ',4),'OpenLevel39',2))), current_date) as server_create_in_days
and here's what the error says:
AnalysisException: cannot resolve 'trim(split(CAST(split(CAST(spark_catalog.jxm.timeframe.log AS STRING), ' ', 4) AS STRING), 'OpenLevel39', 2))' due to data type mismatch: argument 1 requires string type, however, 'split(CAST(split(CAST(spark_catalog.jxm.timeframe.log AS STRING), ' ', 4) AS STRING), 'OpenLevel39', 2)' is of array type.; line 20 pos 0;
please can anyone help me with this problem? much appreciated.
Spark's split returns an array with a length of at most limit as stated in the documentation.
On the other hand, trim requires the first parameter to be of type string; you are passing an array.
You can try to cast the array to string first, then use trim, as below:
trim(CAST(split(CAST(split(CAST(spark_catalog.jxm.timeframe.log AS STRING), ' ', 4) AS STRING), 'OpenLevel39', 2) as STRING))
However, this kind of does not make sense because an array has no spaces before or after (even after casted to string).
Good luck!

convert json string to integer with pyspark

I want to convert a string object from json file into integer using pyspark.
df1.select(df1["`result.price`"]).dtypes
Out[15]: [('result.price', 'string')]
df1=df1.withColumn(df1.select(df1["`result.price`"]),F.col(df1.select(df1["`result.price`"])).cast(T.IntegerType()))
'DataFrame' object has no attribute '_get_object_id'
If you want to modify inline:
Since you are trying to modify the data type of nested struct field, I think you need to apply the new StructType.
Take a look at this https://stackoverflow.com/a/63270808/2956135
If you are okay with extracting to a different column:
df1 = df1.withColumn('price', F.col('result.price').cast(T.IntegerType()))
TL;DR
Why your line gives an error?
There is a few mistakes in this syntax.
df1 = df1.withColumn(df1.select(df1["`result.price`"]),F.col(df1.select(df1["`result.price`"])).cast(T.IntegerType()))
First, 1st argument of withColumn has to be string of a column name that you want to save as.
Second, F.col's argument has to be string of a column name or reference to the column.
So, this syntax should not throw an error, however, the casted value is saved to the new column.
df1 = df1.withColumn('result.price', F.col('result.price').cast(T.IntegerType()))

Error parsing JSON: more than one document in the input (Redshift to Snowflake SQL)

I'm trying to convert a query from Redshift to Snowflake SQL.
The Redshift query looks like this:
SELECT
cr.creatives as creatives
, JSON_ARRAY_LENGTH(cr.creatives) as creatives_length
, JSON_EXTRACT_PATH_TEXT(JSON_EXTRACT_ARRAY_ELEMENT_TEXT (cr.creatives,0),'previewUrl') as preview_url
FROM campaign_revisions cr
The Snowflake query looks like this:
SELECT
cr.creatives as creatives
, ARRAY_SIZE(TO_ARRAY(ARRAY_CONSTRUCT(cr.creatives))) as creatives_length
, PARSE_JSON(PARSE_JSON(cr.creatives)[0]):previewUrl as preview_url
FROM campaign_revisions cr
It seems like JSON_EXTRACT_PATH_TEXT isn't converted correctly, as the Snowflake query results in error:
Error parsing JSON: more than one document in the input
cr.creatives is formatted like this:
"[{""previewUrl"":""https://someurl.com/preview1.png"",""device"":""desktop"",""splitId"":null,""splitType"":null},{""previewUrl"":""https://someurl.com/preview2.png"",""device"":""mobile"",""splitId"":null,""splitType"":null}]"
It seems to me that you are not working with valid JSON data inside Snowflake.
Please review your file format used for the copy into command.
If you open the "JSON" text provided in a text editor , note that the information is not parsed or formatted as JSON because of the quoting you have. Once your issue with double quotes / escaped quotes is handled, you should be able to make good progress
Proper JSON on Left || Original Data on Right
If you are not inclined to reload your data, see if you can create a Javascript User Defined Function to remove the quotes from your string, then you can use Snowflake to process the variant column.
The following code is working POJO that can be used to remove the doublequotes for you.
var textOriginal = '[{""previewUrl"":""https://someurl.com/preview1.png"",""device"":""desktop"",""splitId"":null,""splitType"":null},{""previewUrl"":""https://someurl.com/preview2.png"",""device"":""mobile"",""splitId"":null,""splitType"":null}]';
function parseText(input){
var a = input.replaceAll('""','\"');
a = JSON.parse(a);
return a;
}
x = parseText(textOriginal);
console.log(x);
For anyone else seeing this double double quote issue in JSON fields coming from CSV files in a Snowflake external stage (slightly different issue than the original question posted):
The issue is likely that you need to use the FIELD_OPTIONALLY_ENCLOSED_BY setting. Specifically, FIELD_OPTIONALLY_ENCLOSED_BY = '"' when setting up your fileformat.
(docs)
Example of creating such a file format:
create or replace file format mydb.myschema.my_tsv_file_format
type = CSV
field_delimiter = '\t'
FIELD_OPTIONALLY_ENCLOSED_BY = '"';
And example of querying from a stage using this file format:
select
$1 field_one
$2 field_two
-- ...and so on
from '#my_s3_stage/path/to/file/my_tab_separated_file.csv' (file_format => 'my_tsv_file_format')

mySql JSON string field returns encoded

First week having to deal with a MYSQL database and JSON field types and I cannot seem to figure out why values are encoded automatically and then returned in encoded format.
Given the following SQL
-- create a multiline string with a tab example
SET #str ="Line One
Line 2 Tabbed out
Line 3";
-- encode it
SET #j = JSON_OBJECT("str", #str);
-- extract the value by name
SET #strOut = JSON_EXTRACT(#J, "$.str");
-- show the object and attribute value.
SELECT #j, #strOut;
You end up with what appears to be a full formed JSON object with a single attribute encoded.
#j = {"str": "Line One\n\tLine 2\tTabbed out\n\tLine 3"}
but using JSON_EXTRACT to get the attribute value I get the encoded version including outer quotes.
#strOut = "Line One\n\tLine 2\tTabbed out\n\tLine 3"
I would expect to get my original string with the \n \t all unescaped to the original values and no outer quotes. as such
Line One
Line 2 Tabbed out
Line 3
I can't seem to find any JSON_DECODE or JSON_UNESCAPE or similar functions.
I did find a JSON_ESCAPE() function but that appears to be used to manually build a JSON object structure in a string.
What am I missing to extract the values to the original format?
I like to use handy operator ->> for this.
It was introduced in MySQL 5.7.13, and basically combines JSON_EXTRACT() and JSON_UNQUOTE():
SET #strOut = #J ->> '$.str';
You are looking for the JSON_UNQUOTE function
SET #strOut = JSON_UNQUOTE( JSON_EXTRACT(#J, "$.str") );
The result of JSON_EXTRACT() is intentionally a JSON document, not a string.
A JSON document may be:
An object enclosed in { }
An array enclosed in [ ]
A scalar string value enclosed in " "
A scalar number or boolean value
A null — but this is not an SQL NULL, it's a JSON null. This leads to confusing cases because you can extract a JSON field whose JSON value is null, and yet in an SQL expression, this fails IS NULL tests, and it also fails to be equal to an SQL string 'null'. Because it's a JSON type, not a scalar type.

jsx:decode json into string value instead of binary

I read the doc of jsx which removed the post_decode option because it prevented the evolution of the jsx.
So what's the option now if I want what post_decode can do. For example, I can have a function to convert binary value into string value with this option.
F = fun(E) when is_binary(E) -> binary_to_list(E) end,
jsx:decode(BinaryJsonString, [{post_decode,F}]).
How do I do that now?