PyArrow issue with timestamp data - pyarrow

I am trying to load data from a csv into a parquet file using pyarrow. I am using the convert options to set the data types to their proper type and then using the timestamp_parsers option to dictate how the timestamp data should be interpreted: please see my "csv" below:
time,data
01-11-19 10:11:56.132,xxx
Please see my code sample below.
import pyarrow as pa
from pyarrow import csv
from pyarrow import parquet
convert_dict = {
'time': pa.timestamp('us', None),
'data': pa.string()
}
convert_options = csv.ConvertOptions(
column_types=convert_dict
, strings_can_be_null=True
, quoted_strings_can_be_null=True
, timestamp_parsers=['%d-%m-%y %H:%M:%S.%f']
)
table = csv.read_csv('test.csv', convert_options=convert_options)
print(table)
parquet.write_table(table, 'test.parquet')
Basically, pyarrow doesn't like some strptime values. Specifically in this case, it does not like "%f" which is for fractional seconds (https://www.geeksforgeeks.org/python-datetime-strptime-function/). Any help to get pyarrow to do what I need would be appreciated.
Just to be clear, I can get the code to run if I edit the data to not have fractional seconds and then remove the "%f" from the timestamp_parsers option. However I need to maintain the integrity of the data so this is not an option. To me it seems like a bug in pyarrow or I'm an idiot and missing something obvious. Open to both options just want to know which it is.

%f is not supported in pyarrow and most likely won't be as it's a Python specific flag. See discussion here: https://issues.apache.org/jira/browse/ARROW-15883 . PRs are of course always welcome!
As a workaround you could first read timestamps as strings, then process them by slicing off the fractional part and add that as pa.duration to processed timestamps:
import pyarrow as pa
import pyarrow.compute as pc
ts = pa.array(["1970-01-01T00:00:59.123456789", "2000-02-29T23:23:23.999999999"], pa.string())
ts2 = pc.strptime(pc.utf8_slice_codeunits(ts, 0, 19), format="%Y-%m-%dT%H:%M:%S", unit="ns")
d = pc.utf8_slice_codeunits(ts, 20, 99).cast(pa.int64()).cast(pa.duration("ns"))
pc.add(ts2, d)

So I have found that for timestamp data, you should just try to have the data in the default parser format (ISO8601). For example if you convert csv data into parquet using the pyarrow timestamp data type. Just have the csv data in this format:
No time zone
YYYY-MM-DDTHH:MI:SS.FF6
With time zone
YYYY-MM-DDTHH:MI:SS.FF6TZH:TZM

Related

Merging and/or Reading 88 JSON Files into Dataframe - different datatypes

I basically have a procedure where I make multiple calls to an API and using a token within the JSON return pass that pack to a function top call the API again to get a "paginated" file.
In total I have to call and download 88 JSON files that total 758mb. The JSON files are all formatted the same way and have the same "schema" or at least should do. I have tried reading each JSON file after it has been downloaded into a data frame, and then attempted to union that dataframe to a master dataframe so essentially I'll have one big data frame with all 88 JSON files read into.
However the problem I encounter is roughly on file 66 the system (Python/Databricks/Spark) decides to change the file type of a field. It is always a string and then I'm guessing when a value actually appears in that field it changes to a boolean. The problem is then that the unionbyName fails because of different datatypes.
What is the best way for me to resolve this? I thought about reading using "extend" to merge all the JSON files into one big file however a 758mb JSON file would be a huge read and undertaking.
Could the other solution be to explicitly set the schema that the JSON file is read into so that it is always the same type?
If you know the attributes of those files, you can define the schema before reading them and create an empty df with that schema so you can to a unionByName with the allowMissingColumns=True:
something like:
from pyspark.sql.types import *
my_schema = StructType([
StructField('file_name',StringType(),True),
StructField('id',LongType(),True),
StructField('dataset_name',StringType(),True),
StructField('snapshotdate',TimestampType(),True)
])
output = sqlContext.createDataFrame(sc.emptyRDD(), my_schema)
df_json = spark.read.[...your JSON file...]
output.unionByName(df_json, allowMissingColumns=True)
I'm not sure this is what you are looking for. I hope it helps

How do I specify a dtype for all columns when reading a CSV file with pyarrow?

I wanna read a big CSV file with pyarrow. All my columns are float64's. But pyarrow seems to be inferring int64.
How do I specify a dtype for all columns?
import gcsfs
import pyarrow.dataset as ds
fs = gcsfs.GCSFileSystem(project='my-google-cloud-project')
my_dataset = ds.dataset("bucket/foo/bar.csv", format="csv", filesystem=fs)
my_dataset.to_table()
which produces:
ArrowInvalid Traceback (most recent call last)
........py in <module>
----> 65 my_dataset.to_table()
File /opt/conda/envs/py39/lib/python3.9/site-packages/pyarrow/_dataset.pyx:491, in pyarrow._dataset.Dataset.to_table()
File /opt/conda/envs/py39/lib/python3.9/site-packages/pyarrow/_dataset.pyx:3235, in pyarrow._dataset.Scanner.to_table()
File /opt/conda/envs/py39/lib/python3.9/site-packages/pyarrow/error.pxi:143, in pyarrow.lib.pyarrow_internal_check_status()
File /opt/conda/envs/py39/lib/python3.9/site-packages/pyarrow/error.pxi:99, in pyarrow.lib.check_status()
ArrowInvalid: In CSV column #172: Row #28: CSV conversion error to int64: invalid value '6.58841482364418'
Pyarrow's dataset module reads CSV files in chunks (the default is 1MB I think) and it processes those chunks in parallel. This makes column inference a bit tricky and it handles this by using the first chunk to infer data types. So the error you are getting is very common when the first chunk of the file has a column that looks integral but in future chunks the column has decimal values.
If you know the column names in advance then you can specify the data types of the columns:
import pyarrow as pa
import pyarrow.csv as csv
import pyarrow.dataset as ds
column_types = {'a': pa.float64(), 'b': pa.float64(), 'c': pa.float64()}
convert_options = csv.ConvertOptions(column_types=column_types)
custom_csv_format = ds.CsvFileFormat(convert_options=convert_options)
dataset = ds.dataset('/tmp/foo.csv', format=custom_csv_format)
If you don't know the column names then things are a bit trickier. However, it sounds like ALL columns are float64. In that case, since you only have one file, you can probably do something like this as a workaround:
dataset = ds.dataset('/tmp/foo.csv', format='csv')
column_types = {}
for field in dataset.schema:
column_types[field.name] = pa.float64()
# Now use column_types as above
This works because we call pa.dataset(...) twice and it will have a small bit of overhead. This is because each time we call pa.dataset(...) pyarrow will open the first chunk of the first file in the dataset to determine the schema (this is why we can use dataset.schema)
If you have multiple files with different columns then this approach won't work. In that case I'd recommend mailing the Arrow user# mailing list and we can have a more general discussion about different ways to solve the problem.

Date fields transformation from AWS Glue table to RedShift Spectrum external table

I am trying to transform the JSON dataset from S3 to Glue table schema into an Redshift spectrum for data analysis. While creating external tables, how to transform the DATE fields?
Need to highlight the source data is coming from MongoDB in ISODate format. Here, is the Glue table format.
struct $date:string
Tried the following formats within the External table
startDate:struct<$date:varchar(40)>
startDate:struct<date:varchar(40)>
startDate:struct<date:timestamp>
Is there a work around within the Redshift Spectrum or Glue to handle ISODate formats? Or the recommendation is to go back to the source to convert the ISOdate format?
Assuming you are using Python in glue, and assuming python understands your field as a date, you could do something like:
from pyspark.sql.functions import date_format
from awsglue.dynamicframe import DynamicFrame
from awsglue.context import GlueContext
def out_date_format(to_format):
"""formats the passed date into MM/dd/yyyy format"""
return date_format(to_format,"MM/dd/yyyy")
#if you have a dynamic frame you will need to convert it to a dataframe first:
#dataframe = dynamic_frame.toDF()
dataframe.withColumn("new_column_name", out_date_format("your_old_date_column_name"))
#assuming you are outputting via glue, you will need to convert the dataframe back into a dynamic frame:
#glue_context = GlueContext(spark_context)
#final = DynamicFrame.fromDF(dataframe, glue_context,"final")
Depending on how you are getting the data, there may be other options to use mapping or formatting.
If python doesn't understand your field as a date object, you will need to parse it first, something like:
import dateutil.parser
#and the convert would change to:
def out_date_format(to_format):
"""formats the passed date into MM/dd/yyyy format"""
yourdate = dateutil.parser.parse(to_format)
return date_format(yourdate,"MM/dd/yyyy")
Note that if the dateutil isn't built into glue, you will need to add it to your job parameters with syntax like:
"--additional-python-modules" = "python-dateutil==2.8.1"

CSV data exported/copied to HDFS going in weird format

I am using a spark job for reading csv file data from a stating area and coping that data into HDFS using following code line:
val conf = new SparkConf().setAppName("WCRemoteReadHDFSWrite").set("spark.hadoop.validateOutputSpecs", "true");
val sc = new SparkContext(conf)
val rdd = sc.textFile(source)
rdd.saveAsTextFile(destination)
csv file is having data in following format:
CTId,C3UID,region,product,KeyWord
1,1004634181441040000,East,Mobile,NA
2,1004634181441040000,West,Tablet,NA
whereas when data goes into HDFS it goes in following format:
CTId,C3UID,region,product,KeyWord
1,1.00463E+18,East,Mobile,NA
2,1.00463E+18,West,Tablet,NA
I am not able to find any valid reason behind this.
Any kind of help would be appreciated.
Regards,
Bhupesh
What happens is that because your C3UID is a large number, it gets parsed as Double and then is saved in standard Double notation. You need to fix the schema, and make sure you read the second column either as Long, BigDecimal or String, then there will be no change in String-representation.
Sometimes your CSV file could also be the culprit. Do NOT open CSV file in excel as excel could convert those big numeric values into exponential format and hence once you use spark job for importing data into hdfs, it goes as it is in string format.
Hence be very sure that your data in CSV should never be opened in excel before importing to hdfs using spark job. If you really want to see the content of your excel use either notepad++ or any other text editor tool

how to load large csv with many fields to Spark

Happy New Year!!!
I know this type of similar question has been asked/answered before, however, mine is different:
I have large size csv with 100+ fields and 100MB+, I want to load it to Spark (1.6) for analysis, the csv's header looks like the attached sample (only one line of the data)
Thank you very much.
UPDATE 1(2016.12.31.1:26pm EST):
I use the following approach and was able to load data (sample data with limited columns), however, I need to auto assign the header (from the csv) as the field's name in the DataFrame, BUT, the DataFrame looks like:
Can anyone tell me how to do it? Note, any manual manner is what I want to avoid.
>>> import csv
>>> rdd = sc.textFile('file:///root/Downloads/data/flight201601short.csv')
>>> rdd = rdd.mapPartitions(lambda x: csv.reader(x))
>>> rdd.take(5)
>>> df = rdd.toDF()
>>> df.show(5)
As noted in the comments you can use spark.read.csv for spark 2.0.0+ (https://spark.apache.org/docs/2.0.0/api/python/pyspark.sql.html)
df = spark.read.csv('your_file.csv', header=True, inferSchema=True)
Setting header to True will parse the header to column names of the dataframe. Setting inferSchema to True will get the table schema (but will slow down reading).
See also here:
Load CSV file with Spark