DirectoryPartitioning in pyarrow (python) - pyarrow

dataset = ds.dataset("abfs://test", format="parquet", partitioning="hive", filesystem=fs)
I can read datasets with the pyarrow dataset feature, but how can I write to a dataset with a different schema?
I seem to be able to import DirectoryPartitioning, for example, but I cannot figure out a way to save the data to create a schema like this:
from pyarrow.dataset import DirectoryPartitioning
partitioning = DirectoryPartitioning(pa.schema([("year", pa.int16()), ("month", pa.int8()), ("day", pa.int8())]))
print(partitioning.parse("/2009/11/3"))
Will we continue to use write_to_dataset to write Parquet files, or will there be a new method specific to the datasets class?

Related

Plotting specific column of GIS data using geopandas

I am trying to recreate a map of NY state geographical data using geopandas, and the available data found here: https://pubs.usgs.gov/of/2005/1325/#NY. I can print the map but can not figure out how to make use of the other files to plot their columns.
Any help would be appreciated.
What exactly are you trying to do?
Here's a quick setup you can use to download the Shapefiles from the site and have access to a GeoDataFrame:
import geopandas as gpd
from io import BytesIO
import requests as r
# Link to file
shp_link = 'https://pubs.usgs.gov/of/2005/1325/data/NYgeol_dd.zip'
# Downloading the fie into memory
my_req = r.get(shp_link)
# Creating a file stream for GeoPandas
my_zip = BytesIO(my_req.content)
# Loading the data into a GeoDataFrame
my_geodata = gpd.read_file(my_zip)
# Printing all of the column names
for this_col in my_geodata.columns:
print(this_col)
Now you can access the multiple columns in my_geodata the data using square brackets. For example, if I want to access the data stored in the column called "SOURCE", I can just use my_geodata["SOURCE"].
Now it's just a matter of figuring out what exactly you want to do with that data.

Can Pyarrow non-legacy parquet datasets read and write to Azure Blob? (legacy system and Dask are able to)

Is it possible to read a parquet dataset from Azure Blob using the new non-legacy?
I can read and write to blob storage with the old system where fs is fsspec:
pq.write_to_dataset(table=table.replace_schema_metadata(),
root_path=path,
partition_cols=[
'year',
'month',
],
filesystem=fs,
version='2.0',
flavor='spark',
)
With Dask, I am able to read the data using storage options:
ddf = dd.read_parquet(path='abfs://analytics/iag-cargo/zendesk/ticket-metric-events',
storage_options={
'account_name': base.login,
'account_key': base.password,
})
But when I try using
import pyarrow.dataset as ds
dataset = ds.dataset()
Or
dataset = pq.ParquetDataset(path_or_paths=path, filesystem=fs, use_legacy_dataset=False)
I run into errors about invalid filesystem URIs. I tried every combination I could think of and tried to figure out how Dask and the legacy system can read and write files but the new one can't.
I'd like to test the row filtering and non-Hive partitioning.

Date fields transformation from AWS Glue table to RedShift Spectrum external table

I am trying to transform the JSON dataset from S3 to Glue table schema into an Redshift spectrum for data analysis. While creating external tables, how to transform the DATE fields?
Need to highlight the source data is coming from MongoDB in ISODate format. Here, is the Glue table format.
struct $date:string
Tried the following formats within the External table
startDate:struct<$date:varchar(40)>
startDate:struct<date:varchar(40)>
startDate:struct<date:timestamp>
Is there a work around within the Redshift Spectrum or Glue to handle ISODate formats? Or the recommendation is to go back to the source to convert the ISOdate format?
Assuming you are using Python in glue, and assuming python understands your field as a date, you could do something like:
from pyspark.sql.functions import date_format
from awsglue.dynamicframe import DynamicFrame
from awsglue.context import GlueContext
def out_date_format(to_format):
"""formats the passed date into MM/dd/yyyy format"""
return date_format(to_format,"MM/dd/yyyy")
#if you have a dynamic frame you will need to convert it to a dataframe first:
#dataframe = dynamic_frame.toDF()
dataframe.withColumn("new_column_name", out_date_format("your_old_date_column_name"))
#assuming you are outputting via glue, you will need to convert the dataframe back into a dynamic frame:
#glue_context = GlueContext(spark_context)
#final = DynamicFrame.fromDF(dataframe, glue_context,"final")
Depending on how you are getting the data, there may be other options to use mapping or formatting.
If python doesn't understand your field as a date object, you will need to parse it first, something like:
import dateutil.parser
#and the convert would change to:
def out_date_format(to_format):
"""formats the passed date into MM/dd/yyyy format"""
yourdate = dateutil.parser.parse(to_format)
return date_format(yourdate,"MM/dd/yyyy")
Note that if the dateutil isn't built into glue, you will need to add it to your job parameters with syntax like:
"--additional-python-modules" = "python-dateutil==2.8.1"

running nested jobs in spark

I am using PySpark. I have a list of gziped json files on s3 which I have to access, transform and then export in parquet to s3. Each json file contains around 100k lines so parallelizing it wont make much sense(but i am open to parallelizing it), but there are around 5k files which I have parallelize. My approach is pass the json file list to script -> run parallelize on the list -> run map(? this is where I am getting blocked). how do I access and transform the json create a DF out of the transformed json and dump it as parquet into s3.
To read json in a distributed fashion, you will need to parallelize your keys as you mention. To do this while reading from s3, you'll need to use boto3. Below is a skeleton sketch of how to do so. You'll likely need to modify distributedJsonRead to fit your use case.
import boto3
import json
from pyspark.sql import Row
def distributedJsonRead(s3Key):
s3obj = boto3.resource('s3').Object(bucket_name='bucketName', key=key)
contents = json.loads(s3obj.get()['Body'].read().decode('utf-8'))
return Row(**contents)
pkeys = sc.parallelize(keyList) #keyList is a list of s3 keys
dataRdd = pkeys.map(distributedJsonRead)
Boto3 Reference: http://boto3.readthedocs.org/en/latest/guide/quickstart.html
Edit: to address the 1:1 mapping of input files to output files
Later on, it's likely that having a merged parquet data set would be easier to work with. But if this is the way you need to do it, you could try something like this
for k in keyList:
rawtext = sc.read.json(k) # or whichever method you need to use to read in the data
outpath = k[:-4]+'parquet'
rawtext.write.parquet(outpath)
I don't think you will not be able to parallelize these operations if you want a 1:1 mapping of json to parquet files. Spark's read/write functionality is designed to be called by the driver, and needs access to sc and sqlContext. This is another reason why having 1 parquet directory is likely the way to go.

PySpark: How to Read Many JSON Files, Multiple Records Per File

I have a large dataset stored in a S3 bucket, but instead of being a single large file, it's composed of many (113K to be exact) individual JSON files, each of which contains 100-1000 observations. These observations aren't on the highest level, but require some navigation within each JSON to access.
i.e.
json["interactions"] is a list of dictionaries.
I'm trying to utilize Spark/PySpark (version 1.1.1) to parse through and reduce this data, but I can't figure out the right way to load it into an RDD, because it's neither all records > one file (in which case I'd use sc.textFile, though added complication here of JSON) nor each record > one file (in which case I'd use sc.wholeTextFiles).
Is my best option to use sc.wholeTextFiles and then use a map (or in this case flatMap?) to pull the multiple observations from being stored under a single filename key to their own key? Or is there an easier way to do this that I'm missing?
I've seen answers here that suggest just using json.loads() on all files loaded via sc.textFile, but it doesn't seem like that would work for me because the JSONs aren't simple highest-level lists.
The previous answers are not going to read the files in a distributed fashion (see reference). To do so, you would need to parallelize the s3 keys and then read in the files during a flatMap step like below.
import boto3
import json
from pyspark.sql import Row
def distributedJsonRead(s3Key):
s3obj = boto3.resource('s3').Object(bucket_name='bucketName', key=s3Key)
contents = json.loads(s3obj.get()['Body'].read().decode('utf-8'))
for dicts in content['interactions']
yield Row(**dicts)
pkeys = sc.parallelize(keyList) #keyList is a list of s3 keys
dataRdd = pkeys.flatMap(distributedJsonRead)
Boto3 Reference
What about using DataFrames?
does
testFrame = sqlContext.read.json('s3n://<bucket>/<key>')
give you what you want from one file?
Does every observation have the same "columns" (# of keys)?
If so you could use boto to list each object you want to add, read them in and union them with each other.
from pyspark.sql import SQLContext
import boto3
from pyspark.sql.types import *
sqlContext = SQLContext(sc)
s3 = boto3.resource('s3')
bucket = s3.Bucket('<bucket>')
aws_secret_access_key = '<secret>'
aws_access_key_id = '<key>'
#Configure spark with your S3 access keys
sc._jsc.hadoopConfiguration().set("fs.s3n.awsAccessKeyId", aws_access_key_id)
sc._jsc.hadoopConfiguration().set("fs.s3n.awsSecretAccessKey", aws_secret_access_key)
object_list = [k for k in bucket.objects.all() ]
key_list = [k.key for k in bucket.objects.all()]
paths = ['s3n://'+o.bucket_name+'/'+ o.key for o in object_list ]
dataframes = [sqlContext.read.json(path) for path in paths]
df = dataframes[0]
for idx, frame in enumerate(dataframes):
df = df.unionAll(frame)
I'm new to spark myself so I'm wondering if there's a better way to use dataframes with a lot of s3 files, but so far this is working for me.
The name is misleading (because it's singular), but sparkContext.textFile() (at least in the Scala case) also accepts a directory name or a wildcard path, so you just be able to say textFile("/my/dir/*.json").