Can Pyarrow non-legacy parquet datasets read and write to Azure Blob? (legacy system and Dask are able to) - pyarrow

Is it possible to read a parquet dataset from Azure Blob using the new non-legacy?
I can read and write to blob storage with the old system where fs is fsspec:
pq.write_to_dataset(table=table.replace_schema_metadata(),
root_path=path,
partition_cols=[
'year',
'month',
],
filesystem=fs,
version='2.0',
flavor='spark',
)
With Dask, I am able to read the data using storage options:
ddf = dd.read_parquet(path='abfs://analytics/iag-cargo/zendesk/ticket-metric-events',
storage_options={
'account_name': base.login,
'account_key': base.password,
})
But when I try using
import pyarrow.dataset as ds
dataset = ds.dataset()
Or
dataset = pq.ParquetDataset(path_or_paths=path, filesystem=fs, use_legacy_dataset=False)
I run into errors about invalid filesystem URIs. I tried every combination I could think of and tried to figure out how Dask and the legacy system can read and write files but the new one can't.
I'd like to test the row filtering and non-Hive partitioning.

Related

DirectoryPartitioning in pyarrow (python)

dataset = ds.dataset("abfs://test", format="parquet", partitioning="hive", filesystem=fs)
I can read datasets with the pyarrow dataset feature, but how can I write to a dataset with a different schema?
I seem to be able to import DirectoryPartitioning, for example, but I cannot figure out a way to save the data to create a schema like this:
from pyarrow.dataset import DirectoryPartitioning
partitioning = DirectoryPartitioning(pa.schema([("year", pa.int16()), ("month", pa.int8()), ("day", pa.int8())]))
print(partitioning.parse("/2009/11/3"))
Will we continue to use write_to_dataset to write Parquet files, or will there be a new method specific to the datasets class?

Move JSON Data from DocumentDB (or CosmosDB) to Azure Data Lake

I have a lot of JSON files (in millions) in Cosmos DB (earlier called Document DB) and I want to move it into Azure Data Lake, for cold storage.
I found this document https://learn.microsoft.com/en-us/dotnet/api/microsoft.azure.documents.client.documentclient.readdocumentfeedasync?view=azure-dotnet but it doesnt have any samples to start with.
How should I proceed, any code samples are highly appreciated.
Thanks.
I suggest you using Azure Data Factory to implement your requirement.
Please refer to this doc about how to export json documents from cosmos db and this doc about how to import data into ADL.
Hope it helps you.
Update Answer:
Please refer to this : Azure Cosmos DB as source, you could create query in pipeline.
Yeah the change feed will do the trick.
You have two options. The first one (which is probably what you want in this case) is to use it via the SDK.
Microsoft has a detailed page on how to do with including code examples here: https://learn.microsoft.com/en-us/azure/cosmos-db/change-feed#rest-apis
The second one is the Change Feed Library which allows you to have a service running at all times listening for changes and processing them based on your needs. More details with code examples of the change feed library here: https://learn.microsoft.com/en-us/azure/cosmos-db/change-feed#change-feed-processor
(Both pages (which is the same really just different sections) contain a link to a Microsoft github repo which contains code examples.)
Keep in mind you will still be charged for using this in terms of RU/s but from what I've seems it is relatively low (or at least lower than what you'd pay of you start reading the collections themselves.)
You also could read the change feed via Spark. Following python code example generates parquet files partitioned by loaddate for changed data. Works in an Azure Databricks notebooks on a daily schedule:
# Get DB secrets
endpoint = dbutils.preview.secret.get(scope = "cosmosdb", key = "endpoint")
masterkey = dbutils.preview.secret.get(scope = "cosmosdb", key = "masterkey")
# database & collection
database = "<yourdatabase>"
collection = "<yourcollection"
# Configs
dbConfig = {
"Endpoint" : endpoint,
"Masterkey" : masterkey,
"Database" : database,
"Collection" : collection,
"ReadChangeFeed" : "True",
"ChangeFeedQueryName" : database + collection + " ",
"ChangeFeedStartFromTheBeginning" : "False",
"ChangeFeedUseNextToken" : "True",
"RollingChangeFeed" : "False",
"ChangeFeedCheckpointLocation" : "/tmp/changefeedcheckpointlocation",
"SamplingRatio" : "1.0"
}
# Connect via Spark connector to create Spark DataFrame
df = spark.read.format("com.microsoft.azure.cosmosdb.spark").options(**dbConfig).load()
# set partition to current date
import datetime
from pyspark.sql.functions import lit
partition_day= datetime.date.today()
partition_datetime=datetime.datetime.now().isoformat()
# new dataframe with ingest date (=partition key)
df_part= df.withColumn("ingest_date", lit(partition_day))
# write parquet file
df_part.write.partitionBy('ingest_date').mode('append').json('dir')
You could also use a Logic App. A Timer trigger could be used. This would be a no-code solution
Query for documents
Loop though documents
Add to Data Lake
The advantage is that you can apply any rules before sending to Data Lake

running nested jobs in spark

I am using PySpark. I have a list of gziped json files on s3 which I have to access, transform and then export in parquet to s3. Each json file contains around 100k lines so parallelizing it wont make much sense(but i am open to parallelizing it), but there are around 5k files which I have parallelize. My approach is pass the json file list to script -> run parallelize on the list -> run map(? this is where I am getting blocked). how do I access and transform the json create a DF out of the transformed json and dump it as parquet into s3.
To read json in a distributed fashion, you will need to parallelize your keys as you mention. To do this while reading from s3, you'll need to use boto3. Below is a skeleton sketch of how to do so. You'll likely need to modify distributedJsonRead to fit your use case.
import boto3
import json
from pyspark.sql import Row
def distributedJsonRead(s3Key):
s3obj = boto3.resource('s3').Object(bucket_name='bucketName', key=key)
contents = json.loads(s3obj.get()['Body'].read().decode('utf-8'))
return Row(**contents)
pkeys = sc.parallelize(keyList) #keyList is a list of s3 keys
dataRdd = pkeys.map(distributedJsonRead)
Boto3 Reference: http://boto3.readthedocs.org/en/latest/guide/quickstart.html
Edit: to address the 1:1 mapping of input files to output files
Later on, it's likely that having a merged parquet data set would be easier to work with. But if this is the way you need to do it, you could try something like this
for k in keyList:
rawtext = sc.read.json(k) # or whichever method you need to use to read in the data
outpath = k[:-4]+'parquet'
rawtext.write.parquet(outpath)
I don't think you will not be able to parallelize these operations if you want a 1:1 mapping of json to parquet files. Spark's read/write functionality is designed to be called by the driver, and needs access to sc and sqlContext. This is another reason why having 1 parquet directory is likely the way to go.

Recommended ways to load large csv to RDB like mysql

Aim : Build a small ETL framework to take a Huge CSV and dump it into RDB(say MySQL).
The current approach we are thinking about is to load csv using spark into a dataframe and persist it and later use frameworks like apache scoop and and load it into mySQL.
Need recommendations on which format to persist and on the approach itself.
Edit:
CSV will have around 50 million rows with 50-100 columns.
Since our tasks involves lots of transformations before dumping into RDB, we thought using spark was a good idea.
Spark SQL support to writing to RDB directly. You can load your huge CSV as DataFrame, transform it, and call below API to save it to database.
Please refer to below API:
org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils
def saveTable(df: DataFrame,
url: String,
table: String,
properties: Properties): Unit
Saves the RDD to the database in a single transaction.
Example Code:
val url: String = "jdbc:oracle:thin:#your_domain:1521/dbname"
val driver: String = "oracle.jdbc.OracleDriver"
val props = new java.util.Properties()
props.setProperty("user", "username")
props.setProperty("password", "userpassword")
org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils.saveTable(dataFrame, url, "table_name", props)

Persisting RDD on Amazon S3

I have a large text file containing JSON objects on Amazon S3. I am planning to process this data using Spark on Amazon EMR.
Here are my questions:
How do I load the text file containing JSON objects into Spark?
Is it possible to persist the internal RDD representation of this data on S3, after the EMR cluster is turned-off?
If I am able to persist the RDD representation, is it possible to directly load the data in RDD format next time I need to analyze the same data?
This should cover #1, as long as you're using pyspark:
#Configure spark with your S3 access keys
sc._jsc.hadoopConfiguration().set("fs.s3n.awsAccessKeyId", "MY-ACCESS-KEY")
sc._jsc.hadoopConfiguration().set("fs.s3n.awsSecretAccessKey", "MY-SECRET-ACCESS-KEY")
#Retrieve the data
my_data = sc.textFile("s3n://my-bucket-name/my-key")
my_data.count() #Count all rows
my_data.take(20) #Take the first 20 rows
#Parse it
import json
my_data.map(lambda x: json.loads(x)).take(20) #Take the first 20 rows of json-parsed content
Note the s3 address is s3n://, not s3://. This is a legacy thing from hadoop.
Also, my-key can point to a whole S3 directory*. If you're using a spark cluster, importing several medium-sized files is usually faster than a single big one.
For #2 and #3, I'd suggest looking up spark's parquet support. You can also save text back to s3:
my_data.map(lambda x: json.dumps(x)).saveAsTextFile('s3://my-bucket-name/my-new-key')
Not knowing the size of your dataset and the computational complexity of your pipeline, I can't say which way of storing intermediate data to S3 will be the best use of your resources.
*S3 doesn't really have directories, but you know what I mean.