Azure Function in Python get schema of parquet file - function

It is possible get schema of parquet file using Azure Function in Python without download file from datalake ? I using BlobStorageClient to connect to data lake and get the files and containers but i have no idea how can i dispatcher the command using for example pyarrow.
About pyarrow: https://arrow.apache.org/docs/python/parquet.html
BlobStorageClient: https://learn.microsoft.com/en-us/azure/storage/blobs/storage-quickstart-blobs-python-legacy

Regarding the issue, please refer to the following script
import pyarrow.parquet as pq
import io
from azure.storage.blob import BlobServiceClient
blob_service_client = BlobServiceClient.from_connection_string(conn_str)
container_client = blob_service_client.get_container_client('test')
blob_client = container_client.get_blob_client('test.parquet')
with io.BytesIO() as f:
download_stream = blob_client.download_blob(0)
download_stream.readinto(f)
schema = pq.read_schema(f)
print(schema)

It is possible to read both parquet schema and parquet metadata without reading the file content using read_schema and read_metadata:
import pyarrow.parquet as pq
fname = 'filename.parquet'
meta = pq.read_metadata(fname)
schema = pq.read_schema(fname)

Related

python api json dict in dataframe

I want to scrape data at the county level from https://apidocs.covidactnow.org
However I could only get a dataframe with one line for each county, and data for each date is stored within a dictionary in each row/county. I would like to access this data and store it in long format (= have one row per county-date).
import requests
import pandas as pd
import os
if __name__ == '__main__':
os.chdir('/home/username/Desktop/')
url = 'https://api.covidactnow.org/v2/counties.timeseries.json?apiKey=ENTER_YOUR_KEY'
response = requests.get(url).json()
data = pd.DataFrame(response)
This seems like a trivial question, but I've tried for hours. What would be the best way to achieve that ?
Do you mean something like that?
import requests
url = 'https://api.covidactnow.org/v2/states.timeseries.csv?apiKey=YOURAPIKEY'
response = requests.get(url)
csv_response = (response.text)
# Then you can transform STRING to CSV
Check this fo string to CSV --> python parsing string to csv format

using csv as a value python

I'm still green in programing and trying to ajust and learn python but I am struggling with reading a csv file and using the content of the file as a value property
I have looked and googled to death and all the solutions puts out
['Deon:app2018:value:1538402685271'] ore a virtical result
example:
session = file content of csv
here is the closes i got
Code:
import urllib.request
import csv
with open('F:\test\session\main\data\credentials\session_id.csv','r') as file:
session_ID = csv.reader(file)
for row in session_ID:
session = "".join(row)
print(session)
url = 'http://webrates.app.com/rates/connect.html?id='+session
print(url)
What i get
Deon:app2018:value:1538402685271
http://webrates.app.com/rates/connect.html?id=
What i want
Deon:app2018:value:1538402685271
http://webrates.app.com/rates/connect.html?id=Deon:app2018:value:1538402685271
Kind Regards
Deon
after lots of trail and error
this solved my problem
import csvlib.request
import csv
with open('F:\test\session\main\data\credentials\session_id.csv','r') as file:
id = file.read() + '\n'
url = 'http://webrates.app.com/rates/connect.html?id='+ id
print(url)

Not able to read streaming files using Spark structured streaming

I have a set of CSV files which needs to be read through Spark structured streaming. After creating a DataFrame I need to load into a Hive table.
When a file is already present before running code through spark-submit,the data is loaded into Hive successfully.But when I add new CSV files on runtime, it's not at all inserting it into Hive.
Code is:
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._
import org.apache.spark.sql.streaming.OutputMode
val spark = SparkSession.builder().appName("Spark SQL Example").config("hive.metastore.uris","thrift://hostname:port").enableHiveSupport().config("hive.exec.dynamic.partition", "true").config("hive.exec.dynamic.partition.mode", "nonstrict").getOrCreate()
spark.conf.set("spark.sql.streaming.schemaInference", true)
import spark.implicits._
val df = spark.readStream.option("header", true).csv("file:///folder path/")
val query = df.writeStream.queryName("tab").format("memory").outputMode(OutputMode.Append()).start()
spark.sql("insert into hivetab select * from tab").show()
query.awaitTermination()
Am I missing out something here?
Any suggestions would be helpful.
Thanks

How to save JSON data fetched from URL in PySpark?

I have fetched some .json data from API.
import urllib2
test=urllib2.urlopen('url')
print test
How can I save it as a table or data frame? I am using Spark 2.0.
This is how I succeeded importing .json data from web into df:
from pyspark.sql import SparkSession, functions as F
from urllib.request import urlopen
spark = SparkSession.builder.getOrCreate()
url = 'https://web.url'
jsonData = urlopen(url).read().decode('utf-8')
rdd = spark.sparkContext.parallelize([jsonData])
df = spark.read.json(rdd)
For this you can have some research and try using sqlContext. Here is Sample code:-
>>> df2 = sqlContext.jsonRDD(test)
>>> df2.first()
Moreover visit line and check for more things here,
https://spark.apache.org/docs/1.6.2/api/python/pyspark.sql.html
Adding to Rakesh Kumar answer, the way to do it in spark 2.0 is:
http://spark.apache.org/docs/2.1.0/sql-programming-guide.html#data-sources
As an example, the following creates a DataFrame based on the content of a JSON file:
# spark is an existing SparkSession
df = spark.read.json("examples/src/main/resources/people.json")
# Displays the content of the DataFrame to stdout
df.show()
Note that the file that is offered as a json file is not a typical JSON file. Each line must contain a separate, self-contained valid JSON object. For more information, please see JSON Lines text format, also called newline-delimited JSON. As a consequence, a regular multi-line JSON file will most often fail.
from pyspark import SparkFiles
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("Project").getOrCreate()
zip_url = "https://raw.githubusercontent.com/spark-examples/spark-scala-examples/master/src/main/resources/zipcodes.json"
spark.sparkContext.addFile(zip_url)
zip_df = spark.read.json("file://" +SparkFiles.get("zipcodes.json"))
#click on raw and then copy url

django upload file using JSON

i have JSON request like this:
object: { "fields":{ "src" : "http://dss.com/a.jpg", "data" : " //file is here" } }
i have the model like this:
class FileMy(models.Model):
f = models.FileField(upload_to='file_path/',)
How to save the file ?
You may use urllib to read the file and then you can add it to your model.
Take a look at this post:
Django: add image in an ImageField from image url
You may be able to wrap the data in a ContentFile which inherits from File and then save the file to the model directly.
from __future__ import unicode_literals
from django.core.files.base import ContentFile
from .models import FileMy
f1 = ContentFile("esta sentencia está en español")
f2 = ContentFile(b"these are bytes")
m1 = FileMy()
m2 = FileMy()
m1.f.save("filename", f1, save=True)
m2.f.save("filename", f2, save=True)
First of all, encode the raw data in the json request body.
from tempfile import NamedTemporaryFile
from django.core.files import File
def save_file_to_field(field, file_name, raw_content):
# field: reference to the model object instance field
img_temp = NamedTemporaryFile(delete=True)
img_temp.write(raw_content)
field.save(
file_name,
File(img_temp)
)
img_temp.flush()
What does this do:
creates a temporary file on your system that holds the data
uses the file field save method to trigger the usual file handling
deletes the temporary file