I have a pyspark code that is written which reads three JSON files and converts the JSON files to DataFrames and the DataFrames are converted to tables on which SQL queries are performed.
import pyspark.sql
from pyspark.sql import SparkSession
from pyspark.sql import SQLContext
from pyspark.sql import *
from pyspark.sql import Row
import json
from pyspark.sql.types import StructType, StructField, StringType
from pyspark.sql.types import *
spark = SparkSession \
.builder \
.appName("project") \
.getOrCreate()
sc = spark.sparkContext
sqlContext=SQLContext(sc)
reviewFile= sqlContext.read.json("review.json")
usersFile=sqlContext.read.json("user.json")
businessFile=sqlContext.read.json("business.json")
reviewFile.createOrReplaceTempView("review")
usersFile.createOrReplaceTempView("user")
businessFile.createOrReplaceTempView("business")
review_user = spark.sql("select r.review_id,r.user_id,r.business_id,r.stars,r.date,u.name,u.review_count,u.yelping_since from (review r join user u on r.user_id = u.user_id)")
review_user.createOrReplaceTempView("review_user")
review_user_business= spark.sql("select r.review_id,r.user_id,r.business_id,r.stars,r.date,r.name,r.review_count,r.yelping_since,b.address,b.categories,b.city,b.latitude,b.longitude,b.name,b.neighborhood,b.postal_code,b.review_count,b.stars,b.state from review_user r join business b on r.business_id= b.business_id")
review_user_business.createOrReplaceTempView("review_user_business")
#categories= spark.sql("select distinct(categories) from review_user_business")
categories= spark.sql("select distinct(r.categories) from review_user_business r where 'Food' in r.categories")
print categories.show(50)
You guys can find the description of the data in the below link.
https://www.yelp.com/dataset/documentation/json
What I'm trying to do is get the rows which has food as a part of its category.
Can some one help me with it??
When using expression A in B in pyspark A should be a column object not a constant value.
What you are looking for is array_contains:
categories= spark.sql("select distinct(r.categories) from review_user_business r \
where array_contains(r.categories, 'Food')")
Related
I am trying to write data into mysql using spark. For this I am using foreachBatch but this is not working. I am doing this on spark shell. Below is the complete code.
spark-shell --driver-class-path mysql-connector-java-5.1.36-bin.jar --jars mysql-connector-java-5.1.36-bin.jar
import org.apache.commons.lang3.StringUtils
import org.apache.spark.SparkContext
import org.apache.spark.sql.streaming.{OutputMode, StreamingQuery}
import org.apache.spark.sql.{DataFrame, SaveMode, SparkSession}
import org.apache.spark.sql.functions.{window, column, col, desc}
val staticDataFrame=spark.read.format("csv").option("inferSchema","true").option("header","true").load("/user/ajeet20028137/by-day/*.csv")
staticDataFrame.createOrReplaceTempView("retail_data")
val staticSchema = staticDataFrame.schema
val streamingDataFrame=spark.readStream.schema(staticSchema).option("maxFilesPerTrigger",1).format("csv").option("header","true").load("/user/ajeet20028137/by-day/*.csv")
val purchaseByCustomerPerHour=streamingDataFrame.selectExpr("customerid","unitprice * quantity as total_cost","invoiceDate").groupBy(col("customerid"), window(col("invoiceDate"),"1 day")).sum("total_cost")
val query = purchaseByCustomerPerHour.writeStream.outputMode(OutputMode.Complete()).foreachBatch((batchDF:DataFrame,batchId:Long)=>{batchDF.coalesce(1).write.mode(SaveMode.Overwrite).format("jdbc").option("url","jdbc:mysql://cxln2.c.thelab-240901.internal/retail_db").option("dbtable","purchasebycustomerperday").option("user","sqoopuser").option("password","password").save()}).start()
I have a big JSON which is want to use in Spark Structured Streaming. I don't want to re-type this JSON as Spark schema expression manually. Can I do this automatically once?
I wrote this
from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.appName("Infer Schema") \
.getOrCreate()
df = spark \
.read \
.option("multiline", True) \
.json("file_examples/dataflow/row01.json")
df.printSchema()
df.show()
with open("dataflow_schema.json", "w") as fp:
fp.write(df.schema.json())
Is this ok?
You are on the right path. You may save your schema as a json and then load it later. Be sure to convert it to json and then a StructType before use
import json
from pyspark.sql.types import StructType
with open("dataflow_schema.json", "r") as fp:
json_schema_str = fp.read()
my_schema = StructType.fromJson(json.loads(json_schema_str))
In your structured streaming query if you have a json column you may use the from_json method to convert your json to a struct type and eventually several columns eg:
from pyspark.sql.functions import from_json,col
# Assume that we have a kafkaStream
kafkaStream.selectExpr("CAST(value as string)")\
.select(from_json(col("value"),my_schema).alias("json_value"))\
.selectExpr("json_value.*") # extract as columns
I know 2 ways to import a CSV file in PySpark:
1) I can use SparkSession. Here is my full code in Jupyter Notebook.
from pyspark import SparkContext
sc = SparkContext()
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('Spark Session 1').getOrCreate()
df = spark.read.csv('mtcars.csv', header = True)
2) I can use the Spark-CSV module from Databricks.
from pyspark import SparkContext
sc = SparkContext()
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
df = sqlContext.read.format('com.databricks.spark.csv').options(header = 'true', inferschema = 'true').load('mtcars.csv')
1) What are the advantages of SparkSession over Spark-CSV?
2) What are the advantages of Spark-CSV over SparkSession?
3) If SparkSession is perfectly capable of importing CSV files, why did Databricks invent the Spark-CSV module?
Let me answer 3rd question first, since 2.0.0 spark csv is embedded. But in older version of spark we have to use spark-csv library. Databricks invented spark-csv at the early stage(1.3+).
To address your 1st and 2nd question,
it's kind of spark 1.6 vs 2.0+ comparison. You will get all the feature provided by spark-csv + spark 2.0 feature if you use SparkSession. If you use spark-csv then you will loose those features.
Hope this helps.
I am new to python and as per my thesis work I am trying to convert JSON to csv.I am able to download data in JSON but when I am writing it back using dictionaries it is not converting JSON to CSV with every column.
import pandas as pd
import statsmodels.formula.api as smf
import statsmodels.api as sm
import matplotlib.pyplot as plt
import numpy as np
import requests
from pprint import pprint
import csv
from time import sleep
s1='https://fantasy.premierleague.com/drf/element-summary/'
print s1
players = []
for player_link in range(1,450,1):
link = s1+""+str(player_link)
print link
r = requests.get(link)
print r
player =r.json()
players.append(player)
sleep(1)
with open('C:\Users\dell\Downloads\players_new2.csv', 'w') as f: # Just use 'w' mode in 3.x
w = csv.DictWriter(f,player.keys())
w.writeheader()
for player in players:
w.writerow(player)
I have uploaded the expected output(dec_15_expected.csv) and the program out with file name "player_new_wrong_output.csv"
https://drive.google.com/drive/folders/0BwKYmRU_0K6tZUljd3Q0aG1LT0U?usp=sharing
It will be a great help if some can tell what I am doing wrong.
Converting JSON to CSV is simple with pandas. Try this:
import pandas as pd
df=pd.read_json("input.json")
df.to_csv('output.csv')
I have a huge dataset with multiple tables. Each table is split into hundreds of csv.gz files and I need to import them to Spark through PySpark. Any idea on how to import the "csv.gz" files to Spark? Does SparkContext or SparkSession from SparkSQL provide a function to import this type of files?
You can import gzipped csv files natively using spark.read.csv():
from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.appName("stackOverflow") \
.getOrCreate()
fpath1 = "file1.csv.gz"
DF = spark.read.csv(fpath1, header=True)
where DF is a spark DataFrame.
You can read from multiple files by feeding in a list of files:
fpath1 = "file1.csv.gz"
fpath2 = "file2.csv.gz"
DF = spark.read.csv([fpath1, fpath2] header=True)
You can also create a "temporary view" allowing for SQL queries:
fpath1 = "file1.csv.gz"
fpath2 = "file2.csv.gz"
DF = spark.read.csv([fpath1, fpath2] header=True)
DF.createOrReplaceTempView("table_name")
DFres = spark.sql("SELECT * FROM table_name)
where DFres is a spark DataFrame generated from the query.