I'm trying to create a DataFrame with JSON strings from a text file. First I merge JValues:
def mergeSales(salesArray: JArray, metrics: JValue): List[String] = {
salesArray.children
.map(sale => sale merge metrics)
.map(merged => compact(render(merged)))
}
Then I write the strings to a file:
out.write(mergedSales.flatMap(s => s getBytes "UTF-8").toArray)
Data in the resulting file looks like this and there are no commas between the objects and no new lines:
{"store":"New Sore_1","store_id":"10","store_metric":"1234567"}{"store":"New Sore_1","store_id":"10","store_metric":"98765"}
The problem is when I'm creating a DataFrame it contains only the first row (with store_metric 1234567), ignoring the second one.
What is my mistake in creating a DataFrame? And what should I do for the data to be parsed correctly?
Here is how I'm trying to create a DataFrame:
val df = sqlContext.read.json(sc.wholeTextFiles("data.txt").values)
Related
I read a .csv file to create a data frame and I want to write the data to a kafka topic. The code is the following
df = spark.read.format("csv").option("header", "true").load(f'{file_location}')
kafka_df = df.selectExpr("to_json(struct(*)) AS value").selectExpr("CAST(value AS STRING)")
kafka_df.show(truncate=False)
And the data frame looks like this:
value
"{""id"":""d215e9f1-4d0c-42da-8f65-1f4ae72077b3"",""latitude"":""-63.571457254062715"",""longitude"":""-155.7055842710919""}"
"{""id"":""ca3d75b3-86e3-438f-b74f-c690e875ba52"",""latitude"":""-53.36506636464281"",""longitude"":""30.069167069917597""}"
"{""id"":""29e66862-9248-4af7-9126-6880ceb3b45f"",""latitude"":""-23.767505281795835"",""longitude"":""174.593140405442""}"
"{""id"":""451a7e21-6d5e-42c3-85a8-13c740a058a9"",""latitude"":""13.02054867061598"",""longitude"":""20.328402498420786""}"
"{""id"":""09d6c11d-7aae-4d17-8cd8-183157794893"",""latitude"":""-81.48976715040848"",""longitude"":""1.1995769642056189""}"
"{""id"":""393e8760-ef40-482a-a039-d263af3379ba"",""latitude"":""-71.73949722379649"",""longitude"":""112.59922770487054""}"
"{""id"":""d6db8fcf-ee83-41cf-9ec2-5c2909c18534"",""latitude"":""-4.034680969008576"",""longitude"":""60.59645511854336""}"
After I wrote it to Kafka I want to read it and transform the binary data from column "value" back to json string but the result is that the value contains only the id, not the whole string. Any ideea why?
from pyspark.sql import functions as F
df = consume_from_event_hub(topic, bootstrap_servers, config, consumer_group)
string_df = df.select(F.col("value").cast("string"))
string_df.display()
value
794541bc-30e6-4c16-9cd0-3c5c8995a3a4
20ea5b50-0baa-47e3-b921-f9a3ac8873e2
598d2fc1-c919-4498-9226-dd5749d92fc5
86cd5b2b-1c57-466a-a3c8-721811ab6959
807de968-c070-4b8b-86f6-00a865474c35
e708789c-e877-44b8-9504-86fd9a20ef91
9133a888-2e8d-4a5a-87ce-4a53e63b67fc
cd5e3e0d-8b02-45ee-8634-7e056d49bf3b
the CSV the format is this
id,latitude,longitude
bd6d98e1-d1da-4f41-94ba-8dbd8c8fce42,-86.06318155350924,-108.14300138138589
c39e84c6-8d7b-4cc5-b925-68a5ea406d52,74.20752175171859,-129.9453606091319
011e5fb8-6ab7-4ee9-97bb-acafc2c71e15,19.302250885973592,-103.2154291337162
You need to remove selectExpr("CAST(value AS STRING)") since to_json already returns a string column
from pyspark.sql.functions import col, to_json, struct
df = spark.read.format("csv").option("header", "true").option("inferSchema", "true").load(f'{file_location}')
kafka_df = df.select(to_json(struct(col("*"))).alias("value"))
kafka_df.show(truncate=False)
I'm not sure what's wrong with the consumer. That should have worked unless consume_from_event_hub does something specifically to extract the ID column
I have the following JSON Array in on of the columns of Dataframe in Spark (2.4)
[{"quoteID":"12411736"},{"quoteID":"12438257"},{"quoteID":"12438288"},{"quoteID":"12438296"},{"quoteID":"12438299"}]
I am trying to merge as
{"quoteIDs":["12411736","12438257","12438288","12438296","12438299"]}
Require help on it
Use from_json to read your column data as array<struct> then by exploding we can flatten the data.
finally do groupBy + collect_list to create an array and use to_json to create json required.
Example:
val df=Seq(("a","""[{"quoteID":"123"},{"quoteID":"456"}]""")).toDF("i","j")
df.show(false)
//+---+-------------------------------------+
//|i |j |
//+---+-------------------------------------+
//|a |[{"quoteID":"123"},{"quoteID":"456"}]|
//+---+-------------------------------------+
val sch=ArrayType(new StructType().add("quoteID",StringType))
val drop_cols=Seq("tmp1","tmp2","j")
//if you need as column in the dataframe
df.withColumn("tmp1",from_json(col("j"),sch)).
withColumn("tmp2",explode(col("tmp1"))).
selectExpr("*","tmp2.*").
groupBy("i").
agg(collect_list("quoteID").alias("quoteIDS")).
withColumn("quoteIDS",to_json(struct(col("quoteIDS")))).
drop(drop_cols:_*).
show(false)
//+---+--------------------------+
//|i |quoteIDS |
//+---+--------------------------+
//|a |{"quoteIDS":["123","456"]}|
//+---+--------------------------+
//if you need as json string
val df1=df.withColumn("tmp1",from_json(col("j"),sch)).
withColumn("tmp2",explode(col("tmp1"))).
selectExpr("*","tmp2.*").
groupBy("i").
agg(collect_list("quoteID").alias("quoteIDS"))
//toJSON (or) write as json file
df1.toJSON.first
//String = {"i":"a","quoteIDS":["123","456"]}
How to create DataFrame from Json data where two fields has same key but differ in caps. For example,
{"abc1":"some-value", "ABC1":"some-other-value", "abc":"some-value"}
{"abc1":"some-value", "ABC1":"some-other-value", "abc":"some-value1"}
At present, I am getting following error,
org.apache.spark.sql.AnalysisException: Reference 'ABC1' is ambiguous.
This is how I created DataFrame,
val df = sqlContext.read.json(inputPath)
I also tried creating RDD first where I read every line and changed the key name in Json string and then converted RDD to Dataframe. This approach was very slow.
I tried multiple ways but still same problem remains.
I tried to rename the columns name but still error is there
val modDf = df
.withColumnRenamed("MCC", "MCC_CAP")
.withColumnRenamed("MNC", "MNC_CAP")
.withColumnRenamed("MCCMNC", "MCCMNC_CAP")
Created another DataFrame with renamed columns
val cols = df.columns.map(line => if (line.startsWith("M"))line.concat("_cap") else line)
val smallDf = df.toDF(cols: _*)
Dropping duplicate columns,
val capCols = df.columns.filter(line => line.startsWith("M"))
val smallDf = df.drop(capCols: _*)
You should read it as text rdd using sparkContext and then use sqlContext to read the rdd as json
sqlContext.read.json(sc.textFile("path to your json file"))
You should have your dataframe as
+----------------+-----------+----------+
|ABC1 |abc |abc1 |
+----------------+-----------+----------+
|some-other-value|some-value |some-value|
|some-other-value|some-value1|some-value|
+----------------+-----------+----------+
The dataframe generated would have defect though as duplicate column names are not allowed in spark dataframes (case insensitive)
So I would recommend to change the duplicate column names before you convert the jsons to dataframe
val rdd = sc.textFile("path to your json file").map(jsonLine => jsonLine.replace("\"ABC1\":", "\"AAA\":"))
sqlContext.read.json(rdd)
now you should have dataframe as
+----------------+-----------+----------+
|AAA |abc |abc1 |
+----------------+-----------+----------+
|some-other-value|some-value |some-value|
|some-other-value|some-value1|some-value|
+----------------+-----------+----------+
When I create a dataframe from json file, the fields from the json file are sorted by default in the dataframe. How to avoid this sorting?
Jsonfile having one json message per line:
{"name":"john","age":10,"class":2}
{"name":"rambo","age":11,"class":3}
When I create Data frame from this file as:
val jDF = sqlContext.read.json("/user/inputfiles/sample.json")
a DF is created as jDF: org.apache.spark.sql.DataFrame = [age: bigint, class: bigint, name: string]
. In the DF the fields are sorted by default.
How do we avoid this from happening?
Im unable to understand what is going wrong here.
Appreciate any help in sorting out the problem.
For Question 1:
A simple way is to do select on the DataFrame:
val newDF = jDF.select("name","age","class")
The order of parameters is the order of the columns you want.
But this could be verbose if there are many columns and you have to define the order yourself.
a relative newbie to spark, hbase, and scala here.
I have json (stored as byte arrays) in hbase cells in the same column family but across several thousand column qualifiers. Example (simplified):
Table name: 'Events'
rowkey: rk1
column family: cf1
column qualifier: cq1, cell data (in bytes): {"id":1, "event":"standing"}
column qualifier: cq2, cell data (in bytes): {"id":2, "event":"sitting"}
etc.
Using scala, I can read rows by specifying a timerange
val scan = new Scan()
val start = 1460542400
val end = 1462801600
val hbaseContext = new HBaseContext(sc, conf)
val getRdd = hbaseContext.hbaseRDD(TableName.valueOf("Events"), scan)
If I try to load up my hbase rdd (getRdd) into a dataframe (after converting the byte arrays into string etc.), it only reads the first cell in every row (in the example above, I would only get "standing".
this code only loads up a single cell for every row returned
val resultsString = getRdd.map(s=>Bytes.toString(s._2.value()))
val resultsDf = sqlContext.read.json(resultsString)
In order to get every cell I have to iterate as below.
val jsonRDD = getRdd.map(
row => {
val str = new StringBuilder
str.append("[")
val it = row._2.listCells().iterator()
while (it.hasNext) {
val cell = it.next()
val cellstring = Bytes.toString(CellUtil.cloneValue(cell))
str.append(cellstring)
if (it.hasNext()) {
str.append(",")
}
}
str.append("]")
str.toString()
}
)
val hbaseDataSet = sqlContext.read.json(jsonRDD)
I need to add the square brackets and the commas so its properly formatted json for the dataframe to read it.
Questions:
Is there a more elegant way to construct the json i.e. some parser that takes in the individual json strings and concatenates them together so its properly formed json?
Is there a better capability to flatten hbase cells so i dont need to iterate?
For the jsonRdd, the closure that is computed should include the str local variable, so the task executing this code on a node should not be missing the "[", "]" or ",". i.e i wont get parser errors once i run this on the cluster instead of local[*]
Finally, is it better to just create a pair RDD from the json or use data frames to perform simple things like counts? Is there some way to measure the efficiency and performance of one vs. the other?
thank you