How Can I query an RDD with complex types such as maps/arrays?
for example, when I was writing this test code:
case class Test(name: String, map: Map[String, String])
val map = Map("hello" -> "world", "hey" -> "there")
val map2 = Map("hello" -> "people", "hey" -> "you")
val rdd = sc.parallelize(Array(Test("first", map), Test("second", map2)))
I thought the syntax would be something like:
sqlContext.sql("SELECT * FROM rdd WHERE map.hello = world")
or
sqlContext.sql("SELECT * FROM rdd WHERE map[hello] = world")
but I get
Can't access nested field in type MapType(StringType,StringType,true)
and
org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Unresolved attributes
respectively.
It depends on a type of the column. Lets start with some dummy data:
import org.apache.spark.sql.functions.{udf, lit}
import scala.util.Try
case class SubRecord(x: Int)
case class ArrayElement(foo: String, bar: Int, vals: Array[Double])
case class Record(
an_array: Array[Int], a_map: Map[String, String],
a_struct: SubRecord, an_array_of_structs: Array[ArrayElement])
val df = sc.parallelize(Seq(
Record(Array(1, 2, 3), Map("foo" -> "bar"), SubRecord(1),
Array(
ArrayElement("foo", 1, Array(1.0, 2.0, 2.0)),
ArrayElement("bar", 2, Array(3.0, 4.0, 5.0)))),
Record(Array(4, 5, 6), Map("foz" -> "baz"), SubRecord(2),
Array(ArrayElement("foz", 3, Array(5.0, 6.0)),
ArrayElement("baz", 4, Array(7.0, 8.0))))
)).toDF
df.registerTempTable("df")
df.printSchema
// root
// |-- an_array: array (nullable = true)
// | |-- element: integer (containsNull = false)
// |-- a_map: map (nullable = true)
// | |-- key: string
// | |-- value: string (valueContainsNull = true)
// |-- a_struct: struct (nullable = true)
// | |-- x: integer (nullable = false)
// |-- an_array_of_structs: array (nullable = true)
// | |-- element: struct (containsNull = true)
// | | |-- foo: string (nullable = true)
// | | |-- bar: integer (nullable = false)
// | | |-- vals: array (nullable = true)
// | | | |-- element: double (containsNull = false)
array (ArrayType) columns:
Column.getItem method
df.select($"an_array".getItem(1)).show
// +-----------+
// |an_array[1]|
// +-----------+
// | 2|
// | 5|
// +-----------+
Hive brackets syntax:
sqlContext.sql("SELECT an_array[1] FROM df").show
// +---+
// |_c0|
// +---+
// | 2|
// | 5|
// +---+
an UDF
val get_ith = udf((xs: Seq[Int], i: Int) => Try(xs(i)).toOption)
df.select(get_ith($"an_array", lit(1))).show
// +---------------+
// |UDF(an_array,1)|
// +---------------+
// | 2|
// | 5|
// +---------------+
Additionally to the methods listed above Spark supports a growing list of built-in functions operating on complex types. Notable examples include higher order functions like transform (SQL 2.4+, Scala 3.0+, PySpark / SparkR 3.1+):
df.selectExpr("transform(an_array, x -> x + 1) an_array_inc").show
// +------------+
// |an_array_inc|
// +------------+
// | [2, 3, 4]|
// | [5, 6, 7]|
// +------------+
import org.apache.spark.sql.functions.transform
df.select(transform($"an_array", x => x + 1) as "an_array_inc").show
// +------------+
// |an_array_inc|
// +------------+
// | [2, 3, 4]|
// | [5, 6, 7]|
// +------------+
filter (SQL 2.4+, Scala 3.0+, Python / SparkR 3.1+)
df.selectExpr("filter(an_array, x -> x % 2 == 0) an_array_even").show
// +-------------+
// |an_array_even|
// +-------------+
// | [2]|
// | [4, 6]|
// +-------------+
import org.apache.spark.sql.functions.filter
df.select(filter($"an_array", x => x % 2 === 0) as "an_array_even").show
// +-------------+
// |an_array_even|
// +-------------+
// | [2]|
// | [4, 6]|
// +-------------+
aggregate (SQL 2.4+, Scala 3.0+, PySpark / SparkR 3.1+):
df.selectExpr("aggregate(an_array, 0, (acc, x) -> acc + x, acc -> acc) an_array_sum").show
// +------------+
// |an_array_sum|
// +------------+
// | 6|
// | 15|
// +------------+
import org.apache.spark.sql.functions.aggregate
df.select(aggregate($"an_array", lit(0), (x, y) => x + y) as "an_array_sum").show
// +------------+
// |an_array_sum|
// +------------+
// | 6|
// | 15|
// +------------+
array processing functions (array_*) like array_distinct (2.4+):
import org.apache.spark.sql.functions.array_distinct
df.select(array_distinct($"an_array_of_structs.vals"(0))).show
// +-------------------------------------------+
// |array_distinct(an_array_of_structs.vals[0])|
// +-------------------------------------------+
// | [1.0, 2.0]|
// | [5.0, 6.0]|
// +-------------------------------------------+
array_max (array_min, 2.4+):
import org.apache.spark.sql.functions.array_max
df.select(array_max($"an_array")).show
// +-------------------+
// |array_max(an_array)|
// +-------------------+
// | 3|
// | 6|
// +-------------------+
flatten (2.4+)
import org.apache.spark.sql.functions.flatten
df.select(flatten($"an_array_of_structs.vals")).show
// +---------------------------------+
// |flatten(an_array_of_structs.vals)|
// +---------------------------------+
// | [1.0, 2.0, 2.0, 3...|
// | [5.0, 6.0, 7.0, 8.0]|
// +---------------------------------+
arrays_zip (2.4+):
import org.apache.spark.sql.functions.arrays_zip
df.select(arrays_zip($"an_array_of_structs.vals"(0), $"an_array_of_structs.vals"(1))).show(false)
// +--------------------------------------------------------------------+
// |arrays_zip(an_array_of_structs.vals[0], an_array_of_structs.vals[1])|
// +--------------------------------------------------------------------+
// |[[1.0, 3.0], [2.0, 4.0], [2.0, 5.0]] |
// |[[5.0, 7.0], [6.0, 8.0]] |
// +--------------------------------------------------------------------+
array_union (2.4+):
import org.apache.spark.sql.functions.array_union
df.select(array_union($"an_array_of_structs.vals"(0), $"an_array_of_structs.vals"(1))).show
// +---------------------------------------------------------------------+
// |array_union(an_array_of_structs.vals[0], an_array_of_structs.vals[1])|
// +---------------------------------------------------------------------+
// | [1.0, 2.0, 3.0, 4...|
// | [5.0, 6.0, 7.0, 8.0]|
// +---------------------------------------------------------------------+
slice (2.4+):
import org.apache.spark.sql.functions.slice
df.select(slice($"an_array", 2, 2)).show
// +---------------------+
// |slice(an_array, 2, 2)|
// +---------------------+
// | [2, 3]|
// | [5, 6]|
// +---------------------+
map (MapType) columns
using Column.getField method:
df.select($"a_map".getField("foo")).show
// +----------+
// |a_map[foo]|
// +----------+
// | bar|
// | null|
// +----------+
using Hive brackets syntax:
sqlContext.sql("SELECT a_map['foz'] FROM df").show
// +----+
// | _c0|
// +----+
// |null|
// | baz|
// +----+
using a full path with dot syntax:
df.select($"a_map.foo").show
// +----+
// | foo|
// +----+
// | bar|
// |null|
// +----+
using an UDF
val get_field = udf((kvs: Map[String, String], k: String) => kvs.get(k))
df.select(get_field($"a_map", lit("foo"))).show
// +--------------+
// |UDF(a_map,foo)|
// +--------------+
// | bar|
// | null|
// +--------------+
Growing number of map_* functions like map_keys (2.3+)
import org.apache.spark.sql.functions.map_keys
df.select(map_keys($"a_map")).show
// +---------------+
// |map_keys(a_map)|
// +---------------+
// | [foo]|
// | [foz]|
// +---------------+
or map_values (2.3+)
import org.apache.spark.sql.functions.map_values
df.select(map_values($"a_map")).show
// +-----------------+
// |map_values(a_map)|
// +-----------------+
// | [bar]|
// | [baz]|
// +-----------------+
Please check SPARK-23899 for a detailed list.
struct (StructType) columns using full path with dot syntax:
with DataFrame API
df.select($"a_struct.x").show
// +---+
// | x|
// +---+
// | 1|
// | 2|
// +---+
with raw SQL
sqlContext.sql("SELECT a_struct.x FROM df").show
// +---+
// | x|
// +---+
// | 1|
// | 2|
// +---+
fields inside array of structs can be accessed using dot-syntax, names and standard Column methods:
df.select($"an_array_of_structs.foo").show
// +----------+
// | foo|
// +----------+
// |[foo, bar]|
// |[foz, baz]|
// +----------+
sqlContext.sql("SELECT an_array_of_structs[0].foo FROM df").show
// +---+
// |_c0|
// +---+
// |foo|
// |foz|
// +---+
df.select($"an_array_of_structs.vals".getItem(1).getItem(1)).show
// +------------------------------+
// |an_array_of_structs.vals[1][1]|
// +------------------------------+
// | 4.0|
// | 8.0|
// +------------------------------+
user defined types (UDTs) fields can be accessed using UDFs. See Spark SQL referencing attributes of UDT for details.
Notes:
depending on a Spark version some of these methods can be available only with HiveContext. UDFs should work independent of version with both standard SQLContext and HiveContext.
generally speaking nested values are a second class citizens. Not all typical operations are supported on nested fields. Depending on a context it could be better to flatten the schema and / or explode collections
df.select(explode($"an_array_of_structs")).show
// +--------------------+
// | col|
// +--------------------+
// |[foo,1,WrappedArr...|
// |[bar,2,WrappedArr...|
// |[foz,3,WrappedArr...|
// |[baz,4,WrappedArr...|
// +--------------------+
Dot syntax can be combined with wildcard character (*) to select (possibly multiple) fields without specifying names explicitly:
df.select($"a_struct.*").show
// +---+
// | x|
// +---+
// | 1|
// | 2|
// +---+
JSON columns can be queried using get_json_object and from_json functions. See How to query JSON data column using Spark DataFrames? for details.
Once You convert it to DF, u can simply fetch data as
val rddRow= rdd.map(kv=>{
val k = kv._1
val v = kv._2
Row(k, v)
})
val myFld1 = StructField("name", org.apache.spark.sql.types.StringType, true)
val myFld2 = StructField("map", org.apache.spark.sql.types.MapType(StringType, StringType), true)
val arr = Array( myFld1, myFld2)
val schema = StructType( arr )
val rowrddDF = sqc.createDataFrame(rddRow, schema)
rowrddDF.registerTempTable("rowtbl")
val rowrddDFFinal = rowrddDF.select(rowrddDF("map.one"))
or
val rowrddDFFinal = rowrddDF.select("map.one")
here was what I did and it worked
case class Test(name: String, m: Map[String, String])
val map = Map("hello" -> "world", "hey" -> "there")
val map2 = Map("hello" -> "people", "hey" -> "you")
val rdd = sc.parallelize(Array(Test("first", map), Test("second", map2)))
val rdddf = rdd.toDF
rdddf.registerTempTable("mytable")
sqlContext.sql("select m.hello from mytable").show
Results
+------+
| hello|
+------+
| world|
|people|
+------+
Related
I'm trying to to_html(formatters={column: function}) in order to make visual changes.
df = pd.DataFrame({'name':['Here',4.45454,5]}, index=['A', 'B', 'C'])
def _form(val):
value = 'STRING' if isinstance(val, str) else '{:0.0f}'.format(val)
return value
df.to_html(formatters={'name':_form})
I get
name
A
STRING
B
4.45454
C
5
instead of
| | name |
| -------- | -------------- |
| A | STRING |
| B | 4 |
| C | 5 |
Problem here is float value doesn't change.
On the other hand when I have all values float or integers, it gives desired result:
df = pd.DataFrame({'name':[323.322,4.45454,5]}, index=['A', 'B', 'C'])
def _form(val):
value = 'STRING' if isinstance(val, str) else '{:0.0f}'.format(val)
return value
df.to_html(formatters={'name':_form})
How can it be fixed?
Thank you.
I have a nested source json file that contains an array of structs. The number of structs varies greatly from row to row and I would like to use Spark (scala) to dynamically create new dataframe columns from the key/values of the struct where the key is the column name and the value is the column value.
Example Minified json record
{"key1":{"key2":{"key3":"AK","key4":"EU","key5":{"key6":"001","key7":"N","values":[{"name":"valuesColumn1","value":"9.876"},{"name":"valuesColumn2","value":"1.2345"},{"name":"valuesColumn3","value":"8.675309"}]}}}}
dataframe schema
scala> val df = spark.read.json("file:///tmp/nested_test.json")
root
|-- key1: struct (nullable = true)
| |-- key2: struct (nullable = true)
| | |-- key3: string (nullable = true)
| | |-- key4: string (nullable = true)
| | |-- key5: struct (nullable = true)
| | | |-- key6: string (nullable = true)
| | | |-- key7: string (nullable = true)
| | | |-- values: array (nullable = true)
| | | | |-- element: struct (containsNull = true)
| | | | | |-- name: string (nullable = true)
| | | | | |-- value: string (nullable = true)
Whats been done so far
df.select(
($"key1.key2.key3").as("key3"),
($"key1.key2.key4").as("key4"),
($"key1.key2.key5.key6").as("key6"),
($"key1.key2.key5.key7").as("key7"),
($"key1.key2.key5.values").as("values")).
show(truncate=false)
+----+----+----+----+----------------------------------------------------------------------------+
|key3|key4|key6|key7|values |
+----+----+----+----+----------------------------------------------------------------------------+
|AK |EU |001 |N |[[valuesColumn1, 9.876], [valuesColumn2, 1.2345], [valuesColumn3, 8.675309]]|
+----+----+----+----+----------------------------------------------------------------------------+
There is an array of 3 structs here but the 3 structs need to be spilt into 3 separate columns dynamically (the number of 3 can vary greatly), and I am not sure how to do it.
Sample Desired output
Notice that there were 3 new columns produced for each of the array elements within the values array.
+----+----+----+----+-----------------------------------------+
|key3|key4|key6|key7|valuesColumn1|valuesColumn2|valuesColumn3|
+----+----+----+----+-----------------------------------------+
|AK |EU |001 |N |9.876 |1.2345 |8.675309 |
+----+----+----+----+-----------------------------------------+
Reference
I believe that the desired solution is something similar to what was discussed in this SO post but with 2 main differences:
The number of columns is hardcoded to 3 in the SO post but in my circumstance, the number of array elements is unknown
The column names need to be driven by the name column and the column value by the value.
...
| | | | |-- element: struct (containsNull = true)
| | | | | |-- name: string (nullable = true)
| | | | | |-- value: string (nullable = true)
You could do it this way:
val sac = new SparkContext("local[*]", " first Program");
val sqlc = new SQLContext(sac);
import sqlc.implicits._;
import org.apache.spark.sql.functions.split
import scala.math._
import org.apache.spark.sql.types._
import org.apache.spark.sql.functions._
import org.apache.spark.sql.functions.{ min, max }
val json = """{"key1":{"key2":{"key3":"AK","key4":"EU","key5":{"key6":"001","key7":"N","values":[{"name":"valuesColumn1","value":"9.876"},{"name":"valuesColumn2","value":"1.2345"},{"name":"valuesColumn3","value":"8.675309"}]}}}}"""
val df1 = sqlc.read.json(Seq(json).toDS())
val df2 = df1.select(
($"key1.key2.key3").as("key3"),
($"key1.key2.key4").as("key4"),
($"key1.key2.key5.key6").as("key6"),
($"key1.key2.key5.key7").as("key7"),
($"key1.key2.key5.values").as("values")
)
val numColsVal = df2
.withColumn("values_size", size($"values"))
.agg(max($"values_size"))
.head()
.getInt(0)
val finalDFColumns = df2.select(explode($"values").as("values")).select("values.*").select("name").distinct.map(_.getAs[String](0)).orderBy($"value".asc).collect.foldLeft(df2.limit(0))((cdf, c) => cdf.withColumn(c, lit(null))).columns
val finalDF = df2.select($"*" +: (0 until numColsVal).map(i => $"values".getItem(i)("value").as($"values".getItem(i)("name").toString)): _*)
finalDF.columns.zip(finalDFColumns).foldLeft(finalDF)((fdf, column) => fdf.withColumnRenamed(column._1, column._2)).show(false)
finalDF.columns.zip(finalDFColumns).foldLeft(finalDF)((fdf, column) => fdf.withColumnRenamed(column._1, column._2)).drop($"values").show(false)
The resulting final output as :
+----+----+----+----+-------------+-------------+-------------+
|key3|key4|key6|key7|valuesColumn1|valuesColumn2|valuesColumn3|
+----+----+----+----+-------------+-------------+-------------+
|AK |EU |001 |N |9.876 |1.2345 |8.675309 |
+----+----+----+----+-------------+-------------+-------------+
Hope I got your question right!
----------- EDIT with Explanation----------
This block gets the number of columns to be created for the array structure.
val numColsVal = df2
.withColumn("values_size", size($"values"))
.agg(max($"values_size"))
.head()
.getInt(0)
finalDFColumns is the DF created with all the expected columns as output with null values.
Below block returns the different columns that needs to be created from the array structure.
df2.select(explode($"values").as("values")).select("values.*").select("name").distinct.map(_.getAs[String](0)).orderBy($"value".asc).collect
Below block combines the above new columns with the other columns in df2 initialized with empty/null values.
foldLeft(df2.limit(0))((cdf, c) => cdf.withColumn(c, lit(null)))
Combining these two blocks if you print the output you will get :
+----+----+----+----+------+-------------+-------------+-------------+
|key3|key4|key6|key7|values|valuesColumn1|valuesColumn2|valuesColumn3|
+----+----+----+----+------+-------------+-------------+-------------+
+----+----+----+----+------+-------------+-------------+-------------+
Now we have the structure ready. We need the values for corresponding columns here. Below block gets us the values:
df2.select($"*" +: (0 until numColsVal).map(i => $"values".getItem(i)("value").as($"values".getItem(i)("name").toString)): _*)
This results like below:
+----+----+----+----+--------------------+---------------+---------------+---------------+
|key3|key4|key6|key7| values|values[0][name]|values[1][name]|values[2][name]|
+----+----+----+----+--------------------+---------------+---------------+---------------+
| AK| EU| 001| N|[[valuesColumn1, ...| 9.876| 1.2345| 8.675309|
+----+----+----+----+--------------------+---------------+---------------+---------------+
Now we need to rename the columns as we have in the first block above. So we will use the zip function to merge the columns and then use foldLeft method to rename the output columns as below :
finalDF.columns.zip(finalDFColumns).foldLeft(finalDF)((fdf, column) => fdf.withColumnRenamed(column._1, column._2)).show(false)
This results in the below structure:
+----+----+----+----+--------------------+-------------+-------------+-------------+
|key3|key4|key6|key7| values|valuesColumn1|valuesColumn2|valuesColumn3|
+----+----+----+----+--------------------+-------------+-------------+-------------+
| AK| EU| 001| N|[[valuesColumn1, ...| 9.876| 1.2345| 8.675309|
+----+----+----+----+--------------------+-------------+-------------+-------------+
We are almost there. We now just need to remove the unwanted values column like this:
finalDF.columns.zip(finalDFColumns).foldLeft(finalDF)((fdf, column) => fdf.withColumnRenamed(column._1, column._2)).drop($"values").show(false)
Thus resulting into expected output as follows -
+----+----+----+----+-------------+-------------+-------------+
|key3|key4|key6|key7|valuesColumn1|valuesColumn2|valuesColumn3|
+----+----+----+----+-------------+-------------+-------------+
|AK |EU |001 |N |9.876 |1.2345 |8.675309 |
+----+----+----+----+-------------+-------------+-------------+
I'm not sure if I was able to explain it clearly. But if you try breaking the above statements/code and try printing it you will get to know how we are reaching till the output. You could find the explanation with examples for different functions used in this logic on internet.
I found that this approach performed much better and was easier to understand using an explode and pivot:
val json = """{"key1":{"key2":{"key3":"AK","key4":"EU","key5":{"key6":"001","key7":"N","values":[{"name":"valuesColumn1","value":"9.876"},{"name":"valuesColumn2","value":"1.2345"},{"name":"valuesColumn3","value":"8.675309"}]}}}}"""
val df = spark.read.json(Seq(json).toDS())
// schema
df.printSchema
root
|-- key1: struct (nullable = true)
| |-- key2: struct (nullable = true)
| | |-- key3: string (nullable = true)
| | |-- key4: string (nullable = true)
| | |-- key5: struct (nullable = true)
| | | |-- key6: string (nullable = true)
| | | |-- key7: string (nullable = true)
| | | |-- values: array (nullable = true)
| | | | |-- element: struct (containsNull = true)
| | | | | |-- name: string (nullable = true)
| | | | | |-- value: string (nullable = true)
// create final df
val finalDf = df.
select(
$"key1.key2.key3".as("key3"),
$"key1.key2.key4".as("key4"),
$"key1.key2.key5.key6".as("key6"),
$"key1.key2.key5.key7".as("key7"),
explode($"key1.key2.key5.values").as("values")
).
groupBy(
$"key3", $"key4", $"key6", $"key7"
).
pivot("values.name").
agg(min("values.value")).alias("values.name")
// result
finalDf.show
+----+----+----+----+-------------+-------------+-------------+
|key3|key4|key6|key7|valuesColumn1|valuesColumn2|valuesColumn3|
+----+----+----+----+-------------+-------------+-------------+
| AK| EU| 001| N| 9.876| 1.2345| 8.675309|
+----+----+----+----+-------------+-------------+-------------+
I'm new to Rust and still I am learning things. There is a rust application with main.rs and routes.rs. main.rs file has server configuration and routes.rs has methods with paths.
main.rs
#[macro_use]
extern crate log;
use actix_web::{App, HttpServer};
use dotenv::dotenv;
use listenfd::ListenFd;
use std::env;
mod search;
#[actix_rt::main]
async fn main() -> std::io::Result<()> {
dotenv().ok();
env_logger::init();
let mut listenfd = ListenFd::from_env();
let mut server = HttpServer::new(||
App::new()
.configure(search::init_routes)
);
server = match listenfd.take_tcp_listener(0)? {
Some(listener) => server.listen(listener)?,
None => {
let host = env::var("HOST").expect("Host not set");
let port = env::var("PORT").expect("Port not set");
server.bind(format!("{}:{}", host, port))?
}
};
info!("Starting server");
server.run().await
}
routes.rs
use crate::search::User;
use actix_web::{get, post, put, delete, web, HttpResponse, Responder};
use serde_json::json;
extern crate reqwest;
extern crate serde;
use reqwest::Error;
use serde::{Deserialize};
use rocket_contrib::json::Json;
use serde_json::Value;
// mod bargainfindermax;
#[get("/users")]
async fn find_all() -> impl Responder {
HttpResponse::Ok().json(
vec![
User { id: 1, email: "tore#cloudmaker.dev".to_string() },
User { id: 2, email: "tore#cloudmaker.dev".to_string() },
]
)
}
pub fn init_routes(cfg: &mut web::ServiceConfig) {
cfg.service(find_all);
}
Now what I want is I want to fetch an API using a method in another separate rs file (fetch_test.rs) and route it in the routes.rs file. Then I want to get the response from a web browser by running that route path(link).
How can I do these things ?? I searched everywhere, but I found nothing helpful. And sometimes I didn't understand some documentations also.
**Update.
fetch_test.rs
extern crate reqwest;
use hyper::header::{Headers, Authorization, Basic, ContentType};
pub fn authenticate() -> String {
fn construct_headers() -> Headers {
let mut headers = Headers::new();
headers.set(
Authorization(
Basic {
username: "HI:ABGTYH".to_owned(),
password: Some("%8YHT".to_owned())
}
)
);
headers.set(ContentType::form_url_encoded());
headers
}
let client = reqwest::Client::new();
let resz = client.post("https://api.test.com/auth/token")
.headers(construct_headers())
.body("grant_type=client_credentials")
.json(&map)
.send()
.await?;
}
Errors.
Compiling sabre-actix-kist v0.1.0 (E:\wamp64\www\BukFlightsNewLevel\flights\APIs\sabre-actix-kist)
error[E0425]: cannot find value `map` in this scope
--> src\search\routes\common.rs:28:12
|
28 | .json(&map)
| ^^^ not found in this scope
error[E0728]: `await` is only allowed inside `async` functions and blocks
--> src\search\routes\common.rs:25:12
|
4 | pub fn authenticate() -> String {
| ------------ this is not `async`
...
25 | let resz = client.post("https://api-crt.cert.havail.sabre.com/v2/auth/token")
| ____________^
26 | | .headers(construct_headers())
27 | | .body("grant_type=client_credentials")
28 | | .json(&map)
29 | | .send()
30 | | .await?;
| |__________^ only allowed inside `async` functions and blocks
error[E0277]: the trait bound `std::result::Result<search::routes::reqwest::Response, search::routes::reqwest::Error>: std::future::Future` is not satisfied
--> src\search\routes\common.rs:25:12
|
25 | let resz = client.post("https://api-crt.cert.havail.sabre.com/v2/auth/token")
| ____________^
26 | | .headers(construct_headers())
27 | | .body("grant_type=client_credentials")
28 | | .json(&map)
29 | | .send()
30 | | .await?;
| |__________^ the trait `std::future::Future` is not implemented for `std::result::Result<search::routes::reqwest::Response, search::routes::reqwest::Error>`
error[E0277]: the `?` operator can only be used in a function that returns `Result` or `Option` (or another type that implements `std::ops::Try`)
--> src\search\routes\common.rs:25:12
|
4 | / pub fn authenticate() -> String {
5 | |
6 | | let res = reqwest::get("http://api.github.com/users")
7 | | .expect("Couldnt")
... |
25 | | let resz = client.post("https://api-crt.cert.havail.sabre.com/v2/auth/token")
| |____________^
26 | || .headers(construct_headers())
27 | || .body("grant_type=client_credentials")
28 | || .json(&map)
29 | || .send()
30 | || .await?;
| ||___________^ cannot use the `?` operator in a function that returns `std::string::String`
31 | |
32 | | }
| |_- this function should return `Result` or `Option` to accept `?`
|
= help: the trait `std::ops::Try` is not implemented for `std::string::String`
= note: required by `std::ops::Try::from_error`
error[E0308]: mismatched types
--> src\search\routes\common.rs:4:26
|
4 | pub fn authenticate() -> String {
| ------------ ^^^^^^ expected struct `std::string::String`, found `()`
| |
| implicitly returns `()` as its body has no tail or `return` expression
**Update Again.
extern crate reqwest;
use hyper::header::{Headers, Authorization, Basic, ContentType};
fn construct_headers() -> Headers {
let mut headers = Headers::new();
headers.set(
Authorization(
Basic {
username: "HI:ABGTYH".to_owned(),
password: Some("%8YHT".to_owned())
}
)
);
headers.set(ContentType::form_url_encoded());
headers
}
pub async fn authenticate() -> Result<String, reqwest::Error> {
let client = reqwest::Client::new();
let resz = client.post("https://api.test.com/auth/token")
.headers(construct_headers())
.body("grant_type=client_credentials")
.json(&map)
.send()
.await?;
}
**New Error.
error[E0425]: cannot find value `map` in this scope
--> src\search\routes\common.rs:24:12
|
24 | .json(&map)
| ^^^ not found in this scope
error[E0277]: the trait bound `impl std::future::Future: search::routes::serde::Serialize` is not satisfied
--> src\search\routes.rs:24:29
|
24 | HttpResponse::Ok().json(set_token)
| ^^^^^^^^^ the trait `search::routes::serde::Serialize` is not implemented for `impl std::future::Future`
error[E0308]: mismatched types
--> src\search\routes\common.rs:22:14
|
22 | .headers(construct_headers())
| ^^^^^^^^^^^^^^^^^^^ expected struct `search::routes::reqwest::header::HeaderMap`, found struct `hyper::header::Headers`
|
= note: expected struct `search::routes::reqwest::header::HeaderMap`
found struct `hyper::header::Headers`
error[E0599]: no method named `json` found for struct `search::routes::reqwest::RequestBuilder` in the current scope
--> src\search\routes\common.rs:24:6
|
24 | .json(&map)
| ^^^^ method not found in `search::routes::reqwest::RequestBuilder`
error[E0308]: mismatched types
--> src\search\routes\common.rs:18:63
|
18 | pub async fn authenticate() -> Result<String, reqwest::Error> {
| _______________________________________________________________^
19 | |
20 | | let client = reqwest::Client::new();
21 | | let resz = client.post("https://api.test.com/auth/token")
... |
27 | |
28 | | }
| |_^ expected enum `std::result::Result`, found `()`
|
= note: expected enum `std::result::Result<std::string::String, search::routes::reqwest::Error>`
found unit type `()`
Can I clarify your question? As I understand you already know how to use functions from another file. Do you need to know how to make API requests and pass a result form a request as Response?
Firstly, you need to create fetch_test.rs with using for example reqwest lib:
let client = reqwest::Client::new();
let res = client.post("http://httpbin.org/post")
.json(&map)
.send()
.await?;
Map result or pass it as it is.
Return result in routes.rs: HttpResponse::Ok().json(res)
I hope it will help you.
There is the following dataframe:
>>> df.printSchema()
root
|-- I: string (nullable = true)
|-- F: string (nullable = true)
|-- D: string (nullable = true)
|-- T: string (nullable = true)
|-- S: string (nullable = true)
|-- P: string (nullable = true)
column F is in dictionary format:
{"P1":"1:0.01","P2":"3:0.03,4:0.04","P3":"3:0.03,4:0.04",...}
I need to read column F as following and create two new columns P and N
P1 => "1:0.01"
P2 => "3:0.03,4:0.04"
and so on
+--------+--------+-----------------+-----+------+--------+----+
| I | P | N | D | T | S | P |
+--------+--------+---------------- +------------+--------+----+
| i1 | p1 | 1:0.01 | d1 | t1 | s1 | p1 |
|--------|--------|-----------------|-----|------|--------|----|
| i1 | p2 | 3:0.03,4:0.04 | d1 | t1 | s1 | p1 |
|--------|--------|-----------------|-----|------|--------|----|
| i1 | p3 | 3:0.03,4:0.04 | d1 | t1 | s1 | p1 |
|--------|--------|-----------------|-----|------|--------|----|
| i2 | ... | .... | d2 | t2 | s2 | p2 |
+--------+--------+-----------------+-----+------+--------+----+
any suggestion in Pyspark?
Try this:
The DataFrame you have
from pyspark.sql import functions as F
df = spark.createDataFrame([('id01', '{"P1":"1:0.01","P2":"3:0.03,4:0.04","P3":"3:0.03,4:0.04"}')], ['I', 'F'])
df.printSchema()
df.show(truncate=False)
You can see the schema and data are the same in your post.
root
|-- I: string (nullable = true)
|-- F: string (nullable = true)
+----+---------------------------------------------------------+
|I |F |
+----+---------------------------------------------------------+
|id01|{"P1":"1:0.01","P2":"3:0.03,4:0.04","P3":"3:0.03,4:0.04"}|
+----+---------------------------------------------------------+
Process the string to distinguish sub-dicts
# remove '{' and '}'
df = df.withColumn('array', F.regexp_replace('F', r'\{', ''))
df = df.withColumn('array', F.regexp_replace('array', r'\}', ''))
# replace the comma with '#' between each sub-dict so we can split on them
df = df.withColumn('array', F.regexp_replace('array', '","', '"#"' ))
df = df.withColumn('array', F.split('array', '#'))
df.show(truncate=False)
Here's the middle results
+----+---------------------------------------------------------+-----------------------------------------------------------+
|I |F |array |
+----+---------------------------------------------------------+-----------------------------------------------------------+
|id01|{"P1":"1:0.01","P2":"3:0.03,4:0.04","P3":"3:0.03,4:0.04"}|["P1":"1:0.01", "P2":"3:0.03,4:0.04", "P3":"3:0.03,4:0.04"]|
+----+---------------------------------------------------------+-----------------------------------------------------------+
Now generate one row for each sub-dict
# generate one row for each element int he array
df = df.withColumn('exploded', F.explode(df['array']))
# Need to distinguish ':' in the dict and in the value
df = df.withColumn('exploded', F.regexp_replace('exploded', '":"', '"#"' ))
df = df.withColumn('exploded', F.split('exploded', '#'))
# extract the name and value
df = df.withColumn('P', F.col('exploded')[0])
df = df.withColumn('N', F.col('exploded')[1])
df.select('I', 'exploded', 'P', 'N').show(truncate=False)
The final output:
+----+-----------------------+----+---------------+
|I |exploded |P |N |
+----+-----------------------+----+---------------+
|id01|["P1", "1:0.01"] |"P1"|"1:0.01" |
|id01|["P2", "3:0.03,4:0.04"]|"P2"|"3:0.03,4:0.04"|
|id01|["P3", "3:0.03,4:0.04"]|"P3"|"3:0.03,4:0.04"|
+----+-----------------------+----+---------------+
This is how I solved this at the end:
#This method replaces "," with ";" to
#distinguish between other camas in the string to split it
def _comma_replacement(val):
if (val):
val = val.replace('","', '";"').replace('{','').replace('}', '')
return val
replacing = UserDefinedFunction(lambda x: _comma_replacement(x))
new_df = df.withColumn("F", replacing(col("F")))
new_df = new_df.withColumn("F",split(col("F"),";").cast(ArrayType(StringType())))
exploded_df = new_df.withColumn("F", explode("F"))
df_sep = exploded_df.withColumn("F",split(col("F"),'":"').cast(ArrayType(StringType())))
dff = df_sep.withColumn("P", df_sep["F"].getItem(0))
dff_new = dff.withColumn("N", dff["F"].getItem(1))
dff_new = dff_new.drop('F')
Using another UDF, i removed the extra characters that remained during the string manipulation.
The above solution also uses the same way. The key idea is to distinguish between commas between different components and inside them. For that, I suggested the _comma_replacement(val) method which be called in a UDF. The above solution also utilizes the same method but using regxp_replace that can be more optimized.
now has JSON data as follows
{"Id":11,"data":[{"package":"com.browser1","activetime":60000},{"package":"com.browser6","activetime":1205000},{"package":"com.browser7","activetime":1205000}]}
{"Id":12,"data":[{"package":"com.browser1","activetime":60000},{"package":"com.browser6","activetime":1205000}]}
......
This JSON is the activation time of app, the purpose of which is to analyze the total activation time of each app
I use sparK SQL to parse JSON
scala
val sqlContext = sc.sqlContext
val behavior = sqlContext.read.json("behavior-json.log")
behavior.cache()
behavior.createOrReplaceTempView("behavior")
val appActiveTime = sqlContext.sql ("SELECT data FROM behavior") // SQL query
appActiveTime.show (100100) // print dataFrame
appActiveTime.rdd.foreach(println) // print RDD
But the printed dataFrame is like this
.
+----------------------------------------------------------------------+
| data|
+----------------------------------------------------------------------+
| [[60000, com.browser1], [12870000, com.browser]]|
| [[60000, com.browser1], [120000, com.browser]]|
| [[60000, com.browser1], [120000, com.browser]]|
| [[60000, com.browser1], [1207000, com.browser]]|
| [[120000, com.browser]]|
| [[60000, com.browser1], [1204000, com.browser5]]|
| [[60000, com.browser1], [12075000, com.browser]]|
| [[60000, com.browser1], [120000, com.browser]]|
| [[60000, com.browser1], [1204000, com.browser]]|
| [[60000, com.browser1], [120000, com.browser]]|
| [[60000, com.browser1], [1201000, com.browser]]|
| [[1200400, com.browser5]]|
| [[60000, com.browser1], [1200400, com.browser]]|
|[[60000, com.browser1], [1205000, com.browser6], [1205000, com.browser7]]|
.
RDD is like this
.
[WrappedArray ([60000, com.browser1], [60000, com.browser1])]
[WrappedArray ([120000, com.browser])]
[WrappedArray ([60000, com.browser1], [1204000, com.browser5])]
[WrappedArray ([12075000, com.browser], [12075000, com.browser])]
.
And I want to turn the data into
.
Com.browser1 60000
Com.browser1 60000
Com.browser 12075000
Com.browser 12075000
...
.
I want to turn the array elements of each line in RDD into one row. Of course, it can be another structure that is easy to analyze.
Because I only learn spark and Scala a lot, I have try it for a long time but fail, so I hope you can guide me.
From your given json data you can view the schema of your dataframe with printSchema and use it
appActiveTime.printSchema()
root
|-- data: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- activetime: long (nullable = true)
| | |-- package: string (nullable = true)
Since you have array you need to explode the data and select the struct field as below
import org.apache.spark.sql.functions._
appActiveTime.withColumn("data", explode($"data"))
.select("data.*")
.show(false)
Output:
+----------+------------+
|activetime| package|
+----------+------------+
| 60000|com.browser1|
| 1205000|com.browser6|
| 1205000|com.browser7|
| 60000|com.browser1|
| 1205000|com.browser6|
+----------+------------+
Hope this helps!
with #Shankar Koirala 's help , I learned how to use ' explode' to handle joson array.
val df = sqlContext.sql("SELECT data FROM behavior")
appActiveTime.select(explode(df("data"))).toDF("data")
.select("data.package","data.activetime")
.show(false)
For Apache spark Java we will require to do something like below:
Dataset<Row> dataDF = spark.read()
.option("header", "true")
.json("/file_path");
dataDF.createOrReplaceTempView("behavior");
String sqlQuery = "SELECT data from behavior";
Dataset<Row> jsonData = spark.sql(sqlQuery);
snapshot.withColumn("data", explode(jsonData.col("data"))).select("data.*").show();