spark non json to json and to dataframe error - json

I have a json type file (not real json structure), but i converted to json and read through spark read json (we are in spark 1.6.0), i can't use multiline feature from spark 2 yet. It displays results , but at the same time it error out. Any help greatly appreciated.
I have document like this .. took just one example, but it is an array :
$result = [
{
'name' => 'R-2018:1583',
'issue_date' => '2018-05-17 02:51:06',
'type' => 'Product Enhancement Advisory',
'last_modified_date' => '2018-05-17 03:51:00',
'id' => 273,
'update_date' => '2018-05-17 02:51:06',
'synopsis' => ' enhancement update',
'advory' => 'R:1583'
}
]
I used like this:
jsonRDD = sc.wholeTextFiles("/user/xxxx/aa.json").map(lambda x: x[1]).map(lambda x:x.replace('$result =','')).map(lambda x: x.replace("'",'"')).map(lambda x:x.replace("\n","")).map(lambda x:x.replace("=>",":")).map(lambda x:x.replace(" ",""))
sqlContext.read.json(rdd).show()
It display the results, but I get the below error also, please help on this.
18/08/31 11:19:30 WARN util.ExecutionListenerManager: Error executing query execution listener
java.lang.ArrayIndexOutOfBoundsException: 0
at org.apache.spark.sql.query.analysis.QueryAnalysis$$anonfun$getInputMetadata$2.apply(QueryAnalysis.scala:121)
at org.apache.spark.sql.query.analysis.QueryAnalysis$$anonfun$getInputMetadata$2.apply(QueryAnalysis.scala:108)
at scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:111)
at scala.collection.immutable.List.foldLeft(List.scala:84)
at org.apache.spark.sql.query.analysis.QueryAnalysis$.getInputMetadata(QueryAnalysis.scala:108)
at com.cloudera.spark.lineage.ClouderaNavigatorListener.writeQueryMetadata(ClouderaNavigatorListener.scala:74)
at com.cloudera.spark.lineage.ClouderaNavigatorListener.onSuccess(ClouderaNavigatorListener.scala:54)
at org.apache.spark.sql.util.ExecutionListenerManager$$anonfun$onSuccess$1$$anonfun$apply$mcV$sp$1.apply(QueryExecutionListener.scala:100)
at org.apache.spark.sql.util.ExecutionListenerManager$$anonfun$onSuccess$1$$anonfun$apply$mcV$sp$1.apply(QueryExecutionListener.scala:99)
at org.apache.spark.sql.util.ExecutionListenerManager$$anonfun$org$apache$spark$sql$util$ExecutionListenerManager$$withErrorHandling$1.apply(QueryExecutionListener.scala:121)
at org.apache.spark.sql.util.ExecutionListenerManager$$anonfun$org$apache$spark$sql$util$ExecutionListenerManager$$withErrorHandling$1.apply(QueryExecutionListener.scala:119)
at scala.collection.immutable.List.foreach(List.scala:318)
at scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:32)
at scala.collection.mutable.ListBuffer.foreach(ListBuffer.scala:45)
at org.apache.spark.sql.util.ExecutionListenerManager.org$apache$spark$sql$util$ExecutionListenerManager$$withErrorHandling(QueryExecutionListener.scala:119)
at org.apache.spark.sql.util.ExecutionListenerManager$$anonfun$onSuccess$1.apply$mcV$sp(QueryExecutionListener.scala:99)
at org.apache.spark.sql.util.ExecutionListenerManager$$anonfun$onSuccess$1.apply(QueryExecutionListener.scala:99)
at org.apache.spark.sql.util.ExecutionListenerManager$$anonfun$onSuccess$1.apply(QueryExecutionListener.scala:99)
at org.apache.spark.sql.util.ExecutionListenerManager.readLock(QueryExecutionListener.scala:132)
at org.apache.spark.sql.util.ExecutionListenerManager.onSuccess(QueryExecutionListener.scala:98)
at org.apache.spark.sql.DataFrame.withCallback(DataFrame.scala:2116)
at org.apache.spark.sql.DataFrame.head(DataFrame.scala:1389)
at org.apache.spark.sql.DataFrame.take(DataFrame.scala:1471)
at org.apache.spark.sql.DataFrame.showString(DataFrame.scala:184)
at sun.reflect.GeneratedMethodAccessor55.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:381)
at py4j.Gateway.invoke(Gateway.java:259)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:209)
at java.lang.Thread.run(Thread.java:745)

json function take the path of a json file as parameter, so you need to first save the json somewhere and then read this file.
Something like that should work
jsonRDD = sc.wholeTextFiles("/user/xxxx/aa.json")
.map(lambda x: x[1])
.map(lambda x:x.replace('$result =',''))
.map(lambda x: x.replace("'",'"'))
.map(lambda x:x.replace("\n",""))
.map(lambda x:x.replace("=>",":"))
.map(lambda x:x.replace(" ",""))
.saveAsTextFile("/user/xxxx/aa_transformed.json")
sqlContext.read.json(jsonRDD).show()

Related

How to convert Json objects into arrays [(string, double)] in Scala JSON4s?

I don't know how to use CustomSerializer to convert JSON to Scala data
I don't need to serialize the parts, but deserialization is difficult
Can someone help me? Thank you very much
{
"parameters": {
"field_owner": true,
"sample_field": "gender",
"sample_strategy": "by_ratio",
"sample_config": [["male", 1500.0], ["female", 1500.0]],
"with_replacement": true,
"random_seed": 114514,
"least_sample_num": 100
},
"input_artifacts": {"dataset": {"object_id": "upload/ml_classifier_predict_result"}},
"output_artifacts": {"predict_result": {"object_id": "ingest_output.csv"}}
}
class TupleSerializer extends CustomSerializer[(String,Double)](tuple_formats => (
{
case JArray(List(JString(x), JDouble(y))) =>
(x,y)
},{
case x:Tuple2[String,Double] => null
}
))
implicit val formats: Formats = DefaultFormats + new TupleSerializer
val fileBuffer = Source.fromFile(path)
val jsonString = fileBuffer.mkString
parse(jsonString).extract[StratifiedSampleArgument]
}
error message:
o usable value for parameters
No usable value for sample_config
Expected object with 1 element but got JArray(List(JString(male), JDouble(1500.0)))
at com.alipay.morse.sgxengine.driver.StratifiedSampleDriverTest.main(StratifiedSampleDriverTest.scala:15)
Caused by: org.json4s.package$MappingException:
No usable value for sample_config
Expected object with 1 element but got JArray(List(JString(male), JDouble(1500.0)))
at com.alipay.morse.sgxengine.driver.StratifiedSampleDriverTest.main(StratifiedSampleDriverTest.scala:15)
Caused by: org.json4s.package$MappingException: Expected object with 1 element but got JArray(List(JString(male), JDouble(1500.0)))
at com.alipay.morse.sgxengine.driver.StratifiedSampleDriverTest.main(StratifiedSampleDriverTest.scala:15)
[INFO]
[INFO] Results:
[INFO]
[ERROR] Errors:
[ERROR] StratifiedSampleDriverTest.main:15 » Mapping No usable value for parameters
No...
[INFO]
[ERROR] Tests run: 1, Failures: 0, Errors: 1, Skipped: 0

Logstash with elasticsearch input and output keep looping results

I would like to reindex and filter my log again. What I get the information from Internet is using the logstash to filter the data again.
I tried and it can really split my data into different fields, however, the data keeps looping. That is, I have 100,000 logs but after filtering and output to elasticsearch, I found that more than 100,000 logs are output into elasticsearch and the logs are duplicated. Does anyone have idea on that?
Moreover, I receive below log when running logstash, although it said that error phasing JSON, I found that the log can still be filtered. Why would be like that?
Thank you!
Here is my logstash config:
input {
elasticsearch {
hosts => "10.0.132.56"
index => "logstash-2018.01.04"
}
}
filter{
grok {
match => {"message" => "%{TIMESTAMP_ISO8601:logdate} %{GREEDYDATA:vmname} %{GREEDYDATA:message}"}
overwrite => [ "message" ]
}
}
filter {
json {
source => "scrmsg"
}
}
output {
elasticsearch {
hosts => ["10.0.132.64:9200"]
manage_template => false
index => "logstash-2018.01.04-1"
}
}
Here is the error log:
[2018-01-11T15:15:32,010][WARN ][logstash.filters.json ] Error parsing json {:source=>"scrmsg", :raw=>"Trident/5.0)\",\"geoip_country\":\"US\",\"allowed\":\"1\",\"threat_score\":\"268435456\",\"legacy_unique_id\":\"\",\"cache_status\":\"-\",\"informed_id\":\"\",\"primitive_id\":\"2BC2D8AD-7AD0-3CAD-9453-B0335F409701\",\"valid_ajax\":\"0\",\"orgin_response_time\":\"0.081\",\"request_id\":\"cd2ae0a8-0921-48b6-b03f-15c71a55100b\",\"bytes_returned_origin\":\"83\",\"server_ip\":\"10.0.10.16\",\"origin_status_code\":\"\",\"calculated_pages_per_min\":\"1\",\"calculated_pages_per_session\":\"1\",\"calculated_session_length\":\"0\",\"k_s\":\"\",\"origin_address\":\"10.0.10.16:443\",\"request_protocol\":\"https\",\"server_serial\":\"5c3eb4ad-3799-4bd8-abb2-42edecd54b99\",\"nginx_worker_process\":\"19474\",\"origin_content_type\":\"application/json;charset=UTF-8\",\"lb_request_time\":\"\",\"SID\":\"\",\"geoip_org\":\"Drake Holdings LLC\",\"accept\":\"*/*\",\"accept_encoding\":\"gzip, deflate\",\"accept_language\":\"\",\"connection\":\"Keep-Alive\",\"http_request_length\":\"418\",\"real_ip_header_value\":\"204.79.180.18\",\"http_host\":\"www.honeyworkshop.com\",\"machine_learning_score\":\"\",\"HSIG\":\"ALE_UHCF\",\"ZID\":\"\",\"ZUID\":\"\",\"datacenter_id\":\"363\",\"new_platform_domain_id\":\"3063fc0b-5b48-4413-9bc7-600039caf64c\",\"whitelist_score\":\"0\",\"billable\":\"1\",\"distil_action\":\"#proxy\",\"js_additional_threats\":\"\",\"js_kv_additional_threats\":\"\",\"re_field_1\":\"\",\"re_field_2\":\"\",\"re_field_3\":\"\",\"http_accept_charset\":\"\",\"sdk_token_id\":\"\",\"sdk_application_instance_id\":\"\",\"per_path_calculated_pages_per_minute\":\"1\",\"per_path_calculated_pages_per_session\":\"1\",\"path_security_type\":\"api\",\"identification_provider\":\"web\",\"identifier_record_pointer\":\"\",\"identifier_record_value\":\"\",\"path_rule_scope_id\":\"\",\"experiment_id\":\"0\",\"experiment_score\":\"\",\"experiment_group_id\":\"\",\"experiment_auxiliary_string\":\"\",\"type\":\"distil\"}\n", :exception=>#<LogStash::Json::ParserError: Unrecognized token 'Trident': was expecting ('true', 'false' or 'null') at [Source: (byte[])"Trident/5.0)","geoip_country":"US","allowed":"1","threat_score":"268435456","legacy_unique_id":"","cache_status":"-","informed_id":"","primitive_id":"2BC2D8AD-7AD0-3CAD-9453-B0335F409701","valid_ajax":"0","orgin_response_time":"0.081","request_id":"cd2ae0a8-0921-48b6-b03f-15c71a55100b","bytes_returned_origin":"83","server_ip":"10.0.10.16","origin_status_code":"","calculated_pages_per_min":"1","calculated_pages_per_session":"1","calculated_session_length":"0","k_s":"","origin_address":"10.0.10"[truncated 1180 bytes]; line: 1, column: 9]>}

Through an error using Unisharp laravel file manager

i have update laravel 5.3 to 5.4 than i got this error for using laravel filemanager package Unisharp.
FatalErrorException in ItemsController.php line 0: Method
Illuminate\View\View::__toString() must not throw an exception, caught
ErrorException: Cannot use object of type stdClass as array (View:
ProjectName\resources\views\vendor\laravel-filemanager\item.blade.php)
(View:
ProjectName\resources\views\vendor\laravel-filemanager\item.blade.php)
return [
'html' => (string)view($this->getView())->with([
'files' => $files,
'directories' => $directories,
'items' => array_merge($directories, $files)
]),
'working_dir' => parent::getInternalPath($path)
];
add in the vendor views item.blade.php in the line 0
<?php $file = json_decode(json_encode($file),true) ; ?>

MongoDB: How to deal with invalid ObjectIDs

Here below is my code to find a document by ObjectID:
def find(selector: JsValue, projection: Option[JsValue], sort: Option[JsValue],
page: Int, perPage: Int): Future[Seq[JsValue]] = {
var query = collection.genericQueryBuilder.query(selector).options(
QueryOpts(skipN = page * perPage)
)
projection.map(value => query = query.projection(value))
sort.map(value => query = query.sort(value.as[JsObject]))
// this is the line where the call crashes
query.cursor[JsValue].collect[Vector](perPage).transform(
success => success,
failure => failure match {
case e: LastError => DaoException(e.message, Some(DATABASE_ERROR))
}
)
}
Now let's suppose we invoke this method with an invalid ObjectID:
// ObjectId 53125e9c2004006d04b605abK is invalid (ends with a K)
find(Json.obj("_id" -> Json.obj("$oid" -> "53125e9c2004006d04b605abK")), None, None, 0, 25)
The call above causes the following exception when executing query.cursor[JsValue].collect[Vector](perPage) in the find method:
Caused by: java.util.NoSuchElementException: JsError.get
at play.api.libs.json.JsError.get(JsResult.scala:11) ~[play-json_2.10.jar:2.2.1]
at play.api.libs.json.JsError.get(JsResult.scala:10) ~[play-json_2.10.jar:2.2.1]
at play.modules.reactivemongo.json.collection.JSONGenericHandlers$StructureBufferWriter$.write(jsoncollection.scala:44) ~[play2-reactivemongo_2.10-0.10.2.jar:0.10.2]
at play.modules.reactivemongo.json.collection.JSONGenericHandlers$StructureBufferWriter$.write(jsoncollection.scala:42) ~[play2-reactivemongo_2.10-0.10.2.jar:0.10.2]
at reactivemongo.api.collections.GenericQueryBuilder$class.reactivemongo$api$collections$GenericQueryBuilder$$write(genericcollection.scala:323) ~[reactivemongo_2.10-0.10.0.jar:0.10.0]
at reactivemongo.api.collections.GenericQueryBuilder$class.cursor(genericcollection.scala:342) ~[reactivemongo_2.10-0.10.0.jar:0.10.0]
at play.modules.reactivemongo.json.collection.JSONQueryBuilder.cursor(jsoncollection.scala:110) ~[play2-reactivemongo_2.10-0.10.2.jar:0.10.2]
at reactivemongo.api.collections.GenericQueryBuilder$class.cursor(genericcollection.scala:331) ~[reactivemongo_2.10-0.10.0.jar:0.10.0]
at play.modules.reactivemongo.json.collection.JSONQueryBuilder.cursor(jsoncollection.scala:110) ~[play2-reactivemongo_2.10-0.10.2.jar:0.10.2]
at services.common.mongo.MongoDaoComponent$MongoDao$$anon$1.find(MongoDaoComponent.scala:249) ~[classes/:na]
... 25 common frames omitted
Any idea? Thanks.

export to csv using fastercsv and CSV::Writer (Ruby on Rails)

What am I trying to do: export data to csv.
I have a form which allows user to select the format (from a drop down menu). So based on the selection of the format the ouput is displayed using a ajax call. Works fine for html but when I select the format as csv I don't see any pop up on the screen (asking to save or open the file) and neither any file gets downloaded directly.
I tried using Fastercsv (but the problem is that I don't see any pop up asking me whether I want to save or open the file) and CSV::Writer where I get this error message on the console.
NoMethodError (You have a nil object when you didn't expect it!
The error occurred while evaluating nil.bytesize):
actionpack (2.3.4) lib/action_controller/streaming.rb:142:in `send_data'
Code using Fastercsv:
def export_to_csv
csv_string = FasterCSV.generate(:col_sep => ",") do |csv|
members = ["Versions / Project Members"]
members_selected.each {|member| members << Stat.member_name(member)}
Stat.project_members(project).each {|user| members << user.name}
csv << ["some text", "text 2", "text 3"]
end
return csv_string
end
and this is how I am sending the data:
send_data(export_to_csv,:type => 'text/csv; charset=iso-8859-1; header=present',
:disposition => "attachment", :filename => "filename.csv")
I see the response as "some text, text 2, text 3" in the firebug console but no pop up asking whether I want to save or open the file.
This is what I am doing using CSV::Writer:
def export_to_csv
report = StringIO.new
CSV::Writer.generate(report, ',') do |csv|
csv << ['c1', 'c2']
end
end
and call it as:
send_data(export_to_csv,:type => 'text/csv; charset=iso-8859-1; header=present',
:disposition => "attachment", :filename => "filename.csv")
This is the error which is thrown on the console:
NoMethodError (You have a nil object when you didn't expect it!
The error occurred while evaluating nil.bytesize):
actionpack (2.3.4) lib/action_controller/streaming.rb:142:in `send_data'
send_data is trying to reference an object that is out of scope. Check your closing 'end' statement