How to write CSV file with headers using akka stream alpakka? - csv

I can't see to find it, hence i turn to slack to ask: Is there a way to write a csv file with its heards using akka stream alpakka ?
The only thing i see is https://doc.akka.io/docs/alpakka/current/data-transformations/csv.html#csv-formatting
But no reverse operation to csv to map somehow.
My use case is that i need to read few csv files, filter their content, and write the clean content in a corresponding file orginalcsvfilename-cleanded.csv.
If it is not directly supported, any recommendation ?

You can do something like that
def csv_header(elem:T):List[String] = ???
def csv_line(elem:T):List[String] = ???
def firstTrueIterator(): Iterator[Boolean] = (Iterator single true) ++ (Iterator continually false)
def firstTrueSource: Source[Boolean, _] = Source fromIterator firstTrueIterator
def processData(elem: T, firstRun: Boolean): List[List[String]] = {
if (firstRun) {
List(
csv_header(elem),
csv_line(elem)
)
} else {
List(csv_line(elem))
}
}
val finalSource = source
.zipWith(firstTrueSource)(processData)
.mapConcat(identity)
.via(CsvFormatting.format())

Related

pandas to_json , django and d3.js for visualisation

I have a pandas dataframe that I converted to json in order to create graphs and visualize with d3.js so I would like to know how to send this json format obtained in django (in the view or template) in order to visualize with d3.js
def parasol_view(request):
parasol = function_parasol()
parasol_json = parasol.to_json(orient='records')
parasol = parasol.to_html(index = False, table_id="table_parasol")
context = {
'parasol': parasol
'parasol_json':parasol_json
}
return render(request,'parasol.html',context)
template :
{%block content%}
{{parasol| safe}}
{{parasol_json| safe}}
{%endblock content%}
I'm not sure what the parasol.to_html is for, therefore I left that part untouched.
But this is what I would do in order to use your .json file:
Views.py:
def parasol_view(request):
parasol = function_parasol()
# parasol_json = parasol.to_json(orient='records')
parasol = parasol.to_html(index = False, table_id="table_parasol")
context = {
'parasol': parasol
# 'parasol_json':parasol_json
}
return render(request,'parasol.html', context)
Function parasol:
function_parasol(){
#whatever code you have here
#I made a new var parasol2 for this so that the parasol that will be returned will be the same as before.
#So that you can still use it for parasol.to_html
parasol2 = parasol.to_json(orient='records')
text_file = open("parasol.json", "w") #name of file, make sure it ends with .json
text_file.write(parasol2)
text_file.close()
return parasol
}
The javascript file where you want to make a graph with d3:
//so basically var_name = parasol.json
d3.json("/parasol.json", function(var_name) {
//Go make your graphs
});
Ps. If you were to upload the file that gets parsed into a json file. Parasol.json will be overwritten every time.
So there won't be an abundance of json files at some point.

How to properly merge multiple FlowFile's?

I use MergeContent 1.3.0 in order to merge FlowFiles from 2 sources: 1) from ListenHTTP and 2) from QueryElasticsearchHTTP.
The problem is that the merging result is a list of JSON strings. How can I convert them into a single JSON string?
{"event-date":"2017-08-08T00:00:00"}{"event-date":"2017-02-23T00:00:00"}{"eid":1,"zid":1,"latitude":38.3,"longitude":2.4}
I would to get this result:
{"event-date":["2017-08-08T00:00:00","2017-02-23T00:00:00"],"eid":1,"zid":1,"latitude":38.3,"longitude":2.4}
Is it possible?
UPDATE:
After changing data structure in Elastic, I was able to come up with the following output result of MergeContent. Now I have a common field eid in both JSON strings. I would like to merge these strings by eid in order to get a single JSON file. Which operator should I use?
{"eid":"1","zid":1,"latitude":38.3,"longitude":2.4}{"eid":"1","dates":{"event-date":["2017-08-08","2017-02-23"]}}
I need to get the following output:
{"eid":"1","zid":1,"latitude":38.3,"longitude":2.4,"dates":{"event-date":["2017-08-08","2017-02-23"]}}
It was suggested to use ExecuteScript to merge files. However I cannot figure out how to do this. This is what I tried:
import json
import java.io
from org.apache.commons.io import IOUtils
from java.nio.charset import StandardCharsets
from org.apache.nifi.processor.io import StreamCallback
class ModJSON(StreamCallback):
def __init__(self):
pass
def process(self, inputStream, outputStream):
text = IOUtils.toString(inputStream, StandardCharsets.UTF_8)
obj = json.loads(text)
newObj = {
"eid": obj['eid'],
"zid": obj['zid'],
...
}
outputStream.write(bytearray(json.dumps(newObj, indent=4).encode('utf-8')))
flowFile1 = session.get()
flowFile2 = session.get()
if (flowFile1 != None && flowFile2 != None):
# WHAT SHOULD I PUT HERE??
flowFile = session.write(flowFile, ModJSON())
flowFile = session.putAttribute(flowFile, "filename", flowFile.getAttribute('filename').split('.')[0]+'_translated.json')
session.transfer(flowFile, REL_SUCCESS)
session.commit()
The example how to read multiple files from incoming queue using filtering
Assume you have multiple pairs of flow files with following content:
{"eid":"1","zid":1,"latitude":38.3,"longitude":2.4}
and
{"eid":"1","dates":{"event-date":["2017-08-08","2017-02-23"]}}
The same value of eid field provides a link between pairs.
Before merging we have to extract the value of eid field and put it into na attribute of the flow file for fast filtering.
Use the EvaluateJsonPath processor with properties:
Destination : flowfile-attribute
eid : $.eid
After this you'll have new eid attribute of the flow file.
Then use ExecuteScript processor with groovy language and with following code:
import org.apache.nifi.processor.FlowFileFilter;
import groovy.json.JsonSlurper
import groovy.json.JsonBuilder
//get first flow file
def ff0 = session.get()
if(!ff0)return
def eid = ff0.getAttribute('eid')
//try to find files with same attribute in the incoming queue
def ffList = session.get(new FlowFileFilter(){
public FlowFileFilterResult filter(FlowFile ff) {
if( eid == ff.getAttribute('eid') )return FlowFileFilterResult.ACCEPT_AND_CONTINUE
return FlowFileFilterResult.REJECT_AND_CONTINUE
}
})
//let's assume you require two additional files in queue with the same attribute
if( !ffList || ffList.size()<1 ){
//if less than required
//rollback current session with penalize retrieved files so they will go to the end of the incoming queue
//with pre-configured penalty delay (default 30sec)
session.rollback(true)
return
}
//let's put all in one list to simplify later iterations
ffList.add(ff0)
if( ffList.size()>2 ){
//for example unexpected situation. you have more files then expected
//redirect all of them to failure
session.transfer(ffList, REL_FAILURE)
return
}
//create empty map (aka json object)
def json = [:]
//iterate through files parse and merge attributes
ffList.each{ff->
session.read(ff).withStream{rawIn->
def fjson = new JsonSlurper().parse(rawIn)
json.putAll(fjson)
}
}
//create new flow file and write merged json as a content
def ffOut = session.create()
ffOut = session.write(ffOut,{rawOut->
rawOut.withWriter("UTF-8"){writer->
new JsonBuilder(json).writeTo(writer)
}
} as OutputStreamCallback )
//set mime-type
ffOut = session.putAttribute(ffOut, "mime.type", "application/json")
session.remove(ffList)
session.transfer(ffOut, REL_SUCCESS)
Joining together two different types of data is not really what MergeContent was made to do.
You would need to write a custom processor, or custom script, that understood your incoming data formats and created the new output.
If you have ListenHttp connected to QueryElasticSearchHttp, meaning that you are triggering the query based on the flow file coming out of ListenHttp, then you may want to make a custom version of QueryElasticSearchHttp that takes the content of the incoming flow file and joins it together with any of the outgoing results.
Here is where the query result is currently written to a flow file:
https://github.com/apache/nifi/blob/master/nifi-nar-bundles/nifi-elasticsearch-bundle/nifi-elasticsearch-processors/src/main/java/org/apache/nifi/processors/elasticsearch/QueryElasticsearchHttp.java#L360
Another option is to use ExecuteScript and write a script that could take multiple flow files and merge them together in the way you described.

How to get all data I have Inserted?

I have made a small app using json and reactiveMongo which inserts
Students Information.
object Applications extends Controller{
val studentDao = StudentDaoAndEntity
val studentqueryReader: Reads[JsObject] = implicitly[Reads[JsObject]]
def saveStudent = Action.async(parse.json) { request =>
request.body.validate[StudentInfo].map {
k => studentDao.insertStudent(k).map {
l => Ok("Successfully inserted")
}
}.getOrElse(Future.successful(BadRequest("Invalid Json"))
In databse
object StudentDaoAndEntity {
val sreader: Reads[StudentInfo] = Json.reads[StudentInfo]
val swriter: Writes[StudentInfo] = Json.writes[StudentInfo]
val studentqueryReader: Reads[JsObject] = implicitly[Reads[JsObject]]
def db = ReactiveMongoPlugin.db
def collection: JSONCollection = db[JSONCollection]("student")
def insertStudent(student: StudentInfo): Future[JsObject]= {
val modelToJsObj = swriter.writes(student).as[JsObject]
collection.insert(modelToJsObj) map (_ => modelToJsObj)
}
This works fine. Now I need to get all data I have inserted. How can I
do that? I am not asking for code but for Idea.
First of all: it seems that you are using Play-ReactiveMongo (as far as i know, JSONCollection is not part of ReactiveMongo itself). If this is the case, then your code is unnecessarily complex. Instead of doing the JSON conversions manually, you just can pass your StudentInfo objects directly to insert. Minimal example:
val studentInfo: StudentInfo = ...
def collection: JSONCollection = db[JSONCollection]("student")
collection.insert(studentInfo)
That's the elegant part of the Play plugin. Yes, MongoDB persists data as JSON (or BSON, to be more precise), but you don't have do deal with it. Just make sure the implicit Writes (or Reads, in case of querying) is in scope, as well as other necessary imports (e.g. play.modules.reactivemongo.json._).
Now I need to get all data I have inserted. How can I do that? I am
not asking for code but for Idea.
Well, you want to have a look at the documentation (scroll down for examples), it's quite simple and there is not more to it. In your case, it could look like this:
// perform query via cursor
val cursor: Cursor[StudentInfo] =
collection.find(Json.obj("lastName" -> "Regmi")).cursor[StudentInfo]
// gather results as list
val futureStudents: Future[List[StudentInfo]] = cursor.collect[List]()
In this case, you get all students with the last name Regmi. If you really want to retrieve all students, then you probably need to pass an empty JsObject as your query. Again, it's not necessary to deal with JSON conversions, as long as the implicit Reads is in scope.
Here is the Complete Answer of my own Question
package controllers
def findAll=Action.async {
val cursor = Json.obj()
StudentDaoAndEntity.findAllStudent(cursor) map {
case Nil => Ok("Student Not Found")
case l:Seq[JsObject] => Ok(Json.toJson(l))
}
}
def findAllStudent(allStd: JsObject): Future[Seq[JsObject]] = {
// gather all the JsObjects in a list
collection.find(allStd).cursor[JsObject].collect[List]()
}

Spark file load - `try` and `except` in scala

I wish to read a file that is in one of two locations. I wish to try the first location and if this fails, try the second location. In python I would use a try and then if a IOError file does not exist is returned use except for the second location. I can read one location in scala like this:
val vertices_raw = sqlContext.read.json("location_a/file.json")
I have tried the following, using getOrElse:
val vertices_raw = sqlContext.read.json("location_a/file.json") getOrElse vertices_raw = sqlContext.read.json("location_b/file.json")
However this did not compile
You can do the same thing in Scala
val vertices_raw: DataFrame = try {
sqlContext.read.json("location_a/file.json")
} catch {
case e: Exception => sqlContext.read.json("location_b/file.json")
}
Or alternatively
import scala.util.Try
val vertices_raw =
Try(sqlContext.read.json("location_a/file.json"))
.getOrElse(sqlContext.read.json("location_b/file.json"))

Optimal way to read out JSON from MongoDB into a Scalatra API

I have a pre-formatted JSON blob stored as a string in MongoDB as a field in one of collections. Currently in my Scalatra based API, I have a before filter that renders all of my responses with a JSON content type. An example of how I return the content looks like the following:
get ("/boxscore", operation(getBoxscore)) {
val game_id:Int = params.getOrElse("game_id", "3145").toInt
val mongoColl = mongoDb.apply("boxscores")
val q: DBObject = MongoDBObject("game_id" -> game_id)
val res = mongoColl.findOne(q)
res match {
case Some(j) => JSON.parseFull(j("json_body").toString)
case None => NotFound("Requested document could not be found.")
}
}
Now this certainly does work. It doesn't seem the "Scala" way of doing things and I feel like this can be optimized. The worrisome part to me is when I add a caching layer and a cache does not hit that I am spending additional CPU time on re-parsing a String I already formatted as JSON in MongoDB:
JSON.parseFull(j("json_body").toString)
I have to take the result from findOne(), run .toString on it, then re-parse it into JSON afterwards. Is there a more optimal route? Since the JSON is already stored as a String in MongoDB, I'm guessing a serializer / case class isn't the right solution here. Of course I can just leave what's here - but I'd like to learn if there's a way that would be more Scala-like and CPU friendly going forward.
There is the option to extend Scalatra's render pipeline with handling for MongoDB classes. The following two routes act as an example. They return a MongoCursor and a DBObject as result. We are going to convert those to a string.
get("/") {
mongoColl.find
}
get("/:key/:value") {
val q = MongoDBObject(params("key") -> params("value"))
mongoColl.findOne(q) match {
case Some(x) => x
case None => halt(404)
}
}
In order to handle the types we need to define a partial function which takes care of the conversion and sets the appropriate content type.
There are two cases, the first one handles a DBObject. The content type is set to "application/json" and the object is converted to a string by calling the toString method. The second case handles a MongoCursor. Since it implements TraversableOnce the map function can be used.
def renderMongo = {
case dbo: DBObject =>
contentType = "application/json"
dbo.toString
case xs: TraversableOnce[_] => // handles a MongoCursor, be aware of type erasure here
contentType = "application/json"
val ls = xs map (x => x.toString) mkString(",")
"[" + ls + "]"
}: RenderPipeline
(Note the following type definition: type RenderPipeline = PartialFunction[Any, Any])
Now the method needs to get hooked in. After a HTTP call has been handled the result is forwarded to the render pipeline for further conversion. Custom handling can be added by overriding the renderPipeline method from ScalatraBase. With the following definition the renderMongo function is called first:
override protected def renderPipeline = renderMongo orElse super.renderPipeline
This is a basic approach to handle MongoDB types. There are other options as well, for example by making use of json4s-mongo.
Here is the previous code in a working sample project.