How to filter pyarrow pyarrow.lib.Date32Value values? - pyarrow

For example if you load a pyarrow parqetdataset, you can get at the data but is there an easy way of filtering this before converting to datetime.date ? datetime.date is a python object so would be good to have a fast way of cutting the data down before constructing many objects I think? Or maybe I am missing something that makes this unecessary.

Related

Is there a way to get columns names of dataframe in pyspark without reading the whole dataset?

I have huges datasets in my HDFS environnement, say 500+ datasets and all of them are around 100M+ rows. I want to get only the column names of each dataset without reading the whole datasets because it will take too long time to do that. My data are json formatted and I'm reading them using the classic spark json reader : spark.read.json('path'). So what's the best way to get columns names without wasting my time and memory ?
Thanks...
from the official doc :
If the schema parameter is not specified, this function goes through the input once to determine the input schema.
Therefore, you cannot get the column names with only the first line.
Still, you can do an extra step first, that will extract one line and create a dataframe from it, then extract the column names.
One answer could be the following :
Read the data using spark.read.txt('path') method
Limit the number of rows to 1 with the method limit(1) since we just want the header as column names
Convert the table to rdd and collect it as a list with the method collect()
Convert the first row collected from unicode string to python dict (since I'm working with json formatted data).
The keys of the above dict is exactly what we are looking for (columns names as list in python).
This code worked for me:
from ast import literal_eval
literal_eval(spark.read.text('path').limit(1)
.rdd.flatMap(lambda x: x)
.collect()[0]).keys()
The reason it works faster might be that pyspark won't load the whole dataset with all the field structures if you read it using txt format (because everything is read as a big string), it's lighter and more efficient for that specific case.

Change resultset structure elasticsearch

is it possible to change the resultset structure when retrieving data from elastic search?
the problem ist, the timeseries data are sometimes from 3000-8000 records which are a json array with json objects in it ... parsing it in this case not really efficient or necessary so i thought - could a resultset be transformed to just lets say a simple json object with an array of time and array of values? nothing more?
i could do this in java or php but since we want to have an efficent way of dealing with large datasets we are currently evaluating our options.
You can control what elasticsearch returns using source filtering:
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-source-filtering.html
It can let you pick which part of the indexed document it will return, which, depending on your index structure could be an array of times and values, or at least, very easily mapped to it using the language of your choice.
Another possibility is to use scripting to control the results. If you map the result in this way, you should be able to get the hits object to be a JSON array of key: value.

Possibilities for structuring ingested json data using Nifi

Is it possible, using Nifi, to load a json file into a structured table?
I've called the following weather forecast data (from 6000 weather stations), which i'm currently loading into HDFS. It all appears on one line:
{"SiteRep":{"Wx":{"Param":[{"name":"F","units":"C","$":"Feels Like Temperature"},{"name":"G","units":"mph","$":"Wind Gust"},{"name":"H","units":"%","$":"Screen Relative Humidity"},{"name":"T","units":"C","$":"Temperature"},{"name":"V","units":"","$":"Visibility"},{"name":"D","units":"compass","$":"Wind Direction"},{"name":"S","units":"mph","$":"Wind Speed"},{"name":"U","units":"","$":"Max UV Index"},{"name":"W","units":"","$":"Weather Type"},{"name":"Pp","units":"%","$":"Precipitation Probability"}]},"DV":{"dataDate":"2017-01-12T22:00:00Z","type":"Forecast","Location":[{"i":"14","lat":"54.9375","lon":"-2.8092","name":"CARLISLE AIRPORT","country":"ENGLAND","continent":"EUROPE","elevation":"50.0","Period":{"type":"Day","value":"2017-01-13Z","Rep":{"D":"WNW","F":"-3","G":"25","H":"67","Pp":"0","S":"13","T":"2","V":"EX","W":"1","U":"1","$":"720"}}},{"i":"22","lat":"53.5797","lon":"-0.3472","name":"HUMBERSIDE AIRPORT","country":"ENGLAND","continent":"EUROPE","elevation":"24.0","Period":{"type":"Day","value":"2017-01-13Z","Rep":{"D":"NW","F":"-2","G":"43","H":"63","Pp":"3","S":"25","T":"4","V":"EX","W":"3","U":"1","$":"720"}}}, .....
Ideally, I want the schema structuring into a 6000 row table.
I've tried writing a schema to pass the above into Pig, but haven't been successful, probably because I'm not familiar enough with json to translate this correctly.
Casting around for an easy way to add some structure to the data, I've spotted that there's a PutHBaseJson processor in Nifi.
Can anyone advise if this PutHBaseJson processor would work with the above data structure? And if so, can anyone point me towards a decent tutorial to give me a starting point on the configuration?
Greatly appreciate any guidance.
You probably want to use the SplitJson processor to split the 6000 record JSON structure into 6000 individual flowfiles. If you need to "inject" the parameter definitions from the top-level response, you can do a ReplaceText or JoltTransformJSON operation to manipulate the individual JSON records. Here is a good article by Yolanda Davis describing how to perform Jolt transforms (JSON -> JSON) in NiFi.
Once you have the individual flowfiles containing a single JSON record, putting them into HBase is very easy. Bryan Bende wrote an article describing the necessary configurations for the PutHBaseJson processor.

Read small json files with declared schema from S3Bucket - Spark Performance Issue

I have huge number (35k) of small (16kb) json files stored on S3Bucket. I need to load them into DataFrame for futher processing, here is my code for extract:
val jsonData = sqlContext.read.json("s3n://bucket/dir1/dir2")
.where($"nod1.filter1"==="filterValue")
.where($"nod2.subNode1.subSubNode2.created"(0)==="filterValue2")
I'm storing this data into temp table and use for futher operations (exploading nested structures into separate data frames)
jsonData.registerTempTable("jsonData")
So now I have autogenerated schema for this deeply nested dataframe.
With above code I have terrible performance issues I presume its caused by not using sc.parallelize during bucket load, moreover I'm pretty sure that autogeneration of schema in read.json() method is taking a lot of time.
Questions part:
How should my bucket load look like, to be more efficient and faster?
Is there any way to declare this schema in advance (I need to work around Case Class tuple problem thou) to avoid auto-generation?
Does filtering data during load make sense or I should simply load all and filter data after?
Found so far:
sqlContext.jsonRdd(rdd, schema)
It did the part with auto generated schema, but InteliJ screams about depreciated method, is there any alternative for it?
As an alternative to case class, use a custom class that implements the Product interface, and then DataFrame will use the schema exposed by your class members without the case class constraints. See in-line comment here http://spark.apache.org/docs/latest/sql-programming-guide.html#inferring-the-schema-using-reflection
If your json is composed of unrooted fragments you could use s3distcp to group the files and concatenate them into fewer files. Also try s3a protocol as it is better performance than s3n.

json from mongo to csv in scala

I am retrieving some records from mongo using scala. These are plain json records. I need to convert them to csv. Is there a method or a library that does this. As far as ive searched there are no such converters or libraries to do that in scala. Basically i wanna do something like this. The json will be something like but the fields are not known, but for a particular query the fields returned would be same. for example if i query for apple the entire result will have the same fields like
{ "id" : "some", "type" : "no-type", "extra" : "somedata" }
Say there are 100 records, how do i find out the fields in these records and export them to a csv file.
Sadly for scala youre out of luck. You will need to decide on how you want your CSV file to be formatted and how you will deal with any missing tags (if possible).
There is a good example here, not in Scala but the process should be similar: http://michelleminkoff.com/2011/02/01/making-the-structured-usable-transform-json-into-a-csv/
For reading JSON and using Mongo checkout Beaucatcher: http://beaucatcher.org/
For writing files use the scala-io library.
Heres a good CSV parser in SCALA for future use: https://gist.github.com/115557