How to convert pyarrow.Table columnar data to tabular, row-like, data - pyarrow

I am having some columnar data in Pyarrow Table. How to convert it back from dict of lists to list of dicts?

The following code snippet allows you to iterate the table efficiently using pyarrow.RecordBatch.to_pydict() as a working buffer.
See full example.
"""Columnar data manipulation utilities."""
from typing import Iterable, Dict
def iterate_columnar_dicts(inp: Dict[str, list]) -> Iterable[Dict[str, object]]:
"""Iterates columnar dict data as rows.
Useful for constructing rows/objects out from :py:class:`pyarrow.Table` or :py:class:`pyarrow.RecordBatch`.
Example:
.. code-block:: python
#classmethod
def create_from_pyarrow_table(cls, table: pa.Table) -> "PairUniverse":
pairs = {}
for batch in table.to_batches(max_chunksize=5000):
d = batch.to_pydict()
for row in iterate_columnar_dicts(d):
pairs[row["pair_id"]] = DEXPair.from_dict(row)
return PairUniverse(pairs=pairs)
:param inp: Input dictionary of lists e.g. one from :py:method:`pyarrow.RecordBatch.to_pydict`. All lists in the input must be equal length.
:return: Iterable that gives one dictionary per row after transpose
"""
keys = inp.keys()
for i in range(len(inp)):
item = {key: inp[key][i] for key in keys}
yield item

Related

How to convert multi dimensional array in JSON as separate columns in pandas

I have a DB collection consisting of nested strings . I am trying to convert the contents under "status" column as separate columns against each order ID in order to track the time taken from "order confirmed" to "pick up confirmed". The string looks as follows:
I have tried the same using
xyz_db= db.logisticsOrders -------------------------(DB collection)
df =pd.DataFrame(list(xyz_db.find())) ------------(JSON to dataframe)
Using normalize :
parse1=pd.json_normalize(df['status'])
It works fine in case of non nested arrays. But status being a nested array the output is as follows:
Using for :
data = df[['orderid','status']]
data = list(data['status'])
dfy = pd.DataFrame(columns = ['statuscode','statusname','laststatusupdatedon'])
for i in range(0, len(data)):
result = data[i]
dfy.loc[i] = [data[i][0],data[i][0],data[i][0],data[i][0]]
It gives the result in form of appended rows which is not the format i am trying to achieve
The output I am trying to get is :
Please help out!!
i share you which i used json read, maybe help you:
you can use two and more list
def jsonify(z):
genr = []
if z==z and z is not None:
z = eval(z)
if type(z) in (dict, list, tuple):
for dic in z:
for key, val in dic.items():
if key == "name":
genr.append(val)
else:
return None
else:
return None
return genr
top_genr['genres_N']=top_genr['genres'].apply(jsonify)

Get JuliaDB.loadtable() to parse all columns as String

I want JuliaDB.loadtable() to read a CSV (really a bunch of CSVs, but for simplicity let's try just one), where all columns are parsed as String.
Here's what I've tried:
using CSV
using DataFrames
using JuliaDB
df1 = DataFrame(
[['a', 'b', 'c'], [1, 2, 3]],
["name", "id"]
)
CSV.write("df1.csv", df1)
# This works, but if I have 10+ columns it would get unwieldy
df1 = loadtable("df1.csv"; colparsers=Dict(:name=>String, :id=>String),)
# This doesn't work
df1 = loadtable("df1.csv"; colparsers=String,)
# MethodError: no method matching iterate(::Type{String})
Here's how it's done in R:
df1 = read.csv("df1.csv", colClasses = "character")
If you know the number of columns (or just an upper bound on it), you can use types, I should think (from CSV.jl documentation):
types: a Vector or Dict of types to be used for column types; a Dict can map column index Int, or name Symbol or String to type for a column, i.e. Dict(1=>Float64) will set the first column as a Float64, Dict(:column1=>Float64) will set the column named column1 to Float64 and, Dict("column1"=>Float64) will set the column1 to Float64; if a Vector if provided, it must match the # of columns provided or detected in header

Spark - creating a DataFrame from JSON - only the first row is processed

I'm trying to create a DataFrame with JSON strings from a text file. First I merge JValues:
def mergeSales(salesArray: JArray, metrics: JValue): List[String] = {
salesArray.children
.map(sale => sale merge metrics)
.map(merged => compact(render(merged)))
}
Then I write the strings to a file:
out.write(mergedSales.flatMap(s => s getBytes "UTF-8").toArray)
Data in the resulting file looks like this and there are no commas between the objects and no new lines:
{"store":"New Sore_1","store_id":"10","store_metric":"1234567"}{"store":"New Sore_1","store_id":"10","store_metric":"98765"}
The problem is when I'm creating a DataFrame it contains only the first row (with store_metric 1234567), ignoring the second one.
What is my mistake in creating a DataFrame? And what should I do for the data to be parsed correctly?
Here is how I'm trying to create a DataFrame:
val df = sqlContext.read.json(sc.wholeTextFiles("data.txt").values)

Removing characters from column in pandas data frame

My goal is to (1) import Twitter JSON, (2) extract data of interest, (3) create pandas data frame for the variables of interest. Here is my code:
import json
import pandas as pd
tweets = []
for line in open('00.json'):
try:
tweet = json.loads(line)
tweets.append(tweet)
except:
continue
# Tweets often have missing data, therefore use -if- when extracting "keys"
tweet = tweets[0]
ids = [tweet['id_str'] for tweet in tweets if 'id_str' in tweet]
text = [tweet['text'] for tweet in tweets if 'text' in tweet]
lang = [tweet['lang'] for tweet in tweets if 'lang' in tweet]
geo = [tweet['geo'] for tweet in tweets if 'geo' in tweet]
place = [tweet['place'] for tweet in tweets if 'place' in tweet]
# Create a data frame (using pd.Index may be "incorrect", but I am a noob)
df=pd.DataFrame({'Ids':pd.Index(ids),
'Text':pd.Index(text),
'Lang':pd.Index(lang),
'Geo':pd.Index(geo),
'Place':pd.Index(place)})
# Create a data frame satisfying conditions:
df2 = df[(df['Lang']==('en')) & (df['Geo'].dropna())]
So far, everything seems to be working fine.
Now, the extracted values for Geo result in the following example:
df2.loc[1921,'Geo']
{'coordinates': [39.11890951, -84.48903638], 'type': 'Point'}
To get rid of everything except the coordinates inside the squared brackets I tried using:
df2.Geo.str.replace("[({':]", "") ### results in NaN
# and also this:
df2['Geo'] = df2['Geo'].map(lambda x: x.lstrip('{'coordinates': [').rstrip('], 'type': 'Point'')) ### results in syntax error
Please advise on the correct way to obtain coordinates values only.
The following line from your question indicates that this is an issue with understanding the underlying data type of the returned object.
df2.loc[1921,'Geo']
{'coordinates': [39.11890951, -84.48903638], 'type': 'Point'}
You are returning a Python dictionary here -- not a string! If you want to return just the values of the coordinates, you should just use the 'coordinates' key to return those values, e.g.
df2.loc[1921,'Geo']['coordinates']
[39.11890951, -84.48903638]
The returned object in this case will be a Python list object containing the two coordinate values. If you want just one of the values, you can slice the list, e.g.
df2.loc[1921,'Geo']['coordinates'][0]
39.11890951
This workflow is much easier to deal with than casting the dictionary to a string, parsing the string, and recapturing the coordinate values as you are trying to do.
So let's say you want to create a new column called "geo_coord0" which contains all of the coordinates in the first position (as shown above). You could use a something like the following:
df2["geo_coord0"] = [x['coordinates'][0] for x in df2['Geo']]
This uses a Python list comprehension to iterate over all entries in the df2['Geo'] column and for each entry it uses the same syntax we used above to return the first coordinate value. It then assigns these values to a new column in df2.
See the Python documentation on data structures for more details on the data structures discussed above.

How can i flatten hbase cells so i can process the resulting JSON using a Spark RDD or Data frame in scala?

a relative newbie to spark, hbase, and scala here.
I have json (stored as byte arrays) in hbase cells in the same column family but across several thousand column qualifiers. Example (simplified):
Table name: 'Events'
rowkey: rk1
column family: cf1
column qualifier: cq1, cell data (in bytes): {"id":1, "event":"standing"}
column qualifier: cq2, cell data (in bytes): {"id":2, "event":"sitting"}
etc.
Using scala, I can read rows by specifying a timerange
val scan = new Scan()
val start = 1460542400
val end = 1462801600
val hbaseContext = new HBaseContext(sc, conf)
val getRdd = hbaseContext.hbaseRDD(TableName.valueOf("Events"), scan)
If I try to load up my hbase rdd (getRdd) into a dataframe (after converting the byte arrays into string etc.), it only reads the first cell in every row (in the example above, I would only get "standing".
this code only loads up a single cell for every row returned
val resultsString = getRdd.map(s=>Bytes.toString(s._2.value()))
val resultsDf = sqlContext.read.json(resultsString)
In order to get every cell I have to iterate as below.
val jsonRDD = getRdd.map(
row => {
val str = new StringBuilder
str.append("[")
val it = row._2.listCells().iterator()
while (it.hasNext) {
val cell = it.next()
val cellstring = Bytes.toString(CellUtil.cloneValue(cell))
str.append(cellstring)
if (it.hasNext()) {
str.append(",")
}
}
str.append("]")
str.toString()
}
)
val hbaseDataSet = sqlContext.read.json(jsonRDD)
I need to add the square brackets and the commas so its properly formatted json for the dataframe to read it.
Questions:
Is there a more elegant way to construct the json i.e. some parser that takes in the individual json strings and concatenates them together so its properly formed json?
Is there a better capability to flatten hbase cells so i dont need to iterate?
For the jsonRdd, the closure that is computed should include the str local variable, so the task executing this code on a node should not be missing the "[", "]" or ",". i.e i wont get parser errors once i run this on the cluster instead of local[*]
Finally, is it better to just create a pair RDD from the json or use data frames to perform simple things like counts? Is there some way to measure the efficiency and performance of one vs. the other?
thank you