Pyarrow: How to specify partial schema - pyarrow

I am creating a table with some known columns and some dynamic columns. I would like to specify the data types for the known columns and infer the data types for the unknown columns. Is there a way to do this?
If I create a schema with only the known columns, then the other columns are ignored when creating the table:
n_legs = pa.array([2, 4, 5, 100])
animals = pa.array(["Flamingo", "Horse", "Brittle stars", "Centipede"])
pydict = {'n_legs': n_legs, 'animals': animals}
partialSchema = pa.schema([('n_legs', pa.int32())])
pa.Table.from_pydict(pydict, schema=partialSchema)
pyarrow.Table
n_legs: int32
----
n_legs: [[2,4,5,100]]
^^^ The animals column was omitted instead of inferred.

One solution could be to specify the data type for your inputs before you create the table, when you are creating your arrays. Then you do not need to specify a schema:
n_legs = pa.array([2, 4, 5, 100], pa.int32())
animals = pa.array(["Flamingo", "Horse", "Brittle stars", "Centipede"])
pydict = {'n_legs': n_legs, 'animals': animals}
pa.Table.from_pydict(pydict)
pyarrow.Table
n_legs: int32
animals: string
----
n_legs: [[2,4,5,100]]
animals: [["Flamingo","Horse","Brittle stars","Centipede"]]

Related

Get JuliaDB.loadtable() to parse all columns as String

I want JuliaDB.loadtable() to read a CSV (really a bunch of CSVs, but for simplicity let's try just one), where all columns are parsed as String.
Here's what I've tried:
using CSV
using DataFrames
using JuliaDB
df1 = DataFrame(
[['a', 'b', 'c'], [1, 2, 3]],
["name", "id"]
)
CSV.write("df1.csv", df1)
# This works, but if I have 10+ columns it would get unwieldy
df1 = loadtable("df1.csv"; colparsers=Dict(:name=>String, :id=>String),)
# This doesn't work
df1 = loadtable("df1.csv"; colparsers=String,)
# MethodError: no method matching iterate(::Type{String})
Here's how it's done in R:
df1 = read.csv("df1.csv", colClasses = "character")
If you know the number of columns (or just an upper bound on it), you can use types, I should think (from CSV.jl documentation):
types: a Vector or Dict of types to be used for column types; a Dict can map column index Int, or name Symbol or String to type for a column, i.e. Dict(1=>Float64) will set the first column as a Float64, Dict(:column1=>Float64) will set the column named column1 to Float64 and, Dict("column1"=>Float64) will set the column1 to Float64; if a Vector if provided, it must match the # of columns provided or detected in header

AWS Glue Crawler for JSONB column in PostgreSQL RDS

I've created a crawler that looks at a PostgreSQL 9.6 RDS table with a JSONB column but the crawler identifies the column type as "string". When I then try to create a job that loads data from a JSON file on S3 into the RDS table I get an error.
How can I map a JSON file source to a JSONB target column?
It's not quite a direct copy, but an approach that has worked for me is to define the column on the target table as TEXT. After the Glue job populates the field, I then convert it to JSONB. For example:
alter table postgres_table
alter column column_with_json set data type jsonb using column_with_json::jsonb;
Note the use of the cast for the existing text data. Without that, the alter column would fail.
Crawler will identify JSONB column type as "string" but you can try to use Unbox Class in Glue to convert this column to json
let's check the following table in PostgreSQL
create table persons (id integer, person_data jsonb, creation_date timestamp )
There is an example of one record from person table
ID = 1
PERSON_DATA = {
"firstName": "Sergii",
"age": 99,
"email":"Test#test.com"
}
CREATION_DATE = 2021-04-15 00:18:06
The following code need to be added in Glue
# 1. create dynamic frame from catalog
df_persons = glueContext.create_dynamic_frame.from_catalog(database = "testdb", table_name = "persons", transformation_ctx = "df_persons ")
# 2.in path you need to add your jsonb column name that need to be converted to json
df_persons_json = Unbox.apply(frame = df_persons , path = "person_data", format="json")
# 3. converting from dynamic frame to data frame
datf_persons_json = df_persons_json.toDF()
# 4. after that you can process this column as a json datatype or create dataframe with all necessary columns , each json data element can be added as a separate column in dataframe :
final_df_person = datf_persons_json.select("id","person_data.age","person_data.firstName","creation_date")
You can also check the following link:
https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-crawler-pyspark-transforms-Unbox.html

Expanding a JSON column in R

I am reading in a data table from a CSV file. Some elements in the CSV are in JSON format, so one of the columns has JSON formatted data, for example:
user_id tv_sec action_info
1: 47074 1426791420 {"foo": {"bar":12345,"baz":309}, "type": "type1"}
2: 47074 1426791658 {"foo": '{"bar":23409,"baz":903}, "type": "type2"}
3: 47074 1426791923 {"foo": {"bar":97241,"baz":218}, "type": "type3"}
I would like to flatten out the action_info column and add the data as columns, as follows:
user_id tv_sec bar baz type
1: 47074 1426791420 12345 309 type1
2: 47074 1426791658 23409 903 type2
3: 47074 1426791923 97241 218 type3
I am not sure how to achieve this. I found a library to convert strings to JSON in R (RJSONIO) but I'm having a hard time figuring out what to do next. When I experiment with just trying to convert all rows in the action_info column to JSON with the command userActions[,.(fromJSON(action_info))] I basically get a data table with what seems like all the values accumulated in some way that's not entirely clear to me. For example, running that with my (non-example) data I get:
V1
1: 2.188603e+12,2.187628e+12,2.186202e+12,1.164000e+03
2: type1
Warning messages:
1: In if (is.na(encoding)) return(0L) :
the condition has length > 1 and only the first element will be used
2: In if (is.na(i)) { :
the condition has length > 1 and only the first element will be used
So, I'm trying to figure out:
how to operate on the column to convert it from JSON to values (I think I am doing this correctly though, but I'm not certain)
how to get the values and create columns out of them in either the current or new data table.
Rather ugly but should work:
library(dplyr)
library(data.table)
lapply(as.character(df$action_info), RJSONIO::fromJSON) %>%
lapply(function(e) list(bar=e$foo[1], baz=e$foo[2], type=e$type)) %>%
rbindlist() %>%
cbind(df) %>%
select(-action_info)
Data:
library(data.table)
df <- data.table(structure(list(user_id = c(47074L, 47074L, 47074L), tv_sec = c(1426791420L,
1426791658L, 1426791923L), action_info = c("{\"foo\": {\"bar\":12345,\"baz\":309}, \"type\": \"type1\"}",
"{\"foo\": {\"bar\":23409,\"baz\":903}, \"type\": \"type2\"}",
"{\"foo\": {\"bar\":97241,\"baz\":218}, \"type\": \"type3\"}"
)), .Names = c("user_id", "tv_sec", "action_info"), row.names = c(NA,
-3L), class = "data.frame"))
Here's one way to do it with data_table:
df[, c('bar', 'baz', 'type'):=as.list(unlist(fromJSON(action_info[1]))),
by=action_info]
How it works:
The by=action_info essentially makes sure we just call fromJSON once per unique action_info (once per row in your case); this is because fromJSON doesn't work on vectorised input.
The fromJSON(action_info[1]) converts the action_info to JSON (the [1] is on the off chance that you have multiple rows with the same action_info since fromJSON doesn't work on vector input).
The unlist flattens the nested "foo: {bar...}" (do fromJSON(df$action_info[1]) and unlist(fromJSON(df$action_info[1])) to see what I mean).
The as.list converts the result back into a list, with one element per "column" (data.table needs this to do the multiple assignment)
Then the c('bar', 'baz', 'type'):= assigns the output back out to the columns.
Note we don't match by name, so 'bar' is always the first part of the JSON, 'baz' is always the second, etc. If your action_info could have a {bar: ..., baz: ...} as well as a {baz: ..., bar: ...} the baz of the second will be assigned to the bar column. If you want to be cleverer and assign by name, you will have to think of something cleverer (for you could do as.list(...)[c('foo.bar', 'foo.baz', 'type')] to ensure the elements are in the right order before assigning).

Removing Unnecessary JSON fields using SPARK (SQL)

I'm a new spark user currently playing around with Spark and some big data and I have a question related to Spark SQL or more formally the SchemaRDD. I'm reading a JSON file containing data about some weather forecasts and I'm not really interested in all of the fields that I have ... I only want 10 fields out of 50+ fields returned for each record. Is there a way (similar to filter) that I can use to specify the names of some fields that I want remove from spark.
Just a small descriptive example. Consider I have the Schema "Person" with 3 fields "Name", "Age", and "Gender" and I'm not interested in the "Age" field and wold like to remove it. Can I use spark some how to do that. ? Thanks
If you are using Spark 1.2, you can do the following (using Scala)...
If you already know what fields you want to use, you can construct the schema for these fields and apply this schema to the JSON dataset. Spark SQL will return a SchemaRDD. Then, you can register it and query it as a table. Here is a snippet...
// sc is an existing SparkContext.
val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
// The schema is encoded in a string
val schemaString = "name gender"
// Import Spark SQL data types.
import org.apache.spark.sql._
// Generate the schema based on the string of schema
val schema =
StructType(
schemaString.split(" ").map(fieldName => StructField(fieldName, StringType, true)))
// Create the SchemaRDD for your JSON file "people" (every line of this file is a JSON object).
val peopleSchemaRDD = sqlContext.jsonFile("people.txt", schema)
// Check the schema of peopleSchemaRDD
peopleSchemaRDD.printSchema()
// Register peopleSchemaRDD as a table called "people"
peopleSchemaRDD.registerTempTable("people")
// Only values of name and gender fields will be in the results.
val results = sqlContext.sql("SELECT * FROM people")
When you look at the schema of peopleSchemaRDD (peopleSchemaRDD.printSchema()), you will only see name and gender field.
Or, if you want to explore the dataset and determine what fields you want after you see all fields, you can ask Spark SQL to infer the schema for you. Then, you can register the SchemaRDD as a table and use projection to remove unneeded fields. Here is a snippet...
// Spark SQL will infer the schema of the given JSON file.
val peopleSchemaRDD = sqlContext.jsonFile("people.txt")
// Check the schema of peopleSchemaRDD
peopleSchemaRDD.printSchema()
// Register peopleSchemaRDD as a table called "people"
peopleSchemaRDD.registerTempTable("people")
// Project name and gender field.
sqlContext.sql("SELECT name, gender FROM people")
You can specify what fields you would like to have in the schemaRDD. Below is an example. Create a case class, with only the fields that you need. Read the data into an rdd, then specify the only the fileds that you need(in the same order as you have specified the schema in the case class).
Sample Data: People.txt
foo,25,M
bar,24,F
Code:
case class Person(name: String, gender: String)
val people = sc.textFile("People.txt").map(_.split(",")).map(p => Person(p(0), p(2)))
people.registerTempTable("people")

Encoding a binary tree to json

I'm using the sqlalchemy to store a binary tree data in the db:
class Distributor(Base):
__tablename__ = "distributors"
id = Column(Integer, primary_key=True)
upline_id = Column(Integer, ForeignKey('distributors.id'))
left_id = Column(Integer, ForeignKey('distributors.id'))
right_id = Column(Integer, ForeignKey('distributors.id'))
how can I generate json "tree" format data like the above listed:
{'id':1,children:[{'id':2, children:[{'id':3, 'id':4}]}]}
I'm guessing you're asking to store the data in a JSON format? Or are you trying to construct JSON from the standard relational data?
If the former, why don't you just create entries like:
{id: XX, parentId: XX, left: XX, right: XX, value: "foo"}
For each of the nodes, and then reconstruct the tree manually from the entries? Just start form the head (parentId == null) and then assemble the branches.
You could also add an additional identifier for the tree itself, in case you have multiple trees in the database. Then you would just query where the treeId was XXX, and then construct the tree from the entries.
I hesitate to provide this answer, because I'm not sure I really understand your the problem you're trying to solve (A binary tree, JSON, sqlalchemy, none of these are problems).
What you can do with this kind of structure is to iterate over each row, adding edges as you go along. You'll start with what is basically a cache of objects; which will eventually become the tree you need.
import collections
idmap = collections.defaultdict(dict)
for distributor in session.query(Distributor):
dist_dict = idmap[distributor.id]
dist_dict['id'] = distributor.id
dist_dict.setdefault('children', [])
if distributor.left_id:
dist_dict.['children'].append(idmap[distributor.left_id])
if distributor.right_id:
dist_dict.['children'].append(idmap[distributor.right_id])
So we've got a big collection of linked up dicts that can represent the tree. We don't know which one is the root, though;
root_dist = session.query(Distributor).filter(Distributor.upline_id == None).one()
json_data = json.dumps(idmap[root_dist.id])