Convert JSON to dataframe with two indices - json

I'm trying to convert some financial data provided in JSON format into a single row in a dataframe. However, this JSON has the data with two indices or nested indices? I'm not sure how to appropriately describe the data.
So below is the code I'm using to pull the financial data.
import requests
import pandas as pd
stock ='AAPL'
BS = requests.get(f"https://financialmodelingprep.com/api/v3/financials/balance-sheet-statement/{stock}?period=quarter")
data = BS.json()
The output looks like this
{'symbol': 'AAPL',
'financials': [{'date': '2019-12-28',
'Cash and cash equivalents': '39771000000.0',
'Short-term investments': '67391000000.0',
'Cash and short-term investments': '1.07162e+11',
'Receivables': '20970000000.0',...}
I've tried the following
df = pd.DataFrame.from_dict(data, orient='index')
and
df = pd.DataFrame.from_dict(json_normalize(data), orient='columns')
Neither gets me what I want. Somehow I need to get rid of 'financials'. I want the data frame to
look like:
How do I do this?

So just use the dict of 'financials' when creating the dataframe.
import requests
import pandas as pd
stock ='AAPL'
BS = requests.get(f"https://financialmodelingprep.com/api/v3/financials/balance-sheet-statement/{stock}?period=quarter")
data = BS.json()
df = pd.DataFrame.from_dict(data['financials'])
print(df)

Related

Best approach for geospatial indexes in Palantir Foundry

What the recommended approach is for building a pipeline that needs to find a point contained in a polygon (shape) in Planatir Foundry? In the past, this has been pretty difficult in Spark. GeoSpark has been pretty popular, but can still lag. If there is nothing specific to Foundry I can implement something with Geospark. I have ~13k shapes and batches of thousands of points.
How large are the datasets? With a big enough driver and some optimizations, I previously got it working using geopandas. Just make sure that the coordinate point is the same projection as the polygon.
Here is a helper function:
from shapely import geometry
import json
import geopandas
from pyspark.sql import functions as F
def geopandas_spatial_join(df_left, df_right, geometry_left, geometry_right, how='inner', op='intersects'):
'''
Computes a spatial join of two Geopandas dataframes. Implemetns the Geopandas "sjoin" method, reference: https://geopandas.org/reference/geopandas.sjoin.html.
Expects both dataframes to contain a GeoJSON geometry column, whose names are passed as the 'geometry_left' and 'geometry_right' arguments/
Inputs:
df_left (PANDAS_DATAFRAME): Left input dataframe.
df_right (PANDAS_DATAFRAME): Right input dataframe.
geometry_left (string): Name of the geometry column of the left dataframe.
geometry_right (string): Name of the geometry column of the right dataframe.
how (string): The type of join, one of {'left', 'right', 'inner'}.
op (string): Binary predicate, one of {‘intersects’, ‘contains’, ‘within’}.
Outputs:
(PANDAS_DATAFRAME): Joined dataframe.
'''
df1 = df_left
df1["geometry_left_shape"] = df1[geometry_left].apply(json.loads)
df1["geometry_left_shape"] = df1["geometry_left_shape"].apply(geometry.shape)
gdf_left = geopandas.GeoDataFrame(df1, geometry="geometry_left_shape")
df2 = df_right
df2["geometry_right_shape"] = df2[geometry_right].apply(json.loads)
df2["geometry_right_shape"] = df2["geometry_right_shape"].apply(geometry.shape)
gdf_right = geopandas.GeoDataFrame(df2, geometry="geometry_right_shape")
joined = geopandas.sjoin(gdf_left, gdf_right, how=how, op=op)
joined = joined.drop(joined.filter(items=["geometry_left_shape", "geometry_right_shape"]).columns, axis=1)
return joined
We can then run the join:
import pandas as pd
left_df = points_df.toPandas()
left_geo_column = "point_geometry"
right_df = polygon_df.toPandas()
right_geo_column = "polygon_geometry"
pdf = geopandas_spatial_join(left_df,right_df,left_geo_column,right_geo_column)
return_df = spark.createDataFrame(pdf).dropDuplicates()
return return_df

Json string written to Kafka using Spark is not converted properly on reading

I read a .csv file to create a data frame and I want to write the data to a kafka topic. The code is the following
df = spark.read.format("csv").option("header", "true").load(f'{file_location}')
kafka_df = df.selectExpr("to_json(struct(*)) AS value").selectExpr("CAST(value AS STRING)")
kafka_df.show(truncate=False)
And the data frame looks like this:
value
"{""id"":""d215e9f1-4d0c-42da-8f65-1f4ae72077b3"",""latitude"":""-63.571457254062715"",""longitude"":""-155.7055842710919""}"
"{""id"":""ca3d75b3-86e3-438f-b74f-c690e875ba52"",""latitude"":""-53.36506636464281"",""longitude"":""30.069167069917597""}"
"{""id"":""29e66862-9248-4af7-9126-6880ceb3b45f"",""latitude"":""-23.767505281795835"",""longitude"":""174.593140405442""}"
"{""id"":""451a7e21-6d5e-42c3-85a8-13c740a058a9"",""latitude"":""13.02054867061598"",""longitude"":""20.328402498420786""}"
"{""id"":""09d6c11d-7aae-4d17-8cd8-183157794893"",""latitude"":""-81.48976715040848"",""longitude"":""1.1995769642056189""}"
"{""id"":""393e8760-ef40-482a-a039-d263af3379ba"",""latitude"":""-71.73949722379649"",""longitude"":""112.59922770487054""}"
"{""id"":""d6db8fcf-ee83-41cf-9ec2-5c2909c18534"",""latitude"":""-4.034680969008576"",""longitude"":""60.59645511854336""}"
After I wrote it to Kafka I want to read it and transform the binary data from column "value" back to json string but the result is that the value contains only the id, not the whole string. Any ideea why?
from pyspark.sql import functions as F
df = consume_from_event_hub(topic, bootstrap_servers, config, consumer_group)
string_df = df.select(F.col("value").cast("string"))
string_df.display()
value
794541bc-30e6-4c16-9cd0-3c5c8995a3a4
20ea5b50-0baa-47e3-b921-f9a3ac8873e2
598d2fc1-c919-4498-9226-dd5749d92fc5
86cd5b2b-1c57-466a-a3c8-721811ab6959
807de968-c070-4b8b-86f6-00a865474c35
e708789c-e877-44b8-9504-86fd9a20ef91
9133a888-2e8d-4a5a-87ce-4a53e63b67fc
cd5e3e0d-8b02-45ee-8634-7e056d49bf3b
the CSV the format is this
id,latitude,longitude
bd6d98e1-d1da-4f41-94ba-8dbd8c8fce42,-86.06318155350924,-108.14300138138589
c39e84c6-8d7b-4cc5-b925-68a5ea406d52,74.20752175171859,-129.9453606091319
011e5fb8-6ab7-4ee9-97bb-acafc2c71e15,19.302250885973592,-103.2154291337162
You need to remove selectExpr("CAST(value AS STRING)") since to_json already returns a string column
from pyspark.sql.functions import col, to_json, struct
df = spark.read.format("csv").option("header", "true").option("inferSchema", "true").load(f'{file_location}')
kafka_df = df.select(to_json(struct(col("*"))).alias("value"))
kafka_df.show(truncate=False)
I'm not sure what's wrong with the consumer. That should have worked unless consume_from_event_hub does something specifically to extract the ID column

how can I append 2 dataframes in pyspark and keep both headers in first and second rows? or add the 2nd header to a dataframe with already 1 header?

I'm trying to write a sparksql code (pyspark) in databricks which append 2 dataframes in a CSV file, but there is an issue that I need to keep both headers of these 2 dataframes because I gonna use these two headers as metadata in next step. Could you please help me how could I do that, two first rows should read as headers.
%python
from pyspark.sql import DataFrameWriter
from pyspark.sql.functions import col
df12= spark.read.csv ("/FileStore/tables/BBrate.csv")
df12.write.csv(path="/opt/Output/test5.csv", mode="append")
df12= df12.select (col("_c0").alias("IntCurrates"), col("_c1").alias(" "),
col("_c2").alias(" "),col("_c3").alias(" "), col("_c4").alias(" "),
col("_c5").alias(" "))
df12.createOrReplaceTempView("BBXrate")
sqlDF = spark.sql("SELECT * FROM BBXrate")
sqlDF.show()
To be honest I'm completely beginner in Spark and frankly, what I need to have, are 2 dataframes and append both of them in a CSV file something like the bellow (the second table is just one row as the header) but I know that the below code is not correct. and another point is that mode="append" doesn't work anymore.
%python
from pyspark.sql import DataFrameWriter
from pyspark.sql.functions import col
df8= spark.read.csv("/FileStore/tables/BBrate.csv")
df9= spark.read.csv("/FileStore/tables/ETLy.csv")
df8.coalesce(1). write.format('com.databricks.spark.csv')
.save("/FileStore/tables/ETLyardi2.csv").mode("append")
df9.coalesce(1). Write.format('com.databricks.spark.csv')
.save("/FileStore/tables/ETLy.csv", mode= 'append')
df9.createOrReplaceTempView("BBXrate")
sqlDF9 = spark.sql("SELECT * FROM BBXrate")
sqlDF9.show()

How to add/change column names with pyarrow.read_csv?

I am currently trying to import a big csv file (50GB+) without any headers into a pyarrow table with the overall target to export this file into the Parquet format and further to process it in a Pandas or Dask DataFrame. How can i specify the column names and column dtypes within pyarrow for the csv file?
I already thought about to append the header to the csv file. This enforces a complete rewrite of the file which looks like a unnecssary overhead. As far as I know, pyarrow provides schemas to define the dtypes for specific columns, but the docs are missing a concrete example for doing so while transforming a csv file to an arrow table.
Imagine that this csv file just has for an easy example the two columns "A" and "B".
My current code looks like this:
import numpy as np
import pandas as pd
import pyarrow as pa
df_with_header = pd.DataFrame({'col1': [1, 2], 'col2': [3, 4]})
print(df_with_header)
df_with_header.to_csv("data.csv", header=False, index=False)
df_without_header = pd.read_csv('data.csv', header=None)
print(df_without_header)
opts = pa.csv.ConvertOptions(column_types={'A': 'int8',
'B': 'int8'})
table = pa.csv.read_csv(input_file = "data.csv", convert_options = opts)
print(table)
If I print out the final table, its not going to change the names of the columns.
pyarrow.Table
1: int64
3: int64
How can I now change the loaded column names and dtypes? Is there maybe also a possibility to for example pass in a dict containing the names and their dtypes?
You can specify type overrides for columns:
fp = io.BytesIO(b'one,two,three\n1,2,3\n4,5,6')
fp.seek(0)
table = csv.read_csv(
fp,
convert_options=csv.ConvertOptions(
column_types={
'one': pa.int8(),
'two': pa.int8(),
'three': pa.int8(),
}
))
But in your case you don't have a header, and as far as I can tell this use case is not supported in arrow:
fp = io.BytesIO(b'1,2,3\n4,5,6')
fp.seek(0)
table = csv.read_csv(
fp,
parse_options=csv.ParseOptions(header_rows=0)
)
This raises:
pyarrow.lib.ArrowInvalid: header_rows == 0 needs explicit column names
The code is here: https://github.com/apache/arrow/blob/3cf8f355e1268dd8761b99719ab09cc20d372185/cpp/src/arrow/csv/reader.cc#L138
This is similar to this question apache arrow - reading csv file
There should be fix for it in the next version: https://github.com/apache/arrow/pull/4898

Timestamp problem when fetching data from iforge and importing it to csv

Ok I am a python beginner who tries fetching data from iforge. However I get problem with timestamp when exporting to CSV. I think timestamp should look like this "2019-03-22 23:00:00" but instead I get 1553460483. Why is that and how to fix it so it becomes in correct format in the csv file?
# coding: utf-8
import json
import csv
import urllib.request
import datetime
data = json.load(request)
time = data[0]['timestamp']
price = data[0]['price']
data = json.load(request) contains this -
[{'symbol': 'EURUSD',
'bid': 1.2345,
'ask': 1.2399,
'price': 1.2343,
'timestamp': 1553460483}]
But since I was only interested in price and timestamp I did-
time = data[0]['timestamp']
price = data[0]['price']
myprice = {'Date':time,'price':price}
And then made csv from myprice....it works but I dont know if correct =)
Now to problem -
How to fix timestamp to show up correctly in CSV?
You would have to figure out what unit 'timestamp' is in. My guess would be seconds since a certain start date, so go for:
import pandas as pd
pd.to_datetime(1553460483, unit='s')
Out: Timestamp('2019-03-24 20:48:03')