What the recommended approach is for building a pipeline that needs to find a point contained in a polygon (shape) in Planatir Foundry? In the past, this has been pretty difficult in Spark. GeoSpark has been pretty popular, but can still lag. If there is nothing specific to Foundry I can implement something with Geospark. I have ~13k shapes and batches of thousands of points.
How large are the datasets? With a big enough driver and some optimizations, I previously got it working using geopandas. Just make sure that the coordinate point is the same projection as the polygon.
Here is a helper function:
from shapely import geometry
import json
import geopandas
from pyspark.sql import functions as F
def geopandas_spatial_join(df_left, df_right, geometry_left, geometry_right, how='inner', op='intersects'):
'''
Computes a spatial join of two Geopandas dataframes. Implemetns the Geopandas "sjoin" method, reference: https://geopandas.org/reference/geopandas.sjoin.html.
Expects both dataframes to contain a GeoJSON geometry column, whose names are passed as the 'geometry_left' and 'geometry_right' arguments/
Inputs:
df_left (PANDAS_DATAFRAME): Left input dataframe.
df_right (PANDAS_DATAFRAME): Right input dataframe.
geometry_left (string): Name of the geometry column of the left dataframe.
geometry_right (string): Name of the geometry column of the right dataframe.
how (string): The type of join, one of {'left', 'right', 'inner'}.
op (string): Binary predicate, one of {‘intersects’, ‘contains’, ‘within’}.
Outputs:
(PANDAS_DATAFRAME): Joined dataframe.
'''
df1 = df_left
df1["geometry_left_shape"] = df1[geometry_left].apply(json.loads)
df1["geometry_left_shape"] = df1["geometry_left_shape"].apply(geometry.shape)
gdf_left = geopandas.GeoDataFrame(df1, geometry="geometry_left_shape")
df2 = df_right
df2["geometry_right_shape"] = df2[geometry_right].apply(json.loads)
df2["geometry_right_shape"] = df2["geometry_right_shape"].apply(geometry.shape)
gdf_right = geopandas.GeoDataFrame(df2, geometry="geometry_right_shape")
joined = geopandas.sjoin(gdf_left, gdf_right, how=how, op=op)
joined = joined.drop(joined.filter(items=["geometry_left_shape", "geometry_right_shape"]).columns, axis=1)
return joined
We can then run the join:
import pandas as pd
left_df = points_df.toPandas()
left_geo_column = "point_geometry"
right_df = polygon_df.toPandas()
right_geo_column = "polygon_geometry"
pdf = geopandas_spatial_join(left_df,right_df,left_geo_column,right_geo_column)
return_df = spark.createDataFrame(pdf).dropDuplicates()
return return_df
Related
I'm trying to convert some financial data provided in JSON format into a single row in a dataframe. However, this JSON has the data with two indices or nested indices? I'm not sure how to appropriately describe the data.
So below is the code I'm using to pull the financial data.
import requests
import pandas as pd
stock ='AAPL'
BS = requests.get(f"https://financialmodelingprep.com/api/v3/financials/balance-sheet-statement/{stock}?period=quarter")
data = BS.json()
The output looks like this
{'symbol': 'AAPL',
'financials': [{'date': '2019-12-28',
'Cash and cash equivalents': '39771000000.0',
'Short-term investments': '67391000000.0',
'Cash and short-term investments': '1.07162e+11',
'Receivables': '20970000000.0',...}
I've tried the following
df = pd.DataFrame.from_dict(data, orient='index')
and
df = pd.DataFrame.from_dict(json_normalize(data), orient='columns')
Neither gets me what I want. Somehow I need to get rid of 'financials'. I want the data frame to
look like:
How do I do this?
So just use the dict of 'financials' when creating the dataframe.
import requests
import pandas as pd
stock ='AAPL'
BS = requests.get(f"https://financialmodelingprep.com/api/v3/financials/balance-sheet-statement/{stock}?period=quarter")
data = BS.json()
df = pd.DataFrame.from_dict(data['financials'])
print(df)
I downloaded a dataset for facial key point detection the image and the labels were in a CSV file I extracted it using pandas but I don't know how to convert it into a tensor and load it into a data loader for training.
dataframe = pd.read_csv("training_facial_keypoints.csv")
dataframe['Image'] = dataframe['Image'].apply(lambda i: np.fromstring(i, sep=' '))
dataframe= dataframe.dropna()
images_array = np.vstack(dataframe['Image'].values)/255.0
images_array = images_array.astype(np.float32)
images_array = images_array.reshape(-1, 96, 96, 1)
print(images_array.shape)
labels_array = dataframe[dataframe.columns[:-1]].values
labels_array = (labels_array-48)/48
labels_array = labels_array.astype(np.float32)
I have the images and labels in two arrays. How do I create a training set from this and use transforms.
Then load it using a dataloader.
Create a subclass of torch.utils.data.Dataset, fill it with your data.
You can pass desired torchvision.transforms to it and apply them to your data in __getitem__(self, index).
Than you can pass it to torch.utils.data.DataLoader which allows multi-threaded loading of data.
And PyTorch has an overwhelming documentation you should first refer to.
I'm trying to create a GeoDataFrame with 2 zip codes per row, whose distances from each other I want to compare.
I took a list of approx 220 zip codes and ran an itertools combination on them to get all combo's, then unpacked the tuples into two columns
code_combo = list(itertools.combinations(df_with_all_zip_codes['code'], 2))
df_distance_ctr = pd.DataFrame(code_combo, columns=['first_code','second_code'])
Then I did some standard pandas merges and column renaming to get the polygon/geometry column from the original geodataframe into this new one, right beside the respective zip code columns.
The problem is I can't seem to get the polygon columns to be read as geometry, even after 1.) attempting to convert the dataframe to a geodataframe - AttributeError: No geometry data set yet, 2.) applying wkt.loads to the geometry column - AttributeError: 'MultiPolygon' object has no attribute 'encode'
.
I've tried to look for a way to convert a series to a geoseries but can't find anything on SO nor the documentation. Can anyone please point out where I'm likely going wrong?
Looking at the __init__ method of a GeoDataFrame at https://github.com/geopandas/geopandas/blob/master/geopandas/geodataframe.py, it looks like a GDF can only have one column at a time. The other columns you've created should still have geometry objects in them though.
Since you still have geometry objects in each column, you could write a method that uses Shapely's distance method, like so:
import pandas as pd
import geopandas
from shapely.geometry import Point
import matplotlib.pyplot as plt
lats = [-34.58, -15.78, -33.45, 4.60, 10.48]
lons = [-58.66, -47.91, -70.66, -74.08, -66.86]
df = pd.DataFrame(
{'City': ['Buenos Aires', 'Brasilia', 'Santiago', 'Bogota', 'Caracas'],
'Country': ['Argentina', 'Brazil', 'Chile', 'Colombia', 'Venezuela'],
'Latitude': lats,
'Longitude': lons})
df['Coordinates'] = list(zip(df.Longitude, df.Latitude))
df['Coordinates'] = df['Coordinates'].apply(Point)
df['Coordinates_2'] = list(zip(lons[::-1], lats[::-1]))
df['Coordinates_2'] = df['Coordinates_2'].apply(Point)
gdf = geopandas.GeoDataFrame(df, geometry='Coordinates')
def get_distance(row):
distance = row.Coordinates.distance(row.Coordinates_2)
print(distance)
return distance
gdf['distance'] = gdf.apply(lambda row: get_distance(row), axis=1)
As for the AttributeError: 'MultiPolygon' object has no attribute 'encode'. MultiPolygon is a Shapely geometry class. encode is usually a method on string objects so you can probably remove the call to wkt.loads.
I am currently trying to import a big csv file (50GB+) without any headers into a pyarrow table with the overall target to export this file into the Parquet format and further to process it in a Pandas or Dask DataFrame. How can i specify the column names and column dtypes within pyarrow for the csv file?
I already thought about to append the header to the csv file. This enforces a complete rewrite of the file which looks like a unnecssary overhead. As far as I know, pyarrow provides schemas to define the dtypes for specific columns, but the docs are missing a concrete example for doing so while transforming a csv file to an arrow table.
Imagine that this csv file just has for an easy example the two columns "A" and "B".
My current code looks like this:
import numpy as np
import pandas as pd
import pyarrow as pa
df_with_header = pd.DataFrame({'col1': [1, 2], 'col2': [3, 4]})
print(df_with_header)
df_with_header.to_csv("data.csv", header=False, index=False)
df_without_header = pd.read_csv('data.csv', header=None)
print(df_without_header)
opts = pa.csv.ConvertOptions(column_types={'A': 'int8',
'B': 'int8'})
table = pa.csv.read_csv(input_file = "data.csv", convert_options = opts)
print(table)
If I print out the final table, its not going to change the names of the columns.
pyarrow.Table
1: int64
3: int64
How can I now change the loaded column names and dtypes? Is there maybe also a possibility to for example pass in a dict containing the names and their dtypes?
You can specify type overrides for columns:
fp = io.BytesIO(b'one,two,three\n1,2,3\n4,5,6')
fp.seek(0)
table = csv.read_csv(
fp,
convert_options=csv.ConvertOptions(
column_types={
'one': pa.int8(),
'two': pa.int8(),
'three': pa.int8(),
}
))
But in your case you don't have a header, and as far as I can tell this use case is not supported in arrow:
fp = io.BytesIO(b'1,2,3\n4,5,6')
fp.seek(0)
table = csv.read_csv(
fp,
parse_options=csv.ParseOptions(header_rows=0)
)
This raises:
pyarrow.lib.ArrowInvalid: header_rows == 0 needs explicit column names
The code is here: https://github.com/apache/arrow/blob/3cf8f355e1268dd8761b99719ab09cc20d372185/cpp/src/arrow/csv/reader.cc#L138
This is similar to this question apache arrow - reading csv file
There should be fix for it in the next version: https://github.com/apache/arrow/pull/4898
My goal is to (1) import Twitter JSON, (2) extract data of interest, (3) create pandas data frame for the variables of interest. Here is my code:
import json
import pandas as pd
tweets = []
for line in open('00.json'):
try:
tweet = json.loads(line)
tweets.append(tweet)
except:
continue
# Tweets often have missing data, therefore use -if- when extracting "keys"
tweet = tweets[0]
ids = [tweet['id_str'] for tweet in tweets if 'id_str' in tweet]
text = [tweet['text'] for tweet in tweets if 'text' in tweet]
lang = [tweet['lang'] for tweet in tweets if 'lang' in tweet]
geo = [tweet['geo'] for tweet in tweets if 'geo' in tweet]
place = [tweet['place'] for tweet in tweets if 'place' in tweet]
# Create a data frame (using pd.Index may be "incorrect", but I am a noob)
df=pd.DataFrame({'Ids':pd.Index(ids),
'Text':pd.Index(text),
'Lang':pd.Index(lang),
'Geo':pd.Index(geo),
'Place':pd.Index(place)})
# Create a data frame satisfying conditions:
df2 = df[(df['Lang']==('en')) & (df['Geo'].dropna())]
So far, everything seems to be working fine.
Now, the extracted values for Geo result in the following example:
df2.loc[1921,'Geo']
{'coordinates': [39.11890951, -84.48903638], 'type': 'Point'}
To get rid of everything except the coordinates inside the squared brackets I tried using:
df2.Geo.str.replace("[({':]", "") ### results in NaN
# and also this:
df2['Geo'] = df2['Geo'].map(lambda x: x.lstrip('{'coordinates': [').rstrip('], 'type': 'Point'')) ### results in syntax error
Please advise on the correct way to obtain coordinates values only.
The following line from your question indicates that this is an issue with understanding the underlying data type of the returned object.
df2.loc[1921,'Geo']
{'coordinates': [39.11890951, -84.48903638], 'type': 'Point'}
You are returning a Python dictionary here -- not a string! If you want to return just the values of the coordinates, you should just use the 'coordinates' key to return those values, e.g.
df2.loc[1921,'Geo']['coordinates']
[39.11890951, -84.48903638]
The returned object in this case will be a Python list object containing the two coordinate values. If you want just one of the values, you can slice the list, e.g.
df2.loc[1921,'Geo']['coordinates'][0]
39.11890951
This workflow is much easier to deal with than casting the dictionary to a string, parsing the string, and recapturing the coordinate values as you are trying to do.
So let's say you want to create a new column called "geo_coord0" which contains all of the coordinates in the first position (as shown above). You could use a something like the following:
df2["geo_coord0"] = [x['coordinates'][0] for x in df2['Geo']]
This uses a Python list comprehension to iterate over all entries in the df2['Geo'] column and for each entry it uses the same syntax we used above to return the first coordinate value. It then assigns these values to a new column in df2.
See the Python documentation on data structures for more details on the data structures discussed above.