How do I parse a GeoJSON shapefile into a dataset with one row per Feature? - palantir-foundry

I'm working on a project and need to parse a GeoJSON shapefile of flight route airspace in the US. The data is coming from the FAA open data portal: https://adds-faa.opendata.arcgis.com/datasets/faa::route-airspace/about
There seems to be some relevant documentation at /workspace/documentation/product/geospatial-docs/vector_data_in_transforms where it mentions:
A typical Foundry Ontology pipeline for geospatial vector data may include the following steps:
- Convert into rows with GeoJSON representation of the shape for each feature.
However there isn't actually any guidance on how to go about doing this when the source is a single GeoJSON file with a FeatureCollection and the desired output is a dataset with one row per Feature in the collection.
Anyone have a code snippet for accomplishing this? Seems like a pretty generic task in Foundry.

I typically do something like this:
import json
with open('Route_Airspace.geojson', 'r') as f:
data = json.load(f)
rows = []
for feature in data['features']:
row = {
'geometry': json.dumps(feature['geometry']),
'properties': json.dumps(feature['properties']),
'id': feature['properties']['OBJECTID']
}
rows.append(row)
Note you can leave out the properties, but I like to keep them in this step in case I need them later. Also note this is a good place to set each row's primary key as well (the Features in this dataset have an OBJECTID property, but this may vary)
This rows list can then be used to initialise a Pandas dataframe:
import pandas as pd
df = pd.DataFrame(rows)
or a Spark dataframe (*assuming you're doing this within a transform):
df = ctx.spark_session.createDataFrame(rows)
The resulting dataframes will have one row per Feature, where that feature's shape is contained within the geometry column.
Full example within transform:
from transforms.api import transform, Input, Output
import json
#transform(
out=Output('Path/to/output'),
source_df=Input('Path/to/source'),
)
def compute(source_df, out, ctx)
with source_df.filesystem().open('Route_Airspace.geojson', 'r') as f:
data = json.load(f)
rows = []
for feature in data['features']:
row = {
'geometry': json.dumps(feature['geometry']),
'properties': json.dumps(feature['properties']),
'id': feature['properties']['OBJECTID']
}
rows.append(row)
df = ctx.spark_session.createDataFrame(rows)
out.write_dataframe(df)
Note that for this to work your GeoJSON file needs to be uploaded into a "dataset without a schema" so the raw file becomes accessible via the FileSystem API.

Related

Micropython: bytearray in json-file

i'm using micropython in the newest version. I also us an DS18b20 temperature sensor. An adress of theses sensor e.g. is "b'(b\xe5V\xb5\x01<:'". These is the string representation of an an bytearray. If i use this to save the adress in a json file, i run in some problems:
If i store directly "b'(b\xe5V\xb5\x01<:'" after reading the json-file there are no single backslahes, and i get b'(bxe5Vxb5x01<:' inside python
If i escape the backslashes like "b'(b\xe5V\xb5\x01<:'" i get double backslashes in python: b'(b\xe5V\xb5\x01<:'
How do i get an single backslash?
Thank you
You can't save bytes in JSON with micropython. As far as JSON is concerned that's just some string. Even if you got it to give you what you think you want (ie. single backslashes) it still wouldn't be bytes. So, you are faced with making some form of conversion, no-matter-what.
One idea is that you could convert it to an int, and then convert it back when you open it. Below is a simple example. Of course you don't have to have a class and staticmethods to do this. It just seemed like a good way to wrap it all into one, and not even need an instance of it hanging around. You can dump the entire class in some other file, import it in the necessary file, and just call it's methods as you need them.
import math, ujson, utime
class JSON(object):
#staticmethod
def convert(data:dict, convert_keys=None) -> dict:
if isinstance(convert_keys, (tuple, list)):
for key in convert_keys:
if isinstance(data[key], (bytes, bytearray)):
data[key] = int.from_bytes(data[key], 'big')
elif isinstance(data[key], int):
data[key] = data[key].to_bytes(1 if not data[key]else int(math.log(data[key], 256)) + 1, 'big')
return data
#staticmethod
def save(filename:str, data:dict, convert_keys=None) -> None:
#dump doesn't seem to like working directly with open
with open(filename, 'w') as doc:
ujson.dump(JSON.convert(data, convert_keys), doc)
#staticmethod
def open(filename:str, convert_keys=None) -> dict:
return JSON.convert(ujson.load(open(filename, 'r')), convert_keys)
#example with both styles of bytes for the sake of being thorough
json_data = dict(address=bytearray(b'\xFF\xEE\xDD\xCC'), data=b'\x00\x01\02\x03', date=utime.mktime(utime.localtime()))
keys = ['address', 'data'] #list of keys to convert to int/bytes
JSON.save('test.json', json_data, keys)
json_data = JSON.open('test.json', keys)
print(json_data) #{'date': 1621035727, 'data': b'\x00\x01\x02\x03', 'address': b'\xff\xee\xdd\xcc'}
You may also want to note that with this method you never actually touch any JSON. You put in a dict, you get out a dict. All the JSON is managed "behind the scenes". Regardless of all of this, I would say using struct would be a better option. You said JSON though so, my answer is about JSON.

How to convert a multi-dimensional dictionary to json file?

I have uploaded a *.mat file that contains a 'struct' to my jupyter lab using:
from pymatreader import read_mat
data = read_mat(mat_file)
Now I have a multi-dimensional dictionary, for example:
data['Forces']['Ss1']['flap'].keys()
Gives the output:
dict_keys(['lf', 'rf', 'lh', 'rh'])
I want to convert this into a JSON file, exactly by the keys that already exist, without manually do so because I want to perform it to many *.mat files with various key numbers.
EDIT:
Unfortunately, I no longer have access to MATLAB.
An example for desired output would look something like this:
json_format = {
"Forces": {
"Ss1": {
"flap": {
"lf": [1,2,3,4],
"rf": [4,5,6,7],
"lh": [23 ,5,6,654,4],
"rh": [4 ,34 ,35, 56, 66]
}
}
}
}
ANOTHER EDIT:
So after making lists of the subkeys (I won't elaborate on it), I did this:
FORCES = []
for ind in individuals:
for force in forces:
for wing in wings:
FORCES.append({
ind: {
force: {
wing: data['Forces'][ind][force][wing].tolist()
}
}
})
Then, to save:
with open(f'{ROOT_PATH}/Forces.json', 'w') as f:
json.dump(FORCES, f)
That worked but only because I looked manually for all of the keys... Also, for some reason, I have squared brackets at the beginning and at the end of this json file.
The json package will output dictionaries to JSON:
import json
with open('filename.json', 'w') as f:
json.dump(data, f)
If you are using MATLAB-R2016b or later, and want to go straight from MATLAB to JSON check out JSONENCODE and JSONDECODE. For your purposes JSONENCODE
encodes data and returns a character vector in JSON format.
MathWorks Docs
Here is a quick example that assumes your data is in the MATLAB variable test_data and writes it to a file specified in the variable json_file
json_data = jsonencode(test_data);
writematrix(json_data,json_file);
Note: Some MATLAB data formats cannot be translate into JSON data due to limitations in the JSON specification. However, it sounds like your data fits well with the JSON specification.

Choropleth map fails to render shading for geojson boundaries

Folium renders the choropleth map with the police force boundaries but these are all grey and not colour matched for the data in the dataframe.
I have also ensured the new version of the documentation is followed i.e. folium.Choropleth.
I have also checked to make sure I am key_on='feature.properties.pfa16nm' by checking the json in geojson.io
Feature is spelt with a capital when checking the geojson however when I change it to this I get an error and no map renders. I have also renamed the file to only have the geojson as an extension and that didn't work.
import pandas as pd
import folium
import json
import os
adults_trafficked = pd.read_excel('Adults trafficked.xlsx')
force_boundaries = 'Police_Force_Areas_December_2016_Generalised_Clipped_Boundaries_in_England_and_Wales.geojson.json'
m = folium.Map([52.6333, -1.1333], zoom_start=4)
folium.Choropleth(
geo_data=force_boundaries,
data=adults_trafficked,
columns=['Police_Force', 'Adults_Exploited'],
key_on='feature.properties.pfa16nm',
threshold_scale=[0, 25, 50, 75, 100, 125, 150, 175],
fill_color='BuPu',
legend_name='Trafficked Humans',
).add_to(m)
m
Output that I am getting
I expect the Leaflet map to render with each police boundary shaded to the appropriate level based on the dataframe column data. The Chorpleth map renders perfectly with the boundaries however these are all grey and do not contain the tonal colour range one would expect. Please find the code, data and the json link here.
The problem is that the excel file doesn't match the json file. When you use
columns=['Police_Force', 'Adults_Exploited'],
key_on='feature.properties.pfa16nm',
the Police_Force should match the pfa16nm in your json file.
This code will give you the pfa16nms in your json.
import json
policejson = json.load(open('Police_Force_Areas_December_2016_Generalised_Clipped_Boundaries_in_England_and_Wales.geojson.json'))
for x in policejson['features'] :
print (x['properties']['pfa16nm'])
Then you need to make fix your excel file, and make sure that the Police_Force column matches the names in your json file.
The problem is the name of the key and the name of the Police_Force in your database aren't matching. So after analysing your data as well as the json file I did some pre processing with your data so that the name matches with the key in our json file.
Here is a full fledge solution to your question.
# import libraries
import pandas as pd
import folium
import json
import webbrowser
# read data
adults_trafficked = pd.read_excel('Adults trafficked.xlsx')
# Pre processing of data
adults_trafficked['Police_Force'] = adults_trafficked['Police_Force'].replace('Police|Constabulary','', regex=True, ).replace('&','and', regex=True)
adults_trafficked.loc[adults_trafficked['Police_Force'] == "Metropolitan Service",'Police_Force' ] = 'Metropolitan Police'
# remove any trailing or leading white spaces
adults_trafficked['Police_Force'] = adults_trafficked['Police_Force'].str.strip()
# border json file
force_boundaries = 'Police_Force_Areas_December_2016_Generalised_Clipped_Boundaries_in_England_and_Wales.geojson.json'
# choropleth map
m = folium.Map([52.6333, -1.1333], zoom_start=7)
m.choropleth(
geo_data=force_boundaries,
data=adults_trafficked,
columns=['Police_Force', 'Adults_Exploited'],
key_on='feature.properties.pfa16nm',
threshold_scale=[0, 40, 80, 120, 160, 200],
fill_color='BuPu',
legend_name='Trafficked Humans',
)
m.save('map.html')
webbrowser.open('map.html')
Note that the length of threshold_scale cannot be more than 6. I can see yours was 8. Also there are only 44 police_force data in your json file while the length of your dataset is 47. So those 3 data which didn't match were ignored by folium.
This is what you will get
In case if you have trouble understanding any part of code please comment below.

GCP Proto Datastore encode JsonProperty in base64

I store a blob of Json in the datastore using JsonProperty.
I don't know the structure of the json data.
I am using endpoints proto datastore in order to retrieve my data.
The probleme is the json property is encoded in base64 and I want a plain json object.
For the example, the json data will be:
{
first: 1,
second: 2
}
My code looks something like:
import endpoints
from google.appengine.ext import ndb
from protorpc import remote
from endpoints_proto_datastore.ndb import EndpointsModel
class Model(EndpointsModel):
data = ndb.JsonProperty()
#endpoints.api(name='myapi', version='v1', description='My Sample API')
class DataEndpoint(remote.Service):
#Model.method(path='mymodel2', http_method='POST',
name='mymodel.insert')
def MyModelInsert(self, my_model):
my_model.data = {"first": 1, "second": 2}
my_model.put()
return my_model
#Model.method(path='mymodel/{entityKey}',
http_method='GET',
name='mymodel.get')
def getMyModel(self, model):
print(model.data)
return model
API = endpoints.api_server([DataEndpoint])
When I call the api for getting a model, I get:
POST /_ah/api/myapi/v1/mymodel2
{
"data": "eyJzZWNvbmQiOiAyLCAiZmlyc3QiOiAxfQ=="
}
where eyJzZWNvbmQiOiAyLCAiZmlyc3QiOiAxfQ== is the base64 encoded of {"second": 2, "first": 1}
And the print statement give me: {u'second': 2, u'first': 1}
So, in the method, I can explore the json blob data as a python dict.
But, in the api call, the data is encoded in base64.
I expeted the api call to give me:
{
'data': {
'second': 2,
'first': 1
}
}
How can I get this result?
After the discussion in the comments of your question, let me share with you a sample code that you can use in order to store a JSON object in Datastore (it will be stored as a string), and later retrieve it in such a way that:
It will show as plain JSON after the API call.
You will be able to parse it again to a Python dict using eval.
I hope I understood correctly your issue, and this helps you with it.
import endpoints
from google.appengine.ext import ndb
from protorpc import remote
from endpoints_proto_datastore.ndb import EndpointsModel
class Sample(EndpointsModel):
column1 = ndb.StringProperty()
column2 = ndb.IntegerProperty()
column3 = ndb.StringProperty()
#endpoints.api(name='myapi', version='v1', description='My Sample API')
class MyApi(remote.Service):
# URL: .../_ah/api/myapi/v1/mymodel - POSTS A NEW ENTITY
#Sample.method(path='mymodel', http_method='GET', name='Sample.insert')
def MyModelInsert(self, my_model):
dict={'first':1, 'second':2}
dict_str=str(dict)
my_model.column1="Year"
my_model.column2=2018
my_model.column3=dict_str
my_model.put()
return my_model
# URL: .../_ah/api/myapi/v1/mymodel/{ID} - RETRIEVES AN ENTITY BY ITS ID
#Sample.method(request_fields=('id',), path='mymodel/{id}', http_method='GET', name='Sample.get')
def MyModelGet(self, my_model):
if not my_model.from_datastore:
raise endpoints.NotFoundException('MyModel not found.')
dict=eval(my_model.column3)
print("This is the Python dict recovered from a string: {}".format(dict))
return my_model
application = endpoints.api_server([MyApi], restricted=False)
I have tested this code using the development server, but it should work the same in production using App Engine with Endpoints and Datastore.
After querying the first endpoint, it will create a new Entity which you will be able to find in Datastore, and which contains a property column3 with your JSON data in string format:
Then, if you use the ID of that entity to retrieve it, in your browser it will show the string without any strange encoding, just plain JSON:
And in the console, you will be able to see that this string can be converted to a Python dict (or also a JSON, using the json module if you prefer):
I hope I have not missed any point of what you want to achieve, but I think all the most important points are covered with this code: a property being a JSON object, store it in Datastore, retrieve it in a readable format, and being able to use it again as JSON/dict.
Update:
I think you should have a look at the list of available Property Types yourself, in order to find which one fits your requirements better. However, as an additional note, I have done a quick test working with a StructuredProperty (a property inside another property), by adding these modifications to the code:
#Define the nested model (your JSON object)
class Structured(EndpointsModel):
first = ndb.IntegerProperty()
second = ndb.IntegerProperty()
#Here I added a new property for simplicity; remember, StackOverflow does not write code for you :)
class Sample(EndpointsModel):
column1 = ndb.StringProperty()
column2 = ndb.IntegerProperty()
column3 = ndb.StringProperty()
column4 = ndb.StructuredProperty(Structured)
#Modify this endpoint definition to add a new property
#Sample.method(request_fields=('id',), path='mymodel/{id}', http_method='GET', name='Sample.get')
def MyModelGet(self, my_model):
if not my_model.from_datastore:
raise endpoints.NotFoundException('MyModel not found.')
#Add the new nested property here
dict=eval(my_model.column3)
my_model.column4=dict
print(json.dumps(my_model.column3))
print("This is the Python dict recovered from a string: {}".format(dict))
return my_model
With these changes, the response of the call to the endpoint looks like:
Now column4 is a JSON object itself (although it is not printed in-line, I do not think that should be a problem.
I hope this helps too. If this is not the exact behavior you want, maybe should play around with the Property Types available, but I do not think there is one type to which you can print a Python dict (or JSON object) without previously converting it to a String.

Solve issue with nested keys in JSON

I am trying to adapt some python code from an awesome guide for dark web scanning/graph creation.
I have thousands of json files created with Onionscan, and I have this code that should wrap everything in a gephi graph. Unfortunately, this code is old, as the Json files are now formatted differently and this code does not work anymore:
code (partial):
import glob
import json
import networkx
import shodan
file_list = glob.glob("C:\\test\\*.json")
graph = networkx.DiGraph()
for json_file in file_list:
with open(json_file,"rb") as fd:
scan_result = json.load(fd)
edges = []
if scan_result('linkedOnions') is not None:
edges.extend(scan_result['linkedOnions'])
In fact, at this point I get "KeyError", because linkedOnions is one-level nested like this:
"identifierReport": {
"privateKeyDetected": false,
"foundApacheModStatus": false,
"serverVersion": "",
"relatedOnionServices": null,
"relatedOnionDomains": null,
"linkedOnions": [many urls here]
could you please help me fix the code above?
I would be VERY grateful :)
Lorenzo
this is the correct way to read nested JSON.
if scan_result['identifierReport']['linkedOnions'] is not None:
edges.extend(scan_result'identifierReport']['linkedOnions'])
Try this it will work for you if your JSON file is correct format
try:
scan_result = json.load(fd)
edges = []
if scan_result('linkedOnions') is not None:
edges.extend(scan_result['linkedOnions'])
except Exception,e:
#print your message or log
print e