I'm trying to build a GeoJSON file with Python geojson module consisting on a regular 2-d grid of points whose 'properties' are associated to geophysical variables (velocity,temperature, etc). The information comes from a netcdf file.
So the code is something like that:
from netCDF4 import Dataset
import numpy as np
import geojson
ncfile = Dataset('20140925-0332-n19.nc', 'r')
u = ncfile.variables['Ug'][:,:] # [T,Z,Y,X]
v = ncfile.variables['Vg'][:,:]
lat = ncfile.variables['lat'][:]
lon = ncfile.variables['lon'][:]
features=[]
for i in range(0,len(lat)):
for j in range(0,len(lon)):
coords = (lon[j],lat[i])
features.append(geojson.Feature(geometry = geojson.Point(coords),properties={"u":u[i,j],"v":v[i,j]}))
In this case the point has velocity components in the 'properties' object. The error I receive is on the features.append() line with the following message:
*ValueError: -5.4989638 is not JSON compliant number*
which corresponds to a longitude value. Can someone explains me whatcan be wrong ?
I have used simply conversion to float and it eliminated that error without need of numpy.
coords = (float(lon[j]),float(lat[i]))
I found the solution. The geojson module only supports standard Python data classes, while numpy extends up to 24 types. Unfortunately netCDF4 module needs numpy to load arrays from netCDF files. I solved using numpy.asscalar() method as explained here. So in the code above for example:
coords = (lon[j],lat[i])
is replaced by
coords = (np.asscalar(lon[j]),np.asscalar(lat[i]))
and works also for the rest of variables coming from the netCDF file.
Anyway, thanks Bret for your comment that provide me the clue to solve it.
Related
I am trying to recreate a map of NY state geographical data using geopandas, and the available data found here: https://pubs.usgs.gov/of/2005/1325/#NY. I can print the map but can not figure out how to make use of the other files to plot their columns.
Any help would be appreciated.
What exactly are you trying to do?
Here's a quick setup you can use to download the Shapefiles from the site and have access to a GeoDataFrame:
import geopandas as gpd
from io import BytesIO
import requests as r
# Link to file
shp_link = 'https://pubs.usgs.gov/of/2005/1325/data/NYgeol_dd.zip'
# Downloading the fie into memory
my_req = r.get(shp_link)
# Creating a file stream for GeoPandas
my_zip = BytesIO(my_req.content)
# Loading the data into a GeoDataFrame
my_geodata = gpd.read_file(my_zip)
# Printing all of the column names
for this_col in my_geodata.columns:
print(this_col)
Now you can access the multiple columns in my_geodata the data using square brackets. For example, if I want to access the data stored in the column called "SOURCE", I can just use my_geodata["SOURCE"].
Now it's just a matter of figuring out what exactly you want to do with that data.
I'm trying to find a library that contains a keyword to help me with that and didn't succeed.
What I'm doing at the moment is converting each JSON response to a dictionary and then comparing the dictionaries, but I hate it.
I was trying to find similar libraries and found this python code, but I don't know how to make this function works to me.
def _verify_json_file(self, result, exp):
'''
Verifies if two json files are different
'''
with open(exp) as json_data:
data = re.sub(ID, ID_REP, json_data.read())
expected = JSON.loads(data)
differences = jsondiff.diff(expected, result, syntax='explicit')
if not differences :
return True
if differences == expected or differences == result:
raise AssertionError("ERROR! Jsons have different structure")
return False
APPROACH#0
To make the above function work for you you just have to create a python file and put your function in that file and keep that file in the PYTHONPATH and use the same in your robot code by calling it in settings section using Library keyword. I have answered this question in detail with all the steps mention in this link.
Create a python file(comparejsons.py) with the above code in it
Keep the above python file in PYTHONPATH
Use Library comparejsons.py under settings section in your robot file
APPROACH#1
You should create a custom keyword which makes use of the below library and then compare the 2 jsons.
You can make use of "robotframework-jsonvalidator" module
Sample code below,
*** Settings ***
Library JsonValidator
Library OperatingSystem
*** Test Cases ***
Check Element
${json_example}= OperatingSystem.Get File ${CURDIR}${/}json_example.json
Element should exist ${json_example} .author:contains("Evelyn Waugh")
APPROACH#2
After converting the JSON to a dictionary you can just make use of the Built-in keyword , here , values=True is imortant.
Dictionaries Should Be Equal<<TAB>>dict1<<TAB>>dict2<<TAB>>values=True
I am running a SQL query via the google.cloud.bigquery.Client.query package in AWS lambda (Python 2.7 runtime). The native BQ object extracted from a query is the BigQuery Row() i.e.,
Row((u'exampleEmail#gmail.com', u'XXX1234XXX'), {u'email': 0, u'email_id': 1})
I need to convert this to Json, i.e.,
[{'email_id': 'XXX1234XXX', 'email': 'exampleEmail#gmail.com'}]
When running locally, I am able to just call the python Dict function on the row to transform it, i.e.,
queryJob = bigquery.Client.query(sql)
list=[]
for row in queryJob.result():
** at this point row = the BQ sample Row object shown above **
tmp = dict(row)
list.append(tmp)`
but when I load this into AWS Lambda it throws the error:
ValueError: dictionary update sequence element #0 has length 22; 2 is required
I have tried forcing it in different ways, breaking it out into sections etc but cannot get this into the JSON format desired.
I took a brief dive into the rabbit hole of transforming the QueryJob into a Pandas dataframe and then from there into a JSON object, which also works locally but runs into numpy package errors in AWS Lambda which seems to be a bit of a known issue.
I feel like this should have an easy solution but just haven't found it yet.
Try doing it like this
`
L = []
sql = (#sql_statement)
query_job = client.query(sql) # API request
query_job.result()
for row in query_job:
email_id= row.get('email_id')
email= row.get('email')
L.append([email_id, email])
`
Is there a way to change values, or assign new variables in a json file and after give it back in the same format?
It can be used rjson pachage to get the json file in R in data.frame format but how to covert back this data.frame to json after my changes?
EDIT:
sample code:
json file:
{"__v":1,"_id":{"$oid":"559390f6fa76bc94285fa68a"},"accountId":6,"api":false,"countryCode":"no","countryName":"Norway","date":{"$date":"2015-07-01T07:04:22.265Z"},"partnerId":1,"query":{"search":[{"label":"skill","operator":"and","terms":["java"],"type":"required"}]},"terms":[{"$oid":"559390f6fa76bc94285fa68b"}],"time":19,"url":"eyJzZWFyY2giOlt7InRlcm1zIjpbImphdmEiXSwibGFiZWwiOiJza2lsbCIsInR5cGUiOiJyZXF1aXJlZCIsIm9wZXJhdG9yIjoiYW5kIn1dfQ","user":11}
{"__v":1,"_id":{"$oid":"5593910cfa76bc94285fa68d"},"accountId":6,"api":false,"countryCode":"se","countryName":"Sweden","date":{"$date":"2015-07-01T07:04:44.565Z"},"partnerId":1,"query":{"search":[{"label":"company","operator":"or","terms":["microsoft"],"type":"required"},{"label":"country","operator":"or","terms":["se"],"type":"required"}]},"terms":[{"$oid":"5593910cfa76bc94285fa68e"},{"$oid":"5593910cfa76bc94285fa68f"}],"time":98,"url":"eyJzZWFyY2giOlt7InRlcm1zIjpbIm1pY3Jvc29mdCJdLCJsYWJlbCI6ImNvbXBhbnkiLCJ0eXBlIjoicmVxdWlyZWQiLCJvcGVyYXRvciI6Im9yIn0seyJ0ZXJtcyI6WyJzZSJdLCJsYWJlbCI6ImNvdW50cnkiLCJ0eXBlIjoicmVxdWlyZWQiLCJvcGVyYXRvciI6Im9yIn1dfQ","user":13}
Code:
library('rjson')
c <- file(Usersfile,'r')
l <- readLines(c,-1L)
json <- lapply(X=l,fromJSON)
json[[1]]$countryName <- 'Jamaica'
result <- cat(toJSON(json))
Output(is one line and start with [:
[{"__v":1,"_id":{"$oid":"559390f6fa76bc94285fa68a"},"accountId":6,"api":false,"countryCode":"no","countryName":"Jamaica","date":{"$date":"2015-07-01T07:04:22.265Z"},"partnerId":1,"query":{"search":[{"label":"skill","operator":"and","terms":"java","type":"required"}]},"terms":[{"$oid":"559390f6fa76bc94285fa68b"}],"time":19,"url":"eyJzZWFyY2giOlt7InRlcm1zIjpbImphdmEiXSwibGFiZWwiOiJza2lsbCIsInR5cGUiOiJyZXF1aXJlZCIsIm9wZXJhdG9yIjoiYW5kIn1dfQ","user":11},{"__v":1,"_id":{"$oid":"5593910cfa76bc94285fa68d"},"accountId":6,"api":false,"countryCode":"se","countryName":"Sweden","date":{"$date":"2015-07-01T07:04:44.565Z"},"partnerId":1,"query":{"search":[{"label":"company","operator":"or","terms":"microsoft","type":"required"},{"label":"country","operator":"or","terms":"se","type":"required"}]},"terms":[{"$oid":"5593910cfa76bc94285fa68e"},{"$oid":"5593910cfa76bc94285fa68f"}],"time":98,"url":"eyJzZWFyY2giOlt7InRlcm1zIjpbIm1pY3Jvc29mdCJdLCJsYWJlbCI6ImNvbXBhbnkiLCJ0eXBlIjoicmVxdWlyZWQiLCJvcGVyYXRvciI6Im9yIn0seyJ0ZXJtcyI6WyJzZSJdLCJsYWJlbCI6ImNvdW50cnkiLCJ0eXBlIjoicmVxdWlyZWQiLCJvcGVyYXRvciI6Im9yIn1dfQ","user":13}]
convert data frame to json
So this question has already been answered in full here ^^^
Quick Summary ::
There are 2 options presented.
(A) rjson library
import the library
use to the toJSON() method to create a JSON object. (Not exactly sure what the unname() function does... :p ).
(B) jsonlite library
import the jsonlite library
just use the toJSON() method (same as above, but with no modification).
cat() the above object.
Code examples are at that link. Hope this helps!
I have a large dataset stored in a S3 bucket, but instead of being a single large file, it's composed of many (113K to be exact) individual JSON files, each of which contains 100-1000 observations. These observations aren't on the highest level, but require some navigation within each JSON to access.
i.e.
json["interactions"] is a list of dictionaries.
I'm trying to utilize Spark/PySpark (version 1.1.1) to parse through and reduce this data, but I can't figure out the right way to load it into an RDD, because it's neither all records > one file (in which case I'd use sc.textFile, though added complication here of JSON) nor each record > one file (in which case I'd use sc.wholeTextFiles).
Is my best option to use sc.wholeTextFiles and then use a map (or in this case flatMap?) to pull the multiple observations from being stored under a single filename key to their own key? Or is there an easier way to do this that I'm missing?
I've seen answers here that suggest just using json.loads() on all files loaded via sc.textFile, but it doesn't seem like that would work for me because the JSONs aren't simple highest-level lists.
The previous answers are not going to read the files in a distributed fashion (see reference). To do so, you would need to parallelize the s3 keys and then read in the files during a flatMap step like below.
import boto3
import json
from pyspark.sql import Row
def distributedJsonRead(s3Key):
s3obj = boto3.resource('s3').Object(bucket_name='bucketName', key=s3Key)
contents = json.loads(s3obj.get()['Body'].read().decode('utf-8'))
for dicts in content['interactions']
yield Row(**dicts)
pkeys = sc.parallelize(keyList) #keyList is a list of s3 keys
dataRdd = pkeys.flatMap(distributedJsonRead)
Boto3 Reference
What about using DataFrames?
does
testFrame = sqlContext.read.json('s3n://<bucket>/<key>')
give you what you want from one file?
Does every observation have the same "columns" (# of keys)?
If so you could use boto to list each object you want to add, read them in and union them with each other.
from pyspark.sql import SQLContext
import boto3
from pyspark.sql.types import *
sqlContext = SQLContext(sc)
s3 = boto3.resource('s3')
bucket = s3.Bucket('<bucket>')
aws_secret_access_key = '<secret>'
aws_access_key_id = '<key>'
#Configure spark with your S3 access keys
sc._jsc.hadoopConfiguration().set("fs.s3n.awsAccessKeyId", aws_access_key_id)
sc._jsc.hadoopConfiguration().set("fs.s3n.awsSecretAccessKey", aws_secret_access_key)
object_list = [k for k in bucket.objects.all() ]
key_list = [k.key for k in bucket.objects.all()]
paths = ['s3n://'+o.bucket_name+'/'+ o.key for o in object_list ]
dataframes = [sqlContext.read.json(path) for path in paths]
df = dataframes[0]
for idx, frame in enumerate(dataframes):
df = df.unionAll(frame)
I'm new to spark myself so I'm wondering if there's a better way to use dataframes with a lot of s3 files, but so far this is working for me.
The name is misleading (because it's singular), but sparkContext.textFile() (at least in the Scala case) also accepts a directory name or a wildcard path, so you just be able to say textFile("/my/dir/*.json").