I am trying to recreate a map of NY state geographical data using geopandas, and the available data found here: https://pubs.usgs.gov/of/2005/1325/#NY. I can print the map but can not figure out how to make use of the other files to plot their columns.
Any help would be appreciated.
What exactly are you trying to do?
Here's a quick setup you can use to download the Shapefiles from the site and have access to a GeoDataFrame:
import geopandas as gpd
from io import BytesIO
import requests as r
# Link to file
shp_link = 'https://pubs.usgs.gov/of/2005/1325/data/NYgeol_dd.zip'
# Downloading the fie into memory
my_req = r.get(shp_link)
# Creating a file stream for GeoPandas
my_zip = BytesIO(my_req.content)
# Loading the data into a GeoDataFrame
my_geodata = gpd.read_file(my_zip)
# Printing all of the column names
for this_col in my_geodata.columns:
print(this_col)
Now you can access the multiple columns in my_geodata the data using square brackets. For example, if I want to access the data stored in the column called "SOURCE", I can just use my_geodata["SOURCE"].
Now it's just a matter of figuring out what exactly you want to do with that data.
Related
I basically have a procedure where I make multiple calls to an API and using a token within the JSON return pass that pack to a function top call the API again to get a "paginated" file.
In total I have to call and download 88 JSON files that total 758mb. The JSON files are all formatted the same way and have the same "schema" or at least should do. I have tried reading each JSON file after it has been downloaded into a data frame, and then attempted to union that dataframe to a master dataframe so essentially I'll have one big data frame with all 88 JSON files read into.
However the problem I encounter is roughly on file 66 the system (Python/Databricks/Spark) decides to change the file type of a field. It is always a string and then I'm guessing when a value actually appears in that field it changes to a boolean. The problem is then that the unionbyName fails because of different datatypes.
What is the best way for me to resolve this? I thought about reading using "extend" to merge all the JSON files into one big file however a 758mb JSON file would be a huge read and undertaking.
Could the other solution be to explicitly set the schema that the JSON file is read into so that it is always the same type?
If you know the attributes of those files, you can define the schema before reading them and create an empty df with that schema so you can to a unionByName with the allowMissingColumns=True:
something like:
from pyspark.sql.types import *
my_schema = StructType([
StructField('file_name',StringType(),True),
StructField('id',LongType(),True),
StructField('dataset_name',StringType(),True),
StructField('snapshotdate',TimestampType(),True)
])
output = sqlContext.createDataFrame(sc.emptyRDD(), my_schema)
df_json = spark.read.[...your JSON file...]
output.unionByName(df_json, allowMissingColumns=True)
I'm not sure this is what you are looking for. I hope it helps
I have large dataset in json format from which I want to extract important attributes whcih captures the most variance. I want to extract these attributes to build a search engine on the dataset with these attributes being the hash key.
The main question being asked here is doing feature selection on a json data.
You could read the data into a pandas DataFrame Object with the pandas.read_json() function. You can use this DataFrame Object to gain insight into your data. For example:
data = pandas.load_json(json_file)
data.head() # Displays the top five rows
data.info() # Displays description of the data
Or you can use matplotlib on this DataFrame to plot a histogram for each numerical attribute
import matplotlib.pyplot as plt
data.hist(bins=50, figsize=(20,15))
If you are interested into correlation of attributes, you can use the pandas.scatter_matrix() function.
You have to manually pick the attributes that fit best to your task and this tools help you to understand the data and gain insight into it.
I'm trying to build a GeoJSON file with Python geojson module consisting on a regular 2-d grid of points whose 'properties' are associated to geophysical variables (velocity,temperature, etc). The information comes from a netcdf file.
So the code is something like that:
from netCDF4 import Dataset
import numpy as np
import geojson
ncfile = Dataset('20140925-0332-n19.nc', 'r')
u = ncfile.variables['Ug'][:,:] # [T,Z,Y,X]
v = ncfile.variables['Vg'][:,:]
lat = ncfile.variables['lat'][:]
lon = ncfile.variables['lon'][:]
features=[]
for i in range(0,len(lat)):
for j in range(0,len(lon)):
coords = (lon[j],lat[i])
features.append(geojson.Feature(geometry = geojson.Point(coords),properties={"u":u[i,j],"v":v[i,j]}))
In this case the point has velocity components in the 'properties' object. The error I receive is on the features.append() line with the following message:
*ValueError: -5.4989638 is not JSON compliant number*
which corresponds to a longitude value. Can someone explains me whatcan be wrong ?
I have used simply conversion to float and it eliminated that error without need of numpy.
coords = (float(lon[j]),float(lat[i]))
I found the solution. The geojson module only supports standard Python data classes, while numpy extends up to 24 types. Unfortunately netCDF4 module needs numpy to load arrays from netCDF files. I solved using numpy.asscalar() method as explained here. So in the code above for example:
coords = (lon[j],lat[i])
is replaced by
coords = (np.asscalar(lon[j]),np.asscalar(lat[i]))
and works also for the rest of variables coming from the netCDF file.
Anyway, thanks Bret for your comment that provide me the clue to solve it.
I am using PySpark. I have a list of gziped json files on s3 which I have to access, transform and then export in parquet to s3. Each json file contains around 100k lines so parallelizing it wont make much sense(but i am open to parallelizing it), but there are around 5k files which I have parallelize. My approach is pass the json file list to script -> run parallelize on the list -> run map(? this is where I am getting blocked). how do I access and transform the json create a DF out of the transformed json and dump it as parquet into s3.
To read json in a distributed fashion, you will need to parallelize your keys as you mention. To do this while reading from s3, you'll need to use boto3. Below is a skeleton sketch of how to do so. You'll likely need to modify distributedJsonRead to fit your use case.
import boto3
import json
from pyspark.sql import Row
def distributedJsonRead(s3Key):
s3obj = boto3.resource('s3').Object(bucket_name='bucketName', key=key)
contents = json.loads(s3obj.get()['Body'].read().decode('utf-8'))
return Row(**contents)
pkeys = sc.parallelize(keyList) #keyList is a list of s3 keys
dataRdd = pkeys.map(distributedJsonRead)
Boto3 Reference: http://boto3.readthedocs.org/en/latest/guide/quickstart.html
Edit: to address the 1:1 mapping of input files to output files
Later on, it's likely that having a merged parquet data set would be easier to work with. But if this is the way you need to do it, you could try something like this
for k in keyList:
rawtext = sc.read.json(k) # or whichever method you need to use to read in the data
outpath = k[:-4]+'parquet'
rawtext.write.parquet(outpath)
I don't think you will not be able to parallelize these operations if you want a 1:1 mapping of json to parquet files. Spark's read/write functionality is designed to be called by the driver, and needs access to sc and sqlContext. This is another reason why having 1 parquet directory is likely the way to go.
I have a large dataset stored in a S3 bucket, but instead of being a single large file, it's composed of many (113K to be exact) individual JSON files, each of which contains 100-1000 observations. These observations aren't on the highest level, but require some navigation within each JSON to access.
i.e.
json["interactions"] is a list of dictionaries.
I'm trying to utilize Spark/PySpark (version 1.1.1) to parse through and reduce this data, but I can't figure out the right way to load it into an RDD, because it's neither all records > one file (in which case I'd use sc.textFile, though added complication here of JSON) nor each record > one file (in which case I'd use sc.wholeTextFiles).
Is my best option to use sc.wholeTextFiles and then use a map (or in this case flatMap?) to pull the multiple observations from being stored under a single filename key to their own key? Or is there an easier way to do this that I'm missing?
I've seen answers here that suggest just using json.loads() on all files loaded via sc.textFile, but it doesn't seem like that would work for me because the JSONs aren't simple highest-level lists.
The previous answers are not going to read the files in a distributed fashion (see reference). To do so, you would need to parallelize the s3 keys and then read in the files during a flatMap step like below.
import boto3
import json
from pyspark.sql import Row
def distributedJsonRead(s3Key):
s3obj = boto3.resource('s3').Object(bucket_name='bucketName', key=s3Key)
contents = json.loads(s3obj.get()['Body'].read().decode('utf-8'))
for dicts in content['interactions']
yield Row(**dicts)
pkeys = sc.parallelize(keyList) #keyList is a list of s3 keys
dataRdd = pkeys.flatMap(distributedJsonRead)
Boto3 Reference
What about using DataFrames?
does
testFrame = sqlContext.read.json('s3n://<bucket>/<key>')
give you what you want from one file?
Does every observation have the same "columns" (# of keys)?
If so you could use boto to list each object you want to add, read them in and union them with each other.
from pyspark.sql import SQLContext
import boto3
from pyspark.sql.types import *
sqlContext = SQLContext(sc)
s3 = boto3.resource('s3')
bucket = s3.Bucket('<bucket>')
aws_secret_access_key = '<secret>'
aws_access_key_id = '<key>'
#Configure spark with your S3 access keys
sc._jsc.hadoopConfiguration().set("fs.s3n.awsAccessKeyId", aws_access_key_id)
sc._jsc.hadoopConfiguration().set("fs.s3n.awsSecretAccessKey", aws_secret_access_key)
object_list = [k for k in bucket.objects.all() ]
key_list = [k.key for k in bucket.objects.all()]
paths = ['s3n://'+o.bucket_name+'/'+ o.key for o in object_list ]
dataframes = [sqlContext.read.json(path) for path in paths]
df = dataframes[0]
for idx, frame in enumerate(dataframes):
df = df.unionAll(frame)
I'm new to spark myself so I'm wondering if there's a better way to use dataframes with a lot of s3 files, but so far this is working for me.
The name is misleading (because it's singular), but sparkContext.textFile() (at least in the Scala case) also accepts a directory name or a wildcard path, so you just be able to say textFile("/my/dir/*.json").