Masking netcdf file with shapefile using GeoPandas Python - intersection

I have a netcdf file of the EDGAR emission inventory and a shapefile of the US Census data. I want to extract the data from the netcdf that only overlaps/intersects with the NYC region in the entire shapefile so then I can calculate the total emissions in NYC.
Shapefile is 2018 US Census cartographic Boundary Files for Urban Areas.
https://www2.census.gov/geo/tiger/GENZ2018/shp/cb_2018_us_ua10_500k.zip
Netcdf file is EDGAR Emission inventories for N2O in 2015.
https://edgar.jrc.ec.europa.eu/gallery.php?release=v50&substance=N2O&sector=TOTALS
I have never worked with shapefiles/GeoPandas so bear with me. I am able to read in the shapefile, filter for specific region, then convert netcdf to GeoDataFrame. I want to only keep the data from the netcdf that falls within the filtered region from the shapefile in order to do analysis.
Update: I tried using sjoin and clip but when I execute that command, my dataframes have no data and when I plot with sjoin, I get an error "The GeoDataFrame you are attempting to plot is empty. Nothing has been displayed."
import netCDF4
import numpy as np
from osgeo import gdal,osr,ogr
import matplotlib.pyplot as plt
import geopandas as gpd
import pandas as pd
import xarray as xr
# read in file path for shapefile
fp_shp = "C:/Users/cb_2018_us_ua10_500k/cb_2018_us_ua10_500k.shp"
# read in netcdf file path
ncs = "C:/Users/v50_N2O_2015.0.1x0.1.nc"
# Read in NETCDF as a pandas dataframe
# Xarray provides a simple method of opening netCDF files, and converting them to pandas dataframes
ds = xr.open_dataset(ncs)
edgar = ds.to_dataframe()
# the index in the df is a Pandas.MultiIndex. To reset it, use df.reset_index()
edgar = edgar.reset_index()
# Read shapefile using gpd.read_file()
shp = gpd.read_file(fp_shp)
# read the netcdf data file
#nc = netCDF4.Dataset(ncs,'r')
# quick check for shpfile plotting
shp.plot(figsize=(12, 8));
# filter out shapefile for SPECIFIC city/region
# how to filter rows in DataFrame that contains string
# extract NYC from shapefile dataframe
nyc_shp = shp[shp['NAME10'].str.contains("New York")]
# export shapefile
#nyc_shp.to_file('NYC.shp', driver ='ESRI Shapefile')
# use geopandas points_from_xy() to transform Longitude and Latitude into a list of shapely.Point objects and set it as a geometry while creating the GeoDataFrame
edgar_gdf = gpd.GeoDataFrame(edgar, geometry=gpd.points_from_xy(edgar.lon, edgar.lat))
print(edgar_gdf.head())
# check CRS coordinates
nyc_shp.crs #shapefile
edgar_gdf.crs #geodataframe netcdf
# set coordinates equal to each other
# PointsGeodataframe.crs = PolygonsGeodataframe.crs
edgar_gdf.crs = nyc_shp.crs
# check coordinates after setting coordinates equal to each other
edgar_gdf.crs #geodataframe netcdf
# Clip points, lines, or polygon geometries to the mask extent.
mask = gpd.clip(edgar_gdf, nyc_shp)

I figured it out! I need to make sure my netcdf file has the same longitude degrees as my shapefile. So instead of [0, 360] I converted it to be [-180, 180] to match before converting into a GeoDataFrame. Then the above code works!

Related

Python: Creating PDF from PNG images and CSV tables using reportlab

I am trying to create a PDF document using a series of PDF images and a series of CSV tables using the python package reportlab. The tables are giving me a little bit of grief.
This is my code so far:
import os
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from reportlab.lib.pagesizes import letter
from reportlab.platypus import SimpleDocTemplate
from reportlab.pdfgen.canvas import Canvas
from reportlab.platypus import *
from reportlab.platypus.tables import Table
from PIL import Image
from matplotlib.backends.backend_pdf import PdfPages
# Set the path to the folder containing the images and tables
folder_path = 'Files'
# Create a new PDF document
pdf_filename = 'testassessment.pdf'
canvas = Canvas(pdf_filename)
# Iterate through the files in the folder
for file in os.listdir(folder_path):
file_path = os.path.join(folder_path, file)
# If the file is an image, draw it on the PDF
if file.endswith('.png'):
canvas.drawImage(file_path, 105, 148.5, width=450, height=400)
canvas.showPage() #ends page
# If the file is a table, draw it on the PDF
elif file.endswith('.csv'):
df = pd.read_csv(file_path)
table = df.to_html()
canvas.drawString(10, 10, table)
canvas.showPage()
# Save the PDF
canvas.save()
The tables are not working. When I use .drawString it ends up looking like this:
Does anyone know how I can get the table to be properly inserted into the PDF?
According to the reportlab docs, page 14, "The draw string methods draw single lines of text on the canvas.". You might want to have a look at "The text object methods" on the same page.
You might want to consider using PyMuPDF with Stories it allows for more flexibility of layout from a data input. For an example of something very similar to what you are trying to achieve see: https://pymupdf.readthedocs.io/en/latest/recipes-stories.html#how-to-display-a-list-from-json-data

How to convert multiple nested JSON files into single CSV file using python?

I have about 200 nested JSON files with varying levels of nesting from one to three. Each JSON file consist of more than thousand data points. The keys of the values are same in all the files. My objective is to combine the data in all the files in a tabular format in a single CSV file so that I can read all the data and analyze it. I am looking for a simpler python code with brief explanation of each steps of the code to help in understanding the whole sequence of the code.
You can use this code snippet.
First of all install pandas using
pip install pandas
After that, you can use this code to convert JSON files to CSV.
# code to save all data to a single file
import pandas as pd
import glob
path = './path to directory/*.json'
files = glob.glob(path)
data_frames = []
for file in files:
f = open(file, 'r')
data_frames.append(pd.read_json(f))
f.close()
pd.concat(data_frames).to_csv("data.csv")
# code to save CSV data to individual files
import pandas as pd
import glob
path = './path to directory/*.json'
files = glob.glob(path)
for file in files:
f = open(file, 'r')
jsonData = pd.read_json(f.read())
jsonData.to_csv(f.name+".csv")
f.close()

Create Multiple .json Files from an Excel file with multiple sheets using Pandas

I 'm trying to convert a very big number of Excel Files with multiple sheets (some of them very big also) to .json files. So I created a list with the names of the sheets and then made a loop to create a data frame for each sheet and then I wrote this dataframe to a .json file. My code is :
from zipfile import ZipFile
from bs4 import BeautifulSoup
import pandas as pd
file = 'filename.xlsx'
with ZipFile(file) as zipped_file:
summary = zipped_file.open(r'xl/workbook.xml').read()
soup = BeautifulSoup(summary, "xml")
sheets = [sheet.get("name") for sheet in soup.find_all("sheet")]
for i in sheets:
df = pd.read_excel(file, sheet_name = i, index = False, header = 1)
json_file = df.to_json(("{}.json").format(i))
This code works like a charm when the sheets are not very big. When I run it for an excel file it works and creates the json files I want up to the point that it finds a very big sheet with a lot of data and it crashes.
So my question is : Is there a different more efficient way to do this without crashing the program. When I run the df=pd.read_excel command separately for each sheet it works without a problem, but I need this to happen in a loop
Import numpy. Declare an empty numpy array, out_array. Then, given a list of paths, paths, for each path in paths, read the file into a temporary dataframe, temp_df, get the values of the temporary dataframe using the .values() method, store the values into a temporary numpy array, temp_array, concatenate out_array and temp_array using numpy.concatenate.
Once this loop completes the process, convert the out_array to dataframe, out_df, using pandas.DataFrame. And finally, set column names for your new dataframe.

Loading a CSV file from Blob Storage Container using PySpark

I am unable to load a CSV file directly from Azure Blob Storage into a RDD by using PySpark in a Jupyter Notebook.
I have read through just about all of the other answers to similar problems but I haven't found specific instructions for what I am trying to do. I know I could also load the data into the Notebook by using Pandas, but then I would need to convert the Panda DF into an RDD afterwards.
My ideal solution would look something like this, but this specific code give me the error that it can't infer a schema for CSV.
#Load Data
source = <Blob SAS URL>
elog = spark.read.format("csv").option("inferSchema", "true").option("url",source).load()
I have also taken a look at this answer: reading a csv file from azure blob storage with PySpark
but I am having trouble defining the correct path.
Thank you very much for your help!
Here is my sample code with Pandas to read a blob url with SAS token and convert a dataframe of Pandas to a PySpark one.
First, to get a Pandas dataframe object via read a blob url.
import pandas as pd
source = '<a csv blob url with SAS token>'
df = pd.read_csv(source)
print(df)
Then, you can convert it to a PySpark one.
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("testDataFrame").getOrCreate()
spark_df = spark.createDataFrame(df)
spark_df.show()
Or, the same result with the code below.
from pyspark.sql import SQLContext
from pyspark import SparkContext
sc = SparkContext()
sqlContest = SQLContext(sc)
spark_df = sqlContest.createDataFrame(df)
spark_df.show()
Hope it helps.

Reading all CSV files in current working directory into pandas with correct filenames

I'm trying to use a loop to read in multiple CSVs (for now but mix of that and xls in the future).
I'd like each data frame in pandas to be the same name excluding file extension in my folder.
import os
import pandas as pd
files = filter(os.path.isfile, os.listdir( os.curdir ) )
files # this shows a list of the files that I want to use/have in my directory- they are all CSVs if that matters
# i want to load these into pandas data frames with the corresponding filenames
# not sure if this is the right approach....
# but what is wrong is the variable is named 'weather_today.csv'... i need to drop the .csv or .xlsx or whatever it might be
for each_file in files:
frame = pd.read_csv( each_file)
each_file = frame
Bernie seems to be great but one problem:
or each_file in files:
frame = pd.read_csv(each_file)
filename_only = os.path.splitext(each_file)[0]
# Right below I am assigning my looped data frame the literal variable name of "filename_only" rather than the value that filename_only represents
#rather than what happens if I print(filename_only)
filename_only = frame
for example if my two files are weather_today, earthquakes.csv (in that order) in my files list, then both 'earthquakes' and 'weather' will not be created.
however, if I simply type 'filename_only' and click the enter key in python - then I will see the earthquake dataframe. If I have 100 files, then the last data frame name in the list loop will be titled 'filename_only' and the other 99 won't because the previous assignments are never made and the 100th one overwrites them.
You can use os.path.splitext() for this to "split the pathname path into a pair (root, ext) such that root + ext == path, and ext is empty or begins with a period and contains at most one period."
for each_file in files:
frame = pd.read_csv(each_file)
filename_only = os.path.splitext(each_file)[0]
filename_only = frame
As asked in a comment we would like a way to filter for just CSV files so you can do something like this:
files = [file for file in os.listdir( os.curdir ) if file.endswith(".csv")]
Use a dictionary to store your frames:
frames = {}
for each_file in files:
frames[os.path.splitext(each_file)[0]] = pd.read_csv(each_file)
Now you can get the DataFrame of your choice with:
frames[filename_without_ext]
Simple, right? Be careful about RAM usage though, reading a bunch of files can quickly fill up system memory and cause a crash.