How can I call a specific variable from an xarray dataset? - glob

I am using an Argo Float data set, which consists of hundreds of different floats which each contain a plethora of data. I want to be able to call the variable "TEMP" from each float, but since this is a NetCDF, I do not understand how to do this. I have a folder which consists of 291 different floats, and each one is a NetCDF file.
Here are the packages I am using:
from netCDF4 import Dataset
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import xarray as xr
import glob
Here is what I did next
path = r"C:\Users\Mason\OneDrive\Documents\STAT 315\ArgoFloat\Data"
all_files = glob.glob(path + "/*.nc")
li = []
for filename in all_files:
dsnew = xr.open_dataset(filename)
li.append(dsnew)
li[0]
Below is the output for li[0]:
<xarray.Dataset>
Dimensions: (DEPTH: 1001, LATITUDE: 6, LONGITUDE: 6, POSITION: 6, TIME: 6)
Coordinates:
* TIME (TIME) datetime64[ns] 2022-01-07T07:47:45 ... 2022-02-26T07:38:55
* LATITUDE (LATITUDE) float32 11.627 12.003 ... 12.094 12.386
* LONGITUDE (LONGITUDE) float32 -99.835 -99.37 ... -99.792
Dimensions without coordinates: DEPTH, POSITION
Data variables:
TIME_QC (TIME) float32 ...
POSITION_QC (POSITION) float32 ...
DC_REFERENCE (TIME) object ...
DIRECTION (TIME) object ...
VERTICAL_SAMPLING_SCHEME (TIME) object ...
PRES (TIME, DEPTH) float32 ...
PRES_QC (TIME, DEPTH) float32 ...
PRES_ADJUSTED (TIME, DEPTH) float32 ...
PRES_ADJUSTED_QC (TIME, DEPTH) float32 ...
TEMP (TIME, DEPTH) float64 ...
TEMP_QC (TIME, DEPTH) float32 ...
TEMP_ADJUSTED (TIME, DEPTH) float64 ...
TEMP_ADJUSTED_QC (TIME, DEPTH) float32 ...
PSAL (TIME, DEPTH) float64 ...
PSAL_QC (TIME, DEPTH) float32 ...
PSAL_ADJUSTED (TIME, DEPTH) float64 ...
PSAL_ADJUSTED_QC (TIME, DEPTH) float32 ...
Attributes:
data_type: OceanSITES vertical profile
format_version: 1.4
platform_code: 1902272
institution: Pacific Marine Environmental Laboratory (...
institution_edmo_code: 1440
site_code:
wmo_platform_code: 1902272
coriolis_platform_code: 1902272
platform_name: 1283
ices_platform_code:
source: drifting subsurface profiling float
source_platform_category_code: 46
wmo_inst_type: 863
references: http://marine.copernicus.eu http://www.ma...
comment:
Conventions: CF-1.6 Copernicus-InSituTAC-FormatManual-...
netcdf_version: netCDF-4 classic model
title: Global Ocean - In Situ Observation Copern...
summary:
naming_authority: Copernicus Marine In Situ
id: GL_PR_PF_1902272
cdm_data_type: profile
area: Global Ocean
bottom_depth:
institution_references:
contact: cmems-service#ifremer.fr
data_assembly_center: Ifremer
pi_name: GREGORY C. JOHNSON
distribution_statement: These data follow Copernicus standards; t...
citation: These data were collected and made freely...
update_interval: P1D
qc_manual: Recommendations for in-situ data Near Rea...
doi: https://doi.org/10.13155/59938 https://do...
geospatial_lat_min: 11.62700
geospatial_lat_max: 12.38600
geospatial_lon_min: -99.83500
geospatial_lon_max: -99.06700
geospatial_vertical_min: 4.00
geospatial_vertical_max: 2003.40
time_coverage_start: 2022-01-07T07:47:45Z
time_coverage_end: 2022-02-26T07:38:55Z
last_date_observation: 2022-02-26T07:38:55Z
last_latitude_observation: 12.38600
last_longitude_observation: -99.79200
date_update: 2022-02-28T22:06:02Z
history: 2022-02-28T22:06:02Z : Creation
data_mode: R
I want to know how I can take the variable "TEMP" from this data point. How can I call it and print it?

Related

best_score_ vs. accuracy_score_ vs. AUROC performance score for classification models (binary)

I have these three metrics for my classification task. Can someone tells me in plain English, what the differences are, which one(s) to use and when to utilize them?
Thank you
for name, model in fitted_models.items():
print(name, model.best_score_)
l1
0.8493863035326624
l2
0.8493863035326624
rf
0.9796513913558318
gb
0.9752980461811722
///////////////////////////////////////////////
for name, model in fitted_models.items():
pred = model.predict(X_test)
print(name, accuracy_score(y_test, pred))
l1
0.8603411513859275
l2
0.8603411513859275
rf
0.9790334044065387
gb
0.9758351101634684
///////////////////////////////////////////////
for name, model in fitted_models.items():
pred = model.predict_proba(X_test)
pred = [p[1] for p in pred]
print(name, roc_auc_score(y_test, pred))
l1
0.9015388373737675
l2
0.9015381433597084
rf
0.9915194952019338
gb
0.988678201643009
1.frist thing :
model.best_score_
return to you accuracy on train data but two others in this question works with test data
2.model.best_score_ & accuracy_score(y_test, pred) return to you mean accuracy for all classes with :
accuracy = (tp + tn) / (tp + fp + fn + tn)
but roc_auc_score(y_test, pred) calculate auc with probability perdicts separate for each class and Of course with area under the ROC curve
3.accuracy on train data not good metrics parameter and roc auc is better than accuracy when we can use predict_proba()
4.at the end Precision and recall is better parameters than accuracy and we must choose wisely between the two them based on our need
i offer you classification_report this method return to you all parameters for each class and average them :
from sklearn.metrics import classification_report
print(classification_report(y_pred=y_pred, y_true=y_test))

How are n dimensional vectors state vectors represented in Q Learning?

Using this code:
import gym
import numpy as np
import time
"""
SARSA on policy learning python implementation.
This is a python implementation of the SARSA algorithm in the Sutton and Barto's book on
RL. It's called SARSA because - (state, action, reward, state, action). The only difference
between SARSA and Qlearning is that SARSA takes the next action based on the current policy
while qlearning takes the action with maximum utility of next state.
Using the simplest gym environment for brevity: https://gym.openai.com/envs/FrozenLake-v0/
"""
def init_q(s, a, type="ones"):
"""
#param s the number of states
#param a the number of actions
#param type random, ones or zeros for the initialization
"""
if type == "ones":
return np.ones((s, a))
elif type == "random":
return np.random.random((s, a))
elif type == "zeros":
return np.zeros((s, a))
def epsilon_greedy(Q, epsilon, n_actions, s, train=False):
"""
#param Q Q values state x action -> value
#param epsilon for exploration
#param s number of states
#param train if true then no random actions selected
"""
if train or np.random.rand() < epsilon:
action = np.argmax(Q[s, :])
else:
action = np.random.randint(0, n_actions)
return action
def sarsa(alpha, gamma, epsilon, episodes, max_steps, n_tests, render = True, test=False):
"""
#param alpha learning rate
#param gamma decay factor
#param epsilon for exploration
#param max_steps for max step in each episode
#param n_tests number of test episodes
"""
env = gym.make('Taxi-v3')
n_states, n_actions = env.observation_space.n, env.action_space.n
Q = init_q(n_states, n_actions, type="ones")
print('Q shape:' , Q.shape)
timestep_reward = []
for episode in range(episodes):
print(f"Episode: {episode}")
total_reward = 0
s = env.reset()
print('s:' , s)
a = epsilon_greedy(Q, epsilon, n_actions, s)
t = 0
done = False
while t < max_steps:
if render:
env.render()
t += 1
s_, reward, done, info = env.step(a)
total_reward += reward
a_ = epsilon_greedy(Q, epsilon, n_actions, s_)
if done:
Q[s, a] += alpha * ( reward - Q[s, a] )
else:
Q[s, a] += alpha * ( reward + (gamma * Q[s_, a_] ) - Q[s, a] )
s, a = s_, a_
if done:
if render:
print(f"This episode took {t} timesteps and reward {total_reward}")
timestep_reward.append(total_reward)
break
# print('Updated Q values:' , Q)
if render:
print(f"Here are the Q values:\n{Q}\nTesting now:")
if test:
test_agent(Q, env, n_tests, n_actions)
return timestep_reward
def test_agent(Q, env, n_tests, n_actions, delay=0.1):
for test in range(n_tests):
print(f"Test #{test}")
s = env.reset()
done = False
epsilon = 0
total_reward = 0
while True:
time.sleep(delay)
env.render()
a = epsilon_greedy(Q, epsilon, n_actions, s, train=True)
print(f"Chose action {a} for state {s}")
s, reward, done, info = env.step(a)
total_reward += reward
if done:
print(f"Episode reward: {total_reward}")
time.sleep(1)
break
if __name__ =="__main__":
alpha = 0.4
gamma = 0.999
epsilon = 0.9
episodes = 200
max_steps = 20
n_tests = 20
timestep_reward = sarsa(alpha, gamma, epsilon, episodes, max_steps, n_tests)
print(timestep_reward)
from :
https://towardsdatascience.com/reinforcement-learning-temporal-difference-sarsa-q-learning-expected-sarsa-on-python-9fecfda7467e
A sample Q table generated is :
[[ 1. 1. 1. 1. 1. 1. ]
[ 0.5996 0.5996 0.5996 0.35936 0.5996 1. ]
[ 0.19936016 0.35936 0.10336026 0.35936 0.35936 -5.56063984]
...
[ 0.35936 0.5996 0.35936 0.5996 1. 1. ]
[ 1. 0.5996 1. 1. 1. 1. ]
[ 0.35936 0.5996 1. 1. 1. 1. ]]
The columns representing the actions and rows representing the corresponding states.
Can the state be represented by a vector ? The Q table cells are not contained by vectors of size > 1 so how should these states be represented ? For example if I'm in state [2] can this be represented as an n dimensional vector ?
Put another way if Q[1,3] = 4 can the Q state 1 with action 3 be represented as vector[1,3,2,12,3] ? If so then is the state_number->state_attributes mapping stored in a separate lookup table ?
Yes, states can be represented by anything you want, including vectors of arbitrary length. Note, however, that if you are using a tabular version of Q-learning (or SARSA as in this case), you must have a discrete set of states. Therefore, you need a way to map the representation of your state (for example, a vector of potentially continuous values) to a set of discrete states.
Expanding on the example you have given, imagine that you have three states represented by vectors:
s0 = [1, 3, 2, 12, 3]
s1 = [3, 1, 2, 2, 23]
s2 = [2, 12, 3, 2, 1]
In the end, you only have 3 states, regardless of how they are represented. You could map the vectors to the states s0, s1 and s2, and use a simple Q-table. Or you can use other data structures that use vector representation (i.e. [1, 3, 2, 12, 3]) as an index.
On the other hand, if your state space is continuous and you don't want to discretize it, then you can use function approximators (e.g., a neural network) to store Q values instead of a table. But that's another topic (more info in chapter 8 of Sutton & Barto RL book).

Pyarrow table memory compared to raw csv size

I have a 2GB CSV file that I read into a pyarrow table with the following:
from pyarrow import csv
tbl = csv.read_csv(path)
When I call tbl.nbytes I get 3.4GB. I was surprised at how much larger the csv was in arrow memory than as a csv. Maybe I have a fundamental misunderstanding of what pyarrow is doing under the hood but I thought if anything it would be smaller due to its columnar nature (i also probably could have squeezed out more gains using ConvertOptions but i wanted a baseline). I definitely wasnt expecting an increase of almost 75%. Also when I convert it from arrow table to pandas df the df took up roughly the same amount of memory as the csv - which was expected.
Can anyone help explain the difference in memory for arrow tables compared to a csv / pandas df.
Thx.
UPDATE
Full code and output below.
In [2]: csv.read_csv(r"C:\Users\matth\OneDrive\Data\Kaggle\sf-bay-area-bike-shar
...: e\status.csv")
Out[2]:
pyarrow.Table
station_id: int64
bikes_available: int64
docks_available: int64
time: string
In [3]: tbl = csv.read_csv(r"C:\Users\generic\OneDrive\Data\Kaggle\sf-bay-area-bik
...: e-share\status.csv")
In [4]: tbl.schema
Out[4]:
station_id: int64
bikes_available: int64
docks_available: int64
time: string
In [5]: tbl.nbytes
Out[5]: 3419272022
In [6]: tbl.to_pandas().info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 71984434 entries, 0 to 71984433
Data columns (total 4 columns):
# Column Dtype
--- ------ -----
0 station_id int64
1 bikes_available int64
2 docks_available int64
3 time object
dtypes: int64(3), object(1)
memory usage: 2.1+ GB
There are two problems:
The integers columns are using int64, but int32 would be more appropriate (unless the values are big)
The time column is interpreted as a string. It doesn't help that the input format isn't following any standard (%Y/%m/%d %H:%M:%S)
The first problem is easy to solve, using ConvertionOptions:
tbl = csv.read_csv(
<path>,
convert_options=csv.ConvertOptions(
column_types={
'station_id': pa.int32(),
'bikes_available': pa.int32(),
'docks_available': pa.int32(),
'time': pa.string()
}))
The second one is a bit more complicated because as far as I can tell the read_csv API doesn't let you provide a format for the time column, and there's no easy way to convert string columns to datetime in pyarrow. So you have to use pandas instead:
series = tbl.column('time').to_pandas()
series_as_datetime = pd.to_datetime(series, format='%Y/%m/%d %H:%M:%S')
tbl2 = pa.table(
{
'station_id':tbl.column('station_id'),
'bikes_available':tbl.column('bikes_available'),
'docks_available':tbl.column('docks_available'),
'time': pa.chunked_array([series_as_datetime])
})
tbl2.nbytes
>>> 1475683759
1475683759 is the number you expect, you can't get any better. Each row is 20 bytes (4 + 4 + 4 + 8).

How to rescale range of numbers shifting the centre in spark/scala?

Which function in spark can transform / rescale values in range -infinity to +infinity or -2 to 130 etc to max value to be defined.
In below example, I want to ensure that 55 is 100, and 100+ is 0
before | after
45-55 | 90-100
35-44 | 80-89
...
100+ or < 0| 0-5
is any of the ML features functions useful?
I was able to solve it, thanks #user6910411 for your help.
You can use dense or sparse vector depending on data and replace MinMaxScaler with MaxAbsScaler and extract values using linalg.Vectors or DenseVector
Idea is to split data at the point of required median and reverse scale for one half, then scale both halfs and merge DF.
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.ml.feature.Normalizer
import org.apache.spark.ml.feature.MaxAbsScaler
import org.apache.spark.ml.feature.MinMaxScaler
import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.ml.linalg.DenseVector
import org.apache.spark.sql.functions.udf
val vectorToColumn = udf{ (x: DenseVector, index: Int) => x(index) }
val gt50 = df.filter("score >= 55").select('id,('score * -1).as("score"))
val lt50 = df.filter("score < 55")
val assembler = new VectorAssembler()
.setInputCols(Array("score"))
.setOutputCol("features")
val ass_lt50 = assembler.transform(lt50)
val ass_gt50 = assembler.transform(gt50)
val scaler = new MinMaxScaler()
.setInputCol("features")
.setOutputCol("featuresScaled")
.setMax(100)
.setMin(0)
val feat_lt50 = scaler.fit(ass_lt50).transform(ass_lt50).drop('score)
val feat_gt50 = scaler.fit(ass_gt50).transform(ass_gt50).drop('score)
val scaled_lt50 = feat_lt50.select('id,round(
vectorToColumn(col("featuresScaled"),lit(0))).as("scaled_score"))
val scaled_gt50 = feat_gt50.select('id,round(
vectorToColumn(col("featuresScaled"),lit(0))).as("scaled_score"))
val scaled = scaled_lt50.unionAll(scaled_gt50)

plotting a Date from an csv. file in pylab

I'm trying to plot dates from a csv. file column against three other columns of numbers. I'm new to python and have so far managed to import the columns into python and have tried to read them has an array but i can't seem to append them with the datetime module and plot the dates along the x axis along with my data.
Please can anyone help?
At the minute I keep getting the error message:
Traceback (most recent call last):
File "H:\AppliedGIS\Python\woops.py", line 24, in <module>
date = datetime.datetime.strptime['x', '%d/%m/%Y']
AttributeError: type object 'datetime.datetime' has no attribute 'datetime'
But i'm sure i'm going wrong in more than one place...
The data itself is formatted in four columns and when printed looks like this: ('04/03/2013', 7.0, 12.0, 17.0) ('11/03/2013', 23.0, 15.0, 23.0).
Here is the complete code
import csv
import numpy as np
import pylab as pl
import datetime
from datetime import datetime
data = np.genfromtxt('H:/AppliedGIS/Python/AssignmentData/GrowthDistribution/full.csv', names=True, usecols=(0, 1, 2, 3), delimiter= ',', dtype =[('Date', 'S10'),('HIGH', '<f8'), ('Medium', '<f8'), ('Low', '<f8')])
print data
x = [foo['Date'] for foo in data]
y = [foo['HIGH'] for foo in data]
y2 = [foo['Medium'] for foo in data]
y3 = [foo['Low'] for foo in data]
print x, y, y2, y3
dates = []
for x in data:
date = datetime.datetime.strptime['x', '%d/%m/%Y']
dates.append(date)
pl.plot(data[:, x], data[:, y], '-r', label= 'High Stocking Rate')
pl.plot(data[:, x], data[:, y2], '-g', label= 'Medium Stocking Rate')
pl.plot(data[:, x], data[:, y3], '-b', label= 'Low Stocking Rate')
pl.title('Amount of Livestock Grazing per hectare', fontsize=18)
pl.ylabel('Livestock per ha')
pl.xlabel('Date')
pl.grid(True)
pl.ylim(0,100)
pl.show()
The problem is in the way you have imported datetime.
The datetime module contains a class, also called datetime. At the moment, you are just importing the class as datetime, from which you can use the strptime method, like so:
from datetime import datetime
...
x = [foo['Date'] for foo in data]
...
dates=[]
for i in x:
date = datetime.strptime(i,'%d/%m/%Y')
dates.append(date)
Alternatively, you can import the complete datetime module, and then access the datetime class using datetime.datetime:
import datetime
...
x = [foo['Date'] for foo in data]
...
dates=[]
for i in x:
date = datetime.datetime.strptime(i,'%d/%m/%Y')
dates.append(date)