Improving read performance of pyarrow

Improving read performance of pyarrow - pyarrow

I have a partitioned dataset stored on internal S3 cloud. I am reading the dataset with pyarrow table
import pyarrow.dataset as ds
my_dataset = ds.dataset( ds_name, format="parquet", filesystem=s3file, partitioning="hive")
fragments = list(my_dataset.get_fragments())
required_fragment = fragements.pop()
The metadata from the required fragment shows the following:
required_fragment.metadata
<pyarrow._parquet.FileMetaData object at 0x00000291798EDF48>
created_by: parquet-cpp-arrow version 9.0.0
num_columns: 22
num_rows: 949650
num_row_groups: 29
format_version: 1.0
serialized_size: 68750
converting this to table however takes a long time
%timeit required_fragment.to_table()
6min 29s ± 1min 15s per loop (mean ± std. dev. of 7 runs, 1 loop each)
The size of the table itself is about 272mb
required_fragment.to_table().nbytes
272850898
Any ideas how i can speed up converting the ds.fragment to table?
Updates
So I instead of pyarrow.dataset, i tried using pyarrow.parquet
Only part of my code that changed is
import pyarrow.parquet as pq
my_dataset = pq.ParquetDataset(ds_name,filesystem=s3file, partitioning="hive", use_legacy_dataset=False )
fragments = my_dataset.fragments
required_fragment = fragements.pop()
and when i tried the code again, the performance was much better
%timeit required_fragment.to_table()
12.4 s ± 1.56 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
While i am happy with the better performance, it still feels confusing as under the hood, by setting use_legacy_dataset = False, the program should have similar outcomes
PC Information
Installed RAM: 21.0GB
Software: Windows 10 Enterprise
Internet speed: 10Mbps / 156 Mpbs (download / upload)
s3 location: Asia

Related

How to free ram space in Google Colab

I am working in a Deep Learning project where I am trying different CNN architectures with CIFAR10. I've built some custom functions and do some nested foor-loops to iterate over my different architectures. The problem I get is that the 12GB of RAM get close to 100% and I cannot free that space to continue. I would like a solution different to "reset your runtime environment", I want to free that space, given that 12GB should be enough for what I am doing, if you manage it correctly.
What I've done so far:
Added gc.collect() at the end of each training epoch
Added keras.backend.clear_session() after each model is trained
I've also tried to see the locals() using
import sys
def sizeof_fmt(num, suffix='B'):
''' by Fred Cirera, https://stackoverflow.com/a/1094933/1870254, modified'''
for unit in ['','Ki','Mi','Gi','Ti','Pi','Ei','Zi']:
if abs(num) < 1024.0:
return "%3.1f %s%s" % (num, unit, suffix)
num /= 1024.0
return "%.1f %s%s" % (num, 'Yi', suffix)
for name, size in sorted(((name, sys.getsizeof(value)) for name, value in locals().items()),
key= lambda x: -x[1])[:10]:
print("{:>30}: {:>8}".format(name, sizeof_fmt(size)))
Which yields
xtrain: 1.1 GiB
xtest: 234.4 MiB
_i13: 3.4 KiB
So I cannot understand how the other 10GB are allocated in my current session.

Understanding an wav file exported from a daw

I have generated a tone from Audacity at 440Hz with Amplitude as 1 for 1 sec like this:
I understand that this is going to create 440 peaks in 1 sec with Amplitude as 1.
Here i see that its a 32 bit file with 44100Hz is the sample rate which means there are 44100 samples per sec. The Amplitude is 1 which is as expected because that is what i chose.
What i dont understand is, what is the unit of this Amplitude? When right-clicked it shows linear(-1 to +1)
There is an option to select dB it shows (0 to -60 to 0) which i dont understand how is this converted!
now when i use this wav file in the python scipy to read the wav and get values of time and amplitude
How to match or get the relation between what i generated vs what i see when i read a wav file?
The peak is amplitude is 32767.987724003342 Frequency 439.99002267573695
The code i have used in python is
wavFileName ="440Hz.wav"
sample_rate, sample_data = wavfile.read(wavFileName)
print ("Sample Rate or Sampling Frequency is", sample_rate," Hz")
l_audio = len(sample_data.shape)
print ("Channels", l_audio,"Audio data shape",sample_data.shape,"l_audio",l_audio)
if l_audio == 2:
sample_data = sample_data.sum(axis=1) / 2
N = sample_data.shape[0]
length = N / sample_rate
print ("Duration of audio wav file in secs", length,"Number of Samples chosen",sample_data.shape[0])
time =np.linspace(0, length, sample_data.shape[0])
sampling_interval=time[1]-time[0]

notice in audacity when you created the one second of audio with aplitude choice of 1.0 right before saving file it says signed 16 bit integer so amplitude from -1 to +1 means the WAV file in PCM format stores your raw audio by varying signed integers from its max negative to its max positive which since 2^16 is 65536 then signed 16 bit int range is -32768 to 32767 in other words from -2^15 to ( +2^15 - 1 ) ... to better plot I suggest you choose a shorter time period much smaller than one second lets say 0.1 seconds ... once your OK with that then boost it back to using a full one second which is hard to visualize on a plot due to 44100 samples
import os
import scipy.io
import scipy.io.wavfile
import numpy as np
import matplotlib.pyplot as plt
myAudioFilename = '/home/olof/sine_wave_440_Hz.wav'
samplerate, audio_buffer = scipy.io.wavfile.read(myAudioFilename)
duration = len(audio_buffer)/samplerate
time = np.arange(0,duration,1/samplerate) #time vector
plt.plot(time,audio_buffer)
plt.xlabel('Time [s]')
plt.ylabel('Amplitude')
plt.title(myAudioFilename)
plt.show()
here is 0.1 seconds of 440 Hz using signed 16 bit notice the Y axis of Amplitude range matches above mentioned min to max signed integer value range

Spark CSV GZip to Parquet?

I am using Spark 2.3.1 PySpark (AWS EMR)
I am getting memory errors:
Container killed by YARN for exceeding memory limits
Consider boosting spark.yarn.executor.memoryOverhead
I have input of 160 files, each file approx 350-400 MBytes, each file is a CSV Gzip format.
To read the csv.gz files (with wildcard) and I use this Pyspark
dfgz = spark.read.load("s3://mybucket/yyyymm=201708/datafile_*.csv.gz",
format="csv", sep="^", inferSchema="false", header="false", multiLine="true", quote="^", nullValue="~", schema="id string,...."))
To save the data frame I use this (PySpark)
(dfgz
.write
.partitionBy("yyyymm")
.mode("overwrite")
.format("parquet")
.option("path", "s3://mybucket/mytable_parquet")
.saveAsTable("data_test.mytable")
)
One line of code to save all 160 files.
I tried this with 1 file and it works fine.
Total size for all 160 files (csv.gzip) is about 64 GBytes.
Each file as a pure CSV, when Unzipped is approx 3.5 GBytes. I am assuming Spark may unzip each file in RAM and then convert it to Parquet in RAM ??
I want to convert each csv.gzip file to Parquet format i.e. I want 160 Parquet files as output (ideally).
The task runs for a while and it seems to create 1 Parquet file for each CSV.GZ file. After some time it always fails with Yarn memory error.
I tried various settings for executors memory and memoryOverhead and all results in no change - jobs always fails. I tried memoryOverhead of up to 1-8 GB and executormemory of 8G.
Apart from manually breaking up input 160 files workload into many small workloads what else can I do?
Do I need a Spark cluster with a total RAM capacity of much greater than 64 GB?
I use 4 slave nodes, each has 8 CPU and 16 GB per node (slaves), plus one master of 4 CPU and 8 GB of RAM.
This is (with overhead) less than 64 GB of input gzip csv files I am trying to process but the files are evenly sized of 350-400 MBytes so I dont understand why Spark is throwing memory errors given it can easily process these 1 file at a time per executor, discard it and move on to next file. It does not appear to work this way. I feel it is trying to load all input csv.gzip files into memory but I have no way of knowing it (I am still new to Spark 2.3.1).
Late Update: I managed to get it to work with following memory config:
4 slave nodes, each 8 CPU and 16 GB of RAM
1 master node, 4 CPU and 8 GB of RAM:
spark maximizeResourceAllocation false
spark-defaults spark.driver.memoryOverhead 1g
spark-defaults spark.executor.memoryOverhead 2g
spark-defaults spark.executor.instances 8
spark-defaults spark.executor.cores 3
spark-defaults spark.default.parallelism 48
spark-defaults spark.driver.memory 6g
spark-defaults spark.executor.memory 6g
Needless to say - I cannot explain why this config worked!
Also this took 2 hours+ to process 64 GB of gzip data which seems slow even for a small 4+1 node cluster with total of 32+4 CPU and 64+8 GB of RAM. Perhaps S3 was the bottleneck....
FWIW I just did not expect to micro-manage a database cluster for memory, disk I/O or CPU allocation.
Update 2:
I just ran another load on same cluster with same config, a smaller load of 129 files of same sizes and this load failed with same Yarn memory errors.
I am very disappointed with Spark 2.3.1 memory management.
Thank you for any guidance

Python 1Hz measurement temperature

I want to create csv file in python ,storing data from the sensor plus time stamp of the reading .But sensor measures fast and I need exactly 1 measurment from the sensor exactly after 1 sec.For example sensor value is 20 at time 12:34:15.and i need value exactly at 12:34:16 .I do not have to use time.sleep because it creates delay more than second and will affect the log file if i have to take readings more than hundred.

Consumer PCs do not have real-time operating systems and there is no guarantee that a particular process will execute at least once per second and certainly no guarantee that it will be executing at each 1 second interval. If you want precision timed measurements with Python, you should look as Micropython executing on a microcontroller board. It may be able to do what you want. Python on a Raspberry Pi board might also work better than a PC.
On a regular PC, I would start with something using perf_counter.
from time import perf_counter as timer
from somewhere import sensor, save # read temperature, save value
t0 = t1 = timer()
delta = .99999 # adjust by experiment to average 1 sec reading intervals
while True:
while t1 - t0 < delta:
t1 = timer()
value = sensor())
save(value)
t0 = t1

Multiple regression with lagged time series using libsvm

I'm trying to develop a forecaster for electric consumption. So I want to perform a regression using daily data for an entire year. My dataset has several features. Googling I've found that my problem is a Multiple regression problem (Correct me please if I am mistaken).
What I want to do is train a svm for regression with several independent variables and one dependent variable with n lagged days. Here's a sample of my independent variables, I actually have around 10. (We used PCA to determine which variables had some correlation to our problem)
Day Indep1 Indep2 Indep3
1 1.53 2.33 3.81
2 1.71 2.36 3.76
3 1.83 2.81 3.64
... ... ... ...
363 1.5 2.65 3.25
364 1.46 2.46 3.27
365 1.61 2.72 3.13
And the independendant variable 1 is actually my dependant variable in the future. So for example, with a p=2 (lagged days) I would expect my svm to train with the first 2 time series of all three independant variables.
Indep1 Indep2 Indep3
1.53 2.33 3.81
1.71 2.36 3.76
And the output value of the dependent variable would be "1.83" (Indep variable 1 on time 3).
My main problem is that I don't know how to train properly. What I was doing is just putting all features-p in an array for my "x" variables and for my "y" variables I'm just putting my independent variable on p+1 in case I want to predict next day's power consumption.
Example of training.
x with p = 2 and 3 independent variables y for next day
[1.53, 2.33, 3.81, 1.71, 2.36, 3.76] [1.83]
I tried with x being a two dimensional array but when you combine it for several days it becomes a 3d array and libsvm says it can't be.
Perhaps I should change from libsvm to another tool or maybe it's just that I'm training incorrectly.
Thanks for your help,
Aldo.

Let me answer with the python / numpy notation.
Assume the original time series data matrix with columns (Indep1, Indep2, Indep3, ...) is a numpy array data with shape (n_samples, n_variables). Let's generate it randomly for this example:
>>> import numpy as np
>>> n_samples = 100, n_variables = 5
>>> data = np.random.randn(n_samples, n_variables)
>>> data.shape
(100, 5)
If you want to use a window size of 2 time-steps, then the training set can be built as follows:
>>> targets = data[2:, 0] # shape is (n_samples - 2,)
>>> targets.shape
(98,)
>>> features = np.hstack([data[0:-2, :], data[1:-1, :]]) # shape is (n_samples - 2, n_variables * 2)
>>> features.shape
(98, 10)
Now you have your 2D input array + 1D targes that you can feed to libsvm or scikit-learn.
Edit: it might very well be the case that extracting more time-series oriented features such as moving average, moving min, moving max, moving differences (time based derivatives of the signal) or STFT might help your SVM mode make better predictions.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008