Can one create a Date32Array with pyarrow from 3 integer arrays for days, months and years? - pyarrow

Suppose I have three integer arrays:
days: e.g. [23, 12, 2]
months: e.g. [3, 5, 11]
years: e.g. [2008, 2011, 2019]
Can one create a Date32Array, Date64Array or TimestampArray with pyarrow in a vectorized way?
I know I can e.g. create individual python datetime.date() objects and feed a list of those to pyarrow. I would however like to avoid a slow loop that creates a large number of python objects just to discard them in the next step.

Related

how do I load length frequency histogram data into mixtools?

I want to use mixtools to separate 1, 2 and 3+ year old cohorts in shellfish length frequency data. I am totally new to R coding. The example is old faithful geyser data but it is merely a list of 272 data points. I have various tables of lengths (size class midpoints) and frequencies. Generally about 15 length classes and counts in each between 0 and 50. I can create a data frame from my MSexcel table but not sure how to call it with normalmixEM() Thanks.

Cumulative Frequency Tables and Chart Output

I'm working with some rather large time-series data sets related to futures prices and am in the process of converting some calculations which I previously did in Excel to R. This conversion has been relatively straightforward thus far but I m having a bit of trouble replicating my histograms with their cumulative frequency distributions in R as I had them in Excel. If you're familiar with Excel, the Histogram function in the Data Analysis Toolpack automatically creates a Cumulative Frequency Distribution table with the cumulative percentages of each, in this case, Price Level, next to the histogram.
I've had some success creating some basic histograms using ggplot, here is a snippet of that code:
ggplot(data=CrudeRaw, aes(x=CrudeRaw$X7_1_F))+
geom_histogram(breaks=seq(X7_F_M_L, X7_F_M_H, by=0.01),
col="blue",
fill="white",
alpha= 0.2)+
labs(title="X7 1 Month Price Distribution", x="Price Levels",
y="Frequency") +
xlim(c(X7_F_M_L, X7_F_M_H)) +
ylim(c(0,100))
Several questions regarding formatting and usage.
a) CrudeRaw is a dataframe which contains roughly 276 rows, and no less then 50 columns. For the purposes of this project I've chopped the data into 20 period, 60 period, 120 period, 180 period, and 240 period subsets. The data is in chronological order by date.
Question(s): ggplot cannot take numeric data types, only data frames, so I can only feed it the entire df even though I am interested in creating distributions for the aforementioned subsets. Is there a way that I can still do this?
b) How do I get every bin (price) to show up on the x-axis rather than a number marking every 5 bins (-15, -10, -5, 0, 5 ..., 15)?
c) I've successfully created a cumulative frequency table using the follow code,
round(cbind(cumsum(table(X7_F)))/NROW(X7_F),2)
But I'd like a way to a) output each of these tables (of which there are many) to a CSV file OR, ideally create a "report" of sorts with R which can be saved to a pdf, or perhaps even within the histogram which the table/data is associated with.
d) I've done some searching on how to output data to a CSV file, but it wasnt clear from the examples I went over how I could output multiple arrays to the same sheet or workbook, en masse. That is, I would like to output my 20, 60, 120, 180, and 240 period arrays of prices to the same workbook. I'm thinking that by creating another dataframe that I could then pass these subsets of the data to the ggplot function like I mentioned I was having trouble doing in part a)
e) Lastly (for now) how do I overlay the CFD onto my histograms?
Please advise if you require any additional information or colour in order to help me and many thanks in advance for your responses!

Storing multiple number ranges in json for later lookup

I'm trying to find a good way to store multiple ranges of numbers and single digits within a json array that I can later lookup in a graph database (likely neo4j).
So numbers and ranges such as
1
5-12
25-99
and later on if I want to see if number 27 is in there, that I can. What's the best way to structure this in a json string and is it possible to use neo4j to check if number 27 is within one of the the ranges?
One way to approach this is to use a Range node for each numeric range.
For example:
(:Range {min: 1, max: 1})
(:Range {min: 5, max: 12})
(:Range {min: 25, max: 99})
And here is an example of how you could find all Foo nodes that have a range that includes 27:
MATCH (n:Foo)-[:HAS_RANGE]->(r:Range)
WHERE r.min <= 27 <= r.max
RETURN n;

Weka MLP multiple outputs

I'm using the multiplayer perceptron on Weka to classify data. Classification output should be a unique binary vector associated with certain input, e.g., 1, 1, -1, 1, -1, -1, 1. Output vector is 31-element long while input is 39-element vector of real numbers. That is, the output cannot be represented by one column in the CSV file. Rather, I should have 31 columns for output values (class) beside the 39 columns of the input. I know how to use Weka when I have one-column classes, but with such vector output I have a problem. I must have it like that because I need to compare it with MLP ANN in Matlab that hase 31 outputs in the output layer. Therefore, I cannot assign an abstract symbol for each unique combination in order to have one coloumn in my CSV. Your help is highly appreciated. Thanks in advance and have a nice day

The most efficient way to calculate an integral in a dataset range

I have an array of 10 rows by 20 columns. Each columns corresponds to a data set that cannot be fitted with any sort of continuous mathematical function (it's a series of numbers derived experimentally). I would like to calculate the integral of each column between row 4 and row 8, then store the obtained result in a new array (20 rows x 1 column).
I have tried using different scipy.integrate modules (e.g. quad, trpz,...).
The problem is that, from what I understand, scipy.integrate must be applied to functions, and I am not sure how to convert each column of my initial array into a function. As an alternative, I thought of calculating the average of each column between row 4 and row 8, then multiply this number by 4 (i.e. 8-4=4, the x-interval) and then store this into my final 20x1 array. The problem is...ehm...that I don't know how to calculate the average over a given range. The question I am asking are:
Which method is more efficient/straightforward?
Can integrals be calculated over a data set like the one that I have described?
How do I calculate the average over a range of rows?
Since you know only the data points, the best choice is to use trapz (the trapezoidal approximation to the integral, based on the data points you know).
You most likely don't want to convert your data sets to functions, and with trapz you don't need to.
So if I understand correctly, you want to do something like this:
from numpy import *
# x-coordinates for data points
x = array([0, 0.4, 1.6, 1.9, 2, 4, 5, 9, 10])
# some random data: 3 whatever data sets (sharing the same x-coordinates)
y = zeros([len(x), 3])
y[:,0] = 123
y[:,1] = 1 + x
y[:,2] = cos(x/5.)
print y
# compute approximations for integral(dataset, x=0..10) for datasets i=0,1,2
yi = trapz(y, x[:,newaxis], axis=0)
# what happens here: x must be an array of the same shape as y
# newaxis tells numpy to add a new "virtual" axis to x, in effect saying that the
# x-coordinates are the same for each data set
# approximations of the integrals based the datasets
# (here we also know the exact values, so print them too)
print yi[0], 123*10
print yi[1], 10 + 10*10/2.
print yi[2], sin(10./5.)*5.
To get the sum of the entries 4 to 8 (including both ends) in each column, use
a = numpy.arange(200).reshape(10, 20)
a[4:9].sum(axis=0)
(The first line is just to create an example array of the desired shape.)