Octave(loading files with multiple NA values) - octave

I have a dataset which has a lot of NA values.I have no idea how to load it into octave.
The problem is whenever i am using
data=load("a,txt")
it says error: value on the right hand side is undefined

First, ensure that you are using data=load("a.txt"); and not data=load("a,txt");
If you want to load all of the data from the file into one matrix then use
data = dlmread("a.txt", "emptyvalue", NA);
This reads in the data from a.txt, inferring the delimiter, and replacing all empty values with NA. The code preserves NaN and Inf values.
If you have multiple data sets in the file you'll need to get creative, and it may be simpler to just use the above code and segment the data sets in Octave.

Related

Using lapply or for loop on JSON parsed text to calculate mean

I have a json file that has a multi-layered list (already parsed text). Buried within the list, there is a layer that includes several calculations that I need to average. I have code to do this for each line individually, but that is not very time efficient.
mean(json_usage$usage_history[[1]]$used[[1]]$lift)
This returns an average for the numbers in the lift layer of the list for the 1st row. As mentioned, this isn't time efficient when you have a dataset with multiple rows. Unfortunately, I haven't had much success in using either a loop or lapply to do this on the entire dataset.
This is what happens when I try the for loop:
for(i in json_usage$usage_history[[i]]$used[[1]]$lift){
json_usage$mean_lift <- mean(json_usage$usage_history[[i]]$used[[1]]$lift)
}
Error in json_usage$affinity_usage_history[[i]] :
subscript out of bounds
This is what happens why I try lapply:
mean_lift <- lapply(lift_list, mean(lift_list$used$lift))
Error in match.fun(FUN) :
'mean(lift_list$used$lift)' is not a function, character or symbol
In addition: Warning message:
In mean.default(lift_list$used$lift) :
argument is not numeric or logical: returning NA
I am new to R, so I know I am likely doing it wrong, but I haven't found any examples of what I'm trying to do. I'm running out of ideas and growing increasingly frustrated. Please help!
Thank you!
The jsonlite package has a very useful function called flatten that you can use to convert the nested lists that commonly appear when parsing JSON data to a more usable dataframe. That should make it simpler to do the calculations you need.
Documentation is here: https://cran.r-project.org/web/packages/jsonlite/jsonlite.pdf
For an answer to a vaguely similar question I asked (though my issue was with NA data within JSON results), see here: Converting nested list with missing values to data frame in R

Can't load sparse matrix correctly into Octave

I have the task to load symmetric positive define sparse matrices from The University of Florida Sparse Matrix Collection into GNU Octave. I need to study different ordering algorithms, like symamd but I can't use them since the matrices are not stored as squared
I have chosen for example bcsstk17.
I've tried different load methods with the .mat files:
load -mat bcsstk17
load -mat-binary bcsstk17
load -6 bcsstk17
load -v6 bcsstk17
load -7 bcsstk17
load -mat4-binary bcsstk17
error: load: can't read binary file
load -4 bcsstk17
But none of them worked, since my workspace's variables are empty.
When I load the Matrix Market format mtx, load bcsstk17.mtx I get a 219813x3 matrix.
I've tried the full command but I get the same 219813x3 matrix.
What I am doing wrong?
Not sure why you're trying to load the .mtx file when there's a matlab/octave specific .mat format offered there.
Just download the bcsstk17.mat file, and load it:
load bcsstk17.mat
You will then see on your workspace a variable called Problem which is of type struct. This contains several fields, including a A field which seems to hold your data in the form of a sparse matrix. In other words, your data can be accessed as Problem.A
You shouldn't be bothering with the .mtx file at all. However for completion I will explain what you're seeing when you load it. The .mat file is a binary format. However, a .mtx file seems to be a human-readable format (i.e. it contains normal ASCII text). In particular it seems that it consists of a 'header' containing comments, which start with a % character,
a row which seems to encode the size of the sparse matrix in each dimension,
and then it contains "space-delimited" data, where presumably each row represents an element in the matrix, and the three columns presumably represent the row, the column, and the value of that element.
When matlab comes across an ASCII file containing data (+comments), regardless of the extension, as long as the data seems like a valid 2D array of numbers, it loads the data contents of this file onto a variable with the same name as the file.
Clearly this is not what you want. Not least because the first row will be interpreted as a normal row of data in a Nx3 matrix. In other words, matlab/octave is just loading a standard file it perceives as text-based, and it loads the values it sees inside onto a variable. The extention .mtx here is irrelevant as far as matlab/octave is concerned, and it is most definitely not interpreting or decoding the .mtx file in any way related to the .mtx specification.

Spark - load numbers from a CSV file with non-US number format

I have a CSV file which I want to convert to Parquet for futher processing. Using
sqlContext.read()
.format("com.databricks.spark.csv")
.schema(schema)
.option("delimiter",";")
.(other options...)
.load(...)
.write()
.parquet(...)
works fine when my schema contains only Strings. However, some of the fields are numbers that I'd like to be able to store as numbers.
The problem is that the file arrives not as an actual "csv" but semicolon delimited file, and the numbers are formatted with German notation, i.e. comma is used as decimal delimiter.
For example, what in US would be 123.01 in this file would be stored as 123,01
Is there a way to force reading the numbers in different Locale or some other workaround that would allow me to convert this file without first converting the CSV file to a different format? I looked in Spark code and one nasty thing that seems to be causing issue is in CSVInferSchema.scala line 268 (spark 2.1.0) - the parser enforces US formatting rather than e.g. rely on the Locale set for the JVM, or allowing configuring this somehow.
I thought of using UDT but got nowhere with that - I can't work out how to get it to let me handle the parsing myself (couldn't really find a good example of using UDT...)
Any suggestions on a way of achieving this directly, i.e. on parsing step, or will I be forced to do intermediate conversion and only then convert it into parquet?
For anybody else who might be looking for answer - the workaround I went with (in Java) for now is:
JavaRDD<Row> convertedRDD = sqlContext.read()
.format("com.databricks.spark.csv")
.schema(stringOnlySchema)
.option("delimiter",";")
.(other options...)
.load(...)
.javaRDD()
.map ( this::conversionFunction );
sqlContext.createDataFrame(convertedRDD, schemaWithNumbers).write().parquet(...);
The conversion function takes a Row and needs to return a new Row with fields converted to numerical values as appropriate (or, in fact, this could perform any conversion). Rows in Java can be created by RowFactory.create(newFields).
I'd be happy to hear any other suggestions how to approach this but for now this works. :)

DT_TEXT concatenating rows on Flat File Import

I have a project that imports a TSV file with a field set as text stream (DT_TEXT).
When I have invalid rows that get redirected, the DT_TEXT fields from my invalid rows gets appended to the first proceeding valid row.
Here's my test data:
Tab-delimited input file: ("tsv IN")
CatID Descrip
y "desc1"
z "desc2"
3 "desc3"
CatID is set as in integer (DT_I8)
Descrip is set as text steam (DT_TEXT)
Here's my basic Data Flow Task:
(I apologize, I cant post images until my rep is above 10 :-/ )
So my 2 invalid rows get redirected, and my 3rd row directs to sucess,
But here is my "Success" output:
"CatID","Descrip"
"3","desc1desc2desc3"
Is this a bug when using DT_TEXT fields? I am fairly new to SSIS, so maybe I misunderstand the use of text streams. I chose to use DT_TEXT as I was having truncation issues with DT_STR.
If its helpful, my tsv Fail output is below:
Flat File Source Error Output Column,ErrorCode,ErrorColumn
x "desc1"
,-1071607676,10
y "desc2"
,-1071607676,10
Thanks in advance.
You should really try and avoid using the DT_TEXT, DT_NTEXT or DT_IMAGE data types within SSIS fields as they can severely impact dataflow performance. The problem is that these types come through not as a CLOB (Character Large OBject), but as a BLOB (Binary Large OBject).
For reference see:
CLOB: http://en.wikipedia.org/wiki/Character_large_object
BLOB: http://en.wikipedia.org/wiki/BLOB
Difference: Help me understand the difference between CLOBs and BLOBs in Oracle
Using DT_TEXT you cannot just pull out the characters as you would from a large array. This type is represented as an array of bytes and can store any type of data, which in your case is not needed and is creating problems concatenating your fields. (I recreated the problem in my environment)
My suggestion would be to stick to the DT_STR for your description, giving it a large OutputColumnWidth. Make it large enough so no truncation will occur when reading from your source file and test it out.

genfromtxt dtype=None returns wrong shape

I'm a newcomer to numpy, and am having a hard time reading CSVs into a numpy array with genfromtxt.
I found a CSV file on the web that I'm using as an example. It's a mixture of floats and strings. It's here: http://pastebin.com/fMdRjRMv
I'm using numpy via pylab (initializing on a Ubuntu system via: ipython -pylab). numpy.version.version is 1.3.0.
Here's what I do:
Example #1:
data = genfromtxt("fMdRjRMv.txt", delimiter=',', dtype=None)
data.shape
(374, 15)
data[10,10] ## Take a look at an example element
'30'
type(data[10,10])
type 'numpy.string_'
There are no errant quotation marks in the CSV file, so I've no idea why it should think that the number is a string. Does anyone know why this is the case?
Example #2 (skipping the first row):
data = genfromtxt("fMdRjRMv.txt", delimiter=',', dtype=None, skiprows=1)
data.shape
(373,)
Does anyone know why it would not read all of this into a 1-dimensional array?
Thanks so much!
In your example #1, the problem is that all the values in a single column must share the same datatype. Since the first line of your data file has the column names, this means that the datatype of every column is string.
You have the right idea in example #2 of skipping the first row. Note however that 1.3.0 is a rather old version (I have 1.6.1). In newer versions skiprows is deprecated and you should use skip_header instead.
The reason that the shape of the array is (373,) is that it is a structured array (see http://docs.scipy.org/doc/numpy/user/basics.rec.html), which is what numpy uses to represent inhomogeneous data. So data[10] gives you an entire row of your table. You can also access the data columns by name, for example data['f10']. You can find the names of the columns in data.dtype.names. It is also possible to use the original column names that are defined in the first line of your data file:
data = genfromtxt("fMdRjRMv.txt", dtype=None, delimiter=',', names=True)
then you can access a column like data['Age'].