Can't load sparse matrix correctly into Octave - octave

I have the task to load symmetric positive define sparse matrices from The University of Florida Sparse Matrix Collection into GNU Octave. I need to study different ordering algorithms, like symamd but I can't use them since the matrices are not stored as squared
I have chosen for example bcsstk17.
I've tried different load methods with the .mat files:
load -mat bcsstk17
load -mat-binary bcsstk17
load -6 bcsstk17
load -v6 bcsstk17
load -7 bcsstk17
load -mat4-binary bcsstk17
error: load: can't read binary file
load -4 bcsstk17
But none of them worked, since my workspace's variables are empty.
When I load the Matrix Market format mtx, load bcsstk17.mtx I get a 219813x3 matrix.
I've tried the full command but I get the same 219813x3 matrix.
What I am doing wrong?

Not sure why you're trying to load the .mtx file when there's a matlab/octave specific .mat format offered there.
Just download the bcsstk17.mat file, and load it:
load bcsstk17.mat
You will then see on your workspace a variable called Problem which is of type struct. This contains several fields, including a A field which seems to hold your data in the form of a sparse matrix. In other words, your data can be accessed as Problem.A
You shouldn't be bothering with the .mtx file at all. However for completion I will explain what you're seeing when you load it. The .mat file is a binary format. However, a .mtx file seems to be a human-readable format (i.e. it contains normal ASCII text). In particular it seems that it consists of a 'header' containing comments, which start with a % character,
a row which seems to encode the size of the sparse matrix in each dimension,
and then it contains "space-delimited" data, where presumably each row represents an element in the matrix, and the three columns presumably represent the row, the column, and the value of that element.
When matlab comes across an ASCII file containing data (+comments), regardless of the extension, as long as the data seems like a valid 2D array of numbers, it loads the data contents of this file onto a variable with the same name as the file.
Clearly this is not what you want. Not least because the first row will be interpreted as a normal row of data in a Nx3 matrix. In other words, matlab/octave is just loading a standard file it perceives as text-based, and it loads the values it sees inside onto a variable. The extention .mtx here is irrelevant as far as matlab/octave is concerned, and it is most definitely not interpreting or decoding the .mtx file in any way related to the .mtx specification.

Related

Limit precision on JSONEncoder's Double output

I'm working on a document-based SwiftUI application, and my file contents includes large arrays of doubles. The application involves importing some data from CSV, and then the file saves this data in a data structure.
However, when looking at the file sizes, the custom document type I declare in my application is 2-3 times larger than the CSV file. The other bits of metadata could not possibly be that large.
Here's the JSON output that is saved to file:
Here's the CSV file that was imported:
On looking at the raw text of the JSON output and comparing it with the original CSV, it became obvious that the file format I declare was using a lot of unnecessary precision. How do I make Swift's JSONEncoder only use, for example, 4 decimal places of precision?

How to read 100GB of Nested json in pyspark on Databricks

There is a nested json with very deep structure. File is of the format json.gz size 3.5GB. Once this file is uncompressed it is of size 100GB.
This json file is of the format, where Multiline = True (if this condition is used to read the file via spark.read_json then only we get to see the proper json schema).
Also, this file has a single record, in which it has two columns of Struct type array, with multilevel nesting.
How should I read this file and extract information. What kind of cluster / technique to use to extract relevant data from this file.
Structure of the JSON (multiline)
This is a single record. and the entire data is present in 2 columns - in_netxxxx and provider_xxxxx
enter image description here
I was able to achieve this in a bit different way.
Use the utility - Big Text File Splitter -
BigTextFileSplitter - Withdata Softwarehttps://www.withdata.com › big-text-file-splitter ( as the file was huge and multiple level nested) the split record size I kept was 500. This generated around 24 split files of around 3gb each. Entire process took 30 -40 mins.
Processed the _corrupt_record seperately - and populated the required information.
Read the each split file in a using - this option removes the _corrupt_record and also removes the null rows.
spark.read.option("mode", "DROPMALFORMED").json(file_path)
Once the information is fetched form each file, we can merge all the files into a single file, as per standard process.

Julia: Visualize images saved in csv form

What would be the best way of visualizing images saved in .csv format?
The following doesn't work:
using CSV, ImageView
data = CSV.read("myfile.csv");
imshow(data)
This is the error:
MethodError: no method matching pixelspacing(::DataFrames.DataFrame)
Closest candidates are:
pixelspacing(!Matched::MappedArrays.AbstractMultiMappedArray) at /Users/xxx/.julia/packages/ImageCore/yKxN6/src/traits.jl:63
pixelspacing(!Matched::MappedArrays.AbstractMappedArray) at /Users/xxx/.julia/packages/ImageCore/yKxN6/src/traits.jl:62
pixelspacing(!Matched::OffsetArrays.OffsetArray) at /Users/xxx/.julia/packages/ImageCore/yKxN6/src/traits.jl:67
...
Stacktrace:
[1] imshow(::Any, ::Reactive.Signal{GtkReactive.ZoomRegion{RoundingIntegers.RInt64}}, ::ImageView.SliceData, ::Any; name::Any, aspect::Any) at /Users/xxx/.julia/packages/ImageView/sCn9Q/src/ImageView.jl:269
[2] imshow(::Any; axes::Any, name::Any, aspect::Any) at /Users/xxx.julia/packages/ImageView/sCn9Q/src/ImageView.jl:260
[3] imshow(::Any) at /Users/xxx/.julia/packages/ImageView/sCn9Q/src/ImageView.jl:259
[4] top-level scope at In[5]:2
[5] include_string(::Function, ::Module, ::String, ::String) at ./loading.jl:1091
Reference on github.
This question was answered at https://github.com/JuliaImages/ImageView.jl/issues/241. Copying the answer here:
imshow(Matrix(data))
where data is your DataFrame. But CSV is a poor choice for images; Netbpm if you simply must use text-formatted images, otherwise binary would be recommended. Binary Netpbm are especially easy to write, if you have to write your own (e.g., if the images are coming from some language that doesn't support other file formats), otherwise PNG is typically a good choice.
Does the CSV file have a header line of names for its columns or is it just a delimited file full of text number values?
If the CSV file is actually in the form of a matrix of values, such that the values are the bytes of a 2D image, you may use DelimitedFiles -- see readdlm() docs. Read the file with readdlm() into a matrix and see if ImageView can display the results.

Octave(loading files with multiple NA values)

I have a dataset which has a lot of NA values.I have no idea how to load it into octave.
The problem is whenever i am using
data=load("a,txt")
it says error: value on the right hand side is undefined
First, ensure that you are using data=load("a.txt"); and not data=load("a,txt");
If you want to load all of the data from the file into one matrix then use
data = dlmread("a.txt", "emptyvalue", NA);
This reads in the data from a.txt, inferring the delimiter, and replacing all empty values with NA. The code preserves NaN and Inf values.
If you have multiple data sets in the file you'll need to get creative, and it may be simpler to just use the above code and segment the data sets in Octave.

CSV format for OpenCV machine learning algorithms

Machine learning algorithms in OpenCV appear to use data read in CSV format. See for example this cpp file. The data is read into an OpenCV machine learning class CvMLData using the following code:
CvMLData data;
data.read_csv( filename )
However, there does not appear to be any readily available documentation on the required format for the csv file. Does anyone know how the csv file should be arranged?
Other (non-Opencv) programs tend to have a line per training example, and begin with an integer or string indicating the class label.
If I read the source for that class, particularly the str_to_flt_elem function, and the class documentation I conclude that valid formats for individual items in the file are:
Anything that can be parsed to a double by strod
A question mark (?) or the empty string to represent missing values
Any string that doesn't parse to a double.
Items 1 and 2 are only valid for features. anything matched by item 3 is assumed to be a class label, and as far as I can deduce the order of the items doesn't matter. The read_csv function automatically assigns each column in the csv file the correct type, and (if you want) you can override the labels with set_response_index. Delimiter wise you can use the default (,) or set it to whatever you like before calling read_csv with set_delimiter (as long as you don't use the decimal point).
So this should work for example, for 6 datapoints in 3 classes with 3 features per point:
A,1.2,3.2e-2,+4.1
A,3.2,?,3.1
B,4.2,,+0.2
B,4.3,2.0e3,.1
C,2.3,-2.1e+3,-.1
C,9.3,-9e2,10.4
You can move your text label to any column you want, or even have multiple text labels.