Clojure write 2d array to csv - csv

I understand how to write to CSV using [clojure.data.csv] But I am at a loss as to write the CSV in this specific format.
The data I want to write to CSV is the result of a DB query using [clojure.java.jdbc] with the as-arrays? true modifier which returns a 2D array where [0][1] is the column names which need to become the headers in the CSV and then [x][y] will become the data to write to these headers so [1][0] will write the first returned row and column 0 to the CSV under the first heading.
(with-open [out-file (io/writer "out-file.csv")]
(csv/write-csv out-file
[["abc" "def"]
["ghi" "jkl"]]))
The above is an example of writing to CSV file, but I am unsure how to use the result of my query and write the values to CSV.
The data will look like this:
[[header1, header2, header3]
[val1, val2, val3]
[val1, val2, val3]]
The query looks like this:
(j/query db ["$SOME_QUERY"] as-arrays? true))
Can somebody help with this?
Edit: update this is what i have so far:
(defn write-query-to-csv [query db output-filename]
(log/info (str "Executing " query " on " db))
(let [results (j/query db ["$QUERY"]
:as-arrays? true)
header (->> results
first)
data (->> results)]
(with-open [out-file (io/writer output-filename)]
(csv/write-csv out-file
(reduce conj (conj [] header) data)))
(io/file output-filename)))
The header data is correct but I'm unsure how to populate the data variabale :/

It looks to me like results is a sequence of sequences, and in the let you pull the header sequence out, but don't strip it off of the data. Then header contains a sequence of header labels, and data contains the header sequence plus the data sequences (one sequence for each row). The reduce line adds the header sequence back on to the sequence of sequences--which now contains two header sequences. Most of that isn't necessary. Since results is in the correct format for passing to write-csv, the let only needs to bind results, and then you can pass results with no modification as the second argument to write-csv, like this:
(defn write-query-to-csv [query db output-filename]
(log/info (str "Executing " query " on " db))
(let [results (j/query db ["$QUERY"]
:as-arrays? true)]
(with-open [out-file (io/writer output-filename)]
(csv/write-csv out-file result)
(io/file output-filename)))
So you don't need the reduce line here, but for future reference, it would probably clearer to replace (conj [] header) with (vector header). Also, another way to write the entire reduce expression would be (cons header data). That will return a different kind of sequence than your reduce line, but write-csv won't care, and I think performance should be similar. You could also use (into (vector header) data).

Related

How to stream a large CSV response from a compojure API so that the whole response is not held in memory at once?

I'm new to using compojure, but have been enjoying using it so far. I'm
currently encountering a problem in one of my API endpoints that is generating
a large CSV file from the database and then passing this as the response body.
The problem I seem to be encountering is that the whole CSV file is being kept
in memory which is then causing an out of memory error in the API. What is the
best way to handle and generate this, ideally as a gzipped file? Is it possible
to stream the response so that a few thousand rows are returned at a time? When
I return a JSON response body for the same data, there is no problem returning
this.
Here is the current code I'm using to return this:
(defn complete
"Returns metrics for each completed benchmark instance"
[db-client response-format]
(let [benchmarks (completed-benchmark-metrics {} db-client)]
(case response-format
:json (json-grouped-output field-mappings benchmarks)
:csv (csv-output benchmarks))))
(defn csv-output [data-seq]
(let [header (map name (keys (first data-seq)))
out (java.io.StringWriter.)
write #(csv/write-csv out (list %))]
(write header)
(dorun (map (comp write vals) data-seq))
(.toString out)))
The data-seq is the results returned from the database, which I think is a
lazy sequence. I'm using yesql to perform the database call.
Here is my compojure resource for this API endpoint:
(defresource results-complete [db]
:available-media-types ["application/json" "text/csv"]
:allowed-methods [:get]
:handle-ok (fn [request]
(let [response-format (keyword (get-in request [:request :params :format] :json))
disposition (str "attachment; filename=\"nucleotides_benchmark_metrics." (name response-format) "\"")
response {:headers {"Content-Type" (content-types response-format)
"Content-Disposition" disposition}
:body (results/complete db response-format)}]
(ring-response response))))
Thanks to all the suggestion that were provided in this thread, I was able to create a solution using piped-input-stream:
(defn csv-output [data-seq]
(let [headers (map name (keys (first data-seq)))
rows (map vals data-seq)
stream-csv (fn [out] (csv/write-csv out (cons headers rows))
(.flush out))]
(piped-input-stream #(stream-csv (io/make-writer % {})))))
This differs from my solution because it does not realise the sequence using dorun and does not create a large String object either. This instead writes to a PipedInputStream connection asynchronously as described by the documentation:
Create an input stream from a function that takes an output stream as its
argument. The function will be executed in a separate thread. The stream
will be automatically closed after the function finishes.
Your csv-output function completely realises the dataset and turns it into a string. To lazily stream the data, you'll need to return something other than a concrete data type like a String. This suggests ring supports returning a stream, that can be lazily realised by Jetty. The answer to this question might prove useful.
I was also struggling with the streaming of large csv file. My solution was to use httpkit-channel to stream every single line of the data-seq to the client and then close the channel. My solution looks like that:
[org.httpkit.server :refer :all]
(fn handler [req]
(with-channel req channel (let [header "your$header"
data-seq ["your$seq-data"]]
(doseq [line (cons header data-seq)]
(send! channel
{:status 200
:headers {"Content-Type" "text/csv"}
:body (str line "\n")}
false))
(close channel))))

Julia - Rewriting a CSV

Complete Julia newbie here.
I'd like to do some processing on a CSV. Something along the lines of:
using CSV
in_file = CSV.Source('/dir/in.csv')
out_file = CSV.Sink('/dir/out.csv')
for line in CSV.eachline(in_file)
replace!(line, "None", "")
CSV.writeline(out_file, line)
end
This is in pseudocode, those aren't existing functions.
Idiomatically, should I iterate on 1:CSV.countlines(in_file)? Do a while and check something?
If all you want to do is replace a string in the line, you do not need any CSV parsing utilities. All you do is read the file line by line, replace, and write. So:
infile = "/path/to/input.csv"
outfile = "/path/to/output.csv"
out = open(outfile, "w+")
for line in readlines(infile)
newline = replace(line, "a", "b")
write(out, newline)
end
close(out)
This will replicate the pseudocode you have in your question.
If you need to parse and read the csv field by field, use the readcsv function in base.
data=readcsv(infile)
typeof(data) #Array{Any,2}
This will return the data in the file as a 2 dimensional array. You can process this data any way you want, and write it back using the writecsv function.
for i in 1:size(data,1) #iterate by rows
data[i, 1] = "This is " * data[i, 1] # Add text to first column
end
writecsv(outfile, data)
Documentation for these functions:
http://docs.julialang.org/en/release-0.5/stdlib/io-network/?highlight=readcsv#Base.readcsv
http://docs.julialang.org/en/release-0.5/stdlib/io-network/?highlight=readcsv#Base.writecsv

Loading column from CSV file as a list assigned to a variable

given is a function f(a,b,x,y) in gnuplot, where we got a 3D-space with x,y,z (using splot).
Also given is a csv file (without any header) of the following structure:
2 4
1 9
6 7
...
Is there a way to read out all the values of the first column and assign them to the variable a? Implicitly it should create something like:
a = [2,1,6]
b = [4,9,7]
The idea is to plot the function f(a,b,x,y) having iterated for all a,b tuples.
I've read through other posts where I hoped it would be related to it such as e.g. Reading dataset value into a gnuplot variable (start of X series). However I could not make any progres.
Is there a way to go through all rows of a csv file with two columns, using the each column value of a row as the parameter of a function?
Say you have the following data file called data:
1 4
2 5
3 6
You can load the 1st and 2nd column values to variables a and b easily using an awk system call (you can also do this using plot preprocessing with gnuplot but it's more complicated that way):
a=system("awk '{print $1}' data")
b=system("awk '{print $2}' data")
f(a,b,x,y)=a*x+b*y # Example function
set yrange [-1:1]
set xrange [-1:1]
splot for [i in a] for [j in b] f(i,j,x,y)
This is a gnuplot-only solution without the need for a system call:
a=""
b=""
splot "data" u (a=sprintf(" %s %f", a, $1), b=sprintf(" %s %f", b, \
$2)):(1/0):(1/0) not, for [i in a] for [j in b] f(i,j,x,y)

How to read json file into Clojure defrecord (to be searched later)

I have created a defrecord in a Clojure REPL:
user=> (defrecord Data [column1 column2 column3])
user.Data
How do I automate adding data to this record by reading in a .json file? Each of the columns in the defrecord corresponds exactly to a key in the json data. If the file contained a single record it would look similar to this:
[
{
"column1" : "value1"
"column2" : "value2"
"column3" : "value3"
}
]
But there are many thousands of such records in the file.
I can slurp the contents of the file like this:
(json/read-json (slurp "path/to/file.json")))
The dependencies for the read-json function are added to the project.clj file found in the directory where I am running lein repl from the command line: :dependencies [org.clojure/data.json "0.2.1"].
I would just like to be able to search the values of the records using a Clojure function, such that the value I am passing to the search function is between the values of a single record's column1 and column2 values (i.e., nth-record.column1.value <= query <= nth-record.column2.value). Once I've found a matching record, I want to return the value of another column in that same record (nth-record.column3.value). The values of columns 1 and 2 will be unique, representing a non-overlapping range of values. The value of column3 is not unique.
This seems like a fairly trivial task, but I can't figure out how to do it using the Clojure documentation or the examples I've found online. It doesn't matter to me how the records are stored internally in Clojure, as long as I can search them and return the value of a related field in the same record.
Using data.json package:
(require '[clojure.data.json :as json])
Read values into memory:
(def all-records (json/read-str (slurp "path/to/file.json")
:key-fn keyword))
;; ==> [ { :column1 "value1", :column2 "value2", :column3 "value3" }, ...]
Find matching records:
(def query "some-value")
(def matching (filter #(and (< (:column1 %) query) (< query (:column2 %))) all-records))
Get column3:
(map :column3 matching)
Collecting it all together (and making it more flexible):
(defn find-matching [select-fn result-fn records]
(map result-fn (filter select-fn records)))
(defn select-within [rec query]
(and (< (:column1 rec) query) (< query (:column2 rec))))
(find-matching #(select-within % "some-value") :column3 all-records)
Should probably use cheshire for speed.
If your queries get sufficiently complex, consider lucene, clojure has a nice wrapper.
I think you're thinking records are somehow more suitable for this than maps, as far as I can tell, you're not using any features that make records special like polymorphism. There might be a way to make cheshire spit out records, but I wouldn't bother.

Scientific data

I want to import data from a corrupted CSV file. It contains scientific numbers and it's a big data set with about 300000 rows and 27 columns. When I import it using,
Import["data.csv","HeaderLines"->1]
the data format is string. So I change it to data table format by
StringSplit[ToString[data[[#]]], ";"] & /#
Range[Dimensions[
Import["data.csv"]][[1]]]
and I need to use the first column to analyse the data. But the problem is that this row is
scientific numbers in string type!! I want to change it to numbers. I used this command:
ToExpression[Internal`StringToDouble[fdata[[All, 1]][[#]]]] & /#
Range[291407];
But it takes more than hours to do so!!! Do you have any idea how I can do this without wasting of time??
You could try the following:
(* read the first 5 rows *)
d = ReadList["data.csv", Table[Number, {27}], 5]
(* read the rows 100 to 150 *)
s = OpenRead["data.csv"];
Skip[s, Record, 99]
d = ReadList[s, Table[Number, {27}], 51]
Close[s]
And d[[All,1]] will get you the first column.