Julia Create Column name from a variable when writing to a csv file - csv

I'm trying to write a column name which has a variable. The other thing Im trying to get is the append function column-wise.
CSV.write("File_Name.csv",(;"column$i"::String=sort(val)),append=true)
where i is generated in for loop.
Also, how to append in next column? e.g. if there are 2 columns
column 1 | column2 |
then whats the way to add new column next to them as column 3?

That's a pretty unusal way of going about things, and the right answer is probably to change your approach more fundamentally - probably building up the table in a reasonable format in your code, and then writing out when you have it.
Fundamentally I believe you can't append columns to csv files, the append keyword works on a row-basis. You can transpose whatever you're reading in though, so you could just append your columns as rows and then read in transposed.
An example where there are five columns to be written out, each of which holds ten strings sorted alphabetically:
julia> using CSV, DataFrames, Random
julia> for i ∈ 1:5
val = sort([randstring(3) for _ ∈ 1:10])
CSV.write("test.csv", DataFrame(permutedims(val), :auto), append = true)
end
julia> CSV.read("test.csv", DataFrame; transpose = true, header = false)
10×5 DataFrame
Row │ Column1 Column2 Column3 Column4 Column5
│ String3 String3 String3 String3 String3
─────┼─────────────────────────────────────────────
1 │ 98g AW6 02L 0HL 10n
2 │ Anq Bgv 8JB 0jf 8RH
3 │ FCL D2n 8Z1 2wd O5N
4 │ SUf QwW FhK 7jk QfP
5 │ ewW Sd2 Jxw EEv XgW
6 │ mG9 Thi Tbk Lx1 cqi
7 │ mm5 jI6 WI8 QsI lbm
8 │ nEa ou5 bRs S3o sIF
9 │ u2w rnz hb1 TPh tJD
10 │ zKW tZh x2J bJb tPn
Here val is a column in the final table - 10 random strings of length 3, sorted alphabetically. I then put this vector in a DataFrame transposed, which means instead of one column of lenght 10, I get a table with 10 columns and one row:
julia> DataFrame(permutedims(val), :auto)
1×10 DataFrame
Row │ x1 x2 x3 x4 x5 x6 x7 x8 x9 x10
│ String String String String String String String String String String
─────┼────────────────────────────────────────────────────────────────────────────────
1 │ 3vh J4M JDu Y2P Zcb dLA dTy oU6 rhG tN2
the column names x1 to x10 come from the :auto kwarg, but are irrelevant here because CSV.write with append = true will ignore the header anyway.
Doing this in a loop I therefore end up with a 5 row, 10 column csv file. Reading this in with transpose = true will give me a 10x5 table, and header = false means that CSV will just assign Column1...Column5 as column names.

Related

Julia set data from DataFrames to JSON

related post: DataFrames to Database tables
Indeed, my plan is data passing from DataFrame to json. I am trying to use Genie frame work and Genie.json, the sample data is as following,
Row │ id name address age sex
│ Int64 String String? String? String?
─────┼───────────────────────────────────────────
1 │ 1 Ono Kyoto 60 m
2 │ 2 Serena PA 38 F
3 │ 3 Edita Slovakia 32 F
and this data packed to res as DataFrames, then
json( Dict( "alldata" => res ))
the json data is lined order by columns.
(A):{"alldata":{"columns":[[1,2,3], ["Ono","Serana","Edita"],["Kyoto","PA","Slovakia"],["60","38","32"],["m","f","f"]]}}
But, of course, I would like to get each row like this
(B):{"alldata":{"columns":[[1,"Ono","Kyoto","60","m"],[2,"Serana","PA","38","f"],[3,"Edita","Slovakia","32","f"]]}}
I posted the same question on Genie community then got answer using JSON3 package after serialization. That procedure made sense, however I wonder are there any ideas alike do not use serialize processing. The ideal process is direct pass from Dataframes to json to realize (B) json form data.
To json by serializing
oDf::DataFrame = SearchLight.query( sSelect )
if !isempty(oDf)
for oRow in eachrow(oDf)
_iid::Int = oRow.id
_sname::String = oRow.name
・
・
push!( aTmp["alldata"], Dict("columns"=>[_iid,_sname,_saddress,_sage,_ssex]))
end
aTmp["data_ready"] = 1
end
sRetJ = JSON3.write(aTmp)
I looked at DataFrames.jl but did not find the solution, maybe the book has it, but not yet.
Thanks any advances.
I assume you want the following:
julia> df = DataFrame(a=[1, 2, 3], b=[11, 12, 13])
3×2 DataFrame
Row │ a b
│ Int64 Int64
─────┼──────────────
1 │ 1 11
2 │ 2 12
3 │ 3 13
julia> Genie.Renderer.Json.json(Dict("alldata" => "columns" => Vector.(eachrow(df))))
HTTP.Messages.Response:
"""
HTTP/1.1 200 OK
Content-Type: application/json; charset=utf-8
{"alldata":{"columns":[[1,11],[2,12],[3,13]]}}"""
(however, it is strange that you want to call this "columns" while these are rows of your data)
If you want to retain column names you can do:
julia> Genie.Renderer.Json.json(Dict("alldata" => "columns" => copy.(eachrow(df))))
HTTP.Messages.Response:
"""
HTTP/1.1 200 OK
Content-Type: application/json; charset=utf-8
{"alldata":{"columns":[{"a":1,"b":11},{"a":2,"b":12},{"a":3,"b":13}]}}"""

How can I use Julia CSV package rowWriter?

I'm using Julia. I would write a single row again and again on existed CSV file.
I think 'CSV.RowWriter' can make it happen, but I don't know how to use. Can anyone show me an example?
CSV.RowWriter is an iterator that produces consecutive rows of a table as strings. Here is an example:
julia> df = DataFrame(a=1:5, b=11:15)
5×2 DataFrame
Row │ a b
│ Int64 Int64
─────┼──────────────
1 │ 1 11
2 │ 2 12
3 │ 3 13
4 │ 4 14
5 │ 5 15
julia> for row in CSV.RowWriter(df)
#show row
end
row = "a,b\n"
row = "1,11\n"
row = "2,12\n"
row = "3,13\n"
row = "4,14\n"
row = "5,15\n"
You would now just need to write these strings to a file in append mode.
Most likely, since you want to append you want to drop he header. You can do it e.g. like this:
julia> for row in CSV.RowWriter(df, writeheader=false)
#show row
end
row = "1,11\n"
row = "2,12\n"
row = "3,13\n"
row = "4,14\n"
row = "5,15\n"
If you want me to show how to write to a file please comment.
The reason why I do not show it is that you do not need to use CSV.RowWriter to achieve what you want. Just do the following:
CSV.write(file, table, append=true)
EDIT: example of writing with CSV.RowWriter:
julia> using DataFrames, CSV
julia> df = DataFrame(a=[1, 2], b=[3, 4])
2×2 DataFrame
Row │ a b
│ Int64 Int64
─────┼──────────────
1 │ 1 3
2 │ 2 4
julia> isfile("test.txt") # make sure the file does not exist yet
false
julia> open("test.txt", "w") do io # create a file and write with header as the file does not exist
foreach(row -> print(io, row), CSV.RowWriter(df))
end
julia> readlines("test.txt") # chceck all is as expected
3-element Vector{String}:
"a,b"
"1,3"
"2,4"
julia> open("test.txt", "a") do io # append to file and write without header
foreach(row -> print(io, row), CSV.RowWriter(df, writeheader=false))
end
julia> readlines("test.txt") # check that all is as expected
5-element Vector{String}:
"a,b"
"1,3"
"2,4"
"1,3"
"2,4"

How to plot TimeArray in julia with zoom in for hourly, zoom out for daily/monthly?

There is a sample csv data like (real data is in millisecond percision)
using TimeSeries, Plots
s="DateTime,Open,High,Low,Close,Volume
2020/01/05 16:14:01,20,23,19,20,30
2020/01/05 16:14:11,23,27,19,22,20
2020/01/05 17:14:01,24,28,19,23,10
2020/01/05 18:14:01,25,29,20,24,40
2020/01/06 08:02:01,26,30,22,25,50"
ta=readtimearray(IOBuffer(s),format="yyyy/mm/dd HH:MM:SS")
plot(ta.Volume)
I found the package TimeSeries and Temporal are based on daily plot. Is there any easy way to aggregate them into minutes/hourly/daily/weekly... and plot them?
For the Open value, it should keep the first value during the period.
For the High value, it should be the maximum value during the period.
For the Low value, it should be the minimum value during the period.
For the Close value, it should be the last value during the period.
For the Volume value, it should be the sum value during the period.
I expect it could the volume like tb
s="DateTime,Volume
2020/01/05 16:00:00,50
2020/01/05 17:00:00,10
2020/01/05 18:00:00,40
2020/01/06 08:00:00,50"
tb=readtimearray(IOBuffer(s),format="yyyy/mm/dd HH:MM:SS")
plot(tb.Volume)
Method 1: I found a workable but not perfect method. For example, plot in hourly, by Volume
using DataFrames,Statistics,Dates
df = DataFrame(ta)
df.ms = Date.value.(df.timestamp)
df.hour = df.ms
df.hour = df.ms .÷ (60*60*1000)
df2 = aggregate(df[:, [:hour, :Volume]], :hour, sum)
df2.timestamp = convert.(DateTime, Dates.Millisecond.(df2.hour.*(60*60*1000)))
tb=TimeArray(df2[:,[:timestamp,:Volume_sum]], timestamp=:timestamp)
plot(tb)
the content of tb
4×1 TimeArray{Float64,1,DateTime,Array{Float64,1}} 2020-01-05T16:00:00 to 2020-01-06T08:00:00
│ │ Volume_sum │
├─────────────────────┼────────────┤
│ 2020-01-05T16:00:00 │ 50.0 │
│ 2020-01-05T17:00:00 │ 10.0 │
│ 2020-01-05T18:00:00 │ 40.0 │
│ 2020-01-06T08:00:00 │ 50.0 │
Method 2: There seems a more easy way by floor function
df.hour2 = floor.(df.timestamp, Dates.Hour(1))
df2 = aggregate(df[:, [:hour2, :Volume]], :hour2, sum)
tb=TimeArray(df2[:,[:hour2,:Volume_sum]], timestamp=:hour2)
Method 3: Just use collapse second form syntax
using Statistics
tb1 = collapse(ta[:, :Open], hour, first, first)
tb2 = collapse(ta[:, :High], hour, first, maximum)
tb3 = collapse(ta[:, :Low], hour, first, minimum)
tb4 = collapse(ta[:, :Close], hour, first, last)
tb5 = collapse(ta[:, :Volume], hour, first, sum)
tb = merge(tb1, tb2, tb3, tb4, tb5)

readtable() when a string ends with \

When I read in a csv file containing
"number","text"
1,"row1text\"
2,"row2text"
with the commands
using DataFrames
readtable(filename.csv)
I get a dataframe with only one row. Apparently, the backslash at the end of the text in the first row is a problem. Is this expected behavior? Is there an alternative way where this problem is avoided?
As a side note: The following works fine (i.e. I get two rows) but is obviously impractical for reading in big files
df = csv"""
"number","text"
1,"row1text\"
2,"row2text"
"""
Since the backslash is the escape character by default, it escapes the quote mark and messes everything up. One workaround would be to use the CSV.jl package and specify a different escape character:
julia> using CSV
julia> CSV.read("filename.csv", escapechar = '~')
2×2 DataFrames.DataFrame
│ Row │ number │ text │
├─────┼────────┼─────────────┤
│ 1 │ 1 │ "row1text\" │
│ 2 │ 2 │ "row2text" │
But then you have to make sure the ~ chars are not escaping something else. There might be a better way of doing this, but this would be one hack to get around the problem.
Another way would be to process the data row by row. Here is a way over-complicated example of doing so:
julia> open("filename.csv", "r") do f
for (i, line) in enumerate(eachline(f))
if i == 1
colnames = map(Symbol, split(line, ','))
global df = DataFrame(String, 0, length(colnames))
rename!(df,
Dict([(old_name, new_name) for (old_name, new_name) in zip(names(df), colnames)]))
else
new_row = map(String, split(replace(line, "\\\"", "\""), ','))
# replace quotes around vales
new_row = map(x -> replace(x, "\"", ""), new_row)
push!(df, new_row)
end
end
end
julia> df
2×2 DataFrames.DataFrame
│ Row │ "number" │ "text" │
├─────┼──────────┼────────────┤
│ 1 │ "1" │ "row1text" │
│ 2 │ "2" │ "row2text" │

Readtable() with differing number of columns - Julia

I'm trying to read a CSV file into a DataFrame using readtable(). There is an unfortunate issue with the CSV file in that if the last x columns of a given row are blank, instead of generating that number of commas, it just ends the line. For example, I can have:
Col1,Col2,Col3,Col4
item1,item2,,item4
item5
Notice how in the third line, there is only one entry. Ideally, I would like readtable to fill the values for Col2, Col3, and Col4 with NA, NA, and NA; however, because of the lack of commas and therefore lack of empty strings, readtable() simply sees this as a row that doesn't match the number of columns. If I run readtable() in Julia with the sample CSV above, I get the error "Saw 2 Rows, 2 columns, and 5 fields, * Line 1 has 6 columns". If I add in 3 commas after item5, then it works.
Is there any way around this, or do I have to fix the CSV file?
If the CSV parsing doesn't need too much quote logic, it is easy to write a special purpose parser to handle the case of missing columns. Like so:
function bespokeread(s)
headers = split(strip(readline(s)),',')
ncols = length(headers)
data = [String[] for i=1:ncols]
while !eof(s)
newline = split(strip(readline(s)),',')
length(newline)<ncols && append!(newline,["" for i=1:ncols-length(newline)])
for i=1:ncols
push!(data[i],newline[i])
end
end
return DataFrame(;OrderedDict(Symbol(headers[i])=>data[i] for i=1:ncols)...)
end
Then the file:
Col1,Col2,Col3,Col4
item1,item2,,item4
item5
Would give:
julia> df = bespokeread(f)
2×4 DataFrames.DataFrame
│ Row │ Col1 │ Col2 │ Col3 │ Col4 │
├─────┼─────────┼─────────┼──────┼─────────┤
│ 1 │ "item1" │ "item2" │ "" │ "item4" │
│ 2 │ "item5" │ "" │ "" │ "" │
The answer of Dan Getz is nice, but it converts everything to strings.
The following solution instead "fill" the gap and write a new file (in a memory-efficient way) that can then be normally imported using readtable():
function fillAll(iF,oF,d=",")
open(iF, "r") do i
open(oF, "w") do o # "w" for writing
headerRow = strip(readline(i))
headers = split(headerRow,d)
nCols = length(headers)
write(o, headerRow*"\n")
for ln in eachline(i)
nFields = length(split(strip(ln),d))
write(o, strip(ln))
[write(o,d) for y in 1:nCols-nFields] # write delimiters to match headers
write(o,"\n")
end
end
end
end
fillAll("data.csv","data_out.csv",";")
Even better: just use CSV.jl.
julia> f = IOBuffer("Col1,Col2,Col3,Col4\nitem1,item2,,item4\nitem5"); # or the filename
julia> CSV.read(f)
2×4 DataFrames.DataFrame
│ Row │ Col1 │ Col2 │ Col3 │ Col4 │
├─────┼─────────┼─────────┼───────┼─────────┤
│ 1 │ "item1" │ "item2" │ #NULL │ "item4" │
│ 2 │ "item5" │ #NULL │ #NULL │ #NULL │