Polars List Type to Comma Separated String - csv

I have a df that I'd like to groupby and write to csv format. However, one of the columns has a list type that prevents writing the df to csv.
df = pl.DataFrame({"Column A": ["Variable 1", "Variable 2", "Variable 2", "Variable 3", "Variable 3", "Variable 4"],
"Column B": ["AB", "AB", "CD", "AB", "CD", "CD"]})
Which I want to group by as below:
df.groupby(by="Column A").agg(pl.col("Column B").unique())
Output:
shape: (4, 2)
┌────────────┬──────────────┐
│ Column A ┆ Column B │
│ --- ┆ --- │
│ str ┆ list[str] │
╞════════════╪══════════════╡
│ Variable 3 ┆ ["AB", "CD"] │
│ Variable 1 ┆ ["AB"] │
│ Variable 4 ┆ ["CD"] │
│ Variable 2 ┆ ["CD", "AB"] │
└────────────┴──────────────┘
When trying to write the above dataframe to csv it comes up with an error: "ComputeError: CSV format does not support nested data. Consider using a different data format. Got: 'list[str]'"
If trying to convert the list type to pl.Utf8 it leads to an error
(df
.groupby(by="Column A").agg(pl.col("Column B").unique())
.with_columns(pl.col("Column B").cast(pl.Utf8))
)
Output: "ComputeError: Cannot cast list type"
If I try to explode the list in the groupby context:
df.groupby(by="Column A").agg(pl.col("Column B").unique().explode())
The output is not desired:
shape: (4, 2)
┌────────────┬─────────────────────┐
│ Column A ┆ Column B │
│ --- ┆ --- │
│ str ┆ list[str] │
╞════════════╪═════════════════════╡
│ Variable 1 ┆ ["A", "B"] │
│ Variable 3 ┆ ["A", "B", ... "D"] │
│ Variable 2 ┆ ["A", "B", ... "B"] │
│ Variable 4 ┆ ["A", "B", ... "D"] │
└────────────┴─────────────────────┘
What would be the most convenient way for me to groupby and then write to csv?
Desired output written in csv:
shape: (4, 2)
┌────────────┬──────────────┐
│ Column A ┆ Column B │
│ --- ┆ --- │
│ str ┆ list[str] │
╞════════════╪══════════════╡
│ Variable 3 ┆ ["AB", "CD"] │
│ Variable 1 ┆ ["AB"] │
│ Variable 4 ┆ ["CD"] │
│ Variable 2 ┆ ["CD", "AB"] │
└────────────┴──────────────┘

There was a recent discussion about why this is the case.
It is possible to use ._s.get_fmt() to "stringify" the lists:
print(
df
.groupby(by="Column A").agg(pl.col("Column B").unique())
.with_columns(
pl.col("Column B").map(lambda row:
[row._s.get_fmt(n, 0) for n in range(row.len())]
).flatten())
.write_csv(),
end=""
)
Column A,Column B
Variable 3,"[""AB"", ""CD""]"
Variable 1,"[""AB""]"
Variable 4,"[""CD""]"
Variable 2,"[""AB"", ""CD""]"
Another way is using str() as #FObersteiner has suggested.
print(
df.groupby("Column A").agg(
pl.col("Column B")
.unique()
.apply(lambda col: str(col.to_list()))
).write_csv(),
end=""
)
Column A,Column B
Variable 2,"['CD', 'AB']"
Variable 1,['AB']
Variable 3,"['CD', 'AB']"
Variable 4,['CD']
The main probem with "stringifying" lists is - when you read the CSV data back in - you no longer have a list[] type.
import io
pl.read_csv(io.StringIO(
'Column A,Column B\nVariable 4,"[""CD""]"\n'
'Variable 1,"[""AB""]"\nVariable 2,"[""AB"", ""CD""]"\n'
'Variable 3,"[""CD"", ""AB""]"\n'
))
shape: (4, 2)
┌────────────┬──────────────┐
│ Column A | Column B │
│ --- | --- │
│ str | str │
╞════════════╪══════════════╡
│ Variable 4 | ["CD"] │
│ Variable 1 | ["AB"] │
│ Variable 2 | ["AB", "CD"] │
│ Variable 3 | ["CD", "AB"] │
└────────────┴──────────────┘
This is the reason for the recommendation of using an alternative format.

Related

Julia Create Column name from a variable when writing to a csv file

I'm trying to write a column name which has a variable. The other thing Im trying to get is the append function column-wise.
CSV.write("File_Name.csv",(;"column$i"::String=sort(val)),append=true)
where i is generated in for loop.
Also, how to append in next column? e.g. if there are 2 columns
column 1 | column2 |
then whats the way to add new column next to them as column 3?
That's a pretty unusal way of going about things, and the right answer is probably to change your approach more fundamentally - probably building up the table in a reasonable format in your code, and then writing out when you have it.
Fundamentally I believe you can't append columns to csv files, the append keyword works on a row-basis. You can transpose whatever you're reading in though, so you could just append your columns as rows and then read in transposed.
An example where there are five columns to be written out, each of which holds ten strings sorted alphabetically:
julia> using CSV, DataFrames, Random
julia> for i ∈ 1:5
val = sort([randstring(3) for _ ∈ 1:10])
CSV.write("test.csv", DataFrame(permutedims(val), :auto), append = true)
end
julia> CSV.read("test.csv", DataFrame; transpose = true, header = false)
10×5 DataFrame
Row │ Column1 Column2 Column3 Column4 Column5
│ String3 String3 String3 String3 String3
─────┼─────────────────────────────────────────────
1 │ 98g AW6 02L 0HL 10n
2 │ Anq Bgv 8JB 0jf 8RH
3 │ FCL D2n 8Z1 2wd O5N
4 │ SUf QwW FhK 7jk QfP
5 │ ewW Sd2 Jxw EEv XgW
6 │ mG9 Thi Tbk Lx1 cqi
7 │ mm5 jI6 WI8 QsI lbm
8 │ nEa ou5 bRs S3o sIF
9 │ u2w rnz hb1 TPh tJD
10 │ zKW tZh x2J bJb tPn
Here val is a column in the final table - 10 random strings of length 3, sorted alphabetically. I then put this vector in a DataFrame transposed, which means instead of one column of lenght 10, I get a table with 10 columns and one row:
julia> DataFrame(permutedims(val), :auto)
1×10 DataFrame
Row │ x1 x2 x3 x4 x5 x6 x7 x8 x9 x10
│ String String String String String String String String String String
─────┼────────────────────────────────────────────────────────────────────────────────
1 │ 3vh J4M JDu Y2P Zcb dLA dTy oU6 rhG tN2
the column names x1 to x10 come from the :auto kwarg, but are irrelevant here because CSV.write with append = true will ignore the header anyway.
Doing this in a loop I therefore end up with a 5 row, 10 column csv file. Reading this in with transpose = true will give me a 10x5 table, and header = false means that CSV will just assign Column1...Column5 as column names.

How can I use Julia CSV package rowWriter?

I'm using Julia. I would write a single row again and again on existed CSV file.
I think 'CSV.RowWriter' can make it happen, but I don't know how to use. Can anyone show me an example?
CSV.RowWriter is an iterator that produces consecutive rows of a table as strings. Here is an example:
julia> df = DataFrame(a=1:5, b=11:15)
5×2 DataFrame
Row │ a b
│ Int64 Int64
─────┼──────────────
1 │ 1 11
2 │ 2 12
3 │ 3 13
4 │ 4 14
5 │ 5 15
julia> for row in CSV.RowWriter(df)
#show row
end
row = "a,b\n"
row = "1,11\n"
row = "2,12\n"
row = "3,13\n"
row = "4,14\n"
row = "5,15\n"
You would now just need to write these strings to a file in append mode.
Most likely, since you want to append you want to drop he header. You can do it e.g. like this:
julia> for row in CSV.RowWriter(df, writeheader=false)
#show row
end
row = "1,11\n"
row = "2,12\n"
row = "3,13\n"
row = "4,14\n"
row = "5,15\n"
If you want me to show how to write to a file please comment.
The reason why I do not show it is that you do not need to use CSV.RowWriter to achieve what you want. Just do the following:
CSV.write(file, table, append=true)
EDIT: example of writing with CSV.RowWriter:
julia> using DataFrames, CSV
julia> df = DataFrame(a=[1, 2], b=[3, 4])
2×2 DataFrame
Row │ a b
│ Int64 Int64
─────┼──────────────
1 │ 1 3
2 │ 2 4
julia> isfile("test.txt") # make sure the file does not exist yet
false
julia> open("test.txt", "w") do io # create a file and write with header as the file does not exist
foreach(row -> print(io, row), CSV.RowWriter(df))
end
julia> readlines("test.txt") # chceck all is as expected
3-element Vector{String}:
"a,b"
"1,3"
"2,4"
julia> open("test.txt", "a") do io # append to file and write without header
foreach(row -> print(io, row), CSV.RowWriter(df, writeheader=false))
end
julia> readlines("test.txt") # check that all is as expected
5-element Vector{String}:
"a,b"
"1,3"
"2,4"
"1,3"
"2,4"

Google Apps Script Utilities.parseCsv() change decimal and thousand separator

I am new to GAS and I am struggling badly with the problem that I have. (I haven't found a similar question on the site that would have solved my problem, therefore I am asking a new one)
Goal: Import CSV from Google Drive into Google Sheets
Problem:
Currencies in the csv file are "1,000.57" --> US format
Currency format that I need "1.000,57" --> European format
Currently with the Utilities.parseCsv() the formats just gets messed up and the currencies are plain wrong.
Question: Is there a way to change "," to "." and "." to "," during the parse? If so, will there be further problems since the delimiter for the csv is "," as well.
I already found some code snippets in the web (not my code: props to spreadsheet.dev) and tried to change the following, but it does not seem to work:
//Imports a CSV file in Google Drive into the Google Sheet
function importCSVFromDrive() {
var fileName = promptUserForInput("Please enter the name of the CSV file to import from Google Drive:");
var files = findFilesInDrive(fileName);
if(files.length === 0) {
displayToastAlert("No files with name \"" + fileName + "\" were found in Google Drive.");
return;
} else if(files.length > 1) {
displayToastAlert("Multiple files with name " + fileName +" were found. This program does not support picking the right file yet.");
return;
}
var file = files[0];
var csvString = file.getBlob().getDataAsString()
var escapedString = csvString.replace(",",".")
.replace(".",",");
var contents = Utilities.parseCsv(escapedString);
var sheetName = writeDataToSheet(contents);
displayToastAlert("The CSV file was successfully imported into " + sheetName + ".");
}
//Prompts the user for input and returns their response
function promptUserForInput(promptText) {
var ui = SpreadsheetApp.getUi();
var prompt = ui.prompt(promptText);
var response = prompt.getResponseText();
return response;
}
//Returns files in Google Drive that have a certain name.
function findFilesInDrive(filename) {
var files = DriveApp.getFilesByName(filename);
var result = [];
while(files.hasNext())
result.push(files.next());
return result;
}
//Inserts a new sheet and writes a 2D array of data in it
function writeDataToSheet(data) {
var ss = SpreadsheetApp.getActive();
sheet = ss.insertSheet();
sheet.getRange(1, 1, data.length, data[0].length).setValues(data);
return sheet.getName();
}
What am I doing wrong?
Utilities.parseCsv() is a hot mess. I recommend you not to use it. Instead, try the Advanced Google Service - Drive V2
You will need to add Drive under Services.
Here is the code snippet you will need:
function insertFromCsv(fileName) {
var blob = DriveApp.getFilesByName(fileName).next().getBlob();
var tempFile = Drive.Files.insert({title: "tempSheet"}, blob, {
convert: true
});
var tempSsId = tempFile.getId();
var tempSheet = SpreadsheetApp.openById(tempSsId).getSheets()[0];
var newSheet = tempSheet.copyTo(SpreadsheetApp.getActive());
DriveApp.getFileById(tempSsId).setTrashed(true);
return newSheet.getName();
}
and change importCSVFromDrive as follows:
function importCSVFromDrive() {
var fileName = promptUserForInput("Please enter the name of the CSV file to import from Google Drive:");
var files = findFilesInDrive(fileName);
if(files.length === 0) {
displayToastAlert("No files with name \"" + fileName + "\" were found in Google Drive.");
return;
} else if(files.length > 1) {
displayToastAlert("Multiple files with name " + fileName +" were found. This program does not support picking the right file yet.");
return;
}
var file = files[0];
// var csvString = file.getBlob().getDataAsString()
// var escapedString = csvString.replace(",",".")
// .replace(".",",");
// var contents = Utilities.parseCsv(escapedString);
// var sheetName = writeDataToSheet(contents);
var fileName = file.getName();
var sheetName = insertFromCsv(fileName);
displayToastAlert("The CSV file was successfully imported into " + sheetName + ".");
}
So sample of your CSV data is here: https://drive.google.com/file/d/1ASevYOWtu8YL6YA4w-UqDuNXAS0RfaJF/view?usp=sharing
It looks for me like just plain CSV data:
Trades,Header,DataDiscriminator,Asset Category,Currency,Symbol,Date/Time,Quantity,T.Price
Trades,Data,Order,Stocks,USD,ALGN,"2021-06-28,10:50:27",3,627.17,621.52,-1881.51,-1,1882.51,0,-16.95,O
Trades,Data,Order,Stocks,USD,AMAT,"2021-06-29,09:38:53",14,142.15,141.92,-1990.1,-1,1991.1,0,-3.22,O
Trades,Data,Order,Stocks,USD,APH,"2021-07-02,09:30:01",30,69.438,69.95,-2083.14,-1,2084.14,0,15.36,O
I see no "european" formatted numbers out there.
I believe it can be parsed correctly into this:
Trades
Header
DataDiscriminator
Asset Category
Currency
Symbol
Date/Time
Quantity
T.Price
Trades
Data
Order
Stocks
USD
ALGN
"2021-06-28,10:50:27"
3
627.17
621.52
-1881.51
-1
1882.51
0
-16.95
O
Trades
Data
Order
Stocks
USD
AMAT
"2021-06-29,09:38:53"
14
142.15
141.92
-1990.1
-1
1991.1
0
-3.22
O
Trades
Data
Order
Stocks
USD
APH
"2021-07-02,09:30:01"
30
69.438
69.95
-2083.14
-1
2084.14
0
15.36
O
I haven't tried to do it with Utilities.parseCsv(), I wrote my own little csv-parser, just to be sure that the task is doable and my assumptions are correct:
var s = `Trades,Header,DataDiscriminator,Asset Category,Currency,Symbol,Date/Time,Quantity,T.Price
Trades,Data,Order,Stocks,USD,ALGN,"2021-06-28,10:50:27",3,627.17,621.52,-1881.51,-1,1882.51,0,-16.95,O
Trades,Data,Order,Stocks,USD,AMAT,"2021-06-29,09:38:53",14,142.15,141.92,-1990.1,-1,1991.1,0,-3.22,O
Trades,Data,Order,Stocks,USD,APH,"2021-07-02,09:30:01",30,69.438,69.95,-2083.14,-1,2084.14,0,15.36,O`;
// replace ',' with '_' inside quotes
s.match(/("[^,]+),(.+")/g).forEach(t=>s=s.split(t).join(t.replace(/,/g,'_')));
// replace ',' with '\t', replace '_' with ',' and split string into 2-d array
var array = s.replace(/,/g,"\t").replace(/_/g,',').split('\n').map(x => x.split('\t'));
console.table(array);
Output:
┌─────────┬──────────┬──────────┬─────────────────────┬──────────────────┬────────────┬──────────┬─────────────────────────┬────────────┬───────────┬──────────┬────────────┬──────┬───────────┬─────┬──────────┬─────┐
│ (index) │ 0 │ 1 │ 2 │ 3 │ 4 │ 5 │ 6 │ 7 │ 8 │ 9 │ 10 │ 11 │ 12 │ 13 │ 14 │ 15 │
├─────────┼──────────┼──────────┼─────────────────────┼──────────────────┼────────────┼──────────┼─────────────────────────┼────────────┼───────────┼──────────┼────────────┼──────┼───────────┼─────┼──────────┼─────┤
│ 0 │ 'Trades' │ 'Header' │ 'DataDiscriminator' │ 'Asset Category' │ 'Currency' │ 'Symbol' │ 'Date/Time' │ 'Quantity' │ 'T.Price' │ │ │ │ │ │ │ │
│ 1 │ 'Trades' │ 'Data' │ 'Order' │ 'Stocks' │ 'USD' │ 'ALGN' │ '"2021-06-28,10:50:27"' │ '3' │ '627.17' │ '621.52' │ '-1881.51' │ '-1' │ '1882.51' │ '0' │ '-16.95' │ 'O' │
│ 2 │ 'Trades' │ 'Data' │ 'Order' │ 'Stocks' │ 'USD' │ 'AMAT' │ '"2021-06-29,09:38:53"' │ '14' │ '142.15' │ '141.92' │ '-1990.1' │ '-1' │ '1991.1' │ '0' │ '-3.22' │ 'O' │
│ 3 │ 'Trades' │ 'Data' │ 'Order' │ 'Stocks' │ 'USD' │ 'APH' │ '"2021-07-02,09:30:01"' │ '30' │ '69.438' │ '69.95' │ '-2083.14' │ '-1' │ '2084.14' │ '0' │ '15.36' │ 'O' │
└─────────┴──────────┴──────────┴─────────────────────┴──────────────────┴────────────┴──────────┴─────────────────────────┴────────────┴───────────┴──────────┴────────────┴──────┴───────────┴─────┴──────────┴─────┘
If you add range.setValues(array) instead of console.table(array) you probably will get a propper table in your sheet.
Update
To replace 123.45 --> 123,45 in the array you need to add one line at the end:
array = array.map(row => row.map(cell => cell.replace(/(\d)\.(\d)/g, '$1,$2')));
I create a csv file and test here https://docs.google.com/spreadsheets/d/148muW2MhTBwXfppFhO85gQv6GRduvyb5OMHeS68VWH4/edit?usp=sharing
function importCsvFromId() {
var id = '1yc04iLf20k6oVlwSMpWm-xythMPKJoBO';
var csv = DriveApp.getFileById(id).getBlob().getDataAsString();
var csvData = Utilities.parseCsv(csv);
var f = SpreadsheetApp.getActiveSpreadsheet().getActiveSheet()
f.getRange(1, 1, csvData.length, csvData[0].length).setValues(csvData);
}

readtable() when a string ends with \

When I read in a csv file containing
"number","text"
1,"row1text\"
2,"row2text"
with the commands
using DataFrames
readtable(filename.csv)
I get a dataframe with only one row. Apparently, the backslash at the end of the text in the first row is a problem. Is this expected behavior? Is there an alternative way where this problem is avoided?
As a side note: The following works fine (i.e. I get two rows) but is obviously impractical for reading in big files
df = csv"""
"number","text"
1,"row1text\"
2,"row2text"
"""
Since the backslash is the escape character by default, it escapes the quote mark and messes everything up. One workaround would be to use the CSV.jl package and specify a different escape character:
julia> using CSV
julia> CSV.read("filename.csv", escapechar = '~')
2×2 DataFrames.DataFrame
│ Row │ number │ text │
├─────┼────────┼─────────────┤
│ 1 │ 1 │ "row1text\" │
│ 2 │ 2 │ "row2text" │
But then you have to make sure the ~ chars are not escaping something else. There might be a better way of doing this, but this would be one hack to get around the problem.
Another way would be to process the data row by row. Here is a way over-complicated example of doing so:
julia> open("filename.csv", "r") do f
for (i, line) in enumerate(eachline(f))
if i == 1
colnames = map(Symbol, split(line, ','))
global df = DataFrame(String, 0, length(colnames))
rename!(df,
Dict([(old_name, new_name) for (old_name, new_name) in zip(names(df), colnames)]))
else
new_row = map(String, split(replace(line, "\\\"", "\""), ','))
# replace quotes around vales
new_row = map(x -> replace(x, "\"", ""), new_row)
push!(df, new_row)
end
end
end
julia> df
2×2 DataFrames.DataFrame
│ Row │ "number" │ "text" │
├─────┼──────────┼────────────┤
│ 1 │ "1" │ "row1text" │
│ 2 │ "2" │ "row2text" │

Readtable() with differing number of columns - Julia

I'm trying to read a CSV file into a DataFrame using readtable(). There is an unfortunate issue with the CSV file in that if the last x columns of a given row are blank, instead of generating that number of commas, it just ends the line. For example, I can have:
Col1,Col2,Col3,Col4
item1,item2,,item4
item5
Notice how in the third line, there is only one entry. Ideally, I would like readtable to fill the values for Col2, Col3, and Col4 with NA, NA, and NA; however, because of the lack of commas and therefore lack of empty strings, readtable() simply sees this as a row that doesn't match the number of columns. If I run readtable() in Julia with the sample CSV above, I get the error "Saw 2 Rows, 2 columns, and 5 fields, * Line 1 has 6 columns". If I add in 3 commas after item5, then it works.
Is there any way around this, or do I have to fix the CSV file?
If the CSV parsing doesn't need too much quote logic, it is easy to write a special purpose parser to handle the case of missing columns. Like so:
function bespokeread(s)
headers = split(strip(readline(s)),',')
ncols = length(headers)
data = [String[] for i=1:ncols]
while !eof(s)
newline = split(strip(readline(s)),',')
length(newline)<ncols && append!(newline,["" for i=1:ncols-length(newline)])
for i=1:ncols
push!(data[i],newline[i])
end
end
return DataFrame(;OrderedDict(Symbol(headers[i])=>data[i] for i=1:ncols)...)
end
Then the file:
Col1,Col2,Col3,Col4
item1,item2,,item4
item5
Would give:
julia> df = bespokeread(f)
2×4 DataFrames.DataFrame
│ Row │ Col1 │ Col2 │ Col3 │ Col4 │
├─────┼─────────┼─────────┼──────┼─────────┤
│ 1 │ "item1" │ "item2" │ "" │ "item4" │
│ 2 │ "item5" │ "" │ "" │ "" │
The answer of Dan Getz is nice, but it converts everything to strings.
The following solution instead "fill" the gap and write a new file (in a memory-efficient way) that can then be normally imported using readtable():
function fillAll(iF,oF,d=",")
open(iF, "r") do i
open(oF, "w") do o # "w" for writing
headerRow = strip(readline(i))
headers = split(headerRow,d)
nCols = length(headers)
write(o, headerRow*"\n")
for ln in eachline(i)
nFields = length(split(strip(ln),d))
write(o, strip(ln))
[write(o,d) for y in 1:nCols-nFields] # write delimiters to match headers
write(o,"\n")
end
end
end
end
fillAll("data.csv","data_out.csv",";")
Even better: just use CSV.jl.
julia> f = IOBuffer("Col1,Col2,Col3,Col4\nitem1,item2,,item4\nitem5"); # or the filename
julia> CSV.read(f)
2×4 DataFrames.DataFrame
│ Row │ Col1 │ Col2 │ Col3 │ Col4 │
├─────┼─────────┼─────────┼───────┼─────────┤
│ 1 │ "item1" │ "item2" │ #NULL │ "item4" │
│ 2 │ "item5" │ #NULL │ #NULL │ #NULL │