F# merge CSV files with different columns - csv

I'm fairly new to F# but I'm fascinated about it and want to apply it to some applications. Currently, I have multiple csv files which is just timestamp and some sensor's values, the timestamp is unique but the columns values' are different.
For example I have two csv file
csv1:
timestamp, sensor1
time1, 1.0
csv2:
timestamp, sensor1, sensor2
time2, 2.0, 3.0
The result I want is
timestamp, sensor1, sensor2
time1, 1.0,
time2, 2.0, 3.0
I wonder if any easy way to do it in F#. Thanks
UPDATE 1:
Here my current solution which involves using LumenWorks.Framework.IO.Csv (https://www.nuget.org/packages/LumenWorksCsvReader) to parse csv to Data.DataTable and Deedle (https://www.nuget.org/packages/Deedle) to convert Data.DataTable to Frame and use the SaveCsv method to save to csv files.
open System.IO
open System
open LumenWorks.Framework.IO.Csv
open Deedle
// get list of csv files
let filelist = expression_to_get_list_of_csv_file_path
// func to readCsv from path and return Data.DataTable
let funcReadCSVtoDataTable (path:string) =
use csv = new CachedCsvReader(new StreamReader(path), true)
let tmpdata = new Data.DataTable()
tmpdata.Load(csv)
tmpdata
// map list of file paths to get list of datatable
let allTables = List.map funcReadCSVtoDataTable filelist
// create allData table to iterate over the list
let allData = new Data.DataTable()
List.iter (fun (x:Data.DataTable) -> allData.Merge(x)) allTables
//convert datatable to Deedle Frame and save to csv file
let df = Frame.ReadReader (allData.CreateDataReader())
df.SaveCsv("./final_csv.csv")
The reason for using LumenWorks.Framework.IO.Csv is because I need to parse a few thousands of files at the same time, and according to this article (https://www.codeproject.com/Articles/11698/A-Portable-and-Efficient-Generic-Parser-for-Flat-F) LumenWorks.Framework.IO.Csv is the fastest.
UPDATE 2: FINAL SOLUTION
Thanks to Tomas about the RowsKey map solution (see his comment below), I re-twisted his code for the case of list of files
// get list of csv files
let filelist = expression_to_get_list_of_csv_file_path
// function to merge two Frames
let domerge (df0:Frame<int,string>) (df1:Frame<int,string>) =
df1
|> Frame.mapRowKeys (fun k-> k+df0.Rows.KeyCount)
|> Frame.merge df0
// read filelist to Frame list
let dflist = filelist |> List.map (fun (x:string)-> Frame.ReadCsv x)
// using List.fold to "fold" through the list with dflist.[0] is the intial state
let dffinal = List.tail dflist |> List.fold domerge (List.head dflist)
dffinal.SaveCsv("./final_csv.csv")
Now the code looks "functional", however, I get a small warning of Frame.ReadCsv that the method is not meant for F#, but it works anyway.

If you are happy to use an external library, then you can do this very easily using the data frame manipulation library called Deedle. Deedle lets you read data frames from CSV files and when you merge data frames, it makes sure to align column and row keys for you:
open Deedle
let f1 = Frame.ReadCsv("c:/temp/f1.csv")
let f2 = Frame.ReadCsv("c:/temp/f2.csv")
let merged =
f2
|> Frame.mapRowKeys (fun k -> k + f1.Rows.KeyCount)
|> Frame.merge f1
merged.SaveCsv("c:/temp/merged.csv")
The one tricky thing that we have to do here is to use mapRowKeys. When you read the frames, Deedle automatically generates ordinal row keys for your data and so merging would fail because you have two rows with a key 0. The mapRowKeys function lets us transform the keys so that they are unique and the frames can be merged. (Saving the CSV file does not automatically write the row keys to the output, so the result of this is exactly what you wanted.)

If yo do a lot of processing like this you should look into the CSV TypeProvider and Parser or my favorite FileHelpers.
If you don't want to use any third party libraries, here's a quick step-by-step process to read, re-assemble and write out the file:
open System.IO
open System
let csv1path = #"E:\tmp\csv1.csv"
let csv2path = #"E:\tmp\csv2.csv"
/// Read the file, split it up, and remove the header from the first csv file
let csv1 =
File.ReadAllLines(csv1path)
|> Array.map (fun x -> x.Split(','))
|> Array.tail
let csv2 =
File.ReadAllLines(csv2path)
|> Array.map (fun x -> x.Split(','))
///Split the header and data in the second csv file
let header', data = (csv2.[0], Array.tail csv2)
let header = String.Join(",", header')
///put back the data together, this is an array of arrays
let csv3 =
Array.append(csv1) data
///Sort the combined file, put it back together as a csv and add back the header
let csv4 =
csv3
|> Array.sort
|> Array.map (fun x -> String.Join(",", x))
|> Array.append [|header|]
///Write it out
File.WriteAllLines(#"E:\tmp\combined.csv",csv4)

Related

Write to file from string and array

I'm trying to write an csv file where some of the values comes from arrays.
let handle = "thetitle"
let title = "The Title"
let body = "ipsum lorem"
let mutable variantPrice = [|1000,2000,3000,4000|]
let mutable variantComparePrice = [|2000,4000,6000,8000|]
let mutable storlek = ["50x50","50x60","50x70","50x80"]
let Header = [|
(handle, title, body,variantPrice, variantComparePrice, storlek)
|]
let lines = Header |> Array.map (fun (h, t, vp,vcp,b,s) -> sprintf "Handle\tTitle\tStorlek\tVariantPrice\tVariantComparePrice\tBody\n %s\t%s\t%s\t%s"h t s vp vcp b)
File.WriteAllLines( "data\\test.csv", lines, Encoding.UTF8)
But the problem is that the expression in lines are expected string but im sending in string[]
Ideal would be that the csv file looked something like this
|handle|title|body|variantPrice|variantComparePrice|storlek|
|thetitle|The Title|ipsum lorem|1000|2000|50x50|
|thetitle| | |2000|4000|50x60|
|thetitle| | |3000|6000|50x70|
|thetitle| | |4000|8000|50x80|
The first issue is that your variables storing data like variantPrice are currently arrays containing just a single element, which is a tuple - this is because you've separated elements using , rather than ;. Most likely, you'll want something like:
let variantPrice = [|1000;2000;3000;4000|]
let variantComparePrice = [|2000;4000;6000;8000|]
let storlek = [|"50x50";"50x60";"50x70";"50x80"|]
With this, you can then use Array.zip3 to get a single array with all the data (one item per row).
let data = Array.zip3 variantPrice variantComparePrice storlek
Now you can use Array.map to format the individual lines. The following is my guess based on your sample:
let lines = data |> Array.map (fun (vp, vcp, s) ->
sprintf "|%s| | |%d|%d|%s" handle vp vcp s)
This is an array of lines represented as strings. Finally, you can append the header to the lines and write this to a file:
let header = "|handle|title|body|variantPrice|variantComparePrice|storlek|"
System.IO.File.WriteAllLines("c:/temp/test.csv",
Array.append [| header |] lines, Encoding.UTF8)

F# CSV Type Provider : how to ignore some rows?

I am a beginner and starting to use FSharp.Data library
http://fsharp.github.io/FSharp.Data/library/CsvProvider.html
let rawfile = CsvFile.Load("mydata.csv")
for row in rawfile.Rows do
let d = System.DateTime.Parse (row.GetColumn("Date"))
let p = float (row.GetColumn("Close Price"))
printfn "%A %A" d p
price_table.[BTC].Add (d,p)
I have a csv file whose last lines I would like to ignore because
they are something like "this data was produced by ...."
by the way, even if i delete those lines, save the file, when i reopen again those cells reappear... sticky ones !!!
There's an overload for CsvFile.Load that takes a TextReader-derived parameter.
If you know how many lines to skip, you can create a StreamReader on the file, you can skip lines with ReadLine.
use reader = StreamReader("mydata.csv")
reader.ReadLine() |> ignore
reader.ReadLine() |> ignore
let rawfile = CsvFile.Load(reader)
If you're ok with loading whole scv into memory, then you can simply reverse your rows, skip what you want and then (optionally) reverse back.
Example:
let skipNLastValues (n:int) (xs:seq<'a>) =
xs
|> Seq.rev
|> Seq.skip n
|> Seq.rev
for i in (skipNLastValues 2 {1..10}) do
printfn "%A" i

Deedle Frame From Database, What is the Schema?

I am new to Deedle, and in documentation I cant find how to solve my problem.
I bind an SQL Table to a Deedle Frame using the following code:
namespace teste
open FSharp.Data.Sql
open Deedle
open System.Linq
module DatabaseService =
[<Literal>]
let connectionString = "Data Source=*********;Initial Catalog=*******;Persist Security Info=True;User ID=sa;Password=****";
type bd = SqlDataProvider<
ConnectionString = connectionString,
DatabaseVendor = Common.DatabaseProviderTypes.MSSQLSERVER >
type Database() =
static member contextDbo() =
bd.GetDataContext().Dbo
static member acAgregations() =
Database.contextDbo().AcAgregations |> Frame.ofRecords
static member acBusyHourDefinition() =
Database.contextDbo().AcBusyHourDefinition
|> Frame.ofRecords "alternative_reference_table_scan", "formula"]
static member acBusyHourDefinitionFilterByTimeAgregationTipe(value:int) =
Database.acBusyHourDefinition()
|> Frame.getRows
These things are working properly becuse I can't understand the Data Frame Schema, for my surprise, this is not a representation of the table.
My question is:
how can I access my database elements by Rows instead of Columns (columns is the Deedle Default)? I Thied what is showed in documentation, but unfortunatelly, the columns names are not recognized, as is in the CSV example in Deedle Website.
With Frame.ofRecords you can extract the table into a dataframe and then operate on its rows or columns. In this case I have a very simple table. This is for SQL Server but I assume MySQL will work the same. If you provide more details in your question the solution can narrowed down.
This is the table, indexed by ID, which is Int64:
You can work with the rows or the columns:
#if INTERACTIVE
#load #"..\..\FSLAB\packages\FsLab\FsLab.fsx"
#r "System.Data.Linq.dll"
#r "FSharp.Data.TypeProviders.dll"
#endif
//open FSharp.Data
//open System.Data.Linq
open Microsoft.FSharp.Data.TypeProviders
open Deedle
[<Literal>]
let connectionString1 = #"Data Source=(LocalDB)\MSSQLLocalDB;AttachDbFilename=C:\Users\userName\Documents\tes.sdf.mdf"
type dbSchema = SqlDataConnection<connectionString1>
let dbx = dbSchema.GetDataContext()
let table1 = dbx.Table_1
query { for row in table1 do
select row} |> Seq.takeWhile (fun x -> x.ID < 10L) |> Seq.toList
// check if we can connect to the DB.
let df = table1 |> Frame.ofRecords // pull the table into a df
let df = df.IndexRows<System.Int64>("ID") // if you need an index
df.GetRows(2L) // Get the second row, but this can be any kind of index/key
df.["Number"].GetSlice(Some 2L, Some 5L) // get the 2nd to 5th row from the Number column
Will get you the following output:
val it : Series<System.Int64,float> =
2 -> 2
>
val it : Series<System.Int64,float> =
2 -> 2
3 -> 3
4 -> 4
5 -> 5
Depending on what you're trying to do Selecting Specific Rows in Deedle might also work.
Edit
From your comment you appear to be working with some large table. Depending on how much memory you have and how large the table you still might be able to load it. If not these are some of things you can do in increasing complexity:
Use a query { } expression like above to narrow the dataset on the database server and convert just part of the result into a dataframe. You can do quite complex transformations so you might not even need the dataframe in the end. This is basically Linq2Sql.
Use lazy loading in Deedle. This works with series so you can get a few series and reassemble a dataframe.
Use Big Deedle which is designed for this sort of thing.

how to create JSON string in Erlang manually

I am new to Erlang and noticed that there is no native function to create json string from lists (Or is there?). I use this method to create json string in Erlang but do not know if it will not malfunction.
Here is an example of my method:
-module(index).
-export([test/0]).
test() ->
Ma = "Hello World", Mb = "Hello Erlang",
A = "{\"Messages\" : [\"" ++ Ma ++ "\", \""++Mb++"\"], \"Usernames\" : [\"Username1\", \"Username2\"]}",
A.
The result is:
388> test().
"{\"Messages\" : [\"Hello World\", \"Hello Erlang\"], \"Usernames\" : [\"Username1\", \"Username2\"]}"
389>
I think this is the expected result but is there any chance that this method may malfunction when included special characters, such as: <, >, & / \ " ??
What precautions should I take to make this method stronger?
If Ma or Mb contains double quotes or whatever control characters, the parsing from string to JSON will fail. This parsing may never occur in Erlang, as Erlang does not have string to JSON conversion built-in.
It's a good idea to use binaries (<<"I am a binary string">>), as lists consume a lot more resources.
We're using jiffy, which is implemented as a NIF and hence is reasonably fast and it allows for document construction like so:
jiffy:decode(<<"{\"foo\": \"bar\"}">>).
{[{<<"foo">>,<<"bar">>}]}
Doc = {[{foo, [<<"bing">>, 2.3, true]}]}.
{[{foo,[<<"bing">>,2.3,true]}]}
jiffy:encode(Doc).
<<"{\"foo\":[\"bing\",2.3,true]}">>
I had this very same problem, searched high and low and in the end came up with my own method. This is purely just pointing people in the right directions to finding a solution for themselves. Note: I tried jiffy but as I'm using rebar3 it's not currently compatible.
Im using MS sql server so i use the Erlang odbc module: http://erlang.org/doc/man/odbc.html
The odbc:sql_query/2 gives me back {selected, Columns, Results}
From here i can take the Columns which is a list of strings & the Results, a list of rows represented each as a tuple, then create a few functions to output valid Erlang code to be able to serialize correctly to Json based on a number of factors. Here's the full code:
make the initial query:
Sql = "SELECT * FROM alloys;",
Ref = connect(),
case odbc:sql_query(Ref, Sql) of
{selected, Columns, Results} ->
set_json_from_sql(Columns, Results, []);
{error, Reason} ->
{error, Reason}
end.
Then the input function is set_json_from_sql/3 that calls the below functions:
format_by_type(Item) ->
if
is_list(Item) -> list_to_binary(io_lib:format("~s", [Item]));
is_integer(Item) -> Item;
is_boolean(Item) -> io_lib:format("~a", [Item]);
is_atom(Item) -> Item
end.
json_by_type([H], [Hc], Data) ->
NewH = format_by_type(H),
set_json_flatten(Data, Hc, NewH);
json_by_type([H|T], [Hc|Tc], Data) ->
NewH = format_by_type(H),
NewData = set_json_flatten(Data, Hc, NewH),
json_by_type(T, Tc, NewData).
set_json_flatten(Data, Column, Row) ->
ResTuple = {list_to_binary(Column), Row},
lists:flatten(Data, [ResTuple]).
set_json_from_sql([], [], Data) -> jsone:encode([{<<"data">>, lists:reverse(Data)}]);
set_json_from_sql(Columns, [H], Data) ->
NewData = set_json_merge(H, Columns, Data),
set_json_from_sql([], [], NewData);
set_json_from_sql(Columns, [H|T], Data) ->
NewData = set_json_merge(H, Columns, Data),
set_json_from_sql(Columns, T, NewData).
set_json_merge(Row, Columns, Data) ->
TupleRow = json_by_type(tuple_to_list(Row), Columns, []),
lists:append([TupleRow], Data).
So set_json_from_sql/3 gives you your Json output after matching set_json_from_sql([], [], Data).
The key points here are that you need to call list_to_binary/1 for strings & atoms. Use jsone to encode Erlang objects to Json: https://github.com/sile/jsone
And, notice format_by_type/1 is used to match against Erlang object types, yes not ideal but works as long as you are aware of your DB's types or you can increase the extra guards to accommodate this.
This works for me
test()->
Ma = "Hello World", Mb = "Hello Erlang",
A = "{\"Messages\" : {{\"Ma\":\"" ++ Ma ++ "\"}, {\"Mb\":\""++Mb++"\"}}, {\"Usernames\" : {\"Username1\":\"usrname1\"}, {\"Username2\":\"usrname2\"}}",
io:format("~s~n",[A]).
Output
10> io:format("~s~n",[A]).
{"Messages" : {{"Ma":Hello World"}, {"Mb":Hello Erlang"}}, {"Usernames" : {"Username1":"usrname1"}, {"Username2":"usrname2"}}
ok
or use one of many libraries on github to convert erlang terms to json. My Tuple to JSON module is simple but effective.
Do it like a pro
-define(JSON_WRAPPER(Proplist), {Proplist}).
-spec from_list(json_proplist()) -> object().
from_list([]) -> new();
from_list(L) when is_list(L) -> ?JSON_WRAPPER(L).
-spec to_binary(atom() | string() | binary() | integer() | float() | pid() | iolist()) -> binary().
to_binary(X) when is_float(X) -> to_binary(mochinum:digits(X));
to_binary(X) when is_integer(X) -> list_to_binary(integer_to_list(X));
to_binary(X) when is_atom(X) -> list_to_binary(atom_to_list(X));
to_binary(X) when is_list(X) -> iolist_to_binary(X);
to_binary(X) when is_pid(X) -> to_binary(pid_to_list(X));
to_binary(X) when is_binary(X) -> X.
-spec recursive_from_proplist(any()) -> object().
recursive_from_proplist([]) -> new();
recursive_from_proplist(List) when is_list(List) ->
case lists:all(fun is_integer/1, List) of
'true' -> List;
'false' ->
from_list([{to_binary(K) ,recursive_from_proplist(V)}
|| {K,V} <- List
])
end;
recursive_from_proplist(Other) -> Other.

How to read value of property depending on an argument

How can I get the value of a property given a string argument.
I have a Object CsvProvider.Row which has attributes a,b,c.
I want to get the attribute value depending on property given as a string argument.
I tried something like this:
let getValue (tuple, name: string) =
snd tuple |> Seq.averageBy (fun (y: CsvProvider<"s.csv">.Row) -> y.```name```)
but it gives me the following error:
Unexpected reserved keyword in lambda expression. Expected incomplete
structured construct at or before this point or other token.
Simple invocation of function should look like this:
getValue(tuple, "a")
and it should be equivalent to the following function:
let getValue (tuple) =
snd tuple |> Seq.averageBy (fun (y: CsvProvider<"s.csv">.Row) -> y.a)
Is something like this is even possible?
Thanks for any help!
The CSV type provider is great if you are accessing data by column names statically, because you get nice auto-completion with type inference and checking.
However, for a dynamic access, it might be easier to use the underlying CsvFile (also a part of F# Data) directly, rather than using the type provider:
// Read the given file
let file = CsvFile.Load("c:/test.csv")
// Look at the parsed headers and find the index of column "A"
let aIdx = file.Headers.Value |> Seq.findIndex (fun k -> k = "A")
// Iterate over rows and print A values
for r in file.Rows do
printfn "%A" (r.Item(aIdx))
The only unfortunate thing is that the items are accessed by index, so you need to build some lookup table if you want to easily access them by their name.