I am new to Deedle, and in documentation I cant find how to solve my problem.
I bind an SQL Table to a Deedle Frame using the following code:
namespace teste
open FSharp.Data.Sql
open Deedle
open System.Linq
module DatabaseService =
[<Literal>]
let connectionString = "Data Source=*********;Initial Catalog=*******;Persist Security Info=True;User ID=sa;Password=****";
type bd = SqlDataProvider<
ConnectionString = connectionString,
DatabaseVendor = Common.DatabaseProviderTypes.MSSQLSERVER >
type Database() =
static member contextDbo() =
bd.GetDataContext().Dbo
static member acAgregations() =
Database.contextDbo().AcAgregations |> Frame.ofRecords
static member acBusyHourDefinition() =
Database.contextDbo().AcBusyHourDefinition
|> Frame.ofRecords "alternative_reference_table_scan", "formula"]
static member acBusyHourDefinitionFilterByTimeAgregationTipe(value:int) =
Database.acBusyHourDefinition()
|> Frame.getRows
These things are working properly becuse I can't understand the Data Frame Schema, for my surprise, this is not a representation of the table.
My question is:
how can I access my database elements by Rows instead of Columns (columns is the Deedle Default)? I Thied what is showed in documentation, but unfortunatelly, the columns names are not recognized, as is in the CSV example in Deedle Website.
With Frame.ofRecords you can extract the table into a dataframe and then operate on its rows or columns. In this case I have a very simple table. This is for SQL Server but I assume MySQL will work the same. If you provide more details in your question the solution can narrowed down.
This is the table, indexed by ID, which is Int64:
You can work with the rows or the columns:
#if INTERACTIVE
#load #"..\..\FSLAB\packages\FsLab\FsLab.fsx"
#r "System.Data.Linq.dll"
#r "FSharp.Data.TypeProviders.dll"
#endif
//open FSharp.Data
//open System.Data.Linq
open Microsoft.FSharp.Data.TypeProviders
open Deedle
[<Literal>]
let connectionString1 = #"Data Source=(LocalDB)\MSSQLLocalDB;AttachDbFilename=C:\Users\userName\Documents\tes.sdf.mdf"
type dbSchema = SqlDataConnection<connectionString1>
let dbx = dbSchema.GetDataContext()
let table1 = dbx.Table_1
query { for row in table1 do
select row} |> Seq.takeWhile (fun x -> x.ID < 10L) |> Seq.toList
// check if we can connect to the DB.
let df = table1 |> Frame.ofRecords // pull the table into a df
let df = df.IndexRows<System.Int64>("ID") // if you need an index
df.GetRows(2L) // Get the second row, but this can be any kind of index/key
df.["Number"].GetSlice(Some 2L, Some 5L) // get the 2nd to 5th row from the Number column
Will get you the following output:
val it : Series<System.Int64,float> =
2 -> 2
>
val it : Series<System.Int64,float> =
2 -> 2
3 -> 3
4 -> 4
5 -> 5
Depending on what you're trying to do Selecting Specific Rows in Deedle might also work.
Edit
From your comment you appear to be working with some large table. Depending on how much memory you have and how large the table you still might be able to load it. If not these are some of things you can do in increasing complexity:
Use a query { } expression like above to narrow the dataset on the database server and convert just part of the result into a dataframe. You can do quite complex transformations so you might not even need the dataframe in the end. This is basically Linq2Sql.
Use lazy loading in Deedle. This works with series so you can get a few series and reassemble a dataframe.
Use Big Deedle which is designed for this sort of thing.
Related
Is there a way to sort data and drop duplicates using pure pyarrow tables? My goal is to retrieve the latest version of each ID based on the maximum update timestamp.
Some extra details: my datasets are normally structured into at least two versions:
historical
final
The historical dataset would include all updated items from a source so it is possible to have duplicates for a single ID for each change that happened to it (picture a Zendesk or ServiceNow ticket, for example, where a ticket can be updated many times)
I then read the historical dataset using filters, convert it into a pandas DF, sort the data, and then drop duplicates on some unique constraint columns.
dataset = ds.dataset(history, filesystem, partitioning)
table = dataset.to_table(filter=filter_expression, columns=columns)
df = table.to_pandas().sort_values(sort_columns, ascending=True).drop_duplicates(unique_constraint, keep="last")
table = pa.Table.from_pandas(df=df, schema=table.schema, preserve_index=False)
# ds.write_dataset(final, filesystem, partitioning)
# I tend to write the final dataset using the legacy dataset so I can make use of the partition_filename_cb - that way I can have one file per date_id. Our visualization tool connects to these files directly
# container/dataset/date_id=20210127/20210127.parquet
pq.write_to_dataset(final, filesystem, partition_cols=["date_id"], use_legacy_dataset=True, partition_filename_cb=lambda x: str(x[-1]).split(".")[0] + ".parquet")
It would be nice to cut out that conversion to pandas and then back to a table, if possible.
Edit March 2022: PyArrow is adding more functionalities, though this one isn't here yet. My approach now would be:
def drop_duplicates(table: pa.Table, column_name: str) -> pa.Table:
unique_values = pc.unique(table[column_name])
unique_indices = [pc.index(table[column_name], value).as_py() for value in unique_values]
mask = np.full((len(table)), False)
mask[unique_indices] = True
return table.filter(mask=mask)
//end edit
I saw your question because I had a similar one, and I solved it for my work (due to IP issues I can't post the whole code but I'll try to answer as well as I can. I've never done this before)
import pyarrow.compute as pc
import pyarrow as pa
import numpy as np
array = table.column(column_name)
dicts = {dct['values']: dct['counts'] for dct in pc.value_counts(array).to_pylist()}
for key, value in dicts.items():
# do stuff
I used the 'value_counts' to find the unique values and how many of them there are (https://arrow.apache.org/docs/python/generated/pyarrow.compute.value_counts.html). Then I iterated over those values. If the value was 1, I selected the row by using
mask = pa.array(np.array(array) == key)
row = table.filter(mask)
and if the count was more then 1 I selected either the first or last one by using numpy boolean arrays as a mask again.
After iterating it was just as simple as pa.concat_tables(tables)
warning: this is a slow process. If you need something quick&dirty, try the "Unique" option (also in the same link I provided).
edit/extra:: you can make it a bit faster/less memory intensive by keeping up a numpy array of boolean masks while iterating over the dictionary. then in the end you return a "table.filter(mask=boolean_mask)".
I don't know how to calculate the speed though...
edit2:
(sorry for the many edits. I've been doing a lot of refactoring and trying to get it to work faster.)
You can also try something like:
def drop_duplicates(table: pa.Table, col_name: str) ->pa.Table:
column_array = table.column(col_name)
mask_x = np.full((table.shape[0]), False)
_, mask_indices = np.unique(np.array(column_array), return_index=True)
mask_x[mask_indices] = True
return table.filter(mask=mask_x)
The following gives a good performance. About 2mins for a table with half billion rows. The reason I don't do combine_chunks(): there is a bug, arrow seems can not combine chunk arrays if there size are too large. See details: https://issues.apache.org/jira/browse/ARROW-10172?src=confmacro
a = [len(tb3['ID'].chunk(i)) for i in range(len(tb3['ID'].chunks))]
c = np.array([np.arange(x) for x in a])
a = ([0]+a)[:-1]
c = pa.chunked_array(c+np.cumsum(a))
tb3= tb3.set_column(tb3.shape[1], 'index', c)
selector = tb3.group_by(['ID']).aggregate([("index", "min")])
tb3 = tb3.filter(pc.is_in(tb3['index'], value_set=selector['index_min']))
I found duckdb can give better performance on group by. Change the last 2 lines above into the following will give 2X speedup:
import duckdb
duck = duckdb.connect()
sql = "select first(index) as idx from tb3 group by ID"
duck_res = duck.execute(sql).fetch_arrow_table()
tb3 = tb3.filter(pc.is_in(tb3['index'], value_set=duck_res['idx']))
I'm fairly new to F# but I'm fascinated about it and want to apply it to some applications. Currently, I have multiple csv files which is just timestamp and some sensor's values, the timestamp is unique but the columns values' are different.
For example I have two csv file
csv1:
timestamp, sensor1
time1, 1.0
csv2:
timestamp, sensor1, sensor2
time2, 2.0, 3.0
The result I want is
timestamp, sensor1, sensor2
time1, 1.0,
time2, 2.0, 3.0
I wonder if any easy way to do it in F#. Thanks
UPDATE 1:
Here my current solution which involves using LumenWorks.Framework.IO.Csv (https://www.nuget.org/packages/LumenWorksCsvReader) to parse csv to Data.DataTable and Deedle (https://www.nuget.org/packages/Deedle) to convert Data.DataTable to Frame and use the SaveCsv method to save to csv files.
open System.IO
open System
open LumenWorks.Framework.IO.Csv
open Deedle
// get list of csv files
let filelist = expression_to_get_list_of_csv_file_path
// func to readCsv from path and return Data.DataTable
let funcReadCSVtoDataTable (path:string) =
use csv = new CachedCsvReader(new StreamReader(path), true)
let tmpdata = new Data.DataTable()
tmpdata.Load(csv)
tmpdata
// map list of file paths to get list of datatable
let allTables = List.map funcReadCSVtoDataTable filelist
// create allData table to iterate over the list
let allData = new Data.DataTable()
List.iter (fun (x:Data.DataTable) -> allData.Merge(x)) allTables
//convert datatable to Deedle Frame and save to csv file
let df = Frame.ReadReader (allData.CreateDataReader())
df.SaveCsv("./final_csv.csv")
The reason for using LumenWorks.Framework.IO.Csv is because I need to parse a few thousands of files at the same time, and according to this article (https://www.codeproject.com/Articles/11698/A-Portable-and-Efficient-Generic-Parser-for-Flat-F) LumenWorks.Framework.IO.Csv is the fastest.
UPDATE 2: FINAL SOLUTION
Thanks to Tomas about the RowsKey map solution (see his comment below), I re-twisted his code for the case of list of files
// get list of csv files
let filelist = expression_to_get_list_of_csv_file_path
// function to merge two Frames
let domerge (df0:Frame<int,string>) (df1:Frame<int,string>) =
df1
|> Frame.mapRowKeys (fun k-> k+df0.Rows.KeyCount)
|> Frame.merge df0
// read filelist to Frame list
let dflist = filelist |> List.map (fun (x:string)-> Frame.ReadCsv x)
// using List.fold to "fold" through the list with dflist.[0] is the intial state
let dffinal = List.tail dflist |> List.fold domerge (List.head dflist)
dffinal.SaveCsv("./final_csv.csv")
Now the code looks "functional", however, I get a small warning of Frame.ReadCsv that the method is not meant for F#, but it works anyway.
If you are happy to use an external library, then you can do this very easily using the data frame manipulation library called Deedle. Deedle lets you read data frames from CSV files and when you merge data frames, it makes sure to align column and row keys for you:
open Deedle
let f1 = Frame.ReadCsv("c:/temp/f1.csv")
let f2 = Frame.ReadCsv("c:/temp/f2.csv")
let merged =
f2
|> Frame.mapRowKeys (fun k -> k + f1.Rows.KeyCount)
|> Frame.merge f1
merged.SaveCsv("c:/temp/merged.csv")
The one tricky thing that we have to do here is to use mapRowKeys. When you read the frames, Deedle automatically generates ordinal row keys for your data and so merging would fail because you have two rows with a key 0. The mapRowKeys function lets us transform the keys so that they are unique and the frames can be merged. (Saving the CSV file does not automatically write the row keys to the output, so the result of this is exactly what you wanted.)
If yo do a lot of processing like this you should look into the CSV TypeProvider and Parser or my favorite FileHelpers.
If you don't want to use any third party libraries, here's a quick step-by-step process to read, re-assemble and write out the file:
open System.IO
open System
let csv1path = #"E:\tmp\csv1.csv"
let csv2path = #"E:\tmp\csv2.csv"
/// Read the file, split it up, and remove the header from the first csv file
let csv1 =
File.ReadAllLines(csv1path)
|> Array.map (fun x -> x.Split(','))
|> Array.tail
let csv2 =
File.ReadAllLines(csv2path)
|> Array.map (fun x -> x.Split(','))
///Split the header and data in the second csv file
let header', data = (csv2.[0], Array.tail csv2)
let header = String.Join(",", header')
///put back the data together, this is an array of arrays
let csv3 =
Array.append(csv1) data
///Sort the combined file, put it back together as a csv and add back the header
let csv4 =
csv3
|> Array.sort
|> Array.map (fun x -> String.Join(",", x))
|> Array.append [|header|]
///Write it out
File.WriteAllLines(#"E:\tmp\combined.csv",csv4)
I am a beginner and starting to use FSharp.Data library
http://fsharp.github.io/FSharp.Data/library/CsvProvider.html
let rawfile = CsvFile.Load("mydata.csv")
for row in rawfile.Rows do
let d = System.DateTime.Parse (row.GetColumn("Date"))
let p = float (row.GetColumn("Close Price"))
printfn "%A %A" d p
price_table.[BTC].Add (d,p)
I have a csv file whose last lines I would like to ignore because
they are something like "this data was produced by ...."
by the way, even if i delete those lines, save the file, when i reopen again those cells reappear... sticky ones !!!
There's an overload for CsvFile.Load that takes a TextReader-derived parameter.
If you know how many lines to skip, you can create a StreamReader on the file, you can skip lines with ReadLine.
use reader = StreamReader("mydata.csv")
reader.ReadLine() |> ignore
reader.ReadLine() |> ignore
let rawfile = CsvFile.Load(reader)
If you're ok with loading whole scv into memory, then you can simply reverse your rows, skip what you want and then (optionally) reverse back.
Example:
let skipNLastValues (n:int) (xs:seq<'a>) =
xs
|> Seq.rev
|> Seq.skip n
|> Seq.rev
for i in (skipNLastValues 2 {1..10}) do
printfn "%A" i
How can I get the value of a property given a string argument.
I have a Object CsvProvider.Row which has attributes a,b,c.
I want to get the attribute value depending on property given as a string argument.
I tried something like this:
let getValue (tuple, name: string) =
snd tuple |> Seq.averageBy (fun (y: CsvProvider<"s.csv">.Row) -> y.```name```)
but it gives me the following error:
Unexpected reserved keyword in lambda expression. Expected incomplete
structured construct at or before this point or other token.
Simple invocation of function should look like this:
getValue(tuple, "a")
and it should be equivalent to the following function:
let getValue (tuple) =
snd tuple |> Seq.averageBy (fun (y: CsvProvider<"s.csv">.Row) -> y.a)
Is something like this is even possible?
Thanks for any help!
The CSV type provider is great if you are accessing data by column names statically, because you get nice auto-completion with type inference and checking.
However, for a dynamic access, it might be easier to use the underlying CsvFile (also a part of F# Data) directly, rather than using the type provider:
// Read the given file
let file = CsvFile.Load("c:/test.csv")
// Look at the parsed headers and find the index of column "A"
let aIdx = file.Headers.Value |> Seq.findIndex (fun k -> k = "A")
// Iterate over rows and print A values
for r in file.Rows do
printfn "%A" (r.Item(aIdx))
The only unfortunate thing is that the items are accessed by index, so you need to build some lookup table if you want to easily access them by their name.
I would like to transfer a SQL table (let say i. 2 columns : one containing users ID and one containing users age and ii. n rows) containing only integers into a F# matrix (same dimensions).
I manage to do so with the following F# code, but I am convinced it is not the most efficient way to do so.
Indeed, the only way I found to dimensionate the F# matrix was to create 2 tables with a single value (number of rows and number of columns respectively) using MySQL and transfer these value into F#.
Is it possible to import a mySQL table into a F# matrix with a F# code which "recognize" the dimension of the matrix. Basically I would like a function which take a table address as an argument and return a matrix.
Here is my code :
#r "FSharp.PowerPack.dll"
#r "Microsoft.Office.Interop.Excel"
open System
open System.Data
open System.Data.SqlClient
open Microsoft.Office.Interop
open Microsoft.FSharp.Math
open System.Collections.Generic
//Need of three types : User, number of rows and number of columns
type user = {
ID : int;
Age : int;}
type nbrRows = {NbreL : int ;}
type nbrCol = {NbreC : int ;}
// I. Import the SQL data into F#
// I.1. Import the number of rows of the table into F#
let NbrRows = seq {
use cnn = new SqlConnection(#"myconnection; database=MyDataBase; integrated security=true")
use cmd1 = new SqlCommand("Select * from theTablesWhichContainsTheNumberOfRows", cnn)
cnn.Open()
use reader = cmd1.ExecuteReader()
while reader.Read() do
yield {
NbreL = unbox(reader.["Expr1"])
}
}
let NbrRowsList = NbrRows |> Seq.toList // convert the sequence into a List
// I.2. Same code to import the number of columns of the table
let NbrCol = seq {
use cnn = new SqlConnection(#"MyConnection; database=myDatabase; integrated security=true")
use cmd1 = new SqlCommand("Select * from theTablesWhichContainsTheNumberOfColumns", cnn)
cnn.Open()
use reader = cmd1.ExecuteReader()
while reader.Read() do
yield {
NbreC = unbox(reader.["Expr1"])
}
}
let NbrColsList = NbrCol |> Seq.toList
// Initialisation of the Matrix
let matrixF = Matrix.create NbrRowsList.[0].NbreL NbrColsList.[0].NbreC 0.
//Transfer of the mySQL User table into F# through a sequence as previously
let GetUsers = seq {
use cnn = new SqlConnection(#"myConnection, database=myDatabase; integrated security=true")
use cmd = new SqlCommand("Select * from tableUser ORDER BY ID", cnn)
cnn.Open()
use reader = cmd.ExecuteReader()
while reader.Read() do
yield {
ID = unbox(reader.["ID"])
Age = unbox(reader.["Age"])
}
}
// Sequence to list
let UserDatabaseList = GetUsers |> Seq.toList
// Fill of the user matrix
for i in 0 .. (NbrRowList.[0].NbreL - 1) do
matrixF.[0,i] <- UserDatabaseList.[i].ID |> float
matrixF.[1,i] <- UserDatabaseList.[i].Age|> float
matrixUsers
There are various ways to initialize matrix if you don't know its size in advance. For example Matrix.ofList takes a list of lists and it calculates the size automatically.
If you have just UserDatabaseList (which you can create withtout knowing the number of rows and columns), then you should be able to write:
Matrix.ofList
[ // Create list containing rows from the database
for row in UserDatabaseList do
// For each row, return list of columns (float values)
yield [ float row.ID; float row.Age ] ]
Aside - F# matrix is really useful mainly if you're going to do some matrix operations (and even then, it is not the most efficient option). If you're doing some data processing, then it may be easier to keep the data in an ordinary list. If you're doing some serious math, then you may want to check how to use Math.NET library from F#, which has more efficient matrix type.