I have a requirement where I need to fetch data every 5 minute from multiple source systems (Mysql instances) and join and enrich them with some other data(present in S3 lets say).
I wanted to this processing in Spark to distribute my execution across multiple executors.
The main problem is everytime I do a lookup in Mysql, I only want to fetch latest records (lets say with lastModifiedOn > timestamp).
How can this selective fetch of MySql rows happen effectively?
This is what I have tried:
val filmDf = sqlContext.read.format("jdbc")
.option("url", "jdbc:mysql://localhost/sakila")
.option("driver", "com.mysql.jdbc.Driver").option("dbtable", "film").option("user", "root").option("password", "")
.load()
You should use spark sql with jdbc datasource. I show you an example.
val res = spark.read.jdbc(
url = "jdbc:mysql://localhost/test?user=minty&password=greatsqldb",
table = "TEST.table",
columnName = "lastModifiedOn",
lowerBound = lowerTimestamp,
upperBound = upperTimestamp,
numPartitions = 20,
connectionProperties = new Properties()
)
There are more examples in Apache Spark test suite: https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCSuite.scala
Related
Is there a way to sort data and drop duplicates using pure pyarrow tables? My goal is to retrieve the latest version of each ID based on the maximum update timestamp.
Some extra details: my datasets are normally structured into at least two versions:
historical
final
The historical dataset would include all updated items from a source so it is possible to have duplicates for a single ID for each change that happened to it (picture a Zendesk or ServiceNow ticket, for example, where a ticket can be updated many times)
I then read the historical dataset using filters, convert it into a pandas DF, sort the data, and then drop duplicates on some unique constraint columns.
dataset = ds.dataset(history, filesystem, partitioning)
table = dataset.to_table(filter=filter_expression, columns=columns)
df = table.to_pandas().sort_values(sort_columns, ascending=True).drop_duplicates(unique_constraint, keep="last")
table = pa.Table.from_pandas(df=df, schema=table.schema, preserve_index=False)
# ds.write_dataset(final, filesystem, partitioning)
# I tend to write the final dataset using the legacy dataset so I can make use of the partition_filename_cb - that way I can have one file per date_id. Our visualization tool connects to these files directly
# container/dataset/date_id=20210127/20210127.parquet
pq.write_to_dataset(final, filesystem, partition_cols=["date_id"], use_legacy_dataset=True, partition_filename_cb=lambda x: str(x[-1]).split(".")[0] + ".parquet")
It would be nice to cut out that conversion to pandas and then back to a table, if possible.
Edit March 2022: PyArrow is adding more functionalities, though this one isn't here yet. My approach now would be:
def drop_duplicates(table: pa.Table, column_name: str) -> pa.Table:
unique_values = pc.unique(table[column_name])
unique_indices = [pc.index(table[column_name], value).as_py() for value in unique_values]
mask = np.full((len(table)), False)
mask[unique_indices] = True
return table.filter(mask=mask)
//end edit
I saw your question because I had a similar one, and I solved it for my work (due to IP issues I can't post the whole code but I'll try to answer as well as I can. I've never done this before)
import pyarrow.compute as pc
import pyarrow as pa
import numpy as np
array = table.column(column_name)
dicts = {dct['values']: dct['counts'] for dct in pc.value_counts(array).to_pylist()}
for key, value in dicts.items():
# do stuff
I used the 'value_counts' to find the unique values and how many of them there are (https://arrow.apache.org/docs/python/generated/pyarrow.compute.value_counts.html). Then I iterated over those values. If the value was 1, I selected the row by using
mask = pa.array(np.array(array) == key)
row = table.filter(mask)
and if the count was more then 1 I selected either the first or last one by using numpy boolean arrays as a mask again.
After iterating it was just as simple as pa.concat_tables(tables)
warning: this is a slow process. If you need something quick&dirty, try the "Unique" option (also in the same link I provided).
edit/extra:: you can make it a bit faster/less memory intensive by keeping up a numpy array of boolean masks while iterating over the dictionary. then in the end you return a "table.filter(mask=boolean_mask)".
I don't know how to calculate the speed though...
edit2:
(sorry for the many edits. I've been doing a lot of refactoring and trying to get it to work faster.)
You can also try something like:
def drop_duplicates(table: pa.Table, col_name: str) ->pa.Table:
column_array = table.column(col_name)
mask_x = np.full((table.shape[0]), False)
_, mask_indices = np.unique(np.array(column_array), return_index=True)
mask_x[mask_indices] = True
return table.filter(mask=mask_x)
The following gives a good performance. About 2mins for a table with half billion rows. The reason I don't do combine_chunks(): there is a bug, arrow seems can not combine chunk arrays if there size are too large. See details: https://issues.apache.org/jira/browse/ARROW-10172?src=confmacro
a = [len(tb3['ID'].chunk(i)) for i in range(len(tb3['ID'].chunks))]
c = np.array([np.arange(x) for x in a])
a = ([0]+a)[:-1]
c = pa.chunked_array(c+np.cumsum(a))
tb3= tb3.set_column(tb3.shape[1], 'index', c)
selector = tb3.group_by(['ID']).aggregate([("index", "min")])
tb3 = tb3.filter(pc.is_in(tb3['index'], value_set=selector['index_min']))
I found duckdb can give better performance on group by. Change the last 2 lines above into the following will give 2X speedup:
import duckdb
duck = duckdb.connect()
sql = "select first(index) as idx from tb3 group by ID"
duck_res = duck.execute(sql).fetch_arrow_table()
tb3 = tb3.filter(pc.is_in(tb3['index'], value_set=duck_res['idx']))
I'm trying to output the optimal hyperparameters for a decision tree classifier I trained using Spark's MLlib to a csv file using Dataframes and spark-csv. Here's a snippet of my code:
// Split the data into training and test sets (10% held out for testing)
val Array(trainingData, testData) = assembledData.randomSplit(Array(0.9, 0.1))
// Define cross validation with a hyperparameter grid
val crossval = new CrossValidator()
.setEstimator(classifier)
.setEstimatorParamMaps(paramGrid)
.setEvaluator(new BinaryClassificationEvaluator)
.setNumFolds(10)
// Train model
val model = crossval.fit(trainingData)
// Find best hyperparameter combination and create an RDD
val bestModel = model.bestModel
val hyperparamList = new ListBuffer[(String, String)]()
bestModel.extractParamMap().toSeq.foreach(pair => {
val hyperparam: Tuple2[String,String] = (pair.param.name,pair.value.toString)
hyperparamList += hyperparam
})
val hyperparameters = sqlContext.sparkContext.parallelize(hyperparamList.toSeq)
// Print the best hyperparameters
println(bestModel.extractParamMap().toSeq.foreach(pair => {
println(s"${pair.param.parent} ${pair.param.name}")
println(pair.value)
}))
// Define csv path to output results
var csvPath: String = "/root/results/decision-tree"
val hyperparametersPath: String = csvPath+"/hyperparameters"
val hyperparametersFile: File = new File(hyperparametersPath)
val results = (hyperparameters, hyperparametersPath, hyperparametersFile)
// Convert RDD to Dataframe and write it as csv
val dfToSave = spark.createDataFrame(results._1.map(x => Row(x._1, x._2)))
dfToSave.write.format("csv").mode("overwrite").save(results._2)
// Stop spark session
spark.stop()
After finishing a Spark job, I can see the part-00*... and _SUCCESS files inside the path as expected. However, though there are 13 hyperparameters total in this case (confirmed by printing them on screen), cat-ing the csv files shows not every hyperparameter was written to csv:
user#master:~$ cat /root/results/decision-tree/hyperparameters/part*.csv
checkpointInterval,10
featuresCol,features
maxDepth,5
minInstancesPerNode,1
Also, the hyperparameters that do get written change in every execution. This is executed on a HDFS-based Spark cluster with 1 master and 3 workers that have exactly the same hardware. Could it be a race condition? If so, how can I solve it?
Thanks in advance.
I think I figured it out. I expected dfTosave.write.format("csv")save(path) to write everything to the master node, but since the tasks are distributed to all workers, each worker saves its part of the hyperparameters to a local CSV in its filesystem. Because in my case the master node is also a worker, I can see its part of the hyperparameters. The "inconsistent behaviour" (i.e. seeing different parts in each execution) is caused by whatever algorithm Spark uses for distributing partitions among workers.
My solution will be to collect the CSVs from all workers using something like scp or rsync to build the full results.
I have a MySQL source from which I am creating a Glue Dynamic Frame with predicate push down condition as follows
datasource = glueContext.create_dynamic_frame_from_catalog(
database = source_catalog_db,
table_name = source_catalog_tbl,
push_down_predicate = "id > 1531812324",
transformation_ctx = "datasource")
I am always getting all the records in 'datasource' whatever the condition I put in 'push_down_predicate'.
What am I missing?
Pushdown predicate works for partitioning columns only. In other words, your data files should be placed in hierarchically structured folders. For example, if data is located in s3://bucket/dataset/ and partitioned by year, month and day then the structure should be following:
s3://bucket/dataset/year=2018/month=7/day=18/<data-files-here>
In such case pushdown predicate would work for columns year, month and day only:
datasource = glueContext.create_dynamic_frame_from_catalog(
database = source_catalog_db,
table_name = source_catalog_tbl,
push_down_predicate = "year = 2017 and month > 6 and day between 3 and 10",
transformation_ctx = "datasource")
Besides that you have to keep in mind that pushdown predicates work with s3 data sources only.
Here is a nice blog post written by AWS Glue devs about data partitioning.
This is great! I was able to use it to obtain the last 30 days of data using my "dt" partition column:
datasource0 = glueContext.create_dynamic_frame.from_catalog(
database = "my_db",
table_name = "my_table",
push_down_predicate = "to_date(dt) >= date_sub(current_date, 30)",
transformation_ctx = "datasource0"
)
I'm using Glue 1.0 - Spark 2.4 - Python 2.
I am new to Deedle, and in documentation I cant find how to solve my problem.
I bind an SQL Table to a Deedle Frame using the following code:
namespace teste
open FSharp.Data.Sql
open Deedle
open System.Linq
module DatabaseService =
[<Literal>]
let connectionString = "Data Source=*********;Initial Catalog=*******;Persist Security Info=True;User ID=sa;Password=****";
type bd = SqlDataProvider<
ConnectionString = connectionString,
DatabaseVendor = Common.DatabaseProviderTypes.MSSQLSERVER >
type Database() =
static member contextDbo() =
bd.GetDataContext().Dbo
static member acAgregations() =
Database.contextDbo().AcAgregations |> Frame.ofRecords
static member acBusyHourDefinition() =
Database.contextDbo().AcBusyHourDefinition
|> Frame.ofRecords "alternative_reference_table_scan", "formula"]
static member acBusyHourDefinitionFilterByTimeAgregationTipe(value:int) =
Database.acBusyHourDefinition()
|> Frame.getRows
These things are working properly becuse I can't understand the Data Frame Schema, for my surprise, this is not a representation of the table.
My question is:
how can I access my database elements by Rows instead of Columns (columns is the Deedle Default)? I Thied what is showed in documentation, but unfortunatelly, the columns names are not recognized, as is in the CSV example in Deedle Website.
With Frame.ofRecords you can extract the table into a dataframe and then operate on its rows or columns. In this case I have a very simple table. This is for SQL Server but I assume MySQL will work the same. If you provide more details in your question the solution can narrowed down.
This is the table, indexed by ID, which is Int64:
You can work with the rows or the columns:
#if INTERACTIVE
#load #"..\..\FSLAB\packages\FsLab\FsLab.fsx"
#r "System.Data.Linq.dll"
#r "FSharp.Data.TypeProviders.dll"
#endif
//open FSharp.Data
//open System.Data.Linq
open Microsoft.FSharp.Data.TypeProviders
open Deedle
[<Literal>]
let connectionString1 = #"Data Source=(LocalDB)\MSSQLLocalDB;AttachDbFilename=C:\Users\userName\Documents\tes.sdf.mdf"
type dbSchema = SqlDataConnection<connectionString1>
let dbx = dbSchema.GetDataContext()
let table1 = dbx.Table_1
query { for row in table1 do
select row} |> Seq.takeWhile (fun x -> x.ID < 10L) |> Seq.toList
// check if we can connect to the DB.
let df = table1 |> Frame.ofRecords // pull the table into a df
let df = df.IndexRows<System.Int64>("ID") // if you need an index
df.GetRows(2L) // Get the second row, but this can be any kind of index/key
df.["Number"].GetSlice(Some 2L, Some 5L) // get the 2nd to 5th row from the Number column
Will get you the following output:
val it : Series<System.Int64,float> =
2 -> 2
>
val it : Series<System.Int64,float> =
2 -> 2
3 -> 3
4 -> 4
5 -> 5
Depending on what you're trying to do Selecting Specific Rows in Deedle might also work.
Edit
From your comment you appear to be working with some large table. Depending on how much memory you have and how large the table you still might be able to load it. If not these are some of things you can do in increasing complexity:
Use a query { } expression like above to narrow the dataset on the database server and convert just part of the result into a dataframe. You can do quite complex transformations so you might not even need the dataframe in the end. This is basically Linq2Sql.
Use lazy loading in Deedle. This works with series so you can get a few series and reassemble a dataframe.
Use Big Deedle which is designed for this sort of thing.
I have a very large MySQL table (billions of rows, with dozens of columns) I would like to convert into a ColumnFamily in Cassandra. I'm using Hector.
I first create my schema as such :
String clusterName = "Test Cluster";
String host = "cassandra.lanhost.com:9160";
String newKeyspaceName = "KeyspaceName";
String newColumnFamilyName = "CFName";
ThriftCluster cassandraCluster;
CassandraHostConfigurator cassandraHostConfigurator;
cassandraHostConfigurator = new CassandraHostConfigurator(host);
cassandraCluster = new ThriftCluster(clusterName, cassandraHostConfigurator);
BasicColumnFamilyDefinition columnFamilyDefinition = new BasicColumnFamilyDefinition();
columnFamilyDefinition.setKeyspaceName(newKeyspaceName);
columnFamilyDefinition.setName(newColumnFamilyName);
columnFamilyDefinition.setDefaultValidationClass("UTF8Type");
columnFamilyDefinition.setKeyValidationClass(ComparatorType.UTF8TYPE.getClassName());
columnFamilyDefinition.setComparatorType(ComparatorType.UTF8TYPE);
BasicColumnDefinition columnDefinition = new BasicColumnDefinition();
columnDefinition.setName(StringSerializer.get().toByteBuffer("id"));
columnDefinition.setIndexType(ColumnIndexType.KEYS);
columnDefinition.setValidationClass(ComparatorType.INTEGERTYPE.getClassName());
columnDefinition.setIndexName("id_index");
columnFamilyDefinition.addColumnDefinition(columnDefinition);
columnDefinition = new BasicColumnDefinition();
columnDefinition.setName(StringSerializer.get().toByteBuffer("status"));
columnDefinition.setIndexType(ColumnIndexType.KEYS);
columnDefinition.setValidationClass(ComparatorType.ASCIITYPE.getClassName());
columnDefinition.setIndexName("status_index");
columnFamilyDefinition.addColumnDefinition(columnDefinition);
.......
ColumnFamilyDefinition cfDef = new ThriftCfDef(columnFamilyDefinition);
KeyspaceDefinition keyspaceDefinition =
HFactory.createKeyspaceDefinition(newKeyspaceName, "org.apache.cassandra.locator.SimpleStrategy", 1, Arrays.asList(cfDef));
cassandraCluster.addKeyspace(keyspaceDefinition);
Once that done, I load my data, stored in a List, since I'm fetching the MySQL data with a namedParametersJdbcTemplate, as such :
String clusterName = "Test Cluster";
String host = "cassandra.lanhost.com:9160";
String KeyspaceName = "KeyspaceName";
String ColumnFamilyName = "CFName";
final StringSerializer serializer = StringSerializer.get();
public void insert(List<SqlParameterSource> dataToInsert) throws ExceptionParserInterrupted {
Keyspace workingKeyspace = null;
Cluster cassandraCluster = HFactory.getOrCreateCluster(clusterName, host);
workingKeyspace = HFactory.createKeyspace(KeyspaceName, cassandraCluster);
Mutator<String> mutator = HFactory.createMutator(workingKeyspace, serializer);
ColumnFamilyTemplate<String, String> template = new ThriftColumnFamilyTemplate<String, String>(workingKeyspace, ColumnFamilyName, serializer, serializer);
long t1 = System.currentTimeMillis();
for (SqlParameterSource data : dataToInsert) {
String keyId = "id" + (Integer) data.getValue("id");
mutator.addInsertion(keyId, ColumnFamilyName, HFactory.createColumn("id", (Integer) data.getValue("id"), StringSerializer.get(), IntegerSerializer.get()));
mutator.addInsertion(keyId,ColumnFamilyName, HFactory.createStringColumn("status", data.getValue("status").toString()));
...............
}
mutator.execute();
System.out.println(t1 - System.currentTimeMillis());
I'm inserting 100 000 lines in approximatively 1 hour, which is really slow. I heard about multi-threading my inserts, but in this particular case I don't know what to do. Should I use BatchMutate?
Yes, you should run your insertion code from multiple threads. Take a look at the following stress testing code for an example of how to do this efficiently with hector:
https://github.com/zznate/cassandra-stress
An additional source of your insert performance issue may be the number of secondary indexes you are applying on the column family (each secondary index creates an additional column family 'under the hood').
Correctly designed data models should not really need a large number of secondary indexes. The following article provides a good overview of data modeling in Cassandra:
http://www.datastax.com/docs/1.0/ddl/index
There is one alternate way of achieving this. You can try exploring https://github.com/impetus-opensource/Kundera. You would love it.
Kundera is a JPA 2.0 compliant Object-Datastore Mapping Library for NoSQL Datastores and currently supports Cassandra, HBase, MongoDB and all relational datastores (Kundera internally uses Hibernate for all relational datastores).
In your case you can use your existing objects along with JPA annotations to store them in Cassandra. Since Kundera supports polyglot persistence you also use a MySQL + Cassandra combination where you can use MySQL for most of your data and Cassandra for transactional data.And since all you need to care about is objects and JPA annotations, your job would be much easier.
For performance you can have a look at https://github.com/impetus-opensource/Kundera/wiki/Kundera-Performance