slick mysql streaming to avoid GC and OOM issues - mysql

While querying records from DB for a specified date range I am getting GC issue as the total number of returned records count is very large. Being new to slick I am not aware of using streaming. Could someone help in translating below method to stream logic -
val res = query.filter { row =>
(row.category === ServiceConstants.CATEGORY_TYPE.name ) &&
(row.ftrxDate >= trxDateLowerLimit && row.ftrxDate <= trxDateUpperLimit)}.result
db.run(res)

You can find information on how to stream data from the database in the manual:
https://scala-slick.org/doc/3.3.2/dbio.html#streaming

Related

Why my Spark Streaming program processes so slow?

I am currently writing a Spark Streaming. My task is pretty simple, just receiving json message from kafka and do some text filtering (contains TEXT1, TEXT2, TEXT3, TEXT4). The code looks like:
val messages = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](
ssc, kafkaParams, topics)
messages.foreachRDD { rdd =>
val originrdd = rdd.count()
val record = rdd.map(_._2).filter(x=>x.contains(TEXT1)).filter( x=>x.contains(TEXT2)).filter(x=>x.contains(TEXT3)).filter(x=>x.contains(TEXT4))
val afterrdd = record.count()
println("original number of record: ", originrdd)
println("after filtering number of records:", afterrdd)
}
It is about 4 kb for each JSON message, and around 50000 records from Kafka every 1 second.
The processing time, for the above task, takes 3 seconds for each batch, so it couldn't achieve real-time performance. I have storm for the same tasks, and it performs much faster.
Well, you have made 3 unnecessary RDD's in this process.
val record = rdd.map(_._2).filter(x => {
x.contains(TEXT1) &&
x.contains(TEXT2) &&
x.contains(TEXT3) &&
x.contains(TEXT4)
}
Also, worth reading. https://spark.apache.org/docs/latest/streaming-programming-guide.html#performance-tuning

Mysql data processing in Spark

I have a requirement where I need to fetch data every 5 minute from multiple source systems (Mysql instances) and join and enrich them with some other data(present in S3 lets say).
I wanted to this processing in Spark to distribute my execution across multiple executors.
The main problem is everytime I do a lookup in Mysql, I only want to fetch latest records (lets say with lastModifiedOn > timestamp).
How can this selective fetch of MySql rows happen effectively?
This is what I have tried:
val filmDf = sqlContext.read.format("jdbc")
.option("url", "jdbc:mysql://localhost/sakila")
.option("driver", "com.mysql.jdbc.Driver").option("dbtable", "film").option("user", "root").option("password", "")
.load()
You should use spark sql with jdbc datasource. I show you an example.
val res = spark.read.jdbc(
url = "jdbc:mysql://localhost/test?user=minty&password=greatsqldb",
table = "TEST.table",
columnName = "lastModifiedOn",
lowerBound = lowerTimestamp,
upperBound = upperTimestamp,
numPartitions = 20,
connectionProperties = new Properties()
)
There are more examples in Apache Spark test suite: https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCSuite.scala

Very slow running as_json (Mongoid + Sinatra)

I'm using Sinatra (1.3.2) with Mongoid (2.4.10). I'm noticing that it is taking a VERY long time to convert about 350 mongo documents to JSON.
I added a few benchmark wrappers just to see what is taking the most time:
get '/games' do
content_type :text
obj = nil
t1 = Benchmark.measure { #games = filtered_games.entries }
t2 = Benchmark.measure { obj = #games.as_json }
t3 = Benchmark.measure { obj.to_json }
"Query: #{t1}\nTo Object: #{t2}\nJSON: #{t3}"
end
(filtered_games just returns the results of a Mongoid query using parameters passed in the URL)
This is a typical response:
Query: 0.100000 0.000000 0.100000 ( 0.234351)
To Object: 3.560000 0.010000 3.570000 ( 3.569813)
JSON: 0.220000 0.000000 0.220000 ( 0.217941)
So, it looks like its taking the majority of its time just converting the Mongoid objects into a basic JSON structure (as_json) (over 3.5 seconds), NOT converting that structure into a JSON string.
The documents aren't terribly large (about 450 bytes, 15-20 fields per document).
I suppose what really confuses me is that the time it takes to perform the actual query to Mongodb, parse the response and deserialize it into Mongoid objects is MUCH faster..
Why is this? Any suggestions on how I can further optimize this? I suppose I could just use native calls to Mongo and return those results, but I'd like to be able to continue to use the scopes I've defined in Mongoid.
EDIT: I previously was not actually running the query in the first benchmark because of Mongoid lazy loading until the as_json call.
So, as it turns out, rolling back to a previous version of Mongoid fixed this issue. I'm guessing that this is because it pulls in an earlier version of Active Model or Active Support.
Mongo: 1.4.0
Mongoid: 2.3.5
Active Model: 3.1.6
Active Support: 3.1.6
These are the new benchmark results:
Query: 0.110000 0.010000 0.120000 ( 0.243558)
To Object: 0.200000 0.000000 0.200000 ( 0.196342)
JSON: 0.440000 0.000000 0.440000 ( 0.444311)
If I get a chance to dig down into the code, I'll try to come back and update with anything I find.

Use of custom expression in LINQ leads to a query for each use

I have the following problem: In our database we record helpdesk tickets and we book hours under tickets. Between those is a visit report. So it is: ticket => visitreport => hours.
Hours have a certain 'kind' which is not determined by a type indicator in the hour record, but compiled by checking various properties of an hour. For example, an hour which has a customer but is not a service hour is always an invoice hour.
Last thing I want is that the definitions of those 'kinds' roam everywhere in the code. They must be at one place. Second, I want to be able to calculate totals of hours from various collections of hours. For example, a flattened collection of tickets with a certain date and a certain customer. Or all registrations which are marked as 'solution'.
I have decided to use a 'layered' database access approach. The same functions may provide data for screen representation but also for a report in .pdf . So the first step gathers all relevant data. That can be used for .pdf creation, but also for screen representation. In that case, it must be paged and ordered in a second step. That way I don't need separate queries which basically use the same data.
The amount of data may be large, like the creation of year totals. So the data from the first step should be queryable, not enumerable. To ensure I stay queryable even when I add the summation of hours in the results, I made the following function:
public static decimal TreeHours(this IEnumerable<Uren> h, FactHourType ht)
{
IQueryable<Uren> hours = h.AsQueryable();
ParameterExpression pe = Expression.Parameter(typeof(Uren), "Uren");
Expression left = Expression.Property(pe, typeof(Uren).GetProperty("IsOsab"));
Expression right = Expression.Constant(true, typeof(Boolean));
Expression isOsab = Expression.Equal(Expression.Convert(left, typeof(Boolean)), Expression.Convert(right, typeof(Boolean)));
left = Expression.Property(pe, typeof(Uren).GetProperty("IsKlant"));
right = Expression.Constant(true, typeof(Boolean));
Expression isCustomer = Expression.Equal(Expression.Convert(left, typeof(Boolean)), Expression.Convert(right, typeof(Boolean)));
Expression notOsab;
Expression notCustomer;
Expression final;
switch (ht)
{
case FactHourType.Invoice:
notOsab = Expression.Not(isOsab);
final = Expression.And(notOsab, isCustomer);
break;
case FactHourType.NotInvoice:
notOsab = Expression.Not(isOsab);
notCustomer = Expression.Not(isCustomer);
final = Expression.And(notOsab, notCustomer);
break;
case FactHourType.OSAB:
final = Expression.And(isOsab, isCustomer);
break;
case FactHourType.OsabInvoice:
final = Expression.Equal(isCustomer, Expression.Constant(true, typeof(Boolean)));
break;
case FactHourType.Total:
final = Expression.Constant(true, typeof(Boolean));
break;
default:
throw new Exception("");
}
MethodCallExpression whereCallExpression = Expression.Call(
typeof(Queryable),
"Where",
new Type[] { hours.ElementType },
hours.Expression,
Expression.Lambda<Func<Uren, bool>>(final, new ParameterExpression[] { pe })
);
IQueryable<Uren> result = hours.Provider.CreateQuery<Uren>(whereCallExpression);
return result.Sum(u => u.Uren1);
}
The idea behind this function is that it should remain queryable so that I don't switch a shipload of data to enumerable.
I managed to stay queryable until the end. In step 1 I gather the raw data. In step 2 I order the data and subsequently I page it. In step 3 the data is converted to JSon and sent to the client. It totals hours by ticket.
The problem is: I get one query for the hours for each ticket. That's hundreds of queries! That's too much...
I tried the following approach:
DataLoadOptions options = new DataLoadOptions();
options.LoadWith<Ticket>(t => t.Bezoekrapport);
options.LoadWith<Bezoekrapport>(b => b.Urens);
dc.LoadOptions = options;
Bezoekrapport is simply Dutch for 'visitreport'. When I look at the query which retrieves the tickets, I see it joins the Bezoekrapport/visitreport but not the hours which are attached to it.
A second approach I have used is manually joining the hours in LINQ, but that does not work as well.
I must do something wrong. What is the best approach here?
The following code snippets are how I retrieve the data. Upon calling toList() on strHours in the last step, I get a hailstorm of queries. I've been trying for two days to work around it but it just doesn't work... Something must be wrong in my approach or in the function TreeHours.
Step 1:
IQueryable<RelationHoursTicketItem> HoursByTicket =
from Ticket t in allTickets
let RemarkSolved = t.TicketOpmerkings.SingleOrDefault(tr => tr.IsOplossing)
let hours = t.Bezoekrapport.Urens.
Where(h =>
(dateFrom == null || h.Datum >= dateFrom)
&& (dateTo == null || h.Datum <= dateTo)
&& h.Uren1 > 0)
select new RelationHoursTicketItem
{
Date = t.DatumCreatie,
DateSolved = RemarkSolved == null ? (DateTime?)null : RemarkSolved.Datum,
Ticket = t,
Relatie = t.Relatie,
HoursOsab = hours.TreeHours(FactHourType.OSAB),
HoursInvoice = hours.TreeHours(FactHourType.Invoice),
HoursNonInvoice = hours.TreeHours(FactHourType.NotInvoice),
HoursOsabInvoice = hours.TreeHours(FactHourType.OsabInvoice),
TicketNr = t.Id,
TicketName = t.Titel,
TicketCategorie = t.TicketCategorie,
TicketPriority = t.TicketPrioriteit,
TicketRemark = RemarkSolved
};
Step 2
sort = sort ?? "TicketNr";
IQueryable<RelationHoursTicketItem> hoursByTicket = GetRelationHours(relation, dateFrom, dateTo, withBranches);
IOrderedQueryable<RelationHoursTicketItem> orderedResults;
if (dir == "ASC")
{
orderedResults = hoursByTicket.OrderBy(sort);
}
else
{
orderedResults = hoursByTicket.OrderByDescending(sort);
}
IEnumerable<RelationHoursTicketItem> pagedResults = orderedResults.Skip(start ?? 0).Take(limit ?? 25);
records = hoursByTicket.Count();
return pagedResults;
Step 3:
IEnumerable<RelationHoursTicketItem> hours = _hourReportService.GetRelationReportHours(relation, dateFrom, dateTo, metFilialen, start, limit, dir, sort, out records);
var strHours = hours.Select(h => new
{
h.TicketNr,
h.TicketName,
RelationName = h.Relatie.Naam,
h.Date,
TicketPriority = h.TicketPriority.Naam,
h.DateSolved,
TicketCategorie = h.TicketCategorie == null ? "" : h.TicketCategorie.Naam,
TicketRemark = h.TicketRemark == null ? "" : h.TicketRemark.Opmerking,
h.HoursOsab,
h.HoursInvoice,
h.HoursNonInvoice,
h.HoursOsabInvoice
});
I don't think your TreeHours extension method can be converted to SQL by LINQ in one go. So are evaluated on execution of each constructor of the row, causing a 4 calls to the database in this case per row.
I would simplfy your LINQ query to return you the raw data from SQL, using a simple JOIN to get all tickets and there hours. I would then group and filter the Hours by type in memory. Otherwise, if you really need to perform your operations in SQL then look at the CompiledQuery.Compile method. This should be able to handle not making a query per row. I'm not sure you'd get the switch in there but you may be able to convert it using the ?: operator.

How can I use linqTOsql to efficiently get nested lists?

I have three tables that I would like to query: Streams, Entries, and FieldInstances.
I'm wanting to get a list of entries inside a stream. A Stream could be a blog or a page etc. and entry is the actual instance of the stream ie: "stream:Page entry:Welcome" or "stream:blog entry:News about somthing".
The thing is, each entry has custom data field associated with it through FieldInstance. IE:
stream: Page
entry: Welcome
fieldInstance: Welcome Image Path
I'm trying to figure out the best way to get a list of all entries inside of one stream and also have the custom field instances that are associated with each entry.
I've been playing around with code like this:
var stream = genesisRepository.Streams.First(x => x.StreamUrl == streamUrl);
IQueryable<StreamEntry> entry = genesisRepository.StreamEntries.Where(x => x.StreamID == stream.StreamID);
IQueryable<FieldInstance> fieldInstances = genesisRepository.FieldInstances.Where(
// doesn't work because entry is basically returning a collection of some kind.
// and i can't figure out how to compare a single ID with a list/collection of IDs
x => x.fiStreamEntryID == entry.Where(e => e.StreamID == stream.StreamID)
);
This of course doesn't work. Inititially I was thinking to get all entries in the stream and then all fieldInstances in the stream, then I'll display the data using lambdas after I have everything... hopefully keeping sql queries dows to two or three. But I can't figure out how to write the linqTOsql to execute in just two or three queries. I keep thinking that I need to execute queries in a loop to get the fieldInstances for each entry.
Is there a LinqTOsql query that will select all fieldInstances where the StreamEntryID(fk) is in the list of entries whose (fk)StreamID matches the Stream?
Sorry, I don't quite get what columns you are joining in the third statement, but:
var stream = genesisRepository.Streams.First(x => x.StreamUrl == streamUrl);
IQueryable<StreamEntry> entries = genesisRepository.StreamEntries.Where(x => x.StreamID == stream.StreamID);
IQueryable<FieldInstance> fieldInstances = from entry in entries
from instance in genesisRepository.FieldInstances
where entry.entryId == instance.fiStreamEntryID
select instance;
Does this help? I'm still new to EF but that's how I'd have a crack at it...
var results = genesisRepository.FieldInstances
.Include("StreamEntry.Stream")
.Where(fi => fi.StreamEntry.Stream.StreamUrl == streamUrl);