AWS Glue predicate push down condition has no effect - mysql

I have a MySQL source from which I am creating a Glue Dynamic Frame with predicate push down condition as follows
datasource = glueContext.create_dynamic_frame_from_catalog(
database = source_catalog_db,
table_name = source_catalog_tbl,
push_down_predicate = "id > 1531812324",
transformation_ctx = "datasource")
I am always getting all the records in 'datasource' whatever the condition I put in 'push_down_predicate'.
What am I missing?

Pushdown predicate works for partitioning columns only. In other words, your data files should be placed in hierarchically structured folders. For example, if data is located in s3://bucket/dataset/ and partitioned by year, month and day then the structure should be following:
s3://bucket/dataset/year=2018/month=7/day=18/<data-files-here>
In such case pushdown predicate would work for columns year, month and day only:
datasource = glueContext.create_dynamic_frame_from_catalog(
database = source_catalog_db,
table_name = source_catalog_tbl,
push_down_predicate = "year = 2017 and month > 6 and day between 3 and 10",
transformation_ctx = "datasource")
Besides that you have to keep in mind that pushdown predicates work with s3 data sources only.
Here is a nice blog post written by AWS Glue devs about data partitioning.

This is great! I was able to use it to obtain the last 30 days of data using my "dt" partition column:
datasource0 = glueContext.create_dynamic_frame.from_catalog(
database = "my_db",
table_name = "my_table",
push_down_predicate = "to_date(dt) >= date_sub(current_date, 30)",
transformation_ctx = "datasource0"
)
I'm using Glue 1.0 - Spark 2.4 - Python 2.

Related

slick mysql streaming to avoid GC and OOM issues

While querying records from DB for a specified date range I am getting GC issue as the total number of returned records count is very large. Being new to slick I am not aware of using streaming. Could someone help in translating below method to stream logic -
val res = query.filter { row =>
(row.category === ServiceConstants.CATEGORY_TYPE.name ) &&
(row.ftrxDate >= trxDateLowerLimit && row.ftrxDate <= trxDateUpperLimit)}.result
db.run(res)
You can find information on how to stream data from the database in the manual:
https://scala-slick.org/doc/3.3.2/dbio.html#streaming

How to obtain and process mysql records using Airflow?

I need to
1. run a select query on MYSQL DB and fetch the records.
2. Records are processed by python script.
I am unsure about the way I should proceed. Is xcom the way to go here? Also, MYSQLOperator only executes the query, doesn't fetch the records. Is there any inbuilt transfer operator I can use? How can I use a MYSQL hook here?
you may want to use a PythonOperator that uses the hook to get the data,
apply transformation and ship the (now scored) rows back some other place.
Can someone explain how to proceed regarding the same.
Refer - http://markmail.org/message/x6nfeo6zhjfeakfe
def do_work():
mysqlserver = MySqlHook(connection_id)
sql = "SELECT * from table where col > 100 "
row_count = mysqlserver.get_records(sql, schema='testdb')
print row_count[0][0]
callMYSQLHook = PythonOperator(
task_id='fetch_from_testdb',
python_callable=mysqlHook,
dag=dag
)
Is this the correct way to proceed?
Also how do we use xcoms to store the records for the following MySqlOperator?'
t = MySqlOperator(
conn_id='mysql_default',
task_id='basic_mysql',
sql="SELECT count(*) from table1 where id > 10",
dag=dag)
I was really struggling with this for the past 90 minutes, here is a more declarative way to follow for newcomers:
from airflow.hooks.mysql_hook import MySqlHook
def fetch_records():
request = "SELECT * FROM your_table"
mysql_hook = MySqlHook(mysql_conn_id = 'the_connection_name_sourced_from_the_ui', schema = 'specific_db')
connection = mysql_hook.get_conn()
cursor = connection.cursor()
cursor.execute(request)
sources = cursor.fetchall()
print(sources)
...your DAG() as dag: code
task = PythonOperator(
task_id = 'fetch_records',
python_callable = fetch_records
)
This returns to the logs the contents of your DB query.
I hope this is of use to someone else.
Sure, just create a hook or operator and call the get_records() method: https://airflow.apache.org/docs/apache-airflow/stable/_modules/airflow/hooks/dbapi.html

Mysql data processing in Spark

I have a requirement where I need to fetch data every 5 minute from multiple source systems (Mysql instances) and join and enrich them with some other data(present in S3 lets say).
I wanted to this processing in Spark to distribute my execution across multiple executors.
The main problem is everytime I do a lookup in Mysql, I only want to fetch latest records (lets say with lastModifiedOn > timestamp).
How can this selective fetch of MySql rows happen effectively?
This is what I have tried:
val filmDf = sqlContext.read.format("jdbc")
.option("url", "jdbc:mysql://localhost/sakila")
.option("driver", "com.mysql.jdbc.Driver").option("dbtable", "film").option("user", "root").option("password", "")
.load()
You should use spark sql with jdbc datasource. I show you an example.
val res = spark.read.jdbc(
url = "jdbc:mysql://localhost/test?user=minty&password=greatsqldb",
table = "TEST.table",
columnName = "lastModifiedOn",
lowerBound = lowerTimestamp,
upperBound = upperTimestamp,
numPartitions = 20,
connectionProperties = new Properties()
)
There are more examples in Apache Spark test suite: https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCSuite.scala

Optimize Linq query with PredicateBuilder with N-N join

I'm using Linq to query MS CRM 2011 Web Services. I've got a query that results in very poor SQL, it fetches too much intermediary data and its performance is horrible!! I'm new to it, so it may very well be the way I'm using it...
I've got two entities linked via an N-N relationship: Product and SalesLink. I want to recover a bunch of Product from their SerialNumber along with all SalesLink associated to them.
This is the query I have using PredicateBuilder:
// Build inner OR predicate on Serial Number list
var innerPredicate = PredicateBuilder.False<Xrm.c_product>();
foreach (string sn in serialNumbers) {
string temp = sn; // This temp assignement is important!
innerPredicate = innerPredicate.Or(p => p.c_SerialNumber == temp);
}
// Combine predicate with outer AND predicate
var predicate = PredicateBuilder.True<Xrm.c_product>();
predicate = predicate.And(innerPredicate);
predicate = predicate.And(p => p.statecode == (int)CrmStateValueType.Active);
// Inner Join Query
var prodAndLinks = from p in orgContext.CreateQuery<Xrm.c_product>().AsExpandable()
.Where(predicate)
.AsEnumerable()
join link in orgContext.CreateQuery<Xrm.c_saleslink>()
on p.Id equals link.c_ProductSalesLinkId.Id
where link.statecode == (int)CrmStateValueType.Active
select new {
productId = p.Id
, productSerialNumber = p.c_SerialNumber
, accountId = link.c_Account.Id
, accountName = link.c_Account.Name
};
...
Using SQL profiler, I saw that it causes an intermediate SQL query that has no WHERE clause, looking like this:
select
top 5001 "c_saleslink0".statecode as "statecode"
...
, "c_saleslink0".ModifiedOnBehalfByName as "modifiedonbehalfbyname"
, "c_saleslink0".ModifiedOnBehalfByYomiName as "modifiedonbehalfbyyominame"
from
c_saleslink as "c_saleslink0" order by
"c_saleslink0".c_saleslinkId asc
This returns a huge amount of (useless) data. I think the join is done on the client side instead of on the DB side...
How should I improve this query? I runs in around 3 minutes and that's totally unacceptable.
Thanks.
"Solution"
Based on Daryl's answer to use QueryExpression instead of Linq to CRM, I got this which gets the exact same result.
var qe = new QueryExpression("c_product");
qe.ColumnSet = new ColumnSet("c_serialnumber");
var filter = qe.Criteria.AddFilter(LogicalOperator.Or);
filter.AddCondition("c_serialnumber", ConditionOperator.In, serialNumbers.ToArray());
var link = qe.AddLink("c_saleslink", "c_productid", "c_productsaleslinkid");
link.LinkCriteria.AddCondition("statecode", ConditionOperator.Equal, (int)CrmStateValueType.Active);
link.Columns.AddColumns("c_account");
var entities = serviceProxy.RetrieveMultiple(qe).Entities.ToList();;
var prodAndLinks = entities.Select(x => x.ToEntity<Xrm.c_product>()).Select(x =>
new {
productId = x.c_productId
, productSerialNumber = x.c_SerialNumber
, accountId = ((Microsoft.Xrm.Sdk.EntityReference)((Microsoft.Xrm.Sdk.AliasedValue)x["c_saleslink1.c_account"]).Value).Id
, accountName = ((Microsoft.Xrm.Sdk.EntityReference)((Microsoft.Xrm.Sdk.AliasedValue)x["c_saleslink1.c_account"]).Value).Name
}).ToList();
I really would have liked to find a solution using Linq, but it seems to Linq to CRM is just not there yet...
95% of the time when you're having performance issues with a complicated query in CRM, the easiest way to improve the performance is to run a straight SQL query against the database (assuming this is not CRM online of course). This may be one of the 5% of the time.
In your case, the major performance issue you're experiencing is due to the predicate builder forcing a CRM Server (not the SQL database) side join of data. If you used a Query Expression (which is what your link statement get's translated) you can specify a Condition Expression with an IN operator that would allow you to pass in your serialNumbers collection. You could also use FetchXml as well. Both of these methods would allow CRM to perform a SQL side join.
Edit:
This should get you 80% of the way with Query Expressions:
IOrganizationService service = GetService();
var qe = new QueryExpression("c_product");
var filter = qe.Criteria.AddFilter(LogicalOperator.Or);
filter.AddCondition("c_serialnumber", ConditionOperator.In, serialNumbers.ToArray());
var link = qe.AddLink("c_saleslink", "c_productid", "c_productsaleslinkid");
link.LinkCriteria.AddCondition("statecode", ConditionOperator.Equal, (int)CrmStateValueType.Active);
link.Columns.AddColumns("c_Account");
var entities = service.RetrieveMultiple(qe).Entities.ToList();
You will probably find you can get better control by not using Linq to Crm. You could try:
FetchXml, this is an xml syntax, similar in approach to tsql MSDN.
QueryExpression, MSDN.
You could issue a RetrieveRequest, blog.

Use of custom expression in LINQ leads to a query for each use

I have the following problem: In our database we record helpdesk tickets and we book hours under tickets. Between those is a visit report. So it is: ticket => visitreport => hours.
Hours have a certain 'kind' which is not determined by a type indicator in the hour record, but compiled by checking various properties of an hour. For example, an hour which has a customer but is not a service hour is always an invoice hour.
Last thing I want is that the definitions of those 'kinds' roam everywhere in the code. They must be at one place. Second, I want to be able to calculate totals of hours from various collections of hours. For example, a flattened collection of tickets with a certain date and a certain customer. Or all registrations which are marked as 'solution'.
I have decided to use a 'layered' database access approach. The same functions may provide data for screen representation but also for a report in .pdf . So the first step gathers all relevant data. That can be used for .pdf creation, but also for screen representation. In that case, it must be paged and ordered in a second step. That way I don't need separate queries which basically use the same data.
The amount of data may be large, like the creation of year totals. So the data from the first step should be queryable, not enumerable. To ensure I stay queryable even when I add the summation of hours in the results, I made the following function:
public static decimal TreeHours(this IEnumerable<Uren> h, FactHourType ht)
{
IQueryable<Uren> hours = h.AsQueryable();
ParameterExpression pe = Expression.Parameter(typeof(Uren), "Uren");
Expression left = Expression.Property(pe, typeof(Uren).GetProperty("IsOsab"));
Expression right = Expression.Constant(true, typeof(Boolean));
Expression isOsab = Expression.Equal(Expression.Convert(left, typeof(Boolean)), Expression.Convert(right, typeof(Boolean)));
left = Expression.Property(pe, typeof(Uren).GetProperty("IsKlant"));
right = Expression.Constant(true, typeof(Boolean));
Expression isCustomer = Expression.Equal(Expression.Convert(left, typeof(Boolean)), Expression.Convert(right, typeof(Boolean)));
Expression notOsab;
Expression notCustomer;
Expression final;
switch (ht)
{
case FactHourType.Invoice:
notOsab = Expression.Not(isOsab);
final = Expression.And(notOsab, isCustomer);
break;
case FactHourType.NotInvoice:
notOsab = Expression.Not(isOsab);
notCustomer = Expression.Not(isCustomer);
final = Expression.And(notOsab, notCustomer);
break;
case FactHourType.OSAB:
final = Expression.And(isOsab, isCustomer);
break;
case FactHourType.OsabInvoice:
final = Expression.Equal(isCustomer, Expression.Constant(true, typeof(Boolean)));
break;
case FactHourType.Total:
final = Expression.Constant(true, typeof(Boolean));
break;
default:
throw new Exception("");
}
MethodCallExpression whereCallExpression = Expression.Call(
typeof(Queryable),
"Where",
new Type[] { hours.ElementType },
hours.Expression,
Expression.Lambda<Func<Uren, bool>>(final, new ParameterExpression[] { pe })
);
IQueryable<Uren> result = hours.Provider.CreateQuery<Uren>(whereCallExpression);
return result.Sum(u => u.Uren1);
}
The idea behind this function is that it should remain queryable so that I don't switch a shipload of data to enumerable.
I managed to stay queryable until the end. In step 1 I gather the raw data. In step 2 I order the data and subsequently I page it. In step 3 the data is converted to JSon and sent to the client. It totals hours by ticket.
The problem is: I get one query for the hours for each ticket. That's hundreds of queries! That's too much...
I tried the following approach:
DataLoadOptions options = new DataLoadOptions();
options.LoadWith<Ticket>(t => t.Bezoekrapport);
options.LoadWith<Bezoekrapport>(b => b.Urens);
dc.LoadOptions = options;
Bezoekrapport is simply Dutch for 'visitreport'. When I look at the query which retrieves the tickets, I see it joins the Bezoekrapport/visitreport but not the hours which are attached to it.
A second approach I have used is manually joining the hours in LINQ, but that does not work as well.
I must do something wrong. What is the best approach here?
The following code snippets are how I retrieve the data. Upon calling toList() on strHours in the last step, I get a hailstorm of queries. I've been trying for two days to work around it but it just doesn't work... Something must be wrong in my approach or in the function TreeHours.
Step 1:
IQueryable<RelationHoursTicketItem> HoursByTicket =
from Ticket t in allTickets
let RemarkSolved = t.TicketOpmerkings.SingleOrDefault(tr => tr.IsOplossing)
let hours = t.Bezoekrapport.Urens.
Where(h =>
(dateFrom == null || h.Datum >= dateFrom)
&& (dateTo == null || h.Datum <= dateTo)
&& h.Uren1 > 0)
select new RelationHoursTicketItem
{
Date = t.DatumCreatie,
DateSolved = RemarkSolved == null ? (DateTime?)null : RemarkSolved.Datum,
Ticket = t,
Relatie = t.Relatie,
HoursOsab = hours.TreeHours(FactHourType.OSAB),
HoursInvoice = hours.TreeHours(FactHourType.Invoice),
HoursNonInvoice = hours.TreeHours(FactHourType.NotInvoice),
HoursOsabInvoice = hours.TreeHours(FactHourType.OsabInvoice),
TicketNr = t.Id,
TicketName = t.Titel,
TicketCategorie = t.TicketCategorie,
TicketPriority = t.TicketPrioriteit,
TicketRemark = RemarkSolved
};
Step 2
sort = sort ?? "TicketNr";
IQueryable<RelationHoursTicketItem> hoursByTicket = GetRelationHours(relation, dateFrom, dateTo, withBranches);
IOrderedQueryable<RelationHoursTicketItem> orderedResults;
if (dir == "ASC")
{
orderedResults = hoursByTicket.OrderBy(sort);
}
else
{
orderedResults = hoursByTicket.OrderByDescending(sort);
}
IEnumerable<RelationHoursTicketItem> pagedResults = orderedResults.Skip(start ?? 0).Take(limit ?? 25);
records = hoursByTicket.Count();
return pagedResults;
Step 3:
IEnumerable<RelationHoursTicketItem> hours = _hourReportService.GetRelationReportHours(relation, dateFrom, dateTo, metFilialen, start, limit, dir, sort, out records);
var strHours = hours.Select(h => new
{
h.TicketNr,
h.TicketName,
RelationName = h.Relatie.Naam,
h.Date,
TicketPriority = h.TicketPriority.Naam,
h.DateSolved,
TicketCategorie = h.TicketCategorie == null ? "" : h.TicketCategorie.Naam,
TicketRemark = h.TicketRemark == null ? "" : h.TicketRemark.Opmerking,
h.HoursOsab,
h.HoursInvoice,
h.HoursNonInvoice,
h.HoursOsabInvoice
});
I don't think your TreeHours extension method can be converted to SQL by LINQ in one go. So are evaluated on execution of each constructor of the row, causing a 4 calls to the database in this case per row.
I would simplfy your LINQ query to return you the raw data from SQL, using a simple JOIN to get all tickets and there hours. I would then group and filter the Hours by type in memory. Otherwise, if you really need to perform your operations in SQL then look at the CompiledQuery.Compile method. This should be able to handle not making a query per row. I'm not sure you'd get the switch in there but you may be able to convert it using the ?: operator.