Very slow running as_json (Mongoid + Sinatra) - json

I'm using Sinatra (1.3.2) with Mongoid (2.4.10). I'm noticing that it is taking a VERY long time to convert about 350 mongo documents to JSON.
I added a few benchmark wrappers just to see what is taking the most time:
get '/games' do
content_type :text
obj = nil
t1 = Benchmark.measure { #games = filtered_games.entries }
t2 = Benchmark.measure { obj = #games.as_json }
t3 = Benchmark.measure { obj.to_json }
"Query: #{t1}\nTo Object: #{t2}\nJSON: #{t3}"
end
(filtered_games just returns the results of a Mongoid query using parameters passed in the URL)
This is a typical response:
Query: 0.100000 0.000000 0.100000 ( 0.234351)
To Object: 3.560000 0.010000 3.570000 ( 3.569813)
JSON: 0.220000 0.000000 0.220000 ( 0.217941)
So, it looks like its taking the majority of its time just converting the Mongoid objects into a basic JSON structure (as_json) (over 3.5 seconds), NOT converting that structure into a JSON string.
The documents aren't terribly large (about 450 bytes, 15-20 fields per document).
I suppose what really confuses me is that the time it takes to perform the actual query to Mongodb, parse the response and deserialize it into Mongoid objects is MUCH faster..
Why is this? Any suggestions on how I can further optimize this? I suppose I could just use native calls to Mongo and return those results, but I'd like to be able to continue to use the scopes I've defined in Mongoid.
EDIT: I previously was not actually running the query in the first benchmark because of Mongoid lazy loading until the as_json call.

So, as it turns out, rolling back to a previous version of Mongoid fixed this issue. I'm guessing that this is because it pulls in an earlier version of Active Model or Active Support.
Mongo: 1.4.0
Mongoid: 2.3.5
Active Model: 3.1.6
Active Support: 3.1.6
These are the new benchmark results:
Query: 0.110000 0.010000 0.120000 ( 0.243558)
To Object: 0.200000 0.000000 0.200000 ( 0.196342)
JSON: 0.440000 0.000000 0.440000 ( 0.444311)
If I get a chance to dig down into the code, I'll try to come back and update with anything I find.

Related

RDF4J SPARQL query to JSON

I am trying to move data from a SPARQL endpoint to a JSONObject. Using RDF4J.
RDF4J documentation does not address this directly (some info about using endpoints, less about converting to JSON, and nothing where these two cases meet up).
Sofar I have:
SPARQLRepository repo = new SPARQLRepository(<My Endpoint>);
Map<String, String> headers = new HashMap<String, String>();
headers.put("Accept", "SPARQL/JSON");
repo.setAdditionalHttpHeaders(headers);
try (RepositoryConnection conn = repo.getConnection())
{
String queryString = "SELECT * WHERE {GRAPH <urn:x-evn-master:mwadata> {?s ?p ?o}}";
GraphQuery query = conn.prepareGraphQuery(queryString);
debug("Mark 2");
try (GraphQueryResult result = query.evaluate())
this fails because "Server responded with an unsupported file format: application/sparql-results+json"
I figured a SPARQLGraphQuery should take the place of GraphQuery, but RepositoryConnection does not have a relevant prepare statement.
If I exchange
try (RepositoryConnection conn = repo.getConnection())
with
try (SPARQLConnection conn = (SPARQLConnection)repo.getConnection())
I run into the problem that SPARQLConnection does not generate a SPARQLGraphQuery. The closest I can get is:
SPARQLGraphQuery query = (SPARQLGraphQuery)conn.prepareQuery(QueryLanguage.SPARQL, queryString);
which gives a runtime error as these types cannot be cast to eachother.
I do not know how to proceed from here. Any help or advise much appreciated. Thank you
this fails because "Server responded with an unsupported file format: application/sparql-results+json"
In RDF4J, SPARQL SELECT queries are tuple queries, so named because each result is a set of bindings, which are tuples of the form (name, value). In contrast, CONSTRUCT (and DESCRIBE) queries are graph queries, so called because their result is a graph, that is, a collection of RDF statements.
Furthermore, setting additional headers for the response format as you have done here is not necessary (except in rare circumstances), the RDF4J client handles this for you automatically, based on the registered set of parsers.
So, in short, simplify your code as follows:
SPARQLRepository repo = new SPARQLRepository(<My Endpoint>);
try (RepositoryConnection conn = repo.getConnection()) {
String queryString = "SELECT * WHERE {GRAPH <urn:x-evn-master:mwadata> {?s ?p ?o}}";
TupleQuery query = conn.prepareTupleQuery(queryString);
debug("Mark 2");
try (TupleQueryResult result = query.evaluate()) {
...
}
}
If you want to write the result of the query in JSON format, you could use a TupleQueryResultHandler, for example the SPARQLResultsJSONWriter, as follows:
SPARQLRepository repo = new SPARQLRepository(<My Endpoint>);
try (RepositoryConnection conn = repo.getConnection()) {
String queryString = "SELECT * WHERE {GRAPH <urn:x-evn-master:mwadata> {?s ?p ?o}}";
TupleQuery query = conn.prepareTupleQuery(queryString);
query.evaluate(new SPARQLResultsJSONWriter(System.out));
}
This will write the result of the query (in this example to standard output) using the SPARQL Query Results JSON format. If you have a non-standard format in mind, you could of course also create your own TupleQueryResultHandler implementation.
For more details on the various ways in which you can process the result (including iterating, streaming, adding to a List, or just directly sending to a result handler), see the documentation on querying a repository. As an aside, the javadoc on the RDF4J APIs is pretty extensive too, so if your Java editing environment has support for displaying that, I'd advise you to make use of it.

Inconsistent behaviour when attempting to write Dataframe to CSV in Apache Spark

I'm trying to output the optimal hyperparameters for a decision tree classifier I trained using Spark's MLlib to a csv file using Dataframes and spark-csv. Here's a snippet of my code:
// Split the data into training and test sets (10% held out for testing)
val Array(trainingData, testData) = assembledData.randomSplit(Array(0.9, 0.1))
// Define cross validation with a hyperparameter grid
val crossval = new CrossValidator()
.setEstimator(classifier)
.setEstimatorParamMaps(paramGrid)
.setEvaluator(new BinaryClassificationEvaluator)
.setNumFolds(10)
// Train model
val model = crossval.fit(trainingData)
// Find best hyperparameter combination and create an RDD
val bestModel = model.bestModel
val hyperparamList = new ListBuffer[(String, String)]()
bestModel.extractParamMap().toSeq.foreach(pair => {
val hyperparam: Tuple2[String,String] = (pair.param.name,pair.value.toString)
hyperparamList += hyperparam
})
val hyperparameters = sqlContext.sparkContext.parallelize(hyperparamList.toSeq)
// Print the best hyperparameters
println(bestModel.extractParamMap().toSeq.foreach(pair => {
println(s"${pair.param.parent} ${pair.param.name}")
println(pair.value)
}))
// Define csv path to output results
var csvPath: String = "/root/results/decision-tree"
val hyperparametersPath: String = csvPath+"/hyperparameters"
val hyperparametersFile: File = new File(hyperparametersPath)
val results = (hyperparameters, hyperparametersPath, hyperparametersFile)
// Convert RDD to Dataframe and write it as csv
val dfToSave = spark.createDataFrame(results._1.map(x => Row(x._1, x._2)))
dfToSave.write.format("csv").mode("overwrite").save(results._2)
// Stop spark session
spark.stop()
After finishing a Spark job, I can see the part-00*... and _SUCCESS files inside the path as expected. However, though there are 13 hyperparameters total in this case (confirmed by printing them on screen), cat-ing the csv files shows not every hyperparameter was written to csv:
user#master:~$ cat /root/results/decision-tree/hyperparameters/part*.csv
checkpointInterval,10
featuresCol,features
maxDepth,5
minInstancesPerNode,1
Also, the hyperparameters that do get written change in every execution. This is executed on a HDFS-based Spark cluster with 1 master and 3 workers that have exactly the same hardware. Could it be a race condition? If so, how can I solve it?
Thanks in advance.
I think I figured it out. I expected dfTosave.write.format("csv")save(path) to write everything to the master node, but since the tasks are distributed to all workers, each worker saves its part of the hyperparameters to a local CSV in its filesystem. Because in my case the master node is also a worker, I can see its part of the hyperparameters. The "inconsistent behaviour" (i.e. seeing different parts in each execution) is caused by whatever algorithm Spark uses for distributing partitions among workers.
My solution will be to collect the CSVs from all workers using something like scp or rsync to build the full results.

Formatting Biquery query to ML appropriate JSON to Pass through ML Predict

Using Python 2.7, I wont to pass a query from BigQuery to ML Predict which has a specific formating request.
First: Is there an easier way to go directly from the BigQuery query to JSON in the correct format so it can be passed to requests.post() instead of going through pandas (from what I understand pandas is still not supported for GCP Standard)?
Second: Is there a way to construct the query to go directly to a JSON format and then modify the JSON to reflect the ML Predict JSON requirments?
Currently my code looks like this:
#I used the bigquery to dataframe option here to view the output.
#I would like to not use pandas in the end code.
logs = log_data.execute(output_options=bq.QueryOutput.dataframe()).result()
data = logs.to_json(orient='index')
print data
'{"0":{"end_time":"2018-04-19","device":"iPad","device_os":"iOS","device_os_version":"5.1.1","latency":0.150959,"megacycles":140.0,"cost":"1.3075e-08","device_brand":"Apple","device_family":"iPad","browser_version":"5.1","app":"567","ua_parse":"0"}}'
#The JSON needs to be in this format according to google documentation.
#data = {
# 'instances': [
# {
# 'key':'',
# 'end_time': '2018-04-19',
# 'device': 'iPad',
# 'device_os': 'iOS',
# 'device_os_version': '5.1.1',
# 'latency': 0.150959,
# 'megacycles':140.0,
# 'cost':'1.3075e-08',
# 'device_brand':'Apple',
# 'device_family':'iPad',
# 'browser_version':'5.1',
# 'app':'567',
# 'ua_parse':'40.9.8'
# }
# ]
#}
So all I would need to change is the leading key '0' to 'instances' and I should be all set to pass into `requests.post().
Is there a way to accomplish this?
Edit-Adding BigQuery query:
%%bq query --n log_data
WITH `my.table` AS (
SELECT ARRAY<STRUCT<end_time STRING, device STRING, device_os STRING, device_os_version STRING, latency FLOAT64, megacycles FLOAT64,
cost STRING, device_brand STRING, device_family STRING, browser_version STRING, app STRING, ua_parse STRING>>[] instances
)
SELECT TO_JSON_STRING(t)
FROM `my.table` AS t
WHERE end_time >='2018-04-19'
LIMIT 1
data = log_data.execute().result()
Thanks to #MikhailBerlyant I have adjust my query and code to look like this:
%%bq query --n log_data
SELECT [TO_JSON_STRING(t)] AS instance
FROM `yourproject.yourdataset.yourtable` AS t
WHERE end_time >='2018-04-19'
LIMIT 1
But when I run the execute logs = log_data.execute().result() I get this
Which results in this error when passing into request.post
TypeError: QueryResultsTable job_zfVEiPdf2W6msBlT6bBLgMusF49E is not JSON serializable
Is there a way within execut() to just return the json?
First: Is there an easier way to go directly from the BigQuery query to JSON in the correct format
See example below
#standardSQL
WITH yourTable AS (
SELECT ARRAY<STRUCT<id INT64, type STRING>>[(1, 'abc'), (2, 'xyz')] instances
)
SELECT TO_JSON_STRING(t)
FROM yourTable t
with result is in the format you asked for:
{"instances":[{"id":1,"type":"abc"},{"id":2,"type":"xyz"}]}
Above demonstrates the query and how it will work
In you real case - you should use something like below
SELECT TO_JSON_STRING(t)
FROM `yourproject.yourdataset.yourtable` AS t
WHERE end_time >='2018-04-19'
LIMIT 1
hope this helps :o)
Update based on comments
SELECT [TO_JSON_STRING(t)] AS instance
FROM `yourproject.yourdataset.yourtable` t
WHERE end_time >='2018-04-19'
LIMIT 1
I wanted to add this in case someone has the same problem I had or at least are stuck on were to go once you have the query.
I was able to write a function that formatted the query in the way Google ML Predict wants it to be passed into requests.post(). This is most likely a horrible way to accomplish this but I could not find a direct way to go from BigQuery to ML Predict in the correct format.
def logs(query):
client = gcb.Client()
query_job = client.query(query)
CSV_COLUMNS ='end_time,device,device_os,device_os_version,latency,megacycles,cost,device_brand,device_family,browser_version,app,ua_parse'.split(',')
for row in query_job.result():
var = list(row)
l1 = dict(zip(CSV_COLUMNS,var))
l1.update({'key':''})
l2 = {'instances':[l1]}
return l2

Play + Slick: How to do partial model updates?

I am using Play 2.2.x with Slick 2.0 (with MYSQL backend) to write a REST API. I have a User model with bunch of fields like age, name, gender etc. I want to create a route PATCH /users/:id which takes in partial user object (i.e. a subset of the fields of a full user model) in the body and updates the user's info. I am confused how I can achieve this:
How do I use PATCH verb in Play 2.2.x?
What is a generic way to parse the partial user object into an update query to execute in Slick 2.0?I am expecting to execute a single SQL statement e.g. update users set age=?, dob=? where id=?
Disclaimer: I haven't used Slick, so am just going by their documentation about Plain SQL Queries for this.
To answer your first question:
PATCH is just-another HTTP verb in your routes file, so for your example:
PATCH /users/:id controllers.UserController.patchById(id)
Your UserController could then be something like this:
val possibleUserFields = Seq("firstName", "middleName", "lastName", "age")
def patchById(id:String) = Action(parse.json) { request =>
def addClause(fieldName:String) = {
(request.body \ fieldName).asOpt[String].map { fieldValue =>
s"$fieldName=$fieldValue"
}
}
val clauses = possibleUserFields.flatMap ( addClause )
val updateStatement = "update users set " + clauses.mkString(",") + s" where id = $id"
// TODO: Actually make the Slick call, possibly using the 'sqlu' interpolator (see docs)
Ok(s"$updateStatement")
}
What this does:
Defines the list of JSON field names that might be present in the PATCH JSON
Defines an Action that will parse the incoming body as JSON
Iterates over all of the possible field names, testing whether they exist in the incoming JSON
If so, adds a clause of the form fieldname=<newValue> to a list
Builds an SQL update statement, comma-separating each of these clauses as required
I don't know if this is generic enough for you, there's probably a way to get the field names (i.e. the Slick column names) out of Slick, but like I said, I'm not even a Slick user, let alone an expert :-)

Mahout 0.7 Failed to get recommendation with a large data using MysqlJdbcDataModel

I am using Mahout to build an Item-based Cf recommendation engine.
I create an MahoutHelper class which has a constructor:
public MahoutHelper(String serverName, String user, String password,
String DatabaseName, String tableName) {
source = new MysqlConnectionPoolDataSource();
source.setServerName(serverName);
source.setUser(user);
source.setPassword(password);
source.setDatabaseName(DatabaseName);
source.setCachePreparedStatements(true);
source.setCachePrepStmts(true);
source.setCacheResultSetMetadata(true);
source.setAlwaysSendSetIsolation(true);
source.setElideSetAutoCommits(true);
DBmodel = new MySQLJDBCDataModel(source, tableName, "userId", "itemId",
"value", null);
similarity = new TanimotoCoefficientSimilarity(DBmodel);
}
and the recommend method is:
public List<RecommendedItem> recommendation() throws TasteException {
Recommender recommender = null;
recommender = new GenericItemBasedRecommender(DBmodel, similarity);
List<RecommendedItem> recommendations = null;
recommendations = recommender.recommend(userId, maxNum);
System.out.println("query completed");
return recommendations;
}
It's using datasource to build datamodel but the problem is that when mysql has only a few data (less than 100) the program works fine for me, while when the scale turns to be over 1,000,000, the program stacks at doing recommendation and never goes forward. I have no idea how it happens. By the way I used the same data to build a FileDataModel with a .dat file, and it takes only 2~3 second to complete analysis. I am confused.
Using the database directly will only work for tiny data sets, like maybe a hundred thousand data points. Beyond that the overhead of such data-intensive applications will never run quickly; a query takes thousands of SQL queries or more.
Instead you must load and re-load into memory. You can still pull from the database; look at ReloadFromJDBCDataModel as a wrapper.