I am developing geomesa client to perform basic read write and delete operations. I have also created a function which will return matching feature count for specified query, however it always returns zero, i also tried DataStore stats for fetching the matching feature count, it gives the correct result but operation is very slow. Below is my client code:
public int getRideCount(Long rideId) throws Exception {
int count = 0;
if(rideId != null){
count = fs.getCount(new Query(tableName, CQL.toFilter("r="+rideId)));
//count = ((Long) (ds.stats().getCount(sft, CQL.toFilter("r=" + rideId), true).get())).intValue();
}
return count;
}
Can anyone help me in finding why it returning 0 though features exists in feature collections. or there exists other preferred techniques to fetch the matching feature count?. Any suggestion or clarifications are welcomed.
Based on the additional info from your email to the geomesa dev list, I believe this is caused by a bug in simple feature types that don't have a date attribute. I've opened a ticket here and a PR here for the issue. It should be fixed in the next release (1.3.2), or you can build the branch locally.
In the meantime, the 'exact' counts should still work, although they will be slower. Instructions for enabling exact counts are here and here.
Related
I have a project that requires we allow users to create custom columns, enter custom values, and use these custom values to execute user defined functions.
Similar Functionality In Google Data Studio
We have exhausted all implementation strategies we can think of (executing formulas on the front end, in isolated execution environments, etc.).
Short of writing our own interpreter, the only implementation we could find that meets the performance, functionality, and scalability requirements is to execute these functions directly within MySQL. So basically taking the expressions that have been entered by the user, and dynamically rolling up a query that computes results server side in MySQL.
This obviously opens a can of worms security wise.
Quick aside: I expect to get the "you shouldn't do it that way" response. Trust me, I hate that this is the best solution we can find. The resources online describing similar problems is remarkably scarce, so if there are any suggestions for where to find information on analogous problems/solutions/implementations, I would greatly appreciate it.
With that said, assuming that we don't have alternatives, my question is: How do we go about doing this safely?
We have a few current safeguards set up:
Executing the user defined expressions against a tightly controlled subquery that limits the "inner context" that the dynamic portion of the query can pull from.
Blacklisting certain phrases the should never be used (SELECT, INSERT, UNION, etc.). This introduces issues, because a user should be able to enter something like: CASE WHEN {{var}} = "union pacific railroad" THEN... but that is a tradeoff we are willing to make.
Limiting the access of the MySQL connection making the query to only have access to the tables/functionality needed for the feature.
This gets us pretty far. But I'm still not comfortable with it. One additional option that I couldn't find any info online about was using the query execution plan as a means of detecting if the query is going outside of its bounds.
So prior to actually executing the query/getting the results, you would wrap it within an EXPLAIN statement to see what the dynamic query was doing. From the results of the EXPLAIN query, you should able to detect any operations (subqueries, key references, UNIONs, etc.) that fall outside of the bounds of what the query is allowed to do.
Is this a useful validation method? It seems to me that this would be a powerful tool for protecting against a suite of SQL injections, but I couldn't seem to find any information online.
Thanks in advance!
(from Comment)
Some Examples showing the actual autogenerated queries being used. There are both visual and list examples showing the query execution plan for both malicious and valid custom functions.
GRANT only SELECT on the table(s) that they are allowed to manipulate. This allows arbitrarily complex SELECT queries to be run. (The one flaw: Such queries may run for a long time and/or take a lot of resources. MariaDB has more facilities for preventing run-away selects.)
Provide limited "write" access via Stored Routines with expanded privileges, but do not pass arbitrary values into them. See SQL SECURITY: DEFINER has the privileges of the person creating the routine. (As opposed to INVOKER is limited to SELECT on the tables mentioned above.)
Another technique that may or may not be useful is creating VIEWs with select privileges. This, for example, can let the user see most information about employees while hiding the salaries.
Related to that is the ability to GRANT different permissions on different columns, even in the same table.
(I have implemented a similar web app, and released it to everyone in the company. And I could 'sleep at night'.)
I don't see subqueries and Unions as issues. I don't see the utility of EXPLAIN other than to provide more info in case the user is a programmer trying out queries.
EXPLAIN can help in discovering long-running queries, but it is imperfect. Ditto for LIMIT.
More
I think "UDF" is either "normalization" or "EAV"; it is hard to tell which. Please provide SHOW CREATE TABLE.
This is inefficient because it builds a temp table before removing the 'NULL' items:
FROM ( SELECT ...
FROM ...
LEFT JOIN ...
) AS context
WHERE ... IS NULL
This is better because it can do the filtering sooner:
FROM ( SELECT ...
FROM ...
LEFT JOIN ...
WHERE ... IS NULL
) AS context
I wanted to share a solution I found for anyone who comes across this in the future.
To prevent someone from entering some malicious SQL injection in a "custom expression" we decided to preprocess and analyze the SQL prior to sending it to the MySQL database.
Our server is running NodeJS, so we used a parsing library to construct an abstract syntax tree from their custom SQL. From here we can traverse the tree and identify any operations that shouldn't be taking place.
The mock code (it won't run in this example) would look something like:
const valid_types = [ "case", "when", "else", "column_ref", "binary_expr", "single_quote_string", "number"];
const valid_tables = [ "context" ];
// Create a mock sql expressions and parse the AST
var exp = YOUR_CUSTOM_EXPRESSION;
var ast = parser.astify(exp);
// Check for attempted multi-statement injections
if(Array.isArray(ast) && ast.length > 1){
this.error = throw Error("Multiple statements detected");
}
// Recursively check the AST for unallowed operations
this.recursive_ast_check([], "columns", ast.columns);
function recursive_ast_check(path, p_key, ast_node){
// If parent key is the "type" of operation, check it against allowed values
if(p_key === "type") {
if(validator.valid_types.indexOf(ast_node) == -1){
throw Error("Invalid type '" + ast_node + "' found at following path: " + JSON.stringify(path));
}
return;
}
// If parent type is table, then the value should always be "context"
if(p_key === "table") {
if(validator.valid_tables.indexOf(ast_node) == -1){
throw Error("Invalid table reference '" + ast_node + "' found at following path: " + JSON.stringify(path));
}
return;
}
// Ignore null or empty nodes
if(!ast_node || ast_node==null) { return; }
// Recursively search array values down the chain
if(Array.isArray(ast_node)){
for(var i = 0; i<ast_node.length; i++) {
this.recursive_ast_check([...path, p_key], i, ast_node[i]);
}
return;
}
// Recursively search object keys down the chain
if(typeof ast_node === 'object'){
for(let key of Object.keys(ast_node)){
this.recursive_ast_check([...path, p_key], key, ast_node[key]);
}
}
}
This is just a mockup adapted from our implementation, but hopefully it will provide some guidance. Should also note, it is best to also implement all of the strategies discussed above as well. Many safeguards are better than just one.
I am currently designing a Rest API and is a little stuck on performance matters for 2 of the use cases in the system:
List all campaigns (api/campaigns) - needs to return campaign data needed for listing and paging campaigns. Maybe return up to 1000 records and would take ages to retreive and return detailed data. The needed data can be returned in a single DB call.
Retrieve campaign item (api/campaigns/id) - need to return all data about the campaign and may take up to a second to run. Multiple DB calls is needed to get all campaign data for a single campaign.
My question is: Is it valid to return different json-responses to those 2 calls (if well documented) even if it regards the same resource? I am thinking that the list response is a sub set of the retreive-response. The reason for this is to make to save DB calls and bandwitdh + parsing.
Thanks in advance!
I think it's both fine and expected for /campaigns and /campaigns/{id} to return different information. I would suggest using query parameters to limit the amount of information you need to return. For instance, only return a URI to each player unless you see a ?expand=players query parameter, in which case you return detailed player information.
I have some tables in a MySQL database to represent records from a sensor. One of the features of the system I'm developing is to display this records from the database to the web user, so I used ADO.NET Entity Data Model to create an ORM, used Linq to SQL to get the data from the database, and stored them in a ViewModel I designed, so I can display it using MVCContrib Grid Helper:
public IQueryable<TrendSignalRecord> GetTrends()
{
var dataContext = new SmgerEntities();
var trendSignalRecords = from e in dataContext.TrendSignalRecords
select e;
return trendSignalRecords;
}
public IQueryable<TrendRecordViewModel> GetTrendsProjected()
{
var projectedTrendRecords = from t in GetTrends()
select new TrendRecordViewModel
{
TrendID = t.ID,
TrendName = t.TrendSignalSetting.Name,
GeneratingUnitID = t.TrendSignalSetting.TrendSetting.GeneratingUnit_ID,
//{...}
Unit = t.TrendSignalSetting.Unit
};
return projectedTrendRecords;
}
I call the GetTrendsProjectedMethod and then I use Linq to SQL to select only the records I want. It is working fine in my developing scenario, but when I test it in a real scenario, where the number of records is way greater (something around a million records), it stops working.
I put some debug messages to test it, and everything works fine, but when it reaches the return View() statement, it simply stops, throwing me a MySQLException: Timeout expired. That let me wondering if the data I sent to the page is retrieved by the page itself (it only search for the displayed items in the database when the page itself needs it, or something like that).
All of my other pages use the same set of tools: MVCContrib Grid Helper, ADO.NET, Linq to SQL, MySQL, and everything else works alright.
You absolutely should paginate your data set before executing your query if you have millions of records. This could be done using the .Skip and .Take extension methods. And those should be called before running any query against your database.
Trying to fetch millions of records from a database without pagination would very likely cause a timeout at best.
Well, assuming information in this blog is correct, .AsPagination method requires you to sort your data by a particular column. It's possible that trying to do an OrderBy on a table with millions of records in it is just a time consuming operation and times out.
I am getting an IQueryable from my database and then I am getting another IQueryable from that first one -that is, I am filtering the first one.
My question is -does this affect performance? How many times will the code call the database? Thank you.
Code:
DataContext _dc = new DataContext();
IQueryable offers =
(from o in _dc.Offers
select o);
IQueryable filtered =
(from o in offers
select new { ... } );
return View(filtered);
The code you have given will never call the database since you're never using the results of the query in any code.
IQueryable collections aren't filled until you iterate through them...and you're not iterating through anything in that code sample (ah, the beauty of lazy initialization).
That also means that each of those statements will be executed as its own query against the database which results in no performance cost over doing two completely independent queries.
SO is not a replacement for developer tools. There are many good free tools able to tell you exactly what this code translates into and how it works. Use Reflector on this method and look at what code is generated and reason for yourself what is going on from there.
This is a brief description of my db. You have issues which have categories. I have various queries that get issues based on all sorts of criteria, but that's not important to the question.
I want to be able to take a list of issues I have queried, lets say for example, that occured yesterday and group them by category.
I have a method:
public static IEnumerable<Category> GroupIssuesByCategory(IEnumerable<Issue> issues)
{
return from i in issues
group i by i.Category into c
select c.key
}
Category has a nice mapping which allows it to list Issues within it. That's great for what I want, but in this scenario, it will pull back all the issues in that category, rather than the ones from the set I provided. How do I get around this?
Can I get around this?
I worked out why my original code wasn't compiling and updated the question.
Alas, I still have my main problem.
I'm not sure about the second part of the question, but your compilation problem is that the return type of a grouping is IEnumerable<IGrouping<Category, Issue>>, which I think is what you are looking to return from your method. Also, you don't really need the into c select c bit - that's only useful if you want to do some processing on the result of the grouping to get a different list.
IGrouping<S,T> has a key property which is the Category value, and is IEnumerable<T> to give you the list of Issues in that Category.
Try this as your method:
public static IEnumerable<IGrouping<Category, Issue>> GroupIssuesByCategory(IEnumerable<Issue> issues)
{
return from i in issues
group i by i.Category;
}