I am trying to display the current position of every device registered in my geomesa-accumulo database through geoserver's WPS. Since each device sends its position every X seconds. I am using geomesa's TrackLabel process to get the last position of each device, the WPS process setup is:
track: device_id
dtg : date_time
Using I run the process and display the results using leaflet. But, I think the results are not what I expected, because if I run the following query in jupyter notebook:
spark.sql("select device_id, date_time, position from positions where device_id = 145 order by date_time desc limit 1").show()
It returns that the last position was at 2016-05-17 20:47 but the TrackLabel process says 2016-03-05 20:12.
My questions: If this the correct approach, then what am I missing?
Or what should be the correct approach for this problem?
Since you're querying the entire dataset, you may be hitting the WFS result limit. See here for details
Related
I'm a student intern in a business team and my coworkers don't have the CS background so I hope to get some feedback and suggestion for improvement on the database design for the Flask web application that I will work on. Also, I self-learned sql a couple years ago by following tutorials on Youtube.
When a new work order is received by the business, it is then passed to a line of 5 stations to process it further. Currently the status of the work order is either started or finished. We hope to track it better by knowing the current station/stage (A, B, C, D, E) of the work order and then help improve the flow by letting the operator at each station know what's next in line.
My idea is create a web app (using Python 3, Flask, and postgresql) that updates the database when an operator at each station scans the work order's barcode and two other static barcodes (in_station_X and out_station_X). Each station will have a tablet connected to a scanner.
I. Station Operator perspective (for example Station 1)
Scan the batch of all incoming work order (barcode) for that shift. For each item, they would also scan the in_station_1 barcode to record the time_in for each work order.
The work orders come in queue so eventually the web app running on the tablet can show them what's next in line.
When an item is processed, the operator would scan the work order again and also the out_station_1 barcode to record the time_out for each work order.
The item coming out of that station may not have the same order as the incoming queue due to different priority (boolean Yes/No).
II. Admin/dashboard perspective:
See the current station and cycle time of each work order in that day.
Modify the priority of a work order if needed be.
Also, possibility to see reloop if a work order fails to be processed in station 2 and needs to go back to station 1.
III. The database:
a. Work Order Info table that contains fields such as:
id, workorder_barcode, requestor, priority (boolean Yes/No), date_created.
b. The Tracking Database: I'm thinking of having columns like:
- id (automatically generated for new row)
- workorder_barcode (nullable = False)
- current_station (nullable = False)
- time_in
- time_out
I have several questions/concerns related to this tracking table:
Every time a work order is scanned in or out, a new row will be created (which mean either column is blank). Do you see any issues with this approach vs. looking up the same work order that has time_in to fill the time_out? The reason for this is to avoid multiple look up when the database scales big.
Since the app screen at each station will show what's next in line, do you think a simple query with ORDER_BY to show the the order needed would suffice? What concerns me is showing the next item based on both Priority of each item and the current incoming order. I think I can sort by multiple columns (time_in and priority) and FILTER by current_station. However, as you can see below, I think the current table design may be more suitable for capturing events than doing queue control.
For example: the table for today would look like
id, workorder_barcode, current_station, time_in, time_out
61, 100.1, A, 6:00pm, null
62, 100.3, A, 6:01pm, null
63, 100.2, A, 6:02pm, null
...
70, 100.1, A, null, 6:03pm
71, 100.1, B, 6:04pm, null
...
74, 100.5, C, 6:05pm, null
At 6:05pm, the queue at each station would be
Station A queue: 100.3, 100.2
Station B queue: 100.1
Station C queue: 100.5
I think this can get complicated to have all 5 stations sharing the same table but seeing different queues. Is there a Queue based database that you would recommend I look into?
Thank you so much for reading this. I appreciate any questions, comments, and suggestions since I'm trying to learn more about database as I get hands-on with this project.
I am totally clueless how to get around to get the following kinda result from the same table in MySQL.
Required Result:
The raw data as shown in below image.
Mc_id and op_id can be different. For example, if mc_id is 4 and op_id is 10 then it has to loop through each vouid and extract done_on_date, again it has to loop through for the same mc_id 4 and op_id 10 and extract done_on_date where done_on_date is after first extracted done_on_date. Here second extracted done_on_date, we refer to, as next_done_on_date, just to distinguish it differently. Accordingly continue till end of the table. I hope I am clear enough now.
The idea is basically to see when was particular operation_id carried out for the said machine having mc_id. First time operation done is refered to as done_on_date and when the same operation carried out for the same machine next time, we refer to as next_done_on_date but actually inside the database table it is done_on_date.
Though let me know if anything yet to be clarified
This query creates an export for UPS from the deliveries history:
select 'key'
, ACC.Name
, CON.FullName
, CON.Phone
, ADR.AddressLine1
, ADR.AddressLine2
, ADR.AddressLine3
, ACC.Postcode
, ADR.City
, ADR.Country
, ACC.Code
, DEL.DeliveryNumber
, CON.Email
, case
when CON.Email is not null
then 'Y'
else 'N'
end
Ship_Not_Option
, 'Y' Ship_Not
, 'ABCDEFG' Description_Goods
, '1' numberofpkgs
, 'PP' billing
, 'CP' pkgstype
, 'ST' service
, '1' weight
, null Shippernr
from ExactOnlineREST..GoodsDeliveries del
join ExactOnlineREST..Accounts acc
on ACC.ID = del.DeliveryAccount
join ExactOnlineREST..Addresses ADR
on ADR.ID = DEL.DeliveryAddress
join ExactOnlineREST..Contacts CON
on CON.ID = DEL.DeliveryContact
where DeliveryDate between $P{P_SHIPDATE_FROM} and $P{P_SHIPDATE_TO}
order
by DEL.DeliveryNumber
It takes many minutes to run. The number of deliveries and accounts grows with several hundreds each day. Addresses and contacts are mostly 1:1 with accounts. How can this query be optimized for speed in Invantive Control for Excel?
Probably this query is run at most once every day, since the deliverydate does not contain time. Therefore, the number of rows selected from ExactOnlineREST..GoodsDeliveries is several hundreds. Based upon the statistics given, the number of accounts, deliveryaddresses and contacts is also approximately several hundreds.
Normally, such a query would be optimized by a solution such as Exact Online query with joins runs more than 15 minutes, but that solution will not work here: the third value of a join_set(soe, orderid, 100) is the maximum number of rows on the left-hand side to be used with index joins. At this moment, the maximum number on the left-hand side is something like 125, based upon constraints on the URL length for OData requests to Exact Online. Please remember the actual OData query is a GET using an URL, not a POST with unlimited size for the filter.
The alternatives are:
Split volume
Data Cache
Data Replicator
Have SQL engine or Exact Online adapted :-)
Split Volume
In a separate query select the eligible GoodsDeliveries and put them in an in-memory or database table using for instance:
create or replace table gdy#inmemorystorage as select ... from ...
Then create a temporary table per 100 or similar rows such as:
create or replace table gdysubpartition1#inmemorystorage as select ... from ... where rowidx$ between 0 and 99
... etc for 100, 200, 300, 400, 500
And then run the query several times, each time with a different gdysubpartition1..gdysubpartition5 instead of the original from ExactOnlineREST..GoodsDeliveries.
Of course, you can also avoid the use of intermediate tables by using an inline view like:
from (select * from goodsdeliveries where date... limit 100)
or alike.
Data Cache
When you run the query multiple times per day (unlikely, but I don't know), you might want to cache the Accounts in a relational database and update it every day.
You can also use a 'local memorize results clipboard andlocal save results clipboard to to save the last results to a file manually and later restore them usinglocal load results clipboard from ...andlocal insert results clipboard in table . And maybe theninsert into from exactonlinerest..accounts where datecreated > trunc(sysdate)`.
Data Replicator
With Data Replicator enabled, you can have replicas created and maintained automatically within an on-premise or cloud relational database for Exact Online API entities. For low latency, you will need to enable the Exact webhooks.
Have SQL Engine or Exact adapted
You can also register a request to have the SQL engine to allow higher number in the join_set hint, which would require addressing the EOL APIs in another way. Or register a request at Exact to also allow POST requests to the API with the filter in the body.
Quick synopsis of the problem:
I am working on a graph page to map the performance of a device my company is working on.
I get a new statpoint (timestamp, stats, nodeid, volumeid, clusterid) every 2 seconds. from every node.
this results in approximately 43k records per day per node per stat.
now let's say i have 13 stats. that's 520k ish records a day.
So a row would look something like:
timestamp typeid clusterid nodeid volumeid value
01/02/2016 05:02:22 0 1 1 1 82.20
So brief explanation, we decided to go with mysql because it's easily scaleable in Amazon. i was using Influx before which could easily solve this problem but there is no way to auto scale InfluxDB in Amazon.
My ultimate goal is to get a return value that looks like:
object[ {
node1-stat1: 20.0,
node2-stat1: 23.2,
node3-stat1: xx.x,
node1-stat2: 20.0
node2-stat2: xx.x,
node3-stat2: xx.x,
timestamp: unixtimestamp
},
{
node1-stat1: 20.0,
node2-stat1: 23.2,
node3-stat1: xx.x,
node1-stat2: 20.0
node2-stat2: xx.x,
node3-stat2: xx.x,
timestamp: unixtimestamp + 2 seconds
}]
i currently have a query that gathers all the unique timestamps.
and then loops over those to get the values belonging to that timestamp.
and that get's put in an object.
That results in the desired output. but it takes FOREVER to do this and it's over a million queries.
Can something like this even be done in Mysql? should i go back to a timeseries db? and just deal with scaling it manually?
// EDIT //
I think i might have solved my problem:
SELECT data_points.*, data_types.friendly_name as friendly_name
FROM data_points, data_types
WHERE (cluster_id = '5'
AND data_types.id = data_points.data_type_id
AND unix_timestamp(timestamp) BETWEEN '1456387200' AND '1457769599')
ORDER BY timestamp, friendly_name, node_id, volume_id
Gives me all the fields i need.
I then loop over these datapoints. and create a new "object" for each timestamp, and just add stats to this object for all the ones that match the timestamp.
this executes in under a second going over a million records.
I will for sure try to see if swapping to a Timeseries db will make an improvement in the future.
I'm using Gerrit REST API to query all changes whose status is "merged". My query is
https://android-review.googlesource.com/changes/?q=status:merged&n=2
where "n=2" limits the size of query results to 2. So I got a JSON object like:
Of course there are more results. According to the REST document:
If the n query parameter is supplied and additional changes exist that match the query beyond the end, the last change object has a _more_changes: true JSON field set. Callers can resume a query with the N query parameter, supplying the last change’s _sortkey field as the value.
So I add the query parameter N with the _sortkey of the last change 100309. The new query is:
https://android-review.googlesource.com/changes/?q=status:merged&n=2&N=002e4203000187d5
With this new query, I was hoping that I'll get another 2 new query results, since I provided the _sortkey as a cursor of my previous search results.
However, it's really weird that this new query returns exactly the same results as the previous query, instead of the next 2 results as I expected. It seems like providing "N=002e4203000187d5" has no effect at all.
Does anybody know why using _sortkey to resume my query doesn't work?
I chatted with one of the developers at Google, and he confirmed that _sortkey has been removed from the newer versions of Gerrit they are running at android-review and gerrit-review. The N= parameter is no longer valid. The documentation will be updated to reflect this.
The alternative is to use &S=x to skip x results, which I tested and works well.
sortkey is deprecated in Gerrit v2.9 -
see the (Gerrit) ReleaseNotes-2.9.txt, under REST API - Changes:
[[sortkey-deprecation]]
Results returned by the [query changes] endpoint are now paginated using offsets instead of sortkeys.
The sortkey and sortkey_prev parameters on the endpoint are deprecated.
The results are now paginated using the --limit (-n) option to limit the number of results, and the -S option to set the start point.
Queries with sortkeys are still supported against old index versions, to enable online reindexing while clients have an older JS version.
See also here -
PSA: Removing the "sortkey" field from the gerrit-on-borg query interface:
...
Our solution is to kill the sortkey field and its related search operators (sortkey_before, sortkey_after, and resume_sortkey).
There are two ways you can achieve similar functionality.
Add "&S=" to your query to skip a fixed number of results.
(Note that this redoes the search so new results may have jumped ahead and
you might process the same change twice.
This is true of the resume_sortkey implementation as well,
so your code should already be able to handle this.)
Use the before/after operators.
Instead of taking the sortkey field from the last returned change and
using it in a resume_sortkey operator, you take the updated field from
the last returned change and use it in a before operator.
(This has slightly different semantics than the sortkey field, which
uses the change number as a tiebreaker when changes have similar updated times.)
...