My company is evaluating if we can use Google Dataflow.
I have run a dataflow on Google Cloud Platform. The console shows 5 hr 25 minutes in "Reserved CPU Time" field on the right.
Worker configuration: n1-standard-4
Starting 8 workers...
How to calculate the cost of the dataflow ? According to this page
the price is $0.01 per GCEU per hr, how can I find the number of GCEU consumed by my dataflow, and the number of hours?
You can find the number of GCEUs per machine here: https://cloud.google.com/compute/docs/machine-types. For example, n1-standard-4s are 11 GCEUs.
The cost of a batch Dataflow job (in addition to the raw cost of VMs) is then
(Reserved CPU time in hours) / (Cores per machine) * (GCEUs) * $.01
Then, the total cost of the job is
(machine hours) * ((GCEUs) * $.01 + (machine cost per hour) + (PD cost per hour for attached disks))
For example, for n1-standard-4 with 250GB disks, this works out to (11 * $.01 + $.152 + ($.04 * 250 / 30 / 24)) = $.276 per machine-hour.
There is new pricing model for Dataflow since 2018-05-03.
Now you should use following formula:
(vcpu_hours * vcpu_hourly_price) +
(mem_hours * mem_hourly_price) +
(disk_hours * disk_hourly_price)
Additional costs for Shuffle may apply.
If you enable billing export to BigQuery it's possible and easy to compute the cost of a single Dataflow job with the query below filling in the correct values for GCP_PROJECT, BILLING_TABLE_NAME and DATAFLOW_JOB_ID. The query is:
SELECT
l.value AS job_id,
ROUND(SUM(cost),3) AS cost
FROM `$GCP_PROJECT.$BILLING_TABLE_NAME` bill, UNNEST(bill.labels) l
WHERE service.description = 'Cloud Dataflow' and l.value = `$DATAFLOW_JOB_ID`
GROUP BY 1;
You can find the value for DATAFLOW_JOB_ID in the Dataflow UI and BILLING_TABLE_NAME in the BigQuery UI. The BILLING_TABLE_NAME will be of the format gcp_billing_export_resource_$ACCOUNT_ID
NOTE: From personal experience it seems to take quite a while before the billing table is populated with the pricing information.
Related
I have an issue with the limits of record I can run in Zoho Creator in my script, so I have been using Ranges - (ie. run from record 1 to let's say 100, 101 to 200, 201 to 300, 301...); but now I have very many records(40000). Is there a way I can write two or more functions that can run through the records without I defining the ranges time after time?
I would suggest you get the number of records available first.
Based on that count, batch it.
Example: for 4102 records => [(4102 / 100) + ((4102 % 100) > 0 ? 1 : 0)]
and download it with sleep time between each request, to prevent blocking of the requests.
This query creates an export for UPS from the deliveries history:
select 'key'
, ACC.Name
, CON.FullName
, CON.Phone
, ADR.AddressLine1
, ADR.AddressLine2
, ADR.AddressLine3
, ACC.Postcode
, ADR.City
, ADR.Country
, ACC.Code
, DEL.DeliveryNumber
, CON.Email
, case
when CON.Email is not null
then 'Y'
else 'N'
end
Ship_Not_Option
, 'Y' Ship_Not
, 'ABCDEFG' Description_Goods
, '1' numberofpkgs
, 'PP' billing
, 'CP' pkgstype
, 'ST' service
, '1' weight
, null Shippernr
from ExactOnlineREST..GoodsDeliveries del
join ExactOnlineREST..Accounts acc
on ACC.ID = del.DeliveryAccount
join ExactOnlineREST..Addresses ADR
on ADR.ID = DEL.DeliveryAddress
join ExactOnlineREST..Contacts CON
on CON.ID = DEL.DeliveryContact
where DeliveryDate between $P{P_SHIPDATE_FROM} and $P{P_SHIPDATE_TO}
order
by DEL.DeliveryNumber
It takes many minutes to run. The number of deliveries and accounts grows with several hundreds each day. Addresses and contacts are mostly 1:1 with accounts. How can this query be optimized for speed in Invantive Control for Excel?
Probably this query is run at most once every day, since the deliverydate does not contain time. Therefore, the number of rows selected from ExactOnlineREST..GoodsDeliveries is several hundreds. Based upon the statistics given, the number of accounts, deliveryaddresses and contacts is also approximately several hundreds.
Normally, such a query would be optimized by a solution such as Exact Online query with joins runs more than 15 minutes, but that solution will not work here: the third value of a join_set(soe, orderid, 100) is the maximum number of rows on the left-hand side to be used with index joins. At this moment, the maximum number on the left-hand side is something like 125, based upon constraints on the URL length for OData requests to Exact Online. Please remember the actual OData query is a GET using an URL, not a POST with unlimited size for the filter.
The alternatives are:
Split volume
Data Cache
Data Replicator
Have SQL engine or Exact Online adapted :-)
Split Volume
In a separate query select the eligible GoodsDeliveries and put them in an in-memory or database table using for instance:
create or replace table gdy#inmemorystorage as select ... from ...
Then create a temporary table per 100 or similar rows such as:
create or replace table gdysubpartition1#inmemorystorage as select ... from ... where rowidx$ between 0 and 99
... etc for 100, 200, 300, 400, 500
And then run the query several times, each time with a different gdysubpartition1..gdysubpartition5 instead of the original from ExactOnlineREST..GoodsDeliveries.
Of course, you can also avoid the use of intermediate tables by using an inline view like:
from (select * from goodsdeliveries where date... limit 100)
or alike.
Data Cache
When you run the query multiple times per day (unlikely, but I don't know), you might want to cache the Accounts in a relational database and update it every day.
You can also use a 'local memorize results clipboard andlocal save results clipboard to to save the last results to a file manually and later restore them usinglocal load results clipboard from ...andlocal insert results clipboard in table . And maybe theninsert into from exactonlinerest..accounts where datecreated > trunc(sysdate)`.
Data Replicator
With Data Replicator enabled, you can have replicas created and maintained automatically within an on-premise or cloud relational database for Exact Online API entities. For low latency, you will need to enable the Exact webhooks.
Have SQL Engine or Exact adapted
You can also register a request to have the SQL engine to allow higher number in the join_set hint, which would require addressing the EOL APIs in another way. Or register a request at Exact to also allow POST requests to the API with the filter in the body.
The SQL engine hides away all nifty details on what API calls are being done. However, some cloud solutions have pricing per API call.
For instance:
select *
from transactionlines
retrieves all Exact Online transaction lines of the current company, but:
select *
from transactionlines
where financialyear = 2016
filters it effectively on REST API of Exact Online to just that year, reducing data volume. And:
select *
from gltransactionlines
where year_attr = 2016
retrieves all data since the where-clause is not forwarded to this XML API of Exact.
Of course I can attach fiddler or wireshark and try to analyze the data volume, but is there an easier way to analyze the data volume of API calls with Invantive SQL?
First of all, all calls handled by Invantive SQL are logged in the Invantive Cloud together with:
the time
data volume in both directions
duration
to enable consistent API use monitoring across all supported cloud platforms. The actual data is not logged and travels directly.
You can query the same numbers from within your session, for instance:
select * from exactonlinerest..projects where code like 'A%'
retrieves all projects with a code starting with 'A'. And then:
select * from sessionios#datadictionary
shows you the API calls made:
You can also put a query like to following at the end of your session before logging off:
select main_url
, sum(bytes_received) bytes_received
, sum(duration_ms) duration_ms
from ( select regexp_replace(url, '\?.*', '') main_url
, bytes_received
, duration_ms
from sessionios#datadictionary
)
group
by main_url
with a result such as:
Quick synopsis of the problem:
I am working on a graph page to map the performance of a device my company is working on.
I get a new statpoint (timestamp, stats, nodeid, volumeid, clusterid) every 2 seconds. from every node.
this results in approximately 43k records per day per node per stat.
now let's say i have 13 stats. that's 520k ish records a day.
So a row would look something like:
timestamp typeid clusterid nodeid volumeid value
01/02/2016 05:02:22 0 1 1 1 82.20
So brief explanation, we decided to go with mysql because it's easily scaleable in Amazon. i was using Influx before which could easily solve this problem but there is no way to auto scale InfluxDB in Amazon.
My ultimate goal is to get a return value that looks like:
object[ {
node1-stat1: 20.0,
node2-stat1: 23.2,
node3-stat1: xx.x,
node1-stat2: 20.0
node2-stat2: xx.x,
node3-stat2: xx.x,
timestamp: unixtimestamp
},
{
node1-stat1: 20.0,
node2-stat1: 23.2,
node3-stat1: xx.x,
node1-stat2: 20.0
node2-stat2: xx.x,
node3-stat2: xx.x,
timestamp: unixtimestamp + 2 seconds
}]
i currently have a query that gathers all the unique timestamps.
and then loops over those to get the values belonging to that timestamp.
and that get's put in an object.
That results in the desired output. but it takes FOREVER to do this and it's over a million queries.
Can something like this even be done in Mysql? should i go back to a timeseries db? and just deal with scaling it manually?
// EDIT //
I think i might have solved my problem:
SELECT data_points.*, data_types.friendly_name as friendly_name
FROM data_points, data_types
WHERE (cluster_id = '5'
AND data_types.id = data_points.data_type_id
AND unix_timestamp(timestamp) BETWEEN '1456387200' AND '1457769599')
ORDER BY timestamp, friendly_name, node_id, volume_id
Gives me all the fields i need.
I then loop over these datapoints. and create a new "object" for each timestamp, and just add stats to this object for all the ones that match the timestamp.
this executes in under a second going over a million records.
I will for sure try to see if swapping to a Timeseries db will make an improvement in the future.
I am trying to implement a recovery community meeting finder. I have a database and map setup. I am trying to add a variable to display the current day's meetings that is also based upon a "nearest" location priority. How do I get the today's date variable in my database to selectively display only that days meetings? I'm using google maps api.
Thanks,
Terry
In your database you have a datetime field that you store the date and time of the meetings I am guessing.
When you pull the information from your database to put in the map, simply use whichever apprrpriate sql language to select only those records that match the date you need.
This is actually more of a sql problem.
The closest location may not be catered for, but what you need to do is also a sql question. You need to add lat and long fields to your database and store the lat/longs for each of the meeting locations.
Then you can do a distance based sql search once you have those and the lat/long of the user.
Maps doesn't come into the selection process at all.
...
Edit.
The selecting by time is fairly simple but I thought I would share a distance based sql SELECT I used a while back. Note. It was used with MySQL but I think it should work with almost any sql db.
"SELECT name, tel, (((acos(sin((".$latitude."*pi()/180)) * sin((`latitude`*pi()/180))+cos((".$latitude."*pi()/180)) * cos((`latitude`*pi()/180)) * cos(((".$longitude."- `longitude`)*pi()/180))))*180/pi())*60*1.1515) AS distance FROM takeaway WHERE active AND (((acos(sin((".$latitude."*pi()/180)) * sin((`latitude`*pi()/180))+cos((".$latitude."*pi()/180)) * cos((`latitude`*pi()/180)) * cos(((".$longitude."- `longitude`)*pi()/180))))*180/pi())*60*1.1515) < 10 AND tel AND type = 'transport/taxi' ORDER BY distance LIMIT 5"
That gives you the basics for editing and reusing. Just remember to add the time/date check into your final string.