I use Beantalkd and Yii2 framework.
To add in queue I use something like this:
Yii::$app->beanstalk
->putInTube('tube2', ['param' => 'val'], PheanstalkInterface::DEFAULT_PRIORITY, PheanstalkInterface::DEFAULT_DELAY);
But now I need to plain some task right at specified time, is it possible with Beantalkd, or I need something like Resque?
You can play some task at a sepcified time by calculating the delay, and sending that as a parameter to your above example.
On the other hand, it would be good to store time based lists for example in Redis, and have a cron that reads the expired ones every minute and loads the jobs to beanstalkd.
Related
We used to use the guava cache and we want to change it to caffeine.
We want to set for each entity its own "expiration time", something like - put(K key, V value, long expiration_time).
I saw the 3 functions above and I wonder what exactly they are doing, if you can explain me the meaning ant the operations of each one of them it will be great.
For example, the return value of expireAfterCreate should be the duration we want for this entity from it's creation untill it's expiration? or something else?
I'm also wondering why we have the parameter "currentTime" in both expireAfterRead and expireAfterUpdate if we don't use it in the function?
When we used the guava cache we used the expireAfterAccess, what is the substitution for it in caffeine?
My last question is how can I set a default value for entities without a unique expiration time.
Thank you,
May
When we used the guava cache we used the expireAfterAccess, what is the substitution for it in caffeine?
We mirror the Guava API, so this is also available on the cache builder.
My last question is how can I set a default value for entities without a unique expiration time.
Use expireAfterAccess, expireAfterWrite, or return a constant duration with expireAfter(Expiry).
I saw the 3 functions above and I wonder what exactly they are doing, if you can explain me the meaning ant the operations of each one of them it will be great.
Expiry is a callback interface where a single timestamp value is updated. The invoked method corresponds to the operation performed on the cache entry (created, updated, read). An update or read that should have no effect can return currentDuration to no-op.
For example, the return value of expireAfterCreate should be the duration we want for this entity from it's creation untill it's expiration? or something else?
Yes. However if the expireAfterUpdate returns a custom value (something other than currentDuration), then that overrides the prior expiration duration.
I'm also wondering why we have the parameter "currentTime" in both expireAfterRead and expireAfterUpdate if we don't use it in the function?
This can most often be ignored, but is provided if somehow useful. It is the current nano timestamp from the Ticker (not wall clock time).
We want to set for each entity its own "expiration time", something like - put(K key, V value, long expiration_time).
The callback Expiry is required and generally recommended, because ideally entries are loaded through the cache to avoid stampedes (e.g. LoadingCache). A stampede is when multiple threads lookup the same entry, miss, load it, and overwrite each other putting it in. That wasted work rather than having only one thread perform the load and others wait for the results.
That said, this method is available under Cache.policy().expiresVariably(). Those configuration-specific methods are stashed in that area to offer more power when deemed necessary.
Thank you,
You're very welcome.
I am trying to ingest data from a 3rd party API into a Dataflow pipeline. Since the 3rd party doesn't make webhooks available, I wrote a custom script that constantly polls their endpoint for more data.
The data is refreshed every 15 minutes, but since I don't want to miss any datapoints and I want to consume as soon as new data is available, my "crawler" runs every 1 minute. The script then sends the data to a PubSub topic. Easy to see that PubSub will receive about 15 repeated messages for each datapoint in the source.
My first attempt to identify and discard those repeated messages was to add a custom attribute to each PubSub message (eventid), created from a hash of its [ID + updated_time] at source.
const attributes = {
eventid: Buffer.from(`${item.lastupdate}|${item.segmentid}`).toString('base64'),
timestamp: item.timestamp.toString()
};
const dataBuffer = Buffer.from(JSON.stringify(item))
publisher.publish(dataBuffer, attributes)
Then I configured Dataflow with a withIdAttribute() (which is the new idLabel(), based on Record IDs).
PCollection<String> input = p
.apply("ReadFromPubSub", PubsubIO
.readStrings()
.fromTopic(String.format("projects/%s/topics/%s", options.getProject(), options.getIncomingDataTopic()))
.withTimestampAttribute("timestamp")
.withIdAttribute("eventid"))
.apply("OutputToBigQuery", ...)
With that implementation, I was expecting that when the script sends the same datapoint a second time, the repeated eventid would be the same and the message discarded. But for some reason, I still see duplicates on the output dataset.
Some questions:
Is there a clever way to ingest the data to dataflow from that 3rd party API if they don't provide webhooks?
Any ideas on why dataflow is not discarding the messages on this situation?
I know about the 10-minute restriction for deduplication on dataflow, but I see duplicated data even on the 2nd insertion (2 minutes).
Any help will be greatly appreciated!
I think you are on the right track, instead of the hash I recommend to use timestamps. A better way to to this is by using windows. Review this document which filters data that is outside of the window.
Regarding the additional duplicate data, if you are using pull subscriptions and the acknowledgement deadline is reached before having the data processed the message will be resent as per the at-least-once delivery. In this case change the acknowledgement deadline, the defaults is 10 seconds.
This issue is bugging me for some time now. To test it I just installed a fresh Apigility, set the db (PDO:mysql) and added a DB-Connected service. In the table I have 40+ records. When I make a GET collection request the response looks OK (with the default HAL content negotiation). Then I change the content negotiation to JSON. Now when I make a GET collection request my response contains only 10 elements.
So my question is: where do I set/change this limit?
You can set the page size manually, like so:
$paginator = $this->getAlbumTable()->fetchAll(true);
// set the current page to what has been passed in query string, or to 1 if none set
$paginator->setCurrentPageNumber((int) $this->params()->fromQuery('page', 1));
// set the number of items per page to 10
$paginator->setItemCountPerPage(10);
http://framework.zend.com/manual/current/en/tutorials/tutorial.pagination.html
Could you please send the page_size, total_items part at the end of the json output?
it's like:
"page_count": 140002,
"page_size": 25,
"total_items": 3500035,
"page": 1
This is not an ideal fix, because it requires you to go into the source code rather than using the page size given in the UI.
The collection class that is auto generated for you by the DB-Connected style derives off of Zend/Paginator/Paginator. This class defines the $defaultItemCountPerPage static protected member which is defaulted to 10. That's why you're only getting 10 results. If you open up the auto-generated collection class for your entity and add: protected static $defaultItemCountPerPage = 100; in the otherwise empty class, you will see that you now get up to 100 results in the response. You can look at other Paginator class variables and methods that you could replace in your derived class to get your desired behavior.
This is not an ideal solution. I'd prefer that the generated code automatically used the same configed page size that the HalJson strategy uses. Maybe I'll contribute a PR to change that. Or, maybe I'll just use the HalJson approach. It does seem like the better way to go. You should have some limit to how much data you load in from the DB at a time to not have an overly long running query or an overly large collection of data coming back you have to deal with. And, whatever limit you set, what do you do when you hit that limit? With the simple Json method, you can't ever get "page 2" of data. So, if you are going to work with some sizeable amount of data, it might be better to use HalJson on and then have some logic on the client side to grab pages of data at a time as needed. The returned JSON structure is a little more complicated, but not terribly so.
I'm probably in the same spot you are -- I'm trying to do a simple little api to play with while keeping everything simple and so I didn't want the client to have to deal with the other stuff in HalJson, but probably better to deal with that complexity and have a smooth way to page through data if you're going to use this with some real set of data. At least, that's the pep talk I'm giving myself right now. :-)
Given: extension that has a properly configured cronjob (let's say, every 5 minutes) in Config.xml. Also, the system cron is set to run Magento's cron.sh. The cronjob has to run a couple of times after the extension installed, and when it has no more data to process then it becomes obsolete.
Problem: the job isn't needed after it had processed all the data. However, its setup in Config.xml causes it to run every 5 minutes forever, just to check that there is no more data and die.
Question: is there any proper way (maybe with the cron_schedule table...) to 'dismiss' the cronjob programmatically from its own PHP when it sees that there is no more data? Or any other way?
The cron is used since the extension installation process shouldn't be interrupted. Maybe it's possible to schedule some PHP code in some other way than cron (but within Magento)? Thought about threading but since there is no guarantee that this feature will be built in, this doesn't seem to be the option....
Thanks in advance!
So, I found 2 possible solutions: 1) it seems to be possible to create/remove crontabs via core_config_data table without config.xml; 2) remove the crontab node from config.xml after all data is processed + clean the cache + remove all pending tasks. I've managed to implement the 2nd, and it works (I know that the 1st approach is much better, but I just had no time to dig it out).
The 2nd looks like:
if ($more_data) {
// processing...
} else { // Dismissing the cron
$config_xml_path = Mage::getModuleDir('etc', 'the_extension') . '/config.xml';
$config_xml = simplexml_load_file($config_xml_path) or die("Error: Cannot create object");
if (isset($config_xml) && isset($config_xml->crontab)) {
unset($config_xml->crontab);
$config_xml->asXML($config_xml_path);
}
// Cleaning
Mage::app()->cleanCache();
$schedule = Mage::getModel('cron/schedule');
$sch_col = $schedule->getCollection()
->addFilter('job_code', 'the_extension_cronFunc')
->addFilter('status', 'pending');
foreach ($sch_col as $s) {
$s->delete();
}
}
Question on agents: I specifically want to create a Periodic Task, but only want to run it once every day, say 1am, not every 30 minutes which is the default. In the OnInvoke, do I simply check for the hour, and run it only if current hour matches that desired hour.
But on the next OnInvoke call, it will try to run again in 30 minute, maybe when it's 1:31am.
So I guess I'd use a stored boolean in the app settings to mark as "already run for today" or similar, and then check against that value?
If you specifically want to run a custom action at 1 am, i'm not sure that a single boolean would be enough to make it work.
I guess that you plan to reset your boolean at 1:31 to prepare the execution of the next day, but what if your periodic task is also called at 1h51 (so called more than 2 times between 1am and 2am).
How could this happen? Well maybe this could happen if the device is reboot but i'm not quiet sure about it. In any case, storing the last execution datetime somewhere and comparing it to the current one can be a safer way to ensure that your action is only invoked once per day.
One question remains : Where to store your boolean or datetime (depending which one you'll pick)?
AppSetting does not seem to be a recommanded place according msdn :
Passing information between the foreground app and background agents
can be challenging because it is not possible to predict if the agent
and the app will run simultaneously. The following are recommended
patterns for this.
For Periodic and Resource-intensive Agents: Use LINQ 2 SQL or a file in isolated storage that is guarded with a Mutex. For
one-direction communication where the foreground app writes and the
agent only reads, we recommend using an isolated storage file with a
Mutex. We recommend that you do not use IsolatedStorageSettings to
communicate between processes because it is possible for the data to
become corrupt.
A simple file in isolated storage should get the job done.
If you're going by date (once per day) and it's valid that the task can run at 11pm on a day and 1am the next, then after the agent has run you could store the current date (forgetting about time). Then whenever the agent runs again in 30 minutes, check if the date the task last ran is the same as the current date.
protected override void OnInvoke(ScheduledTask task)
{
var lastRunDate = (DateTime)IsolatedStorageSettings.ApplicationSettings["LastRunDate"];
if(DateTime.Today.Subtract(lastRunDate).Days > 0)
{
// it's a greater date than when the task last ran
// DO STUFF!
// save the date - we only care about the date part
IsolatedStorageSettings.ApplicationSettings["LastRunDate"] = DateTime.Today;
IsolatedStorageSettings.ApplicationSettings.Save();
}
NotifyComplete();
}