Parsing large JSON file using symfony controller - json

I have a large json file (not in size, but in elements). It has 30000 JSON elements and I am trying to produce Entities from it as it reads it.
So far I have it reading the file with Guzzle, and it ends up producing about 1500 entities before it crashes. I feel that I must be doing this the wrong way.
Here is my code:
public function generateEntities(Request $request, $number)
{
$client = new \GuzzleHttp\Client();
$request = new \GuzzleHttp\Psr7\Request('GET', 'http://www.example.com/file.json');
$promise = $client->sendAsync($request)->then(function ($response) {
$batchSize = 20;
$i = 0;
foreach (json_decode($response->getBody()) as $entityItem) {
$entity = new Entity();
$entity->setEntityItem($entityItem->string);
$em = $this->getDoctrine()->getManager();
$em->persist($entity);
if (($i % $batchSize) === 0) {
$em->flush(); // Executes all updates
}
$i++;
}
$em->flush();
});
$promise->wait();
return $this->redirect($this->generateUrl('show_entities'));
}
I worked out from research that I should be clearing the Entity Manager frequently, so I added in batch sizing etc to flush it every 20 entities created. This did help but is not enough to load the full 30000 files.
Maybe I am completely wrong and should be handling it a different way?
Is it possible for someone to please point me in the right direction, I am happy to work it out on my own I am just a little unsure where to proceed from here.
Thank you!

You could improve your process act in two ways:
1) increment the time limit of the execution of the controller action with the function set_time_limit, so put this as first line of the controller:
public function generateEntities(Request $request, $number)
{
set_time_limit(0); // set to zero, no time limit is imposed
2) Free much memory is possible for each interaction flush the data to the database and detach/free memory as follow:
$em->persist($entity);
$em->flush($entity);
$em->detach($entity);
$em->clear($entity);
unset($entity);
Hope this help

The script is running out of memory because every one of the 30000 entities is managed in memory. You need to detach the entities from the manager periodically to make sure they are "garbage collected". Use $em->clear(); in your batch flushing block to ensure memory isn't exhausted. See the Doctrine page on batch operations for more information.
Keep in mind though, that $em->clear() will detach all entities from the manager, not just those you are using in this loop.

Related

Mallet Api - Get consistent results

I am new to LDA and mallet. I have the following query
I tried running Mallet-LDA with the command line and by setting the --random-seed to a fixed value, I was able to get consistent results for multiple runs of the algorithm
However, I did try with the Mallet-Java-API and everytime I run the program I get different output.
I did google around and found out that random-seed needs to be fixed and I have it fixed in my java code. I still am getting different results.
Could anyone let me know what other parameters do I need to consider for consistent results (when run multiple times)
I might want to add that train-topics when ran multiple times(command line) yields same result. However, when I rerun import-dir and then run train-topics, the results do not match with previous one. (Probably as expected).
I am ok with running import-dir just once and then experiment with different number of topics and iterations by running train-topics.
Similarly, what needs to be changed/ kept constant if I want to replicate the same when I use Java-Api.
I was able to solve this.
I will respond in detail here:
There are two ways in which Mallet could be run.
a. Command mode
b. Using Java API
To get consistent results for different runs, we need to fix the 'random seed' and in the command line we have an option of setting it. We have no surprises there.
However, while using APIs, though we have an option of setting 'random seed', we need to know that it needs to be done at proper point, else it does not work. (see code)
I have pasted the code here which would create a model(read InstanceList) file from the data
and then we could use the same model file and set the random seed and see to it that we get consistent(read same) results every time we run.
Creating and saving model for later use.
Note: Follow this link to know the format of input file.
http://mallet.cs.umass.edu/ap.txt
public void getModelReady(String inputFile) throws IOException {
if(inputFile != null && (! inputFile.isEmpty())) {
List<Pipe> pipeList = new ArrayList<Pipe>();
pipeList.add(new Target2Label());
pipeList.add(new Input2CharSequence("UTF-8"));
pipeList.add(new CharSequence2TokenSequence());
pipeList.add(new TokenSequenceLowercase());
pipeList.add(new TokenSequenceRemoveStopwords());
pipeList.add(new TokenSequence2FeatureSequence());
Reader fileReader = new InputStreamReader(new FileInputStream(new File(inputFile)), "UTF-8");
CsvIterator ci = new CsvIterator (fileReader, Pattern.compile("^(\\S*)[\\s,]*(\\S*)[\\s,]*(.*)$"),
3, 2, 1); // data, label, name fields
InstanceList instances = new InstanceList(new SerialPipes(pipeList));
instances.addThruPipe(ci);
ObjectOutputStream oos;
oos = new ObjectOutputStream(new FileOutputStream("Resources\\Input\\Model\\Model.vectors"));
oos.writeObject(instances);
oos.close();
}
}
Once model file is saved, this uses the above saved file to generate topics
public void applyLDA(ParallelTopicModel model) throws IOException {
InstanceList training = InstanceList.load (new File("Resources\\Input\\Model\\Model.vectors"));
logger.debug("InstanceList Data loaded.");
if (training.size() > 0 &&
training.get(0) != null) {
Object data = training.get(0).getData();
if (! (data instanceof FeatureSequence)) {
logger.error("Topic modeling currently only supports feature sequences.");
System.exit(1);
}
}
// IT HAS TO BE SET HERE, BEFORE CALLING ADDINSTANCE METHOD.
model.setRandomSeed(5);
model.addInstances(training);
model.estimate();
model.printTopWords(new File("Resources\\Output\\OutputFile\\topic_keys_java.txt"), 25,
false);
model.printDocumentTopics(new File ("Resources\\Output\\OutputFile\\document_topicssplit_java.txt"));
}

How to solve that AttachAsync of a DownloadOperation does not return immediately?

When using the Background Transfer API we must iterate through current data transfers to start them again ahen the App restarts after a termination (i.e. system shutdown). To get progress information and to be able to cancel the data transfers they must be attached using AttachAsync.
My problem is that AttachAsync only returns when the data transfer is finished. That makes sense in some scenarios. But when having multiple data transfers the next transfer in the list would not be started until the currently attached is finished. My solution to this problem was to handle the Task that AttachAsync().AsTask() returns in the classic way (not use await but continuations):
IReadOnlyList<DownloadOperation> currentDownloads =
await BackgroundDownloader.GetCurrentDownloadsAsync();
foreach (var downloadOperation in currentDownloads)
{
Task task = downloadOperation.AttachAsync().AsTask();
DownloadOperation operation = downloadOperation;
task.ContinueWith(_ =>
{
// Handle success
...
}, CancellationToken.None, TaskContinuationOptions.OnlyOnRanToCompletion,
TaskScheduler.FromCurrentSynchronizationContext());
task.ContinueWith(_ =>
{
// Handle cancellation
...
}, CancellationToken.None, TaskContinuationOptions.OnlyOnCanceled,
TaskScheduler.FromCurrentSynchronizationContext());
task.ContinueWith(t =>
{
// Handle errors
...
}, CancellationToken.None, TaskContinuationOptions.OnlyOnFaulted,
TaskScheduler.FromCurrentSynchronizationContext());
}
It kind of works (in the actual code I add the downloads to a ListBox). The loop iterates through all downloads and executes StartAsync. But the downloads are not really started all at the same time. Only one is runninng at a time and only if it finishes the next one continues.
Any solution for this problem?
The whole point of Task is to allow you to have the option of parallel operations. If you await then you are telling the code to serialize the operations; if you don't await, then you are telling the code to parallelize.
What you can do is add each download task to a list, telling the code to parallelize. You can then wait for tasks to finish, one by one.
How about something like:
IReadOnlyList<DownloadOperation> currentDownloads =
await BackgroundDownloader.GetCurrentDownloadsAsync();
if (currentDownloads.Count > 0)
{
List<Task<DownloadOperation>> tasks = new List<Task<DownloadOperation>>();
foreach (DownloadOperation downloadOperation in currentDownloads)
{
// Attach progress and completion handlers without waiting for completion
tasks.Add(downloadOperation.AttachAsync().AsTask());
}
while (tasks.Count > 0)
{
// wait for ANY download task to finish
Task<DownloadOperation> task = await Task.WhenAny<DownloadOperation>(tasks);
tasks.Remove(task);
// process the completed task...
if (task.IsCanceled)
{
// handle cancel
}
else if (task.IsFaulted)
{
// handle exception
}
else if (task.IsCompleted)
{
DownloadOperation dl = task.Result;
// handle completion (e.g. add to your listbox)
}
else
{
// should never get here....
}
}
}
I hope this is not too late but I know exactly what you are talking about. I'm also trying to resume all downloads when the application starts.
After hours of trying, here's the solution that works.
The trick is to let the download operation resume first before attacking the progress handler.
downloadOperation.Resume();
await downloadOperation.AttachAsync().AsTask(cts.Token);

Amfphp is working slow with Flex-4

Well,
I am trying to use remote object (Amfphp) in a project which is using httpservice. I heard it will make my application faster. But when i tried Amfphp in a datagrid for testing purpose i found it takes even more time then httpservice. Here is what i have done so far.
AS-3 code to call php function:
public function init():void{
var params:Array = new Array();
params.push("1234");
_amf = new RemoteObject;
_amf.destination = "dummyDestination";
_amf.endpoint = "http://insight2.ultralysis.com/Amfphp/Amfphp/";//http://insight2.ultralysis.com
_amf.source = "manager1";
_amf.addEventListener(ResultEvent.RESULT, handleResult);
_amf.addEventListener(FaultEvent.FAULT, handleFault);
_amf.init(params);
}
public function handleResult(event:ResultEvent):void{
myGrid.dataProvider = event.result.grid;
}
And the php function to fetch data from mysql database:
class output{
public $grid;
public $week;
}
function form()
{
$arrayOut = new output();
$arrayOut->grid = $this->gridValue();
$arrayOut->week= $this->getAllWeek($this->ThisYear);
return $arrayOut;
}
Everything works fine. But it takes almost 5 seconds to fetch and render 280 rows of data.
Can anyone please help me make it as fast as it should ? I've already tried the optimization tips of silexlabs
I used packet sniffer and stats are the following. It says Latency is consuming most of the time about 5 sec. What's that latency? Need help guys. Please:
Try to use amfphp 1.9.
amfphp 2.x version is unfortunately slower than 1.9

RX throttle on WinRT TemplatedControl

I have some problems with thread synchronization in my Templated control (trying to do a AutoComplete control)
Inside my control I have this code:
protected override void OnApplyTemplate()
{
base.OnApplyTemplate();
var searchTextBox = GetTemplateChild("SearchTextBox") as TextBox;
if (searchTextBox != null)
{
var searchDelegate = SearchAsync;
Observable.FromEventPattern(searchTextBox, "TextChanged")
.Select(keyup => searchTextBox.Text)
.Where(TextIsLongEnough)
.Throttle(TimeSpan.FromMilliseconds(500))
.Do(ShowProgressBar)
.SelectMany(searchDelegate)
.ObserveOn(Dispatcher)
.Subscribe(async results => await RunOnDispatcher(() =>
{
IsInProgress = false;
SearchResults.Clear();
foreach (var result in results)
{
SearchResults.Add(result);
}
}));
}
}
And it is complaining that inside my ShowProgressBar method I'm trying to access code that was marshalled by another thread.
If I comment out the Throttle and the ObserveOn(Dispatcher) it works just fine, but it does not throttle my service calls as I want to.
If I only comment out the Throttle part, Nothing happens at all.
Asti's got the right idea, but a far better approach would be to provide the IScheduler argument to Throttle instead:
// NB: Too lazy to look up real name
.Throttle(TimeSpan.FromMilliseconds(500), CoreDispatcherScheduler.Instance)
This will make operations below it happen on the UI thread (including your ShowProgressBar), up until the SelectMany.
Every dependency object requires that any changes to dependency properties be made only on the Dispatcher thread. The Throttle is using a different scheduler, hence making any changes to the UI in a subsequent Do combinator would result in an access exception.
You can resolve this by:
Add an ObserveOnDispatcher before any actions which cause side-effects on the Dispatcher. Optionally use another scheduler down the pipeline.
Use Dispatcher.Invoke to execute side-effects through the dispatcher. E.g., .Do(() => Dispatcher.Invoke(new Action(ShowProgressBar)))

Is it possible to reset a jQuery deferred object state?

Is it possible to reset a resolved jQuery object to an 'unresolved' state and kick off it's initialization and callbacks all over again?
The specific thing I'm doing is that I have a jQuery deferred wrapper over the local file system api. From there I build up higher level deferreds for the things I care about:
var getFs = defFs.requestQuota(PERSISTENT, 1024*1024)
.pipe (bytes) -> defFs.requestFs(PERSISTENT, bytes)
var getCacheContents = getFs.pipe (fileSystem) ->
defFs.getDirectory('Cache', fileSystem.root).pipe (dir) ->
defFs.readEntries(dir)
Now most of the time, when I call getCacheContents I don't mind the memo-ized values being returned, in fact I prefer it. But, on the occasion when I want to write to the cache I really would like the ability to reset that pipe and have it re-select and rescan the cache next time its accessed.
I could cobble something together from $.Callbacks but a deferred-based solution would really be ideal.
No. A Promise is by definition a thing that resolves only once - from unresolved to fulfilled OR to rejected. You will not be able to do this with jQuery's Deferreds.
What you are actually searching for are Signals. They are to be fired more than once, but provide a similiar interface. There are some implementations around, you might ceck out js-signals or wire.js.
The only solution I could find is to reset the $.Deferred object and return new Promise from that one. It works together with some internal API dirty checking (if something gets edited / deleted), but would be more performant to just reset the existing $.Deferred and let it re-resolve on the next Promise request.
An example of a possible solution is:
$.myDeferredList = [];
$.createRestorableDeferred = function(a,b) {
// JUST BY SIMPLE $.when().then();
$.myDeferredList[a] = {
deferred: $.Deferred()
, then: b
,restore : function() {
$.myDeferredList['myReady'].deferred = $.Deferred();
$.when($.myDeferredList['myReady'].deferred).then($.myDeferredList['myReady'].then);
}
,resolve : function() {
$.myDeferredList['myReady'].deferred.resolve();
}
}
$.when($.myDeferredList['myReady'].deferred).then($.myDeferredList['myReady'].then);
window[a] = $.myDeferredList['myReady'];
}
var counter = 0;
$.createRestorableDeferred('myReady', function () {
console.log('>> myReady WHEN called',++counter);
$.myDeferredList['myReady'].restore();
});
// RESOLVING ways
$.myDeferredList['myReady'].deferred.resolve();
$.myDeferredList.myReady.deferred.resolve();
myReady.resolve();
Results in console:
>> myReady WHEN called 1
>> myReady WHEN called 2
>> myReady WHEN called 3