WinRT: Reading and deserializaing large amount of files takes too much time

WinRT: Reading and deserializaing large amount of files takes too much time - json

I have a Windows Store application which manages collection of objects and stores them in the application local folder. Those objects are serialized on the file system using JSON. As I need to be able to edit and persist those items individually I opted for individual files for each objects instead of one large file. Objects are stored following this pattern:
Local Folder
|
--- db
|
--- AB283376-7057-46B4-8B91-C32E663EC964
| |
| --- AB283376-7057-46B4-8B91-C32E663EC964.json
| --- AB283376-7057-46B4-8B91-C32E663EC964.jpg
|
--- B506EFC5-E853-45E6-BA32-64193BB49ACD
| |
| --- B506EFC5-E853-45E6-BA32-64193BB49ACD.json
| --- B506EFC5-E853-45E6-BA32-64193BB49ACD.jpg
|
...
Each object has its folder node which will contains the JSON serialized object and other eventual resources.
Everything was fine when I made some writing, reading, deleting test. Where it got complicated is when I tried to load up large collections of object on application startup. I estimated that the largest amount of item one would store to 10000. So I wrote 10000 entries and then tried to load it... more than 3 minutes to the application to complete the operation, which of course is unacceptable.
So my questions are, What could be optimized in the code I made for reading and deserializing objects (code below)? Is there a way to implement a paging system so loading would be dynamic in my WinRT application? Is my storage method (pattern above) too heavy for in terms of IO/CPU? Am I missing something in WinRT?
public async Task<IEnumerable<Release>> GetReleases()
{
List<Release> items = new List<Release>();
var dbFolder = await ApplicationData.Current.LocalFolder.CreateFolderAsync(dbName, CreationCollisionOption.OpenIfExists);
foreach (var releaseFolder in await dbFolder.GetFoldersAsync())
{
var releaseFile = await releaseFolder.GetFileAsync(releaseFolder.DisplayName + ".json");
var stream = await releaseFile.OpenAsync(FileAccessMode.Read);
using (var inStream = stream.GetInputStreamAt(0))
{
DataContractJsonSerializer serializer = new DataContractJsonSerializer(typeof(Release));
Release release = (Release)serializer.ReadObject(inStream.AsStreamForRead());
items.Add(release);
}
stream.Dispose();
}
return items;
}
Thanks for your help.
NB: I already had a look as SQLite and I don't need such a sophisticated system.

Supposedly JSON.NET is better than the built in things. If you are not sending the data over the wire, then the quickest way is to do binary serialization rather than JSON or XML. Finally - think if you really need to load all the data when your application starts. Serialize your data as a list of binary records and create an index that will allow you to quickly jump to the range of records you actually need to use.

As Filip already mentioned, you probably don't need to load all data at startup. Even if you really want to show all the items in the first page (showing 10,000 items at once to a user doesn't sound like a good idea to me), you don't need to have all their properties available: usually only a couple of them are shown in the list, you need the rest of them when the user navigates to individual item details. You could have a separate "index" file containing only the data you need for the list. This does mean duplication, but it will help you with performance.
Although you've mentioned, you don't need SQLite as it is too sophisticated for your needs, you really should take a closer look at it. It is designed to efficiently handle structured data such as yours. I'm pretty sure if you switch to it, the performance will be much better and your code might end up even simpler in the end. Try it out.

Related

Psychopy: how to avoid to store variables in the csv file?

When I run my PsychoPy experiment, PsychoPy saves a CSV file that contains my trials and the values of my variables.
Among these, there are some variables I would like to NOT be included. There are some variables which I decided to include in the CSV, but many others which automatically felt in it.
is there a way to manually force (from the code block) the exclusion of some variables in the CSV?
is there a way to decide the order of the saved columns/variables in the CSV?
It is not really important and I know I could just create myself an output file without using the one of PsychoPy, or I can easily clean it afterwards but I was just curious.

PsychoPy spits out all the variables it thinks you could need. If you want to drop some of them, that is a task for the analysis stage, and is easily done in any processing pipeline. Unless you are analysing data in a spreadsheet (which you really shouldn't), the number of columns in the output file shouldn't really be an issue. The philosophy is that you shouldn't back yourself into a corner by discarding data at the recording stage - what about the reviewer who asks about the influence of a variable that you didn't think was important?
If you are using the Builder interface, the saving of onset & offset times for each component is optional, and is controlled in the "data" tab of each component dialog.
The order of variables is also not under direct control of the user, but again, can be easily manipulated at the analysis stage.
As you note, you can of course write code to save custom output files of your own design.

there is a special block called session_variable_order: [var1, var2, var3] in experiment_config.yaml file, which you probably should be using; also, you should consider these methods:
from psychopy import data
data.ExperimentHandler.saveAsWideText(fileName = 'exp_handler.csv', delim='\t', sortColumns = False, encoding = 'utf-8')
data.TrialHandler.saveAsText(fileName = 'trial_handler.txt', delim=',', encoding = 'utf-8', dataOut = ('n', 'all_mean', 'all_raw'), summarised = False)
notice the sortColumns and dataOut params

Splitting a feature collection by system index in Google Earth Engine?

I am trying to export a large feature collection from GEE. I realize that the Python API allows for this more easily than the Java does, but given a time constraint on my research, I'd like to see if I can extract the feature collection in pieces and then append the separate CSV files once exported.
I tried to use a filtering function to perform the task, one that I've seen used before with image collections. Here is a mini example of what I am trying to do
Given a feature collection of 10 spatial points called "points" I tried to create a new feature collection that includes only the first five points:
var points_chunk1 = points.filter(ee.Filter.rangeContains('system:index', 0, 5));
When I execute this function, I receive the following error: "An internal server error has occurred"
I am not sure why this code is not executing as expected. If you know more than I do about this issue, please advise on alternative approaches to splitting my sample, or on where the error in my code lurks.
Many thanks!

system:index is actually ID given by GEE for the feature and it's not supposed to be used like index in an array. I think JS should be enough to export a large featurecollection but there is a way to do what you want to do without relying on system:index as that might not be consistent.
First, it would be a good idea to know the number of features you are dealing with. This is because generally when you use size().getInfo() for large feature collections, the UI can freeze and sometimes the tab becomes unresponsive. Here I have defined chunks and collectionSize. It should be defined in client side as we want to do Export within the loop which is not possible in server size loops. Within the loop, you can simply creating a subset of feature starting from different points by converting the features to list and changing the subset back to feature collection.
var chunk = 1000;
var collectionSize = 10000
for (var i = 0; i<collectionSize;i=i+chunk){
var subset = ee.FeatureCollection(fc.toList(chunk, i));
Export.table.toAsset(subset, "description", "/asset/id")
}

Apigility GET collection returns only 10 results when content negotiation is set to JSON

This issue is bugging me for some time now. To test it I just installed a fresh Apigility, set the db (PDO:mysql) and added a DB-Connected service. In the table I have 40+ records. When I make a GET collection request the response looks OK (with the default HAL content negotiation). Then I change the content negotiation to JSON. Now when I make a GET collection request my response contains only 10 elements.
So my question is: where do I set/change this limit?

You can set the page size manually, like so:
$paginator = $this->getAlbumTable()->fetchAll(true);
// set the current page to what has been passed in query string, or to 1 if none set
$paginator->setCurrentPageNumber((int) $this->params()->fromQuery('page', 1));
// set the number of items per page to 10
$paginator->setItemCountPerPage(10);
http://framework.zend.com/manual/current/en/tutorials/tutorial.pagination.html

Could you please send the page_size, total_items part at the end of the json output?
it's like:
"page_count": 140002,
"page_size": 25,
"total_items": 3500035,
"page": 1

This is not an ideal fix, because it requires you to go into the source code rather than using the page size given in the UI.
The collection class that is auto generated for you by the DB-Connected style derives off of Zend/Paginator/Paginator. This class defines the $defaultItemCountPerPage static protected member which is defaulted to 10. That's why you're only getting 10 results. If you open up the auto-generated collection class for your entity and add: protected static $defaultItemCountPerPage = 100; in the otherwise empty class, you will see that you now get up to 100 results in the response. You can look at other Paginator class variables and methods that you could replace in your derived class to get your desired behavior.
This is not an ideal solution. I'd prefer that the generated code automatically used the same configed page size that the HalJson strategy uses. Maybe I'll contribute a PR to change that. Or, maybe I'll just use the HalJson approach. It does seem like the better way to go. You should have some limit to how much data you load in from the DB at a time to not have an overly long running query or an overly large collection of data coming back you have to deal with. And, whatever limit you set, what do you do when you hit that limit? With the simple Json method, you can't ever get "page 2" of data. So, if you are going to work with some sizeable amount of data, it might be better to use HalJson on and then have some logic on the client side to grab pages of data at a time as needed. The returned JSON structure is a little more complicated, but not terribly so.
I'm probably in the same spot you are -- I'm trying to do a simple little api to play with while keeping everything simple and so I didn't want the client to have to deal with the other stuff in HalJson, but probably better to deal with that complexity and have a smooth way to page through data if you're going to use this with some real set of data. At least, that's the pep talk I'm giving myself right now. :-)

Using Other Data Sources for cubism.js

I like the user experience of cubism, and would like to use this on top of a backend we have.
I've read the API doc's and some of the code, most of this seems to be extracted away. How could I begin to use other data sources exactly?
I have a data store of about 6k individual machines with 5 minute precision on around 100 or so stats.
I would like to query some web app with a specific identifier for that machine and then render a dashboard similar to cubism via querying a specific mongo data store.
Writing the webapp or the querying to mongo isn't the issue.
The issue is more in line with the fact that cubism seems to require querying whatever data store you use for each individual data point (say you have 100 stats across a window of a week...expensive).
Is there another way I could leverage this tool to look at data that gets loaded using something similar to the code below?
var data = [];
d3.json("/initial", function(json) { data.concat(json); });
d3.json("/update", function(json) { data.push(json); });

Cubism takes care of initialization and update for you: the initial request is the full visible window (start to stop, typically 1,440 data points), while subsequent requests are only for a few most recent metrics (7 data points).
Take a look at context.metric for how to implement a new data source. The simplest possible implementation is like this:
var foo = context.metric(function(start, stop, step, callback) {
d3.json("/data", function(data) {
if (!data) return callback(new Error("unable to load data"));
callback(null, data);
});
});
You would extend this to change the "/data" URL as appropriate, passing in the start, stop and step times, and whatever else you want to use to identify a metric. For example, both Cube and Graphite use a metric expression as an additional query parameter.

Construct an Iterator

Let's say you want to construct an Iterator that spits out File objects. What type of data do you usually provide to the constructor of such an Iterator?
an array of pre-constructed File objects, or
simply raw data (multidimensional array for instance), and let the Iterator create File objects on the fly when Iterated through?
Edit:
Although my question was actually ment to be as general a possible, it seems my example is a bit to broad to tackle general, so I'll elaborate a bit more. The File objects I'm talking about are actually file references from a database. See these two tables:
folder
| id | folderId | name |
------------------------------------
| 1 | null | downloads |
file
| id | folderId | name |
------------------------------------
| 1 | 1 | instructions.pdf |
They reference actual folders and files on a filesystem.
Now, I created a FileManager object. This will be able to return a listing of folders and files. For instance:
FileManager::listFiles( Folder $folder );
... would return an Iterator of File objects (or, come to think of it, rather FileReference objects) from the database.
So what my question boils down to is:
If the FileManager object constructs the Iterator in listFiles() would you do something like this (pseudo code):
listFiles( Folder $folder )
{
// let's assume the following returns an multidimensional array of rows
$filesData = $db->fetch( $sqlForFetchingFilesFromFolder );
// let the Iterator take care of constructing the FileReference objects with each iteration
return FileIterator( $filesData );
}
or (pseudo code):
listFiles( Folder $folder )
{
// let's assume the following returns an multidimensional array of rows
$filesData = $db->fetch( $sqlForFetchingFilesFromFolder );
$files = array();
for each( $filesData as $fileData )
{
$files.push ( new FileReference( $fileData ) );
}
// provide the Iterator with precomposed FileReference objects
return FileIterator( $files );
}
Hope this clarifies things a bit.

What is your "File" object meant to be? An open handle to a file, or a representation of a file system path which can be opened in turn?
It would generally be a bad idea to open all the files at once - after all, part of the point of using an iterator is that you only access one object at a time. Your iterator could yield one open file at a time, and let the caller take responsibility for closing it, although again that might be slightly odd to use.
Your requirements aren't clear, to be honest - in my experience, most iterators which yield a series of files use something like Directory.GetFiles(pattern) - you don't pass them the raw data at all, you pass them something which they can use to find the data for you.
It's not obvious what you're trying to get at - it feels like you're trying to ask a general question, but you haven't provided enough information to let us advise you. It's like asking, "Do I want to use a string or an integer?" without giving any context.
EDIT: I would probably push all of that logic into FileIterator, personally. Otherwise it's hard to see what value it's really providing. In a language like C# or Python you wouldn't need a separate class in the first place - you'd just use a generator of some description. In that sense this question isn't language agnostic :(

What exactly is your iterator supposed to do? Write data to files? Create them?
An iterator is a pattern for iterating through data, which means providing sequential data in a uniformous way, not mutating them.

I find the question to be unclear.
Are we talking Iterator or Factory?
To me an Iterator is operating on a pre-existing collection of things and allows the caller to work on each thing in turn.
When you say "Spits Out" do you mean allows the client to work with one file from a pre-existing set of files or do you mean that you are iterating some data and intend to store that data in files you are generting. If we are geneating, then we've got a File factory.
My guess is that you are intending to process some files in a file sytstem. I think that your Iterator is akin to a Directory, it can give you the next file it knows about. So I construct the "Driectory" by passing enough data to allow it to know which files you mean (could be just an OS path, could be some kind of "find" enxpression, a list of ftp-like references, etc.) and expect it to give me the next File as I iterate.
----updated following question clarification
I think that the key question here is when the individual files should be opened. The Iterator itself will reasonably return a File object corresponding to an open file handle, the caller can then just work with the file. But iternally should the iterator be working against a list of pre-opened files or a list of file references, the files being opened as the iterator next() is used.
I think we should do the latter, because there is overhead in having an open file, hence we should open the files only when we need them.
That leads to one other point: who closes the file? We can't afford to keep them all open. Perhaps the iterator should close each file as next() is called. This implies that that the iterator itself needs a close() method to allow tidy up of the currently open file. Alterntaivelywe need to explictily document that closing is the client's responsibility.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008