I have about 10GB worth of data that I would like to import to Parse. the data is currently in JSON format which is great for importing data using the parse importer.
However I have no unique identifier to these objects. Of course they have unique properties e.g. a url, the ids pointing to specific objects need to be constant.
What would be the best way to edit the large amount of data -in bulk- on their server without running into request issues (as I'm currently on the free pricing model) and without taking too much time to alter the data.
Option 1
Import the data once and export the data in JSON with the newly assigned objectIds. Then edit them locally matching the url then replace the class with the new edited data. Any new editions will receive a new objectId by Parse.
How much downtime between import and export will there be as I would need to delete the class and recreate it? Are there any other concerns with this methodology?
Option 2
Query for the URL or array of URLs and then edit the data then re-save. This means the data will persist indefinitely but as the edit will consist of hundreds of thousands of objects will this most likely over run the request limit?
Option 3
Is there a better option I am missing?
The best option is to upload to Parse then edit through their normal channels. Using various hacks it is possible to stay below the 30pings/second offered as part of the free tier. You can iterate over the data using background jobs (written in Javascript) -- you may need to slow down your processing so you don't hit limits. The super hacky way is to download from the table to a client (iOS/Android) app and then push back up to Parse. If you do this in batch (not a synchronous for loop, by the way), then the latency alone will keep you under the 30ping/sec limit.
I'm not sure why you're worried about downtime. If the data isn't already uploaded to Parse, can't you upload it, pull it down and edit it, and re-upload it -- taking as long as you'd like? Do this in a separate table from any you are using in production, and you should be just fine.
Related
I am using GraphDb and have a data update problem :
Data in the repository is coming from 2 sources :
Million of tripples are coming from an external source and updated by a full replace each week
Thousands of tripples are created by users and are permanent. They use the same ontomogy as the external source and are stored in the same repository so that SPARQL queries can run on both data without any difference. However a simple SPARQL query can retrieve all users tripples.
The problem is about the weeky update of the external source.
My first idea was to
Export users data
Import with replace the new external dataset
Reimport users data
Problem : I need to reimport exported data, imports are in RDF format which is not available in export.
Another way (which is about the same):
Import the weekly update in a new repository
Copy users data from the 'old' repo to the new one
Switch the server to the new repo.
Problem : In order to copy users data I need an "INSERT SELECT" SPARQL statement using services which exists in SQL (without services) but not in SPARQL
At last GraphDB Ontorefine should do the work but not efficiently on a weekly base.
Another way could be to store users data in a separate repo but SPARQL queries involving sorting could become hard to maintain and slow to run.
I can also export users data in JSON format and programmatically generate RDF/XLM files and send them to the GraphDB API. This is technically possible, I do it in very special cases and this works fine, but not reliable for a big amount of data, slow, and a big developer work.
In short: I am stuck!
See comments, actually this was not a problem but misundertanding the role of graphs in a repository
The git repo for my Django app includes several .tsv files which contain the initial entries to populate my app's database. During app setup, these items are imported into the app's SQLite database. The SQLite database is not stored in the app's git repo.
During normal app usage, I plan to add more items to the database by using the admin panel. However I also want to get these entries saved as fixtures in the app repo. I was thinking that a JSON file might be ideal for this purpose, since it is text-based and so will work with the git version control. These files would then become more fixtures for the app, which would be imported upon initial configuration.
How can I configure my app so that any time I add new entries to the Admin panel, a copy of that entry is saved in a JSON file as well?
I know that you can use the manage.py dumpdata command to dump the entire database to JSON, but I do not want the entire database, I just want JSON for new entries of specific database tables/models.
I was thinking that I could try to hack the save method on the model to try and write a JSON representation of the item to file, but I am not sure if this is ideal.
Is there a better way to do this?
Overriding save method for something that can go wrong or that can take more than it should is not recommended. You usually override save when changes are simple and important.
You can use signals but in your case it's too much work. You can instead write a function to do this for you but still not exactly after you saved the data to database. You can do it right away but it's too much process unless it's so important for your file to be updated.
I recommend using something like celery to run a function in the background separated from all of your django functions. You can call it on every data update or each hour for example and edit your backup file. You can even create a table to monitor the update process.
Which solution is the best is highly depended you and how important the data is. And keep in mind that editing a file can be a heavy process too so creating a backup like everyday might be a better idea anyway.
I have to process an Excel file in order to insert some data into a Mysql database on a web based application. I am using Spring MVC as the architecture.
The problem is I need to include a mid-step in which the user can review data to be inserted before actual insertion. So the typical process would be for the user to upload the file, then the application would show another webpage with the processed information, and a "Apply changes" button that would actually take all this information and store it into my database.
At first, in order to propagate the data in the file throughout the three steps (the first page where you upload, the mid-step page and the final controller action), I just used a form where I would use hidden fields to store the data, in order to avoid having to process the file twice (one for the result presentation in the mid-step, and the other for the actual data storing). The problem is the Excel file has so much information that my mid-step page gets overloaded and it takes too long to render (and to further process it in the controller when getting ahold of the parameters).
So I though about using either a temporary file or a temporary relation in my database to store the data between steps. Are there any other ways to propagate these data between controller actions? I wouldn't like to leave garbage data behind nor process the file again (since it takes quite some time) so what would be the best approach for this?
I would go for a temporary table in this case, if the processing of it takes not too long. First you are able to use the sequences like in the final table and all you need to do is issue an "insert into select from" statement once the user clicks ok in the second screen. This would be a simple solution to implement i'd say and all the processing is done.
If the processing part is "huge" and the user could check certain things before this big task a solution could be to split up the processing in two parts and store this first results in a temporary file and implement a screen based on that one. After the user clicks ok you can launch an async task that does the heavy lifting and cleans the file after hes done.
I am designing a system with 30,000 objects or so and can't decide between the two: either have a JSON file pre computed for each one and get data by pointing to URL of the file (I think Twitter does something similar) or have a PHP/Perl/whatever else script that will produce JSON object on the fly when requested, from let's say database, and send it back. Is one more suited for than another? I guess if it takes a long time to generate the JSON data it is better to have already done JSON files. What if generating is as quick as accessing a database? Although I suppose one has a dedicated table in the database specifically for that. Data doesn't change very often so updating is not a constant thing. In that respect the data is static for all intense and purposes.
Anyways, any thought would be much appreciated!
Alex
You might want to try MongoDB which retrieves the objects as JSON and is highly scalable and easy to setup.
I have worked with CGI pages a lot and dealt with cookies and storing the data in the /tmp directory in Linux.
Basically I am running a query for millions of records using SQL, and am saving it in a hash format. I want to transfer that data to Ajax ( which eventually will perform some calculation and return a graph using Google API.
Or, I want it to transfer that data to another CGI page somehow.
PS : The data I am talking about here is in forms of 10-100+ MB's.
Until now, i've been saving that data on the file in the server, but again, it's a hassle to deal with that data on the server for each query.
You don't mention why it's a hassle to deal with the data on the server for each query, but assuming the hassle is working with the file, DBM::Deep might make it relatively easy to write the hash out and get it back again. Once you have that, you could create a simple script to return it as JSON and access it as needed from Javascript or other pages. Although I think the browser might slow down with 100MB JSON data structure.