As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
I'm playing with an idea of creating client that would use the torrent protocol used today in torrent download client such as uTorrrent or Vuze to create:
Client software that would:
Select files you would like to backup
Create torrent like descriptor files for each file
Offer optional encryption of your files based on key phrase
Let you select redundancy you would like to trade with other clients
(Redundancy would be based on give-and-take principle. If you want to backup 100MB five times you would have to offer extra 500MB of your own storage space in your system. The file backup would not get distributed only amongst 5 clients but it would utilize as many clients as possible offering storage in exchange based on physical distance specified in settings)
Optionally:
I'm thinking to include edge file sharing. If you would have non encrypted files shared in you backup storage and would prefer clients that have their port 80 open for public HTTP sharing. But this gets tricking since I have hard time coming up with simple scheme where the visitor would pick the closest backup client.
Include file manager that would allow file transfers (something like FTP with GUI) style between two systems using torrent protocol.
I'm thinking about creating this as service API project (sort of like http://www.elasticsearch.org ) that could be integrated with any container such as tomcat and spring or just plain Swing.
This would be P2P open source project. Since I'm not completely confident in my understanding of torrent protocol the question is:
Is the above feasible with current state of the torrent technology (and where should I look to recruit java developers for this project)
If this is the wrong spot to post this please move it to more appropriate site.
You are considering the wrong technology for the job. What you want is an erasure code using Vandermonde matrixes. What this allows you to do is get the same level of protection against lost data without needing to store nearly as many copies. There's an open source implementation by Luigi Rizzo that works perfectly.
What this code allows you to do is take a 8MB chunk of data and cut it into any number of 1MB chunks such that any eight of them can reconstruct the original data. This allows you to get the same level of protection as tripling the size of the data stored without even doubling the size of the data stored.
You can tune the parameters any way you want. With Luigi Rizzo's implementation, there's a limit of 256 chunks. But you can control the chunk size and the number of chunks required to reconstruct the data.
You do not need to generate or store all the possible chunks. If you cut an 80MB chunk of data into 8MB chunks such that any ten can recover the original data, you can construct up to 256 such chunks. You will likely only want 20 or so.
You might have great difficulty enforcing the reciprocal storage feature, which I believe is critical to large-scale adoption (finally, a good use for those three terabyte drives that you get in cereal boxes!) You might wish to study the mechanisms of BitCoin to see if there are any tools you can steal or adopt for your own needs for distributed non-repudiable proof of storage.
Related
As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
My friend already has his own working web-site (selling some stuff). We have an idea to create the iOs app for the site to attract more people(for me - to gain some badly needed experience).
The UI is going to be simple, and there won't as many problems, as using the web-site's data. We need the app to have some data locally, so that people, who do not have an internet access, were able to use the app.
But, of course, we want the information in the app to be up-to-date, so I need to use MySQL data somehow (I mean, that if the person has an internet access, the app can use it and download some data, If not - the app must contain some data to show). To be honest, I want the app to be really good, so I have a question: What combination is better to use???
To use core data, create a data model(it is huge and it's difficult to reproduce it, a lot of classes to create). I can do it, but how to update the data then? =) Have no idea.
To create a sqlite database, then use something like php code to insert get and encode the data into json, then parse it.
Maybe I should connect to MySQL directly from the app and use it's data, because it's impossible to have same data locally?
Or just to parse it, using json or xml?
Please, help me guys, I need my app to be cool and robust, but I don't know how to do it. Maybe you can tell the better way to solve such a problem??
Generally you'll have to build a similar database inside your application using SQLite and import data from MySQL through some kind of API bridge. A simple way to do this data interchange is via JSON that encodes the record's attributes. XML is also a possible transport mechanism but tends to have more overhead and ends up being trickier to use. What you'll be sending back and forth is generally sets of key-value pairs, not entire documents.
Stick to Core Data unless you have an exceptionally good reason to use something else. Finding it irritating or different is not a good reason. It can be a bit tricky to get the hang of at first, but in practice it tends to be mostly unobtrusive if used correctly.
Unless you're writing something that's expressly a MySQL client, never connect directly to MySQL in an iOS application. Period. Don't even think about doing this. Not only is it impossible to secure effectively, but iOS networking is expected to be extremely unreliable, slow, and often unavailable entirely. Your application must be able to make use of limited bandwidth, deal with very high latency, and break up operations into small transactions that are likely to succeed instead of one long-running operation that is bound to fail.
How do you sync data between your database and your client? That depends on what web stack you're going to be using. You have a lot of options here, but at the very least you should prototype in something like Ruby on Rails, Django, or NodeJS. PHP is viable, but without a database framework will quickly become very messy.
As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
I am working with HTML and am attempting to locate the most effecient way to pull images that will be used for banners, backgrounds, etc.
Options
1) Store on server and access them by the server path.
2) Store on a third party site such as imageshack or photobucket and access via URL
3) Store in MYSQL database and access via path. (Not planning on having a ton of images but just throwing this out there.)
I am looking to be effecient in retrieving all images that are going to be displayed on that page but would also like to limit the amount of resources my server will be responsible for. Between these 3 options is there a choice that is overwhelming obvious and I am just not seeing it?
Answers such as the one below would be perfect.(I am looking at my options like this.)
Store on Server - Rely heavily on personal server, downloads etc will be hitting my server, high load/high responsibility
Store on third party site - Images off of server, saves me space and some of the load(But is it worth the hassle?)
DB Link - Quickest, best for tons of images, rely heavily on personal server
If this question is to opinion based then I will assume all 3 ways are equally effective and will close the question.
Store the images on a CDN and store the URLs of the images in a database.
The primary advantage here that is not present in other options is that of caching. If you use a database, it needs to be queried and a server script (.ashx for the .net framework i often use) needs to return this resource. With imageshack etc. I'm not sure, but I think the images retrieved are not cached.
The advantage here is that you don't lose bandwidth and storage space.
No advantages I can think of other than if you need to version control your images or something.
If you're solely working in HTML then storing on a server isn't possible as you would need a server side language to connect the DB and the page. If you have some PHP, ASP, Ruby ect knowledge then you can store on the server.
I think the answer is dependant on what the site/application is.
If (like you said) you're using the images for banners, background and things like that. Then maybe it's easiest to store them on your server and link to them on the page like <img src="/Source" alt="Image"/> (or do the backgrounds in CSS)
Make sure you are caching images so that they'll load quicker for users after the first view.
Most servers are pretty fast so I wouldn't worry too much about speeds ... unless the images you're using are huge (which anyone would tell you isn't recommended for a website anyway)
As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
Is someone out there who is using ZooKeeper for their sites? If you do, what do you use it for? i just want to see real word use case?
I've just started doing the research for using Zookeeper for a number of cases in my companies infrastructure.
The one that seems to fit ZK the best is where we have an array of 30+ dynamic content servers that rely heavily on file based caching ( Memcached is too slow ). Each of these servers will have an agent watching a specific ZK path and when a new node shows up, all servers join into a barrier lock, then once all of them are present, they all update their configuration at the exact same time. This way we can keep all 30 servers configuration / run-states consistent.
Second use case, we receive 45-70 million page views a day in a typical bell curve like pattern. The caching strategy implemented falls from client, to CDN, to memcache, and then to file cache before determining when to make a DB call. Even with a series of locks in place, it's pretty typical to get race conditions ( I've nicknamed them stampedes ) that can strain our backend. The hope is that ZK can provide a tool for developing a consistent and unified locking service across multiple servers and maybe data-centers.
You may be interested in the recently published scientific paper on ZooKeeper:
http://research.yahoo.com/node/3280
The paper also describes three use cases and comparable projects.
We do use ZK as a dependency of HBase and have implemented a scheduled work queue for a feed reader (millions of feeds) with it.
The ZooKeeper "PoweredBy" page has some detail that you might find interesting:
https://cwiki.apache.org/confluence/display/ZOOKEEPER/PoweredBy
HBase uses ZK and is open source (Apache) which would allow you to look at actual code.
http://hbase.apache.org/
As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 11 years ago.
I appreciate a lot CouchDB attempt to use universal web formats in everything it does: RESTFUL HTTP methods in every interaction, JSON objects, javascript code to customize database and documents.
CouchDB seems to scale pretty well, but the individual cost to make a request usually makes 'relational' people afraid of.
Many small business applications should deal with only one machine and that's all. In this case the scalability talk doesn't say too much, we need more performance per request, or people will not use it.
BERT (Binary ERlang Term http://bert-rpc.org/ ) has proven to be a faster and lighter format than JSON and it is native for Erlang, the language in which CouchDB is written. Could we benefit from that, using BERT documents instead of JSON ones?
I'm not saying just for retrieving in views, but for everything CouchDB does, including syncing. And, as a consequence of it, use Erlang functions instead of javascript ones.
This would modify some original CouchDB principles, because today it is very web oriented. Considering I imagine few people would make their database API public and usually its data is accessed by the users through an application, it would be a good deal to have the ability to configure CouchDB for working faster. HTTP+JSON calls could still be handled by CouchDB, considering an extra cost in these cases because of parsing.
You can have a look at hovercraft. It provides a native Erlang interface to CouchDB. Combining this with Erlang views, which CouchDB already supports, you can have an sort-of all-Erlang CouchDB (some external libraries, such as ICU, will still need to be installed).
CouchDB wants maximum data portability. You can't parse BERT in the browser which means it's not portable to the largest platform in the world so that's kind of a non-starter for base CouchDB. Although there might be a place for BERT in hovercraft as mentioned above.
I think it would be first good to measure how much of overhead is due to JSON processing: JSON handling can be very efficient. For example these results suggest that JSON is the fastest schema-less data format on Java platform (protocol buffer requires strict schema, ditto for avro; kryo is java serializer); and I would assume that same could be done on other platforms too (with erlang; for browsers via native support).
So, "if it ain't broke, don't fix it". JSON is very fast, when properly implemented; and if space usage is concern, it compresses well just like any textual formats.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 7 years ago.
Improve this question
I'm in need of a distributed file system that must scale to very large sizes (about 100TB realistic max). Filesizes are mostly in the 10-1500KB range, though some files may peak at about 250MB.
I very much like the thought of systems like GFS with built-in redundancy for backup which would - statistically - render file loss a thing of the past.
I have a couple of requirements:
Open source
No SPOFs
Automatic file replication (that is, no need for RAID)
Managed client access
Flat namespace of files - preferably
Built in versioning / delayed deletes
Proven deployments
I've looked seriously at MogileFS as it does fulfill most of the requirements. It does not have any managed clients, but it should be rather straight forward to do a port of the Java client. However, there is no versioning built in. Without versioning, I will have to do normal backups besides the file replication built into MogileFS.
Basically I need protection from a programming error that suddenly purges a lot of files it shouldn't have. While MogileFS does protect me from disk & machine errors by replicating my files over X number of devices, it doesn't save me if I do an unwarranted delete.
I would like to be able to specify that a delete operation doesn't actually take effect until after Y days. The delete will logically have taken place, but I can restore the file state for Y days until it's actually deleten. Also MogileFS does not have the ability to check for disk corruption during writes - though again, this could be added.
Since we're a Microsoft shop (Windows, .NET, MSSQL) I'd optimally like the core parts to be running on Windows for easy maintainability, while the storage nodes run *nix (or a combination) due to licensing.
Before I even consider rolling my own, do you have any suggestions for me to look at? I've also checked out HadoopFS, OpenAFS, Lustre & GFS - but neither seem to match my requirements.
Do you absolutely need to host this on your own servers? Much of what you need could be provided by Amazon S3. The delayed delete feature could be implemented by recording deletes to a SimpleDB table and running a garbage collection pass periodically to expunge files when necessary.
There is still a single point of failure if you rely on a single internet connection. And of course you could consider Amazon themselves to be a point of failure but the failure rate is always going to be far lower because of scale.
And hopefully you realize the other benefits, the ability to scale to any capacity. No need for IT staff to replace failed disks or systems. Usage costs will continually drop as disk capacity and bandwidth gets cheaper (while disks you purchase depreciate in value).
It's also possible to take a hybrid approach and use S3 as a secure backend archive and cache "hot" data locally, and find a caching strategy that best fits your usage model. This can greatly reduce bandwidth usage and improve I/O, epecially if data changes infrequently.
Downsides:
Files on S3 are immutable, they can
only be replaced entirely or
deleted. This is great for caching,
not so great for efficiency when
making small changes to large files.
Latency and bandwidth are those of
your network connection. Caching can
help improve this but you'll never
get the same level of performance.
Versioning would also be a custom solution, but could be implemented using SimpleDB along with S3 to track sets of revisions to a file. Overally, it really depends on your use case if this would be a good fit.
You could try running a source control system on top of your reliable file system. The problem then becomes how to expunge old check ins after your timeout. You can setup an Apache server with DAV_SVN and it will commit each change made through the DAV interface. I'm not sure how well this will scale with large file sizes that you describe.
#tweakt
I've considered S3 extensively as well, but I don't think it'll be satisfactory for us in the long run. We have a lot of files that must be stored securely - not through file ACL's, but through our application layer. While this can also be done through S3, we do have one bit less control over our file storage. Furthermore there will also be a major downside in forms of latency when we do file operations - both initial saves (which can be done asynchronously though), but also when we later read the files and have to perform operations on them.
As for the SPOF, that's not really an issue. We do have redundant connections to our datacenter and while I do not want any SPOFs, the little downtime S3 has had is acceptable.
Unlimited scalability and no need for maintenance is definitely an advantage.
Regarding a hybrid approach. If we are to host directly from S3 - which would be the case unless we want to store everything locally anyways (and just use S3 as backup), the bandwidth prices are simply too steep when we add S3 + CloudFront (CloudFront would be necessary as we have clients from all around). Currently we host everything from our datacenter in Europe, and we have our own reverse squids setup in the US for a low-budget CDN functionality.
While it's very domain dependent, ummutability is not an issue for us. We may replace files (that is, key X gets new content), but we will never make minor modifications to a file. All our files are blobs.