MySQL: Raspberry Pi Storage Media - mysql

With regard to the physical media used to store MySQL databases (i.e. flash card, or USB) on a RasPi, is there a documented preference / best practice?
I've searched and searched but cannot find anything documented; hopefully I've just overlooked it.
Can frequent database writes to a flash card lead to (early) corruption?
Would the database files be better placed on USB?

Typically data is the most important asset your application handles. There are a number of applications that have long been obsolete, but the database they used to handle still holds high value. Is sum, you don't want to lose your data. At all. Unless, of course this is a proof of concept or testing application.
I would strongly advise against an SD Card to store important data, since a database is typically written very often and will use all the write cycles the card has pretty fast; this will force the card to go into read-only mode after a short period of time.
A USB pendrive is not better than the SD card (maybe a bit). Still they are note very reliable if you are constantly writing to it, as a database application will do.
Any medium to serious application should use a real hard drive. For my local wiki at home (that incidentally uses MySQL) I use an SSD drive that can handle high level of writes operations without issues. And I ended up spending $100 on the SSD since I wanted a high quality one. Nevertheless it's a good idea to backup your database once a day automatically to another machine (another $5 Pi with an SD card in my case). Though less likely, the primary SSD drive can fail too.

Related

Why are "Paas" (aka cloud hosting) so expensive these days?

As a developer, I love the new hosting platform raising lately, such as Pantheon, Platform.sh or Acquia Cloud.
Automated workflows and task runner based on Git and simple YAML files are great features.
However, those platforms are quite expensive for one who simply wants to host a personal website.
I'm wondering why are PAAS (aka managed hosts) so expensive these days compare to other hosting solutions such as shared or VPS. The later have seen their price being reduced significantly in the last few years.
In my opinion, the price of a hosting service should be mainly based on...
the amount of traffic
the disk usage
... not on the technology sustaining the platform.
I am really not sure this is a stackoverflow kind of question (more of a Quora?). Still, I am going to try and have a stab at answering this as I may provide some insight (I work for Platform.sh):
What services like ours propose is a bargain of time vs money. Basically, if we can save 100 hours a month worth of work for a skilled engineer that is $10,000 of value delivered to a company.
Now in order to achieve this companies like ones you have cited invest heavily in R&D... but also in support and ops people that are around 24/7.
The time spent on the project doesn't disappear, it gets shared between anonymous/invisible people. You may not see it, but its there. The time you did not spend on updating the Kernel or a PHP version... the time you did not spend on defending against a vulnerability. Someone spent that time (it is shared between many, many projects and rolled-up into the global price.
... and there is also a lot of "invisible infrastructure" that you don't see as the resources assigned to your specific project.
I think all three providers you cited offer multiple development/staging clusters (in our case 3 staging clusters). So when you see 5GB there are probably 20GB allocated...
If you were to install and run all the software that is necessary to run your single project at the same conditions... well you will see CPU, memory and storage costs pile-up quite rapidly: Redundant storage nodes, storage for backups, monitoring and logging systems, multiple fire-wall layers, a build farm ... and the coordination and control servers needed to run all of that ...
So this is very much apple and oranges.
Often when you work on a side/personal project, you don't have resources other than your time. So of course this kind of trade-off may be less appealing. You may think : well I just need the CPU, I am a competent SysAdmin, I don't need scaling and I don't need monitoring, nobody would scream if my blog is down ... I am doing this by myself and I don't need multiple development environments... and you might very well be right...
I don't think providers as the ones you have cited are a fit for every type of possible project. And none of us were put in a magic cauldron at our birth. Everything that we do can be replicated. Well, if you spend enough time and resources on that...
I hope this answer gives you some clarity to why they can not possibly cost the same as bare-bones hosting.

Considerations for binary seralizations (Protobuf, CBOR, MessagePack, etc.) for a long term archive data format

In discussions for a next generation scientific data format a need for some kind of JSON-like data structures (logical grouping of fieldshas been identified. Additionally, it would be preferable to leverage an existing encoding instead of using a custom binary structure. For serialization formats there are many options. Any guidance or insight from those that have experience with these kinds of encodings is appreciated.
Requirements: In our format, data need to be packed in records, normally no bigger than 4096-bytes. Each record must be independently usable. The data must be readable for decades to come. Data archiving and exchange is done by storing and transmitting a sequence of records. Data corruption must only effect the corrupted records, leaving all others in the file/stream/object readable.
Priorities (roughly in order) are:
stability, long term archive usage
performance, mostly read
ability to store opaque blobs
size
simplicity
broad software (aka library) support
stream-ability, transmitted and readable as a record is generated (if possible)
We have started to look at Protobuf (Protocol Buffers RFC), CBOR (RFC) and a bit at MessagePack.
Any information from those with experience that would help us determine the best fit or, more importantly, avoid pitfalls and dead-ends, would be greatly appreciated.
Thanks in advance!
Late answer but: You may want to decide if you want a schema-based or self-describing format. Amazon Ion overview talks about some of the pros and cons of these design decisions, plus this other ION ( completely different from Amazon Ion ).
Neither of those fully meet your criteria, But these articles should list a few criteria you might want to consider. Obviously actually being a standard and being adopted are far higher guarantees of longevity than any technical design criteria
Your goal of recovery from data corruption almost certainly something that should be addressed in a separate architectural layer from the matter of encoding of the records. How many records to pack in to a blob/file/stream is really more related to how many records you can afford to sequentially read through before finding the one you might need.
An optimal solution to storage corruption depends on what kind of corruption you consider likely. For example, if you store data on spinning disks your best protection might be different from if you store data on tape. But the details of that are really not an application-level concern. It's better to abstract/outsource that sort of concern.
Modern cloud-based data storage services provide extremely robust protection against corruption, measured in the industry as "durability". For example, even the Microsoft Azure's lowest-cost storage option, Locally Redundant Storage (LRS), stores at least three different copies of any data received, and maintains at least that level of protection for as long as you want. If any copy gets damaged, another is made from one of undamaged ones ASAP. That results in an annual "durability" of 11 nines (99.999999999% durability), and that's the "low-cost" option at Microsoft. The normal redundancy plan, Geo Redundant Storage (GRS), offers durability exceeding 16 nines. See Azure Storage redundancy.
According to Wasabi, eleven-nines durability means that if you have 1 million files stored, you might lose one file every 659,000 years. You are about 411 times more likely to get hit by a meteor than losing a file.
P.S. I previously worked on the Microsoft Azure Storage team, so that's the service I know the best. However, I trust that other cloud-storage options (e.g. Wasabi and Amazon's S3) offer similar durability protection, e.g. Amazon S3 Standard and Wasabi hot storage are like Azure LRS: eleven nines durability. If you are not worried about a meteor strike, you can rest assured you these services won't lose your data anytime soon.

Adobe Air unique id issue

I created an AIR app which sends an ID to my server to verify the user's licence.
I created it using
NetworkInfo.networkInfo.findInterfaces() and I use the first "name" value for "displayName" containing "LAN" (or first mac address I get if the user is on a MAC).
But I get a problem:
sometime users connect to internet using an USB stick (given from a mobile phone company) and it changes the serial number I get; probably the USB stick becomes the first value in the vector of findInterfaces().
I could take the last value, but I think I could get similar problems too.
So is there a better way to identify the computer even with this small hardware changes?
It would be nice to get motherboard or CPU serial, but it seems to be not possible. I've found some workaround to get it, but working on WIN and not on a MAC.
I don't want to store data on the user computer for authentication to set "a little" more difficult to hack the software.
Any idea?
Thanks
Nadia
So is there a better way to identify the computer even with this small hardware changes?
No, there is no best practices to identify personal computer and build on this user licensing for the software. You should use server-side/licensing-manager to provide such functional. Also it will give your users flexibility with your desktop software. It's much easier as for product owner (You don't have call center that will respond on every call with changed Network card, hard drive, whatever) and for users to use such product.
Briefly speaking, user's personal computer is insecure (frankly speaking you don't have options to store something valuable) and very dynamic environment (There is very short cycle on the hardware to use it as part of licensing program).
I am in much the same boat as you, and I am now finally starting to address this... I have researched this for over a year and there are a couple options out there.
The biggest thing to watch out for when using a 3rd party system is the leach effect. Nearly all of them want a percentage of your profit - which in my mind makes it nothing more than vampireware. This is on top of a percentage you WILL pay to paypal, merchant processor, etc.
The route I will end up taking is creating a secondary ANE probably written in Java because of 1) Transitioning my knowledge 2) Ability to run on various architectures. I have to concede this solution is not fool proof since reverse engineering of java is nearly as easy as anything running on FP. The point is to just make it harder, not bullet proof.
As a side note - any naysayers of changing CPU / Motherboard - this is extremely rare if not no longer even done. I work on a laptop and obviously once that hardware cycle is over, I need to reregister everything on a new one. So please...
Zarqon was developed by: Cliff Hall
This appears to be a good solution for small scale. The reason I do not believe it scales well based on documentation (say beyond a few thousand users) is it appears to be a completely manual process ie-no ability to tie into a payment system to then auto-gen / notify the user of the key (I could be wrong about this).
Other helpful resources:
http://www.adobe.com/devnet/flex/articles/flex_paypal.html

How to get your code ready for Loadbalancing

As we did this in the past, i'd like to gather useful information for everyone moving to loadbalancing, as there are issues which your code must be aware of.
We moved from one apache server to squid as reverse proxy/loadbalancer with three apache servers behind.
We are using PHP/MySQL, so issues may differ.
Things we had to solve:
Sessions
We moved from "default" php sessions (files) to distributed memcached-sessions. Simple solution, has to be done. This way, you also don't need "sticky sessions" on your loadbalancer.
Caching
To our non-distributed apc-cache per webserver, we added anoter memcached-layer for distributed object caching, and replaced all old/outdated filecaching systems with it.
Uploads
Uploads go to a shared (nfs) folder.
Things we optimized for speed:
Static Files
Our main NFS runs a lighttpd, serving (also user-uploaded) images. Squid is aware of that and never queries our apache-nodes for images, which gave a nice performance boost. Squid is also configured to cache those files in ram.
What did you do to get your code/project ready for loadbalancing, any other concerns for people thinking about this move, and which platform/language are you using?
When doing this:
For http nodes, I push hard for a single system image (ocfs2 is good for this) and use either pound or crossroads as a load balancer, depending on the scenario. Nodes should have a small local disk for swap and to avoid most (but not all) headaches of CDSLs.
Then I bring Xen into the mix. If you place a small, temporal amount of information on Xenbus (i.e. how much virtual memory Linux has actually promised to processes per VM aka Committed_AS) you can quickly detect a brain dead load balancer and adjust it. Oracle caught on to this too .. and is now working to improve the balloon driver in Linux.
After that I look at the cost of splitting the database usage for any given app across sqlite3 and whatever db the app wants natively, while realizing that I need to split the db so posix_fadvise() can do its job and not pollute kernel buffers needlessly. Since most DBMS services want to do their own buffering, you must also let them do their own clustering. This really dictates the type of DB cluster that I use and what I do to the balloon driver.
Memcache servers then boot from a skinny initrd, again while the privileged domain watches their memory and CPU use so it knows when to boot more.
The choice of heartbeat / takeover really depends on the given network and the expected usage of the cluster. Its hard to generalize that one.
The end result is typically 5 or 6 physical nodes with quite a bit of memory booting a virtual machine monitor + guests while attached to mirrored storage.
Storage is also hard to describe in general terms.. sometimes I use cluster LVM, sometimes not. The not will change when LVM2 finally moves away from its current string based API.
Finally, all of this coordination results in something like Augeas updating configurations on the fly, based on events communicated via Xenbus. That includes ocfs2 itself, or any other service where configurations just can't reside on a single system image.
This is really an application specific question .. can you give an example? I love memcache, but not everyone can benefit from using it, for instance. Are we reviewing your configuration or talking about best practices in general?
Edit:
Sorry for being so Linux centric ... its typically what I use when designing a cluster.

Distributed filesystem sanity check [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 7 years ago.
Improve this question
I'm in need of a distributed file system that must scale to very large sizes (about 100TB realistic max). Filesizes are mostly in the 10-1500KB range, though some files may peak at about 250MB.
I very much like the thought of systems like GFS with built-in redundancy for backup which would - statistically - render file loss a thing of the past.
I have a couple of requirements:
Open source
No SPOFs
Automatic file replication (that is, no need for RAID)
Managed client access
Flat namespace of files - preferably
Built in versioning / delayed deletes
Proven deployments
I've looked seriously at MogileFS as it does fulfill most of the requirements. It does not have any managed clients, but it should be rather straight forward to do a port of the Java client. However, there is no versioning built in. Without versioning, I will have to do normal backups besides the file replication built into MogileFS.
Basically I need protection from a programming error that suddenly purges a lot of files it shouldn't have. While MogileFS does protect me from disk & machine errors by replicating my files over X number of devices, it doesn't save me if I do an unwarranted delete.
I would like to be able to specify that a delete operation doesn't actually take effect until after Y days. The delete will logically have taken place, but I can restore the file state for Y days until it's actually deleten. Also MogileFS does not have the ability to check for disk corruption during writes - though again, this could be added.
Since we're a Microsoft shop (Windows, .NET, MSSQL) I'd optimally like the core parts to be running on Windows for easy maintainability, while the storage nodes run *nix (or a combination) due to licensing.
Before I even consider rolling my own, do you have any suggestions for me to look at? I've also checked out HadoopFS, OpenAFS, Lustre & GFS - but neither seem to match my requirements.
Do you absolutely need to host this on your own servers? Much of what you need could be provided by Amazon S3. The delayed delete feature could be implemented by recording deletes to a SimpleDB table and running a garbage collection pass periodically to expunge files when necessary.
There is still a single point of failure if you rely on a single internet connection. And of course you could consider Amazon themselves to be a point of failure but the failure rate is always going to be far lower because of scale.
And hopefully you realize the other benefits, the ability to scale to any capacity. No need for IT staff to replace failed disks or systems. Usage costs will continually drop as disk capacity and bandwidth gets cheaper (while disks you purchase depreciate in value).
It's also possible to take a hybrid approach and use S3 as a secure backend archive and cache "hot" data locally, and find a caching strategy that best fits your usage model. This can greatly reduce bandwidth usage and improve I/O, epecially if data changes infrequently.
Downsides:
Files on S3 are immutable, they can
only be replaced entirely or
deleted. This is great for caching,
not so great for efficiency when
making small changes to large files.
Latency and bandwidth are those of
your network connection. Caching can
help improve this but you'll never
get the same level of performance.
Versioning would also be a custom solution, but could be implemented using SimpleDB along with S3 to track sets of revisions to a file. Overally, it really depends on your use case if this would be a good fit.
You could try running a source control system on top of your reliable file system. The problem then becomes how to expunge old check ins after your timeout. You can setup an Apache server with DAV_SVN and it will commit each change made through the DAV interface. I'm not sure how well this will scale with large file sizes that you describe.
#tweakt
I've considered S3 extensively as well, but I don't think it'll be satisfactory for us in the long run. We have a lot of files that must be stored securely - not through file ACL's, but through our application layer. While this can also be done through S3, we do have one bit less control over our file storage. Furthermore there will also be a major downside in forms of latency when we do file operations - both initial saves (which can be done asynchronously though), but also when we later read the files and have to perform operations on them.
As for the SPOF, that's not really an issue. We do have redundant connections to our datacenter and while I do not want any SPOFs, the little downtime S3 has had is acceptable.
Unlimited scalability and no need for maintenance is definitely an advantage.
Regarding a hybrid approach. If we are to host directly from S3 - which would be the case unless we want to store everything locally anyways (and just use S3 as backup), the bandwidth prices are simply too steep when we add S3 + CloudFront (CloudFront would be necessary as we have clients from all around). Currently we host everything from our datacenter in Europe, and we have our own reverse squids setup in the US for a low-budget CDN functionality.
While it's very domain dependent, ummutability is not an issue for us. We may replace files (that is, key X gets new content), but we will never make minor modifications to a file. All our files are blobs.