I need to store large amount of data into the database (MySQL). I would like to save disk space by compressing the text data before storing it to the database.
I know there will be a performance hit for compressing/decompressing data. But I am going to cache the decompressed data on CDN. And mostly, the data will not become stale for months or even years.
Can you please refer me some good compression/decompression techniques? I am also open to other alternatives than compressing/decompressing data.
If you want a pure MySQL solution, you could always try using the ARCHIVE storage type for your table. The documentation describes it as an insert-only, no update type of engine meant specifically for what you describe, stashing away things that won't change for years.
To do the same thing in a conventional engine would require using zlib on your data streams, but remember that compression performs very poorly on already compressed data such as most popular image types or video. You express your requirements as mostly text, which usually compresses quite well.
Ruby has Zlib::Deflate which can compress and expand data on demand. You could write your own wrapper similar to the JSON one by implementing the encode and decode methods on your module.
One thing to consider is you can probably store the compressed data on your CDN so long as you can be sure your client supports gzip encoding. I don't know of any major browsers that don't, as asset compression has become quite standard, especially in the mobile space.
If the data is really as static as you say then save the data as zipped xml files.
You can unzip them and zip them up again really easily as and when needed and in Rails generating an XML file is dead simple using SomeModel.to_xml the output of which can easily be sent to a file so maintaining them will be a simple too. You can just as eaily work this the other way round so that when it comes to reading the data in you can simply convert the data back into a model (Rails 3.x has ActiveModel which would be ideal for this scenario as the data is not backed by a database but you still get a ActiveRecord API and all the juice that AR gives you meaning that your views, controllers etc are working with a consistent api and consistent behaviour.
You have other options as well such as using ActiveResource but I wouldn't think that was necessary.
Not a recommended approach if you were not to be caching the data in the way you suggest (which is a neat solution BTW)
Related
I am in the process of building my first live node.js web app. It contains a form that accepts data regarding my clients current stock. When submitted, an object is made and saved to an array of current stock. This stock is then permanently displayed on their website until the entry is modified or deleted.
It is unlikely that there will ever be more than 20 objects stored at any time and these will only be updated perhaps once a week. I am not sure if it is necessary to use MongoDB to store these, or whether there could be a simpler more appropriate alternative. Perhaps the objects could be stored to a JSON file instead? Or would this have too big an implication on page load times?
You could potentially store in a JSON file or even in a cache of sorts such as Redis but I still think MongoDB would be your best bet for a live site.
Storing something in a JSON file is not scalable so if you end up storing a lot more data than originally planned (this often happens) you may find you run out of storage on your server hard drive. Also if you end up scaling and putting your app behind a load balancer, then you will need to make sure there are matching copy's of that JSON file on each server. Further more, it is easy to run into race conditions when updating a JSON file. If two processes are trying to update the file at the same time, you are going to potentially lose data. Technically speaking, JSON file would work but it's not recommended.
Storing in memory (i.e.) Redis has similar implications that the data is only available on that one server. Also the data is not persistent, so if your server restarted for whatever reason, you'd lose what was stored in memory.
For all intents and purposes, MongoDB is your best bet.
The only way to know for sure is test it with a load test. But as you probably read html and js files from the file system when serving web pages anyway, the extra load of reading a few json files shouldn't be a problem.
If you want to go with simpler way i.e JSON file use nedb API which is plenty fast as well.
I am making a new version of a old static website that grew up to a 50+ static pages.
So I made a JSON file with the old content so the new website can be more CMS (with templates for common pages) and so backend gets more DRY.
I wonder if I can serve that content to my views from the JSON or if I should have it in a MySQL database?
I am using Node.js, and in Node I can store that JSON file in memory so no file reading is done when user asks for data.
Is there a correct practise for this? are there performance differences between them serving a cached JSON file or via MySQL?
The file in question is about 400Kb. If the filesize is relevant to the choice of one tecnhology over the other?
Why add another layer of indirection? Just serve the views straight from JSON.
Normally, database is used for serving dynamic content that changes frequently, records have one-to-many or many-to-many relationships, and you need to query the data based on various criteria.
In the case you described, it looks like you will be OK with JSON file cached in server memory. Just make sure you update the cache whenever content of the file changes, i.e. by restarting the server, triggering cache update via http request or monitoring the file at the file system level.
Aside from that, you should consider caching of static files on the server and on the browser for better performance
Cache and Gzip static files (html,js,css,jpg) in server memory on startup. This can be easily done using npm package like connect-static
Use browser cache of the client by setting proper response headers. One way to do it is adding maxAge header on the Express route definition, i.e:
app.use "/bower", express.static("bower-components", {maxAge:
31536000})
Here is a good article about browser caching
If you are already storing your views as JSON and using Node, it may be worth considering using a MEAN stack (MongoDB, Express, Angular, Node):
http://meanjs.org/
http://mean.io/
This way you can code the whole thing in JS, including the document store in the MongoDB. I should point out I haven't used MEAN myself.
MySQL can store and serve JSON no problem, but as it doesn't parse it, it's very inflexible unless you split it out into components and indexing within the document is close to impossible.
Whether you 'should' do this depends entirely on your individual project and whether it is/how it is likely to evolve.
As you are implementing a new version (with CMS) of the website it would suggest that it is live and subject to growth or change and perhaps storing JSON in MySQL is storing up problems for the future. If it really is just one file, pulling from the file system and caching it in RAM is probably easier for now.
I have stored JSON in MySQL for our projects before, and in all but a few niche cases ended up splitting up the component data.
400KB is tiny. All the data will live in RAM, so I/O won't be an issue.
Dynamically building pages -- All the heavy hitters do that, if for no other reason than inserting ads. (I used to work in the bowels of such a company. There were million of pages live all the time; only a few were "static".)
Which CMS -- too many to choose from. Pick a couple that sound easy; then see if you can get comfortable with them. Then pick between them.
Linux/Windows; Apache/Tomcat/nginx; PHP/Perl/Java/VB. Again, your comfort level is an important criteria in this tiny web site; any of them can do the task.
Where might it go wrong? I'm sure you have hit web pages that are miserably slow to render. So, it is obviously possible to go the wrong direction. You are already switching gears; be prepared to switch gears a year or two from now if your decision turns out to be less than perfect.
Do avoid any CMS that is too heavy into EAV (key-value) schemas. They might work ok for 400KB of data, but they are ugly to scale.
Its a good practice to serve the json directly from the RAM itself if your data size will not grow in future. but if data is going to be increased in future then it will become a worst application case.
If you are not expecting to add (m)any new pages, I'd go for the simplest solution: read the JSON once into memory, then serve from memory. 400KB is very little memory.
No need to involve a database. Sure, you can do it, but it's overkill here.
I would recomend to generate static html content at build time(use grunt or ..). If you would like to apply the changes, trigger a build and generate static content and deploy it.
We are in the process of setting up a web application (start up at present). The web application will quickly grow in terms of number of JSON files that it needs to handle. We are probably talking about 5-10 million files. The individual JSON files are not particularly large - maybe in the region of 150K per file. Files will unlikely be accessed concurrently so individual users have their set of individual files.
The question I would like to put out there is simply how to best store the JSON files. Is a CDN best where links are stored in a relational database? Or should I jump on the bandwagon and go down the route of a NoSQL database? Or maybe there are other solutions I haven't thought about???
really looking for some good advice, ideally from someone with experience about large databases.
Many thanks in advance!!!!
Markus
I would consider looking into MongoDB since it already stores its documents in a json format.
You could also stick it into a regular relational db, but the nice thing about working with json documents in mongo is that you will have query capabilities against the documents, so that you don't have to load the entire document always.
If all you want is quick access to a write-once-read-many type of storage, then you can also consider DBM. It is fast, cheap, reliable.
Assuming you will compress the file contents, JSON-ness is probably a nonfactor from storage perspective.
Reliability - can you tolerate some statistical loss? If not, an all-or-bust DBs is the only choice left. If not, filesystem-based storage may be an alternative. Filesystems are not as fanatical as DB on whole data integrity checks. And they are much better supported. Serving files is easier; but keeping track of versions takes more design time effort. A common enough pattern is to serve product images and other collateral out of filesystem while keeping other data in an rdbms.
If you consider CDN -> relational DB then could also consider CDN -> {filesystem, inode}, keeping filesystems balanced explicitly in terms of file count.
NoSQL database, like MongoDB, might have restart and recovery times beyond your tolerance levels. Otherwise it's great tool. Many RDBMS have raw partition support for much better IO. At 150KB one must use a TEXT or CLOB field, just a minor annoyance.
HTH. Will appreciate if you shared back what you actually used.
I am developing a Live chat application using Node JS, Socket IO and JSON file. I am using JSON file to read and write the chat data. Now I am stuck on one issue, When I do the stress testing i.e pushing continuous messages into the JSON file, the JSON format becomes invalid and my application crashes.Although I am using forever.js which should keep application up but still the application crashes.
Does anybody have idea on this?
Thanks in advance for any help.
It is highly recommended that you re-consider your approach for persisting data to disk.
Among other things, one really big issue is that you will likely experience data loss. If we both get the file at the exact same time - {"foo":"bar"} - we both make a change and you save it before me, my change will overwrite yours since I started with the same thing as you. Although you saved it before me, I didn't re-open it after you saved.
What you are possibly seeing now in an append-only approach is that we're both adding bits and pieces without regard to valid JSON structure (IE: {"fo"bao":r":"ba"for"o"} from {"foo":"bar"} x 2).
Disk I/O is actually pretty slow. Even with an SSD hard drive. Memory is where it's at.
As recommended, you may want to consider MongoDB, MySQL, or otherwise. This may be a decent use case for Couchbase which is an in-memory key/value store based on memcache that persists things to disk ASAP. It is extremely JSON friendly (it is actually mostly based on JSON), offers great map/reduce support to query data, is super easy to scale to multiple servers, and has a node.js module.
This would allow you to very easily migrate your existing data storage routine into a database. Also, it provides CAS support which will prevent you from data loss in the scenarios outlined earlier.
At minimum though, you should possibly just modify an in memory object that you save to disk ever so often to prevent permanent data loss. However, this only works well with 1 server and then you're back at likely needing to look at a database.
I'm writing a webcrawler in Python that will store the HTML code of a large set of pages in a MySQL database. I'd like to make sure my methods of storage and processing are optimal before I begin processing data. I would like to:
Minimize storage space used in the database - possibly by minifying HTML code, Huffman encoding, or some other form of compression. I'd like to maintain the possibility of fulltext searching the field - I don't know if compression algorithms like Huffman encoding will allow this.
Minimize the processor usage necessary to encode and store large volumes of rows.
Does anyone have any suggestions or experience in this or a similar issue? Is Python the optimal language to be doing this in, given that it's going to require a number of HTTP requests and regular expressions plus whatever compression is optimal?
If you don't mind the HTML being opaque to MySQL, you can use the COMPRESS function to store the data and UNCOMPRESS to retrieve it. You won't be able to use the HTML contents in a WHERE clause (using, e.g., LIKE).
Do you actully need to store the source in the database?
Trying to run 'LIKE' queries against the data is going to suck big time anyway.
Store the raw data on the file system, as standard files. Just dont stick them all in one folder. use hashes of the id, to store them in predictable folders.
(while of course it is perfectly possible to store the text in the database, it bloats the size of your database, and makes it harder to work with. backups are (much!) bigger, changing storage engine, becomes more painful etc. Scaling your filesystem, is usually just a case of adding another harddisk. That doesnt work so easily with a database - you start needing to shard)
... to do any sort of searching on the data, you looking at building an index. I only have experience with SphinxSearch, but that allows you to specify a filename in the input database.