Loading XML "Cache" Versus Querying DB. The Drawbacks? - mysql

For a read-only application, I am currently storing data in a relational database, but rather than querying it via the app, I am doing a nightly write of the data, including its relationships, to an XML file.
Granted, it is not a a lot of data -- the XML represents less than 1000 objects.
Then, through client side code, I am loading that data, and "querying" it as necessary.
No write opertations are required -- the app's sole function is search and display.
I've developed the app in such a way that whether it queries the db or the loaded xml can be switched very easily, and so I am able to compare performance.
I find that e.g. full text search (as such) is instant, etc, with the loaded XML approach.
However, I know there are drawbacks to this approach, and I would greatly appreciate it if any of you could help me flesh out when and why this is or is not a valid approach.
Thanks in advance.

When you load XML into a good XML processing engine, it constructs appropriate data structures to speed up XPath queries or the tree traversal in general.
When you keep the data in a relational database and query it, the query optimizer builds a query plan which will access data in some optimized way too.
Which method best suits you completely depends on the nature of your queries.
Note that loading an XML document and parsing it on each client call may be quite expensive, and unless you use some kind of an application server which keeps the parsed XML tree in memory, a database query will most probably be a better way since with 1000 records your whole table will fit into the cache.

It sounds valid to me.
The drawbacks would come if the size of the data was prohibitively large and threatened your available memory, or if it was shared in such a way that thread safety was an issue.
But you say it's small and read-only, so it sounds fine to me. Keeping the data close to where it's needed is something that every hardware designer would understand.
You say it's stored as XML, but I assume you read the file once per day, parse and store it in an in-memory DOM object, and query it using XPath. Is the XPath performance adequate for your needs? That would be my only concern.

It all comes down to resource management. If you have the resources to run queries it is a better road because your data is "live" vs. having an XML file that is cached and then parsing it. If you are worried about performance unless you are querying tens of millions of rows of data I wouldn't be too concerned. We have a box with about 60+ clients that constantly run queries all day and the box can actually perform quite well with everything running. XML parsing can be more stressful to the server than a query most of the time.

Related

Where to store large number of JSON files

We are in the process of setting up a web application (start up at present). The web application will quickly grow in terms of number of JSON files that it needs to handle. We are probably talking about 5-10 million files. The individual JSON files are not particularly large - maybe in the region of 150K per file. Files will unlikely be accessed concurrently so individual users have their set of individual files.
The question I would like to put out there is simply how to best store the JSON files. Is a CDN best where links are stored in a relational database? Or should I jump on the bandwagon and go down the route of a NoSQL database? Or maybe there are other solutions I haven't thought about???
really looking for some good advice, ideally from someone with experience about large databases.
Many thanks in advance!!!!
Markus
I would consider looking into MongoDB since it already stores its documents in a json format.
You could also stick it into a regular relational db, but the nice thing about working with json documents in mongo is that you will have query capabilities against the documents, so that you don't have to load the entire document always.
If all you want is quick access to a write-once-read-many type of storage, then you can also consider DBM. It is fast, cheap, reliable.
Assuming you will compress the file contents, JSON-ness is probably a nonfactor from storage perspective.
Reliability - can you tolerate some statistical loss? If not, an all-or-bust DBs is the only choice left. If not, filesystem-based storage may be an alternative. Filesystems are not as fanatical as DB on whole data integrity checks. And they are much better supported. Serving files is easier; but keeping track of versions takes more design time effort. A common enough pattern is to serve product images and other collateral out of filesystem while keeping other data in an rdbms.
If you consider CDN -> relational DB then could also consider CDN -> {filesystem, inode}, keeping filesystems balanced explicitly in terms of file count.
NoSQL database, like MongoDB, might have restart and recovery times beyond your tolerance levels. Otherwise it's great tool. Many RDBMS have raw partition support for much better IO. At 150KB one must use a TEXT or CLOB field, just a minor annoyance.
HTH. Will appreciate if you shared back what you actually used.

Optimal way to "mirror" JSON data from a third-party to a Meteor Collection

We have a Meteor-based system that basically polls for data from a third-party REST API, loops through the retrieved data, inserts or updates each record to a Meteor collection.
But then it hit me: What happens when an entry is deleted from the data of the third-party?
One would say insert/update the data, and then loop through the collection and find out which one isn't in the fetched data. True, that's one way of doing it.
Another would be to clear the collection, and rewrite everything from the fetched data.
But with thousands of entries (currently at 1500+ records, will potentially explode), both seem to be very slow and CPU consuming.
What is the most optimal procedure to mirror data from JS Object to a Meteor/Mongo collection in such a way that deleted items from the data are also deleted on the collection?.
I think code is irrelevant here since this could be applicable to other languages that can do a similar feat.
For this kind of usage try using something that's more optimized. The meteor guys are working on using meteor as a sort of replica mongodb set to get/and set data.
For the moment there is Smart-Collections that uses mongodb's oplog to significantly boost performance. It could work in a sort of one size fits all scenario without optimizing for specifics. There are benchmarks that show this.
When Meteor 1.0 comes out I think they'll have optimized their own mongodb driver.
I think this may help with thousands of entries. If you're changing thousands of documents every second you need to get something closer to mongodb. Meteor employs lots of caching techniques which aren't too optimal for this. I think it polls the database every 5 seconds to refresh its cache.
Smart Collections: http://meteorhacks.com/introducing-smart-collections.html
Please do let know if it helps I'm interested to know if its useful in this scenario.
If this doesn't work, redis might be helpful too since everything is stored in-memory. Not sure what your use case is but if you don't need persistence redis would squeeze out more performance than mongo.

XML or MySQL for User Database?

Might seem a strange question but would there be a performance benefit in using XML for a database rather than MySQL and tables?
To put this into context I wil be creating a website that has user profiles. I know more XML than MySQL and know most ppl will use MySQL as standard but was wondering if anyone could throw some pennies this way about how the two compare and if this suggestion is as outrageous to anyone understanding what the big O notation is as it could be...
The bigger xml file, the more memory usage because you'll have to load the entire xml file to RAM whilst running your script.
An average MySQL database is about 4mb big. Lets take that to a xml file of 4 mb, loaded to ram 4 mb, loaded from disk, into ram at every pageview, with about 25 visitors at any given moment that's 100mb already lost, let's say they flick a lotthrough pages it adds up to a fast 1 gigabyte of ram.
Not to mention you'll add about 1 second to page load every time, if not longer.
Not to mention continueus disk load for reading and writing changed vars. Threaded fork issues when two vitors want to update the same xml file.
These problems you don't have with an SQL server.
MySQL has indexes, and it's optimized for the binary values you will be storing. All you have with an xml file, is a plain file.. and any optimizations (caching, indexing, anything you can think of) will be up to you to implement.
XML is a great format for transport, everybody speaks it.. but you do not want to use it for storage.
And if you already know XML, but not yet MySQL.. I would say you're ahead of the game. You'll probably find writing SQL queries and fetching the results more straightforward than working with xml data.
As I see - there are several XML Db solutions available - these appear in a simple google search:
http://exist-db.org/exist/index.xml;jsessionid=1dowedwdr9hsanbcvdcom8aka
http://basex.org/
http://www.oracle.com/technetwork/database/features/xmldb/index.html
http://www.sedna.org/
So all it matters here is the speed of development. If you're mostly familiar with XML - then using one of those could be a booster for development time.
However - there is plenty of relational DB ORM products - depending on the programming language, that leverage the most dev effort and make it easy to use a database for a web site. So if you don't have some specific needs for your web site, you might go with any of the options above.
It depends on the structure of your database. This question cann't give a definite answer without knowing anything about your data. Any comparison of XML versus a relational database depends heavily on which data you choose, and what type of operations you plan.
For example you want store, index, and query is more than million rows and each row has a lot of the same fields. That’s a simple and fixed structure and it’s the same for all records. It’s a perfect fit for a relational database and can be stored in a single table. Relational databases handles such fixed records very efficiently.
Well, there are two main questions here.
First, if you're going to use a database, you have a choice between an XML database and a relational database. The choice depends primarily on the nature of your data (especially its complexity, but also the way in which it is used).
Then you have the choice between using a database and using a simple file (for example an XML file). That choice depends primarily on the quantity of data and the transaction throughput.
Since you haven't told us much about the nature of the data or its quantity or the throughput requirements, it's hard to advise you specifically on either question.

What are some cons of storing html in a database for use?

Altough its very easy to do a search about the topic, it's not as easy to come to a conclusion. What are some cons of storing html in a database for use?
HTML is static, and querying the data from a database uses database resources; database resources are typically among the more restricted on moderate to heavy use systems, therefore it makes sense to not store HTML in the database, but to place it on the filesystem, where it can be retrieved without using critical resources.
In the broadest sense, HTML is a document markup language and serves to structure data into a document. The database on the other hand should contain raw data organized along its logical relations. Documents use formatting and may present data redundantly, but the true, underlying data is always fixed. Thus you should store the most immediate, raw form of data that you possibly can, and retrieve it in meaningful ways using both the query language itself to create suitable views for your purposes, and other, output-specific data processing to generate documents.
Of course you may like to cache the result of an output formatting operation, and you may choose to store the cache in a database, too. That's fine of course. But concerning the raw payload data, I would always go for the above.
That depends on the use of the HTML in the database. If it's data that you only ever access as a blob (meaning you never/rarely query the contents of the HTML), then I think it can be a good idea in some cases. Then the question is essentially the same as "Should I store files in xyz format in my database?" And the answer to questions like that depends on several things:
How large are the files? Would storing them on the filesystem, with just their filename/path in the DB be more efficient?
Do you need to replicate the data to other servers? If so, then storing raw files in the DB may be easier than on the FS, if you already have DB-sync infrastructure in place.
What are your query uses like? Are they friendlier to a DB or a file system storage?
Now, if you're talking about storing HTML data that you frequently have to query, that changes the game entirely.
Any database normalization nazi would tell you never to do it. But there might be cases when it's useful. For instance, if you're using some sort of full-text searching engine, you may want that in a database--or in whatever form the full-text search engine uses.

SQL Assemblies vs Application code for complicated queries on large XML columns

I have a table with a few relational columns and one XML column which sometimes holds a fairly large chunk of data. I also have a simple webservice which uses the database. I need to be able to report on things like all the instances of a certain element within the XML column, a list of all the distinct values for a certain element, things like that.
I was able to get a list of all the distinct values for an element, but didn't get much further than that. I ended up writing incredibly complex T-SQL code to do something that seems pretty simple in C#: go through all the rows in this table, and apply this ( XPath | XQuery | XSLT ) to the XML column. I can filter on the relational columns to reduce the amount of data, but this is still a lot of data for some of the queries.
My plan was to embed an assembly in SQL Server (I'm using 2008 SP2) and have it create an indexed view on the fly for a given query (I'd have other logic to clean this view up). This would allow me to keep the network traffic down, and possibly also allow me to use tools like Excel and MSRS reports as a cheap user interface, but I'm seeing a lot of people saying "just use application logic rather than SQL assemblies". (I could be barking entirely up the wrong tree here, I guess).
Grabbing the big chunk of data to the web service and doing the processing there would have benefits as well - I'm less constrained by the SQL Server environment (since I don't live inside it) and my setup process is easier. But it does mean I'm bringing a lot of data over the network, storing it in memory while I process it, then throwing some of it away.
Any advice here would be appreciated.
Thanks
Edit:
Thanks guys, you've all been a big help. The issue was that we were generating a row in the table for a file, and each file could have multiple results, and we would doing this each time we ran a particular build job. I wanted to flatten this out into a table view.
Each execution of this build job checked thousands of files for several attributes, and in some cases each of these tests these were generating thousands of results (MSIVAL tests were the worst culprit).
The answer (duh!) is to flatten it out before it goes into the database! Based on your feedback, I decided to try creating a row for each result for each test on each file, and the XML just had the details of that one result - this made the query much simpler. Of course, we now have hundreds of thousands of rows each time we run this tool but the performance is much better. I now have a view which creates a flattened version of one of the classes of results that are emitted by the build job - this returns >200,000 and takes <5 seconds, compared to around 3 minutes for the equivalent (complicated) query before I went the flatter route, and between 10 and 30 minutes for the XML file processing of the old (non-database) version.
I now have some issues with the number of times I connect, but I have an idea of how to fix that.
Thanks again! +1's all round
I suggest using the standard xml tools in TSQL. (http://msdn.microsoft.com/en-us/library/ms189075.aspx). If you don't wish to use this I would recommend processing the xml on another machine.
SQLCLR is perfect for smaller functions, but with the restrictions on the usable methods it tends to become an exercise in frustration once you are trying to do more advanced things.
What you're asking about is really a huge balancing act and it totally depends on several factors. First, what's the current load on your database? If you're running this on a database that is already under heavy load, you're probably going to want to do this parsing on the web service. XML shredding and querying is an incredibly expensive procedure in SQL Server, especially if you're doing it on un-indexed columns that don't have a schema defined for them. Schemas and indexes help with this processing overhead, but they can't eliminate the fact that XML parsing isn't cheap. Secondly, the amount of data you're working with. It's entirely possible that you just have too much data to push over the network. Depending on the location of your servers and the amount of data, you could face insurmountable problems here.
Finally, what are the relative specs of your machines? If your web service machine has low memory, it's going to be thrashing data in and out of virtual memory trying to parse the XML which will destroy your performance. Maybe you're not running the most powerful database hardware and shredding XML is going to be performance prohibitive for the CPU you've got on your database machine.
At the end of the day, the only way to really know is to try both ways and figure out what makes sense for you. Doing the development on your web services machine will almost undoubtedly be easier as LINQ to XML is a more elegant way of parsing through XML than XQuery shoehorned into T-SQL is. My indication, given the information you provided in your question, is that T-SQL is going to perform better for you in the long run because you're doing XML parsing on every row or at least most rows in the database for reporting purposes. Pushing that kind of information over the network is just ugly. That said, if performance isn't that important, there's something to be said about taking the easier and more maintainable route of doing all the parsing on the application server.