Optimal way to search large number of XML documents - json

I've got millions of XML documents in my filesystem and would like to query data from them fast. I'd be looking at getting data in one or multiple nodes.
I've looked at XML databases, but the ecosystem doesn't look very
mature ; I'm worried of going down a rabbit hole trying to implement it.
it.
I could convert the XML to JSON and put it in a NoSQL db like Mongo, but would be losing the nested structure of XML.
Having all the XMLs in memory and searching with lXML for instance is not an option, there are too many of them.
What would be the practical approaches in that case ? Any best practices ?

Definitely a case for an XML database like BaseX or EXist.

Related

On Node MySQL vs JSON

I'm currently using a MySQL where do I store JSON information and I'd fetch them from MySQL and parse them on my application. I would like to get rid of MySQL, but first I would like to know is that wise?
Is that efficient if I move to way that I store the data into data folder that contains .json files and these contains the data I need? There will be my app's coordinate data per user who wants to track themselves on map. Will that cause any issues? I don't need "query", but what about big data like 50K lines in example? Same amount will be in MySQL too. Amount doesn't change, but will there be any problems that appears when moving from "reading from sql to reading from json files"
It's difficult to answer all these questions in one, but I'll address some of them:
There are dedicated NoSQL databases that are very good at the type of data storage you're talking about: MongoDB, CouchDB etc. It might be worth checking these out. They are very good at dealing with JSON data. Querying and parsing are very simple in Node.js.
You can store JSON in MySQL (or other RDMS systems), I've done it in several projects with good results. As of MySQL 5.7.8 there is a dedicated JSON type. Queries can actually work surprisingly well, I know I've queried tables with tens of millions of JSON entries pretty quickly.
Make sure you consider backup and restore scenarios, what happens in the event of a data loss. Using MySQL or a NoSQL database will simplify this for you. Either way make sure you have this covered!
I wouldn't call 50K lines big data! I dealt with databases with tens of millions of rows.. this still wouldn't be called big data.
I would probably not recommend storing your data in files. I've worked in telematics before, we stored millions of JSON blobs in relational databases with very little problems. Later on we planned to move to a NoSQL database for these, but the relational database worked surprising well, especially because you can adopt a hybrid approach of using relational queries and including JSON data in the results (to be parsed by clients).
You might not need the ability to query, but it's very useful to get for example "Give me all JSON for user id 100". An RDBMS or NoSQL system would make this very easy.

Use XML file or DB

For my simple App, i have a ftp server where i can store file (json or xml) or DB. Multiple clients could access that file or DB to read or write (DB or file would have only up to 100 entries).
From one point of view, DB is more suited for having big amounts of entries, due to indexing. But from other point of view, i am not sure if there would be issue with xml or json file if multiple clients try to read or write at the same time from the same file? So i am thinking to use DB just to avoid that issue.
I'd suggest using a database for a few reasons:
Databases are designed for exactly this scenario.
If you ever need to work on a larger scale you won't need to change your code.
You'll get to practice writing db code that will be useful in future, larger scale projects.
There are some really good database technologies that will work well with what you need, for example MongoDB, MySQL, SQL Server. They all have great support, code examples and you'll be able to use Stack Overflow to ask questions about them.
After googling, it seems that SQLite is the best choice. It is good approach for small DB, it is self-contained, allowing safe access from multiple processes or threads. Exactly what i needed.

which database suits my application mysql or mongodb ? using Node.js , Backbone , Now.js

I want to make an application like docs.google.com (without its api,completely on my own server) using
frontend : backbone
backend : node
What database would u think is better ? mysql or mongodb ? Should support good scalability .
I am familiar with mysql with php and i will be happy if the answer is mysql.
But many tutorials i saw, they used mongodb, why did they use mongodb without mysql ?
What should i use ?
Can anyone give me link for some sample application(with source) build using backbone , Node , mysql (or mongo) . or atleast app. with Node and mysql
Thanks
With MongoDB, you can just store JSON objects and retrieve them fully-formed, so you don't really need an ORM layer and you spend less CPU time translating your data back-and-forth. The developers behind MongoDB have also made horizontally scaling the database a higher priority and let you run arbitrary Javascript code to pre-process data on the DB side (allowing map-reduce style filtering of data).
But you lose some for these gains: You can't join records. Actually, the JSON structure you store could only be done via joins in SQL, but in MongoDB you only have that one structure to your data, while in SQL you can query differently and get your data represented in alternate ways much easier, so if you need to do a lot of analytics on your database, MongoDB will make that harder.
The query language in MongoDB is "rougher", in my opinion, than SQL's, partly because it's less familiar, and partly because the querying features "feel" haphazardly put together, partially to make it valid JSON, and partially because there are literally a couple of ways of doing the same thing, and some are older ways that aren't as useful or regularly-formatted as the others. And there's the added complexity of the array and sub-object types over SQL's simple row-based design, so the syntax has to be able to handle querying for arrays that contain some of the values you defined, contain all of the values you defined, contain only the values you defined, and contain none of the values you defined. The same distinctions apply to object keys and their values, and this makes the query syntax harder to grasp. (And while I can see the need for edge-cases, the $where query parameter, which takes a javascript function that is run on every record of the data and returns a boolean, is a Siren song because you can easily define what objects you want to return or not, but it has to run on every record in the database, no indexes can be used.)
So, it depends on what you want to do, but since you say it's for a Google Docs clone, you probably don't care about any representation but the document representation, itself, and you're probably only going to query based on document ID, document name, or the owner's ID/name, nothing too complex in the querying.
Then, I'd say being able to take the JSON representation of the document your user is editing, and just throw it into the database and have it automatically index these important fields, is worth the price of learning a new database.
I was also struggling with this choice looking at the hype created by using MongoDB for tasks it was not built for. So my 2 cents are:
Storing and retrieving hierarchical objects, that your documents probably are, is easier in MongoDB, as David says. It becomes more complicated if you want to store documents that are bigger than 16Mb though - MongoDB's answer is GridFS.
Organising documents in folders, groups, keeping track of which user owns which documents and who he/she provided access to them is definitely easier with MySQL - you have the advantage of powerful SQL queries with joins etc., built in EXPLAIN optimization, triggers, functions, stored procedures, etc. MongoDB is nowhere near.
So what prevents you from using both MySQL to organize the documents and MongoDB to store one collection of documents identified by id (or several collections - one for each document type)? It seems to me the best choice and using two databases in one application is not a problem, really.
MySQL will store users, groups, folders, permissions - whatever you fancy - and for each document it will store a reference to the collection and the document id (MongoDB has a special format for it - DBRefs). MongoDB will store documents themselves in collections, if they are all less than 16MB, or the previews and metadata of documents in collections and the whole documents in GridFS.
David provided a good answer. A few things to add to it.
MongoDB's flexible nature permits for easy agile / iterative development.
MongoDB like node.js is asyncronous in nature and works very well within asyncronous environments.
Mongoose is a good ODM (object document mapper) that makes working with MongoDB with Node.js feel very natural. Unlike ORMs this is a very thin layer.
For Google Doc like functionality, the flexibility & very rich data structure provided by MongoDB feels like a much better fit.
You can find some good example posts by searching for mongoose, node and MongoDB.
Here's one that also uses backbone.js and looks good http://mattkopala.com/blog/2012/02/12/getting-started-with-nodejs/

Storing simple data on the server, db or json?

I have an app and it needs to store a few simple objects that change, and it needs to be able to persist them. Should I save the data to a db like mongo and read it in on startup or should I just save to a json file?
Thanks for any tips.
Either way can work. 'It depends' which is best in your situation.
Things to consider:
Flat files are not (or not easily) queriable.
You will have to manage flat files (e.g. folders, file names, etc)
Is it possible that there will be a collision? Are you going to read & write often?
If you are only persisting a few objects, is a database like mongo overkill?
Is something like redis a better solution?
I'm currently working on a project that involves several data stores. MySQL, flat-files, mongo, et cetera.
The flat files work really well for storing data that will be pulled up later. It's suitable for data that isn't read or written often.
Mongo has modules like Mongoose that take away a lot of complexities. But the documentation for both Mongoose and Mongo can be a bit dry and confusing.
Think about how the data will be used, how it will evolve, and if you need to query it. If it just needs to be pulled up at the app's startup, then a flat file can work. If it is going to grow into larger objects then flat files can still be the solution. But if there's a chance for additional objects in the future, then flat-files will become harder to manage, and it will be a pain to add them into a working system.
And overall, if you want any type of querying (other than manually opening files and inspecting them) then go with Mongoose or a similar database.
It depends mongo will be easier if u want to alter the data. It is basically a storage engine for json lol. Depends on your use case. If you think this will grow, you might as well use mongo. If your storage needs will never change use a json file.

What are some cons of storing html in a database for use?

Altough its very easy to do a search about the topic, it's not as easy to come to a conclusion. What are some cons of storing html in a database for use?
HTML is static, and querying the data from a database uses database resources; database resources are typically among the more restricted on moderate to heavy use systems, therefore it makes sense to not store HTML in the database, but to place it on the filesystem, where it can be retrieved without using critical resources.
In the broadest sense, HTML is a document markup language and serves to structure data into a document. The database on the other hand should contain raw data organized along its logical relations. Documents use formatting and may present data redundantly, but the true, underlying data is always fixed. Thus you should store the most immediate, raw form of data that you possibly can, and retrieve it in meaningful ways using both the query language itself to create suitable views for your purposes, and other, output-specific data processing to generate documents.
Of course you may like to cache the result of an output formatting operation, and you may choose to store the cache in a database, too. That's fine of course. But concerning the raw payload data, I would always go for the above.
That depends on the use of the HTML in the database. If it's data that you only ever access as a blob (meaning you never/rarely query the contents of the HTML), then I think it can be a good idea in some cases. Then the question is essentially the same as "Should I store files in xyz format in my database?" And the answer to questions like that depends on several things:
How large are the files? Would storing them on the filesystem, with just their filename/path in the DB be more efficient?
Do you need to replicate the data to other servers? If so, then storing raw files in the DB may be easier than on the FS, if you already have DB-sync infrastructure in place.
What are your query uses like? Are they friendlier to a DB or a file system storage?
Now, if you're talking about storing HTML data that you frequently have to query, that changes the game entirely.
Any database normalization nazi would tell you never to do it. But there might be cases when it's useful. For instance, if you're using some sort of full-text searching engine, you may want that in a database--or in whatever form the full-text search engine uses.