Grails App with Huge Tables - csv

I'm trying to create a database from existing csv files that are about 20,000 columns wide and 700 rows deep. In grails I would like the 20,000 column domain to belongTo another simpler domain (about 200 columns). But upon compilation I get:
java.lang.RuntimeException: Class file too large!
Which is understandable because it's way too much data. My question is, what is the best approach to handle this problem in grails? Should I simply break up the big table into separate domains? Look for a different table format?
I'm specifically worried about:
1) Search time, parsing search methods then delegating to sub domains.
2) Importing the data from the huge csv file into the domains.

When you crash into a JVM size limit like this, take it as a big hint that your approach is way off. As I mentioned in another question earlier this week, we shouldn't even know what these limits are, much less be anywhere near hitting them.
I don't see much benefit in using something like GORM or even an O-O approach in general to this much data. It's not an object in any realistic, usable sense - it's a massive bunch of data. You'll need to programmatically access everything anyway even if it did work, since hand-managing the code for that would be crazy amounts of code. Do you really plan on creating one or more instances of these beasts and passing them around as method args?
You'll need to look at this from a big data perspective, not an ORM perspective.

Related

Best way to store/edit GEOJSON working on Wordpress [duplicate]

I want to create a website that will have an ajax search. It will fetch the data or from a JSON file or from a database.I do not know which technology to use to store the data. JSON file or MySQL. Based on some quick research it is gonna be about 60000 entries. So the file size if i use JSON will be around 30- 50 MB and if use MySQL will have 60000 rows. What are the limitations of each technique and what are the benefits?
Thank you
I can't seem to comment since I need 50 rep. for commenting, so I will give it as an answer:
MySQL will be preferable for many reasons, not the least of which being you do not want your web server process to have write access to the filesystem (except for possibly logging) because that is an easy way to get exploited.
Also, the MySQL team has put a lot of engineering effort into things such as replication, concurrent access to data, ACID compliance, and data integrity.
Imagine if, for instance, you add a new field that is required in whatever data structure you are storing. If you store in JSON files, you will have to have some process that opens each file, adds the field, then saves it. Compare this to the difficulty of using ALTER TABLE with a DEFAULT value for the field. (A bit of a contrived example, but how many hacks do you want to leave in your codebase for dealing with old data?) so to be really blunt about, MySQL is a database while JSON is not, so the correct answer is MySQL, without hesitation. JSON is just a language, and barely even that. JSON was never designed to handle anything like concurrent connections or any sort of data manipulation, since its own function is to represent data, not to manage it.
So go with MySQL for storing the data. Then you should use some programming language to read that database, and send that information as JSON, rather than actually storing anything in JSON.
If you store the data in files, whether in JSON format or anything else, you will have all sorts of problems that people have stopped worrying about since databases started being used for the same thing. Size limitations, locks, name it. It's good enough when you have one user, but the moment you add more of them, you'll start solving so many problems that you would probably end up by writing an entire database engine just to handle the files for you, while all along you could have simply used an actual database. Do note! Don't take my word for granted, I am not an expert on this field, so let others post their answer and then judge by that. I think enough people here on stackoverflow have more experience then I do haha. These are NOT entirely my words, but I have taken out the parts that were true from what I knew and know and added some of my own knowledge :) Have a great time making your website
For MySQl :you can select specific rows,or specific column using queries ,filter data based on a key,order alphabetically
downside:need a REST API to fetch data because it can't be accessed directly,you have to use php or python or whatever programming language for backend code.
for json file :benefits :no backend code directly accessed using GET http request.
downside:no filtering ,ordering or any queries,you have to do it manually.

How to migrate data from mongodb to mysql?

I am currently working on an application like to analitics, i has Angularjs app which communicates with Spring REST Client App from which user creates token(trackingID) and use generated script with this id putting on his website to collect information about visitor's actions through another Spring REST tracking App, for tracking app i am using as mongodb to collect visitor actions/visitor info for fast insertion, but for rest client app mysql with user/accounts details.
My question is how to migrate mongo data from tracking app to mysql maybe for getting posibility of join for easily and fastest way of analyze data with any kind of filters from angularjs client app, to create manually any workers that periodically will transfer data from last point to present state from mongo to mysql, or are any existed tools that can be setted for this transfer?
There is no official library to do this.
But you can use mongoexport feature from mongoDB to export it in a CSV format and mysqlimport to import them into MySQL.
Here are links to the documentation MySQL import and MongoDB Export.
One more method you can try to write a program in one of your favorite language and read from MongoDB and write into MySQL
MySQL 5.7 has a new JSON data type, that can be very convenient.
You can create a table at MySQL to receive the JSON messages AS IS, and then use SQL to query it or do a post processing to load the data in a structured set of database tables.
Check this out: https://dev.mysql.com/doc/refman/5.7/en/json.html
I realise this question is a few years old - but recently I've had a number of people enquiring whether a tool I developed (https://virtual.blue/apps/json-converter) can do exactly what the OP is asking (convert MongoDB to SQL) so I am guessing it is still something people want. Keep reading to find out why I am honestly not surprised by this.
The short answer to whether the tool can help you is: perhaps. If your existing data relationships are not too complicated, and your database is not enormous, it may well be worth a try.
However, I thought it might help to try and explain what the issues are with this kind of conversion, since all the answers I have seen so far are along the lines of "try tool X" or "first convert to format Y and then you can slurp it into MySQL using utility Z". ie there is no thought to whether what you get at the end of doing this is going to make sense in terms of data relationships and integrity.
For example, you could just stick your entire database dump in a single field of a single SQL table (ok space limitations might prevent this in reality, but hopefully you get my point). Then your database would be "in MySQL format", but it would be absolutely no use to anyone.
The point is, what you actually want is a fully defined database model, correctly encapsulating all of the intrinsic data relationships. ("Database normalization" as it is known.) If your conversion process gets those relationships wrong, then you have a broken model, and any queries you try to run over it are likely to return nonsense. Unfortunately there is no magic tool that is just going to "know" the best way to represent your data in MySQL, and closing your eyes and shovelling it into a bunch of random tools is unlikely to miraculously get you what you want.
And herein lies the fundamental problem with the "NoSQL" philosophy (fad). They sold people the bogus notion of "non-relational data". My first thought when I heard this was, "How does that work? Surely all data is relational?" By the looks of things we are steadily getting more and more evidence that my instincts were right. ("NoSQL? Why stop there? I go with 'NoDatabase'. It returns no results at all, but it sure is fast!")
The NoSQL madness throws several important fundamental engineering principles to the wind. We shouted "don't hard code!", "DRY!" (Don't Repeat Yourself) because these actions infuse inflexibility into systems. Traditional wisdom makes precisely the same flexibility argument when it advises "create a fully described model with all the data relationships represented". Then you can execute any arbitrary query over it and expect meaningful results. "Yes but there are a whole bunch of queries we are never going to need to run," says the NoSQL proponent. But surely we learnt our lesson on things we are "never going to need to do"? ("I hard code liberally, because I know I am never going to want to change my code." Hmm...)
The arguments about speed are largely moot. Say it turns out you are frequently doing a complex 9 table join, with unsurprisingly sluggish performance. So create an index. Cache it. Swap some disk space for speed. The NoSQL philosophy is to swap data integrity for speed, which makes no sense at all.
When you generate your fast lookup index (cache/table/map/whatever) what you are really doing is creating a view over your model. If your model changes, you can readily update your view. Going from a model to a view is easy - it's a one to many operation and you are on the right side of entropy.
However, when you went with MongoDB you effectively decided to create views without bothering to describe your fundamental model. Now you discover there are queries you want to run, but can't - and so it's no wonder you want to move over to SQL and actually have your data modelled correctly. The problem is you now want to go from a view to a model. Now you're on the wrong side of entropy. Your view is a lossy representation of the model's fundamental relationships. You can't expect a tool to "translate" your database, because you are asking it to insert new relationships which were not originally defined. These are real world relationships that are not machine-guessable. The tool cannot know what relationships were intended.
In short the only way you can do this reliably is to get your hands dirty. An intelligent human, with complete understanding of the system you are modelling needs to sit down and carefully come up with (possibly a substantial amount of) code which effectively picks through the data and resolves all of the insufficiently represented data relationships. If your data is complex then it's going to be a headache and there is no way to cheat.
If your data is still relatively simple then I would suggest making the conversion as soon as possible, before it becomes difficult. In this case my tool (https://virtual.blue/apps/json-converter) may be able to help.
(They really should have asked a Physicist before they came up with all this nonsense...!)
You can download a trial version of Studio 3T for Mongo and export your database to SQL (or JSON) directly

Which is better: Many files or a singular file with a lot of data

I am working with a lot of separate data entries and unfortunately do not know SQL, so I need to know which is the faster method of storing data.
I have several hundred, if not in the thousands, individual files storing user data. In this case they are all lists of Strings and nothing else, so I have been listing them line by line as such, accessing the files as needed. Encryption is not necessary.
test
buyhome
foo
etc. (About 75 or so entries)
More recently I have learned how to use JSON and had this question: Would it be faster to leave these as individual files to read as necessary, or as a very large JSON file I can keep in memory?
In memory access will always be much faster than disk access, however if your in memory data is modified and the system crashes you will lose that data if it has not been saved to a form of persistent data storage.
Given the amount of data you say you are working with, you really should be using a database of some sort. Either drop everything and go learn some SQL (the basics are not that hard) or leverage what you know about JSON and look into a NoSQL database like MongoDB.
You will find that using the right tool for the job often saves you more time in the long run than trying to force the tool you currently have to work. Even if you need to invest some time upfront to learn something new.
First thing is: DO NOT keep data in memory. Unless you are creating portal like SO or Reddit, RAM as a storage is a bad idea.
Second thing is: reading a file is slow. Opening and closing a file is slow too. Try to keep number of files as low as possible.
If you are gonna use each and every of those files (key issue is EVERY), keep them together. If you will only need some of them, store them separately.

neo4j or neo4j+mysql for partial graph dataset

Even though I read another question here advising not to use both neo4j and mysql (neo4j - graph database along with a relational database?), I was wondering what approach would be the best for dataset that has some data which can be modeled like a graph and the rest looks relational. For some reasons, I can't post the kind of data I'm using.
I can shoehorn the relational part into neo4j but it looks ugly and complex, something I would want to avoid.
On the other hand, if I use both together, I'll have to do double the amount of queries to get the result, decreasing performance (assume the DBs are on cloud in separate machines).
I can't use mysql alone because one of the queries requires a depth of around 20-30 which I assume can't be handled by mysql.
Have any of you encountered such a situation before ? If so, how did you solve it ?
As everyone else says: "give us a better idea of what data you are trying to model so we can best give you a suggestion".
That being said, dealing with 2 DBs is not an issue and its more common than people think: often-times you use a Full-Text store for searches and then get back a list of Document IDs which you then hit the relational DB for additional metadata. Or hitting Redis to get a list of IDs which you also hit the relational DB for more data.
I proof-of-concepted a system of Neo4j+MySQL for targeted searching based on your social network ("show me all restaurants my network has recommended ordered by depth (e.g. 1st level friend recs are weighted higher than 2nd level, and so on) and it didn't feel awkward. But I also didn't take it to scale.
You will be having to keep both datastores in sync. So in my case when a user recommends a place on the web app (which inserts it into MySQL) you then need to turn around and do the same insert into Neo. You probably want to do this asynchronously as well, so you'll need to setup a message queue with workers.

SQL Assemblies vs Application code for complicated queries on large XML columns

I have a table with a few relational columns and one XML column which sometimes holds a fairly large chunk of data. I also have a simple webservice which uses the database. I need to be able to report on things like all the instances of a certain element within the XML column, a list of all the distinct values for a certain element, things like that.
I was able to get a list of all the distinct values for an element, but didn't get much further than that. I ended up writing incredibly complex T-SQL code to do something that seems pretty simple in C#: go through all the rows in this table, and apply this ( XPath | XQuery | XSLT ) to the XML column. I can filter on the relational columns to reduce the amount of data, but this is still a lot of data for some of the queries.
My plan was to embed an assembly in SQL Server (I'm using 2008 SP2) and have it create an indexed view on the fly for a given query (I'd have other logic to clean this view up). This would allow me to keep the network traffic down, and possibly also allow me to use tools like Excel and MSRS reports as a cheap user interface, but I'm seeing a lot of people saying "just use application logic rather than SQL assemblies". (I could be barking entirely up the wrong tree here, I guess).
Grabbing the big chunk of data to the web service and doing the processing there would have benefits as well - I'm less constrained by the SQL Server environment (since I don't live inside it) and my setup process is easier. But it does mean I'm bringing a lot of data over the network, storing it in memory while I process it, then throwing some of it away.
Any advice here would be appreciated.
Thanks
Edit:
Thanks guys, you've all been a big help. The issue was that we were generating a row in the table for a file, and each file could have multiple results, and we would doing this each time we ran a particular build job. I wanted to flatten this out into a table view.
Each execution of this build job checked thousands of files for several attributes, and in some cases each of these tests these were generating thousands of results (MSIVAL tests were the worst culprit).
The answer (duh!) is to flatten it out before it goes into the database! Based on your feedback, I decided to try creating a row for each result for each test on each file, and the XML just had the details of that one result - this made the query much simpler. Of course, we now have hundreds of thousands of rows each time we run this tool but the performance is much better. I now have a view which creates a flattened version of one of the classes of results that are emitted by the build job - this returns >200,000 and takes <5 seconds, compared to around 3 minutes for the equivalent (complicated) query before I went the flatter route, and between 10 and 30 minutes for the XML file processing of the old (non-database) version.
I now have some issues with the number of times I connect, but I have an idea of how to fix that.
Thanks again! +1's all round
I suggest using the standard xml tools in TSQL. (http://msdn.microsoft.com/en-us/library/ms189075.aspx). If you don't wish to use this I would recommend processing the xml on another machine.
SQLCLR is perfect for smaller functions, but with the restrictions on the usable methods it tends to become an exercise in frustration once you are trying to do more advanced things.
What you're asking about is really a huge balancing act and it totally depends on several factors. First, what's the current load on your database? If you're running this on a database that is already under heavy load, you're probably going to want to do this parsing on the web service. XML shredding and querying is an incredibly expensive procedure in SQL Server, especially if you're doing it on un-indexed columns that don't have a schema defined for them. Schemas and indexes help with this processing overhead, but they can't eliminate the fact that XML parsing isn't cheap. Secondly, the amount of data you're working with. It's entirely possible that you just have too much data to push over the network. Depending on the location of your servers and the amount of data, you could face insurmountable problems here.
Finally, what are the relative specs of your machines? If your web service machine has low memory, it's going to be thrashing data in and out of virtual memory trying to parse the XML which will destroy your performance. Maybe you're not running the most powerful database hardware and shredding XML is going to be performance prohibitive for the CPU you've got on your database machine.
At the end of the day, the only way to really know is to try both ways and figure out what makes sense for you. Doing the development on your web services machine will almost undoubtedly be easier as LINQ to XML is a more elegant way of parsing through XML than XQuery shoehorned into T-SQL is. My indication, given the information you provided in your question, is that T-SQL is going to perform better for you in the long run because you're doing XML parsing on every row or at least most rows in the database for reporting purposes. Pushing that kind of information over the network is just ugly. That said, if performance isn't that important, there's something to be said about taking the easier and more maintainable route of doing all the parsing on the application server.