Which type of database is a good fit for this work load? - mysql

I know this is a somewhat subjective question but I want to throw it out there anyway because if there is some insight that I've missed, asking this question will save me (and hopefully others) a great deal of searching :-)
Basically, what is the best type of database for when you have a number of items interlinked. For example:
A->B->C->D->E->F->G->H->I
And you want to be able to quickly find the shortest path (or the number of paths ideally) between A and I?
I would normally just use a relational database for this, but I'm not sure a MapReduce style database is any better and then I wondered if there was a type of database I hadn't considered altogether...
As always, all help gratefully received :)

This is exactly the sort of thing neo4j is designed for. It's a graph database that provides operations for doing graph traversal and other common operations.

I would suggest looking into MongoDB - It's NOT a relational database, and all your data gets saved and passed as JSON, it's easy scale-able and build for performance, and comes with a couple of tricks up its sleeve as well, like MapReduce build into mongoDB and way way more.
Since the data is saved as JSON you can save objects within objects and arrays well:
{
foo:"bar",
animals: ['cat','dog','fish'],
blog:{
post:"Hello World!",
comments:["this is cool","Hello back"]
}
}
Checkout - http://www.mongodb.org/
or http://www.mongodb.org/display/DOCS/MapReduce/
for more info

Related

Can we do Parallel Streaming, with IJSON for bulky JSON data Parsing/processing

I am aware how IJSON is solving, the bulky JSON reading and processing challenges. However i am not able to find any article which specifies how to speed up this:
I have seen few things to achieve that
1. Use YAZL backend
2. Play with buff_size parameter(not seeing any significant improvement)
Question which I had in my mind or I guess many are already working on that:
Now if we want to utilize parallel processing power of the machine, will IJSON supports that.
I know at no point IJSON knows entire stream size, so splitting is out of question.
My knowledge is quite limited in this area.
Any thread, document or link would be a nice start for me to understand this more clearly.

Converting large JSON file to XLS/CSV file (Kickstarter campaigns)

As part of my Master's thesis, I'm trying to run some statistics on which factors affect whether crowdfunding campaigns get funded or not. I've been trying to get data from the largest platform Kickstarter.com. Unfortunately, they have removed all the non-successful campaigns from their website (unless you have the direct link).
Luckily, I'm not the only one looking for this data.
Webrobots.io have a scraper robot which crawls all Kickstarter projects and collects data in JSON format (http://webrobots.io/kickstarter-datasets/).
The latest dataset can be found on:
http://webrobots.io/wp-content/uploads/2015/10/Kickstarter_2015-10-22.json_.zip
However, my programming skills are limited, and I don't know how to convert it into an excel file where I can manipulate the data and run my analysis. I found a few online converters, but the file is far too big for it (approx 300 mb).
Can someone please help me get the file converted?
It will earn you an acknowledgement in my Master's thesis when it gets published :)
Thanks in advance!!!
I guess the answer for this varies massively on a few things.
What subject is the masters covering? (mainly to appease many people who will probably assume you're hoping for people to do your homework for you! This might explain why the thread has been down-voted already)
You mention your programming skills are limited... What programming skills do you have? What language would you be using to achieve this goal? Bear in mind that even with a fully coded solution, if it's not in the language you know, you might not be able to compile it!
What kind of information do you want from the JSON file?
With regards to question 3, I've looked in the JSON file and it contains hierarchical data which is pretty difficult to replicate in a flat file i.e. an Excel or CSV file (I should know, we had to do this a lot in a previous job of mine).
But, I would look at the following plan of action to achieve what you're after:
Use a JSON parser to serialize the data into a class structure (Visual Studio can create the classes for you... See this S/O thread - How to show the "paste Json class" in visual studio 2012 when clicking on Paste Special?)
Once you've got the objects in memory, you can then step through them one by one and pick out the data you want and append them to a comma-separated string (in C# I'd use the StringBuilder) and write the rows of data out to a file on disk.
Once this is complete, you'll have the data you want.
Depending on what data you want from the JSON file, step 2 could be the most difficult part as you'd need to step into the different levels of the data hierarchy.
Hope this points you in the right direction?
You may want to look at this Blog.
http://jdunkerley.co.uk/2015/09/04/downloading-and-parsing-met-office-historic-station-data-with-alteryx/
He uses a process with Alteryx that may line up with what you are trying to do. I am looking to do something similar, but haven't tried it yet. I'll update this answer if I get it to work.

How can I handle relational data in Meteor?

I am learning Meteor using the Discover Meteor book.
I come from a PHP and MySQL background, and the application I am thinking of doing as a side-project is a real-time Backgammon web application. While Meteor's reactivity is a very, very big plus, I am stumped on how I can handle relational data (e.g. games, users, tournaments, friends, teams, etc).
I have read a lot of answers (ranging from old to very old) on StackOverflow on how one can use MySQL with Meteor. My search has led me to numtel/meteor-mysql. However, when I look at the examples provided in that repository, it is nowhere as clean as Meteor's own implementation of MongoDB.
My options, as I understand them, are the following:
Use MongoDB, and rewrite a lot of the features present in RDBMS in Javascript
Use an RDBMS that is not as well-supported in Meteor as MongoDB
IMO, option two is much less work, and I think might lead to less problems in the future. Take the problem in the epilogue of Why You Should Never Use MongoDB, for example.
We could also model this data as a set of nested hashes. The set of information about a particular TV show is one big nested key/value data structure. Inside a TV show, there’s an array of seasons, each of which is also a hash. Within each season, an array of episodes, each of which is a hash, and so on. This is how MongoDB models the data. Each TV show is a document that contains all the information we need for one show.
But then, how would you query for the TV shows that someone has starred in?
Back to my original question: is there something I'm missing here? Handling relational data is something that a lot of applications will need to do, but I can't seem to find a clean solution
It will be much less work if you go with option 1 in my opinion.
It won't be difficult to learn to use MongoDB, and since MongoDB uses JSON objects and is supported natively by Meteor and all it's packages, it will be much less work.
I advise having a look at the aldeed packages: collection2 and simple-schema to structure your collections. I also advice using the collection-helpers package to help with joins.
If you have a posts collection with name, authorId and content fields, then to get the author of the post, you'd write Meteor.users.findOne(userId).
Hope that clears things up a bit and gets you on your way.

Is there a way to formerly define a time interval for configuring a process?

Horribly worded question...I know.
I'm working on an application that processes data for the previous day. The problem is that I know the customer is going to eventually ask to it for every hour or some other arbitrary time interval. I know that languages such as Java or SQL have masks for defining dates. Well what about a way to define a time interval?
Let me ask it this way. If someone asked you to create a configurable piece of software how would you allow the user to specify the time intervals?
UPDATE
What I'm looking for is a textual representation. Imagine somethign that would be put into a config file.
Here's how Google App Engine's cron API is configured, it's a bit more user-friendly than cron:
http://code.google.com/appengine/docs/python/config/cron.html#The_Schedule_Format
Python (and Java?) source to implement it should be available in the SDK. Haven't looked to see how easy it would be to extract, but it should at least provide some ideas.
I think it'd be possible to add various things to the format as required - it's structured enough to be extensible. For example it's currently missing the ability to say "every hour at 4 minutes past", which is common in UNIX cron but not really relevant to GAE because although perhaps the engine could figure out which minutes are busy and which aren't, and balance load, the user certainly can't.
Obviously the major weakness of offering something that looks like natural language to the average user, is that they'll think your code is psychic, and expect it to also understand things like "every other Wednesday except the week after Easter", or "whenever the clocks change" ;-)
I'm not sure if I understand your question correctly. Are you asking for a way to input time intervals?
I don't know your environment, but most language frameworks offer some more or less sophisticated GUI components. Something like a calendar control may be what you are looking for?
Edit: Ah, command line (or config file) it is.
Can you parse an XML file? In this case, you could just serialize a timespan (most languages have a class like this). The use may edit the XML file, typically just replacing, say, the value of the <minutes> part. Deserializing this XML file to obtain a Timespan again is easy.
If you prefer plaintext or command line input, it is a bit less easy. However, many languages support some kind of Timespan.Parse (string text, string format). You should have a look if this concept is present in your environment.
Many environment offer some kind of sscanf(), that parses an input string. Sometimes there is a "time" format as well.
If the earlier ideas don't provide a solution, you can try to parse the date using regex. There should be many resources on this topic on this site as well as the web.
If you dislike regex, split the input string according to your own format rules. That's kind of an ugly solution though.
If you don't feel like splitting user input stringy by hand, use Random.NextInt(1000) and hope nobody notices. 0:-)

On K.I.S.S and paving cowpaths [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
I'm currently developing a PHP application that's using an Access database as a backend. Not by choice you understand... the database is what the client used originally and using it is part of the requirements.
One of the problems with this database is that the column names have the most insane naming convention you could possibly imagine. Uppercase, lowercase, underscores, spaces and the plain insane. For example the column "gender" holds a date. And so does column "User2". There's a lot more but you get the idea.
Faced with this I decided to create an array to map the database columns to PHP variables so we can isolate the code from the madness. However my colleague believes that I'm over-complicating things and we should use the database's column names for the corresponding PHP variables so we don't need to go through the mapping array to find what goes where.
So my question is this... am I doing the right thing or am I complicating things?
Absolutely you are on the right track. If you don't abstract away the madness you will eventually succumb to the madness yourself.
Your colleague has a valid point though, so I suggest you also code an easy way to determine the data to column mapping in PHP.
This isn't about keeping it simple, it's about retrofitting a solid foundation to build upon.
The thing that would worry me is that this kind of random design often hides certain business rules, things like "...if the gender is a date then they must have purchased a widget at some point therefore they can't be allowed to fribbish the lubdub... " - crazy I know but more common than it should be.
Names are exceptionally important. If you want your application to be maintainable, fix them before the code base grows further.
I wouldn't say you are complicating things.
Eric Evan's book Domain Driven Design has a lovely term for this: Anti Corruption Layer
To play Devil's Advocate, there's something to be said for not having an unnecessary layer of indirection in your short-term memory load for working with the system. Once familiar with the code, you will know what goes in which variable, so the main benefit is to someone new picking up the code from scratch. However, fixing that problem properly would also require fixing up the database schema which would (a) be a significant body of work, and (b) largely make the problem go away.
There is no black-and-white answer to this question, and the lack of an obvious answer to your specific problem suggests that you may want to let sleeping dogs lie.
On the other hand, if a cleanup operation is within the bounds of possibility then you may want to do it on a re-factoring type basis, incrementally fixing up the DB column names as the opportunity arises.
Just create views where it is most needed.
This is a good question as it talks to the heart of coding IMHO.
I would go with you and abstract out the bad names into readable decent names. The result being a little complication for much more logically understandable and readable code.
You didn't say you can't rename the columns in Access, so....do that! Another possibility would be to create views for each table, and rename the columns in the view. Then instead of working with table Employees, you work with view vEmployees. If I recall correctly, Access lets you update views as well as select from them. If you are using an ORM with PHP, that may not support updating views however.
Hard coding table names and column names is never a good idea even when the names make sense.
I don't know if using arrays is the best solution though. I'm not really familiar with PHP but I would have gone with something like constant strings to store the table names. In the languages I work in this would lead to more readable code.
You are very unlucky to be stuck with this database but I think on the whole a way of abstracting the field names into something more sensible is smarter.
I would perhaps create a data structure containing the database name, sanitised name, type and a field for the content when you're pulling the data out of the DB. That would give a convenient way of drawing things together so you're not only mapping away the crazy name scheme.
Absolutely you're doing the right thing. In my opinion it's better to implement some sanity there. Going forward, you're logic wouldn't be throw away if they decided to change that database or any of it's column names. If you build your mapping the right way, it should be easy to just plug the new tables/columns right in.
If anything, what you're doing improves the agility of your overall solution.
Of course I would still say KISS applies to the method of your mapping!
Using proper column names in your end of the application is the best you can do. And you should do it unless you want to have to look up "what that field was supposed to be again?" when you have to look at it again after you did something else.
Your colleague's point is not to overcomplicate things. That's valid, too.
So encapsulate access to the fields in a method or methods and have that method do the translation. Using maps this shouldn't be a performance problem.
In fact putting all the mapping to the data source in one object might help you if your customer reconsiders to use a real database. And customers love to change their opinion.
Why not create a datalayer with classes that map on to each table. Then you can define the class methods to access the columns and give the methods whatever names you want. Then the datalayer database access code is the only thing that needs to know about the real column names. I suspect that someone (perhaps several soneones) has already developed a framework to do this. Google "php orm".
Use a ORM, you will be changing the db soon...
You still need to maintain database. One possible approach I can suggest is to map field names in application code as you plan it to do. But then sooner or later you have to start handling this naming madness with field names and fix it. It is not good idea just to screen from a problem and imagine that it is a safe solution and good way to go. It is only temporary workaround. Do not full your self about it.