How do I set up the architecture for a "big data" analysis project? - csv

A friend of mine and I are in our senior year and will be starting a senior project soon. We had the idea to do a data analysis and data visualization project for it. Our project involves reading a CSV file that is updated every 2 minutes, parsing that data, then storing it in a database. Once that data is stored we want to run some analysis on it and provide an API through which we could access that data to visualize in some way. Our end goal would be to build an Android app that displays some of the raw data from the CSV and the analysis in a user friendly format. I talked to another CS Major and he explained that I would need a few different servers to accomplish this: One for the storage, another for analysis, and another for some type of queue that would make sure things don't get screwy while we are doing scraping and analysis. The problem is, I don't really know where to start with this. I've done some work with a SQL database before and a PHP front end, but nothing with multiple servers. I've heard of tools to use with big data projects like Hadoop but i'm not exactly sure where it fits in. If someone could point me to a resource of some kind to explain, or explain themselves, how I would start to structure this kind of project, that would be awesome!

Since you don't have much experience with these things you'll probably want to look at projects like Cloudera. Specifically their resources page has a nice set of videos and articles.
Another source of solid information (that I personally use) is by clicking on an Stack Overflow tag and selecting the votes option. Many good questions on a plethora of big data topics already exists.

Related

Best practice to use several APIs or data sources for one application

I want to build an application that uses data from several endpoints.
Lets say I have:
JSON API for getting cinema data
XML Export for getting data about ???
Another JSON API for something else
A csv-file for some more shit ...
In my application I want to bring all this data together and build views for it and so on ...
MY idea was to set up a database by create schemas for all these data sources, so I can do some kind of "import scripts" which I can call whenever I want to get the latest data.
I thought of schemas because I want to be able to easily adept a new API with any kind of schema.
Please enlighten me of the possibilities and best practices out there (theory and practice if possible :P)
You are totally right on making a database. But the real problem is probably not going to be how to store your data. It's going to be how to make it fit together logically and semantically.
I suggest you first take a good look at what your enpoints can provide. Get several samples from every source and analyze them if you can. How will you know which data is new? How can you match it against existing data and against data from other sources? If existing data changes or gets deleted, how will you detect and handle that? What if sources disagree on something? How and when should you run the synchronization? What will you do if one of your sources goes down? Etc.
It is extremely difficult to make data consistent if your data sources are not. As a rule, if the sources are different, they are not consistent. Thus the proverb "garbage in, garbage out". We, humans, have no problem dealing with small inconsistencies, but algorithms cannot work correctly if there are discrepancies. Even if everything fits together on paper, one usually forgets that data can change over time...
At least that's my experience in such cases.
I'm not sure if in the application you want to display all the data in the same view or if you are going to be creating different views for each of the sources. If you want to display the data in the same view, like a grid, I would recommend using inheritance or an interface depending on your data and needs. I would recommend setting this structure up in the database too using different tables for the different sources and having a parent table related to all them that has a type associated with it.
Here's a good thread with discussion about choosing an interface or inheritance.
Inheritance vs. interface in C#
And here are some examples of representing inheritance in a database.
How can you represent inheritance in a database?

How should I store similar entities - in one table or several?

I am creating a CV website, but in difference to most I am trying to make it with database. I mean that usually such websites are static and all of the information is hard coded in the HTML. Since I am back-end developer I like to make it so everything including buttons and welcome messages are taken from the database. I am trying to store projects that I have worked on. There are several types:
Github Repository - a project that is done purely on github.
Work related - a project I have done on work and there is no github repository of it, only link to view the final result
UpWork or other freelance website - as a freelancer I have projects to fix something on a website and those projects can be viewed only on my profile there and I would like to list them with link to UpWork or wherever there is information on what exactly I was hired to do.
Now my question is - should I have different Entities and therefore different tables for these types of projects or should I have all of the possible properties in one table. For example if it is Github there is repository field and if it is work related then there is company field. If it is freelance it has link to the website I was hired on. Also there are different sub-types - web applications, desktop applications, games and so on.
As you can guess the changes are small (1 or 2 properties). I could very easily leave empty some properties and have another property projectType, but is this the right way? Should I have different tables and entities for them?
To give some info - I can work with both MySQL and NoSQL and I havent decided yet on which one should my website be made on. I am currently thinking about NoSQL. This means I am asking on how to store the projects on MySQL and NoSQL (by NoSQL I mean MongoDB). If it helps the languages I am choosing from are PHP (MySQL) and JavaScript (NoSQL)
I know that usually questions without code are downvoted, but this is more of a logic based problem as I know how to do it, but I don't know the best practices for my situation. This being said here is a small code for you -
console.log('Thank you in advance')
MongoDB lends itself very well to this exact situation.
You can create a collection where documents leave out certain fields if they are not needed for that type. The querying parameters of MongoDB allow you to check $exists on fields if you need to, and documents are stored efficiently, only taking up memory where a field is needed.
You can even setup a sparse index which is not required for every document. As long as your core document structure is the same, it is a good idea to keep them in one collection, and vary them based on their type.

Ruby on Rails - Database or excel

I am currently doing a project in Ruby on Rails and I have been presented with a dilemma.
The dilemma is that the users of my system will be uploading an excel spreadsheet. The issue is should I just read straight from this excel spreadsheet into my front-end or should I load this spreadsheet into my MySQL database and then to my front-end.
I have asked numerous people about this issue and have researched on-line to no avail.
Any help would be much appreciated.
The Excel file is not a database. If you need to allow it as source input, parse it, copy the data into a real database and connect to it.
The database is more flexible and efficient for querying and processing information.
I can think of two benefits, or rather options, of having them upload the excel spreadsheet for processing by your back end.
1) would be for your tracking purposes (who sent what and here is what the back-end did with it...). In fact consider that other formats/versions could be introduced, would it be important to keep them to identify what went wrong? "How can we handle this new format"?
2) On the other side, the front-end way that is, you offload processing from the back-end, but that means that the browser app could get fairly complex and depending on your excel, that is if it has many relationships, sending that data up to the server could be complex. However if is simply a flat spreadsheet, say simple rows without totals/tax calc/..., then it might be an advantage of loading it into the browser and then sending these rows up to the server if offloading processing is of any importance.
However point 2 really is diluted by point 1, which to me would be of greater importance for future migration of this service. So I personally would choose uploading it and processing on the back end.
Update
As you clarified in the comments, if you are asking about the use of Excel on the backend as a database? I would agree with Simone Carletti's answer here. Maybe just add a real database gives you much more flexibility, more tools and, more performance. This difference is loading a file, parsing it into some structure, then saving it (unless you are using some .NET framework and even if, the Database (MySQL, MongoDB...) would give you much more flexibility in structuring and querying, over the headache of managing with the speed of DB connections. You might just want to write a sample in both to evaluate, the DB solution will probably win you over.

Custom digital asset management tool - where to start

I work at a production studio that has hundreds of assets (2D images, videos, 3D models, etc) that we use over and over again in our library. Right now it is just a folder on our server, but because I am a particularly adventurous person I am looking to create a database/application that allows users (approximately 20) to search for and "grab" items from our internal network. I would also need a way for them to upload items to the database - every project we work on we're creating new assets for the library and it grows daily.
I'm a very amateur programmer - mostly working in Javascript and HTML, so what I'm looking for is advice anyone can give me on where to start. From the research I've done I imagine that I would build a MySQL database to store all of the information, and then create an HTML site that all of the users can access via their web browser as the GUI. I know a little bit of Python and really like it so I'm thinking I'll use Python as the back-end and to run MySQL.
I'd love to hear any advice the community can give me! I plan to do this on zero budget, so open-source all the way. The closest tool I can think of to what I want is Adobe Bridge - which I love but which isn't quite what I'm looking for and doesn't have robust enough searching and tagging (and doesn't support anything but images and video).
As a database MySQL isn't particularly suited to this task. The challenge you'll run into is that users will want to access the files in a folder like structure, but for performance reasons you probably will not want a parent-child schema (at least not using InnoDB - I can't speak to the other storage engines). It is certainly possible to create a performant pc schema on InnoDB, but it is not a challenge to be undertaken casually.
If you have access to MSSQL 2012 it makes a tremendous effort at solving this exact problem http://technet.microsoft.com/en-us/library/ff929144.aspx
I love recommending MySQL, but in this case I'd recommend a different database choice.

Serialized data in a MySql Database to use in a Business Intelligence tool

I have a database (MySql) and need to store some results from a web service monthly.
The data can have 10 results today but may have 200 next month.
I need to use a BI tool to create charts and what not.
Someone proposed to serialize the data and save the blobs in the database, while the solution seems to work, I have a gut feeling that when the time comes to hook it up with the BI tool, hell will break loose.
Has anyone had this issue before?
Thanks
Edit: adding extra info.
The problem is that we haven't chosen the BI tool yet. But what it needs to do is create charts for the results. Some of the results come from Google Analytics. So we will be charting number of visitors to a site for the last 6 months. Or Number of viewed pages.
The answer is simple: do not store Serialized data in a database.
Do some research, atomize your data and create data structure.
Once you've done it, you will be able to use any BI tool in the world.
That's the purpose of a database and what distinguishes a database from a flat file.