Repository organization for Hadoop project - mercurial

I am starting on a new Hadoop project that will have multiple hadoop jobs(and hence multiple jar files). Using mercurial for source control, I was wondering what would be optimal way of organizing the repository structure? Should each job live in separate repo or would it be more efficient to keep them in the same, but break down into folders?

If you're pipelining the Hadoop jobs (output of one is the input of another), I've found it's better to keep most of it in the same repository since I tend to generate a lot of common methods I can use in the various MR jobs.
Personally, I keep the streaming jobs in a separate repo from my more traditional jobs since there are generally no dependencies.
Are you planning on using the DistributedCache or streaming jobs? You might want a separate directory for files you distribute. Do you really need a JAR per Hadoop job? I've found I don't.
If you give more details about what you plan on doing with Hadoop, I can see what else I can suggest.

Related

How to maintain simple database in git?

I have a JSON (multiline) file with lots of project settings and a list of included modules for a few projects. It's version-controlled by git with the same repository as my projects. It is constantly growing and works just fine to set up and tune my projects. The only problem is, when working with team and branches I constantly have merge conflicts that need to be solved manually, and 99% of cases is "use both" because it's just new entries. So what are the alternatives? I need to have the same version and branching in this database since it has project settings and dependencies, but I want to reduce conflicts to a minimum. And I do not want a separate database, that I need to maintain in parallel with git, they need to perfectly sync automatically when switching between different branches or commits. Thanks!

Logstash configuration best practices

I'm new to logstash but I like how easy it makes shipping logs and aggregating them. Basically it just works. One problem I have is I'm not sure how to go about making my configurations maintainable. Do people usually have one monolithic configuration file with a bunch of conditionals or do they separate them out into different configurations and launch an agent for each one?
We heavily use Logstash to monitor ftbpro.com. I have two notes which you might find useful:
You should run one agent (process) per machine, not more. Logstash agents requires some amount of CPU and memory, especially under high loads, so you don't want to run more than one on a single machine.
We manage our Logstash configurations with Chef. We have a separate template for each configuration and Chef assembles the configuration by the roles of the machine. So the final result is one large configuration in each machine, but on our repository the configurations are separate and thus maintainable.
Hope this helps you.
I'll offer the following advice
Send your data to Redis as a "channel" rather than a "list", based on time and date, which makes managing Redis a lot easier.
http://www.nightbluefruit.com/blog/2014/03/managing-logstash-with-the-redis-client/

Pros and cons for keeping code and data in separate repositories

We have a project which has data and code, bundled into a single Mercurial repository. The data is just as important the code (it contains parameters for business logic, some inputs, etc.) However, the format of the data files changes rarely, and it's quite natural to change the data files independently from the code.
One advantage of the unified repository is that we don't have to keep track of multiple revisions: if we ever need to recreate output from a previous run, we only need to update the system to the single revision number stored in the output log.
One disadvantage is that if we modify the data while multiple heads are active, we may lose the data changes unless we manually copy those changes to each head.
Are there any other pros/cons to splitting the code and the data into separate repositories?
Multiple repos:
pros:
component-based approach (you identify groups of files that can evolve independently one from another)
configuration specification: you list the references (here "revisions") you need for your system to work. If you want to modify one part without changing the other, you update that list.
partial clones: if you don't need all components, you can only clone the ones you want (doesn't apply in your case)
cons
configuration management: you need to track that configuration (usually through a parent repo, registering subrepos)
in your case, data is quite dependent on certain versions of the projects (you can have new data which doesn't make sense for old versions of the project)
One repo
pros
system-based approach: you see your modules as one system (project and data).
repo management: all in one
tight link between modules (which can makes sense for data)
cons
data propagation (when, as you mention, several HEAD are active)
intermediate revisions (not to reflect a new feature, but just because some data changes)
larger clone (not relevant here, unless your data include large binaries)
For non-binary data, with infrequent changes, I would still keep them in the same repo.
Yes, you should separate code and data. Keep you code in version control and your data in a database.
I love version control since I am a programmer since more then ten years and I like this job.
But during the last months I realized: Data must not be in version control. Sometimes it is hard for a person which is familiar with git (or an other version control system) to "let it go".
You need a good ORM which supports database schema migrations. The migrations (schemamigrations and datamigrations) are kept in version control, but the data is not.
I know your question was about using one or two repositories, but maybe my answer helps you to get a different view point.

RoR: efficiently testing a project with mysql and sqlite

I'd like to continually test and benchmark my RoR app running over both mysql and sqlite, and I'm looking for techniques to simplify that. Ideally, I'd like a few things:
simultaneous autotest / rspec testing with mysql and sqlite versions of the app so I'll know right away if I've broken something
a dependable construct for writing db-specific code, since I need to break into `ActiveRecord::Base.connection.select_all()` once in a while.
The latter seems easy, the former seems difficult. I've considered having two separate source trees, each with its own db-specific config files (e.g. Gemfile, config/database.yml) and using filesystem links to share all common files, but that might frighten and confuse git.
A cleaner approach would be a command line switch to rails to say which configuration to use as rails starts up. Though it would be nice, I don't think such a command line switch exists
How do other people handle this?
If I were you, I would do two things:
Don't check database.yml into your code repo. It contains database passwords, and if you're working with other developers on different machines, it will be a headache trying to keep track of which database is on which machine. It's considered bad practice and not a habit you should get into.
For files that should be checked into source (Gemfile & Gemfile.lock), I would manage this using Git branches. I would have one master branch that uses one database. Then another branch that uses the other. If you are working off the master branch and have it setup with MySQL, you can just rebase or merge into the SQlite branch whenever you make code changes. As long as you're not writing a lot of database-specific queries, you shouldn't have conflict problems.
Okay, with just a couple of tweaks, there's a simple way to run your app and tests under any one of several databases. I describe the technique in:
RoR: how do I test my app against multiple databases?
It works well for me -- someone else might find it useful.

How to Manage a dataset together with an application?

The application's code and configuration files are maintained in a code repository. But sometimes, as a part of the project, I also have a some data (which in some cases can be >100MB, >1GB or so), which is stored in a database. Git does a nice job in handling the code and its changes, but how can the development team easily share the data?
It doesn't really fit in the code version control system, as it is mostly large binary files, and would make pulling updates a nightmare. But it does have to be synchronised with the repository, because some code revisions change the schema (ie migrations).
How do you handle such situations?
We have the data and schema stored in xml and use liquibase to handle the updates to both the schema and the data. The advantage here is that you can diff the files to see what's going on, it plays nicely with any VCS and you can automate it.
Due to the size of your database this would mean a sizable "version 0" file. But, using the migration strategy, after that the updates should be manageable as they would only be deltas. You might be able to convert your existing migrations one-to-one to liquibase as well which might be nicer than a big-bang approach.
You can also leverage #belisarius' strategy if your deltas are very large so each developer doesn't have to apply the delta individually.
It seems to me that your database has a lot of parallels with a binary library dependency: it's large (well, much larger than a reasonable code library!), binary, and has its own versions which must correspond to various versions of your codebase.
With this in mind, why not integrate a dependency manager (e.g. Apache Ivy) with your build process and let it manage your database? This seems like just the sort of task that a dependency manager was built for.
Regarding the sheer size of the data/download, I don't think there's any magic bullet (short of some serious document pre-loading infrastructure) unless you can serialize the data into a delta-able format (the XML/JSON/SQL you mentioned).
A second approach (maybe not so compatible with dependency management): If the specifics of your code allow it, you could keep a second file that is a manual diff that can take a base (version 0) database and bring it up to version X. Every developer will need to keep a clean version 0. A pull (of a version with a changed DB) will consist of: pull diff file, copy version 0 to working database, apply diff file. Note that applying the diff file might take a while for a sizable DB, so you may not be saving as much time over the straight download as it first seems.
We usually use the database sync or replication schema.
Each developer has 2 copies of the database, one for working and the other just for keeping the sync version.
When the code is synchronized, the script syncs the database too (the central DB against the "dead" developer's copy). After that each developer updates his own working copy. Sometimes a developer needs to keep some of his/her data, so these second updates are not always driven by the standard script.
It is as robust as the replication schema .... sometimes (depending on the DB) that doesn't represent good news.
DataGrove is a new product that gives you version control for databases. We allow you to store the entire database (schema and data), tag, restore and share the database at any point in time.
This sounds like what you are looking for.
We're currently working on features to allow git-like (push-pull) behaviors so developers can share their repositories across machines, so I can load the latest version of your database when I need it.