How to solve jenkins 'Disk space is too low' issue? - hudson

I have deployed Jenkins in my CentOS machine, Jenkins was working well for 3 days, but yesterday there was a Disk space is too low. Only 1.019GB left. problem.
How can I solve this problem, it make my master offline for hours?

You can easily change the threshold from jenkins UI (my version is 1.651.3):
[]
Update: How to ensure high disk space
This feature is meant to prevent working on slaves with low free disk space. Lowering the threshold would not solve the fact that some jobs do not properly cleanup after they finish.
Depending on what you're building:
Make sure you understand what is the disk output of your build - if possible - restrict the output to happen only to the job workspace. Use workspace cleanup plugin to cleanup the workspace as post build step.
If the process must write some data to external folders - clean them up manually on post build steps.
Alternative1 - provision a new slave per job (use spot slaves - there are many plugins that integrate with different cloud provider to provision on the fly machines on demand)
Alternative2 - run the build inside a container. Everything will be discarded once the build is finished

Beside above solutions, there is a more "COMMON" way - directly delete the largest space consumer from Linux machine. You can follow the below steps:
Login to Jenkins machine (Putty)
cd to the Jenkins installation path
Using ls -lart to list out hidden folder also, normally jenkin
installation is placed in .jenkins/ folder
[xxxxx ~]$ ls -lart
drwxrwxr-x 12 xxxx 4096 Feb 8 02:08 .jenkins/
list out the folders spaces
Use df -h to show Disk space in high level
du -sh ./*/ to list out total memory for each subfolder in current path.
du -a /etc/ | sort -n -r | head -n 10 will list top 10 directories eating disk space in /etc/
Delete old build or other large size folder
Normally ./job/ folder or ./workspace/ folder can be the largest folder. Please go inside and delete base on you need (DO NOT
delete entire folder).
rm -rf theFolderToDelete

You can limit the reduce of disc space by discarding the old builds. There's a checkbox for this in the project configuration.

This is actually a legitimate question so I don't understand the downvotes, perhaps it belongs on Superuser or Serverfault. This is a soft warning threshold not hard limit where the disk is out of space.
For hudson see where to configure hudson node disk temp space thresholds - this is talking about the host, not nodes
Jenkins is the same. The conclusion is for many small projects the system property called hudson.diagnosis.HudsonHomeDiskUsageChecker.freeSpaceThreshold could be decreased.
In saying that I haven't tested it and there is a disclaimer
No compatibility guarantee
In general, these switches are often experimental in nature, and subject to change without notice. If you find some of those useful, please file a ticket to promote it to the official feature.

I got the same issue. My jenkins version is 2.3 and its UI is slightly different. Putting it here so that it may helps someone. Increasing both disk space thresholds to 5GB fixed the issue.

I have a cleanup job with the following build steps. You can schedule it #daily or #weekly.
Execute system groovy script build step to clean up old jobs:
import jenkins.model.Jenkins
import hudson.model.Job
BUILDS_TO_KEEP = 5
for (job in Jenkins.instance.items) {
println job.name
def recent = job.builds.limit(BUILDS_TO_KEEP)
for (build in job.builds) {
if (!recent.contains(build)) {
println "Preparing to delete: " + build
build.delete()
}
}
}
You'd need to have Groovy plugin installed.
Execute shell build step to clean cache directories
rm -r ~/.gradle/
rm -r ~/.m2/
echo "Disk space"
du -h -s /

To check the free space as Jenkins Job:
Parameters
FREE_SPACE: Needed free space in GB.
Job
#!/usr/bin/env bash
free_space="$(df -Ph . | awk 'NR==2 {print $4}')"
if [[ "${free_space}" = *G* ]]; then
free_space_gb=${x/[^0-9]*/}
if [[ ${free_space_gb} -lt ${FREE_SPACE} ]]; then
echo "Warning! Low space: ${free_space}"
exit 2
fi
else
echo "Warning! Unknown: ${free_space}"
exit 1
fi
echo "Free space: ${free_space}"
Plugins
Set build description
Post-Build Actions
Regular expression: Free space: (.*)
Description: Free space: \1
Regular expression for failed builds: Warning! (.*)
Description for failed builds: \1

For people who do not know where the configs are, download the tmpcleaner from
https://updates.jenkins-ci.org/download/plugins/tmpcleaner/
You will get an hpi file here. Go to Manage Jenkins-> Manage plugins-> Advanced and then upload the hpi file here and restart jenkins
You can immediately see a difference if you go to Manage Nodes.
Since my jenkins was installed in a debian server, I did not understand most of the answers related to this since i cannot find a /etc/default folder or jenkins file.
If someone knows where the /tmp folder is or how to configure it for debian , do let me know in comments

Related

Deploying an application with database inside mysql container inside docker [duplicate]

I'm trying to wrap my head around Docker from the point of deploying an application which is intended to run on the users on desktop. My application is simply a flask web application and mongo database. Normally I would install both in a VM and, forward a host port to the guest web app. I'd like to give Docker a try but I'm not sure how I'm meant to use more than one program. The documentations says there can only be only ENTRYPOINT so how can I have Mongo and my flask application. Or do they need to be in separate containers, in which case how do they talk to each other and how does this make distributing the app easy?
There can be only one ENTRYPOINT, but that target is usually a script that launches as many programs that are needed. You can additionally use for example Supervisord or similar to take care of launching multiple services inside single container. This is an example of a docker container running mysql, apache and wordpress within a single container.
Say, You have one database that is used by a single web application. Then it is probably easier to run both in a single container.
If You have a shared database that is used by more than one application, then it would be better to run the database in its own container and the applications each in their own containers.
There are at least two possibilities how the applications can communicate with each other when they are running in different containers:
Use exposed IP ports and connect via them.
Recent docker versions support linking.
I strongly disagree with some previous solutions that recommended to run both services in the same container. It's clearly stated in the documentation that it's not a recommended:
It is generally recommended that you separate areas of concern by using one service per container. That service may fork into multiple processes (for example, Apache web server starts multiple worker processes). It’s ok to have multiple processes, but to get the most benefit out of Docker, avoid one container being responsible for multiple aspects of your overall application. You can connect multiple containers using user-defined networks and shared volumes.
There are good use cases for supervisord or similar programs but running a web application + database is not part of them.
You should definitely use docker-compose to do that and orchestrate multiple containers with different responsibilities.
I had similar requirement of running a LAMP stack, Mongo DB and my own services
Docker is OS based virtualisation, which is why it isolates its container around a running process, hence it requires least one process running in FOREGROUND.
So you provide your own startup script as the entry point, thus your startup script becomes an extended Docker image script, in which you can stack any number of the services as far as AT LEAST ONE FOREGROUND SERVICE IS STARTED, WHICH TOO TOWARDS THE END
So my Docker image file has two line below in the very end:
COPY myStartupScript.sh /usr/local/myscripts/myStartupScript.sh
CMD ["/bin/bash", "/usr/local/myscripts/myStartupScript.sh"]
In my script I run all MySQL, MongoDB, Tomcat etc. In the end I run my Apache as a foreground thread.
source /etc/apache2/envvars
/usr/sbin/apache2 -DFOREGROUND
This enables me to start all my services and keep the container alive with the last service started being in the foreground
Hope it helps
UPDATE: Since I last answered this question, new things have come up like Docker compose, which can help you run each service on its own container, yet bind all of them together as dependencies among those services, try knowing more about docker-compose and use it, it is more elegant way unless your need does not match with it.
Although it's not recommended you can run 2 processes in foreground by using wait. Just make a bash script with the following content. Eg start.sh:
# runs 2 commands simultaneously:
mongod & # your first application
P1=$!
python script.py & # your second application
P2=$!
wait $P1 $P2
In your Dockerfile, start it with
CMD bash start.sh
I would recommend to set up a local Kubernetes cluster if you want to run multiple processes simultaneously. You can 'distribute' the app by providing them a simple Kubernetes manifest.
They can be in separate containers, and indeed, if the application was also intended to run in a larger environment, they probably would be.
A multi-container system would require some more orchestration to be able to bring up all the required dependencies, though in Docker v0.6.5+, there is a new facility to help with that built into Docker itself - Linking. With a multi-machine solution, its still something that has to be arranged from outside the Docker environment however.
With two different containers, the two parts still communicate over TCP/IP, but unless the ports have been locked down specifically (not recommended, as you'd be unable to run more than one copy), you would have to pass the new port that the database has been exposed as to the application, so that it could communicate with Mongo. This is again, something that Linking can help with.
For a simpler, small installation, where all the dependencies are going in the same container, having both the database and Python runtime started by the program that is initially called as the ENTRYPOINT is also possible. This can be as simple as a shell script, or some other process controller - Supervisord is quite popular, and a number of examples exist in the public Dockerfiles.
Docker provides a couple of examples on how to do it. The lightweight option is to:
Put all of your commands in a wrapper script, complete with testing
and debugging information. Run the wrapper script as your CMD. This is
a very naive example. First, the wrapper script:
#!/bin/bash
# Start the first process
./my_first_process -D
status=$?
if [ $status -ne 0 ]; then
echo "Failed to start my_first_process: $status"
exit $status
fi
# Start the second process
./my_second_process -D
status=$?
if [ $status -ne 0 ]; then
echo "Failed to start my_second_process: $status"
exit $status
fi
# Naive check runs checks once a minute to see if either of the processes exited.
# This illustrates part of the heavy lifting you need to do if you want to run
# more than one service in a container. The container will exit with an error
# if it detects that either of the processes has exited.
# Otherwise it will loop forever, waking up every 60 seconds
while /bin/true; do
ps aux |grep my_first_process |grep -q -v grep
PROCESS_1_STATUS=$?
ps aux |grep my_second_process |grep -q -v grep
PROCESS_2_STATUS=$?
# If the greps above find anything, they will exit with 0 status
# If they are not both 0, then something is wrong
if [ $PROCESS_1_STATUS -ne 0 -o $PROCESS_2_STATUS -ne 0 ]; then
echo "One of the processes has already exited."
exit -1
fi
sleep 60
done
Next, the Dockerfile:
FROM ubuntu:latest
COPY my_first_process my_first_process
COPY my_second_process my_second_process
COPY my_wrapper_script.sh my_wrapper_script.sh
CMD ./my_wrapper_script.sh
I agree with the other answers that using two containers is preferable, but if you have your heart set on bunding multiple services in a single container you can use something like supervisord.
in Hipache for instance, the included Dockerfile runs supervisord, and the file supervisord.conf specifies for both hipache and redis-server to be run.
If a dedicated script seems like too much overhead, you can spawn separate processes explicitly with sh -c. For example:
CMD sh -c 'mini_httpd -C /my/config -D &' \
&& ./content_computing_loop
In docker, there are two ways you can run a program
CMD
ENTRYPOINT
If you want to know the difference between them, please refer here
In CMD/ENTRYPOINT, there are two formats to run a command
SHELL format
EXEC format
SHELL format:
CMD executable_first arg1; executable_second arg1 arg2
ENTRYPOINT executable_first arg1; executable_second arg1 arg2
This version will create a shell and executes above command. Here you can use any shell syntax such as ";", "&", "|", etc. So you can run any number of commands here. If you have complex set of commands to run, you can create separate shell script and use it.
CMD my_script.sh arg1
ENTRYPOINT my_script.sh arg1
EXEC format:
CMD ["executable", "parameter 1", "parameter 2", …]
ENTRYPOINT ["executable", "parameter 1", "parameter 2", …]
Here you can notice that only first parameter is an executable. From the second parameter, everything become an arguments/parameters for that executable.
To run multiple commands in EXEC format
CMD ["/bin/sh", "-c", "executable_first arg1; executable_second"]
CMD ["/bin/sh", "-c", "executable_first arg1; executable_second"]
In above command, we have used shell command as executable to run the command. This is the only way to run multiple commands in EXEC format.
Following are WRONG
CMD ["executable_first parameter", "executable_second parameter"]
ENTRYPOINT ["executable_first parameter", "executable_second parameter"]
CMD ["executable_first", "parameter", ";", "executable_second", "parameter"]
ENTRYPOINT ["executable_first", "parameter", ";", "executable_second", "parameter"]
Can I run multiple programs in a Docker container?
Yes. But with significant risks.
Below is the same answer as above. But with details and a recommended resolution. If you're interested in those.
Not Recommended
Warning. Using the same container for multiple services is not recommended by the Docker community, though. The Docker documentation reads: "It is generally recommended that you separate areas of concern by using one service per container." Source at:
• https://archive.ph/3Roa6#selection-307.2-307.100
• https://docs.docker.com/config/containers/multi-service_container/
If you choose to ignore the recommendation above, you container risk to be with weaker security, increasingly unstable, and in the future a painful growth.
If you are ok with those risks above, the documentation to use one container for multiple services is at:
• https://archive.ph/3Roa6#selection-335.0-691.1
• https://docs.docker.com/config/containers/multi-service_container/
Recommended
If you need a container(s) with stronger security, and more stability, and in the future, scale bigger, as well as better performance, then the Docker community recommends those two steps:
Use one service per Docker container. The end result is that you will have multiple containers.
Use this Docker "Networking" feature to connect any of those containers to your liking.

What does the command line arguments for PM2 startup mean precisely

I am a little confused about start up scripts and the command line options. I am building a small raspberry pi based server for my node applications. In order to provide maximum protection against power failures and flash write corruption, the root file system is read only, and that embraces the home directory of my main user, were the production versions of my apps (two of them) are stored. Because the .pm2 directory here is no good for logs etc I currently set PM2_HOME environment variable to a place in /var (which has 512kb unused space around it to ensure writes to i. The eco-system.json file reads this environment variable also to determine where to place its logs.
In case I need to, I also have a secondary user with a read write home directory in another (protected by buffer space around it) partition. This contains development versions of my application code which because of the convenience of setting environments up etc I also want to monitor with PM2. If I need to investigate a problem I can log in to that user and run and test the application there.
Since this is a headless box, and with watchdog and kernel panic restarts built in, I want pm2 to start during boot and at minimum restart the two production apps. Ideally it should also starts the two development versions of the app also but I can live without that if its impossible.
I can switch the read only root partition to read/write - indeed it does so automatically when I ssh into my production user account. It switches back to read only automatically when I log out.
So I went to this account to try and create a startup script. It then said (unsurprisingly) that I had to run a sudo command like so:-
sudo su -c "env PATH=$PATH:/usr/local/bin pm2 startup ubuntu -u pi --hp /home/pi"
The key issue for me here is the --hp switch. I went searching for some clue as to what it means. Its clearly a home directory, but it doesn't match PM2_HOME - which is set to /var/pas in my case to take it out of the read only area. I don't want to try and and spray my home directory with files that shouldn't be there. So am asking for some guidance here
I found out by experiment what it does with an "ubuntu" start up script. It uses it to set PM2_HOME in the script by appending "/.pm2" to it.
However there is nothing stopping you editing the script once it has created it and setting PM2_HOME to whatever you want.
So effectively its a helper for the script, but only that and nothing more special.

How to add external NTFS drive to Google Cloud Engine app?

Since using docker takes up a lot of space for images, I would like to attach an external hard drive to my 10GB instance Ubuntu VM. However, I've added a blank disk and attached it, but I end up with this message when I type "fdisk -l":
Disk /dev/sdb doesn't contain a valid partition table.
How do I create an external NTFS drive and mount it to my filesystem?
Just as on any other Ubuntu instance, once you've attached an unformatted or unsuitably-formatted drive... one good set of instructions is for example at https://help.ubuntu.com/community/InstallingANewHardDrive .
To do it manually, run, as root (sudo bash for example):
$ apt-get install ntfsprogs
$ df -k # just to check nothing is mounted on /dev/sdb...
$ # umount /dev/sdb if df -k shows something mounted there
$ fdisk # to fix the partition table, see http://linux.die.net/man/8/fdisk
$ # if you need a tutorial, http://www.howtogeek.com/106873/how-to-use-fdisk-to-manage-partitions-on-linux/
$ mkfs.ntfs -f /dev/sdb1 # if you're in a hurry, or
$ # mkfs.ntfs /dev/sdb1 # if you have all the time in the world
Incidentally, this is a system administration question, not a software development one, so you might be happier asking it over at serverfault -- we do monitor the google-cloud-platform there, too.
Two side issues -- (1) why NTFS? You're unlikely to be using this PD with Windows, so a native Linux file system might be preferable... (2) what does this have to do with google-app-engine? Did you mistype that tag meaning actually google-compute-engine instead...?

What will happen with a gsutil command if a DRA bucket's contents are unavailable?

I'm on an DRA (Durable Reduced Availability) bucket and I perform the gsutil rsync command quite often to upload/download files to/from the bucket.
Since file) could be unavailable (because of the DRA), what exactly will happen during a gsutil rsync session when such a scenario is being hit?
Will gsutil just wait until the unavailable files becomes available and complete the task, thus always downloading everything from the bucket?
Or will gsutil exit with a warning about a certain file not being available, and if so exactly what output is being used (so that I can make a script to look for this type of message)?
What will the return code be of the gsutil command in a session where files are found to be unavailable?
I need to be 100% sure that I download everything from the bucket, which I'm guessing can be difficult to keep track of when downloading hundreds of gigabytes of data. In case gsutil rsync completes without downloading unavailable files, is it possible to construct a command which retries the unavailable files until all such files have been successfully downloaded?
If your files exceed the resumable threshold (as of 4.7, this is 8MB), any availability issues will be retried with exponential backoff according to the num_retries and max_retry_delay configuration variables. If the file is smaller than the threshold, it will not be retried (this will be improved in 4.8 so small files also get retries).
If any file(s) fail to transfer successfully, gsutil will halt and output an exception depending on the failure encountered. If you are using gsutil -m rsync or gsutil rsync -C, gsutil will continue on errors and at the end, you'll get a CommandException with the message 'N file(s)/object(s) could not be copied/removed'
If retries are exhausted and/or either of the failure conditions described in #2 occur, the exit code will be nonzero.
In order to ensure that you download all files from the bucket, you can simply rerun gsutil rsync until you get a nonzero exit code.
Note that gsutil rsync relies on listing objects. Listing in Google Cloud Storage is eventually consistent. So if you are upload files to the bucket and then immediately run gsutil rsync, it is possible you will miss newly uploaded files, but the next run of gsutil rsync should pick them up.
I did some tests on a project and could not get gsutil to throw any errors. Afaik, gsutil operates on the directory level, it is not looking for a specific file.
When you run, for example $ gsutil rsync local_dir gs://bucket , gsutil is not expecting any particular file, it just takes whatever you have in "local_dir" and uploads it to gs://bucket, so :
gsutil will not wait, it will complete.
you will not get any errors - the only errors I got is when the local directory or bucket are missing entirely.
if, let´s say a file is missing on local_dir, but it is available in the bucket and then you run $ gsutil rsync -r local_dir gs://bucket, then nothing will change in the bucket. with the "-d" option, the file will be deleted on the bucket side.
As a suggestion, you could just add a crontab entry to rerun the gstuil command a couple of times a day or at night.
Another way is to create a simple script and add it to your crontab to run every hour or so. this will check if your file exists, and if so it will run the gsutil command:
#!/bin/bash
FILE=/home/user/test.txt
if [ -f $FILE ];
then
echo "file exists..or something"
else
gsutil rsync /home/user gs://bucket
fi
UPDATE :
I think this may be what you need. In ~/ you should have a .boto file .
~$ more .boto | grep max
# num_retries = <integer value>
# max_retry_delay = <integer value>
Uncomment those lines and add your numbers. Default is 6 retries, so you could do something like 24 retries and put 3600s in between. This in theory should always keep looping .
Hope this helps !

Using Git to track mysql schema - some questions

If this is recommended ?
Can I ask some git command examples about how to track versions of mysql schema?
Should we use another repository other then the one we normally use on our application root ?
Should I use something called hook ?
Update:
1) We navigate onto our project root where .git database resides.
2) We create a sub folder called hooks.
3) We put something like this inside a file called db-commit:
#!/bin/sh
mysqldump -u DBUSER -pDBPASSWORD DATABASE --no-data=true> SQLVersionControl/vc.sql
git add SQLVersionControl/vc.sql
exit 0
Now we can:
4) git commit -m
This commit will include a mysql schema dump that has been run just before the commit.
The source of the above is here:
http://edmondscommerce.github.io/git/using-git-to-track-db-schema-changes-with-git-hook.html
If this is an acceptable way of doing it, can I please ask someone with patience to comment line by line and with as much detail as possible, what is happening here:
#!/bin/sh
mysqldump -u DBUSER -pDBPASSWORD DATABASE --no-data=true> SQLVersionControl/vc.sql
git add SQLVersionControl/vc.sql
exit 0
Thanks a lot.
Assuming you have a git repo already, do the following in a shell script or whatever:
#!/bin/bash -e
# -e means exit if any command fails
DBHOST=dbhost.yourdomain.com
DBUSER=dbuser
DBPASS=dbpass # do this in a more secure fashion
DBNAME=dbname
GITREPO=/path/to/git/repo
cd $GITREPO
mysqldump -h $DBHOST -u $DBUSER -p$DBPASS -d $DBNAME > $GITREPO/schema.sql # the -d flag means "no data"
git add schema.sql
git commit -m "$DBNAME schema version $(`date`)"
git push # assuming you have a remote to push to
Then start this script on a daily basis from a cron job or what have you.
EDIT: By placing a script in $gitdir/hooks/pre-commit (the name is important), the script will be executed before every commit. This way the state of the DB schema is captured for each commit, which makes sense. If you automatically run this sql script every time you commit, you will blow away your database, which does not make sense.
#!/bin/sh
This line specifies that it's a shell script.
mysqldump -u DBUSER -pDBPASSWORD DATABASE --no-data=true> SQLVersionControl/vc.sql
This is the same as in my answer above; taking the DDL only from the database and storing it in a file.
git add SQLVersionControl/vc.sql
This adds the SQL file to every commit made to your repository.
exit 0
This exits the script with success. This is possibly dangerous. If mysqldump or git add fails, you may blow away something you wanted to keep.
If you're just tracking the schema, put all of the CREATE statements into one .sql file, and add the file to git.
$> mkdir myschema && cd myschema
$> git init
$> echo "CREATE TABLE ..." > schema.sql
$> git add schema.sql
$> git commit -m "Initial import"
IMO the best approach is described here: http://viget.com/extend/backup-your-database-in-git. For your convenience I repeat the most important pieces here.
The trick is to use mysqldump --skip-extended-insert, which creates dumps that can be better tracked/diffed by git.
There are also some hints regarding the best repository configuration in order to reduce disk size. Copied from here:
core.compression = 9 : Flag for gzip to specify the compression level for blobs and packs. Level 1 is fast with larger file sizes, level 9 takes more time but results in better compression.
repack.usedeltabaseoffset = true : Defaults to false for compatibility reasons, but is supported with Git >=1.4.4.
pack.windowMemory = 100m : (Re)packing objects may consume lots of memory. To prevent all your resources go down the drain it's useful to put some limits on that. There is also pack.deltaCacheSize.
pack.window = 15 : Defaults to 10. With a higher value, Git tries harder to find similar blobs.
gc.auto = 1000 : Defaults to 6700. As indicated in the article it is recommended to run git gc every once in a while. Personally I run git gc --auto everyday, so only pack things when there's enough garbage. git gc --auto normally only triggers the packing mechanism when there are 6700 loose objects around. This flag lowers this amount.
gc.autopacklimit = 10: Defaults to 50. Every time you run git gc, a new pack is generated of the loose objects. Over time you get too many packs which waste space. It is a good idea to combine all packs once in a while into a single pack, so all objects can be combined and deltified. By default git gc does this when there are 50 packs around. But for this situation a lower number may be better.
Old versions can be pruned via:
git rebase --onto master~8 master~7
(copied from here)
The following includes a git pre-commit hook to capture mysql database/schema, given user='myuser', password='mypassword', database_name='dbase1'. Properly bubbles errors up to the git system (the exit 0's in other answers could be dangerous and may not handle error scenarios properly). Optionally, can add a database import to a post-checkout hook (when capturing all the data, not just schema), but take care given your database size. Details in bash-script comments below.
pre-commit hook:
#!/bin/bash
# exit upon error
set -e
# another way to set "exit upon error", for readability
set -o errexit
mysqldump -umyuser -pmypassword dbase1 --no-data=true > dbase1.sql
# Uncomment following line to dump all data with schema,
# useful when used in tandem for the post-checkout hook below.
# WARNING: can greatly expand your git repo when employing for
# large databases, so carefully evaluate before employing this method.
# mysqldump -umyuser -pmypassword dbase1 > dbase1.sql
git add dbase1.sql
(optional) post-checkout hook:
#!/bin/bash
# mysqldump (above) is presumably run without '--no-data=true' parameter.
set -e
mysql -umyuser -pmypassword dbase1 < dbase1.sql
Versions of apps, OS I'm running:
root#node1 Dec 12 22:35:14 /var/www# mysql --version
mysql Ver 14.14 Distrib 5.1.54, for debian-linux-gnu (x86_64) using readline 6.2
root#node1 Dec 12 22:35:19 /var/www# git --version
git version 1.7.4.1
root#node1 Dec 12 22:35:22 /var/www# lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 11.04
Release: 11.04
Codename: natty
root#node1 Dec 12 22:35:28 /var/www#
While I am not using Git, I have used source control for over 15 years. A best practice to adhere to when deciding where and how to store your src and accompanying resources in Source Control: If the DB Schema is used within the project then you should be versioning the schema and all other project resources in "that" project. If you develop a set of schemas or programming resources that you resuse in other projects then you should have a seperate repository for those reusable resources. That seperate Reusable resources project will be versioned on it's own and will track the versions of the actual reusable resources in that repository.
If you use a versioned resource out of the reusable repository in a different project then you have the following scenario, (just an example). Project XYZ version 1.0 is now using DB Schema_ABC version 4.0 In this case you will understand that you have used a specific version of a reusable resource and since it is versioned you will be able to track its use throughout your project. If you get a bug report on DBSchema_ABC, you will be able to fix the schema and re-version as well as understand where else DBSchem_ABC is used and where you may have to make some changes. From there you will also understand which projects contain wich versions of which reusable resources... You just have to understand how to track your resources.
Adopting this type of development Environment and Resource Management strategy is key to releasing usable software and managing a break/fix enhancement environment. Even if you're developing for your own edificcation on your own time, you should be using source control.. as you are..
As for Git, I would find a gui front end or a dev env integration if I can. Git is pretty big so I am sure it has plenty of front end support, maybe?
As brilliant as it sounds (the idea did occur to me as well), when I tried to implement it, I hit a wall. In theory, by using the --skip-extended-insert flag, despite initial dump would be big, the diffs between daily dumps should be minimal, hence the size increase over time of the repository could be assumed to be minimal as well, right? Wrong!
Git stores shapshots, not diffs, which means on each commit, it will take the entire dump file, not just the diff. Moreover, since the dump with --skip-extended-instert will use all field names on every single insert line, it will be huge compared to a dump done without --skip-extended-instert. This results in an explosion in size, the exact opposite what one would expect.
In my case, with a ~300MB sql dump, the repository went to gigabytes in days. So, what did I do? I first tried the same thing, only remove --skip-extended-instert, so that dumps will be smaller, and snapshots would be proportionally smaller as well. This approach held for a while, but in time it became unusable as well.
Still, the diff usage with --skip-extended-insert actually still seemed like a good idea, only, now I try to use subversion instead of git. I know, compared to git, svn is ancient history, yet it seems to work better, since it actually does use diffs instead of snapshots.
So in short, I believe best solution is doing the above, but with subversion instead of git.
(shameless plug)
The dbvc commandline tool allows you to manage your database schema updates in your repository.
It creates and uses a table _dbvc in the database which holds a list of the updates that are run. You can easily run the updates that haven't been apply to your database schema yet.
The tool uses git to determine the correct order of executing the updates.
DBVC usage
Show a list of commands
dbvc help
Show help on a specific command
dbvc help init
Initialise DBVC for an existing database.
dbvc init
Create a database dump. This is used to create the DB on a new environment.
mysqldump foobar > dev/schema.php
Create the DB using the schema.
dbvc create
Add an update file. These are used to update the DB on other environments.
echo 'ALTER TABLE `foo` ADD COLUMN `status` BOOL DEFAULT 1;' > dev/updates/add-status-to-foo.sql
Mark an update as already run.
dbvc mark add-status-to-foo
Show a list of updates that need to be run.
dbvc status
Show all updates with their status.
dbvc status --all
Update the database.
dbvc update
I have found the following options to be mandatory for a version control / git-compatible mysqldump.
mysqldump --skip-opt --skip-comments |sed -e 's/DEFINER[ ]*=[ ]*[^*]*\*/\*/'
(and maybe --no-data)
--skip-opt is very useful, it takes away all of --add-drop-table --add-locks --create-options --disable-keys --extended-insert --lock-tables --quick --set-charset. The DEFINER sed is necessary when the database contains triggers.