Mercurial hook to disallow committing large binary files - mercurial

I want to have a Mercurial hook that will run before committing a transaction that will abort the transaction if a binary file being committed is greater than 1 megabyte. I found the following code which works fine except for one problem. If my changeset involves removing a file, this hook will throw an exception.
The hook (I'm using pretxncommit = python:checksize.newbinsize):
from mercurial import context, util
from mercurial.i18n import _
import mercurial.node as dpynode
'''hooks to forbid adding binary file over a given size
Ensure the PYTHONPATH is pointing where hg_checksize.py is and setup your
repo .hg/hgrc like this:
[hooks]
pretxncommit = python:checksize.newbinsize
pretxnchangegroup = python:checksize.newbinsize
preoutgoing = python:checksize.nopull
[limits]
maxnewbinsize = 10240
'''
def newbinsize(ui, repo, node=None, **kwargs):
'''forbid to add binary files over a given size'''
forbid = False
# default limit is 10 MB
limit = int(ui.config('limits', 'maxnewbinsize', 10000000))
tip = context.changectx(repo, 'tip').rev()
ctx = context.changectx(repo, node)
for rev in range(ctx.rev(), tip+1):
ctx = context.changectx(repo, rev)
print ctx.files()
for f in ctx.files():
fctx = ctx.filectx(f)
filecontent = fctx.data()
# check only for new files
if not fctx.parents():
if len(filecontent) > limit and util.binary(filecontent):
msg = 'new binary file %s of %s is too large: %ld > %ld\n'
hname = dpynode.short(ctx.node())
ui.write(_(msg) % (f, hname, len(filecontent), limit))
forbid = True
return forbid
The exception:
$ hg commit -m 'commit message'
error: pretxncommit hook raised an exception: apps/helpers/templatetags/include_extends.py#bced6272d8f4: not found in manifest
transaction abort!
rollback completed
abort: apps/helpers/templatetags/include_extends.py#bced6272d8f4: not found in manifest!
I'm not familiar with writing Mercurial hooks, so I'm pretty confused about what's going on. Why does the hook care that a file was removed if hg already knows about it? Is there a way to fix this hook so that it works all the time?
Update (solved):
I modified the hook to filter out files that were removed in the changeset.
def newbinsize(ui, repo, node=None, **kwargs):
'''forbid to add binary files over a given size'''
forbid = False
# default limit is 10 MB
limit = int(ui.config('limits', 'maxnewbinsize', 10000000))
ctx = repo[node]
for rev in xrange(ctx.rev(), len(repo)):
ctx = context.changectx(repo, rev)
# do not check the size of files that have been removed
# files that have been removed do not have filecontexts
# to test for whether a file was removed, test for the existence of a filecontext
filecontexts = list(ctx)
def file_was_removed(f):
"""Returns True if the file was removed"""
if f not in filecontexts:
return True
else:
return False
for f in itertools.ifilterfalse(file_was_removed, ctx.files()):
fctx = ctx.filectx(f)
filecontent = fctx.data()
# check only for new files
if not fctx.parents():
if len(filecontent) > limit and util.binary(filecontent):
msg = 'new binary file %s of %s is too large: %ld > %ld\n'
hname = dpynode.short(ctx.node())
ui.write(_(msg) % (f, hname, len(filecontent), limit))
forbid = True
return forbid

This is really easy to do in a shell hook in recent Mercurial:
if hg locate -r tip "set:(added() or modified()) and binary() and size('>100k')"; then
echo "bad files!"
exit 1
else
exit 0
fi
What's going on here? First we have a fileset to find all the changed files that are problematic (see 'hg help filesets' in hg 1.9). The 'locate' command is like status, except it just lists files and returns 0 if it finds anything. And we specify '-r tip' to look at the pending commit.

for f in ctx.files() will include removed files, you need to filter those out.
(and you can replace for rev in range(ctx.rev(), tip+1): by for rev in xrange(ctx.rev(), len(repo)):, and remove tip = ...)
If you're using a modern hg, you don't do ctx = context.changectx(repo, node) but ctx = repo[node] instead.

Related

Connection issues in Storage trigger GCF

For my application, new file uploaded to storage is read and the data is added to a main file. The new file contains 2 lines, one a header and other an array whose values are separated by a comma. The main file will need maximum of 265MB. The new files will have maximum of 30MB.
def write_append_to_ecg_file(filename,ecg,patientdata):
file1 = open('/tmp/'+ filename,"w+")
file1.write(":".join(patientdata))
file1.write('\n')
file1.write(",".join(ecg.astype(str)))
file1.close()
def storage_trigger_function(data,context):
#Download the segment file
download_files_storage(bucket_name,new_file_name,storage_folder_name = blob_path)
#Read the segment file
data_from_new_file,meta = read_new_file(new_file_name, scale=1, fs=125, include_meta=True)
print("Length of ECG data from segment {} file {}".format(segment_no,len(data_from_new_file)))
os.remove(new_file_name)
#Check if the main ecg_file_exists
file_exists = blob_exists(bucket_name, blob_with_the_main_file)
print("File status {}".format(file_exists))
data_from_main_file = []
if ecg_file_exists:
download_files_storage(bucket_name,main_file_name,storage_folder_name = blob_with_the_main_file)
data_from_main_file,meta = read_new_file(main_file_name, scale=1, fs=125, include_meta=True)
print("ECG data from main file {}".format(len(data_from_main_file)))
os.remove(main_file_name)
data_from_main_file = np.append(data_from_main_file,data_from_new_file)
print("data after appending {}".format(len(data_from_main_file)))
write_append_to_ecg_file(main_file,data_from_main_file,meta)
token = upload_files_storage(bucket_name,main_file,storage_folder_name = main_file_blob,upload_file = True)
else:
write_append_to_ecg_file(main_file,data_from_new_file,meta)
token = upload_files_storage(bucket_name,main_file,storage_folder_name = main_file_blob,upload_file = True)
The GCF is deployed
gcloud functions deploy storage_trigger_function --runtime python37 --trigger-resource patch-us.appspot.com --trigger-event google.storage.object.finalize --timeout 540s --memory 8192MB
For the first file, I was able to read the file and write the data to the main file. But after uploading the 2nd file, its giving Function execution took 70448 ms, finished with status: 'connection error' On uploading the 3rd file, it gives the Function invocation was interrupted. Error: memory limit exceeded. Despite of deploying the function with 8192MB memory, I am getting this error. Can I get some help on this.

Splitting a csv file into multiple files

I have a csv file of 150500 rows and I want to split it into multiple files containing 500 rows (entries)
I'm using Jupyter and I know how to open and read the file. However, I don't know how to specify an output_path to record the newly created files from splitting the big one.
I have found this code online but once again since I don't know what is my output_path I don't know how to use it. Moreover, for this block of code I don't understand how we specify the input file.
import os
def split(filehandler, delimiter=',', row_limit=1000,
output_name_template='output_%s.csv', output_path='.', keep_headers=True):
import csv
reader = csv.reader(filehandler, delimiter=delimiter)
current_piece = 1
current_out_path = os.path.join(
output_path,
output_name_template % current_piece
)
current_out_writer = csv.writer(open(current_out_path, 'w'), delimiter=delimiter)
current_limit = row_limit
if keep_headers:
headers = reader.next()
current_out_writer.writerow(headers)
for i, row in enumerate(reader):
if i + 1 > current_limit:
current_piece += 1
current_limit = row_limit * current_piece
current_out_path = os.path.join(
output_path,
output_name_template % current_piece
)
current_out_writer = csv.writer(open(current_out_path, 'w'), delimiter=delimiter)
if keep_headers:
current_out_writer.writerow(headers)
current_out_writer.writerow(row)
My file name is DataSet2.csv and it's in the same file in jupyter as my ipynb notebook is running.
number_of_small_files = 301
lines_per_small_file = 500
largeFile = open('large.csv', 'r')
header = largeFile.readline()
for i in range(number_of_small_files):
smallFile = open(str(i) + '_small.csv', 'w')
smallFile.write(header) # This line copies the header to all small files
for x in range(lines_per_small_file):
line = largeFile.readline()
smallFile.write(line)
smallFile.close()
largeFile.close()
This will create many small files in the same directory. About 301 of them. They will be named from 0_small.csv to 300_small.csv.
Using standard unix utilities:
cat DataSet2.csv | tail -n +2 | split -l 500 --additional-suffix=.csv output_
This pipeline takes the original file, strips off the first line with 'tail -n +2', and then splits the rest into 500 line chunks that are put into files with names that start with 'output_' and end with '.csv'

In mercurial, is there a way to recover from a deleted 00changelog.d?

00changelog.d got deleted accidentally.
Is there a way to recover from this?
Or recover the changes made since the last push?
It is technically possible, though with some information loss as the 00changelog.d file contain information such as the changeset author, the date of the commit, the commit message, and the named branch of the changeset. That said, it is possible to reconstruct the data to a high degree of certainty based on parent information in the 00manifest.{d,i}.
The data in the 00changelog.d is part of the changeset's hash, and as such it is important that the data that is shared with an upstream repository is exactly the same, so in order to reconstruct the changelog you need a clone of your upstream repository (what was local only we have some leeway on).
In short, preparation before we begin:
A clone of the upstream repository (let's assume it's in ./upstream/)
A backup of your local repository (let's assume it's in ./local-backup/)
Your local repository (let's assume it's in ./local/)
rm ./local/.hg/store/00changelog.* to start from a fresh slate
Next you need the following Mercurial extension I wrote for the occasion:
import os
import weakref
from mercurial.commands import command
from mercurial import scmutil, changelog, error, node
#command('reconstruct', [], '<path to 00changelog.i>')
def reconstruct(ui, repo, changelog_index):
'reconstructs repository from upstream changelog if local changelog has been deleted'
other_opener = scmutil.vfs(os.path.dirname(changelog_index),
expandpath=True,
realpath=True)
upstream_changelog = changelog.changelog(other_opener)
local_manifest = repo.manifest
local_changelog = repo.changelog
if len(local_changelog) != 0:
raise error.Abort('not running on repository with actual changelog data')
local_manifest_nodes = {local_manifest.node(rev): rev for rev in local_manifest}
reconstructed_manifests = set()
reconstructed_manifests_map = {}
lock = repo.lock()
try:
tr = repo.transaction('reconstruct')
trp = weakref.proxy(tr)
for rev in upstream_changelog:
data = upstream_changelog.read(upstream_changelog.node(rev))
prevs = upstream_changelog.parentrevs(rev)
p1 = node.nullid if prevs[0] == -1 else upstream_changelog.node(prevs[0])
p2 = node.nullid if prevs[1] == -1 else upstream_changelog.node(prevs[1])
if data[0] in local_manifest_nodes:
n = local_changelog.add(data[0], data[3], data[4],
trp, p1, p2,
data[1], data[2], data[5])
reconstructed_manifests.add(data[0])
reconstructed_manifests_map[data[0]] = n
missing_manifests = sorted(
((n, r) for n, r in local_manifest_nodes.iteritems()
if n not in reconstructed_manifests),
key=lambda (x,y): y)
for n, rev in missing_manifests:
p1, p2 = local_manifest.parents(n)
p1 = node.nullid if p1 == node.nullid else reconstructed_manifests_map[p1]
p2 = node.nullid if p2 == node.nullid else reconstructed_manifests_map[p2]
n2 = local_changelog.add(n, '', 'missing', trp, p1, p2, 'user')
reconstructed_manifests_map[n] = n2
tr.close()
finally:
if tr:
tr.release()
lock.release()
Place this file in ./reconstruct.py. Now we're able to revive the changes:
cd local
hg --config extensions.reconstruct=../reconstruct.py reconstruct ../upstream/.hg/store/00changelog.i
wait for a while (there's no progress)
All changes that were local only will have completely bogus changelog information, but this can be fixed up manually using either the histedit or mq extensions.
Note that the extension above makes a few simplifying assumptions: the manifest corresponds 1:1 to your changelog. Since you only deleted your 00changelog.d and not your 00changelog.i it would be possible to use the more accurate parent information from 00changelog.i but that would require substantially more code.
Also note that the extension above uses somewhat recent Python features, so your Mercurial will need to be based on Python 2.7.

using nginx' lua to validate GitHub webhooks and delete cron-lock-file

What I have:
GNU/Linux host
nginx is up and running
there is a cron-job scheduled to run immediately after a specific file has been removed (similar to run-crons)
GitHub sends a webhook when someone pushes to a repository
What I want:
I do now want to run either lua or anything comparable to parse GitHub's request and validate it and then delete a file (if the request was valid of course).
Preferably all of this should happen without the hassle to maintain an additional PHP installation as there is currently none, or the need to use fcgiwrap or similar.
Template:
On the nginx side I have something equivalent to
location /deploy {
# execute lua (or equivalent) here
}
To read json body of GH webhook you nead use JSON4Lua lib, and to validate HMAC signature use luacrypto.
Preconfigure
Install required modules
$ sudo luarocks install JSON4Lua
$ sudo luarocks install luacrypto
In Nginx define location for deploy
location /deploy {
client_body_buffer_size 3M;
client_max_body_size 3M;
content_by_lua_file /path/to/handler.lua;
}
The max_body_size and body_buffer_size should be equal to prevent error
request body in temp file not supported
https://github.com/openresty/lua-nginx-module/issues/521
Process webhook
Get request payload data and check is correct
ngx.req.read_body()
local data = ngx.req.get_body_data()
if not data then
ngx.log(ngx.ERR, "failed to get request body")
return ngx.exit (ngx.HTTP_BAD_REQUEST)
end
Verify GH signature with use luacrypto
local function verify_signature (hub_sign, data)
local sign = 'sha1=' .. crypto.hmac.digest('sha1', data, secret)
-- this is simple comparison, but it's better to use a constant time comparison
return hub_sign == sign
end
-- validate GH signature
if not verify_signature(headers['X-Hub-Signature'], data) then
ngx.log(ngx.ERR, "wrong webhook signature")
return ngx.exit (ngx.HTTP_FORBIDDEN)
end
Parse data as json and check is master branch, for deploy
data = json.decode(data)
-- on master branch
if data['ref'] ~= branch then
ngx.say("Skip branch ", data['ref'])
return ngx.exit (ngx.HTTP_OK)
end
If all correct, call deploy function
local function deploy ()
-- run command for deploy
local handle = io.popen("cd /path/to/repo && sudo -u username git pull")
local result = handle:read("*a")
handle:close()
ngx.say (result)
return ngx.exit (ngx.HTTP_OK)
end
Example
Example constant time string compare
local function const_eq (a, b)
-- Check is string equals, constant time exec
getmetatable('').__index = function (str, i)
return string.sub(str, i, i)
end
local diff = string.len(a) == string.len(b)
for i = 1, math.min(string.len(a), string.len(b)) do
diff = (a[i] == b[i]) and diff
end
return diff
end
A complete example of how I use it in github gist https://gist.github.com/Samael500/5dbdf6d55838f841a08eb7847ad1c926
This solution does not implement verification for GitHub's hooks and assumes you have the lua extension and the cjson module installed:
location = /location {
default_type 'text/plain';
content_by_lua_block {
local cjson = require "cjson.safe"
ngx.req.read_body()
local data = ngx.req.get_body_data()
if
data
then
local obj = cjson.decode(data)
if
# checksum checking should go here
(obj and obj.repository and obj.repository.full_name) == "user/reponame"
then
local file = io.open("<your file>","w")
if
file
then
file:close()
ngx.say("success")
else
ngx.exit(ngx.HTTP_INTERNAL_SERVER_ERROR)
end
else
ngx.exit(ngx.HTTP_UNAUTHORIZED)
end
else
ngx.exit(ngx.HTTP_NOT_ALLOWED)
end
}
}

How to get list of changed files since last build in Jenkins/Hudson

I have set up Jenkins, but I would like to find out what files were added/changed between the current build and the previous build. I'd like to run some long running tests depending on whether or not certain parts of the source tree were changed.
Having scoured the Internet I can find no mention of this ability within Hudson/Jenkins though suggestions were made to use SVN post-commit hooks. Maybe it's so simple that everyone (except me) knows how to do it!
Is this possible?
I have done it the following way. I am not sure if that is the right way, but it seems to be working. You need to get the Jenkins Groovy plugin installed and do the following script.
import hudson.model.*;
import hudson.util.*;
import hudson.scm.*;
import hudson.plugins.accurev.*
def thr = Thread.currentThread();
def build = thr?.executable;
def changeSet= build.getChangeSet();
changeSet.getItems();
ChangeSet.getItems() gives you the changes. Since I use accurev, I did List<AccurevTransaction> accurevTransList = changeSet.getItems();.
Here, the modified list contains duplicate files/names if it has been committed more than once during the current build window.
The CI server will show you the list of changes, if you are polling for changes and using SVN update. However, you seem to want to be changing the behaviour of the build depending on which files were modified. I don't think there is any out-of-the-box way to do that with Jenkins alone.
A post-commit hook is a reasonable idea. You could parameterize the job, and have your hook script launch the build with the parameter value set according to the changes committed. I'm not sure how difficult that might be for you.
However, you may want to consider splitting this into two separate jobs - one that runs on every commit, and a separate one for the long-running tests that you don't always need. Personally I prefer to keep job behaviour consistent between executions. Otherwise traceability suffers.
echo $SVN_REVISION
svn_last_successful_build_revision=`curl $JOB_URL'lastSuccessfulBuild/api/json' | python -c 'import json,sys;obj=json.loads(sys.stdin.read());print obj["'"changeSet"'"]["'"revisions"'"][0]["'"revision"'"]'`
diff=`svn di -r$SVN_REVISION:$svn_last_successful_build_revision --summarize`
You can use the Jenkins Remote Access API to get a machine-readable description of the current build, including its full change set. The subtlety here is that if you have a 'quiet period' configured, Jenkins may batch multiple commits to the same repository into a single build, so relying on a single revision number is a bit naive.
I like to keep my Subversion post-commit hooks relatively simple and hand things off to the CI server. To do this, I use wget to trigger the build, something like this...
/usr/bin/wget --output-document "-" --timeout=2 \
https://ci.example.com/jenkins/job/JOBID/build?token=MYTOKEN
The job is then configured on the Jenkins side to execute a Python script that leverages the BUILD_URL environment variable and constructs the URL for the API from that. The URL ends up looking like this:
https://ci.example.com/jenkins/job/JOBID/BUILDID/api/json/
Here's some sample Python code that could be run inside the shell script. I've left out any error handling or HTTP authentication stuff to keep things readable here.
import os
import json
import urllib2
# Make the URL
build_url = os.environ['BUILD_URL']
api = build_url + 'api/json/'
# Call the Jenkins server and figured out what changed
f = urllib2.urlopen(api)
build = json.loads(f.read())
change_set = build['changeSet']
items = change_set['items']
touched = []
for item in items:
touched += item['affectedPaths']
Using the Build Flow plugin and Git:
final changeSet = build.getChangeSet()
final changeSetIterator = changeSet.iterator()
while (changeSetIterator.hasNext()) {
final gitChangeSet = changeSetIterator.next()
for (final path : gitChangeSet.getPaths()) {
println path.getPath()
}
}
With Jenkins pipelines (pipeline supporting APIs plugin 2.2 or above), this solution is working for me:
def changeLogSets = currentBuild.changeSets
for (int i = 0; i < changeLogSets.size(); i++) {
def entries = changeLogSets[i].items
for (int j = 0; j < entries.length; j++) {
def entry = entries[j]
def files = new ArrayList(entry.affectedFiles)
for (int k = 0; k < files.size(); k++) {
def file = files[k]
println file.path
}
}
}
See How to access changelogs in a pipeline job.
Through Groovy:
<!-- CHANGE SET -->
<% changeSet = build.changeSet
if (changeSet != null) {
hadChanges = false %>
<h2>Changes</h2>
<ul>
<% changeSet.each { cs ->
hadChanges = true
aUser = cs.author %>
<li>Commit <b>${cs.revision}</b> by <b><%= aUser != null ? aUser.displayName : it.author.displayName %>:</b> (${cs.msg})
<ul>
<% cs.affectedFiles.each { %>
<li class="change-${it.editType.name}"><b>${it.editType.name}</b>: ${it.path} </li> <% } %> </ul> </li> <% }
if (!hadChanges) { %>
<li>No Changes !!</li>
<% } %> </ul> <% } %>
#!/bin/bash
set -e
job_name="whatever"
JOB_URL="http://myserver:8080/job/${job_name}/"
FILTER_PATH="path/to/folder/to/monitor"
python_func="import json, sys
obj = json.loads(sys.stdin.read())
ch_list = obj['changeSet']['items']
_list = [ j['affectedPaths'] for j in ch_list ]
for outer in _list:
for inner in outer:
print inner
"
_affected_files=`curl --silent ${JOB_URL}${BUILD_NUMBER}'/api/json' | python -c "$python_func"`
if [ -z "`echo \"$_affected_files\" | grep \"${FILTER_PATH}\"`" ]; then
echo "[INFO] no changes detected in ${FILTER_PATH}"
exit 0
else
echo "[INFO] changed files detected: "
for a_file in `echo "$_affected_files" | grep "${FILTER_PATH}"`; do
echo " $a_file"
done;
fi;
It is slightly different - I needed a script for Git on a particular folder...
So, I wrote a check based on jollychang.
It can be added directly to the job's exec shell script. If no files are detected it will exit 0, i.e. SUCCESS... this way you can always trigger on check-ins to the repository, but build when files in the folder of interest change.
But... If you wanted to build on-demand (i.e. clicking Build Now) with the changed from the last build.. you would change _affected_files to:
_affected_files=`curl --silent $JOB_URL'lastSuccessfulBuild/api/json' | python -c "$python_func"`
Note: You have to use Jenkins' own SVN client to get a change list. Doing it through a shell build step won't list the changes in the build.
It's simple, but this works for me:
$DirectoryA = "D:\Jenkins\jobs\projectName\builds" ####Jenkind directory
$firstfolder = Get-ChildItem -Path $DirectoryA | Where-Object {$_.PSIsContainer} | Sort-Object LastWriteTime -Descending | Select-Object -First 1
$DirectoryB = $DirectoryA + "\" + $firstfolder
$sVnLoGfIle = $DirectoryB + "\" + "changelog.xml"
write-host $sVnLoGfIle
I tried to add that to comments but code in comments is no way:
Just want to prettify code from heroin's answer:
def changedFiles = []
def changeLogSets = currentBuild.changeSets
for (entries in changeLogSets) {
for (entry in entries) {
for (file in entry.affectedFiles) {
echo "Found changed file: ${file.path}"
changedFiles += "${file.path}"
}
}
}
Keep in mind for some cases git plugin returns empty changeSet, like:
First run in newly created branch
'Build now' button build
Refer to https://issues.jenkins-ci.org/browse/JENKINS-26354 for more details.