How measure percent difference in codebase?

How measure percent difference in codebase? - language-agnostic

Task at hand — I have three versions of some code, developed by different coders, one “parent” and two “child”, and need to calculate, which one is closer to the parent one.
Size of code at hand prohibits from manually counting diff's, and I failed to see any aggregate similarity stats in popular diffmerge tools I've tried.
Hao shot web^H^H^H^H^H^H^H acquire the single percent “similarity” number?
Thanks.

You could count the lines of the diff. On Linux you would do:
diff -r parent child1 | wc -l
diff -r parent child2 | wc -l
This way you get a rough difference in lines of code.

Perhaps you can use a Copy-Paste detector tool such as http://pmd.sourceforge.net/cpd.html. I haven't used it personally but it seems to be able to generate statistics.

Related

Data pipeline proposal

Our product has been growing steadily over the last few years and we are now on a turning point as far as data size for some of our tables is, where we expect that the growth of said tables will probably double or triple in the next few months, and even more so in the next few years. We are talking in the range of 1.4M now, so over 3M by the end of the summer and (since we expect growth to be exponential) we assume around 10M at the end of the year. (M being million, not mega/1000).
The table we are talking about is sort of a logging table. The application receives data files (csv/xls) on a daily basis and the data is transfered into said table. Then it is used in the application for a specific amount of time - a couple of weeks/months - after which it becomes rather redundant. That is: if all goes well. If there is some problem down the road, the data in the rows can be useful to inspect for problem solving.
What we would like to do is periodically clean up the table, removing any number of rows based on certain requirements, but instead of actually deleting the rows move them 'somewhere else'.
We currently use MySQL as a database and the 'somewhere else' could be the same, but can be anything. For other projects we have a Master/Slave setup where the whole database is involved, but that's not what we want or need here. It's just some tables where the Master table would need to become shorter and the Slave only bigger, not a one-on-one sync.
The main requirement for the secondary store would be that the data should be easy to inspect/query when need to, either by SQL or another DSL, or just visual tooling. So we are not interested in backing up the data to one or more CSV files or another plain text format, since that is not as easy to inspect. The logs will then be somewhere on S3 so we would need to download it, grep/sed/awk on it... We'd much rather have something database like that we can consult.
I hope the problem is clear?
For the record: while the solution can be anything we prefer to have the simplest solution possible. It's not that we don't want Apache Kafka (example), but then we'd have to learn it, install it, maintain it. Every new piece of technology adds onto our stack, the lighter it remains the more we like it ;).
Thanks!
PS: we are not just being lazy here, we have done some research but we just thought it'd be a good idea to get some more insight in the problem.

How can I sort a very large CSV file?

I have this large 294,000 row csv with urls in column 1 and numbers in column 2.
I need to sort them from the smallest number to the largest number. I have loaded it into the software 'CSVed' and it handles it okay, it doesn't crash or anything but when I click the top of the column to sort it, it doesn't make it in order from smallest to largest, it's all just muddled up.
Anyone have any ideas? I've been searching around all day, I thought I might ask here.
Thanks.

If you have access to a unix system (and your urls don't have commas in them) this should do the trick:
sort -t',' -n -k2 filename
Where -t says columns are delimited by commas, -n says the data is numeric, and -k2 says to sort based on the second column.

You can use gnu sort. It takes has small memory footprint and can even use multiple CPUs for sort.
sort -t , -k 2n file.csv
Gnu sort is available by default in most of linux distributions as well as for MacOS by default (though later has slightly different options). You can install it for windows as well, for example from CoreUtils for Windows page.
For more information about sort invocation use the manual

How to count number of ids in a file

So I have a huge file containing hundreds of thousands lines. I want to know how many different sessions or ids it contains. I really thought it wouldn't be that hard to do, but I'm unable to find a way.
Sessions look like this:
"session":"1425654508277"
So there will be a few thousand lines with that session, then it will switch, not necessarily incrementing by one, at all, I don't know the pattern if there's one. So I just want to know how many sessions appear in the document, how many are different between each other (they SHOULD be consecutive but it's not a requirement just something I noticed).
Is there an easy way to do this? Only things I've found even remotely close are excel macros and scripts, which lead me to think I'm not asking the right questions. I also found this: Notepad++ incrementally replace but it does not help in my case.
Thanks in advance.

Consider using jq. You can extract session with [.session], then apply unique, then length.
https://stedolan.github.io/jq/manual/
I am no jq expert, and have not tested this, but it seems that the program
unique_by(.message) | length
might give you what you want.

According to your profile, you know JavaScript, so you can use that:
Load the file.
Look for session. (If this is JSON, this could be as simple as myJson['session'].)
Keyed on session value, add to a map, e.g. myCounts[sessionValue] = doesNotMatter.
Count the number of keys in the map.
There are easier ways, like torazaburo's suggestion to use cat data | uniq | wc, but it doesn't sound like you want to learn Unix, so you may as well practice your JavaScript (I do this myself when learning programming languages: use it for everything).

You won't be able to achieve this with notepad++, but you can use a linux command shell command, i.e.:
cat sessions.txt | uniq | wc

Adding to my own question, if you manage to get the strings you want separated by columns in Excel, Excel has an option to Filter which automatically gives you the different values to filter a column by.
This means, applied to my case, if I get the key-value ("session":"idSession", the 100000 values each in a row), all of it in one column, filter, count manually, I get the number of different values.
Didn't get to try the wc/unix option because I found this while trying to apply the other method

Find all log messages explaining differences between two changesets

I'd like to find all differences between two mercurial revisions. I'd primarily like to see the history of the differences (i.e. the changeset log messages), not the internal details of what changed in the files.
Example: compare revisions 105 and 106
/---101---103---105
100 \
\---102---104---106
Here, revision 106 includes changesets 106,104 and 102 which 105 doesn't have, and 105 in turn includes 103 and 105 that 106 doesn't have. How can I easily get this list; ideally taking into account grafts too?
The following revision set query almost works:
(ancestors(105) - ancestors(106)) + (ancestors(106) - ancestors(105))
However, that's a fairly long query for something that seems like a fairly common question: why exactly does this branch differ from my local version? I also believe it fails to take into account grafts and it unfortunately includes uninteresting changesets such as merges.
Bonus points for including the git equivalent.
Edit: The reason I want this is to explain to humans how these versions differ. I've got a complex source tree, and I need to be able to tell people that version X includes features A & B and bugfix P, but version Y includes features C & D and bugfix Q - and that they're otherwise the same.
If I go back to my example: merges themselves aren't interesting (so in the example above 104 isn't interesting), but the changesets the merges consist of are very interesting - meaning 101 and 102. Merges combine lots of changes into one changeset that lacks reasonable log information. In particular, if I just find the nearest ancestor, I'd find 101, and then it'd look like 102 isn't of particular interest. In terms of the actual patches applied, this information is complete - I don't need to see how merge changeset 104 was constructed, only the result. However, if I want to know why it contains those changes, I need the log messages from 102.

Hrm, I've not tested it, but would:
ancestor(X,Y)::X + ancestor(X,Y)::Y
get you the same list. I think it would, and would also likely be faster.

Pathing in a non-geographic environment

For a school project, I need to create a way to create personnalized queries based on end-user choices.
Since the user can choose basically any fields from any combination of tables, I need to find a way to map the tables in order to make a join and not have extraneous data (This may lead to incoherent reports, but we're willing to live with that).
For up to two tables, I already managed to design an algorithm that works fine. However, when I add another table, I can't find a way to path through my database. All tables available for the personnalized reports can be linked together so it really all falls down to finding which path to use.

You might be able to try some form of an A* algorithm. Basically this looks at each of the possible next options to choose and applies a heuristic to it, a function that determines roughly how far it is between this node and your goal. It then chooses the one that is closer and repeats. The hardest part of implementing A* is designing a good heuristic.
Without more information on how the tables fit together, or what you mean by a 'path' through the tables, it's hard to recommend something though.
Looks like it didn't like my link, probably the * in it, try:
http://en.wikipedia.org/wiki/A*_search_algorithm
Edit:
If that is the whole database, I'd go with a depth-first exhaustive search.

I thought about using A* or a similar algorithm, but as you said, the hardest part is about designing the heuristic.
My tables are centered around somewhat of a backbone with quite a few branches each leading to at most a single leaf node. Here is the actual map (table names removed because I'm paranoid). Assuming I want to view data from tha A, B and C tables, I need an algorithm to find the blue path.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008