Delete duplicate JSON file based on one of the attributes

Delete duplicate JSON file based on one of the attributes - json

I have two directories in my linux system, /dir and /dir2
Both have more than 4000 JSON files. The JSON content of every file is like
{
"someattribute":"someValue",
"url":[
"https://www.someUrl.com/xyz"
],
"someattribute":"someValue"
}
Note that url is an array, but it always contains one element (the url).
The url makes the file unique. If there is a file with the same url in /dir and /dir2 then it's a duplicate and it needs to be deleted.
I want to automate this operation either using a shell command preferrably. Any opinion how I should go about it?

Use jq to get a list of duplicates:
jq -nrj '[
foreach inputs.url as [$url] ({};
.[$url] += 1;
if .[$url] > 1 then input_filename
else empty end
)
] | join("\u0000")' /{dir1,dir2}/*.json
And to remove them, pipe above command's output to xargs:
xargs -0 rm --

Here's a quick and dirty bash script that uses jq to extract the URL from the json files, and awk to detect and delete duplicates:
#!/bin/bash
rm -f urls-dir1.txt urls-dir2.txt
for file in dir1/*.json; do
printf "%s\t%s\n" "$file" $(jq '.url[0]' "$file") >> urls-dir1.txt
done
for file in dir2/*.json; do
printf "%s\t%s\n" "$file" $(jq '.url[0]' "$file") >> urls-dir2.txt
done
awk -F $'\t' 'FNR == NR { urls[$2] = 1; next }
$2 in urls { system("rm -f \"" $1 "\"") }' urls-dir1.txt urls-dir2.txt
rm -f urls-dir1.txt urls-dir2.txt
It assumes that dir2 has the files that are to be deleted as duplicates and the ones in dir1 should be untouched.

You can use the following Java approach to achieve this:
Set<String> urls = new HashSet<>();
try (Stream<Path> paths = Files.list(Paths.get("/path/to/your/folder"))) {
paths
.map(path -> new FileInfo(path, extractUrl(path)))
.filter(info -> info.getUrl() != null)
.filter(info -> !urls.add(info.getUrl()))
.forEach(info -> {
try {
Files.delete(info.getPath());
} catch (IOException e) {
e.printStackTrace();
}
});
} catch (IOException e) {
e.printStackTrace();
}
This uses the following FileInfo class:
public class FileInfo {
private Path path;
private String url;
// constructor and getter
}
First of all it reads all files in the given directory and extracts the URL. It filters all duplicates with the help of the HashSet. At the end all files containing duplicate URLs are going to be deleted.
There are multiple options to extract the url from each file:
Quick and dirty using a regex:
private String extractUrl(Path path) {
try {
String content = String.join("\n", Files.readAllLines(path));
Pattern pattern = Pattern.compile("\"url\".+\\s+\"(?<url>[^\\s\"]+)\"");
Matcher matcher = pattern.matcher(content);
if (matcher.find()) {
return matcher.group("url");
}
} catch (IOException e) {
e.printStackTrace();
}
return null;
}
A better solution would be using a JsonParser Library like Jackson:
private String extractUrl(Path path) {
try (BufferedReader reader = Files.newBufferedReader(path)) {
ObjectMapper mapper = new ObjectMapper();
MyObject object = mapper.readValue(reader, MyObject.class);
return object.getUrls().stream().findFirst().orElse(null);
} catch (IOException e) {
e.printStackTrace();
}
return null;
}
This uses an Object representation of the file content:
public class MyObject {
#JsonProperty("url")
private List<String> urls;
// getter and setter
}
But at the end, the most performant solution probably would be to use a shell script.

Here is a quick and simple awk script that does all the work from base dir.
The awk script named script1.awk
/https/{
if ($1 in urlArr) {
cmd = "rm " FILENAME;
print cmd;
//system(cmd);
} else {
urlArr[$1] = FILENAME;
}
}
Initially run the script with the following command:
awk -f script1.awk dir{1,}/*.json
When ready to remove the duplicate json files, just uncomment the 5th line (line containing system(cmd)). And run again.
Here are some explanations:
The awk command runs the script script1.awk on all json files in sub directory dir and dir1.
The script traverse each file, extract the URL text having https into variable $1.
If variable $1 already exist in associative array urlArr print/remove the file.
Else add current file into associative array urlArr.
Hope you like this simple solution.

Related

Tcl: can catch { exec } know whether a final newline was output?

Consider the following:
% catch { exec echo "test" } result
0
% catch { exec echo -n "test" } resultnonl
0
% if { $result == $resultnonl } { echo "true" }
true
Question: Is there a way for the two resulting variables to be different?
Use case: I'm retrieving the contents of the clipboard and cannot differentiate between these two cases. In Emacs, it is very common for me to kill (cut) a line without its final newline, and also very common to kill a whole line. The clipboard only differs by the newline.

Check out the -keepnewline flag to exec. Watch:
catch { exec -keepnewline -- echo "test" } result
string length $result

Delete unused key:value properties in JSON file

I have key:value JSON object that is used in my JavaScript project. Value is a string and this object looks like this
{
key1:{
someKey: "Some text",
someKey2: "Some text2"
},
key2:{
someKey3:{
someKey4: "Some text3",
someKey5: "Some text4"
}
}
}
I use it in the project like this: key1.someKey and key2.someKey3.someKey4. Do you have idea how to delete unused properties? Let's say we don't use key2.someKey3.someKey5 in any file in a project, so i want it to be deleted from a JSON file. To people in the comments. I did't say i want to use JavaScript for this. I don't want to use it in browser or server. I just want the script that can do that on my local computer.

If you live within javascript and node, you can use something like this to get all the paths:
Using some modified code from here: https://stackoverflow.com/a/70763473/999943
var lodash=require('lodash') // use this if calling from the node REPL
// import lodash from 'lodash'; // use this if calling from a script
const allPaths = (o, prefix = '', out = []) => {
if (lodash.isObject(o) || lodash.isArray(o)) Object.entries(o).forEach(([k, v]) => allPaths(v, prefix === '' ? k : `${prefix}.${k}`, out));
else out.push(prefix);
return out;
};
let j = {
key1: { someKey: 'Some text', someKey2: 'Some text2' },
key2: { someKey3: { someKey4: 'Some text3', someKey5: 'Some text4' } }
}
allPaths(j)
[
'key1.someKey',
'key1.someKey2',
'key2.someKey3.someKey4',
'key2.someKey3.someKey5'
]
That's all well and good, but now you want to take that list and look through your codebase for usage.
The main choices available are text searching with grep or awk or ag, or parse the language and look through the symbolic representation of the language after it's loaded into your project. Tree-shaking can do this for libraries... I haven't looked into how to do tree-shaking for dictionary keys, or some other undefined reference check like a linter may do for a language.
Then once you have all the instances found, then you either manually modify your list or use a json library to modify it.
My weapons of choice in this instance are:
jq and bash and grep
It's not infallible. But it's a start. (use with caution).
setup_test.sh
#!/usr/bin/env bash
mkdir src
echo "key2.someKey3.someKey4" > src/a.js
echo "key1.someKey2" > src/b.js
echo "key3.otherKey" > src/c.js
test.json
{
"key1":{
"someKey": "Some text",
"someKey2": "Some text2"
},
"key2":{
"someKey3":{
"someKey4": "Some text3",
"someKey5": "Some text4"
}
}
}
check_for_dict_references.sh
#!/usr/bin/env bash
json_input=$1
code_path=$2
cat << HEREDOC
json_input=$json_input
code_path=$code_path
HEREDOC
echo "Paths found in json"
paths="$(cat "$json_input" | jq -r 'paths | join(".")')"
no_refs=
for path in $paths; do
escaped_path=$(echo "$path" | sed -e "s|\.|\\\\.|g")
if ! grep -r "$escaped_path" "$code_path" ; then
no_refs="$no_refs $path"
fi
done
echo "Missing paths..."
echo "$no_refs"
echo "Creating a new json file without the unused paths"
del_paths_list=
for path in $no_refs; do
del_paths_list+=".$path, "
done
del_paths_list=${del_paths_list:0:-2} # remove trailing comma space
cat "$json_input" | jq -r 'del('$del_paths_list')' > ${json_input}.new.json
Running the setup_test.sh, then we can test the jq + grep solution
$ ./check_for_dict_references.sh test.json src
json_input=test.json
code_path=src
Paths found in json
src/b.js:key1.someKey2
src/b.js:key1.someKey2
src/b.js:key1.someKey2
src/a.js:key2.someKey3.someKey4
src/a.js:key2.someKey3.someKey4
src/a.js:key2.someKey3.someKey4
Missing paths...
key2.someKey3.someKey5
Creating a new json file without the unused paths
If you look closely you would want it to also print key1.someKey, but this got "found" in the middle of the name key1.someKey2. There are some more fancy regex things you can do, but for the purpose of this script it may be enough.
Now look in your directory for the new json file:
$ cat test.json.new.json
{
"key1": {
"someKey": "Some text",
"someKey2": "Some text2"
},
"key2": {
"someKey3": {
"someKey4": "Some text3"
}
}
}
Hope that helps.

Parsing JSON from shell script using JSON.sh

I'm working on parsing JSON data using JSON.sh. And I wanted to read data from json file (test.json) whose content will be something like,
{
"/home/ukrishnan/projects/test.yml": {
"LOG_DRIVER": "syslog",
"IMAGE": "mysql:5.6"
},
"/home/ukrishnan/projects/mysql/app.xml": {
"ENV_ACCOUNT_BRIDGE_ENDPOINT": "/u01/src/test/sample.txt"
}
}
And I try to parse this JSON using JSON.sh by using,
test_parser=`sh ./lib/JSON.sh < test/test.json`
echo $test_parser
It prints,
["/home/ukrishnan/projects/test.yml","LOG_DRIVER"] "syslog" ["/home/ukrishnan/projects/test.yml","IMAGE"] "mysql:5.6" ["/home/ukrishnan/projects/test.yml"] {"LOG_DRIVER":"syslog","IMAGE":"mysql:5.6"} ["/home/ukrishnan/projects/mysql/app.xml","ENV_ACCOUNT_BRIDGE_ENDPOINT"] "/u01/src/test/sample.txt" ["/home/ukrishnan/projects/mysql/app.xml"] {"ENV_ACCOUNT_BRIDGE_ENDPOINT":"/u01/src/test/sample.txt"} [] {"/home/ukrishnan/projects/test.yml":{"LOG_DRIVER":"syslog","IMAGE":"mysql:5.6"},"/home/ukrishnan/projects/mysql/app.xml":{"ENV_ACCOUNT_BRIDGE_ENDPOINT":"/u01/src/test/sample.txt"}}
Whereas, the same command (sh ./lib/JSON.sh < test/test.json), if I run through terminal, it is printing with line breaks,
["/home/ukrishnan/projects/test.yml","LOG_DRIVER"] "syslog"
["/home/ukrishnan/projects/test.yml","IMAGE"] "mysql:5.6"
["/home/ukrishnan/projects/test.yml"] {"LOG_DRIVER":"syslog","IMAGE":"mysql:5.6"}
["/home/ukrishnan/projects/mysql/app.xml","ENV_ACCOUNT_BRIDGE_ENDPOINT"] "/u01/src/test/sample.txt"
["/home/ukrishnan/projects/mysql/app.xml"] {"ENV_ACCOUNT_BRIDGE_ENDPOINT":"/u01/src/test/sample.txt"}
[] {"/home/ukrishnan/projects/test.yml":{"LOG_DRIVER":"syslog","IMAGE":"mysql:5.6"},"/home/ukrishnan/projects/mysql/app.xml":{"ENV_ACCOUNT_BRIDGE_ENDPOINT":"/u01/src/test/sample.txt"}}
I wanted to read this and assign to bash variables like,
file_name='/home/ukrishnan/projects/test.yml'
key='LOG_DRIVER'
value='syslog'
As I'm almost completely new to shell script and grep or awk, I don't have much idea of how to achieve this. Any help on this would be greatly appreciated.

I wrote a JSON serializer / deserializer for gawk, if you're interested. Save that script and modify it, replacing everything above # === FUNCTIONS === with the following:
#!/usr/bin/gawk -f
# capture JSON string from beginning to end into a scalar variable
{ json = json ORS $0 }
END {
# objectify JSON string to the multilevel array "obj"
deserialize(json, obj)
for (filename in obj) {
print "file_name=" quote(filename)
for (key in obj[filename]) {
# print key="value"
print key "=" quote(obj[filename][key])
}
}
}
Do chmod 755 json.awk and execute it. Output will resemble this:
$ ./json.awk test5.json
file_name="/home/ukrishnan/projects/mysql/app.xml"
ENV_ACCOUNT_BRIDGE_ENDPOINT="/u01/src/test/sample.txt"
file_name="/home/ukrishnan/projects/test.yml"
LOG_DRIVER="syslog"
IMAGE="mysql:5.6"
Hopefully the logic is reasonably easy to follow. If you prefer to output filename=, key=, and value= on every loop iteration, modify the nested for loops accordingly:
for (filename in obj) {
for (key in obj[filename]) {
print "file_name=" quote(filename)
print "key=" quote(key)
print "value=" quote(obj[filename][key])
}
}
That change will result in the following output:
$ ./json.awk test5.json
file_name="/home/ukrishnan/projects/mysql/app.xml"
key="ENV_ACCOUNT_BRIDGE_ENDPOINT"
value="/u01/src/test/sample.txt"
file_name="/home/ukrishnan/projects/test.yml"
key="LOG_DRIVER"
value="syslog"
file_name="/home/ukrishnan/projects/test.yml"
key="IMAGE"
value="mysql:5.6"
Anyway, with that output, you can do something silly in BASH like this to populate and act upon the variables:
#!/bin/bash
./test.awk test5.json | while read -r line; do {
eval $line
[ "${line/=*/}" = "value" ] && {
echo "bash: file_name=$file_name"
echo "bash: key=$key"
echo "bash: value=$value"
echo "------"
}
}; done
It'd probably be more graceful just to do all processing within gawk from start to finish and not mess with the polyglot handoff, though.
Getting back to json.awk, if you prefer to keep json.awk modular for easy reuse in future projects, you could remove everything above # === FUNCTIONS ===, create a separate main.awk containing the code block at the top of this answer, and #include "json.awk" as a helper library pretty much anywhere outside of END {...} (just below the shbang, for example).

JSON.sh (from http://json.org) offers a nice bash friendly means of flattening out a JSON file. Which you've already provided how it looks in your question. So, the flatten form is the format:
[node] tab value
You have to think in UNIX script in extracting the information you want, you'll note the lines you're interested in actually follow this pattern:
["filename","key"] tab ["value"]
In regex notation, we replace:
filename with (.*)
key with (.*)
tab with \t
value with (.*)
We can retrieve the first, second and third matching groups with \1, \2, \3 respectively.
When used in sed we also note that these symbols []() need to be escaped with a backslash \, resulting in the following script:
./lib/JSON.sh < test/test.json | sed 's/\["\(.*\)","\(.*\)\"]\t"\(.*\)"/\1,\2,\3/;t;d'
/home/ukrishnan/projects/test.yml,LOG_DRIVER,syslog
/home/ukrishnan/projects/test.yml,IMAGE,mysql:5.6
/home/ukrishnan/projects/mysql/app.xml,ENV_ACCOUNT_BRIDGE_ENDPOINT,/u01/src/test/sample.txt
Now we put the lines in a loop and for each line, we can extract out filename,key,value:
for line in $(./lib/JSON.sh < test/test.json | sed 's/\["\(.*\)","\(.*\)\"]\t"\(.*\)"/\1,\2,\3/;t;d')
do
IFS="," read -ra arr <<< $line
filename=${arr[0]}
key=${arr[1]}
value=${arr[2]}
cat <<EOF
filename : $filename
key : $key
value : $value
EOF
done
Which outputs:
filename : /home/ukrishnan/projects/test.yml
key : LOG_DRIVER
value : syslog
filename : /home/ukrishnan/projects/test.yml
key : IMAGE
value : mysql:5.6
filename : /home/ukrishnan/projects/mysql/app.xml
key : ENV_ACCOUNT_BRIDGE_ENDPOINT
value : /u01/src/test/sample.txt

cannot call function in powershell

Function.psm1
function split-release {
Param
(
[string]$Release
)
# Regex to match semantic versioning
if($release -notmatch '\d+[.]\d+[.]\d+')
{
Write-Error "Invalid Release Number"
}
# Split the version string into an array
$RelVersion=$release.split(".")
#{"Major"=$RelVersion[0];"Minor"=$RelVersion[1];"Patch"=$RelVersion[2]}
}
split.psm1
Import-Module .\Function.psm1
split-release
I call the function as
PS c:\ > .\split.psm1 1.2.3
It doesn't print any output or errors out.

Seems to print to the console when I test importing just that function in a psm1 file, and in a separate file import the module and then pass in "0.0.0" to split-release.
The .\ syntax indicates that the desired file is in the same directory as the caller. Is that the case with your files? Is there any additional code that may be obscuring output?
Other minor points:
Write-Host will not write to your output stream. In PS the alias to echo is Write-Output.
You can use a hash table to return these as a single object with properties.
Modified function:
function split-release {
Param
(
[string]$Release
)
# Regex to match semantic versioning
if($release -notmatch '\d+[.]\d+[.]\d+')
{
Write-Error "Invalid Release Number"
}
# Split the version string into an array
$RelVersion=$release.split(".")
#{"Major"=$RelVersion[0];"Minor"=$RelVersion[1];"Patch"=$RelVersion[2]}
}
Output:
PS C:/ > $release = "1.2.3"
PS C:/ > $result = split-release -Release $release
PS C:/ > $result.Major
1
PS C:/ > $result.Minor
2
PS C:/ > $result.Patch
3
More Info:
about_functions

I tried this finally and it appears to work.
File1.psm1
function split-release ($release) {
$RelVersion=$release.split(".")
$Relmajor=$RelVersion[0]
$Relminor=$RelVersion[1]
$Relpatch=$RelVersion[2]
write-host $Relmajor $Relminor $Relpatch
}
File2.ps1
param(
[string]$release = $(throw "Release number required as script parameter")
)
Import-Module ./File1.psm1
split-release "$release"
Finally run it as PS C:\ > ./file2.ps1

What are all the ways to traverse directory trees?

How do you traverse a directory tree in your favorite language?
What do you need to know to traverse a directory tree in different operating systems? On different filesystems?
What's your favorite library/module for aiding in traversing a directory tree?

In Python:
If you're looking for a quick, clean, and portable solution try:
import os
base_dir = '.'
def foo(arg, curr_dir, files):
print curr_dir
print files
os.path.walk(base_dir, foo, None)
Note that you can modify foo to do something else instead of just printing the names. Furthermore, if you're interested in migrating to Python 3.0, you will have to use os.walk() instead.

In Java:
Recursion is useful here. Following is a Java code snippet that's been all over the Internet for ages. Not sure who deserves the credit for it.
// Process all files and directories under dir
public static void visitAllDirsAndFiles(File dir) {
process(dir); //do something useful with the file or dir
if (dir.isDirectory()) {
String[] children = dir.list();
for (int i=0; i<children.length; i++) {
visitAllDirsAndFiles(new File(dir, children[i]));
}
}
}

bash:
#!/bin/bash
function walk_tree {
echo "Directory: $1"
local directory="$1"
local i
for i in "$directory"/*;
do
echo "File: $i"
if [ "$i" = . -o "$i" = .. ]; then
continue
elif [ -d "$i" ]; then # Process directory and / or walk-down into directory
# add command here to process all files in directory (i.e. ls -l "$i/"*)
walk_tree "$i" # DO NOT COMMENT OUT THIS LINE!!
else
continue # replace continue to process individual file (i.e. echo "$i")
fi
done
}
walk_tree $HOME
(adapted from http://ubuntuforums.org/showthread.php?t=886272 Comment #4)

In C#:
Stack<DirectoryInfo> dirs = new Stack<DirectoryInfo>();
dirs.Push(new DirectoryInfo("C:\\"));
while (dirs.Count > 0) {
DirectoryInfo current = dirs.Pop();
// Do something with 'current' (if you want)
Array.ForEach(current.GetFiles(), delegate(FileInfo f)
{
// Do something with 'f'
});
Array.ForEach(current.GetDirectories(), delegate(DirectoryInfo d)
{
dirs.Push(d);
});
}

C++
#include <utility>
#include <boost/filesystem.hpp>
#include <boost/foreach.hpp>
#define foreach BOOST_FOREACH
namespace fs = boost::filesystem;
fs::recursive_directory_iterator it(top), eod;
foreach (fs::path const & p, std::make_pair(it, eod)) {
if (is_directory(p)) {
...
} else if (is_regular_file(p)) {
...
} else if (is_symlink(p)) {
...
}
}

On Linux with GNU tools
find -print0 | xargs -0 md5sum
or
find -print0 | xargs -0 -iASD echo 'this file "ASD" should be dealt with lile this (ASD)'

mmmm, C# with a dose of recursion.....
public static List<string> CrawlPath(string path, bool IncludeSubFolders)
{
List<string> fileList = new List<string>();
try
{
Stack<string> filez = new Stack<string>(Directory.GetFiles(path));
while (filez.Count > 0)
{
fileList.Add(filez.Pop());
}
if (IncludeSubFolders)
{
filez = new Stack<string>(Directory.GetDirectories(path));
while (filez.Count > 0)
{
string curDir = filez.Pop();
fileList.AddRange(CrawlPath(curDir, IncludeSubFolders));
}
}
}
catch (System.Exception err)
{
Console.WriteLine("Error: " + err.Message);
}
return fileList;
}

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

Delete duplicate JSON file based on one of the attributes - json

Use jq to get a list of duplicates: jq -nrj '[ foreach inputs.url as [$url] ({}; .[$url] += 1; if .[$url] > 1 then input_filename else empty end ) ] | join("\u0000")' /{dir1,dir2}/*.json And to remove them, pipe above command's output to xargs: xargs -0 rm --

Related

Tcl: can catch { exec } know whether a final newline was output?

Delete unused key:value properties in JSON file

Parsing JSON from shell script using JSON.sh

cannot call function in powershell

What are all the ways to traverse directory trees?

Categories

Resources