Regex to extract all Starred Items URLs from Google Reader JSON file - html

Sadly it was announced that Google Reader will be shutdown mid of the year.
Since I have a large amount of starred items in Google Reader I'd like to back them up.
This is possible via Google Reader takeout. It produces a file in JSON format.
Now I would like to extract all of the article urls out of this several MB large file.
At first I thought it would be best to use a regex for url but it seems to be better to extract the needed article urls by a regex to find just the article urls. This will prevent to also extract other urls that are not needed.
Here is a short example how parts of the json file looks:
"published" : 1359723602,
"updated" : 1359723602,
"canonical" : [ {
"href" : "http://arstechnica.com/apple/2013/02/omni-group-unveils-omnifocus-2-omniplan-omnioutliner-4-for-mac/"
} ],
"alternate" : [ {
"href" : "http://feeds.arstechnica.com/~r/arstechnica/everything/~3/EphJmT-xTN4/",
"type" : "text/html"
} ],
I just need the urls you can find here:
"canonical" : [ {
"href" : "http://arstechnica.com/apple/2013/02/omni-group-unveils-omnifocus-2-omniplan-omnioutliner-4-for-mac/"
} ],
Perhaps anyone is in the mood to say how a regex have to look like to extract all these urls?
The benefit would be to have a quick and dirty way to extract starred items urls from Google Reader to import them in services like pocket or evernote, once processed.

I know you asked about regex, but I think there's a better way to handle this problem. Multi-line regular expressions are a PITA, and in this case there's no need for that kind of brain damage.
I would start with grep, rather than a regex. The -A1 parameter says "return the line that matches, and one after":
grep -A1 "canonical" <file>
This will return lines like this:
"canonical" : [ {
"href" : "http://arstechnica.com/apple/2013/02/omni-group-unveils-omnifocus-2-omniplan-omnioutliner-4-for-mac/"
Then, I'd grep again for the href:
grep -A1 "canonical" <file> | grep "href"
giving
"href" : "http://arstechnica.com/apple/2013/02/omni-group-unveils-omnifocus-2-omniplan-omnioutliner-4-for-mac/"
now I can use awk to get just the url:
grep -A1 "canonical" <file> | grep "href" | awk -F'" : "' '{ print $2 }'
which strips out the first quote on the url:
http://arstechnica.com/apple/2013/02/omni-group-unveils-omnifocus-2-omniplan-omnioutliner-4-for-mac/"
Now I just need to get rid of the extra quote:
grep -A1 "canonical" <file> | grep "href" | awk -F'" : "' '{ print $2 }' | tr -d '"'
That's it!
http://arstechnica.com/apple/2013/02/omni-group-unveils-omnifocus-2-omniplan-omnioutliner-4-for-mac/

Related

JQ write each object to subdirectory file

I'm new to jq (around 24 hours). I'm getting the filtering/selection already, but I'm wondering about advanced I/O features. Let's say I have an existing jq query that works fine, producing a stream (not a list) of objects. That is, if I pipe them to a file, it produces:
{
"id": "foo"
"value": "123"
}
{
"id": "bar"
"value": "456"
}
Is there some fancy expression I can add to my jq query to output each object individually in a subdirectory, keyed by the id, in the form id/id.json? For example current-directory/foo/foo.json and current-directory/bar/bar.json?
As #pmf has pointed out, an "only-jq" solution is not possible. A solution using jq and awk is as follows, though it is far from robust:
<input.json jq -rc '.id, .' | awk '
id=="" {id=$0; next;}
{ path=id; gsub(/[/]/, "_", path);
system("mkdir -p " path);
print >> path "/" id ".json";
id="";
}
'
As you will need help from outside jq anyway (see #peak's answer using awk), you also might want to consider using another JSON processor instead which offers more I/O features. One that comes to my mind is mikefarah/yq, a jq-inspired processor for YAML, JSON, and other formats. It can split documents into multiple files, and since its v4.27.2 release it also supports reading multiple JSON documents from a single input source.
$ yq -p=json -o=json input.json -s '.id'
$ cat foo.json
{
"id": "foo",
"value": "123"
}
$ cat bar.json
{
"id": "bar",
"value": "456"
}
The argument following -s defines the evaluation filter for each output file's name, .id in this case (the .json suffix is added automatically), and can be manipulated to further needs, e.g. -s '"file_with_id_" + .id'. However, adding slashes will not result in subdirectories being created, so this (from here on comparatively easy) part will be left over for post-processing in the shell.

jq: filter result by value (contains) is very slow

I am trying to use jq to filter a large number of JSON files and extract the ids of each object who belong to a specific domain, as well as the full URL within that domain. Here's a sample of the data:
{
"items": [
{
"completeness": 5,
"dcLanguageLangAware": {
"def": [
"de"
]
},
"edmIsShownBy": [
"https://gallica.example/image/2IC6BQAEGWUEG4OP7AYBDGIGYAX62KZ6H366KXP2IKVAF4LKY37Q/presentation_images/5591be60-01fc-11e6-8e10-fa163e091926/node-3/image/SBB/Berliner_Börsenzeitung/1920/02/27/F_065_098_0/F_SBB_00007_19200227_065_098_0_001/full/full/0/default.jpg"
],
"id": "/9200355/BibliographicResource_3000117730632",
"type": "TEXT",
"ugc": [
false
]
}
]
}
Bigger sample here: https://www.dropbox.com/s/0s0zjtxe01mecjc/AoQhRn%2B56KDm5AJJPwEvOTIwMDUyMC9hcmtfXzEyMTQ4X2JwdDZrMTAyNzY2Nw%3D%3D.json?dl=0
I can extract both ids and URL which contains the string "gallica" using the following command:
jq '[ .items[] | select(.edmIsShownBy[] | contains ("gallica")) | {id: .id, link: .edmIsShownBy[] }]'
However, i have more than 28000 JSON files to process and it is taking a large amount of time (around 1 file per minute). I am processing the files using bash with the command:
find . -name "*.json" -exec cat '{}' ';' | jq '[ .items[] | select(.edmIsShownBy[] | contains ("gallica")) | {id: .id, link: .edmIsShownBy[] }]'
I was wondering if the slowness is due by the instruction given to jq, and if it is the case, is there a faster way to filter a string contained in a chosen value? Any ideas?
It would probably be wise not to attempt to cat all the files at once; indeed, it would probably be best to avoid cat altogether.
For example, assuming program.jq contains whichever jq program you decide on (and there is nothing wrong with using contains here), you could try:
find . -name "*.json" -exec jq -f program.jq '{}' +
Using the non-standard + instead of ';' minimizes the number of times jq must be called, though the overhead of invoking jq is actually quite small. If your find does not support + and you wish to avoid calling jq once per file, then consider using xargs, or GNU parallel with the —-xargs option.
If you know the JSON files of interest are in the pwd, you could also speed up find by specifying -maxdepth 1.

How to find something in a json file using Bash

I would like to search a JSON file for some key or value, and have it print where it was found.
For example, when using jq to print out my Firefox' extensions.json, I get something like this (using "..." here to skip long parts) :
{
"schemaVersion": 31,
"addons": [
{
"id": "wetransfer#extensions.thunderbird.net",
"syncGUID": "{e6369308-1efc-40fd-aa5f-38da7b20df9b}",
"version": "2.0.0",
...
},
{
...
}
]
}
Say I would like to search for "wetransfer#extensions.thunderbird.net", and would like an output which shows me where it was found with something like this:
{ "addons": [ {"id": "wetransfer#extensions.thunderbird.net"} ] }
Is there a way to get that with jq or with some other json tool?
I also tried to simply list the various ids in that file, and hoped that I would get it with jq '.id', but that just returned null, because it apparently needs the full path.
In other words, I'm looking for a command-line json parser which I could use in a way similar to Xpath tools
The path() function comes in handy:
$ jq -c 'path(.. | select(. == "wetransfer#extensions.thunderbird.net"))' input.json
["addons",0,"id"]
The resulting path is interpreted as "In the addons field of the initial object, the first array element's id field matches". You can use it with getpath(), setpath(), delpaths(), etc. to get or manipulate the value it describes.
Using your example with modifications to make it valid JSON:
< input.json jq -c --arg s wetransfer#extensions.thunderbird.net '
paths as $p | select(getpath($p) == $s) | null | setpath($p;$s)'
produces:
{"addons":[{"id":"wetransfer#extensions.thunderbird.net"}]}
Note
If there are N paths to the given value, the above will produce N lines. If you want only the first, you could wrap everything in first(...).
Listing all the "id" values
I also tried to simply list the various ids in that file
Assuming that "id" values of false and null are of no interest, you can print all the "id" values of interest using the jq filter:
.. | .id? // empty

Grep from txt file (JSON format)

I have a txt in a JSON format:
{
"items": [ {
"downloadUrl" : "some url",
"path": "yxxsf",
"id" : "abc",
"repository" : "example",
"format" : "zip",
"checksum" : {
"sha1" : "kdhjfksjdfasdfa",
"md5" : "skjfhkjshdfkjshfkjsdhf"
}
}],
"continuationToken" : null
}
I want to extract download url context (in this example i want "some url") using grep and store it in another txt file. TBH i have never used grep
Using grep
grep -oP 'downloadUrl"\s:\s"(.*)",' myfile > urlFile.txt
See this Regex in action: https://regex101.com/r/DvnXCO/1
A better way to do this is to use jq
Download jq for Windows: https://stedolan.github.io/jq/download/
jq ".items[0].downloadUrl" myfile > urlFile.txt
Although json string may contain double-quote character escaped by
a backslash, both a double quote and a backslash in URLs should be
percent-encoded according to RFC 3986. Then you can extract the URL with:
tr "[:space:]" " " < file.json | grep -Po '"downloadUrl"\s*:\s*\K"[^"]+"'
Let me use tr to pre-process the json file to convert all blank
characters to whitespaces. Then the following grep will work
if the name and the value pair are in the separate (but consecutive) lines.
The \K operator in the regex is a variable-length look-behind without
including the preceding pattern in the matched result.
Note that the command above works with the provided example but may not
be robust enough for arbitrary inputs. I'd still recommend to use jq
for the strict purpose.
If you want to use only grep:
grep downloadURL myfile > new_file.txt
If you prefer a cleaner option, adding cut command:
grep downloadURL myfile | cut -d\" -f4 > new_file.txt
BTW the image of the json file shows that you are using notepad (windows?)

Find and edit a Json file using bash

I have multiple files in the following format with different categories like:
{
"id": 1,
"flags": ["a", "b", "c"],
"name": "test",
"category": "video",
"notes": ""
}
Now I want to append all the files flags whose category is video with string d. So my final file should look like the file below:
{
"id": 1,
"flags": ["a", "b", "c", "d"],
"name": "test",
"category": "video",
"notes": ""
}
Now using the following command I am able to find files of my interest, but now I want to work with editing part which I an unable to find as there are 100's of file to edit manually, e.g.
find . - name * | xargs grep "\"category\": \"video\"" | awk '{print $1}' | sed 's/://g'
You can do this
find . -type f | xargs grep -l '"category": "video"' | xargs sed -i -e '/flags/ s/]/, "d"]/'
This will find all the filnames which contain line with "category": "video", and then add the "d" flag.
Details:
find . -type f
=> Will get all the filenames in your directory
xargs grep -l '"category": "video"'
=> Will get those filenames which contain the line "category": "video"
xargs sed -i -e '/flags/ s/]/, "d"]/'
=> Will add the "d" letter to the flags:line.
"TWEET!!" ... (yellow flag thown to the ground) ... Time Out!
What you have, here, is "a JSON file." You also have, at your #!shebang command, your choice of(!) full-featured programming languages ... with intimate and thoroughly-knowledgeale support for JSON ... with which you can very-speedily write your command-file.
Even if it is "theoretically possible" to do this using "bash scripts," this is roughly equivalent to "putting a beautiful stone archway over the front-entrance to a supermarket." Therefore, "waste ye no time" in such an utterly-profitless pursuit. Write a script, using a language that "honest-to-goodness knows about(!) JSON," to decode the contents of the file, then manipulate it (as a data-structure), then re-encode it again.
Here is a more appropriate approach using PHP in shell:
FILE=foo2.json php -r '$file = $_SERVER["FILE"]; $arr = json_decode(file_get_contents($file)); if ($arr->category == "video") { $arr->flags[] = "d"; file_put_contents($file,json_encode($arr)); }'
Which will load the file, decode into array, add "d" into flags property only when category is video, then write back to the file in JSON format.
To run this for every json file, you can use find command, e.g.
find . -name "*.json" -print0 | while IFS= read -r -d '' file; do
FILE=$file
# run above PHP command in here
done
If the files are in the same format, this command may help (version for a single file):
ex +':/category.*video/norm kkf]i, "d"' -scwq file1.json
or:
ex +':/flags/,/category/s/"c"/"c", "d"/' -scwq file1.json
which is basically using Ex editor (now part of Vim).
Explanation:
+ - executes Vim command (man ex)
:/pattern_or_range/cmd - find pattern, if successful execute another Vim commands (:h :/)
norm kkf]i - executes keystrokes in normal mode
kk - move cursor up twice
f] - find ]
i, "d" - insert , "d"
-s - silent mode
-cwq - executes wq (write & quit)
For multiple files, use find and -execdir or extend above ex command to:
ex +'bufdo!:/category.*video/norm kkf]i, "d"' -scxa *.json
Where bufdo! executes command for every file, and -cxa saves every file. Add -V1 for extra verbose messages.
If flags line is not 2 lines above, then you may perform backward search instead. Or using similar approach to #sps by replacing ] with d.
See also: How to change previous line when the pattern is found? at Vim.SE.
Using jq:
find . -type f | xargs cat | jq 'select(.category=="video") | .flags |= . + ["d"]'
Explanation:
jq 'select(.category=="video") | .flags |= . + ["d"]'
# select(.category=="video") => filters by category field
# .flags |= . + ["d"] => Updates the flags array