batch base64 image decode - html

I've got a large (117MB!) html file that has thousands of images encoded as base64, I'd like to decode them to JPG's but my bash-fu isn't enough to do this and I haven't been able to find an answer online

In general, HTML can't be parsed properly with regular expressions, but if you have a specific limited format then it could work.
Given a simple format like
<body>
<img src="">
<img src=""><img src="">
<div><img src=""></div>
</body>
the following can pull out the data
i=0; awk 'BEGIN{RS="<"} /="data:image\/jpeg;base64,[^\"]*"/ { match($0, /="data:image\/jpeg;base64,([^\"]*)"/, data); print data[1]; }' test.html | while read d; do echo $d | base64 -d > $i.jpg; i=$(($i+1)); done
To break that down:
i=0 Keep a counter so we can output different filenames for each image.
awk 'BEGIN{RS="<"} Run awk with the Record Separator changed from the default newline to <, so we always treat each HTML element as a separate record.
/="data:image\/jpeg;base64,[^\"]*"/ Only run the following commands on records that have embedded base64 jpeg data.
{ match($0, /="data:image\/jpeg;base64,([^\"]*)"/, data); print data[1]; }' Pull out the data itself, the part matched with parentheses between the comma and the trailing quotation mark, then print it.
test.html Just the input filename.
| while read d; do Pipe the output base64 data to a loop. read will put each line into d until there's no more input.
echo $d | base64 -d > img$i.jpg; Pass the current image through the base64 decoder and store the output to a file.
i=$(($i+1)); Increment to change the next filename.
done Done.
There are a few things that could probably be done better here:
There should be a way to get the line-match regexp to capture the base64 data directly, instead of repeating the regexp in a call to the match() function, but I couldn't get it to work.
I don't like the technique of reading a pipe into the variable d, only to echo it back out to another pipe - it would be nicer to just pipe straight through - but base64 doesn't know to only use one line of the input.
For some reason I have not yet figured out, incrementing the counter directly where it's used (i.e. echo $d | base64 -d > img$((i++)).jpg) only wrote to the first file, even though echo $d > img$((i++)).b64 correctly wrote the encoded data to multiple files. Rather than waiting on working that out, I've just split the increment into its own command.

You can try scrapping the encoded strings of the images using Python.
Then check this out for converting the encoded strings to images.

Use regex to direct the base64 images to separate files
Write loop to iterate through your files.
Bash command to decode files will be along lines of:
cat base64_file1 |base64 -d > file1.jpg

Related

Calling Imagemagick from awk?

I have a CSV of image details I want to loop over in a bash script. awk seems like an obvious choice to loop over the data.
For each row, I want to take the values, and use them to do Imagemagick stuff. The following isn't working (obviously):
awk -F, '{ magick "source.png" "$1.jpg" }' images.csv
GNU AWK excels at processing structured text data, although it can be used to summon commands using system function it is less handy for that than some other language, e.g. python has module of standard library called subprocess which is more feature-rich.
If you wish to use awk for this task anyway, then I suggest preparing output to be feed into bash command, say you have file.txt with following content
file1.jpg,file1.bmp
file2.png,file2.bmp
file3.webp,file3.bmp
and you have files listed in 1st column in current working directory and wish to convert them to files shown in 2nd column and access to convert command, then you might do
awk 'BEGIN{FS=","}{print "convert \"" $1 "\" \"" $2 "\""}' file.txt | bash
which is equvialent to starting bash and doing
convert "file1.jpg" "file1.bmp"
convert "file2.png" "file2.bmp"
convert "file3.webp" "file3.bmp"
Observe that I have used literal " to enclose filenames, so it should work with names containing spaces. Disclaimer: it might fail if name containing special character, e.g. ".

How to split text file into multiple files and extract filename from line prefix?

I have a simple log file with content like:
1504007980.039:{"key":"valueA"}
1504007990.359:{"key":"valueB", "key2": "valueC"}
...
That I'd like to output to multiple files that each have as content the JSON part that comes after the timestamp. So I would get as a result the files:
1504007980039.json
1504007990359.json
...
This is similar to How to split one text file into multiple *.txt files? but the name of the file should be extracted from each line (and remove an extra dot), and not generated via an index
Preferably I'd want a one-liner that can be executed in bash.
Since you aren't using GNU awk you need to close output files as you go to avoid the "too many open files" error. To avoid that and issues around specific values in your JSON and issues related to undefined behavior during output redirection, this is what you need:
awk '{
fname = $0
sub(/\./,"",fname)
sub(/:.*/,".json",fname)
sub(/[^:]+:/,"")
print >> fname
close(fname)
}' file
You can of course squeeze it onto 1 line if you see some benefit to that:
awk '{f=$0;sub(/\./,"",f);sub(/:.*/,".json",f);sub(/[^:]+:/,"");print>>f;close(f)}' file
awk solution:
awk '{ idx=index($0,":"); fn=substr($0,1,idx-1)".json"; sub(/\./,"",fn);
print substr($0,idx+1) > fn; close(fn) }' input.log
idx=index($0,":") - capturing index of the 1st :
fn=substr($0,1,idx-1)".json" - preparing filename
Viewing results (for 2 sample lines from the question):
for f in *.json; do echo "$f"; cat "$f"; echo; done
The output (filename -> content):
1504007980039.json
{"key":"valueA"}
1504007990359.json
{"key":"valueB"}

Cloudformation echo json env variable

My question is similar to this, where I am running into issues with putting JSON into a file. The issue is, no matter how I've formatted my strings inside the userData section of the CloudFormation template, I can't seem to capture an env $variable while maintaining a valid JSON object (with double quotes around the keys and values)
Below are two different ways I've tried to get the object into a file (via echo and cat << EOF < env-config.json) with virtually every combination of string escaping (single quotes wrapped around double quotes escaped around object keys...etc..)
echo '{\"development\": {\"EnvironmentConfig\": {\"api\": \" 'http://$ip:8000/api' \"}}}' >> env-config.json\n"
cat << EOF > env-config.json
{\"development\": {\"EnvironmentConfig\": {\"api\": \" 'http://$ip:8000/api' \"}}}
EOF
How can I place my perfectly formatted JSON object into a file while capturing an env $variable in it from the userData section of CloudFormation?
Thank you!
edit
Tools involved: gulp-ng-config, bash, cloudformation, json
Using gulp-ng-config to create a module with constants with the env-config.json file
I found out the answer, I needed single quotes around the url portion (as well as the double quotes) of my JSON like the below. This is what the whole line would look like in Cloudformation, I hope this helps someone:
"echo '{\"development\": {\"EnvironmentConfig\": {\"api\": \"'http://$ip:8000/api'\"}}}' >> env-config.json\n",

Efficiently get the first record of a JSONL file

Is it possible to efficiently get the first record of a JSONL file without consuming the entire stream / file? One way I have been able to inefficiently do so is the following:
curl -s http://example.org/file.jsonl | jq -s '.[0]'
I realize that head could be used here to extract the first line, but assume that the file may not use a newline as the record separator and may simply be concatenated objects or arrays.
If I'm understanding correctly, the JSONL format just returns a stream of JSON objects which jq handles quite nicely. Best case scenario that you wanted the first item, you could just utilize the input filter to grab the first item.
I think you could just do this:
$ curl -s http://example.org/file.jsonl | jq -n 'input'
You need the null input -n to not process the input immediately then input just gets one input from the stream. No need to go through the rest of the input stream.

Output of array as comma separated BASH

I'm trying to pull variables from an API in json format and then put them back together with one variable changed and fire them back as a put.
Only issue is that every value has quote marks in it and must go back to the API separated by commas only.
example of what it should see with redacted information, variables inside the **'s:
curl -skv -u redacted:redacted -H Content-Type: application/json -X PUT -d'{properties:{basic:{request_rules:[**"/(req) testrule","/test-body","/(req) test - Admin","test-Caching"**]}}}' https://x.x.x.x:9070/api/tm/1.0/config/active/vservers/xxx-xx
Obviously if I fire them as a plain array I get spaces instead of commas. However I tried outputting it as a plain string
longstr=$(echo ${valuez[#]})
output=$(echo $longstr |sed -e 's/" /",/g')
And due to the way bash is interpreted it seems to either interpret the quotes wrong or something else. I guess it might well be the single ticks encapsulating after the PUT -d as well but I'm not sure how I can throw a variable into something that has single ticks.
If I put the raw data in manually it works so it's either the way the variable is being sent or the single ticks. I don't get an error and when I echo the line out it looks perfect.
Any ideas?
valuez=( "/(req) testrule" "/test-body" "/(req) test - Admin" "test-Caching" )
# Temporarily set IFS to some character which is known not to appear in the array.
oifs=$IFS
IFS=$'\014'
# Flatten the array with the * expansion giving a string containing the array's elements separated by the first character of $IFS.
d_arg="${valuez[*]}"
IFS=$oifs
# If necessary, quote or escape embedded quotation marks. (Implementation-specific, using doubled double quotes as an example.)
d_arg="${d_arg//\"/\"\"}"
# Substitute the known-to-be-absent character for the desired quote+separator+quote.
d_arg="${d_arg//$'\014'/\",\"}"
# Prepend and append quotes.
d_arg="\"$d_arg\""
# insert the prepared arg into the final string.
d_arg="{properties:{basic:{request_rules:[${d_arg}]}}}"
curl ... -d"$d_arg" ...
if you have gnu awk with version 4 and above, which support FPAT
output=$(echo $longstr |awk '$1=$1' FPAT="(\"[^\"]+\")" OFS=",")
Explanation
FPAT #
This is a regular expression (as a string) that tells gawk to create the fields based on text that matches the regular expression. Assigning a value to FPAT overrides the use of FS and FIELDWIDTHS for field splitting. See Splitting By Content, for more information.
If gawk is in compatibility mode (see Options), then FPAT has no special meaning, and field-splitting operations occur based exclusively on the value of FS.
valuez=( "/(req) testrule" "/test-body" "/(req) test - Admin" "test-Caching" )
csv="" sep=""
for v in "${valuez[#]}"; do csv+="$sep\"$v\""; sep=,; done
echo "$csv"
"/(req) testrule","/test-body","/(req) test - Admin","test-Caching"
If it's something you need to do repeatedly, but it into a function:
toCSV () {
local csv sep val
for val; do
csv+="$sep\"$val\""
sep=,
done
echo "$csv"
}
csv=$(toCSV "${valuez[#]}")