Parse XML from not well formed page using xpath - html

Notice:
While writing this question, I notice that there is a Github API that solves my problem without HTML parsing: https://api.github.com/repos/mozilla/geckodriver/releases/latest I decided to ask it anyway since I'm intested how to solve the described problem of parsing malformed HTML itself. So please dont downvote because there is a github API for it! We can replace github by any other page throwing validation errors.
I want to download the latest version of geckodriver. By fetching the redirection target of the latest tag, I'm on the releases page
curl $(curl -s "https://github.com/mozilla/geckodriver/releases/latest" --head | grep -i location | awk '{print $2}' | sed 's/\r//g') > /tmp/geckodriver.html
The first assets with geckodriver-vx.xxx-linux64.tar.gz is the required link. Since XML is schemantic, it should be parsed properly. Different tools like xmllint could parse it using xpaths. Since xpath is new for me, I tried a simple query on the header. But xmllint throws a lot of errors:
$ xmllint --xpath '//div[#class=Header]' /tmp/geckodriver.html
/tmp/geckodriver.html:51: parser error : Specification mandate value for attribute data-pjax-transient
<meta name="selected-link" value="repo_releases" data-pjax-transient>
^
/tmp/geckodriver.html:107: parser error : Opening and ending tag mismatch: link line 105 and head
</head>
^
/tmp/geckodriver.html:145: parser error : Entity 'nbsp' not defined
Sign up
^
/tmp/geckodriver.html:172: parser error : Entity 'rarr' not defined
es <span class="Bump-link-symbol float-right text-normal text-gray-light">→
...
There are a lot more. It seems that the github page is not properly well formed, as the specification requires it. I also tried xmlstarlet
xmlstarlet sel -t -v -m '//div[#class=Header]' /tmp/geckodriver.html
but the result is similar.
Is it not possible to extract some data using those tools when the HTML is not well formed?

curl $(curl -s "https://github.com/mozilla/geckodriver/releases/latest" --head | grep -i location | awk '{print $2}' | sed 's/\r//g') > /tmp/geckodriver.html
It may be simpler to use -L, and have curl follow the redirection:
curl -L https://github.com/mozilla/geckodriver/releases/latest
Then, xmllint accepts an --html argument, to use an HTML parser:
xmllint --html --xpath '//div[#class=Header]'
However, this doesn't match anything on that page, so perhaps you want to base your XPath on something like:
'string((//a[span[contains(.,"linux")]])[1]/#href)'
Which yields:
/mozilla/geckodriver/releases/download/v0.26.0/geckodriver-v0.26.0-linux32.tar.gz

Related

AOSP How to verify OTA updates by their metadata

I'm building an OTA update for my custom Android 10 build as follows:
./build/make/tools/releasetools/ota_from_target_files \
--output_metadata_path metadata.txt \
target-files.zip \
ota.zip
The resulting ota.zip can be applied by extracting the payload.bin and payload_properties.txt according to the android documentation for update_engine_client.
update_engine_client --payload=file:///<wherever>/paypload.bin \
--update \
--headers=<Contents of payload_properties.txt>
This all works so I'm pretty sure from this result that I've created the OTA correctly, however, I'd like to be able to download the metadata and verify that the payload can be applied before having the client download the entire payload.
Looking at the update_engine_client --help options, it appears one can verify the metadata as follows:
update_engine_client --verify --metadata=<path to metadata.txt from above>
This is where I'm failing to achieve the desired result though. I get an error that says it failed to parse the payload header. It's failing with kDownloadInvalidMetadataMagicString which when I read the source appears to be the first 4 bytes of the metadata. Apparently the metadata.txt I created isn't right for the verification tool.
So I'm hoping someone can point me in the right direction to either generate the metadata correctly or tell me how to use the tool correctly.
Turns out the metadata generated by the ota tool is in human readable format. The verify method expects a binary file. That file is not part of the zip contents as a unique file. Instead, it's prepended to the payload.bin. So the first bytes of payload.bin are actually payload_metadata.bin, and those bytes will work correctly with the verify method of update_engine_client to determine if the payload is applicable.
I'm extracting the payload_metadata.bin in a makefile as follows:
$(DEST)/%.meta: $(DEST)/%.zip
unzip $< -d /tmp META-INF/com/android/metadata
python -c 'import re; meta=open("/tmp/META-INF/com/android/metadata").read(); \
m=re.match(".*payload_metadata.bin:([0-9]*):([0-9]*)", meta); \
s=int(m.groups()[0]); l=int(m.groups()[1]); \
z=open("$<","rb").read(); \
open("$#","wb").write(z[s:s+l])'
rm -rf /tmp/META-INF

About Amster '--body' option (OpenAM command line)

There is a '--body' option for most of the Amster commands. This options allows you to send the body of a request with JSON syntax. However, if the body of your request is big, the --body option will be big and the Amster command will be huge for your terminal. Is there any option to specify this JSON text in a way that it is not so uncomfortable for the command-line?
Maybe it exists an option that allows you to indicate the path of a JSON file or something like that.
I will be very grateful for any answer.
My answer below is based on the latest available Amster (6.0.0)
You can use Amster in Script mode.
Essentially you can write your amster commands in a separate file, lets call it myscript.amster, please note, the extension is not important.
You can then add your entire command including the json in your script, for e.x. to create a realm: Please note the use of: \ to spill the json across multiple lines.
create Realms --global --body '{ \
"name": "test", \
"active": false, \
"parentPath": "/", \
"aliases": [ "testing" ] \
}'
Now, you can run this script in two modes:
From within the amster shell:
am> :load <pathToYourScript>
Without having to enter the script mode:
amster/amster <pathToYourScript>
In this mode, do remember to connect to your openam server before running your commands and :quit at the end. You should find some more samples in the samples directory of your amster.

Postfix and Altermime config. Send html emails with disclaimer

I am currently running a Centos 7 instance. Already installed: postfix, dovecot, altermime. Sending and receiving emails works fine. However, what I am trying to do is attach a disclaimer for every mail sent.
My problem is in the disclaimer config there is a disclaimer text option, and a disclaimer html. But my email is always sent as plain text, so I receive emails with a disclaimer showing html tags.
How can i configure postfix or altermime to send html emails, or send a plain text with an image attached?
Here's my disclaimer config:
from_address=`grep -m 1 "From:" in.$$ | cut -d "<" -f 2 | cut -d ">" -f 1`
if [ `grep -wi ^${from_address}$ ${DISCLAIMER_ADDRESSES}` ]; then
/usr/bin/altermime --input=in.$$ \
--disclaimer-html=/etc/postfix/disclaimer.html \
--disclaimer=/etc/postfix/disclaimer.txt \
--force-for-bad-html \
--xheader="X-Copyrighted-Material: Please visit http://www.company.com/privacy.h$
{ echo Message content rejected; exit $EX_UNAVAILABLE; }
fi
I've tried adding the html text in disclaimer.txt, or in disclaimer.html. The email is always received as plain text showing the html tags.
Any help would be gladly appreciated.

Output Shell Script Variable to HTML

Rather new to coding, looking for a little help with having a variable output to a local html file. Both script and html are on the same machine. Script pulls signal level from a modem and I would like to have that displayed to a local html loading screen.
Signal Script:
#!/bin/bash
MODEM=/dev/ttyUSB3
MODEMCMD=AT+CSQ
{
echo -ne "$MODEMCMD"'\r\n' > "$MODEM"
if [ $? -eq 1 ]
then
echo "error";
exit;
fi
{
while read -t 1
do
if [[ $REPLY == +CSQ* ]]
then
arr1=$(echo "$REPLY" | cut -d' ' -f2)
arr2=$(echo "$arr1" | cut -d',' -f1)
echo "$arr2"
exit;
fi
done
echo "error";
} <"$MODEM"
} 2>/dev/null
Would like the output of this to display in a table on my html. Thanks!
When you host your own web server, the CGI protocol allows you to do server-side programming in any language you like; including Bash.
Here's a simple example that serves a web page that displays the current date and time, and refreshes every 10 seconds. Put the Bash script below in a file named test.cgi in the cgi-bin folder (on Linux/Apache: /usr/lib/cgi-bin) and make it executable (chmod +x test.cgi). The page's URL is: http://MyWebServer/cgi-bin/test.cgi
#!/bin/bash
cat << EOT
content-type: text/html
<!DOCTYPE html>
<html>
<head>
<title>Test</title>
<meta http-equiv="refresh" content="10" />
</head>
<body>
$(date)
</body>
</html>
EOT
Please note: the empty line below content-type is relevant in the CGI protocol.
Replace date by your own Bash script; make sure it outputs something that resembles valid HTML. This means you need to take some effort to add HTML tags for your markup, and HTML-encode some characters (< > &) in your content.

Get list of repositories using gitweb for external use

To be used in another external script, we need the list of repositories hosted in a git repository server. We have GitWeb also enabled on the server.
Any one know if GitWeb exposes some API through which we can get the list of repositories ? Like GitBlit RPC (http://gitblit.com/rpc.html like https://your.glitblit.url/rpc?req=LIST_REPOSITORIES) ?
Thanks.
No, from what I can see of the gitweb.cgi (from gitweb/gitweb.perl) implementation, there is no RPC API with JSON return messages.
This is only visible through the web page.
In the bottom right corner there is a small button that reads: TXT
You can get the list of projects there, for example:
For sourceware, the gitweb page: https://sourceware.org/git/
The TXT button links here: https://sourceware.org/git/?a=project_index
It should return a list of projects which are basically
<name of the git repository> <owner>
in plain text, perfectly parseable by script.
But if you want JSON, you'd have to convert it with something like this:
$ wget -q -O- "https://sourceware.org/git/?a=project_index" \
| jq -R -n '[ inputs | split(" ")[0:2] | {"project": .[0], "owner": .[1]} ]'