Bash Script Loop through MySQL row and use curl and grep - mysql

I have a mysql database, with a table :
url | words
And datas like, for example :
------Column URL------- -------Column Words------
www.firstwebsite.com | hello, hi
www.secondwebsite.com | someword, someotherword
I want to loop through that table to check if the word is present in the content of the website specified by the url.
I have something like this :
!/bin/bash
mysql --user=USERNAME --password=PASSWORD DATABASE --skip-column-names -e "SELECT url, keyword FROM things" | while read url keyword; do
content=$(curl -sL $url)
echo $content | egrep -q $keyword
status=$?
if test $status -eq 0 ; then
# Found...
else
# Not found...
fi
done
One problems :
It's very slow : how set curl to optimize the load time of each website, don't load images, things like that ?
Also, Is it a good idea to put things like that in a shell script, or is it better to create a php script, and call it with curl ?
Thanks !

As it stands your script will not work as you might expect when you have multiple keywords per row as in your example. The reason is that when you pass hello, hi to egrep it will look for the exact string "hello, hi" in its input, not for either "hello" or "hi". You can fix this without making changes to what's in your database by turning each list of keywords into an egrep-compatible regular expression with sed. You'll also need to remove the | from mysql's output, e.g, with awk.
curl doesn't retrieve images when downloading a webpage's HTML. If the order in which the URLs are queried does not matter to you then you can speed things up by making the whole thing asynchronous with &.
#!/bin/bash
handle_url() {
if curl -sL "$1" | egrep -q "$2"; then
echo 1 # Found...
else
echo 0 # Not found...
fi
}
mysql --user=USERNAME --password=PASSWORD DATABASE --skip-column-names -e "SELECT url, keyword FROM things" | awk -F \| '{ print $1, $2 }' | while read url keywords; do
keywords=$(echo $keywords | sed -e 's/, /|/g;s/^/(/;s/$/)/;')
handle_url "$url" "$keywords" &
done

Related

Storing aws ssm parameter as variable in bash script [duplicate]

I have a pretty simple script that is something like the following:
#!/bin/bash
VAR1="$1"
MOREF='sudo run command against $VAR1 | grep name | cut -c7-'
echo $MOREF
When I run this script from the command line and pass it the arguments, I am not getting any output. However, when I run the commands contained within the $MOREF variable, I am able to get output.
How can one take the results of a command that needs to be run within a script, save it to a variable, and then output that variable on the screen?
In addition to backticks `command`, command substitution can be done with $(command) or "$(command)", which I find easier to read, and allows for nesting.
OUTPUT=$(ls -1)
echo "${OUTPUT}"
MULTILINE=$(ls \
-1)
echo "${MULTILINE}"
Quoting (") does matter to preserve multi-line variable values; it is optional on the right-hand side of an assignment, as word splitting is not performed, so OUTPUT=$(ls -1) would work fine.
$(sudo run command)
If you're going to use an apostrophe, you need `, not '. This character is called "backticks" (or "grave accent"):
#!/bin/bash
VAR1="$1"
VAR2="$2"
MOREF=`sudo run command against "$VAR1" | grep name | cut -c7-`
echo "$MOREF"
Some Bash tricks I use to set variables from commands
Sorry, there is a loong answer, but as bash is a shell, where the main goal is to run other unix commands and react on result code and/or output, ( commands are often piped filter, etc... ).
Storing command output in variables is something basic and fundamental.
Therefore, depending on
compatibility (posix)
kind of output (filter(s))
number of variable to set (split or interpret)
execution time (monitoring)
error trapping
repeatability of request (see long running background process, further)
interactivity (considering user input while reading from another input file descriptor)
do I miss something?
First simple, old (obsolete), and compatible way
myPi=`echo '4*a(1)' | bc -l`
echo $myPi
3.14159265358979323844
Compatible, second way
As nesting could become heavy, parenthesis was implemented for this
myPi=$(bc -l <<<'4*a(1)')
Using backticks in script is to be avoided today.
Nested sample:
SysStarted=$(date -d "$(ps ho lstart 1)" +%s)
echo $SysStarted
1480656334
bash features
Reading more than one variable (with Bashisms)
df -k /
Filesystem 1K-blocks Used Available Use% Mounted on
/dev/dm-0 999320 529020 401488 57% /
If I just want a used value:
array=($(df -k /))
you could see an array variable:
declare -p array
declare -a array='([0]="Filesystem" [1]="1K-blocks" [2]="Used" [3]="Available" [
4]="Use%" [5]="Mounted" [6]="on" [7]="/dev/dm-0" [8]="999320" [9]="529020" [10]=
"401488" [11]="57%" [12]="/")'
Then:
echo ${array[9]}
529020
But I often use this:
{ read -r _;read -r filesystem size using avail prct mountpoint ; } < <(df -k /)
echo $using
529020
( The first read _ will just drop header line. ) Here, in only one command, you will populate 6 different variables (shown by alphabetical order):
declare -p avail filesystem mountpoint prct size using
declare -- avail="401488"
declare -- filesystem="/dev/dm-0"
declare -- mountpoint="/"
declare -- prct="57%"
declare -- size="999320"
declare -- using="529020"
Or
{ read -a head;varnames=(${head[#]//[K1% -]});varnames=(${head[#]//[K1% -]});
read ${varnames[#],,} ; } < <(LANG=C df -k /)
Then:
declare -p varnames ${varnames[#],,}
declare -a varnames=([0]="Filesystem" [1]="blocks" [2]="Used" [3]="Available" [4]="Use" [5]="Mounted" [6]="on")
declare -- filesystem="/dev/dm-0"
declare -- blocks="999320"
declare -- used="529020"
declare -- available="401488"
declare -- use="57%"
declare -- mounted="/"
declare -- on=""
Or even:
{ read _ ; read filesystem dsk[{6,2,9}] prct mountpoint ; } < <(df -k /)
declare -p mountpoint dsk
declare -- mountpoint="/"
declare -a dsk=([2]="529020" [6]="999320" [9]="401488")
(Note Used and Blocks is switched there: read ... dsk[6] dsk[2] dsk[9] ...)
... will work with associative arrays too: read _ disk[total] disk[used] ...
Other related sample: Parsing xrandr output: and end of Firefox tab by bash in a size of x% of display size? or at AskUbuntu.com Parsing xrandr output
Dedicated fd using unnamed fifo:
There is an elegent way! In this sample, I will read /etc/passwd file:
users=()
while IFS=: read -u $list user pass uid gid name home bin ;do
((uid>=500)) &&
printf -v users[uid] "%11d %7d %-20s %s\n" $uid $gid $user $home
done {list}</etc/passwd
Using this way (... read -u $list; ... {list}<inputfile) leave STDIN free for other purposes, like user interaction.
Then
echo -n "${users[#]}"
1000 1000 user /home/user
...
65534 65534 nobody /nonexistent
and
echo ${!users[#]}
1000 ... 65534
echo -n "${users[1000]}"
1000 1000 user /home/user
This could be used with static files or even /dev/tcp/xx.xx.xx.xx/yyy with x for ip address or hostname and y for port number or with the output of a command:
{
read -u $list -a head # read header in array `head`
varnames=(${head[#]//[K1% -]}) # drop illegal chars for variable names
while read -u $list ${varnames[#],,} ;do
((pct=available*100/(available+used),pct<10)) &&
printf "WARN: FS: %-20s on %-14s %3d <10 (Total: %11u, Use: %7s)\n" \
"${filesystem#*/mapper/}" "$mounted" $pct $blocks "$use"
done
} {list}< <(LANG=C df -k)
And of course with inline documents:
while IFS=\; read -u $list -a myvar ;do
echo ${myvar[2]}
done {list}<<"eof"
foo;bar;baz
alice;bob;charlie
$cherry;$strawberry;$memberberries
eof
Practical sample parsing CSV files:
As this answer is loong enough, for this paragraph,
I just will let you refer to
this answer to How to parse a CSV file in Bash?, I read a file by using an unnamed fifo, using syntax like:
exec {FD}<"$file" # open unnamed fifo for read
IFS=';' read -ru $FD -a headline
while IFS=';' read -ru $FD -a row ;do ...
... But using bash loadable CSV module.
On my website, you may find the same script, reading CSV as inline document.
Sample function for populating some variables:
#!/bin/bash
declare free=0 total=0 used=0 mpnt='??'
getDiskStat() {
{
read _
read _ total used free _ mpnt
} < <(
df -k ${1:-/}
)
}
getDiskStat $1
echo "$mpnt: Tot:$total, used: $used, free: $free."
Nota: declare line is not required, just for readability.
About sudo cmd | grep ... | cut ...
shell=$(cat /etc/passwd | grep $USER | cut -d : -f 7)
echo $shell
/bin/bash
(Please avoid useless cat! So this is just one fork less:
shell=$(grep $USER </etc/passwd | cut -d : -f 7)
All pipes (|) implies forks. Where another process have to be run, accessing disk, libraries calls and so on.
So using sed for sample, will limit subprocess to only one fork:
shell=$(sed </etc/passwd "s/^$USER:.*://p;d")
echo $shell
And with Bashisms:
But for many actions, mostly on small files, Bash could do the job itself:
while IFS=: read -a line ; do
[ "$line" = "$USER" ] && shell=${line[6]}
done </etc/passwd
echo $shell
/bin/bash
or
while IFS=: read loginname encpass uid gid fullname home shell;do
[ "$loginname" = "$USER" ] && break
done </etc/passwd
echo $shell $loginname ...
Going further about variable splitting...
Have a look at my answer to How do I split a string on a delimiter in Bash?
Alternative: reducing forks by using backgrounded long-running tasks
In order to prevent multiple forks like
myPi=$(bc -l <<<'4*a(1)'
myRay=12
myCirc=$(bc -l <<<" 2 * $myPi * $myRay ")
or
myStarted=$(date -d "$(ps ho lstart 1)" +%s)
mySessStart=$(date -d "$(ps ho lstart $$)" +%s)
This work fine, but running many forks is heavy and slow.
And commands like date and bc could make many operations, line by line!!
See:
bc -l <<<$'3*4\n5*6'
12
30
date -f - +%s < <(ps ho lstart 1 $$)
1516030449
1517853288
So we could use a long running background process to make many jobs, without having to initiate a new fork for each request.
You could have a look how reducing forks make Mandelbrot bash, improve from more than eight hours to less than 5 seconds.
Under bash, there is a built-in function: coproc:
coproc bc -l
echo 4*3 >&${COPROC[1]}
read -u $COPROC answer
echo $answer
12
echo >&${COPROC[1]} 'pi=4*a(1)'
ray=42.0
printf >&${COPROC[1]} '2*pi*%s\n' $ray
read -u $COPROC answer
echo $answer
263.89378290154263202896
printf >&${COPROC[1]} 'pi*%s^2\n' $ray
read -u $COPROC answer
echo $answer
5541.76944093239527260816
As bc is ready, running in background and I/O are ready too, there is no delay, nothing to load, open, close, before or after operation. Only the operation himself! This become a lot quicker than having to fork to bc for each operation!
Border effect: While bc stay running, they will hold all registers, so some variables or functions could be defined at initialisation step, as first write to ${COPROC[1]}, just after starting the task (via coproc).
Into a function newConnector
You may found my newConnector function on GitHub.Com or on my own site (Note on GitHub: there are two files on my site. Function and demo are bundled into one unique file which could be sourced for use or just run for demo.)
Sample:
source shell_connector.sh
tty
/dev/pts/20
ps --tty pts/20 fw
PID TTY STAT TIME COMMAND
29019 pts/20 Ss 0:00 bash
30745 pts/20 R+ 0:00 \_ ps --tty pts/20 fw
newConnector /usr/bin/bc "-l" '3*4' 12
ps --tty pts/20 fw
PID TTY STAT TIME COMMAND
29019 pts/20 Ss 0:00 bash
30944 pts/20 S 0:00 \_ /usr/bin/bc -l
30952 pts/20 R+ 0:00 \_ ps --tty pts/20 fw
declare -p PI
bash: declare: PI: not found
myBc '4*a(1)' PI
declare -p PI
declare -- PI="3.14159265358979323844"
The function myBc lets you use the background task with simple syntax.
Then for date:
newConnector /bin/date '-f - +%s' #0 0
myDate '2000-01-01'
946681200
myDate "$(ps ho lstart 1)" boottime
myDate now now
read utm idl </proc/uptime
myBc "$now-$boottime" uptime
printf "%s\n" ${utm%%.*} $uptime
42134906
42134906
ps --tty pts/20 fw
PID TTY STAT TIME COMMAND
29019 pts/20 Ss 0:00 bash
30944 pts/20 S 0:00 \_ /usr/bin/bc -l
32615 pts/20 S 0:00 \_ /bin/date -f - +%s
3162 pts/20 R+ 0:00 \_ ps --tty pts/20 fw
From there, if you want to end one of background processes, you just have to close its fd:
eval "exec $DATEOUT>&-"
eval "exec $DATEIN>&-"
ps --tty pts/20 fw
PID TTY STAT TIME COMMAND
4936 pts/20 Ss 0:00 bash
5256 pts/20 S 0:00 \_ /usr/bin/bc -l
6358 pts/20 R+ 0:00 \_ ps --tty pts/20 fw
which is not needed, because all fd close when the main process finishes.
As they have already indicated to you, you should use `backticks`.
The alternative proposed $(command) works as well, and it also easier to read, but note that it is valid only with Bash or KornShell (and shells derived from those),
so if your scripts have to be really portable on various Unix systems, you should prefer the old backticks notation.
I know three ways to do it:
Functions are suitable for such tasks:**
func (){
ls -l
}
Invoke it by saying func.
Also another suitable solution could be eval:
var="ls -l"
eval $var
The third one is using variables directly:
var=$(ls -l)
OR
var=`ls -l`
You can get the output of the third solution in a good way:
echo "$var"
And also in a nasty way:
echo $var
Just to be different:
MOREF=$(sudo run command against $VAR1 | grep name | cut -c7-)
When setting a variable make sure you have no spaces before and/or after the = sign. I literally spent an hour trying to figure this out, trying all kinds of solutions! This is not cool.
Correct:
WTFF=`echo "stuff"`
echo "Example: $WTFF"
Will Fail with error "stuff: not found" or similar
WTFF= `echo "stuff"`
echo "Example: $WTFF"
If you want to do it with multiline/multiple command/s then you can do this:
output=$( bash <<EOF
# Multiline/multiple command/s
EOF
)
Or:
output=$(
# Multiline/multiple command/s
)
Example:
#!/bin/bash
output="$( bash <<EOF
echo first
echo second
echo third
EOF
)"
echo "$output"
Output:
first
second
third
Using heredoc, you can simplify things pretty easily by breaking down your long single line code into a multiline one. Another example:
output="$( ssh -p $port $user#$domain <<EOF
# Breakdown your long ssh command into multiline here.
EOF
)"
You need to use either
$(command-here)
or
`command-here`
Example
#!/bin/bash
VAR1="$1"
VAR2="$2"
MOREF="$(sudo run command against "$VAR1" | grep name | cut -c7-)"
echo "$MOREF"
If the command that you are trying to execute fails, it would write the output onto the error stream and would then be printed out to the console.
To avoid it, you must redirect the error stream:
result=$(ls -l something_that_does_not_exist 2>&1)
This is another way and is good to use with some text editors that are unable to correctly highlight every intricate code you create:
read -r -d '' str < <(cat somefile.txt)
echo "${#str}"
echo "$str"
You can use backticks (also known as accent graves) or $().
Like:
OUTPUT=$(x+2);
OUTPUT=`x+2`;
Both have the same effect. But OUTPUT=$(x+2) is more readable and the latest one.
Here are two more ways:
Please keep in mind that space is very important in Bash. So, if you want your command to run, use as is without introducing any more spaces.
The following assigns harshil to L and then prints it
L=$"harshil"
echo "$L"
The following assigns the output of the command tr to L2. tr is being operated on another variable, L1.
L2=$(echo "$L1" | tr [:upper:] [:lower:])
Mac/OSX nowadays come with old Bash versions, ie GNU bash, version 3.2.57(1)-release (arm64-apple-darwin21). In this case, one can use:
new_variable="$(some_command)"
A concrete example:
newvar="$(echo $var | tr -d '123')"
Note the (), instead of the usual {} in Bash 4.
Some may find this useful.
Integer values in variable substitution, where the trick is using $(()) double brackets:
N=3
M=3
COUNT=$N-1
ARR[0]=3
ARR[1]=2
ARR[2]=4
ARR[3]=1
while (( COUNT < ${#ARR[#]} ))
do
ARR[$COUNT]=$((ARR[COUNT]*M))
(( COUNT=$COUNT+$N ))
done

JQ can't parse \u2022 character

I'm trying to perform a bulk upload to Elasticsearch (around 1mln documents). In order to do that, I'm using jq to reformat the JSON file extracted from MySQL database and curl to post the data to Elasticsearch:
cat dataset.json | jq -r -c '.[] | { "index" : { } }, .' | curl -u login:password -H "Content-Type: application/json" -XPOST "https://.../skills/default/_bulk?pretty" --data-binary #-
I get an error:
parse error: Invalid string: control characters from U+0000 through U+001F must be escaped at line 276249, column 317
I found that the character that jq can't parse is \u2022. I tried adding "-r" jq command but the error stil occurs. How can I handle this for all occurrences of \u2022?
Here's verification that \u2022 is properly handled by various versions of jq in a Mac environment:
$ echo '"\u2022"' | jq-1.4 .
"•"
$ echo '"•"' | jq-1.6 .
"•"
$ echo '"•"' | jq-1.5 .
"•"
$ echo '"•"' | jq-1.4 .
"•"
$
Perhaps the problem is related to a bug that was fixed since the release of jq 1.5 (see e.g. https://github.com/stedolan/jq/issues/1311).
If you are having difficulties with jq version 1.6 (the current version), please provide a minimal complete verifiable example
with further details about the computing environment.

Bash - How to extract JSON from a web page?

I'm trying to extract JSON from this URL: here
The output that I want is like this https://pastebin.com/BVzUrk6s .Sorry I can't paste it here because of the StackOverFlow character limit.
Here is what I have tried:
curl 'https://www.lazada.co.id/-i160040703-s181911730.html?spm=a2o4j.order_details.details_title.1.52ec6664luQAQs&urlFlag=true&mp=1' | grep -Poz '(?<=app.run\()(.*\n)*.*(?=\);)'
But that command still doesn't extract the JSON data. How do I solve this ? I want to use a pure bash script without installing any programs to do this if possible.
It's a Bad Idea (TM) to attempt JSON parsing this way.
It seems like a Good Idea (TM) to find out what is possible regardless.
#!/bin/bash
function parseUrl() {
local url=$1
echo '"childCategories": ['
curl --silent ${url} \
| awk '/<script type="text" class=J_data/ { show=1 } show; /<\/script>/ { show=0 }' \
| egrep -v "script" \
| sed -e 's/]//g' -e 's/\[//g' -e 's/{"childCategoryName":"","childCategoryUrl":""},//g' -e 's/}$/},/g' \
| sed -e 's/,{/,\'$'\n{/g' -e 's/^[ ]*//g' -e 's/{/ {/g' \
| sed -e 's/childCategoryName/name/g' -e 's/childCategoryUrl/url/g'
echo ' ]'
}
parseUrl 'https://www.lazada.co.id/-i160040703-s181911730.html?spm=a2o4j.order_details.details_title.1.52ec6664luQAQs&urlFlag=true&mp=1' \
| tee /tmp/extracted.json
So there you go: curl, awk, egrep, sed. Use at your own risk.
Code like this isn't extensible, meaning you can't extract nested JSON easily.
It is quite brittle, meaning if someone changes the layout or even CSS, it's bye-bye data extraction.

Grepping a word buried in a <p> on a website

I am having trouble grepping a word on a website. This is the command I'm using
wget -q http://bcbioinformaticsgrad.ca/our-faculty/james-piret/ | grep 'medical'
which is returning nothing, when it should be returning
[name of the website]:Many recent developments in biological and medical
.
.
.
.
.
.
The overall goal of what I'm trying to do is find a certain word within all the links of the website
My script is written like this
#!/bin/bash
#$1 is the parent website
#This pipeline obtains all the links located on a website
wget -qO- $1 | grep -Eoi '<a [^>]+>' | grep -Eo 'href="[^\"]+"' | cut -c 7- | rev | cut -c 2- | rev > .linksLocated
#$2 is the word being looked for
#This loop goes though every link and tries to locate a word
while IFS='' read -r line || [[ -n "$line" ]]; do
wget -q $line | grep "$2"
done < .linksLocated
#rm .linksLocated
Wget doesn't put the downloaded file to standard output, so it's trying to grep the word from nothing (since you added the -q flag).
Add -O - to print the page to stdout:
wget -q http://bcbioinformaticsgrad.ca/our-faculty/james-piret/ -O - | grep 'medical'
I see you used it with the first wget in your script, so just add it to the second one, too.
It's also possible to use curl, which does that by default, without any parameters:
curl http://bcbioinformaticsgrad.ca/our-faculty/james-piret/ | grep 'medical'
Edit: this tool is super useful when you actually need to select certain HTML elements in the downloaded page, might suit some use cases better than grep: https://github.com/ericchiang/pup

Store Codeship build ID in variable using GREP

I am trying to use a grep to search a JSON output, I used a curl command to return the data from a particular codeship build and I want to use GREP to store said ID value in a variable. However after I run the command and try to echo out the value of the variable its blank.
Below are the commands:
export API_KEY=abc123
export PROJECT_ID=123456
export LAST_BUILD_ID=$(curl -s https://codeship.com/api/v1/projects/$PROJECT_ID.json?api_key=$API_KEY | grep -Eo '"builds":\[{"id":\d+' | grep -Eo --color=never '\d+' | tail -1)
export LAST_BUILD_URL=$(echo "https://codeship.com/api/v1/builds/$LAST_BUILD_ID/restart.json?api_key=$API_KEY")
My response : never use grep nor regex to parse json.
Instead, use a proper json parser.
In shell, take a look to jq.
Example, adapt it a bit :
#!/bin/bash
API_KEY=abc123
PROJECT_ID=123456
html=$(curl -s https://codeship.com/api/v1/projects/$PROJECT_ID.json?api_key=$API_KEY)
LAST_BUILD_ID=$(jq '.builds | .[] | .never' <<< "$html") # just guessing
LAST_BUILD_URL=$(echo "https://codeship.com/api/v1/builds/$LAST_BUILD_ID/restart.json?api_key=$API_KEY")
Note
If you provide the JSON, I will be able to be more specific with the jq command