How to get ksh to read null fields - tabs

I have a tab delimited file with some fields potentially containing no data. In ksh 'read' though treats multiple tabs as a single delimiter. Is there any way to change that behavior so I could have blank data too? I.e. When encountering 2 tabs it would take it as a null field? Or do I have to use awk?
# where <TAB> would be a real tab:
while IFS="<TAB>" read a b c d; do echo $c; done < file.txt
cf.
awk -F"\t" '{print $3}' file.txt
The shell version will output the wrong field if the 1st or 2nd record is blank.

It is indeed possible to use modern Korn Shell natively to treat each tab char as a column delimiter such that multiple consecutive tabs will delimit null fields without sed, awk, or perl.
The trick is to set the IFS variable to 2 consecutive tab chars, like so:
IFS=$'\t\t'
The while loop in the following code will read a tab-separated-values file, putting the fields of each line into a simple indexed array.
The inner for loop simply prints out what it has read, one field per line of output:
typeset -a Cols
while IFS=$'\t\t' read -A Cols
do
for (( i=0 ; i < ${#Cols[#]} ; i++ ))
do
print "Cols[$i] '${Cols[$i]}' "
done
done
And yes, this will also properly treat a line beginning with a tab char as having a null value for column 1, i.e. in the above Cols[0] would be set to null.
I have tested this on /bin/ksh 'AJM93u+ 2012-08-01' on macOS High Sierra
but it should work with AT&T AST open-sourced ksh versions going back 10 years or more. See also https://github.com/att/ast

read will look for the first field, ignoring IFS. Another demonstration of this problem is
echo " b c d e" | while read a b c d e; do echo c=$c; done
I'll keep on using a space as the IFS, just a bit easier to test.
Avoiding awk is possible with cut:
echo c=$(echo " b c d e" | cut -d" " -f3)
When you want to assign all fields in one run, you will be stuck with cut.
Sed accepts different -e options and work on them in the order given.
You can get the fields by
eval $(echo " b c d e" |
sed -e 's/^/a=/' -e 's/ /;b=/' -e 's/ /;c=/' -e 's/ /;d=/' -e 's/ /;e=/')
echo check:
set | grep "^[a-e]="
Do you trust your input or do you prefer awk above sed?

Related

Find / Replace / Append JSON String in bashscript without using jq

I have a json string and should extract the values in the square brackets with bash script and validate it against the expected values. If the expected value exists, leave as it is or else add the new values into the square brackets as expected.
"hosts": [“unix://“,”tcp://0.0.0.0:2376"]
I cannot use jq.
Expected :
Verify if the values “unix://“ and ”tcp://0.0.0.0:2376" exists for the key "hosts". Add if it doesn't exist
I tried using like below,
$echo "\"hosts\":[\"unix://\",\"tcp://0.0.0.0:2376\"]" | cut -d: -f2
["unix
$echo "\"hosts\":[\"unix://\",\"tcp://0.0.0.0:2376\"]" | sed 's/:.*//'
"hosts"
I have tried multiple possibilities with sed & cut but cannot achieve what I expect. I'm a shell script beginner.
How can I achieve this with sed or cut ?
You need to detect the precense of "unix://" and "tcp://0.0.0.0:2376" in your string. You can do it like this:
#!/bin/bash
#
string='"hosts": ["unix://","tcp://0.0.0.0:2376"]'
check1=$(echo "$string" | grep -c "unix://")
check2=$(echo "$string" | grep -c "tcp://0.0.0.0:2376")
(( total = check1 + check2 ))
if [[ "$total" -eq 2 ]]
then
echo "they are both in, nothing to do"
else
echo "they are NOT both there, fix variable string"
string='"hosts": ["unix://","tcp://0.0.0.0:2376"]'
fi
grep -c counts how many times a specific string appears. In your case, both strings have to be found once, so adding them together will produce 0, 1 or 2. Only when it is equal to 2 is the string correct.
cut will extract some string based on a certain delimiter. But it is not typically used to verify if a string is in there, grep does that.
sed has many uses, such as replacing text (with 's///'). But again, grep is the tool that was built to detect strings in other strings (or files).
Now when it comes to adding text, you say that if one of "unix://" or "tcp://0.0.0.0:2376" is missing, add it. Well that comes back to redefining the whole string with the correct values, so just assign it.
Finaly, if you think about it, you want to ensure that string is "hosts": ["unix://","tcp://0.0.0.0:2376"]. So no need to verify anything, just force it through hardcode at the start of your script. The end result will be the same.
Part 2
If you MUST use cut, you could:
#!/bin/bash
#
string='"hosts": ["unix://","tcp://0.0.0.0:2376"]'
firstelement=$(echo "$string" | cut -d',' -f1 | cut -d'"' -f4
echo $firstelement
# will display unix://
secondelement=$(echo "$string" | cut -d',' -f2 | cut -d'"' -f2
echo $secondelement
# will display tcp://0.0.0.0:2376
Then you can use if statements to compare to your desired values. But note that this approach will fail if you do not have at least 2 elements in your text between the [ ]. Ex. ["unix://"] will fail cut -d',' since there is no ',' character in the string.
Part 3
If you MUST use sed:
#!/bin/bash
#
string='"hosts": ["unix://","tcp://0.0.0.0:2376"]'
firstelement=$(echo "$string" | sed 's/.*\["\(.*\)",".*/\1/')
echo "$firstelement"
# will output unix://
secondelement=$(echo "$string" | sed 's/.*","\(.*\)"\]/\1/')
echo $secondelement
# will output tcp://0.0.0.0:2376
Again here, the main character to work with is the ,.
firstelement explanation
sed 's/.*\["\(.*\)",".*/\1/'
.* anything...
\[" followed by [ and ". Since [ means something to sed, you have to \ it
\(.*\) followed by anything at all (. matches any character, * matches any number of these characters).
"," followed by ",". This only happens for the first element.
.* followed by anything
\1 keep only the characters enclosed between \( and \)
Similarily, for the second element the s/// is modified to keep only what follows ",", up to the last "] at the end of the string.
Again like with cut above, use if statements to verify if the extracted values are what you wanted.
Again, read my last comments in the first approach, you might not need all this...

Similar strings, different results

I'm creating a Bash script to parse the air pollution levels from the webpage:
http://aqicn.org/city/beijing/m/
There is a lot of stuff in the file, but this is the relevant bit:
"iaqi":[{"p":"pm25","v":[59,21,112],"i":"Beijing pm25 (fine
particulate matter) measured by U.S Embassy Beijing Air Quality
Monitor
(\u7f8e\u56fd\u9a7b\u5317\u4eac\u5927\u4f7f\u9986\u7a7a\u6c14\u8d28\u91cf\u76d1\u6d4b).
Values are converted from \u00b5g/m3 to AQI levels using the EPA
standard."},{"p":"pm10","v":[15,5,69],"i":"Beijing pm10
(respirable particulate matter) measured by Beijing Environmental
Protection Monitoring Center
I want the script to parse and display 2 numbers: current PM2.5 and PM10 levels (the numbers in bold in the above paragraph).
CITY="beijing"
AQIDATA=$(wget -q 0 http://aqicn.org/city/$CITY/m/ -O -)
PM25=$(awk -v FS="(\"p\":\"pm25\",\"v\":\\\[|,[0-9]+)" '{print $2}' <<< $AQIDATA)
PM100=$(awk -v FS="(\"p\":\"pm10\",\"v\":\\\[|,[0-9]+)" '{print $2}' <<< $AQIDATA)
echo $PM25 $PM100
Even though I can get PM2.5 levels to display correctly, I cannot get PM10 levels to display. I cannot understand why, because the strings are similar.
Anyone here able to explain?
The following approach is based on two steps:
(1) Extracting the relevant JSON;
(2) Extracting the relevant information from the JSON using a JSON-aware tool -- here jq.
(1) Ideally, the web service would provide a JSON API that would allow one to obtain the JSON directly, but as the URL you have is intended for viewing with a browser, some form of screen-scraping is needed. There is a certain amount of brittleness to such an approach, so here I'll just provide something that currently works:
wget -O - http://aqicn.org/city/beijing/m |
gawk 'BEGIN{RS="function"}
$1 ~/getAqiModel/ {
sub(/.*var model=/,"");
sub(/;return model;}/,"");
print}'
(gawk or an awk that supports multi-character RS can be used; if you have another awk, then first split on "function", using e.g.:
sed $'s/function/\\\n/g' # three backslashes )
The output of the above can be piped to the following jq command, which performs the filtering envisioned in (2) above.
(2)
jq -c '.iaqi | .[]
| select(.p? =="pm25" or .p? =="pm10") | [.p, .v[0]]'
The result:
["pm25",59]
["pm10",15]
I think your problem is that you have a single line HTML file that contains a script that contains a variable that contains the data you are looking for.
Your field delimiters are either "p":"pm100", "v":[ or a comma and some digits.
For pm25 this works, because it is the first, and there are no occurrences of ,21 or something similar before it.
However, for pm10, there are some that are associated with pm25 ahead of it. So the second field contains the empty string between ,21 and ,112
#karakfa has a hack that seems to work -- but he doesn't explain very well why it works.
What he does is use awk's record separator (which is usually a newline) and sets it to either of :, ,, or [. So in your case, one of the records would be "pm25", because it is preceded by a colon, which is a separator, and succeeded by a comma, also a separator.
Once it hits the matching content ("pm25") it sets a counter to 4. Then, for this and the next records, it counts this counter down. "pm25" itself, "v", the empty string between : and [, and finally reaches one when hitting the record with the number you want to output: 4 && ! 3 is false, 3 && ! 2 is false, 2 && ! 1 is false, but 1 && ! 0 is true. Since there is no execution block, awk simply prints this record, which is the value you want.
A more robust work would probably be using xpath to find the script, then use some json parser or similar to get the value.
chw21's helpful answer explains why your approach didn't work.
peak's helpful answer is the most robust, because it employs proper JSON parsing.
If you don't want to or can't use third-party utility jq for JSON parsing, I suggest using sed rather than awk, because awk is not a good fit for field-based parsing of this data.
$ sed -E 's/^.*"pm25"[^[]+\[([0-9]+).+"pm10"[^[]+\[([0-9]+).*$/\1 \2/' <<< "$AQIDATA"
59 15
The above should work with both GNU and BSD/OSX sed.
To read the result into variables:
read pm25 pm10 < \
<(sed -E 's/^.*"pm25"[^[]+\[([0-9]+).+"pm10"[^[]+\[([0-9]+).*$/\1 \2/' <<< "$AQIDATA")
Note how I've chosen lowercase variable names, because it's best to avoid all upper-case variables in shell programming, so as to avoid conflicts with special shell and environment variables.
If you can't rely on the order of the values in the source string, use two separate sed commands:
pm25=$(sed -E 's/^.*"pm25"[^[]+\[([0-9]+).*$/\1/' <<< "$AQIDATA")
pm10=$(sed -E 's/^.*"pm10"[^[]+\[([0-9]+).*$/\1/' <<< "$AQIDATA")
awk to the rescue!
If you have to, you can use this hacky way using smart counters with hand-crafted delimiters. Setting RS instead of FS transfers looping through fields to awk itself. Multi-char RS is not available for all awks (gawk supports it).
$ awk -v RS='[:,[]' '$0=="\"pm25\""{c=4} c&&!--c' file
59
$ awk -v RS='[:,[]' '$0=="\"pm10\""{c=4} c&&!--c' file
15

removing commas from numbers in CSV file

I have a file that has many columns and I only need two of those columns. I am getting the columns I need using
cut -f 2-3 -d, file1.csv > file2.csv
The issue I am having is that the first column is ID and once it gets past 999 it becomes 1,000 and so it is treated as an extra column now. I cant get rid of all commas because I need them to separate the data. Is there a way to use sed to remove commas that only show up between 0-9?
I'd use a real CSV parser, and count backwards from the end of the line:
ruby -rcsv -ne '
row = $_.parse_csv
puts row[-5..-4].to_csv :force_quotes => true
' <<END
999,"someone#example.com","Doe, John","Doe","555-1212","address"
1,234,"email#email.com","name","lastname","phone","address"
END
"someone#example.com","Doe, John"
"email#email.com","name"
This works for the example in the comments:
awk -F'"?,"' '{print $2, $3}' file
The field separator is zero or one " followed by ,". This means that the comma in the first number doesn't count.
To separate the two fields with a comma instead of a space, you can change the OFS variable like this:
awk -F'"?,"' -v OFS=',' '{print $2, $3}' file
Or like this:
awk -F'"?,"' 'BEGIN{OFS=","}{print $2, $3}' file
Alternatively, if you want the quotes as well, you can use printf:
awk -F'"?,"' '{printf "\"%s\",\"%s\"\n", $2, $3}' file
From your comments, it sounds like there is a comma and a space (', ') pattern between tokens.
If this is the case, you can do this easily with sed. The strategy is to first replace all occurrences of , with some unique character sequence (like maybe ||).
's:, :||:g'
From there you can remove all commas:
's:,::g'
Finally, replace the double pipes with comma-space again.
's:||:, :g'
Putting it into one statement:
sed -i -e 's:, :||:g;s:,::g;s:||:, :g' your_odd_file.csv
And a command-line example to try before you buy:
bash$ sed -e 's:, :||:g;s:,::g;s:||:, :g' <<< "1,200,000, hello world, 123,456"
1200000, hello world, 123456
If you are in the unfortunate situation where there is not a space between fields in the CSV - you can attempt to 'fake it' by detecting changes in data type - like where there is a numeric field followed by a text field.
's:,\([^0-9]\):, \1:g' # numeric followed by non-numeric
's:\([^0-9]\),:\1, :g' # non-numeric field followed by something (anything)
You can put this all together into one statement, but you are venturing into dangerous waters here - this will definitely be a one-off solution and should be taken with a large grain of salt.
sed -e 's:,\([^0-9]\):, \1:g;s:\([^0-9]\),:\1, :g' \
-e 's:, :||:g;s:,::g;s:||:, :g' file1.csv > file2.csv
And another example:
bash$ sed -e 's:,\([^0-9]\):, \1:g;s:\([^0-9]\),:\1, :g' \
-e 's:, :||:g;s:,::g;s:||:, :g' <<< "1,200,000,hello world,123,456"
1200000, hello world, 123456

Select mysql query with bash

How to select mysql query with bash so each column will be in a separate array value?
I've tried the following command but it only works if the content is one word. for example:
id= 11, text=hello, important=1
if I've an article for instance in text. the code will not work properly. I guess I can use cut -f -d but if "text" contains special characters it wont work either.
while read -ra line; do
id=$(echo "${line[1]}")
text=$(echo "${line[2]}")
important=$(echo "${line[3]}")
echo "id: $id"
echo "text: $text"
echo "important: $important"
done < <(mysql -e "${selectQ}" -u${user} -p${password} ${database} -h ${host})
Bash by default splits strings at any whitespace character. First you need a unique column identifier for your output, you can use mysql --batch to get tab-separated csv output.
From the MySQL man page:
--batch, -B
Print results using tab as the column separator, with each row on a new line. With this option, mysql does not use the history file.
Batch mode results in nontabular output format and escaping of special characters. Escaping may be disabled by using raw mode; see the description for the --raw option
You want the result to be escaped, so don't use --raw, otherwise a tab character in your result data will break the loop again.
To skip the first row (column names) you can use the option --skip-column-names in addition
Now you can walk through each line and split it by tab character.
You can force bash to split by tab only by overriding the IFS variable (Internal Field Separator) temporarily.
Example
# myread prevents collapsing of empty fields
myread() {
local input
IFS= read -r input || return $?
while (( $# > 1 )); do
IFS= read -r "$1" <<< "${input%%[$IFS]*}"
input="${input#*[$IFS]}"
shift
done
IFS= read -r "$1" <<< "$input"
}
# loop though the result rows
while IFS=$'\t' myread id name surname url created; do
echo "id: ${id}";
echo "name: ${name}";
echo "surname: ${surname}";
echo "url: ${url}";
echo "created: ${created}";
done < <(mysql --batch --skip-column-headers -e "SELECT id, name, surname, url, created FROM users")
myread function all credits to this answer by Stefan Kriwanek
Attention:
You need to be very careful with quotes and variable delimiters.
If you just echo $row[0] without the curly brackets, you will get the wrong result
EDIT
You still have a problem , when a column returns empty string because the internal field separator matches any amount of the defined char:
row1\t\trow3 will create an array [row1,row3] instead of [row1,,row3]
I found a very nice approach to fix this, updated the example above.
Also read can directly seperate the input stream into variables.

Can aspell output line number and not offset in pipe mode?

Can aspell output line number and not offset in pipe mode for html and xml files? I can't read the file line by line because in this case aspell can't identify closed tag (if tag situated on the next line).
This will output all occurrences of misspelt words with line numbers:
# Get aspell output...
<my_document.txt aspell pipe list -d en_GB --personal=./aspell.ignore.txt |
# Proccess the aspell output...
grep '[a-zA-Z]\+ [0-9]\+ [0-9]\+' -oh | \
grep '[a-zA-Z]\+' -o | \
while read word; do grep -on "\<$word\>" my_document.txt; done
Where:
my_document.txt is your original document
en_GB is your primary dictionary choice (e.g. try en_US)
aspell.ignore.txt is an aspell personal dictionary (example below)
aspell_output.txt is the output of aspell in pipe mode (ispell style)
result.txt is a final results file
aspell.ignore.txt example:
personal_ws-1.1 en 500
foo
bar
example results.txt output (for an en_GB dictionary):
238:color
302:writeable
355:backends
433:dataonly
You can also print the whole line by changing the last grep -on into grep -n.
This is just an idea, I haven't really tried it yet (I'm on a windows machine :(). But maybe you could pipe the html file through head (with byte limit) and count newlines using grep to find your line number. It's neither efficient nor pretty, but it might just work.
cat icantspell.html | head -c <offset from aspell> | egrep -Uc "$"
I use the following script to perform spell-checking and to work-around the awkward output of aspell -a / ispell. At the same time, the script also works around the problem that ordinals like 2nd aren't recognized by aspell by simply ignoring everything that aspell reports which is not a word of its own.
#!/bin/bash
set +o pipefail
if [ -t 1 ] ; then
color="--color=always"
fi
! for file in "$#" ; do
<"$file" aspell pipe list -p ./dict --mode=html |
grep '[[:alpha:]]\+ [0-9]\+ [0-9]\+' -oh |
grep '[[:alpha:]]\+' -o |
while read word ; do
grep $color -n "\<$word\>" "$file"
done
done | grep .
You even get colored output if the stdout of the script is a terminal, and you get an exit status of 1 in case the script found spelling mistakes, otherwise the exit status of the script is 0.
Also, the script protects itself from pipefail, which is a somewhat popular option to be set i.e. in a Makefile but doesn't work for this script. Last but not least, this script explicitly uses [[:alpha:]] instead of [a-zA-Z] which is less confusing when it's also matching non-ASCII characters like German äöüÄÖÜß and others. [a-zA-Z] also does, but that to some level comes at a surprise.
aspell pipe / aspell -a / ispell output one empty line for each input line (after reporting the errors of the line).
Demonstration printing the line number with awk:
$ aspell pipe < testFile.txt |
awk '/^$/ { countedLine=countedLine+1; print "#L=" countedLine; next; } //'
produces this output:
#(#) International Ispell Version 3.1.20 (but really Aspell 0.60.7-20110707)
& iinternational 7 0: international, Internationale, internationally, internationals, intentional, international's, Internationale's
#L=1
*
*
*
& reelly 22 11: Reilly, really, reel, rely, rally, relay, resell, retell, Riley, rel, regally, Riel, freely, real, rill, roll, reels, reply, Greeley, cruelly, reel's, Reilly's
#L=2
*
#L=3
*
*
& sometypo 18 8: some typo, some-typo, setup, sometime, someday, smote, meetup, smarty, stupor, Smetana, somatic, symmetry, mistype, smutty, smite, Sumter, smut, steppe
#L=4
with testFile.txt
iinternational
I say this reelly.
hello
here is sometypo.
(Still not as nice as hunspell -u (https://stackoverflow.com/a/10778071/4124767). But hunspell misses some command line options I like.)
For others using aspell with one of the filter modes (tex, html, etc), here's a way to only print line numbers for misspelled words in the filtered text. So for example, it won't print misspellings in the comments.
ASPELL_ARGS="--mode=html --personal=./.aspell.en.pws"
for file in "$#"; do
for word in $(aspell $ASPELL_ARGS list < "$file" | sort -u); do
grep -no "\<$word\>" <(aspell $ASPELL_ARGS filter < "$file")
done | sort -n
done
This works because aspell filter does not delete empty lines. I realize this isn't using aspell pipe as requested by OP, but it's in the same spirit of making aspell print line numbers.