Have Perl regex match newline followed by tab in macOS - tabs

I have these two lines in a macOS .plist file:
<key>Disabled</key>
<true/>
I want to replace true with false but only if the preceding line equals <key>Disabled</key>.
I know the 1st line ends with \n and using Perl I can match/replace it with:
perl -pi -w -e 's/Disabled<\/key>\n/DISABLED<\/key>\n/g' file
Note: I don't want to change Disabled to DISABLED, this is just to show the match is working.
I know the 2nd line begins with \t and I can match/replace it with:
perl -pi -w -e 's/\t<true/\t<false/g' file
However, combining the 2 patterns doesn't match/replace anything:
perl -pi -w -e 's/Disabled<\/key>\n\t<true/Disabled<\/key>\n\t<false/g' file
I thought there might be a hidden character between \n at the end of the 1st line and \tat the beginning of the 2nd line, but I've tried this regex in BBEdit and it works perfectly:
Disabled</key>\n\t<true
Any help would be very much appreciated.

It would appear my understanding of Perl is very limited.
After some further digging I now realise that in the code I was using Perl was reading the file line by line. What I needed to do was instruct Perl to read the file in one go or put in another way to "slurp" the file. Using -0 puts Perl in file slurp mode.
My original non-working code:
perl -pi -w -e 's/Disabled<\/key>\n\t<true/Disabled<\/key>\n\t<false/g' file
My amended working code using -0:
perl -0pi -w -e 's/Disabled<\/key>\n\t<true/Disabled<\/key>\n\t<false/g' file
To take it one step further I replaced \n\t in the match pattern with (\s*) so as to match any space, tab or newline character zero or more times and \n\t in the replace pattern with $1:
perl -0pi -w -e 's/Disabled<\/key>(\s*)<true/Disabled<\/key>$1<false/g' file
This seems better for matching any number of whitespace characters that appear between the 2 lines.

Related

Grep ignore special characters before applying regular expression

General
I am trying to recursively search through hundreds of JSON files under a specific directory for lines that match a specific regular expression.
grep -rh works great for searching recursively for specific lines. I am having a problem applying a regular expression with the search because all the lines in the JSON files begin with a " and end in either ", or ".
Example: If I want to apply a regular expression to get all the lines that begin with zxc I will not be able to do it because the lines actually begin with "zxc
Code
The following command would work if the lines had no " at the beginning.
/bin/grep -rh -E "^(zxc)" "/etc/json_dir/"
The following command works, but I do not want grep to get hundreds of thousands of lines from all the JSON files and then apply a regular expression afterwards.
/bin/grep -rh -E ".*" "/etc/json_dir/" | /bin/sed -e 's/^"//g' -e 's/,$//g' -e 's/"$//g' | /bin/grep -E "^(zxc)"
Question
Is there a way for grep to ignore the " character at the beginning and " and ", characters at the end of the lines before it applies a regular expression ?
If there's no way, is there a way to do it with some other bash command, perl, python or some other language.
You can go with awk if I understand Your question properly:
awk '{gsub(/^"|"$/,"") } # this part removes all the "s from the start and end of line
/^WHAT/ { print } # or any other processing
' **/*.json
Note: the **/* requires the globestar recursive globbing option in (modern) bash.
See it in action at Ideone.
You can shorten it somewhat to:
awk '/^"?WHAT/' **/* # this executes the default printing action
But awk|sed|grep might not be the right tool to search JSON.

Fart.exe replace two carriage returns in a row

I have csv file where I'm trying to replace two carriage returns in a row with a single carriage return using Fart.exe. First off, is this possible? If so, the text within the CSV is laid out like the below where "CRLF" is an actual carriage return.
,CRLF
CRLF
But I want it to be just this without the extra carriage return on the second line:
,CRLF
I thought I could just do the below but it won't work:
CALL "C:\tmp\fart.exe" -C "C:\tmp\myfile.csv" ,\r\n\r\n ,\r\n
I need to know what to change ,\r\n\r\n to in order to make this work. Any ideas how I could make this happen? Thanks!
As Squashman has suggested, you are simply trying to remove empty lines.
There is no need for a 3rd party tool to do this. You can simply use FINDSTR to discard empty lines:
findstr /v "^$" myFile.txt >myFile.txt.new
move /y myFile.txt.new *. >nul
However, this will only work if all the lines end with CRLF. If you have a unix formatted file that ends each line with LF, then it will not work.
A more robust option would be to use JREPL.BAT - a regular expression command line text processing utility.
jrepl "^$" "" /r 0 /f myFile.txt /o -
Be sure to use CALL JREPL if you put the command within a batch script.
FART processes one line at a time, and the CRLF is not considered to be part of the line. So you can't use a normal FART command to remove CRLF. If you really want to use FART, then you will need to use the -B binary mode. You also need to use -C to get support for the escape sequences.
I've never used FART, so I can't be sure - but I believe the following would work
call fart -B -C myFile.txt "\r\n\r\n" "\r\n"
If you have many consecutive empty lines, then you will need to run the FART command repeatedly until there are no more changes.

Find a string between 2 other strings in document

I have found a ton of solutions do do what I want with only one exception.
I need to search a .html document and pull a string.
The line containing the string will look like this (1 line, no newlines)
<script type="text/javascript">g_initHeader(0);LiveSearch.attach(ge('oh2345v5ks'));var _ = g_items;_[60]={icon:'INV_Chest_Leather_09',name_enus:'Layered Tunic'};_[6076]={icon:'INV_Pants_11',name_enus:'Tapered Pants'};_[3070]={icon:'INV_Misc_Cape_01',name_enus:'Ensign Cloak'};</script>
The text I need to get is
INV_CHEST_LEATHER_09
When I use awk, grep, and sed, I extract the data between icon:' and ',name_
The problem is, all three of these scripts scan the entire line and use the last occurring ',name_ thus I end up with
INV_Chest_Leather_09',name_enus:'Layered
Tunic'};_[6076]={icon:'INV_Pants_11',name_enus:'Tapered
Pants'};_[3070]={icon:'INV_Misc_Cape_01
Here's the last one I tried
grep -Po -m 1 "(?<=]={icon:').*(?=',name_)"
I've tried awk and sed too, and I don't really have a preference of which one to use.
So basically, I need to search the entire html file, find the first occurrence of icon:', extract the text right after it until the first occurrence after icon:' of ',name_.
With GNU awk for the 3rd arg to match():
$ awk 'match($0,/icon:\047([^\047]+)/,a){print a[1]}' file
INV_Chest_Leather_09
Simple perl approach:
perl -ne 'print "$1\n" if /\bicon:\047([^\047]+)/' file
The output:
INV_Chest_Leather_09
The .* in your regular expression is a greedy matcher, so the pattern will match till the end of the string and then backtrack to match the ,name_ portion. You could try replacing the .* with something like [^,]* (i.e. match anything except comma):
grep -Po -m 1 "(?<=]={icon:')[^,]*(?=',name_)"

Updating files using AWK: Why do I get weird newline character after each replacement?

I have a .csv containing a few columns. One of those columns needs to be updated to the same number in ~1000 files. I'm trying to use AWK to edit each file, but I'm not getting the intended result.
What the original .csv looks like
heading_1,heading_2,heading_3,heading_4
a,b,c,1
d,e,f,1
g,h,i,1
j,k,m,1
I'm trying to update column 4 from 1 to 15.
awk '$4="15"' FS=, OFS=, file > update.csv
When I run this on a .csv generated in excel, the result is a newline ^M character after the first line (which it updates to 15) and then it terminates and does not update any of the other columns.
It repeats the same mistake on each file when running through all files in a directory.
for file in *.csv; do awk '$4="15"' FS=, OFS=, $file > $file"_updated>csv"; done
Alternatively, if someone has a better way to do this task, I'm open to suggestions.
Excel is generating the control-Ms, not awk. Run dos2unix or similar on your file before running awk on it.
Well, I couldn't reproduce your problem in my linux as writing 15 to last column will overwrite the \r (the ^M is actually 0x0D or \r) before the newline \n, but you could always remove the \r first:
$ awk 'sub(/\r/,""); ...' file
I have had some issues with non-ASCII characters processed in a file in a differing locale, for example having a file with ISO-8859-1 encoding processed with Gnu awk in UTF8 shell.

Can aspell output line number and not offset in pipe mode?

Can aspell output line number and not offset in pipe mode for html and xml files? I can't read the file line by line because in this case aspell can't identify closed tag (if tag situated on the next line).
This will output all occurrences of misspelt words with line numbers:
# Get aspell output...
<my_document.txt aspell pipe list -d en_GB --personal=./aspell.ignore.txt |
# Proccess the aspell output...
grep '[a-zA-Z]\+ [0-9]\+ [0-9]\+' -oh | \
grep '[a-zA-Z]\+' -o | \
while read word; do grep -on "\<$word\>" my_document.txt; done
Where:
my_document.txt is your original document
en_GB is your primary dictionary choice (e.g. try en_US)
aspell.ignore.txt is an aspell personal dictionary (example below)
aspell_output.txt is the output of aspell in pipe mode (ispell style)
result.txt is a final results file
aspell.ignore.txt example:
personal_ws-1.1 en 500
foo
bar
example results.txt output (for an en_GB dictionary):
238:color
302:writeable
355:backends
433:dataonly
You can also print the whole line by changing the last grep -on into grep -n.
This is just an idea, I haven't really tried it yet (I'm on a windows machine :(). But maybe you could pipe the html file through head (with byte limit) and count newlines using grep to find your line number. It's neither efficient nor pretty, but it might just work.
cat icantspell.html | head -c <offset from aspell> | egrep -Uc "$"
I use the following script to perform spell-checking and to work-around the awkward output of aspell -a / ispell. At the same time, the script also works around the problem that ordinals like 2nd aren't recognized by aspell by simply ignoring everything that aspell reports which is not a word of its own.
#!/bin/bash
set +o pipefail
if [ -t 1 ] ; then
color="--color=always"
fi
! for file in "$#" ; do
<"$file" aspell pipe list -p ./dict --mode=html |
grep '[[:alpha:]]\+ [0-9]\+ [0-9]\+' -oh |
grep '[[:alpha:]]\+' -o |
while read word ; do
grep $color -n "\<$word\>" "$file"
done
done | grep .
You even get colored output if the stdout of the script is a terminal, and you get an exit status of 1 in case the script found spelling mistakes, otherwise the exit status of the script is 0.
Also, the script protects itself from pipefail, which is a somewhat popular option to be set i.e. in a Makefile but doesn't work for this script. Last but not least, this script explicitly uses [[:alpha:]] instead of [a-zA-Z] which is less confusing when it's also matching non-ASCII characters like German äöüÄÖÜß and others. [a-zA-Z] also does, but that to some level comes at a surprise.
aspell pipe / aspell -a / ispell output one empty line for each input line (after reporting the errors of the line).
Demonstration printing the line number with awk:
$ aspell pipe < testFile.txt |
awk '/^$/ { countedLine=countedLine+1; print "#L=" countedLine; next; } //'
produces this output:
#(#) International Ispell Version 3.1.20 (but really Aspell 0.60.7-20110707)
& iinternational 7 0: international, Internationale, internationally, internationals, intentional, international's, Internationale's
#L=1
*
*
*
& reelly 22 11: Reilly, really, reel, rely, rally, relay, resell, retell, Riley, rel, regally, Riel, freely, real, rill, roll, reels, reply, Greeley, cruelly, reel's, Reilly's
#L=2
*
#L=3
*
*
& sometypo 18 8: some typo, some-typo, setup, sometime, someday, smote, meetup, smarty, stupor, Smetana, somatic, symmetry, mistype, smutty, smite, Sumter, smut, steppe
#L=4
with testFile.txt
iinternational
I say this reelly.
hello
here is sometypo.
(Still not as nice as hunspell -u (https://stackoverflow.com/a/10778071/4124767). But hunspell misses some command line options I like.)
For others using aspell with one of the filter modes (tex, html, etc), here's a way to only print line numbers for misspelled words in the filtered text. So for example, it won't print misspellings in the comments.
ASPELL_ARGS="--mode=html --personal=./.aspell.en.pws"
for file in "$#"; do
for word in $(aspell $ASPELL_ARGS list < "$file" | sort -u); do
grep -no "\<$word\>" <(aspell $ASPELL_ARGS filter < "$file")
done | sort -n
done
This works because aspell filter does not delete empty lines. I realize this isn't using aspell pipe as requested by OP, but it's in the same spirit of making aspell print line numbers.