Cut file by phrase in ubuntu terminal - html

I downloaded some website pages with wget utility, but html pages contain too much information which is not needed. I want files to contain only text before
</article> tag. I suspect that it is possible to do it with grep command, but which parameters do I need? And how to apply such command to all files in a dir?

here's the script
for i in *.htm; do (cat $i | grep -i "</article>" -B 9999) > $i; done;

Related

Grepping a word buried in a <p> on a website

I am having trouble grepping a word on a website. This is the command I'm using
wget -q http://bcbioinformaticsgrad.ca/our-faculty/james-piret/ | grep 'medical'
which is returning nothing, when it should be returning
[name of the website]:Many recent developments in biological and medical
.
.
.
.
.
.
The overall goal of what I'm trying to do is find a certain word within all the links of the website
My script is written like this
#!/bin/bash
#$1 is the parent website
#This pipeline obtains all the links located on a website
wget -qO- $1 | grep -Eoi '<a [^>]+>' | grep -Eo 'href="[^\"]+"' | cut -c 7- | rev | cut -c 2- | rev > .linksLocated
#$2 is the word being looked for
#This loop goes though every link and tries to locate a word
while IFS='' read -r line || [[ -n "$line" ]]; do
wget -q $line | grep "$2"
done < .linksLocated
#rm .linksLocated
Wget doesn't put the downloaded file to standard output, so it's trying to grep the word from nothing (since you added the -q flag).
Add -O - to print the page to stdout:
wget -q http://bcbioinformaticsgrad.ca/our-faculty/james-piret/ -O - | grep 'medical'
I see you used it with the first wget in your script, so just add it to the second one, too.
It's also possible to use curl, which does that by default, without any parameters:
curl http://bcbioinformaticsgrad.ca/our-faculty/james-piret/ | grep 'medical'
Edit: this tool is super useful when you actually need to select certain HTML elements in the downloaded page, might suit some use cases better than grep: https://github.com/ericchiang/pup

Download all links to .zip files on entire website wget

so basically I want to download all zip files on a given website using wget and I'm having a hard time. I'm new to this so please bear with me. The website DOES NOT have a page that list all the zip files. Is there a way I can have wget go through the entire site like a webcrawler and download all the zip files? I've tried commands like -
1) wget -r -np -l 1 -A zip http://site/path/
2) wget -A zip -m -p -E -k -K -np http://site/path/
3) wget --no-clobber --convert-links --random-wait -r -p -E -e robots=off -U mozilla http://site/path/
supposedly they search through the entire site, I haven't been getting those results though. Help or pointing me in the right direction would be very much appreciated!

Creating custom DVD for RHEL 7 with kickstart

I am trying to create a custom CD/DVD to deploy RHEL 7 with kickstart file. Here is what I did:
Edited isolinux.cfg (in the ISOLinux folder) and grub.cfg file (in the EFI\BOOT folder).
Created ISO using mkisofs.
But it is not working. Am I using correct files/method?
Edit the ISO image and put the ks.cfg file that you have created.
Preferably, put the ks.cfg file inside ks directory. More information can be found here.
You need to use the new command. Here is an example of what will work:
Add the kickstart file to your download and exploded ISO.
Run this command in the area with the ISO and kickstart and point to another location to build the ISO:
genisoimage -r -v -V "OEL6 with KS for OVM Manager" -cache-inodes -J -l -b isolinux/isolinux.bin -c isolinux/boot.cat -no-emul-boot -boot-load-size 4 -boot-info-table -o OEL6U6_OVM_Manager.iso /var/www/html/Template/ISO/
I found the way to create custom DVD from the RHEL7 page.
Mount the downloaded image
mount -t iso9660 -o loop path/to/image.iso /mnt/iso
Create a working directory - a directory where you want to place the contents of the ISO image.
mkdir /tmp/ISO
Copy all contents of the mounted image to your new working directory. Make sure to use the -p option to preserve file and directory permissions and ownership.
cp -pRf /mnt/iso /tmp/ISO
Unmount the image.
umount /mnt/iso
Make sure your current working directory is the top-level directory of the extracted ISO image - e.g. /tmp/ISO/iso. Create the new ISO image using genisoimage:
genisoimage -U -r -v -T -J -joliet-long -V "RHEL-7.1 Server.x86_64" -Volset "RHEL-7.1 Server.x86_64" -A "RHEL-7.1 Server.x86_64" -b isolinux/isolinux.bin -c isolinux/boot.cat -no-emul-boot -boot-load-size 4 -boot-info-table -eltorito-alt-boot -e images/efiboot.img -no-emul-boot -o ../NEWISO.iso .
Hope the answer will helpful:
I am editing my answer due to the comments posted. Here is a more comprehensive solution:
(A) You need to create the ISO properly. I found helpful information in this URL.
Here is the line that I actually ended up with, for my MBR/UEFI ISO creation:
mkisofs -U -A "<Volume Header>" -V "RHEL-7.1 x86_64" -volset "RHEL-7.1 x86_64" -J -joliet-long -r -v -T -x ./lost+found -o ${OUTPUT}/${HOST}.iso -b isolinux/isolinux.bin -c isolinux/boot.cat -no-emul-boot -boot-load-size 4 -boot-info-table -eltorito-alt-boot -e images/efiboot.img -no-emul-boot -boot-load-size 18755 /dir/where/sources/for/ISO/are/located
Be careful with the -V parameter, as it has to match what the kernel has defined for inst.stage2. In the default grub.conf included in the boot disk, it is configured to be "hd:LABEL=RHEL-7.1\x20x86_64" which matches with the settings above.
(B) You need the correct setup for EFI for RHEL7. For some reason, this has changed from RHEL6, where you could just use the /EFI/BOOT/BOOTX64.conf. Now it uses the /EFI/BOOT/grub.cfg. Common wisdom from Red Hat Manuals state to add the inst.ks= parameter to the kernel line. The grub.cfg that comes in the /EFI/BOOT directory of the RHEL7 boot iso actually has the linuxefi parameter, instead of the kernel one, I would guess they would work the same. If you are including the KS file on the CD, this should get you there.
Good Luck!

How to download logs from child gears

I have OpenShift Enterprise 2.0 running in a multi-node setup. I am running a simple JBoss scaled app (3 gears, so HAProxy and 2 JBoss gears). I have used a pre_start_jbossews script in .openshift/action_hooks to configure verbose GC logging (with just gc.log as the file name). However, I can't figure out how to get the gc.log files from the gears running JBoss.
[Interestingly enough, there is an empty gc.log file in the head/parent gear (running HAProxy). Looks like there is a java process started there, that might be a bug.]
I tried to run
rhc scp <appname> download . jbossews/gc.log --gears
hoping that it would be implemented like the ssh --gears option, but it just tells me 'invalid option'. So my question is, how can I actually download logs from child gears?
I don't think that you can use RHC directly to get what you want.
That may require an Request for Enhancement to be made to the RHC SCP command.
File that here: https://github.com/openshift/rhc/issues
However you can use the following to find all of your GEARS:
rhc app show APP_NAME --gears | awk '{print $5}' | tail -n +3
From this list you can list all the logs for each gear that are part of that application.
for url in $(rhc app show APP_NAME --gears | awk '{print $5}' | tail -n +3); do for dir in $(ssh $url "ls -R | grep -i log.*:"); do echo -n $url:${dir%?}; echo; done; done
With that you can us simple scp commands to get the files you need from all of the gears:
for file_dir in $(for url in $(rhc app show APP_NAME --gears | awk '{print $5}' | tail -n +3); do for dir in $(ssh $url "ls -R | grep -i log.*:"); do echo -n $url:${dir%?}; echo; done; done); do scp "$file_dir/*" .; done
If you need to download any files, you can use an SFTP client like FileZilla, so you can copy files from the server.
I know it's been a while since the original question was posted, but I just bumped into the same issue today and found that you can use the scp command directly if you know the gear SSH URL:
scp local_file user#gear_ssh:remote_file
to upload a file to the gear, or
scp user#gear_ssh:remote_file local_file
to download from the gear.

How to extract .depot file on HPUX?

How can I extract extract a .depot file on HPUX?
The .depot file is a tarred dir stucture, with some of the files gzipped under the same name as original.
Note that my environment is quite limited - I can't have root, I don't have swinstall.
http://forums13.itrc.hp.com/service/forums/questionanswer.do?admit=109447627+1259826031876+28353475&threadId=1143807
At best, the solution should work on Linux too.
I have tried to untar and gunzip -f -r -d -v --suffix= .
But the problem is that the gzipped files have no suffix, so in the end, gzip deletes them.
It was relatively easy:
for f in `find -type f` ; do
mv $f $f.gz
gunzip $f.gz
done