How can I change specific recurring text on a very large HTML file? - html

I have a very big HTML file (talking about 20MB) and I need to remove from the file a large amount of nodes of the form:
<tr><td>SPECIFIC-STRING</td><td>RANDOM-STRING</td><td>RANDOM-STRING</td></tr><tr><td style="padding-top:0" colspan="3">RANDOM-STRING</td></tr>
The file I need to work on is basically made of thousands of these strings, and I only need to remove those that have a specific first string, for instance, all those with the first string being "banana":
<tr><td>banana</td><td>RANDOM-STRING</td><td>RANDOM-STRING</td></tr><tr><td style="padding-top:0" colspan="3">RANDOM-STRING</td></tr>
I tried achieving this opening the file in Geany and using the replace feature with this regex:
<tr><td>banana<\/td><td>(.*)<\/td><td>(.*)<\/td><\/tr><tr><td(.*)<\/td><\/tr>
but the console output was that it removed X amount of occurrences, when I know there are way more occurrences than that in the file.
Firefox, Chrome and Brackets fail even to view the html code of the file due to it's size. I can't think of another way to do this due to my large unexperience with HTML.

You could be using a stream editor which as the name suggest streams the file content, thus never loads the whole file into the main memory.
A popular editor is sed. It does support RegEx.
Your command would have the following structure.
sed -i -E 's/SEARCH_REGEX/REPLACEMENT/g' INPUTFILE
-E for support of extended RegEx
-i for in-place editing mode
s denotes that you want to replace values
g is for global. By default sed would only replace the first occurrence so to replace all occrrences you must provide g
SEARCH_REGEX is the RegEx you need to find the substrings you want to replace
REPLACEMENT is the value you want to replace all matches with
INPUTFILE is the file sed is gonna read line-by line and do the replacement for you.

While regex may not be the best tool to do this kinda job, try this adjustment to your pattern:
<tr><td>banana<\/td><td>(.*?)<\/td><td>(.*?)<\/td><\/tr><tr><td(.*?)<\/td><\/tr>
That's making your .* matches lazy. I am wondering if those patterns are consuming too much.

Related

Sublime Text - find all instances of an html class name project-wide

I want to find all instances of a class named "validation" in all of my html files project wide. It's a very large project and a search for the word "validation" gives me hundreds of irrelevant results (js functions, css, js/css minified, other classes, functions and html page content containing the word validation, etc). It can sometimes be the second, third, or fourth class declared so searching for "class='validation" doesn't work.
Is there a way to specify that I only want results where validation is a class declared on an html block?
Yes. In the sublime menu go to Find --> Find in Files...
Then match what is in the following image.
The first thing you will want to do is consider other possibilities with how you can solve this problem. Currently, it sounds like you are only using sublime text. Have you considered trying to use a command-line tool like grep?
Here is an example of how it could be used.
I have a project called enfold-child with a bunch of frontend assets for a wordpress project. Let's say, I want to find all of my scss files with the class "home" listed in them somewhere, but I do NOT want to pull in built css files, or anything in my node_modules folder. The way i would do that is as follows:
Folder structure:
..
|build
|scss_files
|node_modules
|css_files
|style.css
grep -rnw build --exclude=*{.css} --exclude-dir=node_modules -e home
grep = handy search utility.
-r = recursive search.
-n = provide line numbers for each match
-w = Select only those lines containing matches that form whole words.
-e = match against a regular expression.
home = the expression I want to search for.
In general, the command line has most anything one could want/need to do most of the nifty operations offered by most text-editors -- such as Sublime. Becoming familiar with the command line will save you a bunch of time and headaches in the future.
In SublimeText, right-click on the folder you want to start the search from and click on Find in Folder. Make sure regex search is enabled (the .* button in the search panel) and use this regex as the search string:
class="([^"]+ )?validation[ "]
That regex will handle cases where "validation" is the only classname as well as cases where its one of several classnames (in which case it can be anywhere in the list).
If you didn't stick to double quotes, this version will work with single or double quotes:
class=['"]([^'"]+ )?validation[ '"]
If you want to use these regexes from the command line with grep, you'll need to include a -E argument for "extended regular expressions".

How to add html attributes and values for all lines quickly with vim and plugins?

My os:debian8.
uname -a
Linux debian 3.16.0-4-amd64 #1 SMP Debian 3.16.39-1+deb8u2 (2017-03-07) x86_64 GNU/Linux
Here is my base file.
home
help
variables
compatibility
modelines
searching
selection
markers
indenting
reformatting
folding
tags
makefiles
mapping
registers
spelling
plugins
etc
I want to create a html file as bellow.
home
help
variables
compatibility
modelines
searching
selection
markers
indenting
reformatting
folding
tags
makefiles
mapping
registers
spelling
plugins
etc
Every line was added href and id attributes,whose values are line content pasted .html and line content itself correspondingly.
How to add html attributes and values for all lines quickly with vim and plugins?
sed,awk,sublime text 3 are all welcomed to solve the problem.
$ sed 's:.*:&:' file
home
help
variables
compatibility
modelines
searching
selection
markers
indenting
reformatting
folding
tags
makefiles
mapping
registers
spelling
plugins
etc
if you want to do this in vi itself, no plug-in neccessary
Open the file, type : and insert this line as the command
%s:.*:&
it will make all the substitutions in the file.
sed is the best solution (simple and pretty fast here) if your are sure of the content, if not it need a bit of complexity that is better treated by awk:
awk '
{
# change special char for HTML constraint
Org = URL = HTML = $0
# sample of modification
gsub( / /, "%20", URL)
gsub( /</, "%3C", HTML)
printf( "%s\n", URL, Org, HTML)
}
' YourFile
To complete this easily in Sublime Text, without any plugins added:
Open the base file in Sublime Text
Type Ctrl+Shift+P and in the fuzzy search input type syn html to set the file syntax to HTML.
In the View menu, make sure Word Wrap is toggled off.
Ctrl+A to select all.
Ctrl+Shift+L to break selection into multi-line edit.
Ctrl+C to copy selection into clipboard as multiple lines.
Alt+Shift+W to wrap each line with a tag-- then tap a to convert the default <p> tag into an <a> tag (hit esc to quit out of any context menus that might pop up)
Type a space then href=" -- you should see this being added to every line as they all have cursors. Also you should note that Sublime has automatically closed your quotes for you, so you have href="" with the cursor between the quotes.
ctrl+v -- this is where the magic happens-- your clipboard contains every lines worth of contents, so it will paste each appropriate value into the quotes where the cursor is lying. Then you simply type .html to add the extension.
Use the right arrow to move the cursors outside of the quotes for the href attribute and follow the two previous steps to similarly add an id attribute with the intended ids pasted in.
Voila! You're done.
Multi-line editing is very powerful as you learn how to combine it with other keyboard shortcuts. It has been a huge improvement in my workflow. If you have any questions please feel free to comment and I'll adjust as needed.
With bash one-liner:
while read v; do printf '%s\n' "$v" "$v" "$v"; done < file
(OR)
while read v; do echo "$v"; done < file
Try this -
awk '{print a$1b$1c$1d}' a='' d='' file
home
help
variables
compatibility
modelines
searching
selection
markers
indenting
reformatting
folding
tags
makefiles
mapping
registers
spelling
plugins
etc
Here I have created 4 variable a,b,c & d which you can edit as per your choice.
OR
while read -r i;do echo ""$i";done < f
home
help
variables
compatibility
To execute it directly in vim:
!sed 's:.*:&:' %
In awk, no regex, no nothing, just print strings around $1s, escaping "s:
$ awk '{print "" $1 ""}' file
home
help
If you happen to have empty lines in there just add /./ before the {:
/./{print ...
list=$(cat basefile.txt)
for val in $list
do
echo ""$val"" >> newfile.html
done
Using bash, you can always make a script or type this into the command line.
This vim replacement pattern handles your base file:
s#^\s*\(.\{-}\)\s*$#\1#
^\s* matches any leading spaces, then
.\{-} captures everything after that, non-greedily — allowing
\s$ to match any trailing spaces.
This avoids giving you stuff like home .
You can also process several base files with vim at once:
vim -c 'bufdo %s#^\s*\(.\{-}\)\s*$#\1# | saveas! %:p:r.html' some.txt more.txt`
bufdo %s#^\s*\(.\{-}\)\s*$#\1# runs the replacement on each buffer loaded into vim,
saveas! %:p:r.html saves each buffer with an html extension, overwriting if necessary,
vim will open and show you the saved more.html, which you can correct as needed, and
you can use :n and :prev to visit some.html.
Something like sed’s probably best for big jobs, but this lets you tweak the conversions in vim right after it’s made them, use :u to undo, etc. Enjoy!

How to remove all content before and after two different HTML tags from the Linux command line

I have a file, links.html, and want to remove everything before:
<div id="data_div"
and everything after AND including
<div style="\"overflow:auto\"">
and save to a new file; links1.html.
Is it possible to complete this in a single operation, either with BASH string manipulation or sed?
If so, how?
Yes it is entirely possible. Below are two examples, although you will have to substitute "pattern" with what you referenced above as the pattern.
To remove everything after a "pattern" to include the pattern itself, you can use the following command
sed -i '/pattern/,$d' filename
To remove everything before a pattern to include the pattern itself, you can use the following command
sed -i '1,/pattern/d' filename
If you would like to not edit the file in place, you could remove the "-i" portion and redirect the output to a separate file.

In Stata, how do I add variable labels from a separate csv file?

I have a set of csv files that are very simple to load into Stata using the -insheet- command. But they have very uninformative variable names. For each of these files, I also have a file of metadata consisting of two columns: the original (uninformative) variable names, and a description of what the variables actually mean. I'd like to use these metadata files to create variable labels, preferably without going through and typing up all the separate label commands or turning the metadata file into a dictionary for each file. It seems like there must be a quick way of loading the metadata file into Stata and looping through it to generate the label commands, but I don't know what it is. Any thoughts?
Ideally each line of the metadata is something like
varname1 "more interesting description"
in which case you can prefix each line with
label var
and then run the file as if it were a do-file using do. See the help for label. That is easy in a decent text editor, as for example searching for the start of each line and replacing it with label var (note the need for the space).
What could bite here includes:
You don't have double quotes " " as delimiters, in which case you need to insert them.
The extra information does not qualify as a variable label because it is more than 80 characters long. See help limits.
There are other ways to do this with Stata. You could write a program to read in the metadata and write out a do-file using file, but if this were my problem I would reach first for my text editor. (Most experienced Stata programmers use something else as well as doedit.)

Find and Replace with Notepad++

I have a document that was converted from PDF to HTML for use on a company website to be referenced and indexed for search. I'm attempting to format the converted document to meet my needs and in doing so I am attempting to clean up some of the junk that was pulled over from when it was a PDF such as page numbers, headers, and footers. luckily all of these lines that need to be removed are in blocks of 4 lines unfortunately they are not exactly the same therefore cannot be removed with a simple literal replace. The lines contain numbers which are incremental as they correlate with the pages. How can I remove the following example from my html file.
Title<br>
10<br>
<hr>
<A name=11></a>Footer<br>
I've tried many different regular expression attempts but as my skill in that area is limited I can't find the proper syntax. I'm sure i'm missing something fairly easy as it would seem all I need is a wildcard replace for the two numbers in the code and the rest is literal.
any help is apprciated
The search & replace of npp is quite odd. I can't find newline charactes with regular expression, although the documentation says:
As of v4.9 the Simple find/replace (control+h) has changed, allowing the use of \r \n and \t in regex mode and the extended mode.
I updated to the last version, but it just doesn't work. Using the extended mode allows me to find newlines, but I can't specify wildcards.
However, you can use the macros to overcome this problems.
prepare a search that will find a unique passage (like Title<br>\r\n, here you can use the extended mode)
start recording a macro
press F3 to use your search
mark the four lines and delete them
stop recording the macro ... done!
Just replay it and it deletes what you wanted to delete.
If I have understood your request correctly this pattern matches your string:
Title<br>( ?)\n([0-9]+)<br>( ?)\n<hr>( ?)\n<A name=([0-9]+)></a>Footer<br>
I use the Regex Coach to try out complicated regex patterns. Other utilities are available.
edit
As I do not use Notepad++ I cannot be sure that this pattern will work for you. Apologies if that transpires to be the case. (I'm a TextPad man myself, and it does work with that tool).