Splitting HTML file using AWK

Splitting HTML file using AWK - html

I was wondering if it's possible to split a HTML file into seperate .html files using awk? I'd like to look for the pattern:
<div class="post">
And when it finds this create the new file for each instance, I've tried to compile the command but can't get it working? My file is called working.html and this is what I got back from the command I've constructed.
awk '/<div class="post">/{x="F"++i;}{print > x;}' working.html
Any ideas?

It looks like it's bombing out because x is not initialized and can't be used as a filename until it is first set on a <div> line.
One way to fix that is to add a BEGIN pattern to initialize it.
BEGIN {
x = "F0"
}
/<div class="post">/ {
x = "F" ++i
}
{ print > x }

Related

SSIS Script howto append text to end of each row in flat file?

I currently have a flat file with around 1million rows.
I need to add a text string to the end of each row in the file.
I've been trying to adapt the following code but not having any success :-
public void Main()
{
// TODO: Add your code here
var lines = System.IO.File.ReadAllLines(#"E:\SSISSource\Source\Source.txt");
foreach (string item in lines)
{
var str = item.Replace("\n", "~20221214\n");
var subitems = str.Split('\n');
foreach (var subitem in subitems)
{
// write the data back to the file
}
}
Dts.TaskResult = (int)ScriptResults.Success;
}
I can't seem to get the code to recognise the carriage return "\n" & am not sure howto write the row back to the file to replace the existing rather than add a new row. Or is the above code sending me down a rabbit hole & there is an easier method ??
Many thanks for any pointers &/or assistance.

Read all lines is likely getting rid of the \n in each record. So your replace won't work.
Simply append your string and use #billinKC's solution otherwise.
BONUS:
I think DateTime.Now.ToString("yyyyMMdd"); is what you are trying to append to each line

Thanks #billinKC & #KeithL
KeithL you were correct in that the \n was stripped off. So I used a slightly amended version of #billinKC's code to get what I wanted :-
string origFile = #"E:\SSISSource\Source\Sourcetxt";
string fixedFile = #"E:\SSISSource\Source\Source.fixed.txt";
// Make a blank file
System.IO.File.WriteAllText(fixedFile, "");
var lines = System.IO.File.ReadAllLines(#"E:\SSISSource\Source\Source.txt");
foreach (string item in lines)
{
var str = item + "~20221214\n";
System.IO.File.AppendAllText(fixedFile, str);
}
As an aside KeithL - thanks for the DateTime code however the text that I am appending is obtained from a header row in the source file which is being read into a variable in an earlier step.

I read your code as
For each line in the file, replace the existing newline character with ~20221214 newline
At that point, the value of str is what you need, just write that! Instead, you split based on the new line which gets you an array of values which could be fine but why do the extra operations?
string origFile = #"E:\SSISSource\Source\Sourcetxt";
string fixedFile = #"E:\SSISSource\Source\Source.fixed.txt";
// Make a blank file
System.IO.File.WriteAllText(fixedFile, "");
var lines = System.IO.File.ReadAllLines(#"E:\SSISSource\Source\Source.txt");
foreach (string item in lines)
{
var str = item.Replace("\n", "~20221214\n");
System.IO.File.AppendAllText(fixedFile, str);
}
Something like this ought to be what you're looking for.

Converting HTML with equations pages to docx

I am trying to convert an html document to docx using pandoc.
pandoc -s Template.html --mathjax -o Test.docx
During the conversion to docx everything goes smooth less the equations.
In the html file the equation look like this:
<div class="jp-Cell jp-MarkdownCell jp-Notebook-cell">
<div class="jp-Cell-inputWrapper">
<div class="jp-Collapser jp-InputCollapser jp-Cell-inputCollapser">
</div>
<div class="jp-InputArea jp-Cell-inputArea"><div class="jp-RenderedHTMLCommon jp-RenderedMarkdown jp-MarkdownOutput " data-mime-type="text/markdown">
\begin{equation}
\log_{10}(\mu)={-2.64}+\frac{4437.038}{T-544.391}
\end{equation}
</div>
</div>
</div>
</div>
After running the pandoc command the result in the docx document is:
\begin{equation} \log_{10}(\mu)={-2.64}+\frac{4437.038}{T-544.391} \end{equation}
Do you have idea how can I overcome this issue?
Thanks

A Lua filter can help here. The code below looks for div elements with a data-mime-type="text/markdown" attribute and, somewhat paradoxically, parses it context as LaTeX. The original div is then replaced with the parse result.
local stringify = pandoc.utils.stringify
function Div (div)
if div.attributes['mime-type'] == 'text/markdown' then
return pandoc.read(stringify(div), 'latex').blocks
end
end
Save the code to a file parse-math.lua and let pandoc use it with the --lua-filter / -L option:
pandoc --lua-filter parse-math.lua ...
As noted in a comment, this gets slightly more complicated if there are other HTML elements with the text/markdown media type. In that case we'll check if the parse result contains only math, and keep the original content otherwise.
local stringify = pandoc.utils.stringify
function Div (div)
if div.attributes['mime-type'] == 'text/markdown' then
local result = pandoc.read(stringify(div), 'latex').blocks
local first = result[1] and result[1].content or {}
return (#first == 1 and first[1].t == 'Math')
and result
or nil
end
end

Razor page, server code with html

I can't figure out how to mix html with server code in this scenario.
#{
var i = 0;
foreach (var match in Model.StagingRooms)
{
if (i % 2 == 0)
{
<div class="row">
}
Html.Partial("_MatchCard", match.Value);
i++;
if (i % 2 == 0)
{
</div>
}
}
}
Using the code above, instead of rows of cards, I get an output of my code.
If I add # to Html.Partial and the increment
I also tried to append # to each server code line, and removing the #{} block, however this doesn't let me compile at all. I get a bunch of red squiggles in my code.
Edit:
When I add # to every server code snippet then I get squiggles, and can't compile
If I remove # from the last if statement, then I can run the app, but that piece of code is displayed back to me in the browser page.

You need to use an # here: #Html.Partial(....).
Also #foreach and #if, as Brad said.
And code lines go in brackets: #(i++)

Flash ABC : What does the number part of <file>.as$<number> in a swfdump

If I take a swf, and run it through swfdump
swfdump.exe -abc file.swf > ABC.txt
One the first run I may get some output in ABC.txt like this
ObjectConfig.as$60
And on a subsequent run of the same SWF get a different output
ObjectConfig.as$61
What is the meaning of the number after the $ ?

This is part of the debug metadata that the mxmlc compiler adds to the bytecode when you do a debug compile, debug=true. If you do a normal release compile, this info is omitted.
This metadata stores filenames and line numbers so that you can see the location in your source while debugging. Although I'm not sure on the exact meaning of these particular numbers, they seem to be a unique identifier or index of that file for the debugger, perhaps in case of two classes with the same name.

The best I can see is in the source code for swfdump, it calls swf_GetString. Somewhere in this chain it adds what looks like a debugLine or a scopeDepth to the end of the class name:
char* swf_GetString(TAG*t)
{
int pos = t->pos;
while(t->pos < t->len && swf_GetU8(t));
/* make sure we always have a trailing zero byte */
if(t->pos == t->len) {
if(t->len == t->memsize) {
swf_ResetWriteBits(t);
swf_SetU8(t, 0);
t->len = t->pos;
}
t->data[t->len] = 0;
}
return (char*)&(t->data[pos]);
}

Best way to find illegal characters in a bunch of ISO-889-1 web pages?

I have a bunch of html files in a site that were created in the year 2000 and have been maintained to this day. We've recently began an effort to replace illegal characters with their html entities. Going page to page looking for copyright symbols and trademark tags seems like quite a chore. Do any of you know of an app that will take a bunch of html files and tell me where I need to replace illegal characters with html entities?

You could write a PHP script (if you can; if not, I'd be happy to help), but I assume you already converted some of the "special characters", so that does make the task a little harder (although I still think it's possible)...

Any good text editor will do a file contents search for you and return a list of matches.
I do this with EditPlus. There are several editors like Notepad++, TextPad, etc that will easily help you do this.
You do not have to open the files. You just specify a path where the files are stored and the Mask (*.html) and the contents to search for "©" and the editor will come back with a list of matches and when you double click, it opens the file and brings up the matching line.

I also have a website that needs to regularly convert large numbers of file names back and forth between character sets. While a text editor can do this, a portable solution using 2 steps in php was preferrable. First, add the filenames to an array, then do the search and replace. An extra piece of code in the function excludes certain file types from the array.
Function listdir($start_dir='.') {
$nonFilesArray=array('index.php','index.html','help.html'); //unallowed files & subfolders
$filesArray = array() ; // $filesArray holds new records and $full[$j] holds names
if (is_dir($start_dir)) {
$fh = opendir($start_dir);
while (($tmpFile = readdir($fh)) !== false) { // get each filename without its path
if (strcmp($tmpFile, '.')==0 || strcmp($tmpFile, '..')==0) continue; // skip . & ..
$filepath = $start_dir . '/' . $tmpFile; // name the relative path/to/file
if (is_dir($filepath)) // if path/to/file is a folder, recurse into it
$filesArray = array_merge($filesArray, listdir($filepath));
else // add $filepath to the end of the array
$test=1 ; foreach ($nonFilesArray as $nonfile) {
if ($tmpFile == $nonfile) { $test=0 ; break ; } }
if ( is_dir($filepath) ) { $test=0 ; }
if ($test==1 && pathinfo($tmpFile, PATHINFO_EXTENSION)=='html') {
$filepath = substr_replace($filepath, '', 0, 17) ; // strip initial part of $filepath
$filesArray[] = $filepath ; }
}
closedir($fh);
} else { $filesArray = false; } # no such folder
return $filesArray ;
}
$filesArray = listdir($targetdir); // call the function for this directory
$numNewFiles = count($filesArray) ; // get number of records
for ($i=0; $i<$numNewFiles; $i++) { // read the filenames and replace unwanted characters
$tmplnk = $linkpath .$filesArray[$i] ;
$outname = basename($filesArray[$i],".html") ; $outname = str_replace('-', ' ', $outname);
}

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

Splitting HTML file using AWK - html

It looks like it's bombing out because x is not initialized and can't be used as a filename until it is first set on a <div> line. One way to fix that is to add a BEGIN pattern to initialize it. BEGIN { x = "F0" } /<div class="post">/ { x = "F" ++i } { print > x }

Related

SSIS Script howto append text to end of each row in flat file?

Converting HTML with equations pages to docx

Razor page, server code with html

Flash ABC : What does the number part of <file>.as$<number> in a swfdump

Best way to find illegal characters in a bunch of ISO-889-1 web pages?

Categories

Resources