What's the best way to automate text replacing? - html

Here's the situation:
I have a lot of HTML files, and these HTML files link to a lot of documents. The documents have ALL been renamed. I have an excel sheet which has the old name of the file and the new name of the file.
What would be the quickest way to change the links inside the HTML files to accommodate the new names?
The method I'm using now:
Have all the HTML files opened in Notepad++
Use Notepad++'s 'Replace in All Opened Documents' function to replace all occurrences of a certain link with the new file name.
Is there a quicker, better way?

Perl's regular expressions.
elaboration:
pseudocode
open up each file for read-only and read them into a list.
close the files
foreach element in the list
#do the desired text replacement
`s/$oldtext/$newtext/g`;
open each file once more now for writing
write out the new text.
It's not hard, but requires some testing. If you have a lot of edits(and more may happen later), this is more efficient.

There are several free and open-source tools that replace text in several files, one of the open-source ones is FART
If you prefer something with a GUI, try the free Text Crawler

First save the excel to somethine nice and simple like a csv file so its easy to read in you favourite language eg perl. Iterate over each file and do the search and replace. One gotcha though is to do it all in one pass otherwise you could create problems if there are links that have changed in complex ways. Ie if file a.html changed to b.html and b.html changed to a.html you can mess up the links if you do it in multiple passes. So load all the changes into memory then cycle through each file and replace all links in it simultaneously.
Because it is specifically html search and replace a tool like this would be ideal:
http://www.aliassoftware.com/
Finds and Replaces multiple text strings in multiple files at once !

Related

How can I replace some text in html on python?

My situation is...
I have few hundreds of chrome html files on one folder, and I want to replace certain text(ex. james) to another text(ex. tom) for every html files. Honestly, I'm just a beginner to python, so may I get a detailed code of it? I need 1. how to open every html file in one folder 2. how to find certain text on html 3. how to replace it to another text (on python) Thanks a lot.
you can just open up the directory in VSC and bulk replace all the instances of any string in all the HTML files directly. I required to do the same and found this to be a very convenient method.

Converting several html files into one word file

I received web-service documentation in html format, but it is very unfriendly when it comes to search for a specific word. Using index file it displays list of names of each request on the left and when you click on a particular one then on the right it displays description and content of this request.
Unfortunately I have to do some mapping with web-services that we already have. When searching through CTRL + F it only goes trough the left side (list), doesn't matter if you place cursor over the description on the right, click and try to search this way too - it doesn't work.
My idea is to extract all html files that have been provided to us into one word document (this way I can go through descriptions not only trough the list of names). Unfortunately all I can reach is that these files open in separate word files (one html file per one word file). It's almost 1000 requests to be mapped and working this way is going to take forever...
So the question is: How to combine more than one html file into one word file?
There two ways to merge html files
Using Command Line
Copy all html files that you want to merge into a folder.
Navigate to that folder using terminal or command prompt.
Execute following commands
on Mac/Linux
cat *.html > output.html
on Windows :
type *.html > output.html
Using already available tools
https://www.sobolsoft.com/howtouse/combine-html-files.htm, html-merge (Windows Only)
In order to convert merged html file to a word document, read here.

Angular 5 : How to integrate html data (which is a formatted text) in a .docx file?

I'm still a bit newbie in the code game, and i would like some advices from senpai.
Context :
I'm making a angular 5 app which has a form, which is using also QuillJS, a rich text editor for only one question (the previous questions are simple input field for strings or numbers). My goal is to allow my users to download the form and the text from QuillJS they completed, on a .docx file (Word). And of course i'm doing this because i want to keep the formatted text from QuillJs, otherwise i would have just get a good ol' string.
Issue :
The point is, i'm already building a docx file for the first questions of the form and the only method i found for now to put my html string from QuillJs in a Word readable data type, is to use html-docx-js library.
This post even explain how. But, BUT, i don't want to use saveAs function (see the post), that create a file and put the content in it. I want to put the content in the docx file i'm already creating.
So here is my question, how would you, senpai, do it ?
The thing is that i've got a Blob file (cf post), but i don't know how to put it in my docx file. I tried to see if FileReader function could do the job, but well... i don't get how to integrate this special Blob file type (which is : application/vnd.openxmlformats-officedocument.wordprocessingml.document) in the docx file.
Maybe there is another way, i'm open to any suggestions, i don't mind at all to change my way of doing.
Thank you. Save internet, give me a tip.
The official documentation for html-docx-js does not state any other options than the asBlob method. I suggest two options:
Decoding the DOCX:
The Blob filetype is not special. The blob is just binary representation of the docx. I found in SE question that the docs in fact zipped XML document. You could unzip it using JSZip or other JS solution, then read it using FileReader and try to deal with it in a DOM manner. I'm not qualified to go into details how that could work.
Adding HTML to the user input first and then outputting it as a whole
This is changing the way you want to do it. In this way, I would first create formatted HTML with the data you collected in other parts of the questionnaire. Then you append the rich data from the rich editor. At last you take this HTML data and save it into single file using the asBlob function.
The second solution will maybe strip some customization from your original approach, but it seems much faster to implement.

Mass-upload many text files to MediaWiki

I have many text files that I want to upload to a wiki running MediaWiki.
I don't even know if this is really possible, but I want to give it a shot.
Each text file's name will be the title of the wiki page.
One wiki page for one file.
I want to upload all text files from the same folder as the program is in.
Perhaps asking you to code it all is asking too much, so could you tell me at least which language I should look for to give it a shot?
What you probably want is a bot to create the articles for you using the MediaWiki API. Probably the best known bot framework is pywikipedia for Python, but there are API libraries and bot frameworks for many other languages too.
In fact, pywikipedia comes with a script called pagefromfile.py that does something pretty close to what you want. By default, it creates multiple pages from a single file, but if you know some Python, it shouldn't be too hard to change that.
Actually, if the files are on the same server your wiki runs on (or you can upload them there), then you don't even need a bot at all: there's a MediaWiki maintenance script called importTextFile.php that can do it for you. You can run it in for all files in a given directory with a simple shell script, e.g.:
for file in directory/*.txt; do
php /path/to/your/mediawiki/maintenance/importTextFile.php "$file";
done
(Obviously, replace directory with the directory containing the text files and /path/to/your/mediawiki with the actual path of your MediaWiki installation.)
By default, importTextFile.php will base the name of the created page on the filename, stripping any directory prefixes and extensions. Also, per standard MediaWiki page naming rules, underscores will be replaced by spaces and the first letter will be capitalized (unless you've turned that off in your LocalSettings.php); thus, for example, the file directory/foo_bar.txt would be imported as the page "Foo bar". If you want finer control over the page naming, importTextFile.php also supports an explicit --title parameter. Or you could always copy the script and modify it yourself to change the page naming rules.
Ps. There's also another MediaWiki maintenance script called edit.php that does pretty much the same thing as importTextFile.php, except that it reads the page text from standard input and doesn't have the convenient default page naming rules of importTextFile.php. It can be quite handy for automated edits using Unix pipelines, though.
Addendum: The importTextFile.php script expects the file names and contents to be in the UTF-8 encoding. If your files are in some other encoding, you'll have to either fix them first or modify the script to do the conversion, e.g. using mb_convert_encoding().
In particular, the following modifications to the script ought to do it:
To convert the file names to UTF-8, edit the titleFromFilename() function, near the bottom of the script, and replace its last line:
return $parts[0];
with:
return mb_convert_encoding( $parts[0], "UTF-8", "your-encoding" );
where your-encoding should be the character encoding used for your file names (or auto to attempt auto-detection).
To also convert the contents of the files, make a similar change higher up, inside the main code of the script, replacing the line:
$text = file_get_contents( $filename );
with:
$text = file_get_contents( $filename );
$text = mb_convert_encoding( $text, "UTF-8", "your-encoding" );
In MediaWiki 1.27, there is a new maintenance script, importTextFiles.php, which can do this. See https://www.mediawiki.org/wiki/Manual:ImportTextFiles.php for information. It improves on the old (now removed) importTextFile.php script in that it can handle file wildcards, so it allows the import of many text files at once.

(OpenXML) Add data pages to xml package without framework

lately I've been into combining multiple OpenXML speadsheets via PHPExcel which
showed me that this framework has certain issues which makes it pretty much unusable
for what I want to do (my related SO question).
To make it short: it's hard to guarantee that all formatting features of Excel 2007 will
persist a file merge performed with that particular framework.
Anways, now I'm thinking of a more general approach. I want to open a template XLSX
which contains various formatting and add some plain alphanumeric data worksheets 'at the end' of the workbook.
Is sensefully possible to do the following:
unzip template XLSX
parse XML files
add worksheets
save xml files
rezip files to get valuid XLSX
Any hints or experiences would be highly appreciated.
thanks in advance
K
I haven't worked with .xlsx too much, but I've altered .docx files by manually adding and editing the XML.
The biggest concern with adding new parts to a document is to make sure you update the .rels files. The best way to figure out what needs to be updated is to create a new .xlsx document in Excel, add a worksheet, save the file and then unzip it to see what has changed. You can also use the DocumentReflector tool that comes with the OpenXML SDK if you want to see the internals of the file without having to unzip it.
I found the OpenXML reference manual very helpful when hand editing files because it tells you what elements you have to keep and what elements are optional to make a valid document. It makes it easier to work with when you can remove some of the extraneous elements that Excel adds automatically.