Mass-upload many text files to MediaWiki - mediawiki

I have many text files that I want to upload to a wiki running MediaWiki.
I don't even know if this is really possible, but I want to give it a shot.
Each text file's name will be the title of the wiki page.
One wiki page for one file.
I want to upload all text files from the same folder as the program is in.
Perhaps asking you to code it all is asking too much, so could you tell me at least which language I should look for to give it a shot?

What you probably want is a bot to create the articles for you using the MediaWiki API. Probably the best known bot framework is pywikipedia for Python, but there are API libraries and bot frameworks for many other languages too.
In fact, pywikipedia comes with a script called pagefromfile.py that does something pretty close to what you want. By default, it creates multiple pages from a single file, but if you know some Python, it shouldn't be too hard to change that.
Actually, if the files are on the same server your wiki runs on (or you can upload them there), then you don't even need a bot at all: there's a MediaWiki maintenance script called importTextFile.php that can do it for you. You can run it in for all files in a given directory with a simple shell script, e.g.:
for file in directory/*.txt; do
php /path/to/your/mediawiki/maintenance/importTextFile.php "$file";
done
(Obviously, replace directory with the directory containing the text files and /path/to/your/mediawiki with the actual path of your MediaWiki installation.)
By default, importTextFile.php will base the name of the created page on the filename, stripping any directory prefixes and extensions. Also, per standard MediaWiki page naming rules, underscores will be replaced by spaces and the first letter will be capitalized (unless you've turned that off in your LocalSettings.php); thus, for example, the file directory/foo_bar.txt would be imported as the page "Foo bar". If you want finer control over the page naming, importTextFile.php also supports an explicit --title parameter. Or you could always copy the script and modify it yourself to change the page naming rules.
Ps. There's also another MediaWiki maintenance script called edit.php that does pretty much the same thing as importTextFile.php, except that it reads the page text from standard input and doesn't have the convenient default page naming rules of importTextFile.php. It can be quite handy for automated edits using Unix pipelines, though.
Addendum: The importTextFile.php script expects the file names and contents to be in the UTF-8 encoding. If your files are in some other encoding, you'll have to either fix them first or modify the script to do the conversion, e.g. using mb_convert_encoding().
In particular, the following modifications to the script ought to do it:
To convert the file names to UTF-8, edit the titleFromFilename() function, near the bottom of the script, and replace its last line:
return $parts[0];
with:
return mb_convert_encoding( $parts[0], "UTF-8", "your-encoding" );
where your-encoding should be the character encoding used for your file names (or auto to attempt auto-detection).
To also convert the contents of the files, make a similar change higher up, inside the main code of the script, replacing the line:
$text = file_get_contents( $filename );
with:
$text = file_get_contents( $filename );
$text = mb_convert_encoding( $text, "UTF-8", "your-encoding" );

In MediaWiki 1.27, there is a new maintenance script, importTextFiles.php, which can do this. See https://www.mediawiki.org/wiki/Manual:ImportTextFiles.php for information. It improves on the old (now removed) importTextFile.php script in that it can handle file wildcards, so it allows the import of many text files at once.

Related

save html page from the server by URL with no changes - get the exact copy, the clone

Let's say I have a URL http://example.com/path/to/document.html
That's the html document, the file, that has no external css or js.
If I open it in Google Chrome and save it with Ctrl+S locally, the content is changed. The content of that html file starts with <!-- saved from url= which is not I want at all. I need to get the exact html document, even spaces count.
The second option is to copy it with Ctrl+U (View Source), Select All and paste it into new document, save it and rename it. This is better, however spaces, tabs and end of file will be different depending on what operation system I'm using.
I need the exact copy of that html file - byte to byte.
How to make it?
This is a practical question as I need slightly modify that document.
I'm sorry there is no any source code in my question, but this question is about web developing.
Any ideas?
Thank you.
P.S. Of course that document could be generated by php or whatever, the part of the code can be even extracted from the db, but not in my case. I know that's a plain file.
I'd delete the comment after saving from Chrome, use wget in a linux environment, or open the page as an InputStream in Java. Do all three, run a diff, and if two arrived identical assume that's the file on the server.
Why do you need a byte-for-byte copy of the file on the server anyway, and why can't you ftp the file? There is always the chance that the server will serve different html files depending on your user-agent, but there are other tools which may be better than Chrome for getting your copy and many can spoof a user-agent as well.

Method of identifying plaintext files as scripts

I am creating a filter for files coming onto a Unix machine. I only want to allow plain text files that do not look like scripts to pass through.
For checking plain text I am checking the executable bit of the file and using the -T file test from perl. (I understand this is not 100%, but it will catch the binary files I most want to avoid). I think this will be sufficient, but any suggestions are welcome.
My main question is in recognizing when a plain text file is a script. Every script I've ever written has started out with a #! line, so my first thought is to read in the file's first line and block any containing that. Are there common non-script plain text files that start with the #! line that I will flag with a false-positive? Are there better/additional methods of identifying a script?
That's what the file command (see Wikipedia) is for. It recognizes much more than just the she-bang (#!), and can tell you what kind of script it is, if any.

download links from a web page with renaming

I'm trying to find a way to automatically download all links from a web page, but I also want to rename them. for example:
<a href = fileName.txt> Name I want to have </a>
I want to be able to get a file named 'Name I want to have' (I don't worry about the extension).
I am aware that I could get the page source, then parse all the links, and download them all manually, but I'm wondering if there are any built-in tools for that.
lynx --dump | grep http:// | cut -d ' ' -f 4
will print all the links that can be batch fetched with wget - but is there a way to rename the links on the fly?
I doubt anything does this out of the box. I suggest you write a script in Python or similar to download the page, and load the source (try the Beautiful Soup library for tolerant parsing). Then it's a simple matter of traversing the source to capture the links with their attributes and text, and download the files with the names you want. With the exception of Beautiful Soup (if you need to be able to parse sloppy HTML), all you need is built in with Python.
I solved the problem by converting the web page entirely to unicode on the first pass (using notepad++'s built-in conversion)
Then I wrote a small shell script that used cat, awk and wget to fetch all the data.
Unfortunately, I couldn't automate the process since I didn't find any tools for linux which would convert an entire page from KOI8-R to unicode.

Output reformatted text within a file included in a JSP

I have a few HTML files that I'd like to include via tags in my webapp.
Within some of the files, I have pseudo-dynamic code - specially formatted bits of text that, at runtime, I'd like to be resolved to their respective bits of data in a MySQL table.
For instance, the HTML file might include a line that says:
Welcome, [username].
I want this resolved to (via a logged-in user's data):
Welcome, user#domain.com.
This would be simple to do in a JSP file, but requirements dictate that the files will be created by people who know basic HTML, but not JSP. Simple text-tags like this should be easy enough for me to explain to them, however.
I have the code set up to do resolutions like that for strings, but can anyone think of a way to do it across files? I don't actually need to modify the file on disk - just load the content, modify it, and output it w/in the containing JSP file.
I've been playing around with trying to load the files into strings via the apache readFileToString, but I can't figure out how to load files from a specific folder within the webapp's content directory without hardcoding it in and having to worry about it breaking if I deploy to a different system in the future.
but I can't figure out how to load files from a specific folder within the webapp's content directory without hardcoding it in and having to worry about it breaking if I deploy to a different system in the future.
If those files are located in the webcontent, use ServletContext#getRealPath() to convert a relative web path to an absolute disk file system path. This works if the WAR is exploded in the appserver (most does it by default, only Weblogic doesn't do that by default, but this is configureable IIRC). Inside servlets you can obtain the ServletContext by the inherited getServletContext() method.
String relativeWebappURL = "/html/file.html";
String absoluteFilePath = getServletContext().getRealPath(relativeWebappURL);
File file = new File(absoluteFilePath);
// ...
Alternatively, you can put it in the classpath of the webapplication and make use of ClassLoader#getResource():
String relativeClasspathURL = "/html/file.html";
URL absoluteClasspathURL = Thread.currentThread().getContextClassLoader().getResource(relativeClasspathURL);
File file = new File(absoluteClasspathURL.toURI());
// ...
As to the complete picture, I question if you have ever considered an existing templating framework like Freemarker or Velocity to ease all the job?

What's the best way to automate text replacing?

Here's the situation:
I have a lot of HTML files, and these HTML files link to a lot of documents. The documents have ALL been renamed. I have an excel sheet which has the old name of the file and the new name of the file.
What would be the quickest way to change the links inside the HTML files to accommodate the new names?
The method I'm using now:
Have all the HTML files opened in Notepad++
Use Notepad++'s 'Replace in All Opened Documents' function to replace all occurrences of a certain link with the new file name.
Is there a quicker, better way?
Perl's regular expressions.
elaboration:
pseudocode
open up each file for read-only and read them into a list.
close the files
foreach element in the list
#do the desired text replacement
`s/$oldtext/$newtext/g`;
open each file once more now for writing
write out the new text.
It's not hard, but requires some testing. If you have a lot of edits(and more may happen later), this is more efficient.
There are several free and open-source tools that replace text in several files, one of the open-source ones is FART
If you prefer something with a GUI, try the free Text Crawler
First save the excel to somethine nice and simple like a csv file so its easy to read in you favourite language eg perl. Iterate over each file and do the search and replace. One gotcha though is to do it all in one pass otherwise you could create problems if there are links that have changed in complex ways. Ie if file a.html changed to b.html and b.html changed to a.html you can mess up the links if you do it in multiple passes. So load all the changes into memory then cycle through each file and replace all links in it simultaneously.
Because it is specifically html search and replace a tool like this would be ideal:
http://www.aliassoftware.com/
Finds and Replaces multiple text strings in multiple files at once !