If I have downloaded Wikipedia XML dumps, is there any way of removing all of the internal links from within an XML file?
Thanks
One thing you could do, if you are importing them into a local wiki, is to import all the files you want, then use a robot (eg. pywikipediabot is easy to use) to get rid of all the internal links.
Wikipedia database dumps and information about using them are located here: Wikipedia:Database download. You should do this instead of writing a script to scrape Wikipedia.
I would try to use XSLT to transform the XML file into another XML file.
You could do a search and replace in your favorite text editor, replacing [[ and ]] with nothing.
Related
My situation is...
I have few hundreds of chrome html files on one folder, and I want to replace certain text(ex. james) to another text(ex. tom) for every html files. Honestly, I'm just a beginner to python, so may I get a detailed code of it? I need 1. how to open every html file in one folder 2. how to find certain text on html 3. how to replace it to another text (on python) Thanks a lot.
you can just open up the directory in VSC and bulk replace all the instances of any string in all the HTML files directly. I required to do the same and found this to be a very convenient method.
I tried using several conversions -- I've used Altova and a couple of other free software options. I don't know how to add the XML file for you to view but would really appreciate some help!
When I use the Export Bookmarks option on Chrome browser, by default, the only option it allows is to save as an HTML file. If you did mistakenly save as an XML file, then using an XML editor like use probably used from Altova will most likely create more errors.
1) I'd suggest, you first right-click on the XML bookmarks file and go to Properties tab and see if you can restore the oldest, least-modified version of it.
2) Then simply try renaming the extension of the bookmarks file from .xml to .html to see if it now works.
I have both JSON file and JL file on my computer but when I open them in Notepad their structure looks like the same. What is the difference between them? where shall I use each one?
Actually, the time that I was asking this question I didn't know that "the file type is no guarantee of what is inside it". in other words I thought that for every file name there is a separate manifesto and if a files name is ".something", there is a unique manifesto for it. But now I know that I can create a file, write anything that I want into it and name it ".peyman" and yes there is nothing special with it!
What was that file? the file was JSON Lines file format.
Where did I find it? in the Scrapy except writing scrapy crawl name -o file.json I saw that somebody wrote scrapy crawl name -o file.jl. I tried that and the file was 99% like JSON file so I wondered and asked this question here.
So:
What is the difference between a .JSON file and .JL file? Now I know that the better question is "What is the difference between a .JSON file and .JL file in the Scrapy?"
The JSON Line is like JSON but without the "[" and "]" at the
beginning and the end. it is used in the Scrapy because of this
There's quite a few things that a jl file extension could be referring to. If I remember correctly, it originally had something to do with the window manager Sawfish.
Sawfish was developed in Lisp, and the jl file was a Lisp source file for Sawfish. However, I'm guessing (because you said that inside was JSON-like sauce) that's not what you're asking about.
In that case, I do recall a few projects on GitHub... JSON lambda and Julia.
Both of those may be the reason why you're seeing JSON in a jl file. Without more information on where you got that file, or what it was part of, though, we won't be able to help you much.
That said, file extensions rarely matter in terms of Linux. In Windows, they're far more important, but in Linux you could literally append anything to a file as an "extension" (ie. thisfile.whatever) and you could still open it up in an editor. The same is true for most editors in Windows.
Likely, the packager of that file decided on jl for their own reasons, rather than following convention of using .json.
I guess JL extension is used for many purposes, but JL is also one of the few extensions used for JSON-lines (also known as NDJSON or JSONL).
This format can contain multiple JSON values, one JSON value (with "compact" formatting) per line and is useful for e.g. streaming or logging.
I just wonder if i can put a .doc or .txt files in the html instead of placing too much code in showing the data. I think that should be some method but i m not sure about it
You can put a direct URL to a .doc or .txt file on your server without even using HTML if that's most convenient. A browser will typically display .txt files right in the browser itself. A .doc file would likely be offered to store on disk so you can use a program like Word to view it.
If you are talking about embedding data into an existing HTML page there are ways to do so but it would require knowing more about your server. Are you using PHP to respond to requests?
You can use a number of methods to acheive this. Most commonly used are php includes if the server is capable of executing php scripts. Javascript is also commonly used and there are many examples of how to do this. This could also be achieved using SSI (server side include) but this method is not commonly used and requires renaming the file with .shtml extension
Hope this helps.
What would be a good tool, or set of tools, to download a list of URLs and extract only the text content?
Spidering is not required, but control over the download file names, and threading would be a bonus.
The platform is linux.
wget | html2ascii
Note: html2ascii can also be called html2a or html2text (and I wasn't able to find a proper man page on the net for it).
See also: lynx.
Python Beautiful Soup allows you to build a nice extractor.
I know that w3m can be used to render an html document and put the text content in a textfile
w3m www.google.com > file.txt for example.
For the remainder, I'm sure that wget can be used.
Look for the Simple HTML DOM parser for PHP on Sourceforge. Use it to parse HTML that you have downloaded with CURL. Each DOM element will have a "plaintext" attribute which should give you only the text. I was very successful in a lot of applications using this combination for quite some time.
PERL (Practical Extracting and Reporting Language) is a scripting language that is excellent for this type of work. http://search.cpan.org/ contains allot of modules that have the required functionality.
Use wget to download the required html and then run html2text on the output files.