Excessive depth in document: XML_PARSE_HUGE option for xml2::read_html() in R - html

First I would like to apologize for a new question as my profile does not yet allow me to comment on other people's comments especially on two SO posts I've seen. So please bear with this older guy :-)
I am trying to read a list of 100 character files ranging in size from around 90KB to 2MB and then using the qdap package do some statistics with the text I extract from the files namely count sentences, words etc. The files contain webpage source previously scraped using RSelenium::remoteDriver$getPageSource() and saved to file using write(pgSource, fileName.txt). I am reading the files in a loop using:
pgSource <- readChar(file.path(fPath, fileNames[i]), nchars = 1e6)
doc <- read_html(pgSource)
that for some files is throwing
Error in eval(substitute(expr), envir, enclos) :
Excessive depth in document: 256 use XML_PARSE_HUGE option [1]
I have seen these posts, SO33819103 and SO31419409 that point to similar problems but cannot fully understand how to use #shabbychef's workaround as suggested in both posts using the snippet suggested by #glossarch in the first link above.
library(drat)
drat:::add("shabbychef");
install.packages('xml2')
library("xml2")
EDIT: I noticed that when previously I was running another script scraping the data live from the webpages using URL's I did not encounter this problem. The code was the same, I was just reading the doc <- read_html(pgSource) after reading it from the RSelenium's remoteDriver.
What I would like to ask this gentle community is whether I am following the right steps in installing and loading xml2 after adding shabbychef's drat or whether I need to add some other step as suggested in SO17154308 post. Any help or suggestions are greatly appreciated. Thank you.

I don't know if this is the right thing to do, but my question was answered by #hrbrmstr in one of his comments. I decided to post an answer so that people stumbling upon this question see that it has at least one answer.
The problem is basically solved by using the "HUGE" option when reading the html source. My problem was only related to when I loaded previously saved source. I did not find the same problem whilst using a "live" version of the application, i.e. reading the source from the website directly.
Anyway, now the August 2016 update of the excellent xml2 package permits the use of the HUGE option as follows:
doc <- read_html(pageSource, options = "HUGE")
For more information please read the xml2 reference manual here
CRAN-xml2
I wish to thank #hrbrmstr again for his valuable contribution.

Related

VSCode not emulating other file than my old version of index

Faced with the prospects of having to code, in order to build a digital annex for my thesis, I've wrote a bunch of very raw lines: a index with some imagelinks - it's a catalogue of sculptures. Each sculpture have it's own standard page, called n0, designed to be a "paper" form ready to be filled with info.
A friend of mine improved my original n0 - but when I try to emulate it via VSCode, software shows only my main index page. Why is that? I mean, I didn't even tried to open the index! I've looked around if someone had the same basic issue, but fact is, noob as I am, I dunno what exactly to search for or which words to use... So, I don't understand if these ppl had the same issue as I'm having.

Converting large JSON file to XLS/CSV file (Kickstarter campaigns)

As part of my Master's thesis, I'm trying to run some statistics on which factors affect whether crowdfunding campaigns get funded or not. I've been trying to get data from the largest platform Kickstarter.com. Unfortunately, they have removed all the non-successful campaigns from their website (unless you have the direct link).
Luckily, I'm not the only one looking for this data.
Webrobots.io have a scraper robot which crawls all Kickstarter projects and collects data in JSON format (http://webrobots.io/kickstarter-datasets/).
The latest dataset can be found on:
http://webrobots.io/wp-content/uploads/2015/10/Kickstarter_2015-10-22.json_.zip
However, my programming skills are limited, and I don't know how to convert it into an excel file where I can manipulate the data and run my analysis. I found a few online converters, but the file is far too big for it (approx 300 mb).
Can someone please help me get the file converted?
It will earn you an acknowledgement in my Master's thesis when it gets published :)
Thanks in advance!!!
I guess the answer for this varies massively on a few things.
What subject is the masters covering? (mainly to appease many people who will probably assume you're hoping for people to do your homework for you! This might explain why the thread has been down-voted already)
You mention your programming skills are limited... What programming skills do you have? What language would you be using to achieve this goal? Bear in mind that even with a fully coded solution, if it's not in the language you know, you might not be able to compile it!
What kind of information do you want from the JSON file?
With regards to question 3, I've looked in the JSON file and it contains hierarchical data which is pretty difficult to replicate in a flat file i.e. an Excel or CSV file (I should know, we had to do this a lot in a previous job of mine).
But, I would look at the following plan of action to achieve what you're after:
Use a JSON parser to serialize the data into a class structure (Visual Studio can create the classes for you... See this S/O thread - How to show the "paste Json class" in visual studio 2012 when clicking on Paste Special?)
Once you've got the objects in memory, you can then step through them one by one and pick out the data you want and append them to a comma-separated string (in C# I'd use the StringBuilder) and write the rows of data out to a file on disk.
Once this is complete, you'll have the data you want.
Depending on what data you want from the JSON file, step 2 could be the most difficult part as you'd need to step into the different levels of the data hierarchy.
Hope this points you in the right direction?
You may want to look at this Blog.
http://jdunkerley.co.uk/2015/09/04/downloading-and-parsing-met-office-historic-station-data-with-alteryx/
He uses a process with Alteryx that may line up with what you are trying to do. I am looking to do something similar, but haven't tried it yet. I'll update this answer if I get it to work.

Error Creating "Large" mediawiki Post

I recently installed the latest version of mediawiki, and it's more or less running fine. However, whenever I try and post what I might consider a "large" entry, I get an error that says I cannot write to index.php, and so the post fails. I have looked though a lot of the documentation, including the variables settings, and cannot seem to nail down the issue or solution. Is it possible that some of the characters in the post are preventing the post? Or, is there a limit to the amount of text content (characters or total size)? Any help would be greatly appreciated!
Mark
For starters, check that $wgMaxArticleSize is greater than what you are trying to post. Even in this case, though, you should get an error message, not an outright failure. The content of the post is unlikely to cause problems, MediaWiki is UTF-8 safe.
Run through the checklist here as well: http://www.mediawiki.org/wiki/Manual:Errors_and_symptoms
Have you tried writing the text in a text editor and then pasting it into mediawiki in smaller chunks, saving the page then pasting another piece? As long as you don't want to do this too often this could be significantly easier than trying to solve the problem.

what data storage model is used to store articles in wikipedia

Articles in wikipedia get edited. They can grow/shrink/updated etc. What file system/database storage layout etc is used underneath to support it. In database course, I had read a bit on variable length record, but that seemed like more for small strings and not for whole document. Like in file system, files can grow/shrink etc, and I think its done by chaining blocks together. each time, we update a file, not the whole file is rewritten. Perhaps something similar would be done here.
I am looking for specific names,terminologies, may be even how the schema in mysql is defined. (I think wikipedia uses mysql).
Below are links to some writeup on wikipedia architecture, but I am not being able to answer my question from these:
http://swe.web.cs.unibo.it/twiki/pub/WikiFactory/AntonelloDiMuroThesis/Wikipedia-cheapandexplosivescalingwithLAMP.pdf
http://dom.as/uc/workbook2007.pdf
Thanks,
See:
http://www.mediawiki.org/wiki/Manual:Database_layout

How do i convert hdb file? ... believed to be from act! source

Any ideas ?
I think the original source was a goldmine database, looking around it appears that the file was likely built using an application called ACT which I gather is a huge product I don't really want to be deploying for a one off file total size less than 5 meg.
So ...
Anyone know of a simple tool that I can run this file through to convert it to a standard CSV or something?
It does appear to be (when looking at it in notepad and excel) in some sort of csv type format but it's like the data is encrypted somehow.
Ok this is weird,
I got a little confused because the data looked a complete mess, in actual fact the mess was the data, that's what it was meant to look like.
Simply put, i opened the file in notepad, seemed to have a sort of pattern so i droppped it on excel.
Apparently excel has no issues reading these files ... strange huh !!!
I am unaware of any third party tooling for opening these files specifically, although there is an SDK available for C# which could resolve your problem with a little elbow grease.
The SDK can be aquired for free Here
Also there is a developer forum which could provide some valuable resources including training material with sample code Here
Resources will be provided with the SDK
Also, out of interest since ACT is a Sage product have you any Sage software floating about which you could attempt to access the data with? Most offices have!
Failing all of the above there is a trial available for ACT! Here!
Good luck with your problem!