R Markdown HTML output doesn't match the R Studio output - html

I am sorry in advance if this question sounds stupid: I am at ease with R but relatively new to Markdown. I realize that Markdown .Rmd script is meant to be reproducible, so whatever is in the Markdown script has to come from it and not Global Environment or other script. I have done a tedious work of copying my very long intial .R script ito .Rmd, with explanations, like a report. My problem is the following: after running the code in .Rmd script I get the outputs below each chunk. I then Knit it, and the outputs in HTML document are not the same. The essentials are the same, but the model summaries are not. I simply cannot understand why.
I have of course tried restarting R Studio, cleaning up Global Environment and starting again from blank script. The tooth-grinding problem is, my script is long and some chunks are heavy (like imputation of missing data using MICE). So every time I have a problem and I have to re-compute everything, it's a very long coffee break.
While I cannot include the code for this reason, I still hope very much that someone has encountered this problem before and can share their experience. I particularly want to know what happens if you leave some chunks {r eval=FALSE} and run them manually for the first time only. Could this be a source of the problems? If so, how do you guys Knit long computation-heavy scripts?
Thanks very much in advance.
P.S. After throwing this bottle into the sea, I'll go and try splitting my script into few scripts to pinpoint the problem (and to be able to include the part that causes the problem).

So, apparently the bug above has the following explanation:
The outputs shown below the chunk codes in R.Studio(.Rmd) are based on the data held in Global Environment.
The Knitted HTML, on the contrary, is rendered by running the script from .Rmd.
Normally it shouldn't pose a problem. But if some code chunks are with eval=FALSE to skip the repeated lengthy execution (in my case, data imputation using MICE), then there's imputed data in Global Environment and non-imputed data being knitted. So, the models in knitted HTML are run on incomplete set of data and are all off.
Before receiving the suggestion with cache=TRUE, I found another workaround, which is doing all required transformations and imputations once, then saving the data with a new code chunk, then setting EVAL=FALSE for this chunk and the chunks above that no longer have to be run (even though some of them still have to be shown).
Then, I import the treated data in a hidden chunk (eval=TRUE, include=FALSE) and run the rest of the training, etc. While technically it's not the best in terms of reproducibility, it saved my neck and computation time.

Related

Need help downloading and reading a zipped CSV file in memory with Clojure

I have an external site from which I want to download a zipped CSV file. Currently, I'm downloading it unzipped, saving it to disk, then unzipping it, saving the unzipped file to disk, then reading the unzipped file with the CSV reader. Lots of useless steps in the process can be trimmed out, and I went on my way to do so.
This amazing answer helped me to get myself going. I tried to use the first option linked there (GZIPInputStream), but I get a "Not GZIP format" error, so I suppose I have to go to the second option.
This is my current code, and it does what I want it to do:
(defn download-zipped-stream!
(:body (clj-http.client/get "www.example.com" {:as :stream})))
(with-open
[stream (ZipInputStream. download-zipped-stream!)]
(.getNextEntry stream)
(doall (clojure.data.csv/read-csv (clojure.java.io/reader stream) :separator \;)))
I literally got to this by trial and error. There are mainly three things I'd like to change / understand about this code.
Ideally, I would want to break my code in two parts: one to download and unzip the content, returning a stream - the reason being that I want to decide later whether I want to read it as a csv directly, or write to disk (I don't want to lose this option, because, during development, it is much easier to read a pre-downloaded csv file than downloading the big content every single time). Turns out that, if I try to access the stream outside of the with-open call, I get a "stream closed" error (which, from what I understand, makes total sense).
On the above code, I have to call this .getNextEntry, or I get an empty list. As someone who is striving to write functional code, this bothers me, because, from what I can understand, I'm dealing with states here - my stream object looks mutable, which is something I really don't want. Isn't there a way to work around this step and straight-up not have it there?
I tried to call the read-csv method directly on the stream object, but the read-csv doesn't really know how to handle ZipInputStreams, apparently. Seeing this, I simply and hopefully throwed an io/reader call in between, and it worked. I don't know if this is the best approach, though. Is it correct?
I'm quite new to Clojure, and I'm completely clueless about Java in general, so, as you can see, my knowledge about those stream objects is pretty limited. I tried to read something about it in Java, but I quitted because I was not sure about how much of it could be useful for someone learning Clojure, so any pointers are also appreciated.
I think you are on the right approach. Suggestions to consider:
Consider using wget to manually download the *.csv.gz file to your local disk. Then, just open that local file instead of using clj-http.client/get.
I haven't played much with ZipInputStream, but if using .getNextEntry() seems to be required, just go with it.
The examples for read-csv show using a Reader to give access to the input file, so this is the expected behavior.
This template project shows how I like to organize a Clojure project & source code. Be sure to peruse the list of documentation provided.
Don't forget to utilize cljdoc.org for looking up Clojure library API docs. For example, see the API docs for data.csv.
Update
You may also want to review this answer.
Use https://github.com/techascent/tech.ml.dataset optionally with https://scicloj.github.io/tablecloth/index.html (a dplyr like api for TMD)
Also has advantage of being extremely fast and able to handle datasets that can't fit in memory, talks SQL, Arrow, et. al. Join conversation about it here:
https://clojurians.zulipchat.com/#narrow/stream/151924-data-science/topic/tech.2Eml.2Edataset

searching in html/txt without loading it into program [duplicate]

I have a FindFile routine in my program which will list files, but if the "Containing Text" field is filled in, then it should only list files containing that text.
If the "Containing Text" field is entered, then I search each file found for the text. My current method of doing that is:
var
FileContents: TStringlist;
begin
FileContents.LoadFromFile(Filepath);
if Pos(TextToFind, FileContents.Text) = 0 then
Found := false
else
Found := true;
The above code is simple, and it generally works okay. But it has two problems:
It fails for very large files (e.g. 300 MB)
I feel it could be faster. It isn't bad, but why wait 10 minutes searching through 1000 files, if there might be a simple way to speed it up a bit?
I need this to work for Delphi 2009 and to search text files that may or may not be Unicode. It only needs to work for text files.
So how can I speed this search up and also make it work for very large files?
Bonus: I would also want to allow an "ignore case" option. That's a tougher one to make efficient. Any ideas?
Solution:
Well, mghie pointed out my earlier question How Can I Efficiently Read The First Few Lines of Many Files in Delphi, and as I answered, it was different and didn't provide the solution.
But he got me thinking that I had done this before and I had. I built a block reading routine for large files that breaks it into 32 MB blocks. I use that to read the input file of my program which can be huge. The routine works fine and fast. So step one is to do the same for these files I am looking through.
So now the question was how to efficiently search within those blocks. Well I did have a previous question on that topic: Is There An Efficient Whole Word Search Function in Delphi? and RRUZ pointed out the SearchBuf routine to me.
That solves the "bonus" as well, because SearchBuf has options which include Whole Word Search (the answer to that question) and MatchCase/noMatchCase (the answer to the bonus).
So I'm off and running. Thanks once again SO community.
The best approach here is probably to use memory mapped files.
First you need a file handle, use the CreateFile windows API function for that.
Then pass that to CreateFileMapping to get a file mapping handle. Finally use MapViewOfFile to map the file into memory.
To handle large files, MapViewOfFile is able to map only a certain range into memory, so you can e.g. map the first 32MB, then use UnmapViewOfFile to unmap it followed by a MapViewOfFile for the next 32MB and so on. (EDIT: as was pointed out below, make sure that the blocks you map this way overlap by a multiple of 4kb, and at least as much as the length of the text you are searching for, so that you are not overlooking any text which might be split at the block boundary)
To do the actual searching once the (part of) the file is mapped into memory, you can make a copy of the source for StrPosLen from SysUtils.pas (it's unfortunately defined in the implementation section only and not exposed in the interface). Leave one copy as is and make another copy, replacing Wide with Ansi every time. Also, if you want to be able to search in binary files which might contain embedded #0's, you can remove the (Str1[I] <> #0) and part.
Either find a way to identify if a file is ANSI or Unicode, or simply call both the Ansi and Unicode version on each mapped part of the file.
Once you are done with each file, make sure to call CloseHandle first on the file mapping handle and then on the file handling. (And don't forget to call UnmapViewOfFile first).
EDIT:
A big advantage of using memory mapped files instead of using e.g. a TFileStream to read the file into memory in blocks is that the bytes will only end up in memory once.
Normally, on file access, first Windows reads the bytes into the OS file cache. Then copies them from there into the application memory.
If you use memory mapped files, the OS can directly map the physical pages from the OS file cache into the address space of the application without making another copy (reducing the time needed for making the copy and halfing memory usage).
Bonus Answer: By calling StrLIComp instead of StrLComp you can do a case insensitive search.
If you are looking for text string searches, look for the Boyer-Moore search algorithm. It uses memory mapped files and a really fast search engine. The is some delphi units around that contain implementations of this algorithm.
To give you an idea of the speed - i currently search through 10-20MB files and it takes in the order of milliseconds.
Oh just read that it might be unicode - not sure if it supports that - but definately look down this path.
This is a problem connected with your previous question How Can I Efficiently Read The First Few Lines of Many Files in Delphi, and the same answers apply. If you don't read the files completely but in blocks then large files won't pose a problem. There's also a big speed-up to be had for files containing the text, in that you should cancel the search upon the first match. Currently you read the whole files even when the text to be found is in the first few lines.
May I suggest a component ? If yes I would recommend ATStreamSearch.
It handles ANSI and UNICODE (and even EBCDIC and Korean and more).
Or the class TUTBMSearch from the JclUnicode (Jedi-jcl). It was mainly written by Mike Lischke (VirtualTreeview). It uses a tuned Boyer-Moore algo that ensure speed. The bad point in your case, is that is fully works in unicode (widestrings) so the trans-typing from String to Widestring risk to be penalizing.
It depends on what kind of data yre you going to search with it, in order for you to achieve a real efficient results you will need to let your programm parse the interesting directories including all files in there, and keep the data in a database which you can access each time for a specific word in a specific list of files which can be generated up to the searching path. A Database statement can provide you results in milliseconds.
The Issue is that you will have to let it run and parse all files after the installation, which may take even more than 1 hour up to the amount of data you wish to parse.
This Database should be updated eachtime your programm starts, this can be done by comparing the MD5-Value of each file if it was changed, so you dont have to parse all your files each time.
If this way of working can be interesting if you have all your data in a constant place and you analyse data in the same files more than each time totally new files, some code analyser work like this and they are real efficient. So you invest some time on parsing and saving intresting data and you can jump to the exact place where a searching word appears and provide a list of all places it appears on in a very short time.
If the files are to be searched multiple times, it could be a good idea to use a word index.
This is called "Full Text Search".
It will be slower the first time (text must be parsed and indexes must be created), but any future search will be immediate: in short, it will use only the indexes, and not read all text again.
You have the exact parser you need in The Delphi Magazine Issue 78, February 2002:
"Algorithms Alfresco: Ask A Thousand Times
Julian Bucknall discusses word indexing and document searches: if you want to know how Google works its magic this is the page to turn to."
There are several FTS implementation for Delphi:
Rubicon
Mutis
ColiGet
Google is your friend..
I'd like to add that most DB have an embedded FTS engine. SQLite3 even has a very small but efficient implementation, with page ranking and such.
We provide direct access from Delphi, with ORM classes, to this Full Text Search engine, named FTS3/FTS4.

.tbc to .tcl file

this is a strange question and i searched but couldn't find any satisfactory answer.
I have a compiled tcl file i.e. a .tbc file. So is there a way to convert this .tbc file back to .tcl file.
I read here and someone mentioned about ::tcl_traceCompile and said this could be used to disassemble the .tbc file. But being a novice tcl user i am not sure if this is possible, or to say more, how exactly to use it.
Though i know that tcl compiler doesn't compile all the statements and so these statements can be easily seen in .tbc file but can we get the whole tcl back from .tbc file.
Any comment would be great.
No, or at least not without a lot of work; you're doing something that quite a bit of effort was put in to prevent (the TBC format is intended for protecting commercial code from prying eyes).
The TBC file format is an encoding of Tcl's bytecode, which is not normally saved at all; the TBC stands for Tcl ByteCode. The TBC format data is only produced by one tool, the commercial “Tcl Compiler” (originally written by either Sun or Scriptics; the tool dates from about the time of the transition), which really is a leveraging of the built-in compiler that every Tcl system has together with some serialization code. It also strips as much of the original source code away as possible. The encoding used is unpleasant; you want to avoid writing your own loader of it if you can, and instead use the tbcload extension to do the work.
You'll then need to use it with a custom build of Tcl that disables a few defensive checks so that you can disassemble the loaded code with the tcl::unsupported::disassemble command (which normally refuses to take apart anything coming from tbcload); that command exists from Tcl 8.5 onwards. After that, you'll have to piece together what the code is doing from the bytecodes; I'm not aware of any tools for doing that at all, but the bytecodes are mostly fairly high level so it's not too difficult for small pieces of code.
There's no manual page for disassemble; it's formally unsupported after all! However, that wiki page I linked to should cover most of the things you need to get started.
I can say partially "yes" and conditionaly too. That condition is if original tcl code is written in namespace and procs are defined within namespace curly braces. Then you source tbc file in tkcon/wish and see code using info procs and namespace command. Offcourse you need to know namespace name. However that also can be found.

SSIS Common rownumber for both outputs on a flatfile source

I have a small problem (I assume...)
I'm loading a flatfile (csv) and I want to add a rownumber to the dataflow. Using the RowNumber transforation works good for both output paths (source and error) individually. But what if you want to use the same rownumber in both paths to be able to track where (in the file) an error occured. I have scratch my head long enough now and I'm just throwing it out here since I'm pretty sure other people has tumbled across this one...
I have tried the script transformation which seems to work for a while but then it hangs the load.
Any suggestion on how to solve this issue is greatly appreciated.
If I understand you correctly, dynamically generating the number with a script component for the dataflow is not a problem for you.
What I would recommend you is to adopt the following philosophy for stable etl processes coming from files:
Never cast anything in the connector, just import the fields as nvarchars of the maximum lenght they will achieve.
Cast and control each column to your specification.
If a row cannot be read, you will not know the index, but you will know that the file is malformed (extremely rare in my experience, for half transferred files), and it should be rejected anyway.
A quick screenshot of a part of a file loading process shows how the rejection (after assigning row_id) can work (link to dataflow image). To this you can add further countless checks (duplicates...) and even have a repository for the loaded files to check upon the rejects and whatever else you might want to control (Link to control flow image).
In some of my processes, I even use a flat file connector and just import each row as a bulk text and then split it in columns with an intermediate script component, allowing for different versions of the columns in the files.
Anyway, sorry not to be more detailed (due to my status I can't add more links or any images), but I hope that you understand the concept.
Regards,
Francisco.

Free text search integrated with code coverage

Is there any tool which will allow me to perform a free text search over a system's code, but only over the code which was actually executed during a particular invocation?
To give a bit of background, when learning my way around a new system, I frequently find myself wanting to discover where some particular value came from, but searching the entire code base turns up far more matches than I can reasonably assess individually.
For what it's worth, I've wanted this in Perl and Java at one time or another, but I'd love to know if any languages have a system supporting this feature.
You can generally twist a code coverage tool's arm and get a report that shows the paths that have been executed during a given run. This report should show the code itself, with the first few columns marked up according to the coverage tool's particular notation on whether a given path was executed.
You might be able to use this straight up, or you might have to preprocess it and either remove the code that was not executed, or add a new notation on each line that tells whether it was executed (most tools will only show path information at control points):
So from a coverage tool you might get a report like this:
T- if(sometest)
{
x somecode;
}
else
{
- someother_code;
}
The notation T- indicates that the if statement only ever evaluated to true, and so only the first part of the code executed. The later notation 'x' indicates that this line was executed.
You should be able to form a regex that matches only when the first column contains a T, F, or x so you can capture all the control statements executed and lines executed.
Sometimes you'll only get coverage information at each control point, which then requires you to parse the C file and mark the execute lines yourself. Not as easy, but not impossible either.
Still, this sounds like an interesting question where the solution is probably more work than it's worth...
-Adam