I am in the process of converting a few older repositories and splicing and dicing them as needed to get one future repository. A bit of scar tissue in the history (most of them due to semantics in the previous VCS) had to be removed as well, but all in all the conversion seems to be smooth.
It just takes very long.
Background: I am using reposurgeon, which works based on git-fast-import streams. Target format is Mercurial.
How can I import new increments of code introduced into the old VCSs while they are still live, such that I don't have to run through the full conversion routine every time?
Related
I have about 30,000 very tiny JSON files that I am attempting to load into a Spark dataframe (from a mounted S3 bucket). It is reported here and here that there may be performance issues and is described as the Hadoop Small Files Problem. Unlike what has been previously reported, I am not recursing into directories (as all my JSON files are in one sub-folder). My code to load the JSON files look like the following.
val df = spark
.read
.option("multiline", "true")
.json("/mnt/mybucket/myfolder/*.json")
.cache
So far, my job seems "stuck". I see 2 stages.
Job 0, Stage 0: Listing leaf files and directories
Job 1, Stage 1: val df = spark .read .option("multiline", "...
Job 0, Stage 0 is quite fast, less than 1 minute. Job 1, Stage 1, however, takes forever to even show up (lost track of time, but between the two, we are talking 20+ minutes), and when it does show up on the jobs UI, it seems to be "stuck" (I am still waiting on any progress to be reported after 15+ minutes). Interestingly, Job 0, Stage 0 has 200 tasks (I see 7 executors being used), and Job 1, Stage 1 has only 1 task (seems like only 1 node/executor is being used! what a waste!).
Is there any way to make this seemingly simple step of loading 30,000 files faster or more performant?
Something that I thought about was to simply "merge" these files into large ones; for example, merge 1,000 JSON files into 30 bigger ones (using NDJSON). However, I am skeptical of this approach since merging the files (let's say using Python) might itself take a long time (something like the native linux ls command in this directory takes an awful long time to return); also, this approach might defeat the purpose of cluster computing end-to-end (not very elegant).
Merging JSON files into newline delimited, much larger (aim for one or at most 10 files, not 30) files would be the only option here.
Python opening 30K files isn't going to be any slower than what you're already doing, it just won't be distributed.
Besides that, multiline=true was particularly only added for the cases where you already have a really large JSON file and it's one top level array or object that's being stored. Before that option existed, "JSONLines" is the only format Spark could read.
The most consistent solution here would be to fix the ingestion pipeline that's writing all these files such that you can accumulate records ahead of time, then dump larger batches. Or just use Kafka rather than reading data from S3 (or any similar filesystem)
There's two HTTP requests a read, one HEAD, one GET; if the files are all kept in the same dir then the listing cost is simply one LIST/5000 objects, so 6 list calls. You'll pay ~$25 for 30K HEAD & GET calls.
If you are using spark to take the listing and generate a record from each individual file, as well as the overhead of scheduling a task per file. You can do a trick where you make the listing itself (which you do in .py) which becomes the input RDD (i.e one row-per-file) and the map() becomes the read of that file and the output of the map the record representing the single file. scala example. This addresses the spark scheduling overhead as that input listing will be split into bigger parts pushed out to the workers, so leaving only those HTTP HEAD/GET calls.
For this to work efficiently, use Hadoop 2.8+ Jars, and do the listing using FileSystem.listFiles(Path, true) to a single recursive listing of the entire directory tree under the path, so using the S3 LIST API at its most optimal.
(Once you've done this, why not post the code up somewhere for others?)
I recently downloaded my location history from Google. From 2014 to present.
The resulting .json file was 997,000 lines, plus a few.
All of the online converters would freeze and lock up unless I did it in really small slices which isn't an option. (Time constraints)
I've gotten a manual process down between Sublime Text and Libre Office to get my information transferred, but I know there's an easier way somewhere.
I even tried the fastFedora plug-in which I couldn't get to work.
Even though I'm halfway done, and will likely finish up using my process, is there an easier way?
I can play with Java though I'm no pro. Any other languages that play well with .json?
A solution that supports nesting without flattening the file. Location data is nested and needs to remain nested (or the like) to make sense. At least grouped.
Following only the instructions here - https://www.chromium.org/developers/how-tos/get-the-code I have been able to successfully build and get a Chromium executable which I can then run.
So, I have been playing around with the code (adding new buttons to the browser etc.) for learning purposes. So each time I make a change (like adding a new button in the settings toolbar) and I use the ninja command to build it takes over 3 hours to finish before I can run the executable. It builds each and every file again I guess.
I have a decently powerful machine (i7, 8GB RAM) running 64-bit Ubuntu. Are there ways to speed up the builds? (At the moment, I have literally just followed the instructions in the above mentioned link and no other optimizations to speed it up.)
Thank you very very much!
If all you're doing is modifying a few files and rebuilding, ninja will only rebuild the objects that were affected by those files. When you run ninja -C ..., the console displays the number of targets that need to be built. If you're modifying only a few files, that should be ~2000 at the high end (modifying popular header files can touch lots of objects). Modifying a single .cpp would result in rebuilding just that object.
Of course, you still have to relink which can take a very long time. To make linking faster, try using a component build, which keeps everything in separate shared libraries rather than one big onw that needs to be relinked for any change. If you're using GN, add is_component_build=true to gn args out/${build_dir}. For GYP, see this page.
You can also peruse faster linux builds and see if any of those tips apply to you. Unfortunately, Chrome is a massive project so builds will naturally be long. However, once you've done the initial build, incremental builds should be on the order of minutes rather than hours.
Follow the recently updated instructions here:
https://chromium.googlesource.com/chromium/src/+/HEAD/docs/windows_build_instructions.md#Faster-builds
In addition to using component builds you can disable nacl, use jumbo builds, turn off symbols for webcore, etc. Jumbo builds are still experimental at this point but they already help build times and they will gradually help more.
Full builds will always take a long time even with jumbo builds, but component builds should let incremental builds be quite fast in many cases.
For building on Linux, you can see how to build faster at: https://chromium.googlesource.com/chromium/src/+/master/docs/linux_build_instructions.md#faster-builds
Most of them require add build argments. To edit build arguments, you can see GN build configuration at: https://www.chromium.org/developers/gn-build-configuration.
You can edit the build arguments on a build directory by:
$ gn args out/mybuild
I have a small problem (I assume...)
I'm loading a flatfile (csv) and I want to add a rownumber to the dataflow. Using the RowNumber transforation works good for both output paths (source and error) individually. But what if you want to use the same rownumber in both paths to be able to track where (in the file) an error occured. I have scratch my head long enough now and I'm just throwing it out here since I'm pretty sure other people has tumbled across this one...
I have tried the script transformation which seems to work for a while but then it hangs the load.
Any suggestion on how to solve this issue is greatly appreciated.
If I understand you correctly, dynamically generating the number with a script component for the dataflow is not a problem for you.
What I would recommend you is to adopt the following philosophy for stable etl processes coming from files:
Never cast anything in the connector, just import the fields as nvarchars of the maximum lenght they will achieve.
Cast and control each column to your specification.
If a row cannot be read, you will not know the index, but you will know that the file is malformed (extremely rare in my experience, for half transferred files), and it should be rejected anyway.
A quick screenshot of a part of a file loading process shows how the rejection (after assigning row_id) can work (link to dataflow image). To this you can add further countless checks (duplicates...) and even have a repository for the loaded files to check upon the rejects and whatever else you might want to control (Link to control flow image).
In some of my processes, I even use a flat file connector and just import each row as a bulk text and then split it in columns with an intermediate script component, allowing for different versions of the columns in the files.
Anyway, sorry not to be more detailed (due to my status I can't add more links or any images), but I hope that you understand the concept.
Regards,
Francisco.
MonoDevelop creates those for every project. Should I include them in source control?
From a MonoDevelop blog post:
There were several long time pending
bug reports, and I also wanted to
improve a bit the performance and
memory use. MonoDevelop creates a
Parser Information Database (pidb)
file for each assembly or project.
This file contains all the information
about classes implemented in an
assembly, together with documentation
pulled from Monodoc. A pidb file has
trhee sections: the first one is a
header which contains among other
things the version of the file format
(that version is checked when loading
the pidb, and the file will be
regenerated if it doesn't match the
current implementation version). The
second section is the index of the
pidb file. It contains an index of all
classes in the database. The index is
always fully loaded in memory to be
able to quickly locate classes. The
third section of the file contains all
the class information: list of
methods, fields, properties,
documentation for each of those, and
so on. Each entry in the index has a
file offset field, which can be used
to completely load all the information
of a class (the index only contains
the name).
So it sounds like it's really just an optimization. I would personally not include it in source control unless you find it makes a big difference to performance: my guess is it will only really stay valid if only one person is working on the project at a time. (If it's big and changes regularly, you could find it adds significant overhead to the repository too. I haven't checked to see what the size is actually like, but it's worth checking.)
They're just cached code completion data. As the post Jon linked explains, the main reason is to save memory, though they do also save you from waiting for MD to parse all the source files and referenced assemblies when you open a project.
The pidb files can be regenerated pretty quickly, so there's no advantage to keeping them in the VCS. Indeed, as well as the VCS repository overhead, it could also cause problem if people are using different versions of MD with different pidb formats, so I'd strongly recommend against keeping them in source control.