Automatic generation of example index

Automatic generation of example index - concordion

A common case is to have examples organized in a file hierarchy and a top-level index, that results in a report summary.
Currently we write this index file by hand, making it easy to forget to add a new example file.
Is there a way to generate this index file from the example file hierarchy ?

I'm aware that some companies have done this, but nothing exists in the Concordion project to do this currently.
One option I've been considering is a runAll command that would run all children of an index. Would something like this work for you?
We'd need to consider what to actually run. Different patterns are:
1) We have just a top-level index specification that should run all descendants.
2) We have an index specification at each level and should run siblings and immediate child index specifications.
3) Maybe we have an index specification at each level and should run all immediate children including index specifications.
This could result in multiple commands, eg. runSiblings, runChildren, runChildIndexes, runDescendants.
What do you think?

I don't think we need to introduce a new file type as an index file.
In the current state, we have a md file as a module of specification, and the run command that can invoke other module.
The problem we have is that in using run we have for each subordinate module to write a title and the relative path.
This can be avoided with a variant of the run command that
takes a glob expression that will evaluate to a list of modules to be run
is replaced in the report with the list of the sub-modules' titles, decorated information on the pass or fail status of their respective executions
The resulting report would be equivalent to one resulting from a hand-written index module.

Related

How to write one Json file for each row from the dataframe in Scala/Spark and rename the files

Need to create one json file for each row from the dataframe. I'm using PartitionBy which creates subfolders for each file. Is there a way to avoid creating the subfolders and rename the json files with the unique key?
OR any other alternatives? Its a huge dataframe with thousands (~300K) of unique values, so Repartition is eating up a lot of resources and taking time.Thanks.
df.select(Seq(col("UniqueField").as("UniqueField_Copy")) ++
df.columns.map(col): _*)
.write.partitionBy("UniqueField")
.mode("overwrite").format("json").save("c:\temp\json\")

Putting all the output in one directory
Your example code is calling partitionBy on a DataFrameWriter object. The documentation tells us that this function:
Partitions the output by the given columns on the file system. If specified, the output is laid out on the file system similar to Hive's partitioning scheme. As an example, when we partition a dataset by year and then month, the directory layout would look like:
year=2016/month=01/
year=2016/month=02/
This is the reason you're getting subdirectories. Simply removing the call to partitionBy will get all your output in one directory.
Getting one row per file
Spark SQL
You had the right idea partitioning your data by UniqueField, since Spark writes one file per partition. Rather than using DataFrameWriter's partition, you can use
df.repartitionByRange(numberOfJson, $"UniqueField")
to get the desired number of partitions, with one JSON per partition. Notice that this requires you to know the number of JSON's you will end up with in advance. You can compute it by
val numberOfJson = df.select(count($"UniqueField")).first.getAs[Long](0)
However, this adds an additional action to your query, which will cause your entire dataset to be computed again. It sounds like your dataset is too big to fit in memory, so you'll need to carefully consider if caching (or checkpointing) with df.cache (or df.checkpoint) actually saves you computation time. (For large datasets that don't require intensive computation to create, recomputation can actually be faster)
RDD
An alternative to using the Spark SQL API is to drop down to the lower-level RDD. Partitioning by key (in pyspark) for RDDs was discussed thoroughly in the answer to this question. In scala, you'd have to specify a custom Partitioner as described in this question.
Renaming Spark's output files
This is a fairly common question, and AFAIK, the consensus is it's not possible.
Hope this helps, and welcome to Stack Overflow!

Unable to use query parameters

I am trying to use the new query parameters to do searches for elements based off child values. Ideally, I want to be able to do something like
https://dinosaur-facts.firebaseio.com/.json?orderBy="hash"&startAt=123&endAt=123
to get a specific element as long as I have given it a unique hash value. In my course to do this, i realized I couldn't do any sorting except for using orderBy="$key". I even went so far as to make a clone of the demo dinosaur-facts data set. I exported the data using the 'export json' button, then imported it into my data set using the 'import json' button, and verified that all the data is the same. I then tried to do the demo queries outlined here, replacing the "dinosaur-facts" with my own domain, and it STILL doesn't work.
When I try
curl https://myapp.firebaseio.com/.json?orderBy="height"
The error I get is:
{"error" : "Index not defined"}
However, if you try
curl https://dinosaur-facts.firebaseio.com/.json?orderBy="height"
You get exactly what you would expect, all the dinosaurs sorted by their height. Is this an issue with my rules? Why can't I do this functionality that is being claimed? Has it not been rolled out to everyone? Do I need to pass my secret token for every one of these? Because when I do that, I get an error saying my auth token could not be parsed. I really have no idea what is happening, I just want to be able to do queries...

To be able to sort on a specific child, there must be an index on that child node. You can add such an index by adding an .indexOn rule to the security/rules in your dashboard, e.g.
".indexOn": ["hash"]
Most client-side APIs that Firebase provides have an implementation that of the ordering/filtering, which allows them to order/filter data even when no index is present. This is handy for development purposes.
But the REST API doesn't have a client-side, so ordering/filtering is only possible after you define the proper index.
See: https://www.firebase.com/docs/security/guide/indexing-data.html:
Indexes are not required for development unless you are using the REST API. The realtime client libraries can execute ad-hoc queries without specifying indexes. Performance will degrade as the data you query grows, so it is important to add indexes before you launch your app if you anticipate querying a large set of data.

SSIS Flat File - How to handle file versions / generations

I am working in a data warehouse project with a lot of sources creating flat files as sources and we are using SSIS to load these into our staging tables, we are currently using the Flat File Source component.
However, after a while, we need an extra column in one of the files and from a date the file specification change to add that extra column. This exercise happens quite frequently and over time accumulate quite a lot versions.
According to answers I can find here and on the rest of the internet the agreed method to handle this scenario seems to be to set up a new flat file source in a new separate data flow for this version, to keep re-runablility for ETL process for old files.
Method is outlined here for example: SSIS pkg with flat-file connection with fewer columns will fail
In our specific setup, the additional columns are always additional columns (never remove old columns) and also, for logical reasons the new columns can not be mandantory if we keep re-runability for the older files in their separate data flows.
I don´t think the method of creating a duplicate data flow handling the same set of columns over and over again is a good answer for a data warehouse project as ours and I would prefeer a source component that takes the last file version and have the ability to mark columns as "not mandadory" and deliver nulls if they are missing.
Is anybody aware of a SSIS Flat File component that is more flexible in handle old file versions or have a better solution for this problem?
I assume that such a component would need to approach the files on a named column basis rather than the existing left-to-right approach?
Any thoughts or suggestions are welcome!

The following will lose efficiency when processing (over having separate data flows), but will provide you with the flexibility to handle multiple file types within a single data flow.
You can arrange you flat file connection to return lines rather than individual columns, by only specifying the row delimiter. Connect this to a flat file source component which will output a single column per row. We now have a single row that represents one of the many file types that you are aware of – the next step is to determine which file type you have.
Consume the output from a flat file type with a script component. Pass in a single column and pass out the superset of all possible columns. We have lost the meta data normally gleamed from a file source, so you will need to build up the column name / type / size within the script component output types.
Within the script component, pass the line and break it into its component columns. You will have to perform a pattern match (maybe using RegularExpression.Regex.Match) to identify when a new column starts. Hopefully the file is well formed which will aid you - beware of quotes and commas within text columns.
You can now access the file type by determining the number of columns you have and default the missing columns. Set the rows’ output columns to pass out the constituent parts. You may want to attach a new column to record the file type with your output.
The rest of the process should be able to load your table with a single data flow as you have catered for all file types within your script.
I would not recommend that you perform the above lightly. The benefit of SSIS is somewhat reduced when you have to code up all the columns / types etc, however it will provide you with a single data flow to handle each file version and can be extended as new columns are passed.

Spring Batch custom partitioner for input file

I'm trying to partition a flat input file containing ; separated items.
the first item on a line indicates a category and I would like to partition on this category so that for each category a partition is created that will be handled by a dedicated thread.
But I'm puzzled to how I can implement this partitioning logic in a custom Partitioner.
The partitioning seems to happen before the chunk oriented step, thus before reading, writing, so it looks like I need to read the file in the custom partitioner line by line, get the category field from the line and collect lines with equal categories and create an ExecutionContext for each of these collections?
Am I looking in the right direction?
Can someone with experience provide a small example using a file (may be pseudo code)?

I've just hit this question myself. I think that custom Partitioner needs to be paired with a custom ItemReader. ItemReader gets initialized with data from slave step (put there by the Partitioner) and consequently only reads the items that are right for that step.

You can find some custom partitioner implementations in these links here and here to get an overall idea. But I think you can't flee from the ExecutionContext creation for each partition.

Namespaces and records in erlang

Erlang obviously has a notion of namespace, we use things like application:start() every day.
I would like to know if there is such a thing as namespace for records. In my application I have defined record user. Everything was fine until I needed to include rabbit.hrl from RabbitMQ which also defines user, which is conflicting with mine.
Online search didn't yield much to resolve this. I have considered renaming my user record and prefixing it with something, say "myapp_user". This will fix this particular issue, until I suspect I hit another conflict say with my record "session".
What are my options here? Is adding a prefix myapp_ to all my records a good practice, or is there a real support for namespaces with records and I am just not finding it?
EDIT: Thank you everyone for your answers. What I've learned is that the records are global. The accepted answer made it very clear. I will go with adding prefixes to all my records, as I have expected.

I would argue that Erlang has no namespaces whatsoever. Modules are global (with the exception of a very unpopular extension to the language), names are global (either to the node or the cluster), pids are global, ports are global, references are global, etc.
Everything is laid flat. The namespacing in Erlang is thus done by convention rather than any other mean. This is why you have <appname>_app, <appname>_sup, etc. as module names. The registered processes also likely follow that pattern, and ETS tables, and so on.
However, you should note that records themselves are not global things: as JUST MY correct OPINION has put it, records are simply a compiler trick over tuples. Because of this, they're local to a module definition. Nobody outside of the module will see a record unless they also include the record definition (either by copying it or with a header file, the later being the best way to do it).
Now I could argue that because you need to include .hrl files and record definitions on a per-module basis, there is no such thing as namespacing records; they're rather scoped in the module, like a variable would be. There is no reason to ever namespace them: just include the right one.
Of course, it could be the case that you include record definitions from two modules, and both records have the same name. If this happens, renaming the records with a prefix might be necessary, but this is a rather rare occurrence in my experience.
Note that it's also generally a bad idea to expose records to other modules. One of the problems of doing so is that all modules depending on yours now get to include its .hrl file. If your module then change the record definition, you will have to recompile every other module that depends on it. A better practice should be to implement functions to interact with the data. Note that get(Key, Struct) isn't always a good idea. If you can pick meaningful names (age, name, children, etc.), your code and API should make more sense to readers.

You'll either need to name all of your records in a way that is unlikely to conflict with other records, or you need to just not use them across modules. In most circumstances I'll treat records as opaque data structures and add functionality to the module that defines the record to access it. This will avoid the issue you've experienced.

I may be slapped down soundly by I GIVE TERRIBLE ADVICE here with his deeper knowledge of Erlang, but I'm pretty sure there is no namespaces for records in Erlang. The record name is just an atom grafted onto the front of the tuple that the compiler builds for you behind the scenes. (Records are pretty much just a hack on tuples, you see.) Once compiled there is no meaningful "namespace" for a record.
For example, let's look at this record.
-record(branch, {element, priority, left, right}).
When you instantiate this record in code...
#branch{element = Element, priority = Priority, left = nil, right = nil}.
...what comes out the other end is a tuple like this:
{branch, Element, Priority, nil, nil}
That's all the record is at this point. There is no actual "record" object and thus namespacing doesn't really make any sense. The name of the record is just an atom tacked onto the front. In Erlang it's perfectly acceptable for me to have that tuple and another that looks like this:
{branch, Twig, Flower}
There's no problem at the run-time level with having both of these.
But...
Of course there is a problem having these in your code as records since the compiler doesn't know which branch I'm referring to when I instantiate. You'd have to, in short, do the manual namespacing you were talking about if you want the records to be exposed in your API.
That last point is the key, however. Why are you exposing records in your API? The code I took my branch record from uses the record as a purely opaque data type. I have a function to build a branch record and that is what will be in my API if I want to expose a branch at all. The function takes the element, priority, etc. values and returns a record (read: a tuple). The user has no need to know about the contents. If I had a module exposing a (biological) tree's structure, it too could return a tuple that happens to have the atom branch as its first element without any kind of conflict.
Personally, to my tastes, exposing records in Erlang APIs is code smell. It may sometimes be necessary, but most of the time it should remain hidden.

There is only one record namespace and unlike functions and macros there can only be one record with a name. However, for record fields there is one namespace per record, which means that there is no problems in having fields with the same name in different records. This is one reason why the record name must always be included in every record access.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008