I'd like to transform data from a CSV file, like this
ID 1 2 3 4 5 6 7 8 9
1 0 0 0 0 1 0 1 0 0
2 1 0 1 0 1 0 0 0 0
3 0 0 0 0 1 1 0 0 0
into a CSV file like this
ID Item
1 5
1 7
2 1
2 3
2 5
3 5
3 6
How to transform that file?
Get the data with a CSV file input, then pivot with Row Normalizer around the ID column, and Filter the rows with Item=0, and put the result in a Text output file with a CSV format.
The only challenging part may be the definition of the normalizer .
The Row filter is straightforward. If it is the first time you use it, note that if you specify a send true data to step you must also specify a send false data to step. In your case do not specify any, so only true data fill flow through.
May I suggest you to read, the Official StartUp guide : Introduction to transformations.
You also can get a hand on the kettle book: Pentaho Kettle Solutions: Building Open Source ETL Solutions with Pentaho Data Integrationby M. Casters, R. Bouman, J. van Dongen. It is a huge and heavy book, but exhaustive and not yet outdated.
You may also have a look into the sample directory which was shipped with your distribution. It contains working examples for almost any steps.
Related
I have a small vb.net app that creates DXF files from scratch, containing polylines and some text objects. It is working as intented and does its job at the moment, making use of some "minimum dxf requirements" info I found online.
As an upgrade for the app, I have decided to add some xdata on the polylines and that's where I am having some trouble.
I have added the following lines inside the polyline definition in ENTITIES section:
1001
MYAPPID01
1002
{
1000
-Some string I want to associate with the polyline-
1002
}
And also created a table section for the appid as follows:
0
SECTION
2
TABLES
0
TABLE
2
APPID
2
MYAPPID01
70
0
0
ENDTAB
0
ENDSEC
I have also added an auto-load process in acaddoc lsp file to register the app:
(if (not (tblsearch "APPID" "MYAPPID01"))
(regapp "MYAPPID01")
)
My dxf files aren't loading and give the "Invalid application" error. What must I do to add this xdata with minimum addition to my normal dxf routine?
Any help about APPID's and their registrations would be great.
Thank you all in advance.
What's missing is the max table count tag (70, count) after the table type definition tag (2, APPID), the following table entries start with a (0, APPID) tag. (Solution for DXF R12)
0
SECTION <<< start table section
0
TABLE <<< start table
2 <<< group code 2 for
APPID <<< table type definition
70
10 <<< max table entry count
0 <<< group code 0 for
APPID <<< 1. table entry
2
ACAD <<< application name
70
0 <<< flags, 0 is good
0
APPID <<< 2. table entry
2
MYAPPID01 <<< application name
70
0 <<< flags
... and so on
0
ENDTAB
0
ENDSEC
You can find more information here (valid for DXF R13 and later):
If I am correct, memoization is associated with READ operations. What is the term which refers to the same technique used for WRITE operations?
Example:
Let's say an app receives the following inputs,
0 1 2 2 2 2 2 3 4 3 3 3 3 3 4 4 4 2 1 2 5 5 5 5 3
Instead of saving everything, we tune the app to save only the transitions. (i.e ignore consecutive duplicates)
0 1 2 3 4 3 4 2 1 2 5 3
What is the (standard) term that can be used to describe the above technique?
I feel bad about using the same term since the final outcome is quite different. In READ operations, if memoization is used, the final outcome will remain the same. But in above example for WRITE operations, the final output is different from the original input.
"Deduplication of adjacent/most-recent entries". Your example looks like what the uniq tool does.
If you preserved the count of duplicates, it would be a form of RLE (run-length encoding).
As an aside, I guess you mean memoization as a way to speed up reads, and this as a method to speed up writes, but I wouldn't say that's the opposite of memoization, since it's opposite of the general goal, but not related to the particular method.
As far as I know, there is no applicable terminology for what you are asking.
And it is NOT memoization ... or (to my mind) the reverse of memorization.
(Likewise, there is no word in English for a cat with three legs.)
i want to import csv-Files with about 40 million lines into neo4j. For this i try to use the "batchimporter" from https://github.com/jexp/batch-import.
Maybe it's a problem that i provide own IDs. This is the example
nodes.csv
i:id
l:label
315041100 Person
201215100 Person
315041200 Person
rels.csv :
start
end
type
relart
315041100 201215100 HAS_RELATION 30006
315041200 315041100 HAS_RELATION 30006
the content of batch.properties:
use_memory_mapped_buffers=true
neostore.nodestore.db.mapped_memory=1000M
neostore.relationshipstore.db.mapped_memory=5000M
neostore.propertystore.db.mapped_memory=4G
neostore.propertystore.db.strings.mapped_memory=2000M
neostore.propertystore.db.arrays.mapped_memory=1000M
neostore.propertystore.db.index.keys.mapped_memory=1500M
neostore.propertystore.db.index.mapped_memory=1500M
batch_import.node_index.node_auto_index=exact
./import.sh graph.db nodes.csv rels.csv
will be processed without errors, but it takes about 60 seconds!
Importing 3 Nodes took 0 seconds
Importing 2 Relationships took 0 seconds
Total import time: 54 seconds
When i use smaller IDs - for example 3150411 instead of 315041100 - it takes just 1 second!
Importing 3 Nodes took 0 seconds
Importing 2 Relationships took 0 seconds
Total import time: 1 seconds
Actually i would take even bigger IDs with 10 digits. I don't know what i'm doing wrong. Can anyone see an error?
JDK 1.7
batchimporter 2.1.3 (with neo4j 2.1.3)
OS: ubuntu 14.04
Hardware: 8-Core-Intel-CPU, 16GB RAM
I think the problem is that the batch importer is interpreting those IDs as actual physical ids on disk. And so the time is spent in the file system, inflating the store files up to the size where they can fit those high ids.
The ids that you're giving are intended to be "internal" to the batch import, or? Although I'm not sure how to tell the batch importer that is the case.
#michael-hunger any input there?
the problem is that those ID's are internal to Neo4j where they represent disk record-ids. if you provide high values there, Neo4j will create a lot of empty records until it reaches your ids.
So either you create your node-id's starting from 0 and you store your id as normal node property.
Or you don't provide node-id's at all and only lookup nodes via their "business-id-value"
i:id id:long l:label
0 315041100 Person
1 201215100 Person
2 315041200 Person
start:id end:id type relart
0 1 HAS_RELATION 30006
2 0 HAS_RELATION 30006
or you have to configure and use an index:
id:long:people l:label
315041100 Person
201215100 Person
315041200 Person
id:long:people id:long:people type relart
0 1 HAS_RELATION 30006
2 0 HAS_RELATION 30006
HTH Michael
Alternatively you can also just write a small java or groovy program to import your data if handling those ids with the batch-importer is too tricky.
See: http://jexp.de/blog/2014/10/flexible-neo4j-batch-import-with-groovy/
I have a matrix listed in a .csv file of approximately 0.6 mio datapoints I would like to visualize in a 3d plot. Since my computer showed troubles with the amount of data I evolved the command line from:
splot "file.csv" matrix w pm3d
to
splot "file.csv" matrix every 5::50::3000 w pm3d
My intention was to only plot from row 50 to 3000 using only every 5th row. A row contains 100 columns btw. The command however cut the first 50 rows and columns, using every 5th row and column and ended in with line 3500.
How do I use the every command on my rows only?
I also tried to combine the using command with the every command in order to define my row with the every command but I couldn't get it to work properly.
Short answer: Use
splot "file.csv" matrix every :5::50::3000 w pm3d
Long answer: The description of the every option is:
plot ’file’ every {<point_incr>}
{:{<block_incr>}
{:{<start_point>}
{:{<start_block>}
{:{<end_point>}
{:<end_block>}}}}}
The description of point and block refers to the usual data file structure, where two data blocks are separated by an empty line.
When using the matrix data format, replace point by column and block by row. That means, that every 1:1 selects all points, every 2:1 selects every second column and every row, every 1:2 (or every :2) selects every column and every second row.
Just use a simple data file
0 0 0 0 0 0
file0 0 0 0 0 0
0 0 0 0 0 0
0 0 0 0 0 0
and test:
splot 'file' matrix with lines, '' every :2
I need to write a function to satisfy this input -> output list:
0 -> 0
1 -> 1
3 -> 2
4 -> 3
5 -> 5
7 -> 13
9 -> 34
f(x) = ??
Well, that is incredibly easy... if you aren't concerned about over-fitting, then you can do:
switch(input)
case 0: report 0
case 1: report 1
case 3: report 2
...
default: report whatever
You probably need more constraints on the problem if you want a good solution. You also might consider graphing the function to see if there is any obvious pattern, or maybe show the bits involved. It would also be useful to know if the inputs and outputs are integer-valued or real-valued (is the function supposed to be continuous or discrete?). Without this information, it's a little bit hard to help.
Edit
Showing the missing numbers helps:
0 -> 0
1 -> 1
2 -> 1
3 -> 2
4 -> 3
5 -> 5
6 -> 8
7 -> 13
8 -> 21
9 -> 34
(It's the Fibonnaci numbers: f(x) = f(x-1)+f(x-2), where f(0)=0 and f(1)=1).
PS
This is a function for which dynamic programming or memoization is particularly useful.
Solved by Eureqa
round(exp(0.4807*input - 0.799938))
I don't know if this is homework or not, but this is an extremely well-known sequence. A (rather large) hint is that f(n) is dependent on f(n-1) and f(n-2). If you're not concerned with just being told the answer, click here. Implementing the sequence is pretty trivially done recursively, but edit your post if you're having trouble with it and specify which part you're stuck on