How to merge two ipython notebooks correctly without getting json error? - json

I have tried:
cat file1.ipynb file2.ipynb > filecomplete.ipynb
since the notebooks are simply json files, but this gives me the error
Unreadable Notebook: Notebook does not appear to be JSON: '{\n "metadata": {'
I think these must be valid json files because file1 and file2 each load individually into nbviewer, and so I am not entirely sure what I am doing wrong.

This Python script concatenates all the notebooks named with a given prefix and present at the first level of a given folder. The resulting notebook is saved in the same folder under the name "compil_" + prefix + ".ipynb".
import json
import os
folder = "slides"
prefix = "quiz"
paths = [os.path.join(folder, name) for name in os.listdir(folder) if name.startswith(prefix) and name.endswith(".ipynb")]
result = json.loads(open(paths.pop(0), "r").read())
for path in paths:
result["worksheets"][0]["cells"].extend(json.loads(open(path, "r").read())["worksheets"][0]["cells"])
open(os.path.join(folder, "compil_%s.ipynb" % prefix), "w").write(json.dumps(result, indent = 1))
Warning: the metadata are those of the first notebook, and the cells those of the first worksheet only (which seems to contain all the cells, in my notebook at least).

Concatenating 2 object with some properties does not always yield object with the same properties. Here is a sequence of increasing number : 4 8 15 16 23 42, here is another one 1 2 3 4 5 6 7. The concatenation of the two is not strictly increasing :4 8 15 16 23 42 1 2 3 4 5 6 7. Same goes for Json.
You need to load json file using json lib and do the merge you want to do yourself. I suppose you "just" want to concatenate the cells, but maybe you want to concatenate worksheet; maybe you want to merge metadata.

Related

spark input_line like input_file_name when reading line separated json

I have a new line separated json file. Each line consists of a json array which contains different type of documents. The standard behavior of spark is that when I read it naivly
df = spark.read.json(file.path)
spark creates a new line for each document in each array, while the different document types become each a column and for each line only one is not null.
I need to recover the line number of each document as the documents on the same line have a relation which is otherwise not recoverable.
I imagined there would be something similar to input_file_name() but there is none. Is there a way to achieve this?
An example input is
[{"firstAttribute":1, "secondAttribute":2},{"firstAttribute":10, "secondAttribute":20}]
[{"firstAttribute":3, "secondAttribute":4},{"thirdAttribute":5, "secondAttribute":6}]
The resulting dataframe now looks like this:
firstAttribute
secondAttribute
thirdAttribute
1
2
10
20
3
4
6
5
Now I would like to see the linenumber of the source file for the specific entry like this
line number
firstAttribute
secondAttribute
thirdAttribute
1
1
2
1
10
20
2
3
4
2
6
5

Jmeter: Parameter settings

Is it possible for each thread select the same row from the CSV file?
eg. I have 5 users and only 5 records (rows) in my CSV file. In each iteration, the 1st value from CSV should be assigned to User1, similarly for all users.
User1: myID1,pass1,item1,product1
User2: myID2,pass2,item2,product2
User3: myID3,pass3,item3,product3
User4: myID14,pass4,item4,product4
User5: myID15,pass5,item5,product5
.
.
Any solution, please?
If you have only 5 threads and 5 lines in CSV I would suggest considering switching to User Parameters instead of working with CSV.
If your CSV file can have > 5 lines and your test can have > 5 virtual users and requirement like "user 1 takes line 1" is a must, you will have to pre-load the CSV file into memory with a scripting test element like Beanshell Sampler like:
Add setUp Thread Group to your Test Plan (with 1 thread and 1 iteration)
Add Beanshell Sampler and put the following code into "Script" area:
import org.apache.commons.io.FileUtils;
List lines = FileUtils.readLines(new File("test.csv"));
bsh.shared.lines = lines;
The above code will read the contents of test.csv file (replace it with relative or full path to your CSV file) and store it into bsh.shared namespace
Add Beanshell PreProcessor as a child of the request where you need to use the values from the CSV file and put the following code into "Script" area:
int user = ctx.getThreadNum();
String line = bsh.shared.lines.get(user);
String[] tokens = line.split(",");
vars.put("ID", tokens[0]);
vars.put("pass", tokens[1]);
vars.put("item", tokens[2]);
vars.put("product", tokens[3]);
The above code will fetch the line from the list, stored in the bsh.shared namespace basing on current virtual user number, split it by comma and store the values into the JMeter Variables so you will be able to access them as:
${ID}
${pass}
${item}
${product}
See How to Use BeanShell: JMeter's Favorite Built-in Component guide for more information on using Beanshell scripting in JMeter tests.

Creating unique node and relationship NEO4J over huge dataset

My question is very similar to this one:
How to create unique nodes and relationships by csv file imported in neo4j?
I have a textfile with around 2.5 million lines that has two columns, each one being node ids:
1234 345
1234 568
345 984
... ...
Each line represents a relationship (so 2.5 million relationships): first_column nodeid-> FOLLOWS -> second_column nodeid. There are around 80,000 unique nodes in this file.
Based on the link above, I did:
USING PERIODIC COMMIT 1000
LOAD CSV FROM 'file:///home/user_name/Desktop/bigfile.csv' AS line FIELDTERMINATOR ' '
MERGE (n:Userid { id: toInt(line[0]) })
WITH line, n
MERGE (m:Userid { id: toInt(line[1]) })
WITH m,n
MERGE (n)-[:FOLLOWS]->(m)
I am assuming this code
creates node n or m if it doesn't exist (and finds it if it does exist), and creates a relationship from n to m.
If n or m exists and already has many other edges (relationships) pointing to and from other nodes, this would just add another edge from n to m (not creating a brand new node when it already exists)
My main question is I am wondering how to make this process faster.
This is being done on Ubuntu, and I changed the values from 512 to 2048 MB for memory in the conf/neo4j-wrapper.conf file. (maximum I can increase on my Virtual Machine)
Should I try doing the Import tool?
Based on example on this website, neo4j.com/developer/guide-import-csv/ under "Super Fast Batch Importer For Huge Datasets",
./bin/neo4j-import --into mydatabase.db --id-type INTEGER \
--nodes allnodes.csv \
--delimiter " " \
--relationships:FOLLOWS bigfile.csv
And to do this, I need to reformat files so that:
allnodes.csv shows
userID:ID(Userid)
1234
5678
...
And bigfile.csv shows
:START_ID(Userid) :END_ID(Userid)
1234 345
1234 568
345 984
*Two columns delimited by space*
And when I run this import, I get this error:
Input error: Expected '--nodes' to have at least 1 valid item, but had 0 []
Caused by:Expected '--nodes' to have at least 1 valid item, but had 0 []
java.lang.IllegalArgumentException: Expected '--nodes' to have at least 1 valid item, but had 0 []
How do I fix this error? And for the csv files, do I put them in same folder where I run this command (neo4j folder)?
Your command line probably has the wrong paths for your two CSV files.

pandas returning empty DataFrames for CSV

I have some large csv and xlsx files which I need to set up pandas DataFrames for. I have code which locates these files within the directory (when printed, these show correct pathnames). These paths are then passed to a helper function which is meant to set up the required DataFrames for the files, then the data will be passed to other functions for some manipulation. I intend to have the data written to a file (by loading a template, writing the data to it, and saving this file) once this is completed.
I currently have code like:
import pandas
# some set-up functions (which work; verified using print statements)
def createDataFrame(filename):
if filename.endswith('.csv'):
df = pandas.read_csv(StringIO(filename), skip_blank_lines=True, index_col=False,
encoding="utf-8", skipinitialspace=True)
When I try print(df), I get:
Empty DataFrame
Columns: [a.csv]
Index: []
and print(StringIO(filename)) gives me:
<_io.StringIO object at 0x004D1990>
However, when I leave out the StringIO() around filename in the function, I get this error:
OSError: File b'a.csv' does not exist
Everywhere that I've been able to find information on this has either just said import and start using, or talks about using read_csv() rather than from_csv() (from this question, which wasn't very helpful here), and even the current pandas docs basically say that it should be as easy as passing the file to pandas.read_csv().
1) I've checked that I have full permissions and that the file is valid and exists. Why am I getting the OSError?
2) When I use StringIO(), why am I still getting an empty DataFrame here? How can I fix this?
Thanks in advance.
I have solved this.
StringIO was the root cause of this problem. Because I'm on Windows, os.path.is_file() was returning False, and I got the error:
OSError: File b'a.csv' does not exist
It wasn't until I stumbled upon this page from the Python 2.5 docs that I discovered that the call should actually be os.path.isfile() on Windows because it uses ntpath behind the scenes. This is to better handle the difference in pathnames between systems (Windows uses '\', Unix uses '/').
Because I had something weird going on in my paths, pandas was unable to properly load the CSV files into DataFrames.
By simply changing my code from this:
import pandas
# some set-up functions (which work; verified using print statements)
def createDataFrame(filename):
if filename.endswith('.csv'):
df = pandas.read_csv(StringIO(filename), skip_blank_lines=True, index_col=False,
encoding="utf-8", skipinitialspace=True)
to this:
import pandas
# some set-up functions (which have been updated)
def createDataFrame(filename):
basepath = config.complete_datasets_dir
fullpath = os.path.join(basepath, filename)
if filename.endswith('.csv'):
df = pandas.read_csv(fullpath, skip_blank_lines=True, index_col=False,
encoding="utf-8", skipinitialspace=True)
and appropriately updating the function which calls that function:
def somefunc():
dfs = []
data_lists = getInputFiles() # checks data directory for files containing info
for item in data_lists:
tdata = createDataFrames(item)
dfs.append(tdata)
print(dfs)
I was able to get the output I was looking for:
[ 1 2 3 4 5 6 7 8 9 10
0 11 12 13 14 15 16 17 18 19 20
1 21 22 23 24 25 26 27 28 29 30
2 31 32 33 34 35 36 37 38 39 40, 1 2 3 4 5 6 7 8 9 10
0 11 12 13 14 15 16 17 18 19 20
1 21 22 23 24 25 26 27 28 29 30]
which is a list of two DataFrames, the first of which came from a CSV containing only the numbers 1-40 (on 4 rows total, no headers); the second file contains only the numbers 1-30 (formatted the same way).
I hope this helps someone in the future.

How to store data from 2 file to array and compare tcl

I am new and still learning for Tcl.
Now, I have 2 files which having different data, i want to store it into array and compare in result to print the difference of data between two files into a new text file. For example, file1.txt
1
2
3
While file2.txt has data
2
4
5
After compare and found the difference, write it into a new text file, file3.txt. Which is to be like
4
5
You can use struct::set package from Tcllib. Read in the values from the files into lists,
package require struct::set
::struct::set difference {2 4 5} {1 2 3}
and then write out the result.