spark input_line like input_file_name when reading line separated json - json

I have a new line separated json file. Each line consists of a json array which contains different type of documents. The standard behavior of spark is that when I read it naivly
df = spark.read.json(file.path)
spark creates a new line for each document in each array, while the different document types become each a column and for each line only one is not null.
I need to recover the line number of each document as the documents on the same line have a relation which is otherwise not recoverable.
I imagined there would be something similar to input_file_name() but there is none. Is there a way to achieve this?
An example input is
[{"firstAttribute":1, "secondAttribute":2},{"firstAttribute":10, "secondAttribute":20}]
[{"firstAttribute":3, "secondAttribute":4},{"thirdAttribute":5, "secondAttribute":6}]
The resulting dataframe now looks like this:
firstAttribute
secondAttribute
thirdAttribute
1
2
10
20
3
4
6
5
Now I would like to see the linenumber of the source file for the specific entry like this
line number
firstAttribute
secondAttribute
thirdAttribute
1
1
2
1
10
20
2
3
4
2
6
5

Related

Spark - Strange characters when reading CSV file

I hope someone could help me please. My problem is the following:
To read a CSV file in Spark I'm using the code
val df=spark.read.option("header","true").option("inferSchema","true").csv("/home/user/Documents/filename.csv")
assuming that my file is called filename.csv and the path is /home/user/Documents/
To show the first 10 results I use
df.show(10)
but instead I get the following result which contains the character � and not showing the 10 results as desired
scala> df.show(10)
+--------+---------+---------+-----------------+
| c1| c2| c3| c4|
+--------+---------+---------+-----------------+
|��1.0|5450|3007|20160101|
+--------+---------+---------+-----------------+
The CSV file looks something like this
c1 c2 c3 c4
1 5450 3007 20160101
2 2156 1414 20160107
1 78229 3656 20160309
1 34963 4484 20160104
1 7897 3350 20160105
11 13247 3242 20160303
2 4957 3350 20160124
1 73083 4211 20160207
The file that I'm trying to read is big. When I try smaller files I don't get the strange character and I can see the first 10 results without problem.
Any help is appreciated.
Sometimes it is not the problem caused by settings of Spark. Try to re-save(save as) your CSV file as "CSV UTF-8 (comma delimited)", then rerun your code, the strange characters will gone. I had similar problem when read some CSV file containing German words, then I did above, it is all good.

How to store data from 2 file to array and compare tcl

I am new and still learning for Tcl.
Now, I have 2 files which having different data, i want to store it into array and compare in result to print the difference of data between two files into a new text file. For example, file1.txt
1
2
3
While file2.txt has data
2
4
5
After compare and found the difference, write it into a new text file, file3.txt. Which is to be like
4
5
You can use struct::set package from Tcllib. Read in the values from the files into lists,
package require struct::set
::struct::set difference {2 4 5} {1 2 3}
and then write out the result.

How to import csv data where some observations are on two rows

I have a dataset with a couple million rows. It is in csv format. I wish to import it into Stata. I can do this, but there is a problem - a small percentage (but still many) of the observations appear on two lines in the CSV file. Most of the entries occur on only one line. The troublesome observations that take up 2 lines still follow the same pattern as far as being delimited by commas. But in the Stata dataset, the observation shows up on two rows, both rows containing only part of the total data.
I used import delimited to import the data. Is there anything that can be done at the data import stage of the process in Stata? I would prefer to not have to deal with this in the original CSV file if possible.
***Update
Here is an example of what the csv file looks like:
var1,var2,var3,var4,var5
text 1, text 2,text 3 ,text 4,text 5
text 6,text 7,text 8,text9,text10
text 11,text 1
2,text 13,text14,text15
text16,text17,text18,text19,text20
Notice that there is no comma at the end of the line. Also notice that the problem is with the observation that begins with text 11.
This is basically how it shows up in Stata:
var1 var2 var3 var4 var5
1 text 1 text 2 text 3 text 4 text 5
2 text 6 text 7 text 8 text9 text10
3 text 11 text 1
4 2 text 13 text14 text15
5 text16 text17 text18 text19 text20
That sometimes the number is right next to text isn't a mistake - it is just to illustrate that the data is more complex than is shown here.
Of course, this is how I need the data:
var1 var2 var3 var4 var5
1 text 1 text 2 text 3 text 4 text 5
2 text 6 text 7 text 8 text9 text10
3 text 11 text 12 text 13 text14 text15
4 text16 text17 text18 text19 text20
A convoluted way is (comments inline):
clear
set more off
*----- example data -----
// change delimiter, if necessary
insheet using "~/Desktop/stata_tests/test.csv", names delim(;)
list
*----- what you want -----
// compute number of commas
gen numcom = length(var1var2var3var4var5) ///
- length(subinstr(var1var2var3var4var5, ",", "", .))
// save all data
tempfile orig
save "`orig'"
// keep observations that are fine
drop if numcom != 4
// save fine data
tempfile origfine
save "`origfine'"
*-----
// load all data
use "`orig'", clear
// keep offending observations
drop if numcom == 4
// for the -reshape-
gen i = int((_n-1)/2) +1
bysort i : gen j = _n
// check that pairs add up to 4 commas
by i : egen check = total(numcom)
assert check == 4
// no longer necessary
drop numcom check
// reshape wide
reshape wide var1var2var3var4var5, i(i) j(j)
// gen definitive variable
gen var1var2var3var4var5 = var1var2var3var4var51 + var1var2var3var4var52
keep var1var2var3var4var5
// append new observations with original good ones
append using "`origfine'"
// split
split var1var2var3var4var5, parse(,) gen(var)
// we're "done"
drop var1var2var3var4var5 numcom
list
But we don't really have the details of your data, so this may or may not work. It's just meant to be a rough draft. Depending on the memory space occupied by your data, and other details, you may need to improve parts of the code so it be made more efficient.
Note: the file test.csv looks like
var1,var2,var3,var4,var5
text 1, text 2,text 3 ,text 4,text 5
text 6,text 7,text 8,text9,text10
text 11,text 1
2,text 13,text14,text15
text16,text17,text18,text19,text20
Note 2: I'm using insheet because I don't have Stata 13 at the moment. import delimited is the way to go if available.
Note 3: details on how the counting of commas works can be reviewed at Stata tip 98: Counting substrings within strings, by Nick Cox.
I would try the following strategy.
Import as a single string variable.
Count commas on each line and combine following lines if lines are incomplete.
Delete redundant material.
The comma count will be
length(variable) - length(subinstr(variable, ",", "", .))
If the observations in question are quoted in the CSV file, then you can use the bindquote(strict) option.
A bit of speculation without seeing the exact data: following Roberto Ferrer's comment, you might find the Stata command filefilter useful in cleaning the csv file before importing. You can substitute new and old string patterns, using basic characters as well as more complex \n and \r terms.
I can't offer any code at the moment, but I suggest you take a good look at help import. The infile and infix commands state:
An observation can be on more than one line.
(I don't know if this means that all observations should be on several lines, or if it can handle cases where only some observations are on more than one line.)
Check also the manuals if the examples and notes in the help files turn out to be insufficient.

How to merge two ipython notebooks correctly without getting json error?

I have tried:
cat file1.ipynb file2.ipynb > filecomplete.ipynb
since the notebooks are simply json files, but this gives me the error
Unreadable Notebook: Notebook does not appear to be JSON: '{\n "metadata": {'
I think these must be valid json files because file1 and file2 each load individually into nbviewer, and so I am not entirely sure what I am doing wrong.
This Python script concatenates all the notebooks named with a given prefix and present at the first level of a given folder. The resulting notebook is saved in the same folder under the name "compil_" + prefix + ".ipynb".
import json
import os
folder = "slides"
prefix = "quiz"
paths = [os.path.join(folder, name) for name in os.listdir(folder) if name.startswith(prefix) and name.endswith(".ipynb")]
result = json.loads(open(paths.pop(0), "r").read())
for path in paths:
result["worksheets"][0]["cells"].extend(json.loads(open(path, "r").read())["worksheets"][0]["cells"])
open(os.path.join(folder, "compil_%s.ipynb" % prefix), "w").write(json.dumps(result, indent = 1))
Warning: the metadata are those of the first notebook, and the cells those of the first worksheet only (which seems to contain all the cells, in my notebook at least).
Concatenating 2 object with some properties does not always yield object with the same properties. Here is a sequence of increasing number : 4 8 15 16 23 42, here is another one 1 2 3 4 5 6 7. The concatenation of the two is not strictly increasing :4 8 15 16 23 42 1 2 3 4 5 6 7. Same goes for Json.
You need to load json file using json lib and do the merge you want to do yourself. I suppose you "just" want to concatenate the cells, but maybe you want to concatenate worksheet; maybe you want to merge metadata.

Unable to import term sets from CSV in sharepoint 2010

I'm unable to import a term set using CSV in SharePoint 2010. I get the following error:
An error was encountered while attempting to import the term set at line [last line] of the submitted file and some data may not have been cleaned up. Please ensure that this file is a valid csv file and adheres to the correct format as in the sample file ImportTermSet.csv
I've tried re-using the sample file itself to create my term set, creating the term set in Notepad (+ensuring that it is a UTF-8 CSV), but all in vain :(
Could some one please help me here?
http://blog.jussipalo.com/2011/04/sharepoint-importing-term-set-gives.html
For the record: all repeating instances of terms must have exactly the same case.
Example
,,,TRUE,,Continent,Political Entity,,,,,
,,,TRUE,,continent,Political Entity,Country,,,,
,,,TRUE,,Continent,Political Entity,Country,Province or State,,,
The above will fail because level 1 term "Continent" has a lower case "c" on the second row.
Take attention to how nesting level are declared.
this is an example of an CORRECT CSV file (make attention on order of nested levels)
"Term Set Name","Term Set Description","LCID","Available for Tagging","Term Description","Level 1 Term","Level 2 Term","Level 3 Term","Level 4 Term","Level 5 Term","Level 6 Term","Level 7 Term"
"My Term Set","Description of My Term Set",,"True",,,,,,,,
,,,"True",,"Italy",,,,,,
,,,"True",,"Italy","Venice",,,,,
,,,"True",,"France",,,,,,
,,,"True",,"France","Paris",,,,,
,,,"True",,"Germany","Berlin",,,,,
this is an example of an INCORRECT CSV file:
"Term Set Name","Term Set Description","LCID","Available for Tagging","Term Description","Level 1 Term","Level 2 Term","Level 3 Term","Level 4 Term","Level 5 Term","Level 6 Term","Level 7 Term"
"My Term Set","Description of My Term Set",,"True",,,,,,,,
,,,"True",,"Italy",,,,,,
,,,"True",,"Italy","Rome",,,,,
,,,"True",,"France","Paris",,,,,
,,,"True",,"France",,,,,,
LAST LINE WILL GIVE IMPORT ERROR SINCE "FRANCE" ALREADY EXISTS!!!