How can I merge/join multiple columns from two dataframes, depending on a matching pattern

How can I merge/join multiple columns from two dataframes, depending on a matching pattern - language-agnostic

I would like to merge two dataframes based on similar patterns in the chromosome column. I made various attempts with R & BASH such as with "data.table" "tidyverse", & merge(). Could someone help me by providing alternative solutions in R, BASH, Python, Perl, etc. for solving this solution? I would like to merge based on the chromosome information and retain both counts/RXNs.
NOTE: These two DFs are not aligned and I am also curious what happens if some values are missing.
Thanks and Cheers:
DF1:
Chromosome;RXN;ID
1009250;q9hxn4;NA
1010820;p16256;NA
31783;p16588;"PNTOt4;PNTOt4pp"
203;3-DEHYDROQUINATE-DEHYDRATASE-RXN;"DHQTi;DQDH"
DF2:
Chromosome;Count1;Count2;Count3;Count4;Count5
203;1;31;1;0;0;0
1010820;152;7;0;11;4
1009250;5;0;0;17;0
31783;1;0;0;0;0;0
Expected Result:
Chromosome;RXN;Count1;Count2;Count3;Count4;Count5
1009250;q9hxn4;5;0;0;17;0
1010820;p16256;152;7;0;11;4
31783;p16588;1;0;0;0;0
203;3-DEHYDROQUINATE-DEHYDRATASE-RXN;1;31;1;0;0;0

As bash was mentioned in the text body, I offer you an awk solution. The dataframes are in files df1 and df2:
$ awk '
BEGIN {
FS=OFS=";" # input and output field delimiters
}
NR==FNR { # process df1
a[$1]=$2 # hash to an array, 1st is the key, 2nd the value
next # process next record
}
{ # process df2
$2=(a[$1] OFS $2) # prepend RXN field to 2nd field of df2
}1' df1 df2 # 1 is output command, mind the file order
The 2 last lines could be written perhaps more clearly:
...
{
print $1,a[$1],$2,$3,$4,$5,$6
}' df1 df2
Output:
Chromosome;RXN;Count1;Count2;Count3;Count4;Count5
203;3-DEHYDROQUINATE-DEHYDRATASE-RXN;1;31;1;0;0;0
1010820;p16256;152;7;0;11;4
1009250;q9hxn4;5;0;0;17;0
31783;p16588;1;0;0;0;0;0
Output will be in the order of df2. Chromosome present in df1 but not in df2 will not be included. Chromosome in df2 but not in df1 will be output from df2 with empty RXN field. Also, if there are duplicate chromosomes in df1, the last one is used. This can be fixed if it is an issue.

If I understand your request correctly, this should do it in Python. I've made the Chromosome column into the index of each DataFrame.
from io import StringIO
txt1 = '''Chromosome;RXN;ID
1009250;q9hxn4;NA
1010820;p16256;NA
31783;p16588;"PNTOt4;PNTOt4pp"
203;3-DEHYDROQUINATE-DEHYDRATASE-RXN;"DHQTi;DQDH"'''
txt2 = """Chromosome;Count1;Count2;Count3;Count4;Count5;Count6
203;1;31;1;0;0;0
1010820;152;7;0;11;4
1009250;5;0;0;17;0
31783;1;0;0;0;0;0"""
df1 = pd.read_csv(
StringIO(txt1),
sep=';',
index_col=0,
header=0
)
df2 = pd.read_csv(
StringIO(txt2),
sep=';',
index_col=0,
header=0
)
DF1:
RXN ID
Chromosome
1009250 q9hxn4 NaN
1010820 p16256 NaN
31783 p16588 PNTOt4;PNTOt4pp
203 3-DEHYDROQUINATE-DEHYDRATASE-RXN DHQTi;DQDH
DF2:
Count1 Count2 Count3 Count4 Count5 Count6
Chromosome
203 1 31 1 0 0 0.0
1010820 152 7 0 11 4 NaN
1009250 5 0 0 17 0 NaN
31783 1 0 0 0 0 0.0
result = pd.concat(
[df1.sort_index(), df2.sort_index()],
axis=1
)
print(result)
RXN ID Count1 Count2 Count3 Count4 Count5 Count6
Chromosome
203 3-DEHYDROQUINATE-DEHYDRATASE-RXN DHQTi;DQDH 1 31 1 0 0 0.0
31783 p16588 PNTOt4;PNTOt4pp 1 0 0 0 0 0.0
1009250 q9hxn4 NaN 5 0 0 17 0 NaN
1010820 p16256 NaN 152 7 0 11 4 NaN
The concat command also handles mismatched indices by simply filling in NaN values for columns in e.g. df1 if df2 doesn't have have the same index, and vice versa.

Related

How to apply countvectorizer to bigrams in a pandas dataframe

I'm trying to apply the countvectorizer to a dataframe containing bigrams to convert it into a frequency matrix showing the number of times each bigram appears in each row but I keep getting error messages.
This is what I tried using
cereal['bigrams'].head()
0 [(best, thing), (thing, I), (I, have),....
1 [(eat, it), (it, every), (every, morning),...
2 [(every, morning), (morning, my), (my, brother),...
3 [(I, have), (five, cartons), (cartons, lying),...
.........
bow = CountVectorizer(max_features=5000, ngram_range=(2,2))
train_bow = bow.fit_transform(cereal['bigrams'])
train_bow
Expected results
(best,thing) (thing, I) (I, have) (eat,it) (every,morning)....
0 1 1 1 0 0
1 0 0 0 1 1
2 0 0 0 0 1
3 0 0 1 0 0
....

I see you are trying to convert a pd.Series into a count representation of each term.
Thats a bit different from what CountVectorizer does;
From the function description:
Convert a collection of text documents to a matrix of token counts
The official example of case use is:
>>> from sklearn.feature_extraction.text import CountVectorizer
>>> corpus = [
... 'This is the first document.',
... 'This document is the second document.',
... 'And this is the third one.',
... 'Is this the first document?',
... ]
>>> vectorizer = CountVectorizer()
>>> X = vectorizer.fit_transform(corpus)
>>> print(vectorizer.get_feature_names())
['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']
>>> print(X.toarray())
[[0 1 1 1 0 0 1 0 1]
[0 2 0 1 0 1 1 0 1]
[1 0 0 1 1 0 1 1 1]
[0 1 1 1 0 0 1 0 1]]
So, as one can see, it takes as input a list where each term is a "document".
Thats problaby the cause of the errors you are getting, you see, you are passing a pd.Series where each term is a list of tuples.
For you to use CountVectorizer you would have to transform your input into the proper format.
If you have the original corpus/text you can easily implement CountVectorizer on top of it (with the ngram parameter) to get the desired result.
Else, best solution wld be to treat it as it is, a series with a list of items, which must be counted/pivoted.
Sample workaround:
(it wld be a lot easier if you just use the text corpus instead)
Hope it helps!

Function does not return the list correctly

I have written a code for adding the numbers from two different text files. For a very big data 2-3 GB, I get the MemoryError. So, I am writing a new code using some functions to avoid loading the whole data into memory.
This code opens an input file 'd.txt' an reads the numbers after some lines from a bigger data as following:
SCALAR
ND 3
ST 0
TS 1000
1.0
1.0
1.0
SCALAR
ND 3
ST 0
TS 2000
3.3
3.4
3.5
SCALAR
ND 3
ST 0
TS 3000
1.7
1.8
1.9
and adds to the number have read from a smaller text file 'e.txt' as following:
SCALAR
ND 3
ST 0
TS 0
10.0
10.0
10.0
The result is written in a text file 'output.txt' like this:
SCALAR
ND 3
ST 0
TS 1000
11.0
11.0
11.0
SCALAR
ND 3
ST 0
TS 2000
13.3
13.4
13.5
SCALAR
ND 3
ST 0
TS 3000
11.7
11.8
11.9
The code which I prepared:
def add_list_same(list1, list2):
"""
list2 has the same size as list1
"""
c = [a+b for a, b in zip(list1, list2)]
print(c)
return c
def list_numbers_after_ts(n, f):
result = []
for line in f:
if line.startswith('TS'):
for node in range(n):
result.append(float(next(f)))
return result
def writing_TS(f1):
TS = []
ND = []
for line1 in f1:
if line1.startswith('ND'):
ND = float(line1.split()[-1])
if line1.startswith('TS'):
x = float(line1.split()[-1])
TS.append(x)
return TS, ND
with open('d.txt') as depth_dat_file, \
open('e.txt') as elev_file, \
open('output.txt', 'w') as out:
m = writing_TS(depth_dat_file)
print('number of TS', m[1])
for j in range(0,int(m[1])-1):
i = m[1]*j
out.write('SCALAR\nND {0:2f}\nST 0\nTS {0:2f}\n'.format(m[1], m[0][j]))
list1 = list_numbers_after_ts(int(m[1]), depth_dat_file)
list2 = list_numbers_after_ts(int(m[1]), elev_file)
Eh = add_list_same(list1, list2)
out.writelines(["%.2f\n" % item for item in Eh])
the output.txt is like this:
SCALAR
ND 3.000000
ST 0
TS 3.000000
SCALAR
ND 3.000000
ST 0
TS 3.000000
SCALAR
ND 3.000000
ST 0
TS 3.000000
The addition of lists does not work, besides I checked separately the functions, they work. I don't find the error. I changed it a lot, but it does not work. Any suggustion? I really appreciate any help you can provide!

You can use grouper to read files by fixed count of lines. Next code should works if order of lines in groups is unchanged.
from itertools import zip_longest
#Split by group iterator
#See http://stackoverflow.com/questions/434287/what-is-the-most-pythonic-way-to-iterate-over-a-list-in-chunks
def grouper(iterable, n, padvalue=None):
return zip_longest(*[iter(iterable)]*n, fillvalue=padvalue)
add_numbers = []
with open("e.txt") as f:
# Read data by 7 lines
for lines in grouper(f, 7):
# Suppress first SCALAR line
for line in lines[1:]:
# add last number in every line to array (6 elements)
add_numbers.append(float(line.split()[-1].strip()))
#template for every group
template = 'SCALAR\nND {:.2f}\nST {:.2f}\nTS {:.2f}\n{:.2f}\n{:.2f}\n{:.2f}\n'
with open("d.txt") as f, open('output.txt', 'w') as out:
# As before
for lines in grouper(f, 7):
data_numbers = []
for line in lines[1:]:
data_numbers.append(float(line.split()[-1].strip()))
# in result_numbers sum elements of two arrays by pair (6 elements)
result_numbers = [x + y for x, y in zip(data_numbers, add_numbers)]
# * unpack result_numbers as 6 arguments of function format
out.write(template.format(*result_numbers))

I had to change some small things in the code and now it works but just for small input files, because many variables are loaded into memory. Can you please tell me how can I work with yield.
from itertools import zip_longest
def grouper(iterable, n, padvalue=None):
return zip_longest(*[iter(iterable)]*n, fillvalue=padvalue)
def writing_ND(f1):
for line1 in f1:
if line1.startswith('ND'):
ND = float(line1.split()[-1])
return ND
def writing_TS(f):
for line2 in f:
if line2.startswith('TS'):
x = float(line2.split()[-1])
TS.append(x)
return TS
TS = []
ND = []
x = 0.0
n = 0
add_numbers = []
with open("e.txt") as f, open("d.txt") as f1,\
open('output.txt', 'w') as out:
ND = writing_ND(f)
TS = writing_TS(f1)
n = int(ND)+4
f.seek(0)
for lines in grouper(f, int(n)):
for item in lines[4:]:
add_numbers.append(float(item))
i = 0
for l in grouper(f1, n):
data_numbers = []
for line in l[4:]:
data_numbers.append(float(line.split()[-1].strip()))
result_numbers = [x + y for x, y in zip(data_numbers, add_numbers)]
del data_numbers
out.write('SCALAR\nND %d\nST 0\nTS %0.2f\n' % (ND, TS[i]))
i += 1
for item in result_numbers:
out.write('%s\n' % item)

Select records based on the specific index string value and then remove subsequent fields by python

I have a .csv file named file01.csv that contains many records. Some records are required and some are not. I find that the required records has a string variable “Mi”, but it is not exist into the unnecessary records. So, I want to select the required records based on string value “Mi” in the field for every records.
Finally I want to delete the subsequent fields of each record from the field that contains value “Mi”. Any suggestion and advice is appreciated.
Optional:
In addition, I want to delete the first column.
Split column BB into two column named as a_id, and c_id. Separate the value by _ (underscore) and left side will go to a_id, and right side will go to c_id.
My fileO.csv is as follows:
AA BB CC DD EE FF GG
1 1_1.csv (=0 =10" 27" =57 "Mi"
0.97 0.9 0.8 NaN 0.9 od 0.2
2 1_3.csv (=0 =10" 27" "Mi" 0.5
0.97 0.5 0.8 NaN 0.9 od 0.4
3 1_6.csv (=0 =10" "Mi" =53 cnt
0.97 0.9 0.8 NaN 0.9 od 0.6
4 2_6.csv No Bi 000 000 000 000
5 2_8.csv No Bi 000 000 000 000
6 6_9.csv less 000 000 000 000
7 7_9.csv s(=0 =26" =46" "Mi" 121
My Expected results files (outFile.csv):
a_id b_id CC DD EE FF GG
1 1 0 10 27 57
1 3 0 10 27
1 6 0 10
7 9 0 26 46

The following approach should work fine using Python csv module:
import csv
import re
import string
output_header = ['a_id', 'b_id', 'CC', 'DD', 'EE', 'FF', 'GG']
sanitise_table = string.maketrans("","")
nodigits_table = sanitise_table.translate(sanitise_table, string.digits)
def find_mi(row):
for index, col in enumerate(row):
if col.find('Mi') != -1:
return index
return -1
def sanitise_cell(cell):
return cell.translate(sanitise_table, nodigits_table) # Keep digits
f_input = open('fileO.csv', 'rb')
f_output = open('outFile.csv', 'wb')
csv_input = csv.reader(f_input)
csv_output = csv.writer(f_output)
input_header = next(f_input)
csv_output.writerow(output_header)
for row in csv_input:
#print '%2d %s' % (len(row), row)
if len(row) >= 2:
bb = re.match(r'(\d+)__(\d+).0\.csv', row[1])
mi = find_mi(row)
if bb and mi != -1:
row[:] = row[:mi] + [''] * (len(row) - mi)
row[:] = [sanitise_cell(col) for col in row]
row[0] = bb.group(1)
row[1] = bb.group(2)
csv_output.writerow(row)
f_input.close()
f_output.close()
outFile.csv will contain the following:
a_id,b_id,CC,DD,EE,FF,GG
1,1,0,10,27,57,
1,3,0,10,27,,
1,6,0,10,,,
7,9,0,26,46,,
Tested using Python 2.6.6

write items from a list to csv file column by column using pandas dataframe.to_csv

I have a list named items
items=['a' , 'b','c']
Code is:
df = pandas.DataFrame(items)
df.to_csv("myfile.csv",headers=None,index=False)
the values written to the file are in different rows but same column.(vertically written)
But
I want the values to be written as : a b c ie. in same row but different column.
Help please

You get each element in different rows because you load the df as that way.
If you want in different column I would suggest to do transpose,
df = df.T
or you can load as one row like below,
items=[['a' , 'b','c']]
df = pd.DataFrame(items)
df
Out[22]:
0 1 2
0 a b c
And then write the output to csv,
eg:
df = pandas.DataFrame(items)
df = df.T
df.to_csv("myfile.csv",headers=None,index=False)
df = pd.DataFrame(items)
df
Out[5]:
0
0 a
1 b
2 c
df.T
Out[11]:
0 1 2
0 a b c

Using JSON schema as column headers in dataframe

Ok, as per a previous question (here) I've now managed to read a load of JSON data into R and to get the data into a data frame. here's the code:-
getCall <- GET("http://long-url.com",
authenticate("myusername", "password"))
contJSON <- content(getCall)
contJSON = sub("\n\r\n", "", contJSON)
df1 <- fromJSON(sprintf("[%s]", gsub("\n", ",", contJSON)), asText=TRUE)
df <- data.frame(matrix(unlist(df1), nrow=31, byrow=T))
Which gets me a data frame that looks as follows:-
head(df[,1:8])
X1 X2 X3 X4 X5 X6 X7 X8
1 2013-05-01 33682 11838 8023 3815 84 177.000000 177.000000
2 2013-05-02 32622 11626 7945 3681 58 210.000000 210.000000
3 2013-05-03 28467 11102 7786 3316 56 186.000000 186.000000
4 2013-05-04 20884 9031 6670 2361 51 7.000000 7.000000
5 2013-05-05 20481 8782 6390 2392 58 1.000000 1.000000
6 2013-05-06 25175 10019 7082 2937 62 24.000000 24.000000
However, there are no column names in my data frame. When I search for "names" in my JSON object R returns "NULL" so that doesn't give me anything useful.
I am wondering if there is any simple way (that might be repeatable on more general cases) to get the names for the column headers from the JSON schema.
I'm aware there are similar questions elsewhere on the site, but this one did not appear to be covered.
EDIT: As per the comment, here is the structure of the contJSON object.
"{\"metricDate\":\"2013-05-01\",\"pageCountTotal\":\"33682\",\"landCountTotal\":\"11838\",\"newLandCountTotal\":\"8023\",\"returnLandCountTotal\":\"3815\",\"spiderCountTotal\":\"84\",\"goalCountTotal\":\"177.000000\",\"callGoalCountTotal\":\"177.000000\",\"callCountTotal\":\"237.000000\",\"onlineGoalCountTotal\":\"0.000000\",\"conversionPerc\":\"1.50\",\"callConversionPerc\":\"74.68\"}\n{\"metricDate\":\"2013-05-02\",\"pageCountTotal\":\"32622\",\"landCountTotal\":\"11626\",\"newLandCountTotal\":\"7945\",\"returnLandCountTotal\":\"3681\",\"spiderCountTotal\":\"58\",\"goalCountTotal\":\"210.000000\",\"callGoalCountTotal\":\"210.000000\",\"callCountTotal\":\"297.000000\",\"onlineGoalCountTotal\":\"0.000000\",\"conversionPerc\":\"1.81\",\"callConversionPerc\":\"70.71\"}\n{\"metricDate\":\"2013-05-03\",\"pageCountTotal\":\"28467\",\"landCountTotal\":\"11102\",\"newLandCountTotal\":\"7786\",\"returnLandCountTotal\":\"3316\",\"spiderCountTotal\":\"56\",\"goalCountTotal\":\"186.000000\",\"callGoalCountTotal\":\"186.000000\",\"callCountTotal\":\"261.000000\",\"onlineGoalCountTotal\":\"0.000000\",\"conversionPerc\":\"1.68\",\"callConversionPerc\":\"71.26\"}\n{\"metricDate\":\"2013-05-04\",\"pageCountTotal\":\"20884\",\"landCountTotal\":\"9031\",\"newLandCountTotal\":\"6670\",\"returnLandCountTotal\":\"2361\",\"spiderCountTotal\":\"51\",\"goalCountTotal\":\"7.000000\",\"callGoalCountTotal\":\"7.000000\",\"callCountTotal\":\"44.000000\",\"onlineGoalCountTotal\":\"0.000000\",\"conversionPerc\":\"0.08\",\"callConversionPerc\":\"15.91\"}\n{\"metricDate\":\"2013-05-05\",\"pageCountTotal\":\"20481\",\"landCountTotal\":\"8782\",\"newLandCountTotal\":\"6390\",\"returnLandCountTotal\":\"2392\",\"spiderCountTotal\":\"58\",\"goalCountTotal\":\"1.000000\",\"callGoalCountTotal\":\"1.000000\",\"callCountTotal\":\"8.000000\",\"onlineGoalCountTotal\":\"0.000000\",\"conversionPerc\":\"0.01\",\"callConversionPerc\":\"12.50\"}\n{\"metricDate\":\"2013-05-06\",\"pageCountTotal\":\"25175\",\"landCountTotal\":\"10019\",\"newLandCountTotal\":\"7082\",\"returnLandCountTotal\":\"2937\",\"spiderCountTotal\":\"62\",\"goalCountTotal\":\"24.000000\",\"callGoalCountTotal\":\"24.000000\",\"callCountTotal\":\"47.000000\",\"onlineGoalCountTotal\":\"0.000000\",\"conversionPerc\":\"0.24\",\"callConversionPerc\":\"51.06\"}\n{\"metricDate\":\"2013-05-07\",\"pageCountTotal\":\"35892\",\"landCountTotal\":\"12615\",\"newLandCountTotal\":\"8391\",\"returnLandCountTotal\":\"4224\",\"spiderCountTotal\":\"62\",\"goalCountTotal\":\"239.000000\",\"callGoalCountTotal\":\"239.000000\",\"callCountTotal\":\"321.000000\",\"onlineGoalCountTotal\":\"0.000000\",\"conversionPerc\":\"1.89\",\"callConversionPerc\":\"74.45\"}\n{\"metricDate\":\"2013-05-08\",\"pageCountTotal\":\"34106\",\"landCountTotal\":\"12391\",\"newLandCountTotal\":\"8389\",\"returnLandCountTotal\":\"4002\",\"spiderCountTotal\":\"90\",\"goalCountTotal\":\"221.000000\",\"callGoalCountTotal\":\"221.000000\",\"callCountTotal\":\"295.000000\",\"onlineGoalCountTotal\":\"0.000000\",\"conversionPerc\":\"1.78\",\"callConversionPerc\":\"74.92\"}\n{\"metricDate\":\"2013-05-09\",\"pageCountTotal\":\"32721\",\"landCountTotal\":\"12447\",\"newLandCountTotal\":\"8541\",\"returnLandCountTotal\":\"3906\",\"spiderCountTotal\":\"54\",\"goalCountTotal\":\"207.000000\",\"callGoalCountTotal\":\"207.000000\",\"callCountTotal\":\"280.000000\",\"onlineGoalCountTotal\":\"0.000000\",\"conversionPerc\":\"1.66\",\"callConversionPerc\":\"73.93\"}\n{\"metricDate\":\"2013-05-10\",\"pageCountTotal\":\"29724\",\"landCountTotal\":\"11616\",\"newLandCountTotal\":\"8063\",\"returnLandCountTotal\":\"3553\",\"spiderCountTotal\":\"139\",\"goalCountTotal\":\"207.000000\",\"callGoalCountTotal\":\"207.000000\",\"callCountTotal\":\"301.000000\",\"onlineGoalCountTotal\":\"0.000000\",\"conversionPerc\":\"1.78\",\"callConversionPerc\":\"68.77\"}\n{\"metricDate\":\"2013-05-11\",\"pageCountTotal\":\"22061\",\"landCountTotal\":\"9660\",\"newLandCountTotal\":\"6971\",\"returnLandCountTotal\":\"2689\",\"spiderCountTotal\":\"52\",\"goalCountTotal\":\"3.000000\",\"callGoalCountTotal\":\"3.000000\",\"callCountTotal\":\"40.000000\",\"onlineGoalCountTotal\":\"0.000000\",\"conversionPerc\":\"0.03\",\"callConversionPerc\":\"7.50\"}\n{\"metricDate\":\"2013-05-12\",\"pageCountTotal\":\"23341\",\"landCountTotal\":\"9935\",\"newLandCountTotal\":\"6960\",\"returnLandCountTotal\":\"2975\",\"spiderCountTotal\":\"45\",\"goalCountTotal\":\"0.000000\",\"callGoalCountTotal\":\"0.000000\",\"callCountTotal\":\"12.000000\",\"onlineGoalCountTotal\":\"0.000000\",\"conversionPerc\":\"0.00\",\"callConversionPerc\":\"0.00\"}\n{\"metricDate\":\"2013-05-13\",\"pageCountTotal\":\"36565\",\"landCountTotal\":\"13583\",\"newLandCountTotal\":\"9277\",\"returnLandCountTotal\":\"4306\",\"spiderCountTotal\":\"69\",\"goalCountTotal\":\"246.000000\",\"callGoalCountTotal\":\"246.000000\",\"callCountTotal\":\"324.000000\",\"onlineGoalCountTotal\":\"0.000000\",\"conversionPerc\":\"1.81\",\"callConversionPerc\":\"75.93\"}\n{\"metricDate\":\"2013-05-14\",\"pageCountTotal\":\"35260\",\"landCountTotal\":\"13797\",\"newLandCountTotal\":\"9375\",\"returnLandCountTotal\":\"4422\",\"spiderCountTotal\":\"59\",\"goalCountTotal\":\"212.000000\",\"callGoalCountTotal\":\"212.000000\",\"callCountTotal\":\"283.000000\",\"onlineGoalCountTotal\":\"0.000000\",\"conversionPerc\":\"1.54\",\"callConversionPerc\":\"74.91\"}\n{\"metricDate\":\"2013-05-15\",\"pageCountTotal\":\"35836\",\"landCountTotal\":\"13792\",\"newLandCountTotal\":\"9532\",\"returnLandCountTotal\":\"4260\",\"spiderCountTotal\":\"94\",\"goalCountTotal\":\"187.000000\",\"callGoalCountTotal\":\"187.000000\",\"callCountTotal\":\"258.000000\",\"onlineGoalCountTotal\":\"0.000000\",\"conversionPerc\":\"1.36\",\"callConversionPerc\":\"72.48\"}\n{\"metricDate\":\"2013-05-16\",\"pageCountTotal\":\"33136\",\"landCountTotal\":\"12821\",\"newLandCountTotal\":\"8755\",\"returnLandCountTotal\":\"4066\",\"spiderCountTotal\":\"65\",\"goalCountTotal\":\"192.000000\",\"callGoalCountTotal\":\"192.000000\",\"callCountTotal\":\"260.000000\",\"onlineGoalCountTotal\":\"0.000000\",\"conversionPerc\":\"1.50\",\"callConversionPerc\":\"73.85\"}\n{\"metricDate\":\"2013-05-17\",\"pageCountTotal\":\"29564\",\"landCountTotal\":\"11721\",\"newLandCountTotal\":\"8191\",\"returnLandCountTotal\":\"3530\",\"spiderCountTotal\":\"213\",\"goalCountTotal\":\"166.000000\",\"callGoalCountTotal\":\"166.000000\",\"callCountTotal\":\"222.000000\",\"onlineGoalCountTotal\":\"0.000000\",\"conversionPerc\":\"1.42\",\"callConversionPerc\":\"74.77\"}\n{\"metricDate\":\"2013-05-18\",\"pageCountTotal\":\"23686\",\"landCountTotal\":\"9916\",\"newLandCountTotal\":\"7335\",\"returnLandCountTotal\":\"2581\",\"spiderCountTotal\":\"56\",\"goalCountTotal\":\"5.000000\",\"callGoalCountTotal\":\"5.000000\",\"callCountTotal\":\"34.000000\",\"onlineGoalCountTotal\":\"0.000000\",\"conversionPerc\":\"0.05\",\"callConversionPerc\":\"14.71\"}\n{\"metricDate\":\"2013-05-19\",\"pageCountTotal\":\"23528\",\"landCountTotal\":\"9952\",\"newLandCountTotal\":\"7184\",\"returnLandCountTotal\":\"2768\",\"spiderCountTotal\":\"57\",\"goalCountTotal\":\"1.000000\",\"callGoalCountTotal\":\"1.000000\",\"callCountTotal\":\"14.000000\",\"onlineGoalCountTotal\":\"0.000000\",\"conversionPerc\":\"0.01\",\"callConversionPerc\":\"7.14\"}\n{\"metricDate\":\"2013-05-20\",\"pageCountTotal\":\"37391\",\"landCountTotal\":\"13488\",\"newLandCountTotal\":\"9024\",\"returnLandCountTotal\":\"4464\",\"spiderCountTotal\":\"69\",\"goalCountTotal\":\"227.000000\",\"callGoalCountTotal\":\"227.000000\",\"callCountTotal\":\"291.000000\",\"onlineGoalCountTotal\":\"0.000000\",\"conversionPerc\":\"1.68\",\"callConversionPerc\":\"78.01\"}\n{\"metricDate\":\"2013-05-21\",\"pageCountTotal\":\"36299\",\"landCountTotal\":\"13174\",\"newLandCountTotal\":\"8817\",\"returnLandCountTotal\":\"4357\",\"spiderCountTotal\":\"77\",\"goalCountTotal\":\"164.000000\",\"callGoalCountTotal\":\"164.000000\",\"callCountTotal\":\"221.000000\",\"onlineGoalCountTotal\":\"0.000000\",\"conversionPerc\":\"1.24\",\"callConversionPerc\":\"74.21\"}\n{\"metricDate\":\"2013-05-22\",\"pageCountTotal\":\"34201\",\"landCountTotal\":\"12433\",\"newLandCountTotal\":\"8388\",\"returnLandCountTotal\":\"4045\",\"spiderCountTotal\":\"76\",\"goalCountTotal\":\"195.000000\",\"callGoalCountTotal\":\"195.000000\",\"callCountTotal\":\"262.000000\",\"onlineGoalCountTotal\":\"0.000000\",\"conversionPerc\":\"1.57\",\"callConversionPerc\":\"74.43\"}\n{\"metricDate\":\"2013-05-23\",\"pageCountTotal\":\"32951\",\"landCountTotal\":\"11611\",\"newLandCountTotal\":\"7757\",\"returnLandCountTotal\":\"3854\",\"spiderCountTotal\":\"68\",\"goalCountTotal\":\"167.000000\",\"callGoalCountTotal\":\"167.000000\",\"callCountTotal\":\"231.000000\",\"onlineGoalCountTotal\":\"0.000000\",\"conversionPerc\":\"1.44\",\"callConversionPerc\":\"72.29\"}\n{\"metricDate\":\"2013-05-24\",\"pageCountTotal\":\"28967\",\"landCountTotal\":\"10821\",\"newLandCountTotal\":\"7396\",\"returnLandCountTotal\":\"3425\",\"spiderCountTotal\":\"106\",\"goalCountTotal\":\"167.000000\",\"callGoalCountTotal\":\"167.000000\",\"callCountTotal\":\"203.000000\",\"onlineGoalCountTotal\":\"0.000000\",\"conversionPerc\":\"1.54\",\"callConversionPerc\":\"82.27\"}\n{\"metricDate\":\"2013-05-25\",\"pageCountTotal\":\"19741\",\"landCountTotal\":\"8393\",\"newLandCountTotal\":\"6168\",\"returnLandCountTotal\":\"2225\",\"spiderCountTotal\":\"78\",\"goalCountTotal\":\"0.000000\",\"callGoalCountTotal\":\"0.000000\",\"callCountTotal\":\"28.000000\",\"onlineGoalCountTotal\":\"0.000000\",\"conversionPerc\":\"0.00\",\"callConversionPerc\":\"0.00\"}\n{\"metricDate\":\"2013-05-26\",\"pageCountTotal\":\"19770\",\"landCountTotal\":\"8237\",\"newLandCountTotal\":\"6009\",\"returnLandCountTotal\":\"2228\",\"spiderCountTotal\":\"79\",\"goalCountTotal\":\"0.000000\",\"callGoalCountTotal\":\"0.000000\",\"callCountTotal\":\"8.000000\",\"onlineGoalCountTotal\":\"0.000000\",\"conversionPerc\":\"0.00\",\"callConversionPerc\":\"0.00\"}\n{\"metricDate\":\"2013-05-27\",\"pageCountTotal\":\"26208\",\"landCountTotal\":\"9755\",\"newLandCountTotal\":\"6779\",\"returnLandCountTotal\":\"2976\",\"spiderCountTotal\":\"82\",\"goalCountTotal\":\"26.000000\",\"callGoalCountTotal\":\"26.000000\",\"callCountTotal\":\"40.000000\",\"onlineGoalCountTotal\":\"0.000000\",\"conversionPerc\":\"0.27\",\"callConversionPerc\":\"65.00\"}\n{\"metricDate\":\"2013-05-28\",\"pageCountTotal\":\"36980\",\"landCountTotal\":\"12463\",\"newLandCountTotal\":\"8226\",\"returnLandCountTotal\":\"4237\",\"spiderCountTotal\":\"132\",\"goalCountTotal\":\"208.000000\",\"callGoalCountTotal\":\"208.000000\",\"callCountTotal\":\"276.000000\",\"onlineGoalCountTotal\":\"0.000000\",\"conversionPerc\":\"1.67\",\"callConversionPerc\":\"75.36\"}\n{\"metricDate\":\"2013-05-29\",\"pageCountTotal\":\"34190\",\"landCountTotal\":\"12014\",\"newLandCountTotal\":\"8279\",\"returnLandCountTotal\":\"3735\",\"spiderCountTotal\":\"90\",\"goalCountTotal\":\"179.000000\",\"callGoalCountTotal\":\"179.000000\",\"callCountTotal\":\"235.000000\",\"onlineGoalCountTotal\":\"0.000000\",\"conversionPerc\":\"1.49\",\"callConversionPerc\":\"76.17\"}\n{\"metricDate\":\"2013-05-30\",\"pageCountTotal\":\"33867\",\"landCountTotal\":\"11965\",\"newLandCountTotal\":\"8231\",\"returnLandCountTotal\":\"3734\",\"spiderCountTotal\":\"63\",\"goalCountTotal\":\"160.000000\",\"callGoalCountTotal\":\"160.000000\",\"callCountTotal\":\"219.000000\",\"onlineGoalCountTotal\":\"0.000000\",\"conversionPerc\":\"1.34\",\"callConversionPerc\":\"73.06\"}\n{\"metricDate\":\"2013-05-31\",\"pageCountTotal\":\"27536\",\"landCountTotal\":\"10302\",\"newLandCountTotal\":\"7333\",\"returnLandCountTotal\":\"2969\",\"spiderCountTotal\":\"108\",\"goalCountTotal\":\"173.000000\",\"callGoalCountTotal\":\"173.000000\",\"callCountTotal\":\"226.000000\",\"onlineGoalCountTotal\":\"0.000000\",\"conversionPerc\":\"1.68\",\"callConversionPerc\":\"76.55\"}\n\r\n"

One thing that works is to split on newlines, call from JSON on each row, then recombine the result.
contJSON <- sub("\n\r\n", "", contJSON) #as before
rowJSON <- strsplit(contJSON, "\n")[[1]]
row <- lapply(rowJSON, fromJSON)
as.data.frame(do.call(rbind, row))

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

How can I merge/join multiple columns from two dataframes, depending on a matching pattern - language-agnostic

Related

How to apply countvectorizer to bigrams in a pandas dataframe

Function does not return the list correctly

Select records based on the specific index string value and then remove subsequent fields by python

write items from a list to csv file column by column using pandas dataframe.to_csv

Using JSON schema as column headers in dataframe

Categories

Resources