Tools / ways to process unstructured medical Text data to CSV - data-analysis

10/03/2014 16:55 Local Title: TRANSFER OUT NOTE
Standard Title: TRANSFER SUMMARIZATION NOTE
AUTHOR: D,WARD
XYZ MEDICAL INSTITUTE
ABC NAGAR, PQW CITY-101011
******************************************************************
TRANSFER OUT NOTE
******************* OCT 03, 2014
UHID:000-01-0202 PATIENT NAME: NAME , SINGH
AGE/SEX:42/FEMALE
DOA:Sep 30,2014
DEPARTMENT:GYNAE AND OBSTETRICS UNIT:II
TRANSFERRED FROM:D3
NAME , SINGH 000-01-0202 DOB: 01/01/1972
TRANSFERRED TO : MCU
DIAGNOSIS:pop- em lscs with male baby nicu B
TREATMENT:
inj.cefazolin 1 gm bd
inj.rantac 1 amp tds
inj.perinorm 1 amp tds
inj.pcm 1 gm tds
inj.texid 1 gm tds
PATIENT STATUS AT THE TIME OF SHIFTING:
g.c. fair on iv fluid ..
NAME , SINGH 000-01-0202 DOB: 01/01/1972
VITALS AT THE TIME OF SHIFTING:
TEMP:98.6F
HR:88/MIN RR:24/MIN
GCS: E V M
< THE ABOVE NOTE IS UNSIGNED >
- DRAFT COPY * DRAFT COPY * DRAFT COPY * DRAFT COPY * DRAFT COPY * DRAFT COPY -
09/21/2014 23:01 Local Title: MED ONCO IRCH DISCHARGE SUMMARY
Standard Title: DISCHARGE SUMMARY
AUTHOR: KUMAR,UVW
LOCAL TITLE: MED ONCO IRCH DISCHARGE SUMMARY
STANDARD TITLE: DISCHARGE SUMMARY
NAME , SINGH 000-01-0202 DOB: 01/01/1972
DATE OF NOTE: SEP 21, 2014#22:04 ENTRY DATE: SEP 21, 2014#22:04:42
AUTHOR: UVW KUMAR
REGISTRATION DETAILS
********************
UHID No:000-01-0202 IRCH No:000222 CR No:111000
NAME: NAME AGE:22 YEAR GENDER:MALE
DOA:Sep 2, 2014 DOD:Sep 18, 2014 DURATION OF STAY: days
WARD: MRO Ward BED No:14
CONSULTANT INCHARGE:Dr UVW Kumar
DIAGNOSIS & REASON FOR CURRENT ADMISSION
****************************************
DIAGNOSIS:Acute Promyelocytic leukemia (Intermediate Risk)
ADMITTED FOR :Chemotherapy
CASE SUMMARY:NAME Singh presented with complaints of bleeding gums, fever,
NAME , SINGH 000-01-0202 DOB: 01/01/1972
blurring of vision and gum hypertrophy. He diagnosed as APML in PQW
hospital based on PS, BMA and PML/RARa positive. He started on ATRA and after
that reffered here. His basline hemorem at PQW Hospital was s/o Hb :
4.6, TLC: 1580/cu.mm, Platlet: 6000/cu.mm. So he is classified as
intermideate risk APML. After coming here diagnosis reconfirmed,
daunorubicin given 60mg/m2 and continoued on ATRA. No features of
ATRA syndrome noticed during ward stay. His fibrinogen level were > 450
mg/dl. He remained afebrile and hemodynamically stable and dischared on
stable condition.
PRESENTATION AT CURRENT ADMISSION
*********************************
VITAL SIGNS:
TEMP:99 F RESP:19/min PULSE:98/min
BP:121/78 mm of Hg SPO2:99% on RA
NAME , SINGH 000-01-0202 DOB: 01/01/1972
GENERAL PHYSICAL EXAMINATION: PERFORMANCE STATUS: I
PALLOR:+ ICTERUS:- OEDEMA:- CYANOSIS:-
STERNAL TENDERNESS:- CLUBBING:- GUM HYPERTROPHY:+
LYMPHNODES: -
BIOMETRIC DETAILS: WEIGHT: 45 kg HEIGHT:166 cms BSA: 1.4 m2
INVESTIGATIONS AT CURRENT ADMISSSION
************************************
PS (3/9/2014) : N2, L8, E-, M1, B-, Meta-, Myelo-, Blast 89%. Blast and abnormal
promyelocytes present. F/S/O Acute promyelocytic leukemia.
BMA (3/9/2014): Cellular BM shows 90% blast and abnormal promyelocyte. F/S/O
APML.
Flow Cytometery (3/9/2014): 87% abnormal promyelocyte, Positive : CD45, CD15,
NAME , SINGH 000-01-0202 DOB: 01/01/1972
CD11b, CD13, CD33, CD64, CD9, CD18, cMPO.
Negative for CD2, CD14, CD117, CD19, HLADR, CCD79a, cCD3.
Day 12 PS (9/9/2014): N78, L20, E-, M2, B-, Meta-, Myelo_ Promyelo Nil, Blast
Nil.
Condition at discharge:
VITAL SIGNS:
TEMP:99 F RESP:18/min PULSE:78/min
BP:112/74 mm of Hg SPO2:99% on RA
Plan At discharge and follow up: As written in OPD card
NAME , SINGH 000-01-0202 DOB: 01/01/1972
< THE ABOVE NOTE IS UNSIGNED >
- DRAFT COPY * DRAFT COPY * DRAFT COPY * DRAFT COPY * DRAFT COPY * DRAFT COPY -
09/21/2014 22:04 Local Title: MED ONCO IRCH DISCHARGE SUMMARY
Standard Title: DISCHARGE SUMMARY
AUTHOR: UVW,AMIT
REGISTRATION DETAILS
********************
UHID No:000-01-0202 IRCH No:000222 CR No:111000
NAME: NAME , SINGH AGE:42 GENDER:FEMALE
DOA:Sep 2, 2014 DOD:Sep 18, 2014 DURATION OF STAY: days
WARD: MRO Ward BED No:14
CONSULTANT INCHARGE:Dr Lalit Kumar
ADDRESS: ,
NAME , SINGH 000-01-0202 DOB: 01/01/1972
DIAGNOSIS & REASON FOR CURRENT ADMISSION
****************************************
DIAGNOSIS:
Acute Promyelocytic leukemia (Intermediate Risk)
ADMITTED FOR :Chemotherapy
CASE SUMMARY:NAME Singh presented with complaints of bleeding gums,
fever, blurring of vision and gum hypertrophy. He diagnosed as APML in
UVW hospital based on PS and PML/RARa positive. He started on ATRA and
after that reffered to XYZ hospital
PRESENTATION AT CURRENT ADMISSION
*********************************
VITAL SIGNS:
TEMP:F RESP:/min PULSE:/min
BP:/mm of Hg SPO2:%
NAME , SINGH 000-01-0202 DOB: 01/01/1972
GENERAL PHYSICAL EXAMINATION: PERFORMANCE STATUS:
PALLOR: ICTERUS: OEDEMA: CYANOSIS:
STERNAL TENDERNESS: CLUBBING: GUM HYPERTROPHY:
LYMPHNODES:
SPECIFIC FINDINGS:
BIOMETRIC DETAILS: WEIGHT:kgS HEIGHT:cms BSA: m2
INVESTIGATIONS AT CURRENT ADMISSSION
************************************
< THE ABOVE NOTE IS UNSIGNED >
- DRAFT COPY * DRAFT COPY * DRAFT COPY * DRAFT COPY * DRAFT COPY * DRAFT COPY -
NAME , SINGH 000-01-0202 DOB: 01/01/1972
This is the text content which I need to convert to CSV. This is details of one patient which came to hospital multiple times. I wanted to extract medical data in different column head[ Age, Sex, UHID,DOA, department,Diagnosis,treatment, patient status, vitals, local title, standard title, case summary, admitted for, General Physical Examination].
As you can see the repetition of "diagnosis" and there will be chances that the column name may differ as well.
File to be processed is of 15GB.
Please suggest the way to solve the issue. I tried with python, openrefine and ctakes tool.
Please give me some light on how to solve this kind of issue. Restriction is that We have to use only open source free tools.

You can do some of this with gawk. The multiline fields, like vitals and treatment, may prove tricky to shoehorn into CSV format, but here's a start on the single value fields.
function dump() {
print age "," sex "," uhid "," doa "," dept "," diagnosis
}
BEGIN { onfirst = 1 }
END { dump() }
{
sub(/^ */, "")
sub(/UHID No/, "UHID")
}
match($0, /UHID:([^ ]*)/, a) {
if(onfirst)
onfirst = 0
else
dump()
uhid = a[1]
}
match($0, /AGE\/SEX:([0-9]*)\/(.*[^ ]) *$/, a) {
age = a[1]
sex = a[2]
}
match($0, /DOA:([^ ][^ ]* *[^ ][^ ]* *[^ ][^ ]*)/, a) {
doa = a[1]
}
match($0, /DEPARTMENT:(.*[^ ]) *UNIT/, a) {
dept = a[1]
}
match($0, /DIAGNOSIS:(.*[^ ]) *$/, a) {
diagnosis = a[1]
}

Related

Difference in Difference in R (Callaway & Sant'Anna)

I'm trying to implement the DiD package by Callaway and Sant'Anna in my master thesis, but I'm coming across errors when I run the DiD code and when I try to view the summary.
did1 <- att_gt(yname = "countgreen",
gname = "signing_year",
idname = "investorid",
tname = "dealyear",
data = panel8)
This code warns me that:
"Be aware that there are some small groups in your dataset.
Check groups: 2006,2007,2008,2011. Dropped 109 observations that had missing data.overlap condition violated for 2009 in time period 2001Not enough control units for group 2009 in time period 2001 to run specified regression"
This error is repeated several hundred times.
Does this mean I need to re-match my treatment firms to control firms using a 1:3 ration (treat:control) rather than the 1:1 I used previously?
Then when I run this code:
summary(did1)
I get this message:
Error in Math.data.frame(list(`mpobj$group` = c(2009L, 2009L, 2009L, 2009L, : non-numeric variable(s) in data frame: mpobj$att
I'm really not too sure what this means.
Can anyone help trouble shoot?
Thanks,
Rory
I don't know the DiD package but i can't answer about the :summary(did1)
If you do str(did1) you should have something like this :
'data.frame': 6 obs. of 7 variables:
$ cluster : int 1 2 3 4 5 6
$ price_scal : num -0.572 -0.132 0.891 1.091 -0.803 ...
$ hd_scal : num -0.778 0.63 0.181 -0.24 0.244 ...
$ ram_scal : num -0.6937 0.00479 0.46411 0.00653 -0.31204 ...
$ screen_scal: num -0.457 2.642 -0.195 2.642 -0.325 ...
$ ads_scal : num 0.315 -0.889 0.472 0.47 -0.822 ...
$ trend_scal : num -0.604 1.267 -0.459 -0.413 1.156 ...
But in your case you should have one variable mpobj$att that is a factor or a str column.
Maybe this should also make the DiD code run.

JSON string to date with Javascript in Google Apps Script editor

I am working through w3schools, particularly https://www.w3schools.com/js/js_json_parse.asp
I ran this example and got an unexpected result
let dashText = '{ "name":"John", "birth":"1986-12-14", "city":"New York"}';
let objD = JSON.parse(dashText);
console.log("objD: ", objD);
objD.birth = new Date(objD.birth);
console.log("objD.birth: ", objD.birth);
3:09:04 PM Info objD: { name: 'John', birth: '1986-12-14', city: 'New York' }
3:09:04 PM Info objD.birth: Sat Dec 13 1986 18:00:00 GMT-0600 (Central Standard Time)
Note the difference in the dates. I then changed the dashes to slashes out of curiosity and the date was correctly determined from the string.
let slashText = '{ "name":"John", "birth":"1986/12/14", "city":"New York"}';
let objS = JSON.parse(slashText);
console.log("objS: ", objS);
objS.birth = new Date(objS.birth);
console.log("objS.birth: ", objS.birth);
3:09:04 PM Info objS: { name: 'John', birth: '1986/12/14', city: 'New York' }
3:09:04 PM Info objS.birth: Sun Dec 14 1986 00:00:00 GMT-0600 (Central Standard Time)
Can anyone explain the results?
Javascript parses DateTime strings differently based on how the string is formatted. The dashes are parsed to ISO date, i.e. international time. You can see this when it tries to handle the timezone conversion, where it sets the time to 18:00:00 to account for the 6 hour shift from Universal Time. Slashes are parsed as just the date, and doesn't try to adjust the time based on timezones.
Here's a w3schools link that goes over this in more detail.

Web-scraping in IBM Watson Studio Jupyter Notebook using BeautifulSoup not working

I'm looking to scrape data in an IBM Watson Studio Jupyter Notebook from this search result page:
https://www.aspc.co.uk/search/?PrimaryPropertyType=Rent&SortBy=PublishedDesc&LastUpdated=AddedAnytime&SearchTerm=&PropertyType=Residential&PriceMin=&PriceMax=&Bathrooms=&OrMoreBathrooms=true&Bedrooms=&OrMoreBedrooms=true&HasCentralHeating=false&HasGarage=false&HasDoubleGarage=false&HasGarden=false&IsNewBuild=false&IsDevelopment=false&IsParkingAvailable=false&IsPartExchangeConsidered=false&PublicRooms=&OrMorePublicRooms=true&IsHmoLicense=false&IsAllowPets=false&IsAllowSmoking=false&IsFullyFurnished=false&IsPartFurnished=false&IsUnfurnished=false&ExcludeUnderOffer=false&IncludeClosedProperties=true&ClosedDatesSearch=14&MapSearchType=EDITED&ResultView=LIST&ResultMode=NONE&AreaZoom=13&AreaCenter[lat]=57.14955426557916&AreaCenter[lng]=-2.0927401123046785&EditedZoom=13&EditedCenter[lat]=57.14955426557916&EditedCenter[lng]=-2.0927401123046785
I've tried BeautifulSoup and attempted Selenium (full disclosure: I am a beginner) over multiple variations of codes. I've gone over dozens of questions on Stack Overflow, Medium articles, etc and I cannot understand what I'm doing wrong.
The latest one I'm doing is:
from bs4 import BeautifulSoup
html_soup = BeautifulSoup(response.text, 'html.parser')
type(html_soup)
properties_containers = html_soup.find_all('div', class_ = 'information-card property-card col ')
print(type(properties_containers))
print(len(properties_containers))
This returns 0.
<class 'bs4.element.ResultSet'>
0
Can someone please guide me in the right direction as to what I'm doing wrong/ missing?
The data you see is loaded via JavaScript. BeautifulSoup cannot execute it, but you can use requests module to load the data from their API.
For example:
import json
import requests
url = 'https://www.aspc.co.uk/search/?PrimaryPropertyType=Rent&SortBy=PublishedDesc&LastUpdated=AddedAnytime&SearchTerm=&PropertyType=Residential&PriceMin=&PriceMax=&Bathrooms=&OrMoreBathrooms=true&Bedrooms=&OrMoreBedrooms=true&HasCentralHeating=false&HasGarage=false&HasDoubleGarage=false&HasGarden=false&IsNewBuild=false&IsDevelopment=false&IsParkingAvailable=false&IsPartExchangeConsidered=false&PublicRooms=&OrMorePublicRooms=true&IsHmoLicense=false&IsAllowPets=false&IsAllowSmoking=false&IsFullyFurnished=false&IsPartFurnished=false&IsUnfurnished=false&ExcludeUnderOffer=false&IncludeClosedProperties=true&ClosedDatesSearch=14&MapSearchType=EDITED&ResultView=LIST&ResultMode=NONE&AreaZoom=13&AreaCenter[lat]=57.14955426557916&AreaCenter[lng]=-2.0927401123046785&EditedZoom=13&EditedCenter[lat]=57.14955426557916&EditedCenter[lng]=-2.0927401123046785'
api_url = 'https://api.aspc.co.uk/Property/GetProperties?{}&Sort=PublishedDesc&Page=1&PageSize=12'
params = url.split('?')[-1]
data = requests.get(api_url.format(params)).json()
# uncomment this to print all data:
# print(json.dumps(data, indent=4)) # <-- uncomment this to see all data received from server
# print some data to screen:
for property_ in data:
print(property_['Location']['AddressLine1'])
print(property_['CategorisationDescription'])
print('Bedrooms:', property_["Bedrooms"]) # <-- print number of Bedrooms
print('Bathrooms:', property_["Bathrooms"]) # <-- print number of Bathrooms
print('PublicRooms:', property_["PublicRooms"]) # <-- print number of PublicRooms
# .. etc.
print('-' * 80)
Prints:
44 Roslin Place
Fully furnished 2 Bdrm 1st flr Flat. Hall. Lounge. Dining kitch. 2 Bdrms. Bathrm (CT band - C). Deposit 1 months rent. Parking. No pets. No smokers. Rent £550 p.m Entry by arr. Viewing contact solicitors. Landlord reg: 871287/100/26061. (EPC band - B).
Bedrooms: 2
Bathrooms: 1
PublicRooms: 1
--------------------------------------------------------------------------------
Second Floor Left, 173 Victoria Road
Unfurnished 1 Bdrm 2nd flr Flat. Hall. Lounge. Dining kitch. Bdrm. Bathrm (CT Band - A). Deposit 1 months rent. No pets. No smokers. Rent £375 p.m Immed entry. Viewing contact solicitors. Landlord reg: 1261711/100/09072. (EPC band - D).
Bedrooms: 1
Bathrooms: 1
PublicRooms: 1
--------------------------------------------------------------------------------
102 Bedford Road
Fully furnished 3 Bdrm 1st flr Flat. Hall. Lounge. Kitch. 3 Bdrms. Bathrm (CT band - B). Deposit 1 months rent. Garden. HMO License. No pets. No smokers. Rent £750 p.m Entry by arr. Viewing contact solicitors. Landlord reg: 49171/100/27130. (EPC band - D).
Bedrooms: 3
Bathrooms: 1
PublicRooms: 1
--------------------------------------------------------------------------------
... and so on.

Merging a weird html-like txt file with an Excel file

I got two files which I'm supposed to merge (most likely using statistical software such as R or SPSS), one of them being a normal Excel table with 3 variables (names at the top of the columns). The second one, however, was sent to me in a format I haven't seen before, a large txt file with input per case (identified with the ID variable, which I would also use to merge with the Excel file) which looks like this:
<organizations>
<organization id="B0101">
<type1>E</type1>
<type2>v</type2>
<name>International Association for Official Statistics</name>
<acronym>IAOS</acronym>
<country_first_address>not known</country_first_address>
<city_first_address>not known</city_first_address>
<countries_in_which_members_located>not known</countries_in_which_members_located>
<subject_headings>Government; Statistics</subject_headings>
<foundation_year>1985</foundation_year>
<history>[[History]] Founded 1985, Amsterdam (Netherlands), at 45th Session of #A2590, as a specialized section of ISI. Absorbed, 1989, #D1316, which had been set up 22 Oct 1958, Geneva (Switzerland), following recommendations of ISI, as [International Association of Municipal Statisticians -- Association internationale de statisticiens municipaux]. </history>
<history_relations>#A2590; #D1316</history_relations>
<consultative_status>none known</consultative_status>
<igo_relations>none known</igo_relations>
<ngo_relations>#E1209; #M4975; #D1976; #E2125; #E3673; #D2578; #M0084</ngo_relations>
<member_organizations>none known</member_organizations>
</organization>
<organization id="B8500">
<type1>B</type1>
<type2>y</type2>
<name>World Blind Union</name>
<acronym>WBU</acronym>
<country_first_address>Canada</country_first_address>
<city_first_address>Toronto</city_first_address>
<countries_in_which_members_located>Algeria; Angola; Benin; Burkina Faso; Burundi; Cameroon; Cape Verde; Central African Rep; Chad; Congo Brazzaville; Congo DR; Côte d'Ivoire; Djibouti; Egypt; Equatorial Guinea; Eritrea; Ethiopia; Gabon; Gambia; Ghana; Guinea; Guinea-Bissau; Kenya; Lesotho; Liberia; Libyan AJ; Madagascar; Malawi; Mali; Mauritania; Mauritius; Morocco; Mozambique; Namibia; Niger; Nigeria; Rwanda; Sao Tomé-Principe; Senegal; Seychelles; Sierra Leone; Somalia; South Africa; South Sudan; Sudan; Swaziland; Tanzania UR; Togo; Tunisia; Uganda; Zambia; Zimbabwe; Anguilla; Antigua-Barbuda; Argentina; Bahamas; Barbados; Belize; Bolivia; Brazil; Canada; Chile; Colombia; Costa Rica; Cuba; Dominica; Dominican Rep; Ecuador; El Salvador; Grenada; Guatemala; Guyana; Haiti; Honduras; Jamaica; Martinique; Mexico; Montserrat; Nicaragua; Panama; Paraguay; Peru; St Kitts-Nevis; St Lucia; St Vincent-Grenadines; Trinidad-Tobago; Turks-Caicos; Uruguay; USA; Venezuela; Virgin Is UK; Afghanistan; Bahrain; Bangladesh; Brunei Darussalam; Cambodia; China; Hong Kong; India; Indonesia; Iraq; Israel; Japan; Jordan; Kazakhstan; Korea Rep; Kuwait; Kyrgyzstan; Laos; Lebanon; Macau; Malaysia; Mongolia; Myanmar; Nepal; Pakistan; Philippines; Qatar; Singapore; Sri Lanka; Syrian AR; Taiwan; Tajikistan; Thailand; Timor-Leste; Turkmenistan; United Arab Emirates; Uzbekistan; Vietnam; Yemen; Australia; Fiji; New Zealand; Tonga; Albania; Armenia; Austria; Azerbaijan; Belarus; Belgium; Bosnia-Herzegovina; Bulgaria; Croatia; Cyprus; Czech Rep; Denmark; Estonia; Finland; France; Georgia; Germany; Greece; Hungary; Iceland; Ireland; Italy; Latvia; Lithuania; Luxembourg; Macedonia; Malta; Moldova; Montenegro; Netherlands; Norway; Poland; Portugal; Romania; Russia; Serbia; Slovakia; Slovenia; Spain; Sweden; Switzerland; Turkey; UK; Ukraine;</countries_in_which_members_located>
<subject_headings>Blind, Visually Impaired</subject_headings>
<foundation_year>1984</foundation_year>
<history>[[History]] Founded 26 Oct 1984, Riyadh (Saudi Arabia), as one united world body composed of representatives of national associations of the blind and agencies serving the blind, successor body to both #B3499, set up 20 July 1951, Paris (France), and #B2024, formed in Aug 1964, New York NY (USA). Constitution adopted 26 Oct 1984; amended at: 3rd General Assembly, 2-6 Nov 1992, Cairo (Egypt); 26-30 Aug 1996, Toronto (Canada); 20-24 Nov 2000, Melbourne (Australia); 22-26 Nov 2004, Cape Town (South Africa); 18-22 Aug 2008, Geneva (Switzerland); 12-16 Nov 2012, Bangkok (Thailand). Registered in accordance with French law, 20 Dec 1984, Paris and again 20 Dec 2004, Paris. Incorporated in Canada as not-share-capital not-for-profit corporation, 16 Mar 2007. </history>
<history_relations>#B3499; #B2024</history_relations>
<consultative_status>#E3377; #B2183; #B3548; #B0971; #F3380; #B3635</consultative_status>
<igo_relations>#E7552; #F1393; #A3375; #B3408</igo_relations>
<ngo_relations>#E0409; #E6422; #J5215; #F5821; #C1224; #D5392; #F6792; #A1945; #B2314; #D1758; #F5810; #D1612; #J0357; #D1038; #G6537; #B2221; #B0094; #B3536; #D7556</ngo_relations>
<member_organizations>#F6063; #F4959; #J1979; #C1224; #B0094; #D5392; #A1945; #D2362; #F2936; #J4730; #F3167; #D8743; #F1898; #D0043; #G0853</member_organizations>
</organization>
Any help would be appreciated - what type of file this is and how to transform it into a manageable table?
I think your data is XML. I copied your sample data, pasted it into a blank file, and saved it as sample.xml. I made sure to add in a line with </organizations> at the very end (line 37 in your sample), to close off that tag.
Then I followed the instructions here to read it in:
library(XML)
xmlfile <- xmlTreeParse(file = "sample.xml")
xmltop = xmlRoot(xmlfile)
orgs <- xmlSApply(xmltop, function(x) xmlSApply(x, xmlValue))
orgs_df <- data.frame(t(orgs),row.names=NULL)
This returns a dataframe orgs_df with 2 obs. of 15 variables. I presume you can now go ahead and merge this with your Excel file as you please.

Comparison the words with the original file in the R

I have original dataset in json format. Let's load it in R.
library("rjson")
setwd("mydir")
getwd()
json_data <- fromJSON(paste(readLines("N1.json"), collapse=""))
uu <- unlist(json_data)
uutext <- uu[names(uu) == "text"]
And I have another dataset mydata2
mydata=read.csv(path to data/words)
I need to find the words in mydata2, only which are present in messages in json file. And then write this messages into the new document, "xyz.txt" How to do it?
chalk indirect pick reaction team skip pumpkin surprise bless ignorance
1 time patient road extent decade cemetery staircase monarch bubble abbey
2 service conglomerate banish pan friendly position tight highlight rice disappear
3 write swear break tire jam neutral momentum requirement relationship matrix
4 inspire dose jump promote trace latest absolute adjust joystick habit
5 wrong behave claim dedicate threat sell particle statement teach lamb
6 eye tissue prescription problem secretion revenge barrel beard mechanism platform
7 forest kick face wisecrack uncertainty ratio complain doubt reflection realism
8 total fee debate hall soft smart sip ritual pill category
9 contain headline lump absorption superintendent digital increase key banner second
i mean
chalk -1 number1 indirect -2 number2
template
Word1-1 number1-1; Word1-2 number 1-2; …; Word 1-10 number 1-10
Word2-1 number2-1; Word2-2 number 2-2; …; Word 2-10 number 2-10
Next time pls include real data. Simplified model:
library(data.table)
word = c("test","meh","blah")
jsonF = c("let's do test", "blah is right", "test blah", "test test")
outp <- list()
for (i in 1:length(word)) {
outp[[i]] = as.data.frame(grep(word[i],jsonF,v=T,fixed=T)) # possibly, ignore.case=T
}
qq = rbindlist(outp)
qq = unique(qq)
print(qq)
1: let's do test
2: test blah
3: test test
4: blah is right
Edit: quick and dirty paste/collapse:
library(data.table)
x = LETTERS[1:10]
y = LETTERS[11:20]
df = rbind(x,y)
L = list()
for (i in 1:nrow(df)) {
L[i] = paste0(df[i,],"-",seq(1,10)," ",i,"-",seq(1,10),collapse="; ")
}
Fin = cbind(L)
View(Fin)
Gives:
> Fin
L
[1,] "A-1 1-1; B-2 1-2; C-3 1-3; D-4 1-4; E-5 1-5; F-6 1-6; G-7 1-7; H-8 1-8; I-9 1-9; J-10 1-10"
[2,] "K-1 2-1; L-2 2-2; M-3 2-3; N-4 2-4; O-5 2-5; P-6 2-6; Q-7 2-7; R-8 2-8; S-9 2-9; T-10 2-10"