Data preprocessing for Named Entity Recognition? - deep-learning

I'm working on a Named Entity Recognition on resume dataset and we have entities like dates, phone, email etc.,,
And I'm working how to preprocess those entities. I'm currently adding a space after each puncuation like this,
DAVID B-Name
John I-Name
, O
IT O
Washington B-Address
, I-Address
DC I-Address
( B-Phone
107 I-Phone
) I-Phone
155
- I-Phone
4838 I-Phone
david B-Email
. I-Email
John I-Email
# I-Email
gmail I-Email
. I-Email
com I-Email
But I'm starting to question the process on how to handle such text during inference. I'm assuming even at inference we have to preprocess text using same process that is adding a space after each puncuation isn't it?
But it won't be so readable right?
For example at inference I have to provide input text like test # example . com? which is not readable isn't it? It only be able to predict entities in such format.

The problem you're trying to deal with is called tokenization. To deal with the formatting issue that you raise, often frameworks will extract the tokens from the underlying text in a way preserves the original text, such as keeping track of the character starts and ends for each token.
For instance, SpaCy in Python returns an object that stores all of this information:
import spacy
from pprint import pprint
nlp = spacy.load("en_core_web_sm")
doc = nlp("DAVID John, IT\nWashington, DC (107) 155-4838 david.John#gmail.com")
pprint([(token.text, token.idx, token.idx + len(token.text)) for token in doc])
output:
[('DAVID', 0, 5),
('John', 6, 10),
(',', 10, 11),
('IT', 12, 14),
('\n', 14, 15),
('Washington', 15, 25),
(',', 25, 26),
('DC', 27, 29),
('(', 30, 31),
('107', 31, 34),
(')', 34, 35),
('155', 36, 39),
('-', 39, 40),
('4838', 40, 44),
('david.John#gmail.com', 45, 65)]
You could either do the same sort of thing for yourself (e.g. keep a counter as you add spaces) or use an existing tokenizer (such as SpaCy, CoreNLP, tensorflow, etc.)

Related

How to read a csv file into a list of lists in SWI prolog where the inner list represents each line of the CSV?

I have a CSV file that look something like below: i.e. not in Prolog format
james,facebook,intel,samsung
rebecca,intel,samsung,facebook
Ian,samsung,facebook,intel
I am trying to write a Prolog predicate that reads the file and returns a list that looks like
[[james,facebook,intel,samsung],[rebecca,intel,samsung,facebook],[Ian,samsung,facebook,intel]]
to be used further in other predicates.
I am still a beginner and have found some good information from SO and modified them to see if I can get it but I`m stuck because I only generate a list that looks like this
[[(james,facebook,intel,samsung)],[(rebecca,intel,samsung,facebook)],[(Ian,samsung,facebook,intel)]]
which means when I call the head of the inner lists I get (james,facebook,intel,samsung) and not james.
Here is the code being used :- (seen on SO and modified)
stream_representations(Input,Lines) :-
read_line_to_codes(Input,Line),
( Line == end_of_file
-> Lines = []
; atom_codes(FinalLine, Line),
term_to_atom(LineTerm,FinalLine),
Lines = [[LineTerm] | FurtherLines],
stream_representations(Input,FurtherLines)
).
main(Lines) :-
open('file.txt', read, Input),
stream_representations(Input, Lines),
close(Input).
The problem lies with term_to_atom(LineTerm,FinalLine).
First we read a line of the CSV file into a list of character codes in
read_line_to_codes(Input,Line).
Let's simulate input with atom_codes/2:
?- atom_codes('james,facebook,intel,samsung',Line).
Line = [106, 97, 109, 101, 115, 44, 102, 97, 99|...].
Then we recompose the original atom read in into FinalLine (this seems wasteful, there must be a way to hoover up a line into an atom directly)
?- atom_codes('james,facebook,intel,samsung',Line),
atom_codes(FinalLine, Line).
Line = [106, 97, 109, 101, 115, 44, 102, 97, 99|...],
FinalLine = 'james,facebook,intel,samsung'.
The we try to map this atom in FinalLine into a term, LineTerm, using term_to_atom/2
?- atom_codes('james,facebook,intel,samsung',Line),
atom_codes(FinalLine, Line),
term_to_atom(LineTerm,FinalLine).
Line = [106, 97, 109, 101, 115, 44, 102, 97, 99|...],
FinalLine = 'james,facebook,intel,samsung',
LineTerm = (james, facebook, intel, samsung).
You see the problem here: LineTerm is not quite a list, but a nested term using the functor , to separate elements:
?- atom_codes('james,facebook,intel,samsung',Line),
atom_codes(FinalLine, Line),
term_to_atom(LineTerm,FinalLine),
write_canonical(LineTerm).
','(james,','(facebook,','(intel,samsung)))
Line = [106, 97, 109, 101, 115, 44, 102, 97, 99|...],
FinalLine = 'james,facebook,intel,samsung',
LineTerm = (james, facebook, intel, samsung).
This ','(james,','(facebook,','(intel,samsung))) term will thus also be in the final result, just written differently: (james,facebook,intel,samsung) and packed into a list:
[(james,facebook,intel,samsung)]
You do not want this term, you want a list. You could use atomic_list_concat/2 to create a new atom that can be read as a list:
?- atom_codes('james,facebook,intel,samsung',Line),
atom_codes(FinalLine, Line),
atomic_list_concat(['[',FinalLine,']'],ListyAtom),
term_to_atom(LineTerm,ListyAtom),
LineTerm = [V1,V2,V3,V4].
Line = [106, 97, 109, 101, 115, 44, 102, 97, 99|...],
FinalLine = 'james,facebook,intel,samsung',
ListyAtom = '[james,facebook,intel,samsung]',
LineTerm = [james, facebook, intel, samsung],
V1 = james,
V2 = facebook,
V3 = intel,
V4 = samsung.
But that's rather barbaric.
We must do this whole processing in fewer steps:
Read a line of comma-separated strings on input.
Transform this into a list of either atoms or strings directly.
DCGs seem like the correct solution. Maybe someone can add a two-liner.

How to read JSON file in Prolog

I found a few SO posts on related issues which were unhelpful. I finally figured it out and here's how to read the contents of a .json file. Say the path is /home/xxx/dnns/test/params.json, I want to turn the dictionary in the .json into a Prolog dictionary:
{
"type": "lenet_1d",
"input_channel": 1,
"output_size": 130,
"batch_norm": 1,
"use_pooling": 1,
"pooling_method": "max",
"conv1_kernel_size": 17,
"conv1_num_kernels": 45,
"conv1_stride": 1,
"conv1_dropout": 0.0,
"pool1_kernel_size": 2,
"pool1_stride": 2,
"conv2_kernel_size": 12,
"conv2_num_kernels": 35,
"conv2_stride": 1,
"conv2_dropout": 0.514948804688646,
"pool2_kernel_size": 2,
"pool2_stride": 2,
"fcs_hidden_size": 109,
"fcs_num_hidden_layers": 2,
"fcs_dropout": 0.8559119274655482,
"cost_function": "SmoothL1",
"optimizer": "Adam",
"learning_rate": 0.0001802763794651928,
"momentum": null,
"data_is_target": 0,
"data_train": "/home/xxx/data/20180402_L74_70mm/train_2.h5",
"data_val": "/home/xxx/data/20180402_L74_70mm/val_2.h5",
"batch_size": 32,
"data_noise_gaussian": 1,
"weight_decay": 0,
"patience": 20,
"cuda": 1,
"save_initial": 0,
"k": 4,
"save_dir": "DNNs/20181203090415_11_created/k_4"
}
To read a JSON file with SWI-Prolog, query
?- use_module(library(http/json)). % to enable json_read_dict/2
?- FPath = '/home/xxx/dnns/test/params.json', open(FPath, read, Stream), json_read_dict(Stream, Dicty).
You'll get
FPath = 'DNNs/test/k_4/model_params.json',
Stream = <stream>(0x7fa664401750),
Dicty = _12796{batch_norm:1, batch_size:32, conv1_dropout:0.
0, conv1_kernel_size:17, conv1_num_kernels:45, conv1_stride:
1, conv2_dropout:0.514948804688646, conv2_kernel_size:12, co
nv2_num_kernels:35, conv2_stride:1, cost_function:"SmoothL1"
, cuda:1, data_is_target:0, data_noise_gaussian:1, data_trai
n:"/home/xxx/Downloads/20180402_L74_70mm/train_2.h5", data
_val:"/home/xxx/Downloads/20180402_L74_70mm/val_2.h5", fcs
_dropout:0.8559119274655482, fcs_hidden_size:109, fcs_num_hi
dden_layers:2, input_channel:1, k:4, learning_rate:0.0001802
763794651928, momentum:null, optimizer:"Adam", output_size:1
30, patience:20, pool1_kernel_size:2, pool1_stride:2, pool2_
kernel_size:2, pool2_stride:2, pooling_method:"max", save_di
r:"DNNs/20181203090415_11_created/k_4", save_initial:0, type
:"lenet_1d", use_pooling:1, weight_decay:0}.
where Dicty is the desired dictionary.
If you want to define this as a predicate, you could do:
:- use_module(library(http/json)).
get_dict_from_json_file(FPath, Dicty) :-
open(FPath, read, Stream), json_read_dict(Stream, Dicty), close(Stream).
Even DEC10 Prolog released 40 years ago could handle JSON just as a normal term . There should be no need for a specialized library or parser for JSON because Prolog can just parse it directly .
?- X={"a":3,"b":"hello","c":undefined,"d":null} .
X = {"a":3, "b":"hello", "c":undefined, "d":null}.
?-

JSON hierarchies encoding and event-capture

Consider the following JSON structure:
{'100': {'Time': '02:00:00', 'Group': 'A', 'Similar events': [101, 102, 104, 120],
'101': {'Time': '02:01:00', 'Group': 'B', 'Similar events': [100, 103, 105, 111],
'102': {'Time': '04:00:00', 'Group': 'A', 'Similar events': [104, 100, 107, 121]}
The top-level keys (e.g. '100', '101', etc.) are unique identifiers. I have come to find this is not the ideal way to store JSON (attempting to load this structure - with many more events - crashed my PC).
After some digging, I believe this is the proper way (or, at least, a much more canonical way) of encoding these data in JSON:
{'Time': [{'100': '02:00:00'},
{'101': '02:01:00'},
{'102': '04:00:00'}],
'Group': [{'100': 'A'},
{'101': 'B'},
{'102': 'A'}],
'Similar events': [{'100': [101, 102, 104, 120]},
{'101': [100, 103, 105, 111]},
{'102': [104, 100, 107, 121]}]}
My machine is able to handle much better this last attempt. Why does my former method of using unique events as (what I think are) individual "rows" cause so much trouble? My gut tells me each "column" or key within each record in the former try becomes a new field since it's found under a unique identifier (a unique key).
It's difficult to say without more details such as the total size of your data, the amount of memory on your computer, the software you're using and the specific operations your're trying to do but it may be that the working set of the second representation is smaller for your problem.

convert h5 file to csv file or text file for data processing

i have a dataset of about 1.85 GB which contains h5 files,i need to process these files using hadoop,for this i may need to convert these files to text or csv.
is there any way hadoop can read h5 files?or any good online tool to convert h5 files to csv or text files?or can any one give a link where i can download a huge dataset which contains text or csv files?
thanks in advance
Have you tried OPeNDAP Hyrax server with hdf5_handler module?
For example, from the sample HDF5 file [1], you can get the following ASCII data [2]:
Dataset: grid_1_2d.h5
temperature[0], 10, 10, 10, 10, 10, 10, 10, 10
temperature[1], 11, 11, 11, 11, 11, 11, 11, 11
temperature[2], 12, 12, 12, 12, 12, 12, 12, 12
temperature[3], 13, 13, 13, 13, 13, 13, 13, 13
...
OPeNDAP Hyrax server with hdf5_handler is a great tool/service because you can select (and subset) a dataset from an HDF5 file easily using HTML form as well [3]. You can find the detailed information about OPeNDAP hdf5_handler from [4].
[1] http://eosdap.hdfgroup.org:8080/opendap/data/hdf5/grid_1_2d.h5
[2] http://eosdap.hdfgroup.org:8080/opendap/data/hdf5/grid_1_2d.h5.ascii
[3] http://eosdap.hdfgroup.org:8080/opendap/data/hdf5/grid_1_2d.h5.html
[4] http://hdfeos.org/software/hdf5_handler.php

pm3d in gnuplot with binary data

I have some data files with content
a1 b1 c1 d1
a1 b2 c2 d2
...
[blank line]
a2 b1 c1 d1
a2 b2 c2 d2
...
I plot this with gnuplot using
splot 'file' u 1:2:3:4 w pm3d.
Now, I want to use a binary file. I created the file with Fortran using unformatted stream-access (direct or sequential access did not work directly). By using gnuplot with
splot 'file' binary format='%float%float%float%float' u 1:2:3
I get a normal 3D-plot. However, the pm3d-command does not work as I don't have the blank lines in the binary file. I get the error message:
>splot 'file' binary format='%float%float%float%float' u 1:2:3:4 w pm3d
Warning: Single isoline (scan) is not enough for a pm3d plot.
Hint: Missing blank lines in the data file? See 'help pm3d' and FAQ.
According to the demo script in http://gnuplot.sourceforge.net/demo/image2.html, I have to specify the record length (which I still don't understand right). However, using this script from the demo page and the command with pm3d obtains the same error message:
splot 'scatter2.bin' binary record=30:30:29:26 u 1:2:3 w pm3d
So how is it possible to plot this four dimensional data from a binary file correctly?
Edit: Thanks, mgilson. Now it works fine. Just for the record: My fortran code-snippet:
open(unit=83,file=fname,action='write',status='replace',access='stream',form='unformatted')
a= 0.d0
b= 0.d0
do i=1,200
do j=1,100
write(83)real(a),real(b),c(i,j),d(i,j)
b = b + db
end do
a = a + da
b = 0.d0
end do
close(83)
The gnuplot commands:
set pm3d map
set contour
set cntrparam levels 20
set cntrparam bspline
unset clabel
splot 'fname' binary record=(100,-1) format='%float' u 1:2:3:4 t 'd as pm3d-projection, c as contour'
Great question, and thanks for posting it. This is a corner of gnuplot I hadn't spent much time with before. First, I need to generate a little test data -- I used python, but you could use fortran just as easily:
Note that my input array (b) is just a 10x10 array. The first two "columns" in the datafile are just the index (i,j), but you could use anything.
>>> import numpy as np
>>> a = np.arange(10)
>>> b = a[None,:]+a[:,None]
>>> b
array([[ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
[ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
[ 2, 3, 4, 5, 6, 7, 8, 9, 10, 11],
[ 3, 4, 5, 6, 7, 8, 9, 10, 11, 12],
[ 4, 5, 6, 7, 8, 9, 10, 11, 12, 13],
[ 5, 6, 7, 8, 9, 10, 11, 12, 13, 14],
[ 6, 7, 8, 9, 10, 11, 12, 13, 14, 15],
[ 7, 8, 9, 10, 11, 12, 13, 14, 15, 16],
[ 8, 9, 10, 11, 12, 13, 14, 15, 16, 17],
[ 9, 10, 11, 12, 13, 14, 15, 16, 17, 18]])
>>> with open('foo.dat','wb') as foo:
... for (i,j),dat in np.ndenumerate(b):
... s = struct.pack('4f',i,j,dat,dat)
... foo.write(s)
...
So here I just write 4-floating point values to the file for each data-point. Again, this is what you've already done using fortran. Now for plotting it:
splot 'foo.dat' binary record=(10,-1) format='%float' u 1:2:3:4 w pm3d
I believe that this specifies that each "scan" is a "record". Since I know that each scan will be 10 floats long, that becomes the first index in the record list. The -1 indicates that gnuplot should keep reading records until it finds the end of the file.