This is a second part of another question I posted. However, they are different enough to be seperate questions, but could be related.
Previous question
Building a Custom Named Entity Recognition with Spacy , using random text as a sample
I have built a custom Named Entity Recognition (NER) using the method described in the previous question. From here, I just copied the method to build the NER from the Spacy website (under "Named Entity Recognizer" at this website https://spacy.io/usage/training#ner)
the custom NER works, sorta. If I sentence tokenize the text, lemmatize the words (so "strawberries" become "strawberry"), it can pick up an entity. However, it stops there. It sometimes picks up two entities, but very rarely.
Is there anything I can do to improve its accuracy?
Here is the code (I have TRAIN_DATA in this format, but for food items
TRAIN_DATA = [
("Uber blew through $1 million a week", {"entities": [(0, 4, "ORG")]}),
("Google rebrands its business apps", {"entities": [(0, 6, "ORG")]})]
)
The data is in the object train_food
import spacy
import nltk
nlp = spacy.blank("en")
#Create a built-in pipeline components and add them in the pipeline
if "ner" not in nlp.pipe_names:
ner = nlp.create_pipe("ner")
nlp.add_pipe(ner, last =True)
else:
ner =nlp.get_pipe("ner")
##Testing for food
for _, annotations in train_food:
for ent in annotations.get("entities"):
ner.add_label(ent[2])
# get names of other pipes to disable them during training
pipe_exceptions = ["ner", "trf_wordpiecer", "trf_tok2vec"]
other_pipes = [pipe for pipe in nlp.pipe_names if pipe not in pipe_exceptions]
model="en"
n_iter= 20
# only train NER
with nlp.disable_pipes(*other_pipes), warnings.catch_warnings():
# show warnings for misaligned entity spans once
warnings.filterwarnings("once", category=UserWarning, module='spacy')
# reset and initialize the weights randomly – but only if we're
# training a new model
nlp.begin_training()
for itn in range(n_iter):
random.shuffle(train_food)
losses = {}
# batch up the examples using spaCy's minibatch
batches = minibatch(train_food, size=compounding(4.0, 32.0, 1.001))
for batch in batches:
texts, annotations = zip(*batch)
nlp.update(
texts, # batch of texts
annotations, # batch of annotations
drop=0.5, # dropout - make it harder to memorise data
losses=losses,
)
print("Losses", losses)
text = "mike went to the supermarket today. he went and bought a potatoes, carrots, towels, garlic, soap, perfume, a fridge, a tomato, tomatoes and tuna."
After this, and using text as a sample, I ran this code
def text_processor(text):
text = text.lower()
token = nltk.word_tokenize(text)
ls = []
for x in token:
p = lemmatizer.lemmatize(x)
ls.append(f"{p}")
new_text = " ".join(map(str,ls))
return new_text
def ner (text):
new_text = text_processor(text)
tokenizer = nltk.PunktSentenceTokenizer()
sentences = tokenizer.tokenize(new_text)
for sentence in sentences:
doc = nlp(sentence)
for ent in doc.ents:
print(ent.text, ent.label_)
ner(text)
This results in
potato FOOD
carrot FOOD
Running the following code
ner("mike went to the supermarket today. he went and bought garlic and tuna")
Results in
garlic FOOD
Ideally, I want the NER to pick up potato, carrot and garlic. Is there anything I can do?
Thank you
Kah
while you are training your model, You can try some information retrieval techniques such as:
1-lower casing all of the words
2-replace words with their synonyms
3-removing stop words
4-rewrite sentences(it can be done automatically using back-translation aka translating into Arabic, then translating it back into English)
also, consider using better models such as:
http://nlp.stanford.edu:8080/corenlp
https://huggingface.co/models
In Flink, parsing a CSV file using readCsvFile raises an exception when encountring a field containing quotes like "Fazenda São José ""OB"" Airport":
org.apache.flink.api.common.io.ParseException: Line could not be parsed: '191,"SDOB","small_airport","Fazenda São José ""OB"" Airport",-21.425199508666992,-46.75429916381836,2585,"SA","BR","BR-SP","Tapiratiba","no","SDOB",,"SDOB",,,'
I've found in this mailing list thread and this JIRA issue that quoting inside the field should be realized through the \ character, but I don't have control over the data to modify it. Is there a way to work around this?
I've also tried using ignoreInvalidLines() (which is the less preferable solution) but it gave me the following error:
08:49:05,737 INFO org.apache.flink.api.common.io.LocatableInputSplitAssigner - Assigning remote split to host localhost
08:49:05,765 ERROR org.apache.flink.runtime.operators.BatchTask - Error in task code: CHAIN DataSource (at main(Job.java:53) (org.apache.flink.api.java.io.TupleCsvInputFormat)) -> Map (Map at main(Job.java:54)) -> Combine(SUM(1), at main(Job.java:56) (2/8)
java.lang.ArrayIndexOutOfBoundsException: -1
at org.apache.flink.api.common.io.GenericCsvInputFormat.skipFields(GenericCsvInputFormat.java:443)
at org.apache.flink.api.common.io.GenericCsvInputFormat.parseRecord(GenericCsvInputFormat.java:412)
at org.apache.flink.api.java.io.CsvInputFormat.readRecord(CsvInputFormat.java:111)
at org.apache.flink.api.common.io.DelimitedInputFormat.nextRecord(DelimitedInputFormat.java:454)
at org.apache.flink.api.java.io.CsvInputFormat.nextRecord(CsvInputFormat.java:79)
at org.apache.flink.runtime.operators.DataSourceTask.invoke(DataSourceTask.java:176)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:559)
at java.lang.Thread.run(Thread.java:745)
Here is my code:
DataSet<Tuple2<String, Integer>> csvInput = env.readCsvFile("resources/airports.csv")
.ignoreFirstLine()
.ignoreInvalidLines()
.parseQuotedStrings('"')
.includeFields("100000001")
.types(String.class, String.class)
.map((Tuple2<String, String> value) -> new Tuple2<>(value.f1, 1))
.groupBy(0)
.sum(1);
If you cannot change the input data, then you should turn off parseQuotedString(). This will simply look for the next field delimiter and return everything in between as a string (including the quotations marks). Then you can remove the leading and trailing quotation mark in a subsequent map operation.
I have a complex JSON file that looks like this: http://pastebin.com/4UfadbqS
I would like to load only several values from these JSON objects using Pig Latin. I tried doing that like this:
mydata = LOAD 'data.json'
USING JsonLoader('id:chararray, created_at:chararray,
user: {(language:chararray)}’);
STORE mydata
INTO 'output';
But it seems that Pig Latin is just taking the first 3 values from the JSON and saving them (it does not recognize the column name as a key). Is there a way to achieve this? OR should I just list ALL the values from JSON in a Pig and filter them after that?
There are few problems in the above approach
1. JsonLoader will always expect the full schema of your input but you gave only three fields.
2. JsonLoader will always expect the entire input as a single line but your input is multiline.
3. JsonLoader will not support nested schema but your input contains nested schema.
To solve all the above problems you have use the thirdparty library elephant-bird jar.
Download the (elephant-bird-pig-4.1.jar and elephant-bird-hadoop-compat-4.1.jar) jar file from this link
http://www.java2s.com/Code/Jar/e/elephant.htm and try the below approach
I copied your entire input and formatted as a single line as below.
input.json
{"filter_level":"medium","retweeted":false,"in_reply_to_screen_name":null,"possibly_sensitive":false,"truncated":false,"lang":"en","in_reply_to_status_id_str":null,"id":488927960280211456,"in_reply_to_user_id_str":null,"in_reply_to_status_id":null,"created_at":"Tue Jul 15 06:08:04 +0000 2014","favorite_count":0,"place":null,"coordinates":null,"text":"RT #BulleyBufton: #MinaANDMaya PLEASE RT /VOTE BULLEY. Last day to help me win my old rescue #HilbraesDogs £5k https://t.co/Y8g47fLYY1 http\u2026","contributors":null,"retweeted_stt
atus":{"filter_level":"low","contributors":null,"text":"#MinaANDMaya PLEASE RT /VOTE BULLEY. Last day to help me win my old rescue #HilbraesDogs £5k https://t.co/Y8g47fLYY1 httpp
://t.co/DDco9wVXtP","geo":null,"retweeted":false,"in_reply_to_screen_name":"MinaANDMaya","possibly_sensitive":false,"truncated":false,"lang":"en","entities":{"trends":[],"symbols":[],"urls":[{"expanded_url":"https://www.animalfriendsquote.co.uk/fb-worldcup/","indices":[93,116],"display_url":"animalfriendsquote.co.uk/fb-worldcup/","url":"https://t.co/Y8g47fLYY1"}],"hashtags":[],"media":[{"sizes":{"thumb":{"w":150,"resize":"crop","h":150},"small":{"w":340,"resize":"fit","h":455},"large":{"w":706,"resize":"fit","h":946},"medium":{"w":600,"resize":"fit","h":803}},"id":488926730481332224,"media_url_https":"https://pbs.twimg.com/media/BskERVuIcAAJZGu.jpg","media_url":"http://pbs.twimg.com/media/BskERVuIcAAJZGu.jpg","expanded_url":"http://twitter.com/BulleyBufton/status/488926827394904064/photo/1","indices":[117,139],"id_str":"488926730481332224","type":"photo","display_url":"pic.twitter.com/DDco9wVXtP","url":"http://t.co/DDco9wVXtP"}],"user_mentions":[{"id":132204038,"name":"Mina*Bad Yoga Kitty*","indices":[0,12],"screen_name":"MinaANDMaya","id_str":"132204038"},{"id":2308374684,"name":"Julianna Kaminski","indices":[75,88],"screen_name":"HilbraesDogs","id_str":"2308374684"}]},"in_reply_to_status_id_str":null,"id":488926827394904064,"source":"<a href=\"http://twitter.com/download/android\" rel=\"nofollow\">Twitter for Android<\/a>","in_reply_to_user_id_str":"132204038","favorited":false,"in_reply_to_status_id":null,"retweet_count":6,"created_at":"Tue Jul 15 06:03:34 +0000 2014","in_reply_to_user_id":132204038,"favorite_count":3,"id_str":"488926827394904064","place":null,"user":{"location":"CHICAGO , USA","default_profile":false,"statuses_count":8868,"profile_background_tile":true,"lang":"en","profile_link_color":"AD54E8","profile_banner_url":"https://pbs.twimg.com/profile_banners/225136520/1403608773","id":225136520,"following":null,"favourites_count":5082,"protected":false,"profile_text_color":"3D1957","verified":false,"description":"I'm Bulley, I'm proof that there is always hope.\r\nI was in rescue kennels in UK for 9yrs. #ada_bscakes took me in.\r\nWe've moved to America to start a new life.","contributors_enabled":false,"profile_sidebar_border_color":"000000","name":"BULLEY","profile_background_color":"0A0A0A","created_at":"Fri Dec 10 19:55:17 +0000 2010","default_profile_image":false,"followers_count":3421,"profile_image_url_https":"https://pbs.twimg.com/profile_images/486614595457789952/gtcLac9w_normal.jpeg","geo_enabled":true,"profile_background_image_url":"http://pbs.twimg.com/profile_background_images/378800000166829702/isbjd7O4.jpeg","profile_background_image_url_https":"https://pbs.twimg.com/profile_background_images/378800000166829702/isbjd7O4.jpeg","follow_request_sent":null,"url":null,"utc_offset":-39600,"time_zone":"International Date Line West","notifications":null,"profile_use_background_image":true,"friends_count":3702,"profile_sidebar_fill_color":"7AC3EE","screen_name":"BulleyBufton","id_str":"225136520","profile_image_url":"http://pbs.twimg.com/profile_images/486614595457789952/gtcLac9w_normal.jpeg","listed_count":29,"is_translator":false},"coordinates":null},"geo":null,"entities":{"trends":[],"symbols":[],"urls":[{"expanded_url":"https://www.animalfriendsquote.co.uk/fb-worldcup/","indices":[111,134],"display_url":"animalfriendsquote.co.uk/fb-worldcup/","url":"https://t.co/Y8g47fLYY1"}],"hashtags":[],"media":[{"sizes":{"thumb":{"w":150,"resize":"crop","h":150},"small":{"w":340,"resize":"fit","h":455},"large":{"w":706,"resize":"fit","h":946},"medium":{"w":600,"resize":"fit","h":803}},"id":488926730481332224,"media_url_https":"https://pbs.twimg.com/media/BskERVuIcAAJZGu.jpg","media_url":"http://pbs.twimg.com/media/BskERVuIcAAJZGu.jpg","expanded_url":"http://twitter.com/BulleyBufton/status/488926827394904064/photo/1","source_status_id_str":"488926827394904064","indices":[139,140],"source_status_id":488926827394904064,"id_str":"488926730481332224","type":"photo","display_url":"pic.twitter.com/DDco9wVXtP","url":"http://t.co/DDco9wVXtP"}],"user_mentions":[{"id":225136520,"name":"BULLEY","indices":[3,16],"screen_name":"BulleyBufton","id_str":"225136520"},{"id":132204038,"name":"Mina*Bad Yoga Kitty*","indices":[18,30],"screen_name":"MinaANDMaya","id_str":"132204038"},{"id":2308374684,"name":"Julianna Kaminski","indices":[93,106],"screen_name":"HilbraesDogs","id_str":"2308374684"}]},"source":"<a href=\"http://twitter.com/download/android\" rel=\"nofollow\">Twitter for Android<\/a>","favorited":false,"in_reply_to_user_id":null,"retweet_count":0,"id_str":"488927960280211456","user":{"location":"","default_profile":false,"statuses_count":1370,"profile_background_tile":true,"lang":"zh-tw","profile_link_color":"038544","profile_banner_url":"https://pbs.twimg.com/profile_banners/2272804116/1404662156","id":2272804116,"following":null,"favourites_count":2000,"protected":false,"profile_text_color":"333333","verified":false,"description":"No More Sorrow","contributors_enabled":false,"profile_sidebar_border_color":"000000","name":"Winnie","profile_background_color":"14DBBA","created_at":"Thu Jan 02 10:13:01 +0000 2014","default_profile_image":false,"followers_count":311,"profile_image_url_https":"https://pbs.twimg.com/profile_images/478106512083017728/4ao_8JjE_normal.jpeg","geo_enabled":false,"profile_background_image_url":"http://pbs.twimg.com/profile_background_images/431815421189029888/YrRNpUfd.jpeg","profile_background_image_url_https":"https://pbs.twimg.com/profile_background_images/431815421189029888/YrRNpUfd.jpeg","follow_request_sent":null,"url":null,"utc_offset":null,"time_zone":null,"notifications":null,"profile_use_background_image":true,"friends_count":455,"profile_sidebar_fill_color":"DDEEF6","screen_name":"winnie341881","id_str":"2272804116","profile_image_url":"http://pbs.twimg.com/profile_images/478106512083017728/4ao_8JjE_normal.jpeg","listed_count":0,"is_translator":false}}
PigScript:
REGISTER '/tmp/elephant-bird-hadoop-compat-4.1.jar';
REGISTER '/tmp/elephant-bird-pig-4.1.jar';
A = LOAD 'input.json ' USING com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad') AS myMap;
B = FOREACH A GENERATE myMap#'id' AS ID,myMap#'created_at' AS createdAT,myMap#'user' AS User;
DUMP B;
Output:
(488927960280211456,Tue Jul 15 06:08:04 +0000 2014,[location#,default_profile#false,profile_background_tile#true,statuses_count#1370,lang#zh-tw,profile_link_color#038544,profile_banner_url#https://pbs.twimg.com/profile_banners/2272804116/1404662156,id#2272804116,following#,protected#false,favourites_count#2000,profile_text_color#333333,contributors_enabled#false,description#No More Sorrow,verified#false,name#Winnie,profile_sidebar_border_color#000000,profile_background_color#14DBBA,created_at#Thu Jan 02 10:13:01 +0000 2014,default_profile_image#false,followers_count#311,geo_enabled#false,profile_image_url_https#https://pbs.twimg.com/profile_images/478106512083017728/4ao_8JjE_normal.jpeg,profile_background_image_url#http://pbs.twimg.com/profile_background_images/431815421189029888/YrRNpUfd.jpeg,profile_background_image_url_https#https://pbs.twimg.com/profile_background_images/431815421189029888/YrRNpUfd.jpeg,follow_request_sent#,url#,utc_offset#,time_zone#,notifications#,friends_count#455,profile_use_background_image#true,profile_sidebar_fill_color#DDEEF6,screen_name#winnie341881,id_str#2272804116,profile_image_url#http://pbs.twimg.com/profile_images/478106512083017728/4ao_8JjE_normal.jpeg,is_translator#false,listed_count#0])
In elephantbird library all the values will be stored as key/value pair(ie MAP datatype), so it will be easy to extract the required fields from the loaded data.
In the above pigscript i have extracted the value of 'id','created_at' and 'user' as per your need.
Suppose you want to extract some fields from 'user' data( ex: 'friends_count' and 'followers_count'), in that case you need to project the 'user' field and extract the required data. sample code below.
PigScript:
REGISTER '/tmp/elephant-bird-hadoop-compat-4.1.jar';
REGISTER '/tmp/elephant-bird-pig-4.1.jar';
A = LOAD 'input.json ' USING com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad') AS myMap;
B = FOREACH A GENERATE 'user' AS User;
C = FOREACH B GENERATE User#'friends_count', User#'followers_count';
DUMP C;
Output:
(455,311)
So my input looks like
{"selling":"0","quantity":"2","price":"80000","date":"1401384212","rs_name":"overhault","contact":"PM","notes":""}
{"selling":"0","quantity":"100","price":"80000","date":"1401383271","rs_name":"sammvarnish","contact":"PM","notes":"Seers Bank W321 :)"}
{"selling":"0","quantity":"100","price":"70000","date":"1401383168","rs_name":"pwnoramaa","contact":"PM","notes":""}
and the output I want must look like
0,2,80000,1401384212,overhault,PM,""
0,100,80000,1401383271,sammvarnish,PM,"Seers Bank W321 :)"
0,100,70000,1401383168,pwnoramaa,PM,""
What's the best way to do this in bash?
EDIT: changed my needs.
The new output I want is, for
{"selling":"0","quantity":"2","price":"80000","date":"1401384212","rs_name":"overhault","contact":"PM","notes":"testnote"}
as input,
rs name: \t overhault
quantity: \t 2
price: \t 80000
date: \t 29-05 19:23
contact: \t PM
notes: \t testnote
Where \t is a tab character (like in echo "\t").
As you can see, this one is a tad bit more complicated.
For example, it changes the order, and requires the UNIX timestamp to be converted to an alternative format.
I'll use any tool you can offer me as long as you explain clearly how I can use it from a bash script. The input will consist of three of such lines, delimited by a newline character, and it must print the output with an empty line between each of the results.
Don't do this with regular expressions/bash, there are JSON parsers for this kind of task. Simple Python example:
import json
data = json.loads('{"selling":"0","quantity":"2"}')
data = ','.join(data.values())
print(data)
I strongly suggest you just use a simple script like this which you make executable and then call.
EDIT: here's a version which preserves the order:
import json
data = json.loads('{"selling":"0","quantity":"2", "price":"80000"}')
orderedkeys = ['selling', 'quantity', 'price']
values = [data[key] for key in orderedkeys]
values = ','.join(values)
print(values)
output:
0,2,80000