I'm trying to merge various .json files so I can later run sentiment analysis on them.
I tried already other approaches but they always end up with an error.
I checked whether the .json is correctly formatted and can't find any issues there. I also attached an example of a .json file.
Error message is attached below my code.
import glob
import json
# list all files containing News from Guardian API
files = list(glob.iglob('/Users/xxx/tempdata/articles_data/*.json'))
news_data = []
for file in files:
news_file = open(file, "r", encoding = 'utf-8')
# Read in news and store in list: news_data
for line in news_file:
news = json.loads(line)
news_data.append(news)
news_file.close()
Updated Error Output
AttributeError Traceback (most recent call last)
<ipython-input-86-3019ee85b15b> in <module>
12 # Read in news and store in list: news_data
13 for line in news_file:
---> 14 news = json.load(line)
15 news_data.append(news)
16
~/opt/anaconda3/lib/python3.8/json/__init__.py in load(fp, cls, object_hook, parse_float, parse_int, parse_constant, object_pairs_hook, **kw)
291 kwarg; otherwise ``JSONDecoder`` is used.
292 """
--> 293 return loads(fp.read(),
294 cls=cls, object_hook=object_hook,
295 parse_float=parse_float, parse_int=parse_int,
AttributeError: 'str' object has no attribute 'read'
2019-11-01.json
{
"id":"business/2019/nov/01/google-snaps-up-fitbit-for-21bn",
"type":"article",
"sectionId":"business",
"sectionName":"Business",
"webPublicationDate":"2019-11-01T14:26:19Z",
"webTitle":"Google snaps up Fitbit for $2.1bn",
"webUrl":"https://www.theguardian.com/business/2019/nov/01/google-snaps-up-fitbit-for-21bn",
"apiUrl":"https://content.guardianapis.com/business/2019/nov/01/google-snaps-up-fitbit-for-21bn",
"fields":{
"headline":"Google snaps up Fitbit for $2.1bn",
"standfirst":"<p>Takeover allows web giant to take on Apple in fast-growing smartwatch and wearables business</p>",
"trailText":"Takeover allows web giant to take on Apple in fast-growing smartwatch and wearables business",
"byline":"Kalyeena Makortoff",
"main":"<figure class=\"element element-image\" data-media-id=\"fc8abb0f70105fcab3aee86dea6c89e211337660\"> <img src=\"https://media.guim.co.uk/fc8abb0f70105fcab3aee86dea6c89e211337660/0_158_3571_2143/1000.jpg\" alt=\"The wireless activity tracker Zip by Fitbit Inc\" width=\"1000\" height=\"600\" class=\"gu-image\" /> <figcaption> <span class=\"element-image__caption\">The wireless activity tracker Zip by Fitbit Inc. Google has confirmed it will buy Fitbit for $2.1bn.</span> <span class=\"element-image__credit\">Photograph: Franck Robichon/EPA</span> </figcaption> </figure>",
"body":"<p>Google has snapped up the Fitbit... ",
"newspaperPageNumber":"38",
"wordcount":"679",
"firstPublicationDate":"2019-11-01T14:25:58Z",
"isInappropriateForSponsorship":"false",
"isPremoderated":"false",
"lastModified":"2019-11-01T18:56:38Z",
"newspaperEditionDate":"2019-11-02T00:00:00Z",
"productionOffice":"UK",
"publication":"The Guardian",
"shortUrl":"https://gu.com/p/cjeze",
"shouldHideAdverts":"false",
"showInRelatedContent":"true",
"thumbnail":"https://media.guim.co.uk/fc8abb0f70105fcab3aee86dea6c89e211337660/0_158_3571_2143/500.jpg",
"legallySensitive":"false",
"lang":"en",
"isLive":"true",
"bodyText":"Google has snapped up the Fitbit activity tracker business in a $2.1bn (\u00a31.6bn) deal that will enable the search giant to go toe-to-toe with Apple in the fast-growing smartwatch and wearables business..." ,
"charCount":"4149",
"shouldHideReaderRevenue":"false",
"showAffiliateLinks":"false",
"bylineHtml":"Kalyeena Makortoff"
},
"isHosted":false,
"pillarId":"pillar/news",
"pillarName":"News"
},
If you are reading json from file, you should use json.load instead of json.loads. json.loads is for reading JSON from string. See json.load documentation.
For example:
import json
with open('ts.json', 'r') as f:
content = json.load(f)
print(content)
Related
I'm trying to get the vocabulary from some publicly-available pre-trained models (that aren't mine) using the python interface of AllenNLP, using self.vocab. However, I'm running into problems trying to load in the model. I'm looking to get the vocabulary from the dygiepp models, using the following code:
from allennlp.models.model import Model
scierc_model = Model.from_archive('https://s3-us-west-2.amazonaws.com/ai2-s2-research/dygiepp/master/scierc.tar.gz')
However, I get the following error:
---------------------------------------------------------------------------
ConfigurationError Traceback (most recent call last)
/tmp/local/63381207/ipykernel_7616/3549263982.py in <module>
----> 1 scierc_model = Model.from_archive('https://s3-us-west-2.amazonaws.com/ai2-s2-research/dygiepp/master/scierc.tar.gz')
~/anaconda3/envs/dygiepp/lib/python3.7/site-packages/allennlp/models/model.py in from_archive(cls, archive_file, vocab)
480 from allennlp.models.archival import load_archive # here to avoid circular imports
481
--> 482 model = load_archive(archive_file).model
483 if vocab:
484 model.vocab.extend_from_vocab(vocab)
~/anaconda3/envs/dygiepp/lib/python3.7/site-packages/allennlp/models/archival.py in load_archive(archive_file, cuda_device, overrides, weights_file)
231 # Instantiate model and dataset readers. Use a duplicate of the config, as it will get consumed.
232 dataset_reader, validation_dataset_reader = _load_dataset_readers(
--> 233 config.duplicate(), serialization_dir
234 )
235 model = _load_model(config.duplicate(), weights_path, serialization_dir, cuda_device)
~/anaconda3/envs/dygiepp/lib/python3.7/site-packages/allennlp/models/archival.py in _load_dataset_readers(config, serialization_dir)
267
268 dataset_reader = DatasetReader.from_params(
--> 269 dataset_reader_params, serialization_dir=serialization_dir
270 )
271 validation_dataset_reader = DatasetReader.from_params(
~/anaconda3/envs/dygiepp/lib/python3.7/site-packages/allennlp/common/from_params.py in from_params(cls, params, constructor_to_call, constructor_to_inspect, **extras)
586 "type",
587 choices=as_registrable.list_available(),
--> 588 default_to_first_choice=default_to_first_choice,
589 )
590 subclass, constructor_name = as_registrable.resolve_class_name(choice)
~/anaconda3/envs/dygiepp/lib/python3.7/site-packages/allennlp/common/params.py in pop_choice(self, key, choices, default_to_first_choice, allow_class_names)
322 """{"model": "my_module.models.MyModel"} to have it imported automatically."""
323 )
--> 324 raise ConfigurationError(message)
325 return value
326
ConfigurationError: dygie not in acceptable choices for dataset_reader.type: ['babi', 'conll2003', 'interleaving', 'multitask', 'multitask_shim', 'sequence_tagging', 'sharded', 'text_classification_json']. You should either use the --include-package flag to make sure the correct module is loaded, or use a fully qualified class name in your config file like {"model": "my_module.models.MyModel"} to have it imported automatically.
The error describes how to fix the error from the command line, but not in the python interface. I additionally tried adding the line import dygie to my code to import the missing package, but that didn't solve the problem.
Wondering if anyone knows how to get around this?
To run this model, you'll need to have the code from this repo: https://github.com/dwadden/dygiepp.
In particular, you need to import the DyGIE dataset reader from here: https://github.com/dwadden/dygiepp/blob/master/dygie/data/dataset_readers/dygie.py#L29
I have mounted my GDrive and have csv file in a folder. I am following the tutorial. However, when I issue the tf.keras.utils.get_file(), I get a ValueError As follows.
data_folder = r"/content/drive/My Drive/NLP/project2/data"
import os
print(os.listdir(data_folder))
It returns:
['crowdsourced_labelled_dataset.csv',
'P2_Testing_Dataset.csv',
'P2_Training_Dataset_old.csv',
'P2_Training_Dataset.csv']
TRAIN_DATA_URL = os.path.join(data_folder, 'P2_Training_Dataset.csv')
train_file_path = tf.compat.v1.keras.utils.get_file("train.csv", TRAIN_DATA_URL)
But this returns:
Downloading data from /content/drive/My Drive/NLP/project2/data/P2_Training_Dataset.csv
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-16-5bd642083471> in <module>()
2 TRAIN_DATA_URL = os.path.join(data_folder, 'P2_Training_Dataset.csv')
3 TEST_DATA_URL = os.path.join(data_folder, 'P2_Testing_Dataset.csv')
----> 4 train_file_path = tf.compat.v1.keras.utils.get_file("train.csv", TRAIN_DATA_URL)
5 test_file_path = tf.compat.v1.keras.utils.get_file("eval.csv", TEST_DATA_URL)
6 frames
/usr/lib/python3.6/urllib/request.py in _parse(self)
382 self.type, rest = splittype(self._full_url)
383 if self.type is None:
--> 384 raise ValueError("unknown url type: %r" % self.full_url)
385 self.host, self.selector = splithost(rest)
386 if self.host:
ValueError: unknown url type: '/content/drive/My Drive/NLP/project2/data/P2_Training_Dataset.csv'
What am I doing wrong please?
As per the docs, this will be the outcome of a call to the function tf.compat.v1.keras.utils.get_file.
tf.keras.utils.get_file(
fname,
origin,
untar=False,
md5_hash=None,
file_hash=None,
cache_subdir='datasets',
hash_algorithm='auto',
extract=False,
archive_format='auto',
cache_dir=None
)
By default the file at the url origin is downloaded to the cache_dir ~/.keras, placed in the cache_subdir datasets, and given the filename fname. The final location of a file example.txt would therefore be ~/.keras/datasets/example.txt.
Returns:
Path to the downloaded file
Since you already have the data in your drive, there's no need to download it again (and IIUC, the function is expecting an accessible URL). Also, there's no need of obtaining the file name from a function call because you already know it.
Assuming the drive is mounted, you can replace your file paths as below:
train_file_path = os.path.join(data_folder, 'P2_Training_Dataset.csv')
test_file_path = os.path.join(data_folder, 'P2_Testing_Dataset.csv')
Trying to run Stanford NER Taggerand NLTK from a jupyter notebook.
I am continuously getting
OSError: Java command failed
I have already tried the hack at
https://gist.github.com/alvations/e1df0ba227e542955a8a
and thread
Stanford Parser and NLTK
I am using
NLTK==3.3
Ubuntu==16.04LTS
Here is my python code:
Sample_text = "Google, headquartered in Mountain View, unveiled the new Android phone"
sentences = sent_tokenize(Sample_text)
tokenized_sentences = [word_tokenize(sentence) for sentence in sentences]
PATH_TO_GZ = '/home/root/english.all.3class.caseless.distsim.crf.ser.gz'
PATH_TO_JAR = '/home/root/stanford-ner.jar'
sn_3class = StanfordNERTagger(PATH_TO_GZ,
path_to_jar=PATH_TO_JAR,
encoding='utf-8')
annotations = [sn_3class.tag(sent) for sent in tokenized_sentences]
I got these files using following commands:
wget http://nlp.stanford.edu/software/stanford-ner-2015-04-20.zip
wget http://nlp.stanford.edu/software/stanford-postagger-full-2015-04-20.zip
wget http://nlp.stanford.edu/software/stanford-parser-full-2015-04-20.zip
# Extract the zip file.
unzip stanford-ner-2015-04-20.zip
unzip stanford-parser-full-2015-04-20.zip
unzip stanford-postagger-full-2015-04-20.zip
I am getting the following error:
CRFClassifier invoked on Thu May 31 15:56:19 IST 2018 with arguments:
-loadClassifier /home/root/english.all.3class.caseless.distsim.crf.ser.gz -textFile /tmp/tmpMDEpL3 -outputFormat slashTags -tokenizerFactory edu.stanford.nlp.process.WhitespaceTokenizer -tokenizerOptions "tokenizeNLs=false" -encoding utf-8
tokenizerFactory=edu.stanford.nlp.process.WhitespaceTokenizer
Unknown property: |tokenizerFactory|
tokenizerOptions="tokenizeNLs=false"
Unknown property: |tokenizerOptions|
loadClassifier=/home/root/english.all.3class.caseless.distsim.crf.ser.gz
encoding=utf-8
Unknown property: |encoding|
textFile=/tmp/tmpMDEpL3
outputFormat=slashTags
Loading classifier from /home/root/english.all.3class.caseless.distsim.crf.ser.gz ... Error deserializing /home/root/english.all.3class.caseless.distsim.crf.ser.gz
Exception in thread "main" java.lang.RuntimeException: java.lang.ClassCastException: java.util.ArrayList cannot be cast to [Ledu.stanford.nlp.util.Index;
at edu.stanford.nlp.ie.AbstractSequenceClassifier.loadClassifierNoExceptions(AbstractSequenceClassifier.java:1380)
at edu.stanford.nlp.ie.AbstractSequenceClassifier.loadClassifierNoExceptions(AbstractSequenceClassifier.java:1331)
at edu.stanford.nlp.ie.crf.CRFClassifier.main(CRFClassifier.java:2315)
Caused by: java.lang.ClassCastException: java.util.ArrayList cannot be cast to [Ledu.stanford.nlp.util.Index;
at edu.stanford.nlp.ie.crf.CRFClassifier.loadClassifier(CRFClassifier.java:2164)
at edu.stanford.nlp.ie.AbstractSequenceClassifier.loadClassifier(AbstractSequenceClassifier.java:1249)
at edu.stanford.nlp.ie.AbstractSequenceClassifier.loadClassifier(AbstractSequenceClassifier.java:1366)
at edu.stanford.nlp.ie.AbstractSequenceClassifier.loadClassifierNoExceptions(AbstractSequenceClassifier.java:1377)
... 2 more
---------------------------------------------------------------------------
OSError Traceback (most recent call last)
<ipython-input-15-5621d0f8177d> in <module>()
----> 1 ne_annot_sent_3c = [sn_3class.tag(sent) for sent in tokenized_sentences]
/home/root1/.virtualenv/demos/local/lib/python2.7/site-packages/nltk/tag/stanford.pyc in tag(self, tokens)
79 def tag(self, tokens):
80 # This function should return list of tuple rather than list of list
---> 81 return sum(self.tag_sents([tokens]), [])
82
83 def tag_sents(self, sentences):
/home/root1/.virtualenv/demos/local/lib/python2.7/site-packages/nltk/tag/stanford.pyc in tag_sents(self, sentences)
102 # Run the tagger and get the output
103 stanpos_output, _stderr = java(cmd, classpath=self._stanford_jar,
--> 104 stdout=PIPE, stderr=PIPE)
105 stanpos_output = stanpos_output.decode(encoding)
106
/home/root1/.virtualenv/demos/local/lib/python2.7/site-packages/nltk/__init__.pyc in java(cmd, classpath, stdin, stdout, stderr, blocking)
134 if p.returncode != 0:
135 print(_decode_stdoutdata(stderr))
--> 136 raise OSError('Java command failed : ' + str(cmd))
137
138 return (stdout, stderr)
OSError: Java command failed : [u'/usr/bin/java', '-mx1000m', '-cp', '/home/root/stanford-ner.jar', 'edu.stanford.nlp.ie.crf.CRFClassifier', '-loadClassifier', '/home/root/english.all.3class.caseless.distsim.crf.ser.gz', '-textFile', '/tmp/tmpMDEpL3', '-outputFormat', 'slashTags', '-tokenizerFactory', 'edu.stanford.nlp.process.WhitespaceTokenizer', '-tokenizerOptions', '"tokenizeNLs=false"', '-encoding', 'utf-8']
Download Stanford Named Entity Recognizer version 3.9.1: see ‘Download’ section from The Stanford NLP website.
Unzip it and move 2 files "ner-tagger.jar" and "english.all.3class.distsim.crf.ser.gz" to your folder
Open jupyter notebook or ipython prompt in your folder path and run the following python code:
import nltk
from nltk.tag.stanford import StanfordNERTagger
sentence = u"Twenty miles east of Reno, Nev., " \
"where packs of wild mustangs roam free through " \
"the parched landscape, Tesla Gigafactory 1 " \
"sprawls near Interstate 80."
jar = './stanford-ner.jar'
model = './english.all.3class.distsim.crf.ser.gz'
ner_tagger = StanfordNERTagger(model, jar, encoding='utf8')
words = nltk.word_tokenize(sentence)
# Run NER tagger on words
print(ner_tagger.tag(words))
I tested this on NLTK==3.3 and Ubuntu==16.0.6LTS
I am trying to read a JSON file into Pandas. It's a relatively large file (41k records) mostly text.
{"sanders": [{"date": "February 8, 2016 Monday", "source": "Federal News
Service", "subsource": "MSNBC \"MSNBC Live\" Interview with Sen. Bernie
Sanders (I-VT), Democratic", "quotes": ["Well, it's not very progressive to
take millions of dollars from Wall Street as well.", "That's a very good
question, and I wish I could give her a definitive answer. QUOTE SHORTENED FOR
SPACE"]}, {"date": "February 7, 2016 Sunday", "source": "CBS News Transcripts", "subsource": "SHOW: CBS FACE THE NATION 10:30 AM EST", "quotes":
["Well, John -- John, I think that`s a media narrative that goes around and
around and around. I don`t accept that media narrative.", "Well, that`s what
she said about Barack Obama in 2008. "]},
I tried:
quotes = pd.read_json("/quotes.json")
I expected it to read in cleanly because it was file created in python. However, I got this error:
ValueError Traceback (most recent call last)
<ipython-input-19-c1acfdf0dbc6> in <module>()
----> 1 quotes = pd.read_json("/Users/kate/Documents/99Antennas/Client\
Files/Fusion/data/quotes.json")
/Users/kate/venv/lib/python2.7/site-packages/pandas/io/json.pyc in
read_json(path_or_buf, orient, typ, dtype, convert_axes, convert_dates,
keep_default_dates, numpy, precise_float, date_unit)
208 obj = FrameParser(json, orient, dtype, convert_axes,
convert_dates,
209 keep_default_dates, numpy, precise_float,
--> 210 date_unit).parse()
211
212 if typ == 'series' or obj is None:
/Users/kate/venv/lib/python2.7/site-packages/pandas/io/json.pyc in parse(self)
276
277 else:
--> 278 self._parse_no_numpy()
279
280 if self.obj is None:
/Users/kate/venv/lib/python2.7/site-packages/pandas/io/json.pyc in _
parse_no_numpy(self)
493 if orient == "columns":
494 self.obj = DataFrame(
--> 495 loads(json, precise_float=self.precise_float),
dtype=None)
496 elif orient == "split":
497 decoded = dict((str(k), v)
ValueError: Expected object or value
After reading the documentation and stackoverflow, I also tried adding convert_dates=False to the parameters, but that did not correct the problem. I would welcome suggestions as to how to handle this error.
Try removing the forward slash in the filename. If you run this python code from the same directory where the file is sitting, it should work.
quotes = pd.read_json("quotes.json")
SPKoder mentioned the forward slash. I was looking for an answer when I realized I hadn't added a / when combing filename and path (i.e. c:/path/herefile.json, instead of c:/path/here/file.json). Anyways the error I received was ...
ValueError: Expected object or value
Not a very intuitive error message, but that is what causes it.
I download different Companies name from different websites from my localhost sometimes i face this problem and that is interrupt the download procedure.My script is working fine for others country but when i download Czech Republic this type of error is occurred.
Total companies processed so far:0 Traceback (most recent call last):
File "process1.py", line 261, in
print "Company Name: "+hit.text File "C:\Python27\lib\encodings\cp437.py", line 12, in encode
return codecs.charmap_encode(input,errors,encoding_map) UnicodeEncodeError: 'charmap' codec can't encode character u'\xfd' in
position 3 3: character maps to
my code is here :
if companyAlreadyKnown == 0:
for hit in soup2.findAll("h1"):
print "Company Name: "+hit.text
pCompanyName = hit.text
flog.write("\nCompany Name: "+str(pCompanyName))
companyObj.setCompanyName(pCompanyName)
I don't know why it is happened.Any suggestion in this problem?
Czech language contains a lot of non ASCII characters. u'\xfd' is a unicode representation of ý. You need to decode UTF-8. An even better solution is to detect what encoding the website you are scraping uses and decode to that one.
if companyAlreadyKnown == 0:
for hit in soup2.findAll("h1"):
company_name = hit.text.decode('utf-8')
print "Company Name: " + company_name
flog.write("\nCompany Name: " + pCompanyName)
companyObj.setCompanyName(company_name)