Wget -i gives no output or results - csv

I'm learning data analysis in Zeppelin, I'm a mechanical engineer so this is outside my expertise.
I am trying to download two csv files using a file that contains the urls, test2.txt. When I run it I get no output, but no error message either. I've included a link to a screenshot showing my code and the results.
When I go into Ambari Sandbox I cannot find any files created. I'm assuming the directory the file is in is where the csv files will be downloaded too. I've tried using -P as well with no luck. I've checked in man wget but it did not help.
So I have several questions:
How do I show the output from running wget?
Where is the default directory that wget stores files?
Do I need additional data in the file other than just the URLs?
Screenshot: Code and Output for %sh
Thanks for any and all help.
%sh
wget -i /tmp/test2.txt

%sh
# list the current working directory
pwd # output: home/zeppelin
# make a new folder, created in "tmp" because it is temporary
mkdir -p /home/zeppelin/tmp/Folder_Name
# change directory to new folder
cd /home/zeppelin/tmp/Folder_Name
# transfer the file from the sandbox to the current working directory
hadoop fs -get /tmp/test2.txt /home/zeppelin/tmp/Folder_Name/
# download the URL
wget -i test2.txt

Related

How can I create a hyperlink via the command line?

I would like to generate a directory of links for some friends who are not technologically savvy. I'm running Ubuntu and would like to do this via the command line.
My attempts so far have been:
touch https:...
which returns:
touch: cannot touch 'https:...': No such file or directory
cat >> https://...
which also returns the No such file or directory exception.
I also tried echo where the link was the filename and the file type was .html, which returned the same exception.
If I drag and drop the link from the address bar into a folder, it creates the hyperlink - however I would like to batch these according to a list of links.
EDIT: This can be done in Python.
I was able to find a question on SO which supplied an alternative using Python.
you could try the following
$ ln -s {source-filename} {symbolic-filename}
source-filename - the target file which you want to create the link
for
symbolic-filename - the name of the symbolic link
Example:
ln -s source_file.txt link_file.txt
You can verify the link creating using the following command
ls -l link_file.txt

Importing large datasets into Couchbase

I am having difficulty importing large datasets into Couchbase. I have experience doing this very fast with Redis via the command line but I have not seen anything yet for Couchbase.
I have tried using the PHP SDK and it imports about 500 documents / second. I have also tried the cbcdocload script in the Couchbase bin folder but it seems to want each document in its on JSON file. It is a bit of work to create all these files and then load them. Is there some other importation process I am missing? If cbcdocload is the only way load data fast then is it possible to put multiple documents into 1 json file.
Take the file that has all the JSON documents in it and zip up the file:
zip somefile.zip somefile.json
Place the zip file(s) into a directory. I used ~/json_files/ in my home directory.
Then load the file or files by the following command:
cbdocloader -u Administrator -p s3kre7Pa55 -b MyBucketToLoad -n 127.0.0.1:8091 -s 1000 \
~/json_files/somefile.zip
Note: '-s 1000' is the memory size. You'll need to adjust this value for your bucket.
If successful you'll see output stating how many documents were loaded, success, etc.
Here is a brief script to load up a lot of .zip files in a given directory:
#!/bin/bash
JSON_Dir=~/json_files/
for ZipFile in $JSON_Dir/*.zip ;
do /Applications/Couchbase\ Server.app/Contents/Resources/couchbase-core/bin/cbdocloader \
-u Administrator -p s3kre7Pa55 -b MyBucketToLoad \
-n 127.0.0.1:8091 -s 1000 $ZipFile
done
UPDATED: Keep in mind this script will only work if your data is formatted correctly or if the files are less than the max single document size of 20MB. (not the zipfile, but any document extracted from the zip)
I have created a blog post describing bulk loading from a single file as well and it is listed here:
Bulk Loading Documents Into Couchbase

How do I give permission for apache to retrieve imgs from directory?

I'm using fedora and I need to set up apache so that it can retrieve images from a directory. So far it can write to my image directory since sending the command ...
semanage fcontext -a -t httpd_sys_rw_content_t '/var/www/profile-pics'
restorecon -v '/var/www/profile-pics'
but when I try to access these files through the localhost/profile-pics/my_pic.jpg, the URL isn't being found. And as I said before, my PHP script is able to upload images from the user and write to this directory. I just don't understand why I can't retrieve the images and send them back to the client.
You'll need to allow access using a section in your httpd or vhost configuration: http://httpd.apache.org/docs/2.2/mod/core.html#directory

Tesseract running error

I have a problem with running tesseract-ocr engine on linux. I've downloaded RUS language data and put it to tessdata directory (/usr/local/share/tessdata). When I'm trying to run tesseract with command tesseract blob.jpg out -l rus , it displays an error:
Error opening data file /usr/local/share/tessdata/eng.traineddata
Please make sure the TESSDATA_PREFIX environment variable is set to the parent directory of your "tessdata" directory.
Failed loading language eng
Tesseract couldn't load any languages!
Could not initialize tesseract.
According to compiling guide, I used export TESSDATA_PREFIX='/usr/local/share/'
to point my tessdata directory.
Maybe I should edit any config files? Tesseract try to load 'eng' data files instead of 'rus'.
Screenshot:
http://i.stack.imgur.com/I0Guc.png
You can grab eng.traineddata Github:
wget https://github.com/tesseract-ocr/tessdata/raw/main/eng.traineddata
Check https://github.com/tesseract-ocr/tessdata for a full list of trained language data.
When you grab the file(s), move them to the /usr/local/share/tessdata folder. Warning: some Linux distributions (such as openSUSE and Ubuntu) may be expecting it in /usr/share/tessdata instead.
# If you got the data from Google, unzip it first!
gunzip eng.traineddata.gz
# Move the data
sudo mv -v eng.traineddata /usr/local/share/tessdata/
The simpliest way is to install the needed package:
sudo apt-get install tesseract-ocr-eng #for english
sudo apt-get install tesseract-ocr-tam #for tamil
sudo apt-get install tesseract-ocr-deu #for deutsch (German)
As you can notice, it opens the road to others languages (i.e. tesseract-ocr-fra).
I had this error too on the Windows machine.
My solution.
1) Download your language files from
https://github.com/tesseract-ocr/tessdata/tree/3.04.00
For example, for eng, I downloaded all files with eng prefix.
2) Put them into tessdata directory inside of some folder. Add this folder into System Path variables as TESSDATA_PREFIX.
Result will be
System env var: TESSDATA_PREFIX=D:/Java/OCR
And OCR folder has tessdata with languages files.
This is a screenshot of the directory:
No previous solution worked for me.
I've installed both by apt-get and manually downloading the tessdata, moved around /usr and so on and no one worked even if i exported the variable thousand times.
Finally, on a last try before start to cry i've tried to pass the path directly to the instance of Tesseract().
In Python: tr = Tesseract("/usr/local/share/tesseract-ocr/") and now it works. To clarify, im using tesserwrap module.
For Windows Users:
In Environment Variables, add a new variable in system variable with name "TESSDATA_PREFIX" and value is "C:\Program Files (x86)\Tesseract-OCR\tessdata"
tesseract --tessdata-dir <tessdata-folder> <image-path> stdout --oem 2 -l <lng>
In my case, the mistakes that I've made or attempts that wasn't a success.
I cloned the github repo and copied files from there to
/usr/local/share/tessdata/
/usr/share/tesseract-ocr/tessdata/
/usr/share/tessdata/
Used TESSDATA_PREFIX with above paths
sudo apt-get install tesseract-ocr-eng
First 2 attempts did not worked because, the files from git clone did not worked for the reasons that I do not know. I am not sure why #3 attempt worked for me.
Finally,
I downloaded the eng.traindata file using wget
Copied it to some directory
Used --tessdata-dir with directory name
Take away for me is to learn the tool well & make use of it, rather than relying on package manager installation & directories
For me the problem was in how I downloaded the train data files. Make sure you get the raw link.
Initially I was using:
wget https://github.com/tesseract-ocr/tessdata_best/blob/master/eng.traineddata
When I changed it to:
wget https://github.com/tesseract-ocr/tessdata_best/raw/master/eng.traineddata
It worked
For Ubuntu just run the below command and the Environment variable error will disappear.
command:
export TESSDATA_PREFIX=Path_of_your_tessdata_folder
Command Example:
export TESSDATA_PREFIX=/home/amar/Desktop/OCR/tesseract-4.1.1/tessdata
This command will set the tessdata folder's path to the environment variable with name TESSDATA_PREFIX and the above error will be resolved.
You can call tesseract API function from C code:
#include <tesseract/baseapi.h>
#include <tesseract/ocrclass.h>; // ETEXT_DESC
using namespace tesseract;
class TessAPI : public TessBaseAPI {
public:
void PrintRects(int len);
};
...
TessAPI *api = new TessAPI();
int res = api->Init(NULL, "rus");
api->SetAccuracyVSpeed(AVS_MOST_ACCURATE);
api->SetImage(data, w0, h0, bpp, stride);
api->SetRectangle(x0,y0,w0,h0);
char *text;
ETEXT_DESC monitor;
api->RecognizeForChopTest(&monitor);
text = api->GetUTF8Text();
printf("text: %s\n", text);
printf("m.count: %s\n", monitor.count);
printf("m.progress: %s\n", monitor.progress);
api->RecognizeForChopTest(&monitor);
text = api->GetUTF8Text();
printf("text: %s\n", text);
...
api->End();
And build this code:
g++ -g -I. -I/usr/local/include -o _test test.cpp -ltesseract_api -lfreeimageplus
(i need FreeImage for picture loading)
I'm using windows OS, I tried all solutions above and none of them work.
Finally, I install Tesseract-OCR on D drive(Where I run my python script from) instead of C drive and it works.
So, if you are using windows, run your python script in the same drive as your Tesseract-OCR.
In Google Colab I resolved the issue in this way:
!sudo apt-get install tesseract-ocr-*
Because if you use this command !sudo apt install tesseract-ocr then it imports 2 languages but when you intend to work on non-English languages then the former command works.
Afterwards, use this command !pip install pytesseract
You can also check languages in this way !tesseract --list-langs
I'm using Visual Studio 2017 Community Edition.
I solved this problem by making a directory called tessdata in the Debug directory of my project. Then I put the eng.traineddata file into said directory.
C# developer working on Windows here. What works for me is simply download the file eng.traineddata from the following URL:
https://github.com/tesseract-ocr/tessdata/blob/master/eng.traineddata
and copy it to the following directory in my Console Application project:
[Project Directory]\bin\Debug\tessdata
I did manually create the tessdata folder above.
tessdata_dir_config = r'--tessdata-dir "/usr/local/Cellar/tesseract/4.1.1/share/tessdata"'
pytesseract.image_to_string(imgCrop,lang='eng',config=tessdata_dir_config)
Add this to your code :
instance.setDatapath("C:\\somepath\\tessdata");
instance.setLanguage("eng");
How I solved the problem in my Manjaro Xfce:
Message “TesseractError: (1, 'Error opening data file /home/julio/snap/tesseract/common/eng.traineddata Please make sure the TESSDATA_PREFIX environment variable is set to your "tessdata" directory. Failed loading language 'eng' Tesseract couldn't load any languages! Could not initialize tesseract.')”
Then, in my Manjaro, I typed: sudo pacman -S tesseract
Then the system installed both the “tesseract” and also a package name “leptonica”
After this step, I thought everything was ok, and tried to run my simple script. However, the error message changed to something like this (it changed the previous “/home” location to other “/usr”-like location):
“"Please make sure the TESSDATA_PREFIX environment variable is set to your "tessdata" directory. Failed loading language 'eng' Tesseract couldn't load any languages! Could not initialize tesseract.')"”
Then I realized that there had appeared this message when I installed “tesseract” with pacman: “You must install one of tesseract-data-* packages or whole tesseract-data group”
So, I tried the command: “sudo pacman -S tesseract-data”, and the system presented lots of language options to me. So I’ve chosen some languages, installed as follows, and the module started to work like a charm:
sudo pacman -S tesseract-data-eng
sudo pacman -S tesseract-data-por
sudo pacman -S tesseract-data-fra
sudo pacman -S tesseract-data-spa
I tried some portuguese special characters (like "ão"), that only worked when I used the argument "lang='por'" in the pytesseract.image_to_string(img,lang='por')
As of 2021, My solution for Ubuntu is to download the zip files from https://github.com/tesseract-ocr/tessdata_best/releases/tag/4.1.0, extract and copy the neccessary .traineddata files into /usr/local/share/tessdata. This is the default folder for tesseract 4.1.1 to search for trained data.
I had the same problem with DEU language on macOS. I could solve it by installing all additional languages like so:
brew install tesseract-lang
as suggested on https://formulae.brew.sh/formula/tesseract
**IF you have windows OS then please add your TesseractOCR to system variable.
Eg..
Find the path where Tesseract is installed in your c drive (in my case r"C:\Program Files\Tesseract-OCR\tesseract.exe")**
2)make sure you have the required files ie tessdata, tessdata if not then download it from https://github.com/tesseract-ocr/tessdata https://github.com/tesseract-ocr/langdata (At least those languages which you want to convert)
past it into the main directory in my case C:\Program Files\Tesseract-OCR
4)Add the path of the directory to your system environment variable
for that
search environment variable in start bar
go to environment variable
click path in your system environment variable (NOT IN USER ENVIRONMENT VARIABLE)
past the path of tesseractocr
thats all...

Hunspell - Can't open affix or dictionary files for dictionary named en_US

I'd like to use hunspell to spell check my repo. However when I try to run it I get the following error:
Can't open affix or dictionary files for dictionary named "en_US".
How can I fix this? I'm on a Mac.
Thanks, Kevin
Execute hunspell -D. You should get output like this:
.::/usr/share/hunspell:/usr/share/myspell:
/usr/share/myspell/dicts:/Library/Spelling:
AVAILABLE DICTIONARIES (path is not mandatory for -d option):
/Library/Spelling/en_GB
LOADED DICTIONARY:
/Library/Spelling/en_GB.aff
/Library/Spelling/en_GB.dic
This lists the directories in which hunspell is searching for dictionary files, as well as the dictionaries is has found. If the dictionary en_US isn't listed, you haven't got that particular dictionary installed.
To install a dictionary, search for it in the LibreOffice extension repository. Download it then extract the .aff and .dic files to one of the locations listed by hunspell -D. For example:
# First download dict-en.oxt
unzip dict-en.oxt -d dict-en
cp dict-en/en_GB.aff dict-en/en_GB.dic ~/Library/Spelling/
rm -r dict-en
I'm using emacs in windows using msys2. I installed following 2 packages:
pacman -S mingw-w64-x86_64-hunspell-en mingw-w64-x86_64-hunspell
The mingw-w64-x86_64-hunspell-en package install english dictionaries in /mingw64/share/hunspell, but you check if the files (en_US.dic and en_US.aff) are available or not.
Steps:
set the Environment Variable in .bashrc within mysys2. Without DICPATH it was not working for me.
export DICTIONARY=en_US
export DICPATH=/d/../msys2/ming164/share/hunspell
run hunspell.exe -D
SEARCH PATH:
.;... ;...;...
AVAILABLE DICTIONARIES (path is not mandatory for -d option):
D:/xx/mysys64/mingw64/share/hunspell/en_AG
D:/xx/mysys64/mingw64/share/hunspell/en_AU
...
Hunspell 1.6.0
I was lucky to find my language here: https://extensions.openoffice.org/en/search?query=de_CH&sort_by=field_project_stats_year&sort_order=DESC
And with the comment from #RobDavenport i was able to rename the extension and extract the files. Something i reread at this link and gave it a try.
I dropped the .dic .dat and .aff into my ~/Library/Spelling/ folder.