how can I convert an h5 file to its corresponding csv file? these h5 files are the output of DeepLabCut
import h5py
import pandas as pd
path_to_file = '/scratch3/3d_pose/animalpose/experiments/moth-filtered-Mona-2019-12-06_10k_iters/videos/mothDLC_resnet50_moth-filteredDec6shuffle1_10000.h5'
f = h5py.File(path_to_file, 'r')
print(f.keys())
I get:
scratch/sjn-p3/anaconda/anaconda3/bin/python /scratch3/pycharm-2019.3/plugins/python/helpers/pydev/pydevconsole.py --mode=client --port=42112
import sys; print('Python %s on %s' % (sys.version, sys.platform))
sys.path.extend(['/scratch3/3d_pose/animalpose/moth_original'])
Python 3.6.7 | packaged by conda-forge | (default, Feb 28 2019, 09:07:38)
Type 'copyright', 'credits' or 'license' for more information
IPython 6.2.1 -- An enhanced Interactive Python. Type '?' for help.
PyDev console: using IPython 6.2.1
Python 3.6.7 | packaged by conda-forge | (default, Feb 28 2019, 09:07:38)
[GCC 7.3.0] on linux
runfile('/scratch3/3d_pose/animalpose/moth_original/converth5tocsv.py', wdir='/scratch3/3d_pose/animalpose/moth_original')
/scratch3/pycharm-2019.3/plugins/python/helpers/pydev/_pydev_bundle/pydev_import_hook.py:21: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
module = self._system_import(name, *args, **kwargs)
KeysView(<HDF5 file "mothDLC_resnet50_moth-filteredDec6shuffle1_10000.h5" (mode r)>)
so I am not sure how to proceed when I don't get the keys
Related
Using:
pip install pydeequ==1.0.1
deequ-2.0.0-spark-3.1.jar and
deequ-1.0.7_scala-2.12_spark-3.0.0.jar (tried both)
Spark Version 3.1
Pyhton 3.8
Spark Config ==> spark = (SparkSession .builder
.config("spark.jars.packages", pydeequ.deequ_maven_coord)
.config("spark.jars.excludes", pydeequ.f2j_maven_coord)
.getOrCreate())
This is the code:
import pydeequ
from pydeequ.repository import *
from pydeequ.analyzers import *
from pydeequ.checks import *
from pydeequ.verification import *
## Initialize the Deequ Validation API
checkResult = VerificationSuite(spark).onData(df)
checkResult.addCheck(Check(spark, CheckLevel.Warning, "DQ Validation").isComplete(column_to_validate))
checkResult.addCheck(Check(spark, CheckLevel.Warning, "DQ Validation").hasDataType(column_to_validate, ConstrainableDataTypes.Integral))
.......
checkResult.run()
checkResult_df = VerificationResult.checkResultsAsDataFrame(spark, checkResult)
checkResult_df.show()
This code is similar to: https://pydeequ.readthedocs.io/en/latest/README.html#constraint-verification
**** But, Gives the following error. Why?****
~/cluster-env/clonedenv/lib/python3.8/site-packages/pydeequ/verification.py in checkResultsAsDataFrame(cls, spark_session, verificationResult, forChecks, pandas)
135
136 df = spark_session._jvm.com.amazon.deequ.VerificationResult.checkResultsAsDataFrame(
--> 137 spark_session._jsparkSession, verificationResult.verificationRun, forChecks
138 )
139 sql_ctx = SQLContext(
AttributeError: 'VerificationRunBuilder' object has no attribute 'verificationRun'
I am trying to install geomesa for pyspark and while initialising getting an error
command: geomesa_pyspark.init_sql(spark)
~/opt/anaconda3/envs/geomesa-pyspark/lib/python3.7/site-packages/geomesa_pyspark/__init__.py in init_sql(spark)
113
114 def init_sql(spark):
--> 115 spark._jvm.org.apache.spark.sql.SQLTypes.init(spark._jwrapped)
TypeError: 'JavaPackage' object is not callable
I have used the below code to install:
pyspark == 2.4.8
geomesa_pyspark using https://repo.eclipse.org/content/repositories/geomesa-releases/org/locationtech/geomesa/
geomesa_pyspark-2.4.0.tar.gz
geomesa-accumulo-spark-runtime_2.11-2.4.0.jar
python 3.7
import geomesa_pyspark
conf = geomesa_pyspark.configure(
jars=['./jars/geomesa-accumulo-spark-runtime_2.11-2.4.0.jar', './jars/postgresql-42.3.1.jar', './jars/geomesa-spark-sql_2.11-2.4.0.jar'],
packages=['geomesa_pyspark','pytz'],
spark_home='/Users/user/opt/anaconda3/envs/geomesa-pyspark/lib/python3.7/site-packages/pyspark').\
setAppName('MyTestApp')
spark = ( SparkSession
.builder
.config(conf=conf)
.config('spark.driver.memory', '15g')
.config('spark.executor.memory', '15g')
.config('spark.default.parallelism', '10')
.config('spark.sql.shuffle.partitions', '10')
.master("local")
.getOrCreate()
)
I replaced the
jars=['./jars/geomesa-accumulo-spark-runtime_2.11-2.4.0.jar', './jars/postgresql-42.3.1.jar', './jars/geomesa-spark-sql_2.11-2.4.0.jar'],
to
jars=['./jars/geomesa-accumulo-spark-runtime_2.11-2.4.0.jar'],
And for postgresql, i passed .option("driver", "org.postgresql.Driver") while loading data through pyspark which fixed the issue
System: WIN10
IDE: MS VSCode
Language: Python version 3.7.3
Library: pandas version 1.0.1
Data source: https://hoopshype.com/salaries/#hh-tab-team-payroll
Dataset: team payrolls
I am having an issue for some reason when trying to convert a str(table) into a dataframe. I think I am grossly missing something here. The sample code is supplied below and it keeps throwing only the top row of data into the data frame. I think I am missing some other transformation process.
Steps were taken:
searched the net for various tutorials
tried splitting the data first then trying the conversion on it to no luck
Code:
# import Libraries
import pandas as pd
import requests
from bs4 import BeautifulSoup
from tabulate import tabulate
# data visualization
import matplotlib as plt
import seaborn as sns
# setting: For output purposes to show all columns
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 2000)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)
# parsing the webpage
url = 'https://hoopshype.com/salaries/#hh-tab-team-payroll'
r = requests.get(url)
data = r.text
# create a beautfulsoup object
soup = BeautifulSoup(r.content,'lxml')
soup.prettify
# team salaries by year = tsby
table = soup.find_all('table')[0]
nbatsby = pd.read_html(str(table))
# this is where I am stuck
df = pd.DataFrame(nbatsby)
df.head(100)
pandas has read_html which can scrape an html table an convert it to dataframe directly :
import pandas as pd
import requests
r = requests.get("https://hoopshype.com/salaries/#hh-tab-team-payroll")
data = pd.read_html(r.text, attrs = {'class': 'hh-salaries-ranking-table'})[0]
print(data)
Output:
Unnamed: 0 Team 2019/20 2020/21 2021/22 2022/23 2023/24 2024/25
0 1.0 Portland $XXX,XXX,XXX $XXXX,XXX,XXX $XX,XXX,XXX $XX,XXX,XXX $XX,XXX,XXX $XX,XXX,XXX
1 2.0 Miami $XXX,XXX,XXX $XX,XXX,XXX $XX,XXX,XXX $XX,XXX,XXX $0 $0
................................................................................................................
28 29.0 Indiana $XXX,XXX,XXX $XXX,XXX,XXX $XX,XXX,XXX $XX,XXX,XXX $XX,XXX,XXX $0
29 30.0 New York $XXX,XXX,XXX $XX,XXX,XXX $XX,XXX,XXX $0 $0 $0
MCVE
In terminal while loading IPython/Jupyter Notebook as .json:
$ python
Python 3.5.2 | packaged by conda-forge | (default, Jul 26 2016, 01:32:08)
[GCC 4.8.2 20140120 (Red Hat 4.8.2-15)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import json
>>> import io
>>> with io.open('Untitled.ipynb','r',encoding='utf-8') as sf:
... data = json.load(sf) # error on this line
... print(data)
...
Traceback (most recent call last):
File "<stdin>", line 4, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character '\xf3' in position 189292: ordinal not in range(128)
How to make this work?
Also:
>>> print(sys.getdefaultencoding())
utf-8
>>> print(sys.stdout.encoding)
ANSI_X3.4-1968 # I assume this is the main reason why it does not work
What have I tried?
export PYTHONIOENCODING=UTF-8 does not work.
import importlib; importlib.reload(sys); sys.setdefaultencoding("UTF-8") does not work.
sys.stdout = codecs.getwriter('utf-8')(sys.stdout.buffer, 'strict') gives TypeError: write() argument must be str, not bytes
Related: 1, 2, 3
I am trying to create namespace packages for a modular project.
The core system has the following packages
homie
homie.api
homie.events
homie.mods
I want to install the respective modules as sub-packages of homie.mods.
Therefor I provided the respective homie.__init__.py and homie.mods.__init__.py with the following content:
from pkg_resources import declare_namespace
declare_namespace(__name__)
My testing module is structured as follows:
homie
homie/__init__.py
homie/mods
homie/mods/__init__.py
homie/mods/test
homie/mods/test/__init__.py
Where homie/__init__.py and homie/mods/__init__.py contain:
from pkgutil import extend_path
__path__ = extend_path(__path__, __name__)
and homie/mods/test/__init__.py contains
print('Works!')
The core package is being set up with the following setup.py:
from setuptools import setup
<snip>
setup(
name='homie',
version=version,
author=author,
author_email=author_email,
install_requires=[
'docopt',
'setproctitle',
'homeinfo',
'homeinfo-crm',
'openimmo',
'zmq'
],
packages=[
'homie',
'homie.api',
'homie.events',
'homie.mods'
],
namespace_packages=['homie', 'homie.mods'],
<snip>
which I install first.
Sencondly I install the module using its setup.py:
#! /usr/bin/env python3
from setuptools import setup
setup(name='homie-test',
version=version,
author=author,
author_email=email,
install_requires=['homie'],
packages=['homie.mods.test'],
namespace_packages=['homie', 'homie.mods'],
license=open('LICENSE.txt').read()
)
When trying to import stuff, I get the following:
>>> from homie import api, events, mods
>>> from homie.mods import test
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ImportError: cannot import name 'test'
>>>
What am I missing?