Python HDFS client


Python HDFS client

Because the world needs yet another way to talk to HDFS from Python.


This library provides a Python client for WebHDFS. NameNode HA is supported by passing in both NameNodes. Responses are returned as nice Python classes, and any failed operation will raise some subclass of HdfsException matching the Java exception.

Example usage:

>>> fs = pyhdfs.HdfsClient(hosts=',', user_name='someone')
>>> fs.list_status('/')
[FileStatus(pathSuffix='benchmarks', permission='777', type='DIRECTORY', ...), FileStatus(...), ...]
>>> fs.listdir('/')
['benchmarks', 'hbase', 'solr', 'tmp', 'user', 'var']
>>> fs.mkdirs('/fruit/x/y')
>>> fs.create('/fruit/apple', 'delicious')
>>> fs.append('/fruit/apple', ' food')
>>> with contextlib.closing('/fruit/apple')) as f:
b'delicious food'
>>> fs.get_file_status('/fruit/apple')
FileStatus(length=14, owner='someone', type='FILE', ...)
>>> fs.get_file_status('/fruit/apple').owner
>>> fs.get_content_summary('/fruit')
ContentSummary(directoryCount=3, fileCount=1, length=14, quota=-1, spaceConsumed=14, spaceQuota=-1)
>>> list(fs.walk('/fruit'))
[('/fruit', ['x'], ['apple']), ('/fruit/x', ['y'], []), ('/fruit/x/y', [], [])]
>>> fs.exists('/fruit/apple')
>>> fs.delete('/fruit')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File ".../", line 525, in delete
pyhdfs.HdfsPathIsNotEmptyDirectoryException: `/fruit is non empty': Directory is not empty
>>> fs.delete('/fruit', recursive=True)
>>> fs.exists('/fruit/apple')
>>> issubclass(pyhdfs.HdfsFileNotFoundException, pyhdfs.HdfsIOException)

The methods and return values generally map directly to WebHDFS endpoints. The client also provides convenience methods that mimic Python os methods and HDFS CLI commands (e.g. walk and copy_to_local).

pyhdfs logs all HDFS actions at the INFO level, so turning on INFO level logging will give you a debug record for your application.

For more information, see the full API docs.


pip install pyhdfs

Python 3 is required.

Development testing Documentation Status

First run x.y.z, which will download, extract, and run the HDFS NN/DN processes in the current directory. (Replace x.y.z with a real version.) Then run the following commands. Note they will create and delete hdfs://localhost/tmp/pyhdfs_test.


python3 -m venv env
source env/bin/activate
pip install -e .
pip install -r dev_requirements.txt
  • client should return some info when succuessfully create a file

    client should return some info when succuessfully create a file

    for example, hdfs server may return a response with headers like this

    HTTP/1.1 201 Created
    Location: webhdfs://<HOST>:<PORT>/<PATH>
    Content-Length: 0

    I want to get location from response headers, however, client.create do not return any thing.

    opened by cosven 7
  • Write error

    Write error

    Hello Mkdir and listdir work fine But create didn't

    fs.create('/fruit/apple', 'delicious')
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "/root/miniconda2/lib/python2.7/site-packages/", line 426, in create
        metadata_response.headers['location'], data=data, **self._requests_kwargs)
      File "/root/miniconda2/lib/python2.7/site-packages/requests/", line 126, in put
        return request('put', url, data=data, **kwargs)
      File "/root/miniconda2/lib/python2.7/site-packages/requests/", line 58, in request
        return session.request(method=method, url=url, **kwargs)
      File "/root/miniconda2/lib/python2.7/site-packages/requests/", line 512, in request
        resp = self.send(prep, **send_kwargs)
      File "/root/miniconda2/lib/python2.7/site-packages/requests/", line 622, in send
        r = adapter.send(request, **kwargs)
      File "/root/miniconda2/lib/python2.7/site-packages/requests/", line 513, in send
        raise ConnectionError(e, request=request)
    requests.exceptions.ConnectionError: HTTPConnectionPool(host='1566bb80c4dc', port=50075): Max retries exceeded with url: /webhdfs/v1/fruit/apple?op=CREATE& (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f644f364510>: Failed to establish a new connection: [Errno -2] Name or service not known',))
    opened by albertoRamon 4
  • requests.exceptions.ConnectionError: ('Connection aborted.', ConnectionResetError(10054, '远程主机强迫关闭了一个现有的连接。', None, 10054, None))

    requests.exceptions.ConnectionError: ('Connection aborted.', ConnectionResetError(10054, '远程主机强迫关闭了一个现有的连接。', None, 10054, None))

    Traceback (most recent call last): File "D:\Anaconda3\lib\site-packages\urllib3\", line 601, in urlopen chunked=chunked) File "D:\Anaconda3\lib\site-packages\urllib3\", line 357, in _make_request conn.request(method, url, **httplib_request_kw) File "D:\Anaconda3\lib\http\", line 1239, in request self._send_request(method, url, body, headers, encode_chunked) File "D:\Anaconda3\lib\http\", line 1285, in _send_request self.endheaders(body, encode_chunked=encode_chunked) File "D:\Anaconda3\lib\http\", line 1234, in endheaders self._send_output(message_body, encode_chunked=encode_chunked) File "D:\Anaconda3\lib\http\", line 1065, in _send_output self.send(chunk) File "D:\Anaconda3\lib\http\", line 986, in send self.sock.sendall(data) ConnectionResetError: [WinError 10054] 远程主机强迫关闭了一个现有的连接。

    During handling of the above exception, another exception occurred:

    Traceback (most recent call last): File "D:\Anaconda3\lib\site-packages\requests\", line 440, in send timeout=timeout File "D:\Anaconda3\lib\site-packages\urllib3\", line 639, in urlopen _stacktrace=sys.exc_info()[2]) File "D:\Anaconda3\lib\site-packages\urllib3\util\", line 357, in increment raise six.reraise(type(error), error, _stacktrace) File "D:\Anaconda3\lib\site-packages\urllib3\packages\", line 685, in reraise raise value.with_traceback(tb) File "D:\Anaconda3\lib\site-packages\urllib3\", line 601, in urlopen chunked=chunked) File "D:\Anaconda3\lib\site-packages\urllib3\", line 357, in _make_request conn.request(method, url, **httplib_request_kw) File "D:\Anaconda3\lib\http\", line 1239, in request self._send_request(method, url, body, headers, encode_chunked) File "D:\Anaconda3\lib\http\", line 1285, in _send_request self.endheaders(body, encode_chunked=encode_chunked) File "D:\Anaconda3\lib\http\", line 1234, in endheaders self._send_output(message_body, encode_chunked=encode_chunked) File "D:\Anaconda3\lib\http\", line 1065, in _send_output self.send(chunk) File "D:\Anaconda3\lib\http\", line 986, in send self.sock.sendall(data) urllib3.exceptions.ProtocolError: ('Connection aborted.', ConnectionResetError(10054, '远程主机强迫关闭了一个现有的连接。', None, 10054, None))

    During handling of the above exception, another exception occurred:

    Traceback (most recent call last): File "D:\workspace\phdfs\", line 144, in fs.copy_from_local(parname,"/test/fcst/china/10d_arwpost_sta/near/" + wrflisttime.format("YYYYMMDD") + "/" + parname,overwrite = True) File "D:\Anaconda3\lib\site-packages\", line 753, in copy_from_local self.create(dest, f, **kwargs) File "D:\Anaconda3\lib\site-packages\", line 426, in create metadata_response.headers['location'], data=data, **self._requests_kwargs) File "D:\Anaconda3\lib\site-packages\requests\", line 126, in put return request('put', url, data=data, **kwargs) File "D:\Anaconda3\lib\site-packages\requests\", line 58, in request return session.request(method=method, url=url, **kwargs) File "D:\Anaconda3\lib\site-packages\requests\", line 508, in request resp = self.send(prep, **send_kwargs) File "D:\Anaconda3\lib\site-packages\requests\", line 618, in send r = adapter.send(request, **kwargs) File "D:\Anaconda3\lib\site-packages\requests\", line 490, in send raise ConnectionError(err, request=request) requests.exceptions.ConnectionError: ('Connection aborted.', ConnectionResetError(10054, '远程主机强迫关闭了一个现有的连接。', None, 10054, None))

    opened by Georege 4
  • BUG:Chinese character can't copy to hdfs

    BUG:Chinese character can't copy to hdfs

    UnicodeEncodeError: 'latin-1' codec can't encode characters in position 2-3: Body ('张三') is not valid Latin-1. Use body.encode('utf-8') if you want to send it encoded in UTF-8.

    opened by yiershanxll 3
  • Help me,please . The second run of the function in the script results in an abnormal result

    Help me,please . The second run of the function in the script results in an abnormal result

    I am a rookie~~!!

    The following code:

    list_info = [{"tenant": "coco", "hive_path": "/user/open_001_dev", "ftp_path": "/files/prov/001"},
                     {"tenant": "lili", "hive_path": "/user/open_002_dev", "ftp_path": "/files/prov/002"}]
    result = 0
    def hive_content_size():
        global result
        for item in range(2):
            if "hive_path" in list_info[item]:

    The result of the first loop is output normally,but the output of the second loop is abnormal.

    The bottom is the error report:

    ContentSummary(directoryCount=1258, fileCount=3773, length=141829751002, quota=4000000, spaceConsumed=425489253006, spaceQuota=659706976665600)
    Failed to reach to (attempt 3/3)
    Traceback (most recent call last):
      File "/usr/local/python/lib/python3.9/site-packages/urllib3-1.26.4-py3.9.egg/urllib3/", line 445, in _make_request
        six.raise_from(e, None)
      File "<string>", line 3, in raise_from
      File "/usr/local/python/lib/python3.9/site-packages/urllib3-1.26.4-py3.9.egg/urllib3/", line 440, in _make_request
        httplib_response = conn.getresponse()
      File "/usr/local/python/lib/python3.9/http/", line 1347, in getresponse
      File "/usr/local/python/lib/python3.9/http/", line 307, in begin
        version, status, reason = self._read_status()
      File "/usr/local/python/lib/python3.9/http/", line 268, in _read_status
        line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
      File "/usr/local/python/lib/python3.9/", line 704, in readinto
        return self._sock.recv_into(b)
    socket.timeout: timed out
    During handling of the above exception, another exception occurred:
    Traceback (most recent call last):
      File "/usr/local/python/lib/python3.9/site-packages/requests-2.25.1-py3.9.egg/requests/", line 439, in send
        resp = conn.urlopen(
      File "/usr/local/python/lib/python3.9/site-packages/urllib3-1.26.4-py3.9.egg/urllib3/", line 755, in urlopen
        retries = retries.increment(
      File "/usr/local/python/lib/python3.9/site-packages/urllib3-1.26.4-py3.9.egg/urllib3/util/", line 532, in increment
        raise six.reraise(type(error), error, _stacktrace)
      File "/usr/local/python/lib/python3.9/site-packages/urllib3-1.26.4-py3.9.egg/urllib3/packages/", line 735, in reraise
        raise value
      File "/usr/local/python/lib/python3.9/site-packages/urllib3-1.26.4-py3.9.egg/urllib3/", line 699, in urlopen
        httplib_response = self._make_request(
      File "/usr/local/python/lib/python3.9/site-packages/urllib3-1.26.4-py3.9.egg/urllib3/", line 447, in _make_request
        self._raise_timeout(err=e, url=url, timeout_value=read_timeout)
      File "/usr/local/python/lib/python3.9/site-packages/urllib3-1.26.4-py3.9.egg/urllib3/", line 336, in _raise_timeout
        raise ReadTimeoutError(
    urllib3.exceptions.ReadTimeoutError: HTTPConnectionPool(host='', port=9000): Read timed out. (read timeout=10)
    During handling of the above exception, another exception occurred:
    Traceback (most recent call last):
      File "/usr/local/python/lib/python3.9/site-packages/PyHDFS-0.3.1-py3.9.egg/pyhdfs/", line 418, in _request
        response = self._requests_session.request(
      File "/usr/local/python/lib/python3.9/site-packages/requests-2.25.1-py3.9.egg/requests/", line 542, in request
        resp = self.send(prep, **send_kwargs)
      File "/usr/local/python/lib/python3.9/site-packages/requests-2.25.1-py3.9.egg/requests/", line 655, in send
        r = adapter.send(request, **kwargs)
      File "/usr/local/python/lib/python3.9/site-packages/requests-2.25.1-py3.9.egg/requests/", line 529, in send
        raise ReadTimeout(e, request=request)
    requests.exceptions.ReadTimeout: HTTPConnectionPool(host='', port=19888): Read timed out. (read timeout=10)
    Traceback (most recent call last):
      File "/home/hadoop/shay/monthly_report/", line 24, in <module>
      File "/home/hadoop/shay/monthly_report/", line 22, in hive_content_size
      File "/usr/local/python/lib/python3.9/site-packages/PyHDFS-0.3.1-py3.9.egg/pyhdfs/", line 633, in get_content_summary
      File "/usr/local/python/lib/python3.9/site-packages/PyHDFS-0.3.1-py3.9.egg/pyhdfs/", line 450, in _get
      File "/usr/local/python/lib/python3.9/site-packages/PyHDFS-0.3.1-py3.9.egg/pyhdfs/", line 442, in _request
    pyhdfs.HdfsNoServerException: Could not use any of the given hosts

    ask for help~~!!!

    opened by qwe55982 2
  • HdfsFileAlreadyExistsException is not implemented?

    HdfsFileAlreadyExistsException is not implemented?

    Hi! Thanks for your great work. I have noticed that some Exceptions are not implemented right now?

    For example: If I try to upload the file with same path, the python raises ConnectionError instead of HdfsFileAlreadyExistsException.

    error message as following:

    Traceback (most recent call last):
      File "", line 12, in <module>
        fs.create('/xxx/xxx/images/test.png', data=file)
      File "/home/chiuhongyu/workplace/xxx/venv/lib/python3.6/site-packages/pyhdfs/", line 504, in create
        metadata_response.headers['location'], data=data, **self._requests_kwargs)
      File "/home/chiuhongyu/workplace/xxx/venv/lib/python3.6/site-packages/requests/", line 132, in put
        return request('put', url, data=data, **kwargs)
      File "/home/chiuhongyu/workplace/xxx/venv/lib/python3.6/site-packages/requests/", line 61, in request
        return session.request(method=method, url=url, **kwargs)
      File "/home/chiuhongyu/workplace/xxx/venv/lib/python3.6/site-packages/requests/", line 542, in request
        resp = self.send(prep, **send_kwargs)
      File "/home/chiuhongyu/workplace/xxx/venv/lib/python3.6/site-packages/requests/", line 655, in send
        r = adapter.send(request, **kwargs)
      File "/home/chiuhongyu/workplace/xxx/venv/lib/python3.6/site-packages/requests/", line 498, in send
        raise ConnectionError(err, request=request)
    requests.exceptions.ConnectionError: ('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer'))
    opened by james77777778 1
  • Support customized WEBHDFS_PATH

    Support customized WEBHDFS_PATH

    In the latest version of pyhdfs, webhdfs is set as a constant '/webhdfs/v1', it works well in most kind of scene, but users may use their customized HTTP URL, for example, users may set their own webhdfs service using Pylon, and they access their restful server using their customized URL PATTERN like http://<HOST>:<HTTP_PORT>/webhdfs/api/v2/<PATH>?op=...

    opened by SparkSnail 1
  • TypeError: __new__() got an unexpected keyword argument 'storagePolicy'

    TypeError: __new__() got an unexpected keyword argument 'storagePolicy'

    I am using hadoop 2.6( with Docker: sudo docker run -i -t sequenceiq/hadoop-docker:2.6.0 /etc/ -bash).

    When I using PyHDFS to call client.list_status, I got error:

    Traceback (most recent call last):
      File "", line 3, in <module>
      File "...testenv/lib/python3.4/site-packages/", line 428, in list_status
        _json(self._get(path, 'LISTSTATUS', **kwargs))['FileStatuses']['FileStatus']
      File "...testenv/lib/python3.4/site-packages/", line 427, in <listcomp>
        FileStatus(**item) for item in
    TypeError: __new__() got an unexpected keyword argument 'storagePolicy'

    The code:

    from pyhdfs import HdfsClient
    client = HdfsClient(hosts='')

    This issue is cause of JSON from server has extra property storagePolicy, add it to can fix this. But I want to know weather this property is standard property of HDFS/WebHDFS.

    opened by robberphex 1
  • why response assert not empty

    why response assert not empty

    In, line 424

    assert not metadata_response.content

    In my client, I get some response when upload files.

    b'<html>\r\n<head><title>307 Temporary Redirect</title></head>\r\n<body bgcolor="white">\r\n<center><h1>307 Temporary Redirect</h1></center>\r\n<hr><center>nginx/1.13.8</center>\r\n</body>\r\n</html>\r\n'

    This response does not mean the upload process failed, and I can successfully upload my files when I delete this line. Why add this line? could you please help me to figure out this problem?

    opened by SparkSnail 0
  • Support setting webhdfs_path

    Support setting webhdfs_path

    In the latest version of pyhdfs, webhdfs is set as a constant '/webhdfs/v1', it works well in most kind of scene, but users may use their customized HTTP URL, for example, users may set their own webhdfs service using Pylon, and they access their restful server using their customized URL PATTERN like http://<HOST>:<HTTP_PORT>/webhdfs/api/v2/<PATH>?op=...

    opened by SparkSnail 0
  • Let pyhdfs can visit HDFS in kerberos environment

    Let pyhdfs can visit HDFS in kerberos environment

    When HDFS need kerberos authentication,ur cannot visit HDFS. So maybe u should add authentication information in ur In fact, it will call request module when python visit HDFS, so add authentication information at here.

    opened by LuckyNemo 0
  • got type error while append file

    got type error while append file

    File "/usr/local/lib/python3.6/site-packages/pyhdfs/", line 520, in append path, 'APPEND', expected_status=HTTPStatus.TEMPORARY_REDIRECT, **kwargs) File "/usr/local/lib/python3.6/site-packages/pyhdfs/", line 466, in _post return self._request('post', path, op, expected_status, **kwargs) File "/usr/local/lib/python3.6/site-packages/pyhdfs/", line 431, in _request _check_response(response, expected_status) File "/usr/local/lib/python3.6/site-packages/pyhdfs/", line 933, in _check_response remote_exception['message'] = exception_name + ' - ' + remote_exception['message'] TypeError: must be str, not NoneType

    opened by BingoZ 0
  • can't parse JSON with unprintable characters

    can't parse JSON with unprintable characters

    If a weird non-utf file name is created in HDFS, then the client fails when it can't interpret the response as a valid JSON string.

    e.g. it's possible to put a ctrl-r in the file name

    opened by jingw 0
Suricata Language Server is an implementation of the Language Server Protocol for Suricata signatures

Suricata Language Server is an implementation of the Language Server Protocol for Suricata signatures. It adds syntax check, hints and auto-completion to your preferred editor once it is configured.

Stamus Networks 39 Nov 28, 2022
Python low-interaction honeyclient

Thug The number of client-side attacks has grown significantly in the past few years shifting focus on poorly protected vulnerable clients. Just as th

Angelo Dell'Aera 896 Dec 19, 2022
CVE-2022-21907 - Windows HTTP协议栈远程代码执行漏洞 CVE-2022-21907

CVE-2022-21907 Description POC for CVE-2022-21907: Windows HTTP协议栈远程代码执行漏洞 creat

antx 365 Nov 30, 2022
This a simple tool XSS Detection Suite for CTFs games

This a simple tool XSS Detection Suite for CTFs games

Mostafa 2 Nov 24, 2021
Dome - Subdomain Enumeration Tool. Fast and reliable python script that makes active and/or passive scan to obtain subdomains and search for open ports.

DOME - A subdomain enumeration tool Check the Spanish Version Dome is a fast and reliable python script that makes active and/or passive scan to obtai

Vadi 329 Jan 01, 2023

Auto_xray xray多线程批量扫描工具 简介 xray社区版貌似没有批量扫描,这就让安服仔使用起来很不方便,扫站得一个个手动添加,非常难受 Auto_xray目录下记得放xray,就跟平时一样的。 选项1:oneforall+xray 输入一个主域名,自动采集子域名然后添加到xray任务列表

1frame 13 Nov 09, 2022
Use FOFA automatic vulnerability scanning tool

AutoSRC Use FOFA automatic vulnerability scanning tool Usage python3 -e FOFA EMAIL -k TOKEN Screenshots License MIT Dev 6613GitHub6613

PwnWiki 48 Oct 25, 2022
This repository consists of the python scripts for execution and automation of vivid tasks. is a repository being maintained to keep log of the python scripts that I create for automating and executing some of my boring manual task.

Prakriti Regmi 1 Feb 07, 2022
Cryptick is a stock ticker for cryptocurrency tokens, and a physical NFT.

Cryptick is a stock ticker for cryptocurrency tokens, and a physical NFT. This repository includes tools and documentation for the Cryptick device.

1 Dec 31, 2021
"Video Moment Retrieval from Text Queries via Single Frame Annotation" in SIGIR 2022.

ViGA: Video moment retrieval via Glance Annotation This is the official repository of the paper "Video Moment Retrieval from Text Queries via Single F

Ran Cui 38 Dec 31, 2022
OSINT Cybersecurity Tools

OSINT Cybersecurity Tools Welcome to the World of OSINT: An ongoing collection of awesome tools and frameworks, best security software practices, libr

Paul Veillard, P. Eng 7 Jul 01, 2022
GitHub Advance Security Compliance Action

advanced-security-compliance This Action was designed to allow users to configure their Risk threshold for security issues reported by GitHub Code Sca

Mathew Payne 121 Dec 14, 2022
Make your own huge Wordlist with advanced options

#It's my first tool i hope to be useful for everyone, Make your own huge Wordlist with advanced options, You need python3 to run this tool, If you hav

0.1Arafa 6 Dec 08, 2022
Crowbar - A windows post exploitation tool

Crowbar - A windows post exploitation tool Status - ✔️ This project is now considered finished. Any updates from now on will most likely be new script

29 Nov 20, 2022
the metasploit script(POC/EXP) about CVE-2021-22005 VMware vCenter Server contains an arbitrary file upload vulnerability

CVE-2021-22005-metasploit the metasploit script(POC/EXP) about CVE-2021-22005 VMware vCenter Server contains an arbitrary file upload vulnerability pr

Taroballz 25 Nov 15, 2022
AutoScan 有多个目标时,调用xray+rad进行自动扫描

Usage: 在高级版Xray和rad同目录下运行 python3 xxxx.txt 写的蛮人性化的哦,os,linux,windows通用 生成的xray报告会在当前目录的/result下面 Ctrl+c 打断脚本运行时还可以结算扫描进度,生成已扫描和未扫描的进度文件,

斯文 73 Jan 01, 2023
The self-hostable proxy tunnel

TTUN Server The self-hostable proxy tunnel. Running Running: docker run -e TUNNEL_DOMAIN=Your tunnel domain -e SECURE=True if using SSL

Tom van der Lee 2 Jan 11, 2022
Provides script to download and format public IP lists related to the Log4j exploit.

Provides script to download and format public IP lists related to the Log4j exploit. Current format includes: plain list, Cisco ASA Network Group.

Gianluca Ulivi 1 Jan 02, 2022
Reverse engineered Parler API

Parler's unofficial API with all endpoints present in their iOS app as of 08/12/2020. For the most part undocumented, but the error responses are alre

393 Nov 26, 2022
Get related domains / subdomains by looking at Google Analytics IDs

DomainRelationShips ██╗ ██╗ █████╗ ██╗██████╗ ██║ ██║██╔══██╗ ██║██╔══██╗ ██║ ██║█████

Josué Encinar 161 Jan 02, 2023